"\n\ndf.to_excel('/path/to/directory/example.xlsx', index=False)\n\n"
Session 3: Intro to Pandas and great_tables
Links
Guide to Python Data Structures
Cancer Dataset
What Is Pandas? 🐼
Pandas is a powerful open-source data analysis and manipulation library in Python. It provides data structures, primarily the DataFrame and Series, which are optimized for handling and analyzing large datasets efficiently.
Data Structures:
Series: A one-dimensional labeled array, suitable for handling single columns or rows of data.
DataFrame: A two-dimensional table with labeled axes (rows and columns), much like a spreadsheet or SQL table, allowing you to work with data in rows and columns simultaneously.
Data Manipulation:
Pandas has functions for merging, reshaping, and aggregating datasets, which helps streamline data cleaning and preparation.
It can handle missing data, making it easy to filter or fill gaps in datasets.
Data Analysis:
Pandas provides extensive functionality for descriptive statistics, grouping data, and handling time series.
Integrates well with other libraries, making it easy to move data between libraries like NumPy for numerical computations and Matplotlib or Seaborn for visualization.
Activate Python Environment and Download Cancer Dataset
Cancer Dataset is located here 👇
Loading the Dataset
<class 'pandas.core.frame.DataFrame'>
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
Viewing Basic Information
Checking the Dataset’s Shape
.shape returns a tuple with (number of rows, number of columns), which provides a basic overview of the dataset size.
Summarizing Column Information
.info() lists all columns, their data types, and counts of non-null values, helping identify any columns that may have missing data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
Viewing Column Names
.columns lists column headers, while .tolist() converts it into a standard Python list for easier viewing.
Column Names: ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']
Summary Statistics
.describe() generates essential statistics (mean, std, min, max, percentiles) for numeric columns, useful for identifying data distributions.
| id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.690000e+02 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 0.0 |
| mean | 3.037183e+07 | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | ... | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 | NaN |
| std | 1.250206e+08 | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | ... | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 | NaN |
| min | 8.670000e+03 | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | ... | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 | NaN |
| 25% | 8.692180e+05 | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | ... | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 | NaN |
| 50% | 9.060240e+05 | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | ... | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 | NaN |
| 75% | 8.813129e+06 | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | ... | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 | NaN |
| max | 9.113205e+08 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | ... | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 | NaN |
8 rows × 32 columns
Using value_counts() on a Single Column
This method is straightforward if you want to check the frequency distribution of one specific categorical column. Returns a pandas series object
Grouping by Categorical Variable
To see summary statistics grouped by a categorical variable in pandas, you can use the groupby() method along with describe() or specific aggregation functions like mean(), sum(), etc.
# Group by 'diagnosis' and get summary statistics for each group
grouped_summary = cancer_data.groupby('diagnosis').mean()
print(grouped_summary)
#Group by 'diagnosis' and get summary statistics for only one variable
grouped_radius_mean = cancer_data.groupby('diagnosis')['radius_mean'].mean()
print(grouped_radius_mean) id radius_mean texture_mean perimeter_mean \
diagnosis
B 2.654382e+07 12.146524 17.914762 78.075406
M 3.681805e+07 17.462830 21.604906 115.365377
area_mean smoothness_mean compactness_mean concavity_mean \
diagnosis
B 462.790196 0.092478 0.080085 0.046058
M 978.376415 0.102898 0.145188 0.160775
concave points_mean symmetry_mean ... texture_worst \
diagnosis ...
B 0.025717 0.174186 ... 23.515070
M 0.087990 0.192909 ... 29.318208
perimeter_worst area_worst smoothness_worst compactness_worst \
diagnosis
B 87.005938 558.899440 0.124959 0.182673
M 141.370330 1422.286321 0.144845 0.374824
concavity_worst concave points_worst symmetry_worst \
diagnosis
B 0.166238 0.074444 0.270246
M 0.450606 0.182237 0.323468
fractal_dimension_worst Unnamed: 32
diagnosis
B 0.079442 NaN
M 0.091530 NaN
[2 rows x 32 columns]
diagnosis
B 12.146524
M 17.462830
Name: radius_mean, dtype: float64
Renaming Columns
To make column names more readable or consistent, you can use rename() to change specific names. Here’s how to rename columns like radius_mean to Radius Mean.
# Rename specific columns for readability. 'old': 'new'
new_columns={
'radius_mean': 'Radius Mean',
'texture_mean': 'Texture Mean',
'perimeter_mean': 'Perimeter Mean'
}
cancer_data = cancer_data.rename(columns=new_columns)
# Display the new column names to verify the changes
print("\nUpdated Column Names:", cancer_data.columns.tolist())
Updated Column Names: ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']
Column Selection
Selecting specific columns is essential for focusing on particular aspects of the dataset. Here are some examples of both single and multiple column selections.
Diagnosis Column:
0 M
1 M
2 M
3 M
4 M
Name: diagnosis, dtype: object
Alternatively, you can select multiple columns.
Selected Columns:
diagnosis Radius Mean area_mean
0 M 17.99 1001.0
1 M 20.57 1326.0
2 M 19.69 1203.0
3 M 11.42 386.1
4 M 20.29 1297.0
Row Selection and Column Selection with loc and iloc
Selecting rows based on labels (loc)
- loc allows selection based on labels (e.g., column names or index labels) and is particularly useful for data subsets.
- loc uses row labels, so it is endpoint-inclusive.
Selected Rows with loc:
diagnosis Radius Mean area_mean
0 M 17.99 1001.0
1 M 20.57 1326.0
2 M 19.69 1203.0
3 M 11.42 386.1
4 M 20.29 1297.0
Integer-Based Indexing with iloc
- iloc allows selection based purely on integer positions, making it convenient for slicing and position-based operations.
- iloc uses row positions, making it endpoint-exclusive (similar to Python slicing)
Selected Rows with iloc:
diagnosis Radius Mean Texture Mean
0 M 17.99 10.38
1 M 20.57 17.77
2 M 19.69 21.25
3 M 11.42 20.38
4 M 20.29 14.34
Filtering
Filtering enables you to create subsets of data that match specific conditions. For example, we can filter by diagnosis to analyze only malignant (M) or benign (B) cases.
Malignant Cases:
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
5 843786 M 12.45 15.70 82.57 477.1
6 844359 M 18.25 19.98 119.60 1040.0
7 84458202 M 13.71 20.83 90.20 577.9
8 844981 M 13.00 21.82 87.50 519.8
9 84501001 M 12.46 24.04 83.97 475.9
10 845636 M 16.02 23.24 102.70 797.8
11 84610002 M 15.78 17.89 103.60 781.0
12 846226 M 19.17 24.80 132.40 1123.0
13 846381 M 15.85 23.95 103.70 782.7
14 84667401 M 13.73 22.61 93.60 578.3
15 84799002 M 14.54 27.54 96.73 658.8
16 848406 M 14.68 20.13 94.74 684.5
17 84862001 M 16.13 20.68 108.10 798.8
18 849014 M 19.81 22.15 130.00 1260.0
22 8511133 M 15.34 14.26 102.50 704.4
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.30010 0.14710
1 0.08474 0.07864 0.08690 0.07017
2 0.10960 0.15990 0.19740 0.12790
3 0.14250 0.28390 0.24140 0.10520
4 0.10030 0.13280 0.19800 0.10430
5 0.12780 0.17000 0.15780 0.08089
6 0.09463 0.10900 0.11270 0.07400
7 0.11890 0.16450 0.09366 0.05985
8 0.12730 0.19320 0.18590 0.09353
9 0.11860 0.23960 0.22730 0.08543
10 0.08206 0.06669 0.03299 0.03323
11 0.09710 0.12920 0.09954 0.06606
12 0.09740 0.24580 0.20650 0.11180
13 0.08401 0.10020 0.09938 0.05364
14 0.11310 0.22930 0.21280 0.08025
15 0.11390 0.15950 0.16390 0.07364
16 0.09867 0.07200 0.07395 0.05259
17 0.11700 0.20220 0.17220 0.10280
18 0.09831 0.10270 0.14790 0.09498
22 0.10730 0.21350 0.20770 0.09756
... texture_worst perimeter_worst area_worst smoothness_worst \
0 ... 17.33 184.60 2019.0 0.1622
1 ... 23.41 158.80 1956.0 0.1238
2 ... 25.53 152.50 1709.0 0.1444
3 ... 26.50 98.87 567.7 0.2098
4 ... 16.67 152.20 1575.0 0.1374
5 ... 23.75 103.40 741.6 0.1791
6 ... 27.66 153.20 1606.0 0.1442
7 ... 28.14 110.60 897.0 0.1654
8 ... 30.73 106.20 739.3 0.1703
9 ... 40.68 97.65 711.4 0.1853
10 ... 33.88 123.80 1150.0 0.1181
11 ... 27.28 136.50 1299.0 0.1396
12 ... 29.94 151.70 1332.0 0.1037
13 ... 27.66 112.00 876.5 0.1131
14 ... 32.01 108.80 697.7 0.1651
15 ... 37.13 124.10 943.2 0.1678
16 ... 30.88 123.40 1138.0 0.1464
17 ... 31.48 136.80 1315.0 0.1789
18 ... 30.88 186.80 2398.0 0.1512
22 ... 19.08 125.10 980.9 0.1390
compactness_worst concavity_worst concave points_worst symmetry_worst \
0 0.6656 0.7119 0.26540 0.4601
1 0.1866 0.2416 0.18600 0.2750
2 0.4245 0.4504 0.24300 0.3613
3 0.8663 0.6869 0.25750 0.6638
4 0.2050 0.4000 0.16250 0.2364
5 0.5249 0.5355 0.17410 0.3985
6 0.2576 0.3784 0.19320 0.3063
7 0.3682 0.2678 0.15560 0.3196
8 0.5401 0.5390 0.20600 0.4378
9 1.0580 1.1050 0.22100 0.4366
10 0.1551 0.1459 0.09975 0.2948
11 0.5609 0.3965 0.18100 0.3792
12 0.3903 0.3639 0.17670 0.3176
13 0.1924 0.2322 0.11190 0.2809
14 0.7725 0.6943 0.22080 0.3596
15 0.6577 0.7026 0.17120 0.4218
16 0.1871 0.2914 0.16090 0.3029
17 0.4233 0.4784 0.20730 0.3706
18 0.3150 0.5372 0.23880 0.2768
22 0.5954 0.6305 0.23930 0.4667
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
5 0.12440 NaN
6 0.08368 NaN
7 0.11510 NaN
8 0.10720 NaN
9 0.20750 NaN
10 0.08452 NaN
11 0.10480 NaN
12 0.10230 NaN
13 0.06287 NaN
14 0.14310 NaN
15 0.13410 NaN
16 0.08216 NaN
17 0.11420 NaN
18 0.07615 NaN
22 0.09946 NaN
[20 rows x 33 columns]
You can also filter based on multiple conditions, such as finding rows where the diagnosis is “M” and radius_mean is greater than 15.
Note: You can’t use ‘and’ python operator here, because ‘and’ is a keyword for Python’s boolean operations, which work with single True or False values, not arrays or Series.
Large Malignant Cases (Radius Mean > 15):
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 17.99 10.38 122.8 1001.0
1 842517 M 20.57 17.77 132.9 1326.0
2 84300903 M 19.69 21.25 130.0 1203.0
4 84358402 M 20.29 14.34 135.1 1297.0
6 844359 M 18.25 19.98 119.6 1040.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
4 0.10030 0.13280 0.1980 0.10430
6 0.09463 0.10900 0.1127 0.07400
... texture_worst perimeter_worst area_worst smoothness_worst \
0 ... 17.33 184.6 2019.0 0.1622
1 ... 23.41 158.8 1956.0 0.1238
2 ... 25.53 152.5 1709.0 0.1444
4 ... 16.67 152.2 1575.0 0.1374
6 ... 27.66 153.2 1606.0 0.1442
compactness_worst concavity_worst concave points_worst symmetry_worst \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
4 0.2050 0.4000 0.1625 0.2364
6 0.2576 0.3784 0.1932 0.3063
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
4 0.07678 NaN
6 0.08368 NaN
[5 rows x 33 columns]
Adding and Modifying Columns
You can create new columns in a DataFrame based on calculations using existing columns. For example, we can calculate the area_ratio by dividing area_worst by area_mean.
New Column 'area_ratio':
area_worst area_mean area_ratio
0 2019.0 1001.0 2.016983
1 1956.0 1326.0 1.475113
2 1709.0 1203.0 1.420615
3 567.7 386.1 1.470344
4 1575.0 1297.0 1.214341
Changing a Value Using .at
Suppose you have a DataFrame and want to update the value in the radius_mean column for a particular index.
# Access and print the original value at index 0 and column 'radius_mean'
original_value = cancer_data.at[0, 'Radius Mean']
print("Original Radius Mean at index 0:", original_value)
# Change the value at index 0 and column 'radius_mean' to 18.5
cancer_data.at[0, 'Radius Mean'] = 18.5
# Verify the updated value
updated_value = cancer_data.at[0, 'Radius Mean']
print("Updated Radius Mean at index 0:", updated_value)Original Radius Mean at index 0: 17.99
Updated Radius Mean at index 0: 18.5
Sorting by Columns
You can sort a dataset by columns. Here’s how to sort by diagnosis first and then by area_mean in ascending order.
# Sort by 'diagnosis' first, then by 'area_mean' within each diagnosis group
sorted_by_diagnosis_area = cancer_data.sort_values(by=['diagnosis', 'area_mean'], ascending=[True, True])
print("Data sorted by Diagnosis and Area Mean:\n", sorted_by_diagnosis_area[['diagnosis', 'area_mean', 'Radius Mean']].head())Data sorted by Diagnosis and Area Mean:
diagnosis area_mean Radius Mean
101 B 143.5 6.981
539 B 170.4 7.691
538 B 178.8 7.729
568 B 181.0 7.760
46 B 201.9 8.196
Reordering Columns to Move a Column to the End
You might also want to move a specific column to the end of the DataFrame, such as moving area_ratio to the last position.
# Move 'area_ratio' to the end of the DataFrame
columns_reordered = [col for col in cancer_data.columns if col != 'area_ratio'] + ['area_ratio']
cancer_data_with_area_ratio_last = cancer_data[columns_reordered]
# Display the reordered columns
print("Data with 'area_ratio' at the end:\n", cancer_data_with_area_ratio_last.head())Data with 'area_ratio' at the end:
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 18.50 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... perimeter_worst area_worst smoothness_worst compactness_worst \
0 ... 184.60 2019.0 0.1622 0.6656
1 ... 158.80 1956.0 0.1238 0.1866
2 ... 152.50 1709.0 0.1444 0.4245
3 ... 98.87 567.7 0.2098 0.8663
4 ... 152.20 1575.0 0.1374 0.2050
concavity_worst concave points_worst symmetry_worst \
0 0.7119 0.2654 0.4601
1 0.2416 0.1860 0.2750
2 0.4504 0.2430 0.3613
3 0.6869 0.2575 0.6638
4 0.4000 0.1625 0.2364
fractal_dimension_worst Unnamed: 32 area_ratio
0 0.11890 NaN 2.016983
1 0.08902 NaN 1.475113
2 0.08758 NaN 1.420615
3 0.17300 NaN 1.470344
4 0.07678 NaN 1.214341
[5 rows x 34 columns]
Method Chaining in Pandas
In Pandas, you can chain multiple methods together to create a pipeline.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
"name": [" Alice ", "BOB", "Charlie", None],
"score": [85, 92, None, 74]
})
# Clean the data using method chaining
clean_df = (
df
.dropna() # Method: drop rows with any NaNs
.assign( # Method: add or update columns
name_clean=lambda d: d["name"].str.strip().str.title()
)
.sort_values("score", ascending=False) # Method: sort by score
)
print(clean_df) name score name_clean
1 BOB 92.0 Bob
0 Alice 85.0 Alice
Why .str.strip() and not just .strip()?
AttributeError: .strip is used for single strings, not a Series of strings
Why? - df["name"] is a Series — not a string. - .strip() is a string method that works on single strings. - .str is the accessor that tells pandas: “apply this string method to each element in the Series.”
Rule of Thumb:
| You have… | Use… | Why? |
|---|---|---|
| A single string | "hello".strip() |
It’s just Python |
| A Series of strings | df["col"].str.strip() |
It’s pandas, operating on many strings |
Applying Functions in Pandas
Applying Functions to Columns
The .apply() method in pandas lets you apply a custom function to each element in a Series (column) or DataFrame. Here’s how to use it to categorize tumors based on area_mean.
Example: Categorizing Tumors by Size
Let’s create a custom function to categorize tumors as “Small”, “Medium”, or “Large” based on area_mean.
# Define a custom function to categorize tumors by area_mean
def categorize_tumor(size):
if size < 500:
return 'Small'
elif 500 <= size < 1000:
return 'Medium'
else:
return 'Large'
# Apply the function to the 'area_mean' column and create a new column 'tumor_size_category'
cancer_data['tumor_size_category'] = cancer_data['area_mean'].apply(categorize_tumor)
# Display the new column to verify the transformation
print("Tumor Size Categories:\n", cancer_data[['area_mean', 'tumor_size_category']].head())Tumor Size Categories:
area_mean tumor_size_category
0 1001.0 Large
1 1326.0 Large
2 1203.0 Large
3 386.1 Small
4 1297.0 Large
Using Lambda Functions for Quick Transformations
Lambda functions are useful for simple, one-line operations. For example, we can use a lambda function to convert diagnosis into numerical codes (0 for Benign, 1 for Malignant).
# Apply a lambda function to classify 'diagnosis' into numerical codes
cancer_data['diagnosis_code'] = cancer_data['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)
# Display the new column to verify the transformation
print("Diagnosis Codes:\n", cancer_data[['diagnosis', 'diagnosis_code']].head())Diagnosis Codes:
diagnosis diagnosis_code
0 M 1
1 M 1
2 M 1
3 M 1
4 M 1
Applying Multiple Conditions with .apply()
You can also use .apply() with a lambda function for more complex, multi-condition classifications.
Example: Adding a Column with Risk Levels Suppose we want to create a new column, risk_level, based on both diagnosis and area_mean:
- “High Risk” for Malignant tumors with area_mean above 1000.
- “Moderate Risk” for Malignant tumors with area_mean below 1000.
- “Low Risk” for Benign tumors.
# Apply a lambda function with multiple conditions to create a 'risk_level' column
cancer_data['risk_level'] = cancer_data.apply(
lambda row: 'High Risk' if row['diagnosis'] == 'M' and row['area_mean'] > 1000
else ('Moderate Risk' if row['diagnosis'] == 'M' else 'Low Risk'), axis=1
)
# Display the new column to verify the transformation
print("Risk Levels:\n", cancer_data[['diagnosis', 'area_mean', 'risk_level']].head())
#Axis=1 tells the function to apply it to the rows. axis=0 (default) applies function to the columnsRisk Levels:
diagnosis area_mean risk_level
0 M 1001.0 High Risk
1 M 1326.0 High Risk
2 M 1203.0 High Risk
3 M 386.1 Moderate Risk
4 M 1297.0 High Risk
When to apply axis=
| You’re applying to… | Use .apply() on… |
Do you need axis? |
|---|---|---|
| A single column (Series) | df['col'].apply(func) |
No |
| Multiple columns (row-wise) | df.apply(func, axis=1) |
Yes (axis=1) |
| Column-wise (less common) | df.apply(func) or axis=0 |
Optional (default is 0) |
To export a Pandas dataframe to CSV or XLSX
Missing values in Python
Missing values are common in data analysis. Python provides multiple ways to represent missing values, including None, np.nan, and pd.NA. Understanding their behavior is crucial for data cleaning, processing, and analysis.
Missing Values in Python: None
- Definition:
Noneis a built-in Python object representing “no value.” - Use Cases: Works with general Python objects but does not support mathematical operations.
Issue with None in arithmetic:
Missing Values in NumPy: np.nan
np.nanrepresents missing values in numerical computations.np.nanis a floating-point value (float64).- Cannot be checked with
==becausenp.nan != np.nan.
x is missing
Behavior in Math Operations
Missing Values in Pandas: pd.NA
pd.NAis Pandas’ missing value representation, introduced in Pandas 1.0.- Works with nullable data types (
Int64,Float64,boolean,string). - Avoids automatic type conversion (e.g., integers remain integers).
Behavior in Math Operations
Why use pd.NA in Pandas datasets?
- Data Type Preservation
- np.nan is always a floating-point value, which forces integer columns to be upcast to floats.
- pd.NA allows a column to retain its original data type
- Consistency Across Types
- Can be used across all nullable data types
- Acts as a single indicator for missingness across data types,(integer, boolean, string, datetime)
- Improved Missing Data Semantics
- pd.NA behaves more like a true “unknown” or “missing” value during operations, which can be more intuitive
- For example, comparing pd.NA with any value (including itself) results in
rather than a boolean
- Enables Nullable Extension Types
- The introduction of pd.NA facilitated the development of “nullable extension types”
- These can be thought of as “column types”.
- For example, typecasting a column as Int64 tells Pandas that this column is made up of integers, and can accept missing values (pd.NA)
- If we try to use np.nan instead, all values in the column would be forced to be floats.
Comparing None, np.nan, and pd.NA
| Feature | None |
np.nan |
pd.NA |
|---|---|---|---|
| Type | NoneType |
float64 |
Special Pandas scalar |
| Use Case | General Python | NumPy/Pandas numeric data | Pandas nullable types |
| Arithmetic Ops | Fails (None + 1) |
Works but returns nan |
Works but returns <NA> |
Comparison (==) |
None == None → True |
np.nan == np.nan → False |
pd.NA == pd.NA → <NA> |
| Check Method | if x is None: |
if np.isnan(x): |
if pd.isna(x): |
Handling Missing Values in Pandas
A
0 1
1 NaN
2 3
3 None
4 <NA>
A
0 False
1 True
2 False
3 True
4 True
<class 'int'>
<class 'float'>
<class 'int'>
<class 'NoneType'>
<class 'pandas._libs.missing.NAType'>
Filling Missing Values
1
<class 'numpy.int64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3
<class 'numpy.int64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Int64
Why does casting the column as int64 fail with missing values?
The Problem
The Fix
1 : <class 'numpy.int64'>
<NA> : <class 'pandas._libs.missing.NAType'>
3 : <class 'numpy.int64'>
<NA> : <class 'pandas._libs.missing.NAType'>
<NA> : <class 'pandas._libs.missing.NAType'>
:::
float64 works, but not int64, with a caveat
1.0 : <class 'float'>
nan : <class 'float'>
3.0 : <class 'float'>
nan : <class 'float'>
nan : <class 'float'>
:::
All pd.na are “recast” as np.nan, since this type of missing value is a float. However, this is usually not what you want to do.
Best practice: cast the column as the Pandas nullable datatype Float64, so that missing values stay pd.na
1.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Float64
Best Practices for Missing Data
- Use
Nonefor general Python objects.
- Use
np.nanfor numerical missing values in NumPy.
- Use
pd.NAfor missing values in Pandas with nullable data types.
- Always use
isna()orisnull()when working with missing data in Pandas.
Other Pandas Nullable Data Types
| Pandas Nullable Data Type | Description | Typical Usage |
|---|---|---|
Int8, Int16, Int32, Int64 |
Nullable integer types (can hold pd.NA) |
Use when you want integers with missing values |
Float32, Float64 |
Nullable float types (standard floats also support NaN) |
Numeric data with decimals, missing values |
boolean |
Nullable Boolean type (True, False, pd.NA) |
Binary categories with missing info |
string |
Pandas string data type (nullable) | Text data with potential nulls |
category |
Categorical type (can include NaN or pd.NA) |
Categorical data, efficient storage |
datetime64[ns] with pd.NaT |
Datetime with nanosecond precision | Time series, datetime columns |
timedelta64[ns] with pd.NaT |
Timedeltas (differences between datetimes) | Duration calculations |
More on missing values
To drop missing values, you can use .dropna()
0 1.0
2 3.0
Name: A, dtype: Float64
To find missing values, you can use isnull() with sum() to calculate the total number of missing values in each column.
Missing Values per Column:
id 0
diagnosis 0
Radius Mean 0
Texture Mean 0
Perimeter Mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
Unnamed: 32 569
area_ratio 0
tumor_size_category 0
diagnosis_code 0
risk_level 0
dtype: int64
Dropping Columns with Excessive Missing Data
Since Unnamed: 32 has no data, it can be dropped from the DataFrame using .drop().
Columns after dropping 'Unnamed: 32': ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'area_ratio', 'tumor_size_category', 'diagnosis_code', 'risk_level']
great_tables for table generation
You can use the great_tables Python module from the great-tables package to explore and display data from a dataset in a clean and interactive format.
We’ll load the data, summarize it, and then build styled tables using great_tables.
Load and Inspect the Data
<>:4: SyntaxWarning: invalid escape sequence '\s'
<>:4: SyntaxWarning: invalid escape sequence '\s'
C:\Users\augellp1\AppData\Local\Temp\ipykernel_25380\1971512142.py:4: SyntaxWarning: invalid escape sequence '\s'
df = pd.read_csv("..\session3\example_data\Cancer_Data.csv")
((569, 32),
id diagnosis radius_mean texture_mean perimeter_mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... radius_worst texture_worst perimeter_worst area_worst \
0 ... 25.38 17.33 184.60 2019.0
1 ... 24.99 23.41 158.80 1956.0
2 ... 23.57 25.53 152.50 1709.0
3 ... 14.91 26.50 98.87 567.7
4 ... 22.54 16.67 152.20 1575.0
smoothness_worst compactness_worst concavity_worst concave points_worst \
0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
3 0.2098 0.8663 0.6869 0.2575
4 0.1374 0.2050 0.4000 0.1625
symmetry_worst fractal_dimension_worst
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678
[5 rows x 32 columns])
Preview of dataset using great_tables
from great_tables import GT
# Select a subset of the columns for preview
preview_df = df[['id', 'diagnosis', 'radius_mean', 'texture_mean', 'area_mean']].head(5)
GT(preview_df).tab_header(
title="Breast Cancer Diagnosis Preview",
subtitle="Selected features from the first 5 records"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)| Breast Cancer Diagnosis Preview | ||||
| Selected features from the first 5 records | ||||
| id | diagnosis | radius_mean | texture_mean | area_mean |
|---|---|---|---|---|
| 842302 | M | 17.99 | 10.38 | 1,001.00 |
| 842517 | M | 20.57 | 17.77 | 1,326.00 |
| 84300903 | M | 19.69 | 21.25 | 1,203.00 |
| 84348301 | M | 11.42 | 20.38 | 386.10 |
| 84358402 | M | 20.29 | 14.34 | 1,297.00 |
Enhance the Table with Styling
Let’s add conditional formatting to highlight larger tumor areas.
| Styled Cancer Data Table | ||||
| With conditional formatting on tumor area | ||||
| id | diagnosis | radius_mean | texture_mean | area_mean |
|---|---|---|---|---|
| 842302 | M | 17.99 | 10.38 | 1,001.00 |
| 842517 | M | 20.57 | 17.77 | 1,326.00 |
| 84300903 | M | 19.69 | 21.25 | 1,203.00 |
| 84348301 | M | 11.42 | 20.38 | 386.10 |
| 84358402 | M | 20.29 | 14.34 | 1,297.00 |
Group Statistics by Diagnosis
Let’s summarize average values by diagnosis group (malignant vs. benign).
summary_df = df.groupby("diagnosis")[["radius_mean", "texture_mean", "area_mean"]].mean().reset_index()
GT(summary_df).tab_header(
title="Group-wise Summary",
subtitle="Mean values for radius, texture, and area by diagnosis"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)| Group-wise Summary | |||
| Mean values for radius, texture, and area by diagnosis | |||
| diagnosis | radius_mean | texture_mean | area_mean |
|---|---|---|---|
| B | 12.15 | 17.91 | 462.79 |
| M | 17.46 | 21.60 | 978.38 |