"\n\ndf.to_excel('/path/to/directory/example.xlsx', index=False)\n\n"
Guide to Python Data Structures
Cancer Dataset
Pandas is a powerful open-source data analysis and manipulation library in Python. It provides data structures, primarily the DataFrame and Series, which are optimized for handling and analyzing large datasets efficiently.
Series: A one-dimensional labeled array, suitable for handling single columns or rows of data.
DataFrame: A two-dimensional table with labeled axes (rows and columns), much like a spreadsheet or SQL table, allowing you to work with data in rows and columns simultaneously.
Pandas has functions for merging, reshaping, and aggregating datasets, which helps streamline data cleaning and preparation.
It can handle missing data, making it easy to filter or fill gaps in datasets.
Pandas provides extensive functionality for descriptive statistics, grouping data, and handling time series.
Integrates well with other libraries, making it easy to move data between libraries like NumPy for numerical computations and Matplotlib or Seaborn for visualization.
<class 'pandas.core.frame.DataFrame'>
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
.shape returns a tuple with (number of rows, number of columns), which provides a basic overview of the dataset size.
Dataset Shape: (569, 33)
.info() lists all columns, their data types, and counts of non-null values, helping identify any columns that may have missing data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
.columns lists column headers, while .tolist() converts it into a standard Python list for easier viewing.
Column Names: ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']
.describe() generates essential statistics (mean, std, min, max, percentiles) for numeric columns, useful for identifying data distributions.
Note: Sometimes you want to exclude certain columns from this (such as patient IDs)
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 0.0 |
| mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 | NaN |
| std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 | NaN |
| min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 | NaN |
| 25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 | NaN |
| 50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 | NaN |
| 75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 | NaN |
| max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 | NaN |
8 rows × 31 columns
This doesn’t actually change cancer_data in memory
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
This method is straightforward if you want to check the frequency distribution of one specific categorical column. Returns a pandas series object
Diagnosis Counts:
diagnosis
B 357
M 212
Name: count, dtype: int64
To see summary statistics grouped by a categorical variable in pandas, you can use the groupby() method along with describe() or specific aggregation functions like mean(), sum(), etc.
id radius_mean texture_mean perimeter_mean \
diagnosis
B 2.654382e+07 12.146524 17.914762 78.075406
M 3.681805e+07 17.462830 21.604906 115.365377
area_mean smoothness_mean compactness_mean concavity_mean \
diagnosis
B 462.790196 0.092478 0.080085 0.046058
M 978.376415 0.102898 0.145188 0.160775
concave points_mean symmetry_mean ... texture_worst \
diagnosis ...
B 0.025717 0.174186 ... 23.515070
M 0.087990 0.192909 ... 29.318208
perimeter_worst area_worst smoothness_worst compactness_worst \
diagnosis
B 87.005938 558.899440 0.124959 0.182673
M 141.370330 1422.286321 0.144845 0.374824
concavity_worst concave points_worst symmetry_worst \
diagnosis
B 0.166238 0.074444 0.270246
M 0.450606 0.182237 0.323468
fractal_dimension_worst Unnamed: 32
diagnosis
B 0.079442 NaN
M 0.091530 NaN
[2 rows x 32 columns]
diagnosis
B 12.146524
M 17.462830
Name: radius_mean, dtype: float64
To make column names more readable or consistent, you can use rename() to change specific names. Here’s how to rename columns like radius_mean to Radius Mean.
# Rename specific columns for readability. 'old': 'new'
new_columns={
'radius_mean': 'Radius Mean',
'texture_mean': 'Texture Mean',
'perimeter_mean': 'Perimeter Mean'
}
cancer_data = cancer_data.rename(columns=new_columns)
# Display the new column names to verify the changes
print("\nUpdated Column Names:", cancer_data.columns.tolist())
Updated Column Names: ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']
Selecting specific columns is essential for focusing on particular aspects of the dataset. Here are some examples of both single and multiple column selections.
Diagnosis Column:
0 M
1 M
2 M
3 M
4 M
Name: diagnosis, dtype: object
Selected Columns:
diagnosis Radius Mean area_mean
0 M 17.99 1001.0
1 M 20.57 1326.0
2 M 19.69 1203.0
3 M 11.42 386.1
4 M 20.29 1297.0
Selected Rows with loc:
diagnosis Radius Mean area_mean
0 M 17.99 1001.0
1 M 20.57 1326.0
2 M 19.69 1203.0
3 M 11.42 386.1
4 M 20.29 1297.0
Selected Rows with iloc:
diagnosis Radius Mean Texture Mean
0 M 17.99 10.38
1 M 20.57 17.77
2 M 19.69 21.25
3 M 11.42 20.38
4 M 20.29 14.34
Filtering enables you to create subsets of data that match specific conditions. For example, we can filter by diagnosis to analyze only malignant (M) or benign (B) cases.
Malignant Cases:
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
5 843786 M 12.45 15.70 82.57 477.1
6 844359 M 18.25 19.98 119.60 1040.0
7 84458202 M 13.71 20.83 90.20 577.9
8 844981 M 13.00 21.82 87.50 519.8
9 84501001 M 12.46 24.04 83.97 475.9
10 845636 M 16.02 23.24 102.70 797.8
11 84610002 M 15.78 17.89 103.60 781.0
12 846226 M 19.17 24.80 132.40 1123.0
13 846381 M 15.85 23.95 103.70 782.7
14 84667401 M 13.73 22.61 93.60 578.3
15 84799002 M 14.54 27.54 96.73 658.8
16 848406 M 14.68 20.13 94.74 684.5
17 84862001 M 16.13 20.68 108.10 798.8
18 849014 M 19.81 22.15 130.00 1260.0
22 8511133 M 15.34 14.26 102.50 704.4
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.30010 0.14710
1 0.08474 0.07864 0.08690 0.07017
2 0.10960 0.15990 0.19740 0.12790
3 0.14250 0.28390 0.24140 0.10520
4 0.10030 0.13280 0.19800 0.10430
5 0.12780 0.17000 0.15780 0.08089
6 0.09463 0.10900 0.11270 0.07400
7 0.11890 0.16450 0.09366 0.05985
8 0.12730 0.19320 0.18590 0.09353
9 0.11860 0.23960 0.22730 0.08543
10 0.08206 0.06669 0.03299 0.03323
11 0.09710 0.12920 0.09954 0.06606
12 0.09740 0.24580 0.20650 0.11180
13 0.08401 0.10020 0.09938 0.05364
14 0.11310 0.22930 0.21280 0.08025
15 0.11390 0.15950 0.16390 0.07364
16 0.09867 0.07200 0.07395 0.05259
17 0.11700 0.20220 0.17220 0.10280
18 0.09831 0.10270 0.14790 0.09498
22 0.10730 0.21350 0.20770 0.09756
... texture_worst perimeter_worst area_worst smoothness_worst \
0 ... 17.33 184.60 2019.0 0.1622
1 ... 23.41 158.80 1956.0 0.1238
2 ... 25.53 152.50 1709.0 0.1444
3 ... 26.50 98.87 567.7 0.2098
4 ... 16.67 152.20 1575.0 0.1374
5 ... 23.75 103.40 741.6 0.1791
6 ... 27.66 153.20 1606.0 0.1442
7 ... 28.14 110.60 897.0 0.1654
8 ... 30.73 106.20 739.3 0.1703
9 ... 40.68 97.65 711.4 0.1853
10 ... 33.88 123.80 1150.0 0.1181
11 ... 27.28 136.50 1299.0 0.1396
12 ... 29.94 151.70 1332.0 0.1037
13 ... 27.66 112.00 876.5 0.1131
14 ... 32.01 108.80 697.7 0.1651
15 ... 37.13 124.10 943.2 0.1678
16 ... 30.88 123.40 1138.0 0.1464
17 ... 31.48 136.80 1315.0 0.1789
18 ... 30.88 186.80 2398.0 0.1512
22 ... 19.08 125.10 980.9 0.1390
compactness_worst concavity_worst concave points_worst symmetry_worst \
0 0.6656 0.7119 0.26540 0.4601
1 0.1866 0.2416 0.18600 0.2750
2 0.4245 0.4504 0.24300 0.3613
3 0.8663 0.6869 0.25750 0.6638
4 0.2050 0.4000 0.16250 0.2364
5 0.5249 0.5355 0.17410 0.3985
6 0.2576 0.3784 0.19320 0.3063
7 0.3682 0.2678 0.15560 0.3196
8 0.5401 0.5390 0.20600 0.4378
9 1.0580 1.1050 0.22100 0.4366
10 0.1551 0.1459 0.09975 0.2948
11 0.5609 0.3965 0.18100 0.3792
12 0.3903 0.3639 0.17670 0.3176
13 0.1924 0.2322 0.11190 0.2809
14 0.7725 0.6943 0.22080 0.3596
15 0.6577 0.7026 0.17120 0.4218
16 0.1871 0.2914 0.16090 0.3029
17 0.4233 0.4784 0.20730 0.3706
18 0.3150 0.5372 0.23880 0.2768
22 0.5954 0.6305 0.23930 0.4667
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
5 0.12440 NaN
6 0.08368 NaN
7 0.11510 NaN
8 0.10720 NaN
9 0.20750 NaN
10 0.08452 NaN
11 0.10480 NaN
12 0.10230 NaN
13 0.06287 NaN
14 0.14310 NaN
15 0.13410 NaN
16 0.08216 NaN
17 0.11420 NaN
18 0.07615 NaN
22 0.09946 NaN
[20 rows x 33 columns]
You can also filter based on multiple conditions, such as finding rows where the diagnosis is “M” and radius_mean is greater than 15.
Note: You can’t use ‘and’ python operator here, because ‘and’ is a keyword for Python’s boolean operations, which work with single True or False values, not arrays or Series.
Large Malignant Cases (Radius Mean > 15):
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 17.99 10.38 122.8 1001.0
1 842517 M 20.57 17.77 132.9 1326.0
2 84300903 M 19.69 21.25 130.0 1203.0
4 84358402 M 20.29 14.34 135.1 1297.0
6 844359 M 18.25 19.98 119.6 1040.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
4 0.10030 0.13280 0.1980 0.10430
6 0.09463 0.10900 0.1127 0.07400
... texture_worst perimeter_worst area_worst smoothness_worst \
0 ... 17.33 184.6 2019.0 0.1622
1 ... 23.41 158.8 1956.0 0.1238
2 ... 25.53 152.5 1709.0 0.1444
4 ... 16.67 152.2 1575.0 0.1374
6 ... 27.66 153.2 1606.0 0.1442
compactness_worst concavity_worst concave points_worst symmetry_worst \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
4 0.2050 0.4000 0.1625 0.2364
6 0.2576 0.3784 0.1932 0.3063
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
4 0.07678 NaN
6 0.08368 NaN
[5 rows x 33 columns]
You can create new columns in a DataFrame based on calculations using existing columns. For example, we can calculate the area_ratio by dividing area_worst by area_mean.
New Column 'area_ratio':
area_worst area_mean area_ratio
0 2019.0 1001.0 2.016983
1 1956.0 1326.0 1.475113
2 1709.0 1203.0 1.420615
3 567.7 386.1 1.470344
4 1575.0 1297.0 1.214341
Suppose you have a DataFrame and want to update the value in the radius_mean column for a particular index.
# Access and print the original value at index 0 and column 'radius_mean'
original_value = cancer_data.at[0, 'Radius Mean']
print("Original Radius Mean at index 0:", original_value)
# Change the value at index 0 and column 'radius_mean' to 18.5
cancer_data.at[0, 'Radius Mean'] = 18.5
# Verify the updated value
updated_value = cancer_data.at[0, 'Radius Mean']
print("Updated Radius Mean at index 0:", updated_value)Original Radius Mean at index 0: 17.99
Updated Radius Mean at index 0: 18.5
You can sort a dataset by columns. Here’s how to sort by diagnosis first and then by area_mean in ascending order.
# Sort by 'diagnosis' first, then by 'area_mean' within each diagnosis group
sorted_by_diagnosis_area = cancer_data.sort_values(by=['diagnosis', 'area_mean'], ascending=[True, True])
print("Data sorted by Diagnosis and Area Mean:\n", sorted_by_diagnosis_area[['diagnosis', 'area_mean', 'Radius Mean']].head())Data sorted by Diagnosis and Area Mean:
diagnosis area_mean Radius Mean
101 B 143.5 6.981
539 B 170.4 7.691
538 B 178.8 7.729
568 B 181.0 7.760
46 B 201.9 8.196
You might also want to move a specific column to the end of the DataFrame, such as moving area_ratio to the last position.
# Move 'area_ratio' to the end of the DataFrame
columns_reordered = [col for col in cancer_data.columns if col != 'area_ratio'] + ['area_ratio']
cancer_data_with_area_ratio_last = cancer_data[columns_reordered]
# Display the reordered columns
print("Data with 'area_ratio' at the end:\n", cancer_data_with_area_ratio_last.head())Data with 'area_ratio' at the end:
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 18.50 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... perimeter_worst area_worst smoothness_worst compactness_worst \
0 ... 184.60 2019.0 0.1622 0.6656
1 ... 158.80 1956.0 0.1238 0.1866
2 ... 152.50 1709.0 0.1444 0.4245
3 ... 98.87 567.7 0.2098 0.8663
4 ... 152.20 1575.0 0.1374 0.2050
concavity_worst concave points_worst symmetry_worst \
0 0.7119 0.2654 0.4601
1 0.2416 0.1860 0.2750
2 0.4504 0.2430 0.3613
3 0.6869 0.2575 0.6638
4 0.4000 0.1625 0.2364
fractal_dimension_worst Unnamed: 32 area_ratio
0 0.11890 NaN 2.016983
1 0.08902 NaN 1.475113
2 0.08758 NaN 1.420615
3 0.17300 NaN 1.470344
4 0.07678 NaN 1.214341
[5 rows x 34 columns]
In Pandas, you can chain multiple methods together to create a pipeline.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
"name": [" Alice ", "BOB", "Charlie", None],
"score": [85, 92, None, 74]
})
# Clean the data using method chaining
clean_df = (
df
.dropna() # Method: drop rows with any NaNs
.assign( # Method: add or update columns
name_clean=lambda d: d["name"].str.strip().str.title()
)
.sort_values("score", ascending=False) # Method: sort by score
)
print(clean_df) name score name_clean
1 BOB 92.0 Bob
0 Alice 85.0 Alice
.str.strip() and not just .strip()?AttributeError: .strip is used for single strings, not a Series of strings
Why? - df["name"] is a Series — not a string. - .strip() is a string method that works on single strings. - .str is the accessor that tells pandas: “apply this string method to each element in the Series.”
| You have… | Use… | Why? |
|---|---|---|
| A single string | "hello".strip() |
It’s just Python |
| A Series of strings | df["col"].str.strip() |
It’s pandas, operating on many strings |
The .apply() method in pandas lets you apply a custom function to each element in a Series (column) or DataFrame. Here’s how to use it to categorize tumors based on area_mean.
Example: Categorizing Tumors by Size
Let’s create a custom function to categorize tumors as “Small”, “Medium”, or “Large” based on area_mean.
# Define a custom function to categorize tumors by area_mean
def categorize_tumor(size):
if size < 500:
return 'Small'
elif 500 <= size < 1000:
return 'Medium'
else:
return 'Large'
# Apply the function to the 'area_mean' column and create a new column 'tumor_size_category'
cancer_data['tumor_size_category'] = cancer_data['area_mean'].apply(categorize_tumor)
# Display the new column to verify the transformation
print("Tumor Size Categories:\n", cancer_data[['area_mean', 'tumor_size_category']].head())Tumor Size Categories:
area_mean tumor_size_category
0 1001.0 Large
1 1326.0 Large
2 1203.0 Large
3 386.1 Small
4 1297.0 Large
Lambda functions are useful for simple, one-line operations. For example, we can use a lambda function to convert diagnosis into numerical codes (0 for Benign, 1 for Malignant).
# Apply a lambda function to classify 'diagnosis' into numerical codes
cancer_data['diagnosis_code'] = cancer_data['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)
# Display the new column to verify the transformation
print("Diagnosis Codes:\n", cancer_data[['diagnosis', 'diagnosis_code']].head())Diagnosis Codes:
diagnosis diagnosis_code
0 M 1
1 M 1
2 M 1
3 M 1
4 M 1
You can also use .apply() with a lambda function for more complex, multi-condition classifications.
Example: Adding a Column with Risk Levels Suppose we want to create a new column, risk_level, based on both diagnosis and area_mean:
# Apply a lambda function with multiple conditions to create a 'risk_level' column
cancer_data['risk_level'] = cancer_data.apply(
lambda row: 'High Risk' if row['diagnosis'] == 'M' and row['area_mean'] > 1000
else ('Moderate Risk' if row['diagnosis'] == 'M' else 'Low Risk'), axis=1
)
# Display the new column to verify the transformation
print("Risk Levels:\n", cancer_data[['diagnosis', 'area_mean', 'risk_level']].head())
#Axis=1 tells the function to apply it to the rows. axis=0 (default) applies function to the columnsRisk Levels:
diagnosis area_mean risk_level
0 M 1001.0 High Risk
1 M 1326.0 High Risk
2 M 1203.0 High Risk
3 M 386.1 Moderate Risk
4 M 1297.0 High Risk
| You’re applying to… | Use .apply() on… |
Do you need axis? |
|---|---|---|
| A single column (Series) | df['col'].apply(func) |
No |
| Multiple columns (row-wise) | df.apply(func, axis=1) |
Yes (axis=1) |
| Column-wise (less common) | df.apply(func) or axis=0 |
Optional (default is 0) |
Missing values are common in data analysis. Python provides multiple ways to represent missing values, including None, np.nan, and pd.NA. Understanding their behavior is crucial for data cleaning, processing, and analysis.
NoneNone is a built-in Python object representing “no value.”x is missing
Issue with None in arithmetic:
TypeError: Unsupported operand type(s)
True
np.nannp.nan represents missing values in numerical computations.np.nan is a floating-point value (float64).== because np.nan != np.nan.x is missing
nan
False
pd.NApd.NA is Pandas’ missing value representation, introduced in Pandas 1.0.Int64, Float64, boolean, string).<NA>
x is missing
#Incorrect way: if x=pd.NA
<NA>
None, np.nan, and pd.NA| Feature | None |
np.nan |
pd.NA |
|---|---|---|---|
| Type | NoneType |
float64 |
Special Pandas scalar |
| Use Case | General Python | NumPy/Pandas numeric data | Pandas nullable types |
| Arithmetic Ops | Fails (None + 1) |
Works but returns nan |
Works but returns <NA> |
Comparison (==) |
None == None → True |
np.nan == np.nan → False |
pd.NA == pd.NA → <NA> |
| Check Method | if x is None: |
if np.isnan(x): |
if pd.isna(x): |
A
0 1
1 NaN
2 3
3 None
4 <NA>
A
0 False
1 True
2 False
3 True
4 True
<class 'int'>
<class 'float'>
<class 'int'>
<class 'NoneType'>
<class 'pandas._libs.missing.NAType'>
1
<class 'numpy.int64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3
<class 'numpy.int64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Int64
Error: cannot convert NA to integer
1 : <class 'numpy.int64'>
<NA> : <class 'pandas._libs.missing.NAType'>
3 : <class 'numpy.int64'>
<NA> : <class 'pandas._libs.missing.NAType'>
<NA> : <class 'pandas._libs.missing.NAType'>
1.0 : <class 'float'>
nan : <class 'float'>
3.0 : <class 'float'>
nan : <class 'float'>
nan : <class 'float'>
All pd.na are “recast” as np.nan, since this type of missing value is a float. However, this is usually not what you want to do.
Best practice: cast the column as the Pandas nullable datatype Float64, so that missing values stay pd.na
1.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Float64
None for general Python objects.np.nan for numerical missing values in NumPy.pd.NA for missing values in Pandas with nullable data types.isna() or isnull() when working with missing data in Pandas.| Pandas Nullable Data Type | Description | Typical Usage |
|---|---|---|
Int8, Int16, Int32, Int64 |
Nullable integer types (can hold pd.NA) |
Use when you want integers with missing values |
Float32, Float64 |
Nullable float types (standard floats also support NaN) |
Numeric data with decimals, missing values |
boolean |
Nullable Boolean type (True, False, pd.NA) |
Binary categories with missing info |
string |
Pandas string data type (nullable) | Text data with potential nulls |
category |
Categorical type (can include NaN or pd.NA) |
Categorical data, efficient storage |
datetime64[ns] with pd.NaT |
Datetime with nanosecond precision | Time series, datetime columns |
timedelta64[ns] with pd.NaT |
Timedeltas (differences between datetimes) | Duration calculations |
To drop missing values, you can use .dropna()
0 1.0
2 3.0
Name: A, dtype: Float64
To find missing values, you can use isnull() with sum() to calculate the total number of missing values in each column.
Missing Values per Column:
id 0
diagnosis 0
Radius Mean 0
Texture Mean 0
Perimeter Mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
Unnamed: 32 569
area_ratio 0
tumor_size_category 0
diagnosis_code 0
risk_level 0
dtype: int64
Since Unnamed: 32 has no data, it can be dropped from the DataFrame using .drop().
Columns after dropping 'Unnamed: 32': ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'area_ratio', 'tumor_size_category', 'diagnosis_code', 'risk_level']
You can use the great_tables Python module from the great-tables package to explore and display data from a dataset in a clean and interactive format.
We’ll load the data, summarize it, and then build styled tables using great_tables.
((569, 32),
id diagnosis radius_mean texture_mean perimeter_mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... radius_worst texture_worst perimeter_worst area_worst \
0 ... 25.38 17.33 184.60 2019.0
1 ... 24.99 23.41 158.80 1956.0
2 ... 23.57 25.53 152.50 1709.0
3 ... 14.91 26.50 98.87 567.7
4 ... 22.54 16.67 152.20 1575.0
smoothness_worst compactness_worst concavity_worst concave points_worst \
0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
3 0.2098 0.8663 0.6869 0.2575
4 0.1374 0.2050 0.4000 0.1625
symmetry_worst fractal_dimension_worst
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678
[5 rows x 32 columns])
from great_tables import GT
# Select a subset of the columns for preview
preview_df = df[['id', 'diagnosis', 'radius_mean', 'texture_mean', 'area_mean']].head(5)
GT(preview_df).tab_header(
title="Breast Cancer Diagnosis Preview",
subtitle="Selected features from the first 5 records"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)| Breast Cancer Diagnosis Preview | ||||
| Selected features from the first 5 records | ||||
| id | diagnosis | radius_mean | texture_mean | area_mean |
|---|---|---|---|---|
| 842302 | M | 17.99 | 10.38 | 1,001.00 |
| 842517 | M | 20.57 | 17.77 | 1,326.00 |
| 84300903 | M | 19.69 | 21.25 | 1,203.00 |
| 84348301 | M | 11.42 | 20.38 | 386.10 |
| 84358402 | M | 20.29 | 14.34 | 1,297.00 |
Let’s add conditional formatting to highlight larger tumor areas.
| Styled Cancer Data Table | ||||
| With conditional formatting on tumor area | ||||
| id | diagnosis | radius_mean | texture_mean | area_mean |
|---|---|---|---|---|
| 842302 | M | 17.99 | 10.38 | 1,001.00 |
| 842517 | M | 20.57 | 17.77 | 1,326.00 |
| 84300903 | M | 19.69 | 21.25 | 1,203.00 |
| 84348301 | M | 11.42 | 20.38 | 386.10 |
| 84358402 | M | 20.29 | 14.34 | 1,297.00 |
Let’s summarize average values by diagnosis group (malignant vs. benign).
summary_df = df.groupby("diagnosis")[["radius_mean", "texture_mean", "area_mean"]].mean().reset_index()
GT(summary_df).tab_header(
title="Group-wise Summary",
subtitle="Mean values for radius, texture, and area by diagnosis"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)| Group-wise Summary | |||
| Mean values for radius, texture, and area by diagnosis | |||
| diagnosis | radius_mean | texture_mean | area_mean |
|---|---|---|---|
| B | 12.15 | 17.91 | 462.79 |
| M | 17.46 | 21.60 | 978.38 |
Navigate to the follow-along file and try the practice problems!
Follow-along: https://github.com/PythonForRUsers/Python-Workshop.github.io/tree/main/FollowAlong
Pre-read: https://pythonforrusers.github.io/Python-Workshop.github.io/session4/sess4preread.html