"\n\ndf.to_excel('/path/to/directory/example.xlsx', index=False)\n\n"
Guide to Python Data Structures
Cancer Dataset
Pandas is a powerful open-source data analysis and manipulation library in Python. It provides data structures, primarily the DataFrame and Series, which are optimized for handling and analyzing large datasets efficiently.
Data Structures:
Series: A one-dimensional labeled array, suitable for handling single columns or rows of data.
DataFrame: A two-dimensional table with labeled axes (rows and columns), much like a spreadsheet or SQL table, allowing you to work with data in rows and columns simultaneously.
Data Manipulation:
Pandas has functions for merging, reshaping, and aggregating datasets, which helps streamline data cleaning and preparation.
It can handle missing data, making it easy to filter or fill gaps in datasets.
Data Analysis:
Pandas provides extensive functionality for descriptive statistics, grouping data, and handling time series.
Integrates well with other libraries, making it easy to move data between libraries like NumPy for numerical computations and Matplotlib or Seaborn for visualization.
###Example Folder Structure
'''
project_name/
data/
raw/
processed/
scripts/
results/
logs/
'''
#To Create folders
import os
#Defining working directory
base_dir = "G:\\dir_demo"
#Defining Project folder
project_name = os.path.join(base_dir, "my_project")
# Define the subdirectories
subdirs = [
"data/raw",
"data/processed",
"scripts",
"results",
"logs",
]
# Create directories
for subdir in subdirs:
path = os.path.join(project_name, subdir)
os.makedirs(path, exist_ok=True) #ensures no error if the folder already exists
print(f"Created directory: {path}")
Created directory: G:\dir_demo\my_project\data/raw
Created directory: G:\dir_demo\my_project\data/processed
Created directory: G:\dir_demo\my_project\scripts
Created directory: G:\dir_demo\my_project\results
Created directory: G:\dir_demo\my_project\logs
Loading the Dataset
<class 'pandas.core.frame.DataFrame'>
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
.shape returns a tuple with (number of rows, number of columns), which provides a basic overview of the dataset size.
Dataset Shape: (569, 33)
.info() lists all columns, their data types, and counts of non-null values, helping identify any columns that may have missing data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
Viewing Column Names
.columns lists column headers, while .tolist() converts it into a standard Python list for easier viewing.
Column Names: ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']
.describe() generates essential statistics (mean, std, min, max, percentiles) for numeric columns, useful for identifying data distributions.
id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 5.690000e+02 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 0.0 |
mean | 3.037183e+07 | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | ... | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 | NaN |
std | 1.250206e+08 | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | ... | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 | NaN |
min | 8.670000e+03 | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | ... | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 | NaN |
25% | 8.692180e+05 | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | ... | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 | NaN |
50% | 9.060240e+05 | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | ... | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 | NaN |
75% | 8.813129e+06 | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | ... | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 | NaN |
max | 9.113205e+08 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | ... | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 | NaN |
8 rows × 32 columns
Using value_counts() on a Single Column
This method is straightforward if you want to check the frequency distribution of one specific categorical column. Returns a pandas series object
Diagnosis Counts:
diagnosis
B 357
M 212
Name: count, dtype: int64
To see summary statistics grouped by a categorical variable in pandas, you can use the groupby() method along with describe() or specific aggregation functions like mean(), sum(), etc.
# Group by 'diagnosis' and get summary statistics for each group
grouped_summary = cancer_data.groupby('diagnosis').mean()
print(grouped_summary)
#Group by 'diagnosis' and get summary statistics for only one variable
grouped_radius_mean = cancer_data.groupby('diagnosis')['radius_mean'].mean()
print(grouped_radius_mean)
id radius_mean texture_mean perimeter_mean \
diagnosis
B 2.654382e+07 12.146524 17.914762 78.075406
M 3.681805e+07 17.462830 21.604906 115.365377
area_mean smoothness_mean compactness_mean concavity_mean \
diagnosis
B 462.790196 0.092478 0.080085 0.046058
M 978.376415 0.102898 0.145188 0.160775
concave points_mean symmetry_mean ... texture_worst \
diagnosis ...
B 0.025717 0.174186 ... 23.515070
M 0.087990 0.192909 ... 29.318208
perimeter_worst area_worst smoothness_worst compactness_worst \
diagnosis
B 87.005938 558.899440 0.124959 0.182673
M 141.370330 1422.286321 0.144845 0.374824
concavity_worst concave points_worst symmetry_worst \
diagnosis
B 0.166238 0.074444 0.270246
M 0.450606 0.182237 0.323468
fractal_dimension_worst Unnamed: 32
diagnosis
B 0.079442 NaN
M 0.091530 NaN
[2 rows x 32 columns]
diagnosis
B 12.146524
M 17.462830
Name: radius_mean, dtype: float64
To make column names more readable or consistent, you can use rename() to change specific names. Here’s how to rename columns like radius_mean to Radius Mean.
# Rename specific columns for readability. 'old': 'new'
new_columns={
'radius_mean': 'Radius Mean',
'texture_mean': 'Texture Mean',
'perimeter_mean': 'Perimeter Mean'
}
cancer_data = cancer_data.rename(columns=new_columns)
# Display the new column names to verify the changes
print("\nUpdated Column Names:", cancer_data.columns.tolist())
Updated Column Names: ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']
Missing values are common in data analysis. Python provides multiple ways to represent missing values, including None
, np.nan
, and pd.NA
. Understanding their behavior is crucial for data cleaning, processing, and analysis.
Missing Values in Python: None
None
is a built-in Python object representing “no value.”x is missing
np.nan
np.nan
represents missing values in numerical computations.np.nan
is a floating-point value (float64
).==
because np.nan != np.nan
.x is missing
nan
False
pd.NA
pd.NA
is Pandas’ missing value representation, introduced in Pandas 1.0.
Works with nullable data types (Int64
, Float64
, boolean
, string
).
Avoids automatic type conversion (e.g., integers remain integers).
Behavior in Math Operations
<NA>
x is missing
#Incorrect way: if x=pd.NA
<NA>
None
, np.nan
, and pd.NA
Feature | None |
np.nan |
pd.NA |
---|---|---|---|
Type | NoneType |
float64 |
Special Pandas scalar |
Use Case | General Python | NumPy/Pandas numeric data | Pandas nullable types |
Arithmetic Ops | Fails (None + 1 ) |
Works but returns nan |
Works but returns <NA> |
Comparison (== ) |
None == None → True |
np.nan == np.nan → False |
pd.NA == pd.NA → <NA> |
Check Method | if x is None: |
if np.isnan(x): |
if pd.isna(x): |
A
0 False
1 True
2 False
3 True
4 True
<class 'int'>
<class 'float'>
<class 'int'>
<class 'NoneType'>
<class 'pandas._libs.missing.NAType'>
Filling Missing Values
1.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Float64
Make sure you typecast the column as a pandas nullable data type.
1.0
<class 'float'>
nan
<class 'float'>
3.0
<class 'float'>
nan
<class 'float'>
nan
<class 'float'>
float64
0 1.0
2 3.0
Name: A, dtype: float64
None
for general Python objects.np.nan
for numerical missing values in NumPy.pd.NA
for missing values in Pandas with nullable data types.isna()
or isnull()
when working with missing data in Pandas.Pandas Nullable Data Type | Description | Typical Usage |
---|---|---|
Int8 , Int16 , Int32 , Int64 |
Nullable integer types (can hold pd.NA ) |
Use when you want integers with missing values |
Float32 , Float64 |
Nullable float types (standard floats also support NaN ) |
Numeric data with decimals, missing values |
boolean |
Nullable Boolean type (True , False , pd.NA ) |
Binary categories with missing info |
string |
Pandas string data type (nullable) | Text data with potential nulls |
category |
Categorical type (can include NaN or pd.NA ) |
Categorical data, efficient storage |
datetime64[ns] with pd.NaT |
Datetime with nanosecond precision | Time series, datetime columns |
timedelta64[ns] with pd.NaT |
Timedeltas (differences between datetimes) | Duration calculations |
To find missing values, you can use isnull() with sum() to calculate the total number of missing values in each column.
Missing Values per Column:
id 0
diagnosis 0
Radius Mean 0
Texture Mean 0
Perimeter Mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
Unnamed: 32 569
dtype: int64
Dropping Columns with Excessive Missing Data Since Unnamed: 32 has no data, it can be dropped from the DataFrame using .drop().
Columns after dropping 'Unnamed: 32': ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
Selecting specific columns is essential for focusing on particular aspects of the dataset. Here are some examples of both single and multiple column selections.
Diagnosis Column:
0 M
1 M
2 M
3 M
4 M
Name: diagnosis, dtype: object
Alternatively, you can select multiple columns.
Selected Columns:
diagnosis Radius Mean area_mean
0 M 17.99 1001.0
1 M 20.57 1326.0
2 M 19.69 1203.0
3 M 11.42 386.1
4 M 20.29 1297.0
Selecting rows based on labels or positions is helpful for inspecting specific data points or subsets.
Label-Based Indexing with loc
loc allows selection based on labels (e.g., column names or index labels) and is particularly useful for data subsets.
Selected Rows with loc:
diagnosis Radius Mean area_mean
0 M 17.99 1001.0
1 M 20.57 1326.0
2 M 19.69 1203.0
3 M 11.42 386.1
4 M 20.29 1297.0
Integer-Based Indexing with iloc
iloc allows selection based purely on integer positions, making it convenient for slicing and position-based operations.
Selected Rows with iloc:
diagnosis Radius Mean Texture Mean
0 M 17.99 10.38
1 M 20.57 17.77
2 M 19.69 21.25
3 M 11.42 20.38
4 M 20.29 14.34
Filtering enables you to create subsets of data that match specific conditions. For example, we can filter by diagnosis to analyze only malignant (M) or benign (B) cases.
Malignant Cases:
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
5 843786 M 12.45 15.70 82.57 477.1
6 844359 M 18.25 19.98 119.60 1040.0
7 84458202 M 13.71 20.83 90.20 577.9
8 844981 M 13.00 21.82 87.50 519.8
9 84501001 M 12.46 24.04 83.97 475.9
10 845636 M 16.02 23.24 102.70 797.8
11 84610002 M 15.78 17.89 103.60 781.0
12 846226 M 19.17 24.80 132.40 1123.0
13 846381 M 15.85 23.95 103.70 782.7
14 84667401 M 13.73 22.61 93.60 578.3
15 84799002 M 14.54 27.54 96.73 658.8
16 848406 M 14.68 20.13 94.74 684.5
17 84862001 M 16.13 20.68 108.10 798.8
18 849014 M 19.81 22.15 130.00 1260.0
22 8511133 M 15.34 14.26 102.50 704.4
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.30010 0.14710
1 0.08474 0.07864 0.08690 0.07017
2 0.10960 0.15990 0.19740 0.12790
3 0.14250 0.28390 0.24140 0.10520
4 0.10030 0.13280 0.19800 0.10430
5 0.12780 0.17000 0.15780 0.08089
6 0.09463 0.10900 0.11270 0.07400
7 0.11890 0.16450 0.09366 0.05985
8 0.12730 0.19320 0.18590 0.09353
9 0.11860 0.23960 0.22730 0.08543
10 0.08206 0.06669 0.03299 0.03323
11 0.09710 0.12920 0.09954 0.06606
12 0.09740 0.24580 0.20650 0.11180
13 0.08401 0.10020 0.09938 0.05364
14 0.11310 0.22930 0.21280 0.08025
15 0.11390 0.15950 0.16390 0.07364
16 0.09867 0.07200 0.07395 0.05259
17 0.11700 0.20220 0.17220 0.10280
18 0.09831 0.10270 0.14790 0.09498
22 0.10730 0.21350 0.20770 0.09756
... radius_worst texture_worst perimeter_worst area_worst \
0 ... 25.38 17.33 184.60 2019.0
1 ... 24.99 23.41 158.80 1956.0
2 ... 23.57 25.53 152.50 1709.0
3 ... 14.91 26.50 98.87 567.7
4 ... 22.54 16.67 152.20 1575.0
5 ... 15.47 23.75 103.40 741.6
6 ... 22.88 27.66 153.20 1606.0
7 ... 17.06 28.14 110.60 897.0
8 ... 15.49 30.73 106.20 739.3
9 ... 15.09 40.68 97.65 711.4
10 ... 19.19 33.88 123.80 1150.0
11 ... 20.42 27.28 136.50 1299.0
12 ... 20.96 29.94 151.70 1332.0
13 ... 16.84 27.66 112.00 876.5
14 ... 15.03 32.01 108.80 697.7
15 ... 17.46 37.13 124.10 943.2
16 ... 19.07 30.88 123.40 1138.0
17 ... 20.96 31.48 136.80 1315.0
18 ... 27.32 30.88 186.80 2398.0
22 ... 18.07 19.08 125.10 980.9
smoothness_worst compactness_worst concavity_worst \
0 0.1622 0.6656 0.7119
1 0.1238 0.1866 0.2416
2 0.1444 0.4245 0.4504
3 0.2098 0.8663 0.6869
4 0.1374 0.2050 0.4000
5 0.1791 0.5249 0.5355
6 0.1442 0.2576 0.3784
7 0.1654 0.3682 0.2678
8 0.1703 0.5401 0.5390
9 0.1853 1.0580 1.1050
10 0.1181 0.1551 0.1459
11 0.1396 0.5609 0.3965
12 0.1037 0.3903 0.3639
13 0.1131 0.1924 0.2322
14 0.1651 0.7725 0.6943
15 0.1678 0.6577 0.7026
16 0.1464 0.1871 0.2914
17 0.1789 0.4233 0.4784
18 0.1512 0.3150 0.5372
22 0.1390 0.5954 0.6305
concave points_worst symmetry_worst fractal_dimension_worst
0 0.26540 0.4601 0.11890
1 0.18600 0.2750 0.08902
2 0.24300 0.3613 0.08758
3 0.25750 0.6638 0.17300
4 0.16250 0.2364 0.07678
5 0.17410 0.3985 0.12440
6 0.19320 0.3063 0.08368
7 0.15560 0.3196 0.11510
8 0.20600 0.4378 0.10720
9 0.22100 0.4366 0.20750
10 0.09975 0.2948 0.08452
11 0.18100 0.3792 0.10480
12 0.17670 0.3176 0.10230
13 0.11190 0.2809 0.06287
14 0.22080 0.3596 0.14310
15 0.17120 0.4218 0.13410
16 0.16090 0.3029 0.08216
17 0.20730 0.3706 0.11420
18 0.23880 0.2768 0.07615
22 0.23930 0.4667 0.09946
[20 rows x 32 columns]
You can also filter based on multiple conditions, such as finding rows where the diagnosis is “M” and radius_mean is greater than 15.
Note: You can’t use ‘and’ python operator here, because ‘and’ is a keyword for Python’s boolean operations, which work with single True or False values, not arrays or Series.
Large Malignant Cases (Radius Mean > 15):
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 17.99 10.38 122.8 1001.0
1 842517 M 20.57 17.77 132.9 1326.0
2 84300903 M 19.69 21.25 130.0 1203.0
4 84358402 M 20.29 14.34 135.1 1297.0
6 844359 M 18.25 19.98 119.6 1040.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
4 0.10030 0.13280 0.1980 0.10430
6 0.09463 0.10900 0.1127 0.07400
... radius_worst texture_worst perimeter_worst area_worst \
0 ... 25.38 17.33 184.6 2019.0
1 ... 24.99 23.41 158.8 1956.0
2 ... 23.57 25.53 152.5 1709.0
4 ... 22.54 16.67 152.2 1575.0
6 ... 22.88 27.66 153.2 1606.0
smoothness_worst compactness_worst concavity_worst concave points_worst \
0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
4 0.1374 0.2050 0.4000 0.1625
6 0.1442 0.2576 0.3784 0.1932
symmetry_worst fractal_dimension_worst
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
4 0.2364 0.07678
6 0.3063 0.08368
[5 rows x 32 columns]
You can create new columns in a DataFrame based on calculations using existing columns. For example, we can calculate the area_ratio by dividing area_worst by area_mean.
New Column 'area_ratio':
area_worst area_mean area_ratio
0 2019.0 1001.0 2.016983
1 1956.0 1326.0 1.475113
2 1709.0 1203.0 1.420615
3 567.7 386.1 1.470344
4 1575.0 1297.0 1.214341
Changing a Value Using .at
Suppose you have a DataFrame and want to update the value in the radius_mean column for a particular index.
# Access and print the original value at index 0 and column 'radius_mean'
original_value = cancer_data.at[0, 'Radius Mean']
print("Original Radius Mean at index 0:", original_value)
# Change the value at index 0 and column 'radius_mean' to 18.5
cancer_data.at[0, 'Radius Mean'] = 18.5
# Verify the updated value
updated_value = cancer_data.at[0, 'Radius Mean']
print("Updated Radius Mean at index 0:", updated_value)
Original Radius Mean at index 0: 17.99
Updated Radius Mean at index 0: 18.5
You can sort a dataset by columns. Here’s how to sort by diagnosis first and then by area_mean in ascending order.
# Sort by 'diagnosis' first, then by 'area_mean' within each diagnosis group
sorted_by_diagnosis_area = cancer_data.sort_values(by=['diagnosis', 'area_mean'], ascending=[True, True])
print("Data sorted by Diagnosis and Area Mean:\n", sorted_by_diagnosis_area[['diagnosis', 'area_mean', 'Radius Mean']].head())
Data sorted by Diagnosis and Area Mean:
diagnosis area_mean Radius Mean
101 B 143.5 6.981
539 B 170.4 7.691
538 B 178.8 7.729
568 B 181.0 7.760
46 B 201.9 8.196
You might also want to move a specific column to the end of the DataFrame, such as moving area_ratio to the last position.
# Move 'area_ratio' to the end of the DataFrame
columns_reordered = [col for col in cancer_data.columns if col != 'area_ratio'] + ['area_ratio']
cancer_data_with_area_ratio_last = cancer_data[columns_reordered]
# Display the reordered columns
print("Data with 'area_ratio' at the end:\n", cancer_data_with_area_ratio_last.head())
Data with 'area_ratio' at the end:
id diagnosis Radius Mean Texture Mean Perimeter Mean area_mean \
0 842302 M 18.50 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... texture_worst perimeter_worst area_worst smoothness_worst \
0 ... 17.33 184.60 2019.0 0.1622
1 ... 23.41 158.80 1956.0 0.1238
2 ... 25.53 152.50 1709.0 0.1444
3 ... 26.50 98.87 567.7 0.2098
4 ... 16.67 152.20 1575.0 0.1374
compactness_worst concavity_worst concave points_worst symmetry_worst \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364
fractal_dimension_worst area_ratio
0 0.11890 2.016983
1 0.08902 1.475113
2 0.08758 1.420615
3 0.17300 1.470344
4 0.07678 1.214341
[5 rows x 33 columns]
In Pandas, you can chain multiple methods together to create a pipeline.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
"name": [" Alice ", "BOB", "Charlie", None],
"score": [85, 92, None, 74]
})
# Clean the data using method chaining
clean_df = (
df
.dropna() # Method: drop rows with any NaNs
.assign( # Method: add or update columns
name_clean=lambda d: d["name"].str.strip().str.title()
)
.sort_values("score", ascending=False) # Method: sort by score
)
print(clean_df)
name score name_clean
1 BOB 92.0 Bob
0 Alice 85.0 Alice
Why .str.strip()
and not just .strip()
?
AttributeError: .strip is used for single strings, not a Series of strings
Why? - df["name"]
is a Series — not a string. - .strip()
is a string method that works on single strings. - .str
is the accessor that tells pandas: “apply this string method to each element in the Series.”
You have… | Use… | Why? |
---|---|---|
A single string | "hello".strip() |
It’s just Python |
A Series of strings | df["col"].str.strip() |
It’s pandas, operating on many strings |
Applying Functions to Columns
Using apply() to Apply Custom Functions
The .apply() method in pandas lets you apply a custom function to each element in a Series (column) or DataFrame. Here’s how to use it to categorize tumors based on area_mean.
Example: Categorizing Tumors by Size Let’s create a custom function to categorize tumors as “Small”, “Medium”, or “Large” based on area_mean.
# Define a custom function to categorize tumors by area_mean
def categorize_tumor(size):
if size < 500:
return 'Small'
elif 500 <= size < 1000:
return 'Medium'
else:
return 'Large'
# Apply the function to the 'area_mean' column and create a new column 'tumor_size_category'
cancer_data['tumor_size_category'] = cancer_data['area_mean'].apply(categorize_tumor)
# Display the new column to verify the transformation
print("Tumor Size Categories:\n", cancer_data[['area_mean', 'tumor_size_category']].head())
Tumor Size Categories:
area_mean tumor_size_category
0 1001.0 Large
1 1326.0 Large
2 1203.0 Large
3 386.1 Small
4 1297.0 Large
Using Lambda Functions for Quick Transformations
Lambda functions are useful for simple, one-line operations. For example, we can use a lambda function to convert diagnosis into numerical codes (0 for Benign, 1 for Malignant).
# Apply a lambda function to classify 'diagnosis' into numerical codes
cancer_data['diagnosis_code'] = cancer_data['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)
# Display the new column to verify the transformation
print("Diagnosis Codes:\n", cancer_data[['diagnosis', 'diagnosis_code']].head())
Diagnosis Codes:
diagnosis diagnosis_code
0 M 1
1 M 1
2 M 1
3 M 1
4 M 1
You can also use apply() with a lambda function for more complex, multi-condition classifications.
Example: Adding a Column with Risk Levels Suppose we want to create a new column, risk_level, based on both diagnosis and area_mean:
# Apply a lambda function with multiple conditions to create a 'risk_level' column
cancer_data['risk_level'] = cancer_data.apply(
lambda row: 'High Risk' if row['diagnosis'] == 'M' and row['area_mean'] > 1000
else ('Moderate Risk' if row['diagnosis'] == 'M' else 'Low Risk'), axis=1
)
# Display the new column to verify the transformation
print("Risk Levels:\n", cancer_data[['diagnosis', 'area_mean', 'risk_level']].head())
#Axis=1 tells the function to apply it to the rows. axis=0 (default) applies function to the columns
Risk Levels:
diagnosis area_mean risk_level
0 M 1001.0 High Risk
1 M 1326.0 High Risk
2 M 1203.0 High Risk
3 M 386.1 Moderate Risk
4 M 1297.0 High Risk
You’re applying to… | Use .apply() on… |
Do you need axis ? |
---|---|---|
A single column (Series) | df['col'].apply(func) |
No |
Multiple columns (row-wise) | df.apply(func, axis=1) |
Yes (axis=1 ) |
Column-wise (less common) | df.apply(func) or axis=0 |
Optional (default is 0 ) |
You can use the great_tables
Python module from the great-tables package to explore and display data from a dataset in a clean and interactive format.
We’ll load the data, summarize it, and then build styled tables using great_tables
.
((569, 32),
id diagnosis radius_mean texture_mean perimeter_mean area_mean \
0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
... radius_worst texture_worst perimeter_worst area_worst \
0 ... 25.38 17.33 184.60 2019.0
1 ... 24.99 23.41 158.80 1956.0
2 ... 23.57 25.53 152.50 1709.0
3 ... 14.91 26.50 98.87 567.7
4 ... 22.54 16.67 152.20 1575.0
smoothness_worst compactness_worst concavity_worst concave points_worst \
0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
3 0.2098 0.8663 0.6869 0.2575
4 0.1374 0.2050 0.4000 0.1625
symmetry_worst fractal_dimension_worst
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678
[5 rows x 32 columns])
from great_tables import GT
# Select a subset of the columns for preview
preview_df = df[['id', 'diagnosis', 'radius_mean', 'texture_mean', 'area_mean']].head(5)
GT(preview_df).tab_header(
title="Breast Cancer Diagnosis Preview",
subtitle="Selected features from the first 5 records"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)
Breast Cancer Diagnosis Preview | ||||
Selected features from the first 5 records | ||||
id | diagnosis | radius_mean | texture_mean | area_mean |
---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 1,001.00 |
842517 | M | 20.57 | 17.77 | 1,326.00 |
84300903 | M | 19.69 | 21.25 | 1,203.00 |
84348301 | M | 11.42 | 20.38 | 386.10 |
84358402 | M | 20.29 | 14.34 | 1,297.00 |
Let’s add conditional formatting to highlight larger tumor areas.
Styled Cancer Data Table | ||||
With conditional formatting on tumor area | ||||
id | diagnosis | radius_mean | texture_mean | area_mean |
---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 1,001.00 |
842517 | M | 20.57 | 17.77 | 1,326.00 |
84300903 | M | 19.69 | 21.25 | 1,203.00 |
84348301 | M | 11.42 | 20.38 | 386.10 |
84358402 | M | 20.29 | 14.34 | 1,297.00 |
Let’s summarize average values by diagnosis group (malignant vs. benign).
summary_df = df.groupby("diagnosis")[["radius_mean", "texture_mean", "area_mean"]].mean().reset_index()
GT(summary_df).tab_header(
title="Group-wise Summary",
subtitle="Mean values for radius, texture, and area by diagnosis"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)
Group-wise Summary | |||
Mean values for radius, texture, and area by diagnosis | |||
diagnosis | radius_mean | texture_mean | area_mean |
---|---|---|---|
B | 12.15 | 17.91 | 462.79 |
M | 17.46 | 21.60 | 978.38 |
Navigate to the follow-along file and try the practice problems!