Session 3: Intro to Pandas and great_tables

Links

Guide to Python Data Structures

Cancer Dataset

What Is Pandas? 🐼

Pandas is a powerful open-source data analysis and manipulation library in Python. It provides data structures, primarily the DataFrame and Series, which are optimized for handling and analyzing large datasets efficiently.

Data Structures:

Series: A one-dimensional labeled array, suitable for handling single columns or rows of data.
DataFrame: A two-dimensional table with labeled axes (rows and columns), much like a spreadsheet or SQL table, allowing you to work with data in rows and columns simultaneously.

Data Manipulation:

Pandas has functions for merging, reshaping, and aggregating datasets, which helps streamline data cleaning and preparation.
It can handle missing data, making it easy to filter or fill gaps in datasets.

Data Analysis:

Pandas provides extensive functionality for descriptive statistics, grouping data, and handling time series.
Integrates well with other libraries, making it easy to move data between libraries like NumPy for numerical computations and Matplotlib or Seaborn for visualization.

Activate Python Environment and Download Cancer Dataset

source .venv/bin/activate   # macOS/Linux
.venv\Scripts\activate.bat  # Windows (Command Prompt)
.venv\Scripts\Activate.ps1  # Windows (PowerShell)

Cancer Dataset is located here 👇

Cancer Dataset

Loading the Dataset

import os
import pandas as pd

# Load the dataset
cancer_data = pd.read_csv("..\\session3\\example_data\\Cancer_Data.csv")

print (type(cancer_data))


# Display the first few rows of the dataset
cancer_data.head()

<class 'pandas.core.frame.DataFrame'>

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NaN
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	NaN
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	NaN

5 rows × 33 columns

Viewing Basic Information

Checking the Dataset’s Shape

.shape returns a tuple with (number of rows, number of columns), which provides a basic overview of the dataset size.

# Display the shape of the dataset
print("Dataset Shape:", cancer_data.shape)

Dataset Shape: (569, 33)

Summarizing Column Information

.info() lists all columns, their data types, and counts of non-null values, helping identify any columns that may have missing data.

# Display column names, data types, and non-null counts
cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Viewing Column Names

.columns lists column headers, while .tolist() converts it into a standard Python list for easier viewing.

# Display column names
print("Column Names:", cancer_data.columns.tolist())

Column Names: ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']

Summary Statistics

.describe() generates essential statistics (mean, std, min, max, percentiles) for numeric columns, useful for identifying data distributions.

# Generate summary statistics for numeric columns
cancer_data.describe()

	id	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	...	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
count	5.690000e+02	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	...	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	0.0
mean	3.037183e+07	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	...	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946	NaN
std	1.250206e+08	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	...	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061	NaN
min	8.670000e+03	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	...	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040	NaN
25%	8.692180e+05	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	...	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460	NaN
50%	9.060240e+05	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	...	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040	NaN
75%	8.813129e+06	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	...	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080	NaN
max	9.113205e+08	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	...	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500	NaN

8 rows × 32 columns

Using value_counts() on a Single Column

This method is straightforward if you want to check the frequency distribution of one specific categorical column. Returns a pandas series object

# Count occurrences of each unique value in the 'diagnosis' column
diagnosis_counts = cancer_data['diagnosis'].value_counts()
print("Diagnosis Counts:\n", diagnosis_counts)

Diagnosis Counts:
 diagnosis
B    357
M    212
Name: count, dtype: int64

Grouping by Categorical Variable

To see summary statistics grouped by a categorical variable in pandas, you can use the groupby() method along with describe() or specific aggregation functions like mean(), sum(), etc.

# Group by 'diagnosis' and get summary statistics for each group
grouped_summary = cancer_data.groupby('diagnosis').mean()
print(grouped_summary)


#Group by 'diagnosis' and get summary statistics for only one variable
grouped_radius_mean = cancer_data.groupby('diagnosis')['radius_mean'].mean()
print(grouped_radius_mean)

                     id  radius_mean  texture_mean  perimeter_mean  \
diagnosis                                                            
B          2.654382e+07    12.146524     17.914762       78.075406   
M          3.681805e+07    17.462830     21.604906      115.365377   

            area_mean  smoothness_mean  compactness_mean  concavity_mean  \
diagnosis                                                                  
B          462.790196         0.092478          0.080085        0.046058   
M          978.376415         0.102898          0.145188        0.160775   

           concave points_mean  symmetry_mean  ...  texture_worst  \
diagnosis                                      ...                  
B                     0.025717       0.174186  ...      23.515070   
M                     0.087990       0.192909  ...      29.318208   

           perimeter_worst   area_worst  smoothness_worst  compactness_worst  \
diagnosis                                                                      
B                87.005938   558.899440          0.124959           0.182673   
M               141.370330  1422.286321          0.144845           0.374824   

           concavity_worst  concave points_worst  symmetry_worst  \
diagnosis                                                          
B                 0.166238              0.074444        0.270246   
M                 0.450606              0.182237        0.323468   

           fractal_dimension_worst  Unnamed: 32  
diagnosis                                        
B                         0.079442          NaN  
M                         0.091530          NaN  

[2 rows x 32 columns]
diagnosis
B    12.146524
M    17.462830
Name: radius_mean, dtype: float64

Renaming Columns

To make column names more readable or consistent, you can use rename() to change specific names. Here’s how to rename columns like radius_mean to Radius Mean.

# Rename specific columns for readability. 'old': 'new'

new_columns={
    'radius_mean': 'Radius Mean',
    'texture_mean': 'Texture Mean',
    'perimeter_mean': 'Perimeter Mean'
}

cancer_data = cancer_data.rename(columns=new_columns)

# Display the new column names to verify the changes
print("\nUpdated Column Names:", cancer_data.columns.tolist())


Updated Column Names: ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']

Column Selection

Selecting specific columns is essential for focusing on particular aspects of the dataset. Here are some examples of both single and multiple column selections.

# Select the 'diagnosis' column - diagnosis_column will be a series
diagnosis_column = cancer_data['diagnosis']
print("Diagnosis Column:\n", diagnosis_column.head())

Diagnosis Column:
 0    M
1    M
2    M
3    M
4    M
Name: diagnosis, dtype: object

Alternatively, you can select multiple columns.

# Select multiple columns: 'diagnosis', 'radius_mean', and 'area_mean' - selected_columns will be a pandas DataFrame

selected_columns = cancer_data[['diagnosis', 'Radius Mean', 'area_mean']]
print("Selected Columns:\n", selected_columns.head())

Selected Columns:
   diagnosis  Radius Mean  area_mean
0         M        17.99     1001.0
1         M        20.57     1326.0
2         M        19.69     1203.0
3         M        11.42      386.1
4         M        20.29     1297.0

Row Selection and Column Selection with loc and iloc

Selecting rows based on labels (loc)

loc allows selection based on labels (e.g., column names or index labels) and is particularly useful for data subsets.
loc uses row labels, so it is endpoint-inclusive.

# Select rows by labels (assuming integer index here) and specific columns
selected_rows_labels = cancer_data.loc[0:4, ['diagnosis', 'Radius Mean', 'area_mean']]
print("Selected Rows with loc:\n", selected_rows_labels)

Selected Rows with loc:
   diagnosis  Radius Mean  area_mean
0         M        17.99     1001.0
1         M        20.57     1326.0
2         M        19.69     1203.0
3         M        11.42      386.1
4         M        20.29     1297.0

Integer-Based Indexing with iloc

iloc allows selection based purely on integer positions, making it convenient for slicing and position-based operations.
iloc uses row positions, making it endpoint-exclusive (similar to Python slicing)

# Select rows by integer position and specific columns
selected_rows_position = cancer_data.iloc[0:5, [1, 2, 3]]  # Select first 5 rows and columns at position 1, 2, 3
print("Selected Rows with iloc:\n", selected_rows_position)

Selected Rows with iloc:
   diagnosis  Radius Mean  Texture Mean
0         M        17.99         10.38
1         M        20.57         17.77
2         M        19.69         21.25
3         M        11.42         20.38
4         M        20.29         14.34

Filtering

Filtering enables you to create subsets of data that match specific conditions. For example, we can filter by diagnosis to analyze only malignant (M) or benign (B) cases.

# Filter rows where 'diagnosis' is "M" (Malignant)
malignant_cases = cancer_data[cancer_data['diagnosis'] == 'M']
print("Malignant Cases:\n", malignant_cases.head(20))

Malignant Cases:
           id diagnosis  Radius Mean  Texture Mean  Perimeter Mean  area_mean  \
0     842302         M        17.99         10.38          122.80     1001.0   
1     842517         M        20.57         17.77          132.90     1326.0   
2   84300903         M        19.69         21.25          130.00     1203.0   
3   84348301         M        11.42         20.38           77.58      386.1   
4   84358402         M        20.29         14.34          135.10     1297.0   
5     843786         M        12.45         15.70           82.57      477.1   
6     844359         M        18.25         19.98          119.60     1040.0   
7   84458202         M        13.71         20.83           90.20      577.9   
8     844981         M        13.00         21.82           87.50      519.8   
9   84501001         M        12.46         24.04           83.97      475.9   
10    845636         M        16.02         23.24          102.70      797.8   
11  84610002         M        15.78         17.89          103.60      781.0   
12    846226         M        19.17         24.80          132.40     1123.0   
13    846381         M        15.85         23.95          103.70      782.7   
14  84667401         M        13.73         22.61           93.60      578.3   
15  84799002         M        14.54         27.54           96.73      658.8   
16    848406         M        14.68         20.13           94.74      684.5   
17  84862001         M        16.13         20.68          108.10      798.8   
18    849014         M        19.81         22.15          130.00     1260.0   
22   8511133         M        15.34         14.26          102.50      704.4   

    smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0           0.11840           0.27760         0.30010              0.14710   
1           0.08474           0.07864         0.08690              0.07017   
2           0.10960           0.15990         0.19740              0.12790   
3           0.14250           0.28390         0.24140              0.10520   
4           0.10030           0.13280         0.19800              0.10430   
5           0.12780           0.17000         0.15780              0.08089   
6           0.09463           0.10900         0.11270              0.07400   
7           0.11890           0.16450         0.09366              0.05985   
8           0.12730           0.19320         0.18590              0.09353   
9           0.11860           0.23960         0.22730              0.08543   
10          0.08206           0.06669         0.03299              0.03323   
11          0.09710           0.12920         0.09954              0.06606   
12          0.09740           0.24580         0.20650              0.11180   
13          0.08401           0.10020         0.09938              0.05364   
14          0.11310           0.22930         0.21280              0.08025   
15          0.11390           0.15950         0.16390              0.07364   
16          0.09867           0.07200         0.07395              0.05259   
17          0.11700           0.20220         0.17220              0.10280   
18          0.09831           0.10270         0.14790              0.09498   
22          0.10730           0.21350         0.20770              0.09756   

    ...  texture_worst  perimeter_worst  area_worst  smoothness_worst  \
0   ...          17.33           184.60      2019.0            0.1622   
1   ...          23.41           158.80      1956.0            0.1238   
2   ...          25.53           152.50      1709.0            0.1444   
3   ...          26.50            98.87       567.7            0.2098   
4   ...          16.67           152.20      1575.0            0.1374   
5   ...          23.75           103.40       741.6            0.1791   
6   ...          27.66           153.20      1606.0            0.1442   
7   ...          28.14           110.60       897.0            0.1654   
8   ...          30.73           106.20       739.3            0.1703   
9   ...          40.68            97.65       711.4            0.1853   
10  ...          33.88           123.80      1150.0            0.1181   
11  ...          27.28           136.50      1299.0            0.1396   
12  ...          29.94           151.70      1332.0            0.1037   
13  ...          27.66           112.00       876.5            0.1131   
14  ...          32.01           108.80       697.7            0.1651   
15  ...          37.13           124.10       943.2            0.1678   
16  ...          30.88           123.40      1138.0            0.1464   
17  ...          31.48           136.80      1315.0            0.1789   
18  ...          30.88           186.80      2398.0            0.1512   
22  ...          19.08           125.10       980.9            0.1390   

    compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
0              0.6656           0.7119               0.26540          0.4601   
1              0.1866           0.2416               0.18600          0.2750   
2              0.4245           0.4504               0.24300          0.3613   
3              0.8663           0.6869               0.25750          0.6638   
4              0.2050           0.4000               0.16250          0.2364   
5              0.5249           0.5355               0.17410          0.3985   
6              0.2576           0.3784               0.19320          0.3063   
7              0.3682           0.2678               0.15560          0.3196   
8              0.5401           0.5390               0.20600          0.4378   
9              1.0580           1.1050               0.22100          0.4366   
10             0.1551           0.1459               0.09975          0.2948   
11             0.5609           0.3965               0.18100          0.3792   
12             0.3903           0.3639               0.17670          0.3176   
13             0.1924           0.2322               0.11190          0.2809   
14             0.7725           0.6943               0.22080          0.3596   
15             0.6577           0.7026               0.17120          0.4218   
16             0.1871           0.2914               0.16090          0.3029   
17             0.4233           0.4784               0.20730          0.3706   
18             0.3150           0.5372               0.23880          0.2768   
22             0.5954           0.6305               0.23930          0.4667   

    fractal_dimension_worst  Unnamed: 32  
0                   0.11890          NaN  
1                   0.08902          NaN  
2                   0.08758          NaN  
3                   0.17300          NaN  
4                   0.07678          NaN  
5                   0.12440          NaN  
6                   0.08368          NaN  
7                   0.11510          NaN  
8                   0.10720          NaN  
9                   0.20750          NaN  
10                  0.08452          NaN  
11                  0.10480          NaN  
12                  0.10230          NaN  
13                  0.06287          NaN  
14                  0.14310          NaN  
15                  0.13410          NaN  
16                  0.08216          NaN  
17                  0.11420          NaN  
18                  0.07615          NaN  
22                  0.09946          NaN  

[20 rows x 33 columns]

You can also filter based on multiple conditions, such as finding rows where the diagnosis is “M” and radius_mean is greater than 15.

Note: You can’t use ‘and’ python operator here, because ‘and’ is a keyword for Python’s boolean operations, which work with single True or False values, not arrays or Series.

# Filter for Malignant cases with radius_mean > 15
large_malignant_cases = cancer_data[(cancer_data['diagnosis'] == 'M') & (cancer_data['Radius Mean'] > 15)]
print("Large Malignant Cases (Radius Mean > 15):\n", large_malignant_cases.head())

Large Malignant Cases (Radius Mean > 15):
          id diagnosis  Radius Mean  Texture Mean  Perimeter Mean  area_mean  \
0    842302         M        17.99         10.38           122.8     1001.0   
1    842517         M        20.57         17.77           132.9     1326.0   
2  84300903         M        19.69         21.25           130.0     1203.0   
4  84358402         M        20.29         14.34           135.1     1297.0   
6    844359         M        18.25         19.98           119.6     1040.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
4          0.10030           0.13280          0.1980              0.10430   
6          0.09463           0.10900          0.1127              0.07400   

   ...  texture_worst  perimeter_worst  area_worst  smoothness_worst  \
0  ...          17.33            184.6      2019.0            0.1622   
1  ...          23.41            158.8      1956.0            0.1238   
2  ...          25.53            152.5      1709.0            0.1444   
4  ...          16.67            152.2      1575.0            0.1374   
6  ...          27.66            153.2      1606.0            0.1442   

   compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
0             0.6656           0.7119                0.2654          0.4601   
1             0.1866           0.2416                0.1860          0.2750   
2             0.4245           0.4504                0.2430          0.3613   
4             0.2050           0.4000                0.1625          0.2364   
6             0.2576           0.3784                0.1932          0.3063   

   fractal_dimension_worst  Unnamed: 32  
0                  0.11890          NaN  
1                  0.08902          NaN  
2                  0.08758          NaN  
4                  0.07678          NaN  
6                  0.08368          NaN  

[5 rows x 33 columns]

Adding and Modifying Columns

You can create new columns in a DataFrame based on calculations using existing columns. For example, we can calculate the area_ratio by dividing area_worst by area_mean.

# Add a new column 'area_ratio' by dividing 'area_worst' by 'area_mean'
cancer_data['area_ratio'] = cancer_data['area_worst'] / cancer_data['area_mean']
print("New Column 'area_ratio':\n", cancer_data[['area_worst', 'area_mean', 'area_ratio']].head())

New Column 'area_ratio':
    area_worst  area_mean  area_ratio
0      2019.0     1001.0    2.016983
1      1956.0     1326.0    1.475113
2      1709.0     1203.0    1.420615
3       567.7      386.1    1.470344
4      1575.0     1297.0    1.214341

Changing a Value Using .at

Suppose you have a DataFrame and want to update the value in the radius_mean column for a particular index.

# Access and print the original value at index 0 and column 'radius_mean'
original_value = cancer_data.at[0, 'Radius Mean']
print("Original Radius Mean at index 0:", original_value)


# Change the value at index 0 and column 'radius_mean' to 18.5
cancer_data.at[0, 'Radius Mean'] = 18.5


# Verify the updated value
updated_value = cancer_data.at[0, 'Radius Mean']
print("Updated Radius Mean at index 0:", updated_value)

Original Radius Mean at index 0: 17.99
Updated Radius Mean at index 0: 18.5

Sorting by Columns

You can sort a dataset by columns. Here’s how to sort by diagnosis first and then by area_mean in ascending order.

# Sort by 'diagnosis' first, then by 'area_mean' within each diagnosis group
sorted_by_diagnosis_area = cancer_data.sort_values(by=['diagnosis', 'area_mean'], ascending=[True, True])
print("Data sorted by Diagnosis and Area Mean:\n", sorted_by_diagnosis_area[['diagnosis', 'area_mean', 'Radius Mean']].head())

Data sorted by Diagnosis and Area Mean:
     diagnosis  area_mean  Radius Mean
101         B      143.5        6.981
539         B      170.4        7.691
538         B      178.8        7.729
568         B      181.0        7.760
46          B      201.9        8.196

Reordering Columns to Move a Column to the End

You might also want to move a specific column to the end of the DataFrame, such as moving area_ratio to the last position.

# Move 'area_ratio' to the end of the DataFrame
columns_reordered = [col for col in cancer_data.columns if col != 'area_ratio'] + ['area_ratio']
cancer_data_with_area_ratio_last = cancer_data[columns_reordered]

# Display the reordered columns
print("Data with 'area_ratio' at the end:\n", cancer_data_with_area_ratio_last.head())

Data with 'area_ratio' at the end:
          id diagnosis  Radius Mean  Texture Mean  Perimeter Mean  area_mean  \
0    842302         M        18.50         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  perimeter_worst  area_worst  smoothness_worst  compactness_worst  \
0  ...           184.60      2019.0            0.1622             0.6656   
1  ...           158.80      1956.0            0.1238             0.1866   
2  ...           152.50      1709.0            0.1444             0.4245   
3  ...            98.87       567.7            0.2098             0.8663   
4  ...           152.20      1575.0            0.1374             0.2050   

   concavity_worst  concave points_worst  symmetry_worst  \
0           0.7119                0.2654          0.4601   
1           0.2416                0.1860          0.2750   
2           0.4504                0.2430          0.3613   
3           0.6869                0.2575          0.6638   
4           0.4000                0.1625          0.2364   

   fractal_dimension_worst  Unnamed: 32  area_ratio  
0                  0.11890          NaN    2.016983  
1                  0.08902          NaN    1.475113  
2                  0.08758          NaN    1.420615  
3                  0.17300          NaN    1.470344  
4                  0.07678          NaN    1.214341  

[5 rows x 34 columns]

Method Chaining in Pandas

In Pandas, you can chain multiple methods together to create a pipeline.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    "name": [" Alice ", "BOB", "Charlie", None],
    "score": [85, 92, None, 74]
})

# Clean the data using method chaining
clean_df = (
    df
    .dropna()                # Method: drop rows with any NaNs
    .assign(                 # Method: add or update columns
        name_clean=lambda d: d["name"].str.strip().str.title()
    )
    .sort_values("score", ascending=False)  # Method: sort by score
)

print(clean_df)

      name  score name_clean
1      BOB   92.0        Bob
0   Alice    85.0      Alice

Why `.str.strip()` and not just `.strip()`?

# This works:
df["name"].str.strip()

# This does NOT:
try:
    df["name"].strip()  # ❌ AttributeError
except AttributeError: 
    print("AttributeError: .strip is used for single strings, not a Series of strings")

AttributeError: .strip is used for single strings, not a Series of strings

Why? - df["name"] is a Series — not a string. - .strip() is a string method that works on single strings. - .str is the accessor that tells pandas: “apply this string method to each element in the Series.”

Rule of Thumb:

You have…	Use…	Why?
A single string	`"hello".strip()`	It’s just Python
A Series of strings	`df["col"].str.strip()`	It’s pandas, operating on many strings

Applying Functions in Pandas

Applying Functions to Columns

The .apply() method in pandas lets you apply a custom function to each element in a Series (column) or DataFrame. Here’s how to use it to categorize tumors based on area_mean.

Example: Categorizing Tumors by Size

Let’s create a custom function to categorize tumors as “Small”, “Medium”, or “Large” based on area_mean.

# Define a custom function to categorize tumors by area_mean
def categorize_tumor(size):
    if size < 500:
        return 'Small'
    elif 500 <= size < 1000:
        return 'Medium'
    else:
        return 'Large'

# Apply the function to the 'area_mean' column and create a new column 'tumor_size_category'
cancer_data['tumor_size_category'] = cancer_data['area_mean'].apply(categorize_tumor)

# Display the new column to verify the transformation
print("Tumor Size Categories:\n", cancer_data[['area_mean', 'tumor_size_category']].head())

Tumor Size Categories:
    area_mean tumor_size_category
0     1001.0               Large
1     1326.0               Large
2     1203.0               Large
3      386.1               Small
4     1297.0               Large

Using Lambda Functions for Quick Transformations

Lambda functions are useful for simple, one-line operations. For example, we can use a lambda function to convert diagnosis into numerical codes (0 for Benign, 1 for Malignant).

# Apply a lambda function to classify 'diagnosis' into numerical codes
cancer_data['diagnosis_code'] = cancer_data['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

# Display the new column to verify the transformation
print("Diagnosis Codes:\n", cancer_data[['diagnosis', 'diagnosis_code']].head())

Diagnosis Codes:
   diagnosis  diagnosis_code
0         M               1
1         M               1
2         M               1
3         M               1
4         M               1

Applying Multiple Conditions with .apply()

You can also use .apply() with a lambda function for more complex, multi-condition classifications.

Example: Adding a Column with Risk Levels Suppose we want to create a new column, risk_level, based on both diagnosis and area_mean:

“High Risk” for Malignant tumors with area_mean above 1000.
“Moderate Risk” for Malignant tumors with area_mean below 1000.
“Low Risk” for Benign tumors.

# Apply a lambda function with multiple conditions to create a 'risk_level' column
cancer_data['risk_level'] = cancer_data.apply(
    lambda row: 'High Risk' if row['diagnosis'] == 'M' and row['area_mean'] > 1000 
    else ('Moderate Risk' if row['diagnosis'] == 'M' else 'Low Risk'), axis=1
)

# Display the new column to verify the transformation
print("Risk Levels:\n", cancer_data[['diagnosis', 'area_mean', 'risk_level']].head())

#Axis=1 tells the function to apply it to the rows. axis=0 (default) applies function to the columns

Risk Levels:
   diagnosis  area_mean     risk_level
0         M     1001.0      High Risk
1         M     1326.0      High Risk
2         M     1203.0      High Risk
3         M      386.1  Moderate Risk
4         M     1297.0      High Risk

When to apply axis=

You’re applying to…	Use `.apply()` on…	Do you need `axis`?
A single column (Series)	`df['col'].apply(func)`	No
Multiple columns (row-wise)	`df.apply(func, axis=1)`	Yes (`axis=1`)
Column-wise (less common)	`df.apply(func)` or `axis=0`	Optional (default is `0`)

To export a Pandas dataframe to CSV or XLSX

# Export to CSV
'''

df.to_csv('/path/to/directory/example.csv', index=False)  # index=False excludes the row indices


'''
#Export to xlsx

'''

df.to_excel('/path/to/directory/example.xlsx', index=False)

'''

"\n\ndf.to_excel('/path/to/directory/example.xlsx', index=False)\n\n"

Missing values in Python

Missing values are common in data analysis. Python provides multiple ways to represent missing values, including None, np.nan, and pd.NA. Understanding their behavior is crucial for data cleaning, processing, and analysis.

Missing Values in Python: None

Definition: None is a built-in Python object representing “no value.”
Use Cases: Works with general Python objects but does not support mathematical operations.

x = None
if x is None:  # Best practice
    print("x is missing")

x is missing

Issue with None in arithmetic:

try: 
    print(x + 1)
except TypeError: 
    print("TypeError: Unsupported operand type(s)")  # TypeError: unsupported operand type(s)
print(x==x)

TypeError: Unsupported operand type(s)
True

Missing Values in NumPy: `np.nan`

np.nan represents missing values in numerical computations.
np.nan is a floating-point value (float64).
Cannot be checked with == because np.nan != np.nan.

import numpy as np
x = np.nan
if np.isnan(x):  # Correct way to check for np.nan
    print("x is missing")

x is missing

Behavior in Math Operations

print(x + 10)  # Output: nan
print(x == x)  # Output: False

nan
False

Missing Values in Pandas: `pd.NA`

pd.NA is Pandas’ missing value representation, introduced in Pandas 1.0.
Works with nullable data types (Int64, Float64, boolean, string).
Avoids automatic type conversion (e.g., integers remain integers).

Behavior in Math Operations

import pandas as pd
x = pd.NA
print(x+1)

<NA>

#Checking for missingness 


x = pd.NA
if pd.isna(x):  # Correct way
    print("x is missing")

try:   
    if x==pd.na:
        print(x)
except AttributeError:
    print('#Incorrect way: if x=pd.NA')

print(x==x)

x is missing
#Incorrect way: if x=pd.NA
<NA>

Why use pd.NA in Pandas datasets?

Data Type Preservation
- np.nan is always a floating-point value, which forces integer columns to be upcast to floats.
- pd.NA allows a column to retain its original data type
Consistency Across Types
- Can be used across all nullable data types
- Acts as a single indicator for missingness across data types,(integer, boolean, string, datetime)
Improved Missing Data Semantics
- pd.NA behaves more like a true “unknown” or “missing” value during operations, which can be more intuitive
- For example, comparing pd.NA with any value (including itself) results in rather than a boolean
Enables Nullable Extension Types
- The introduction of pd.NA facilitated the development of “nullable extension types”
- These can be thought of as “column types”.
- For example, typecasting a column as Int64 tells Pandas that this column is made up of integers, and can accept missing values (pd.NA)
- If we try to use np.nan instead, all values in the column would be forced to be floats.

Comparing `None`, `np.nan`, and `pd.NA`

Feature	`None`	`np.nan`	`pd.NA`
Type	`NoneType`	`float64`	Special Pandas scalar
Use Case	General Python	NumPy/Pandas numeric data	Pandas nullable types
Arithmetic Ops	Fails (`None + 1`)	Works but returns `nan`	Works but returns `<NA>`
Comparison (`==`)	`None == None → True`	`np.nan == np.nan → False`	`pd.NA == pd.NA → <NA>`
Check Method	`if x is None:`	`if np.isnan(x):`	`if pd.isna(x):`

Handling Missing Values in Pandas

import numpy as np
df = pd.DataFrame({"A": [1, np.nan, 3, None, pd.NA]})
print(df.head())
print(df.isna())  # Identifies missing values
for x in df.iloc[:, 0]:
    print(type(x))

      A
0     1
1   NaN
2     3
3  None
4  <NA>
       A
0  False
1   True
2  False
3   True
4   True
<class 'int'>
<class 'float'>
<class 'int'>
<class 'NoneType'>
<class 'pandas._libs.missing.NAType'>

Filling Missing Values

df["A"]=df["A"].fillna(pd.NA).astype('Int64')  # Replaces missing values with pd.NA, and then changes the column type to 'FLoat64, pandas' nullable float datatype.
for value in df.iloc[:,0]:
        print(value)
        print(type(value))

print(df["A"].dtype)

1
<class 'numpy.int64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3
<class 'numpy.int64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Int64

Why does casting the column as int64 fail with missing values?

The Problem

try:
    df["A"].astype("int64")
except ValueError as e:
    print("Error:", e)

Error: cannot convert NA to integer

The Fix

df["A"] = df["A"].astype("Int64")

for x in df.iloc[:,0]:
    print(x, ":", type(x))

1 : <class 'numpy.int64'>
<NA> : <class 'pandas._libs.missing.NAType'>
3 : <class 'numpy.int64'>
<NA> : <class 'pandas._libs.missing.NAType'>
<NA> : <class 'pandas._libs.missing.NAType'>

:::

float64 works, but not int64, with a caveat

df["A"] = df["A"].astype("float64")

for x in df.iloc[:,0]:
    print(x, ":", type(x))

1.0 : <class 'float'>
nan : <class 'float'>
3.0 : <class 'float'>
nan : <class 'float'>
nan : <class 'float'>

:::

All pd.na are “recast” as np.nan, since this type of missing value is a float. However, this is usually not what you want to do.
Best practice: cast the column as the Pandas nullable datatype Float64, so that missing values stay pd.na

df["A"]=df["A"].fillna(pd.NA).astype('Float64')  # Replaces missing values with pd.NA, and then changes the column type to 'FLoat64, pandas' nullable float datatype.
for value in df.iloc[:,0]:
        print(value)
        print(type(value))

print(df["A"].dtype)

1.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Float64

Best Practices for Missing Data

Use None for general Python objects.
Use np.nan for numerical missing values in NumPy.
Use pd.NA for missing values in Pandas with nullable data types.
Always use isna() or isnull() when working with missing data in Pandas.

Other Pandas Nullable Data Types

Pandas Nullable Data Type	Description	Typical Usage
`Int8`, `Int16`, `Int32`, `Int64`	Nullable integer types (can hold `pd.NA`)	Use when you want integers with missing values
`Float32`, `Float64`	Nullable float types (standard floats also support `NaN`)	Numeric data with decimals, missing values
`boolean`	Nullable Boolean type (`True`, `False`, `pd.NA`)	Binary categories with missing info
`string`	Pandas string data type (nullable)	Text data with potential nulls
`category`	Categorical type (can include `NaN` or `pd.NA`)	Categorical data, efficient storage
`datetime64[ns]` with `pd.NaT`	Datetime with nanosecond precision	Time series, datetime columns
`timedelta64[ns]` with `pd.NaT`	Timedeltas (differences between datetimes)	Duration calculations

More on missing values

To drop missing values, you can use .dropna()

df=df.dropna()
print(df["A"])
# Removes rows with missing values

0    1.0
2    3.0
Name: A, dtype: Float64

To find missing values, you can use isnull() with sum() to calculate the total number of missing values in each column.

# Count missing values in each column
missing_values = cancer_data.isnull().sum()
print("Missing Values per Column:")
print(missing_values)

Missing Values per Column:
id                           0
diagnosis                    0
Radius Mean                  0
Texture Mean                 0
Perimeter Mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: 32                569
area_ratio                   0
tumor_size_category          0
diagnosis_code               0
risk_level                   0
dtype: int64

Dropping Columns with Excessive Missing Data

Since Unnamed: 32 has no data, it can be dropped from the DataFrame using .drop().

# Drop the 'Unnamed: 32' column if it contains no data
cancer_data = cancer_data.drop(columns=['Unnamed: 32'])

# Verify the column has been dropped
print("\nColumns after dropping 'Unnamed: 32':", cancer_data.columns.tolist())


Columns after dropping 'Unnamed: 32': ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'area_ratio', 'tumor_size_category', 'diagnosis_code', 'risk_level']

great_tables for table generation

You can use the great_tables Python module from the great-tables package to explore and display data from a dataset in a clean and interactive format.

We’ll load the data, summarize it, and then build styled tables using great_tables.

Load and Inspect the Data

import pandas as pd

# Load the dataset
df = pd.read_csv("..\session3\example_data\Cancer_Data.csv")

# Drop unnamed column
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Show the shape and first few rows
df.shape, df.head()

<>:4: SyntaxWarning: invalid escape sequence '\s'
<>:4: SyntaxWarning: invalid escape sequence '\s'
C:\Users\augellp1\AppData\Local\Temp\ipykernel_25380\1971512142.py:4: SyntaxWarning: invalid escape sequence '\s'
  df = pd.read_csv("..\session3\example_data\Cancer_Data.csv")

((569, 32),
          id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
 0    842302         M        17.99         10.38          122.80     1001.0   
 1    842517         M        20.57         17.77          132.90     1326.0   
 2  84300903         M        19.69         21.25          130.00     1203.0   
 3  84348301         M        11.42         20.38           77.58      386.1   
 4  84358402         M        20.29         14.34          135.10     1297.0   
 
    smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
 0          0.11840           0.27760          0.3001              0.14710   
 1          0.08474           0.07864          0.0869              0.07017   
 2          0.10960           0.15990          0.1974              0.12790   
 3          0.14250           0.28390          0.2414              0.10520   
 4          0.10030           0.13280          0.1980              0.10430   
 
    ...  radius_worst  texture_worst  perimeter_worst  area_worst  \
 0  ...         25.38          17.33           184.60      2019.0   
 1  ...         24.99          23.41           158.80      1956.0   
 2  ...         23.57          25.53           152.50      1709.0   
 3  ...         14.91          26.50            98.87       567.7   
 4  ...         22.54          16.67           152.20      1575.0   
 
    smoothness_worst  compactness_worst  concavity_worst  concave points_worst  \
 0            0.1622             0.6656           0.7119                0.2654   
 1            0.1238             0.1866           0.2416                0.1860   
 2            0.1444             0.4245           0.4504                0.2430   
 3            0.2098             0.8663           0.6869                0.2575   
 4            0.1374             0.2050           0.4000                0.1625   
 
    symmetry_worst  fractal_dimension_worst  
 0          0.4601                  0.11890  
 1          0.2750                  0.08902  
 2          0.3613                  0.08758  
 3          0.6638                  0.17300  
 4          0.2364                  0.07678  
 
 [5 rows x 32 columns])

Preview of dataset using great_tables

from great_tables import GT

# Select a subset of the columns for preview
preview_df = df[['id', 'diagnosis', 'radius_mean', 'texture_mean', 'area_mean']].head(5)

GT(preview_df).tab_header(
    title="Breast Cancer Diagnosis Preview",
    subtitle="Selected features from the first 5 records"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)

id	diagnosis	radius_mean	texture_mean	area_mean
Breast Cancer Diagnosis Preview
Selected features from the first 5 records
842302	M	17.99	10.38	1,001.00
842517	M	20.57	17.77	1,326.00
84300903	M	19.69	21.25	1,203.00
84348301	M	11.42	20.38	386.10
84358402	M	20.29	14.34	1,297.00

Enhance the Table with Styling

Let’s add conditional formatting to highlight larger tumor areas.

GT(preview_df).tab_header(
    title="Styled Cancer Data Table",
    subtitle="With conditional formatting on tumor area"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2
).data_color(
    columns="area_mean",
    palette=["blue", "red"]
)

id	diagnosis	radius_mean	texture_mean	area_mean
Styled Cancer Data Table
With conditional formatting on tumor area
842302	M	17.99	10.38	1,001.00
842517	M	20.57	17.77	1,326.00
84300903	M	19.69	21.25	1,203.00
84348301	M	11.42	20.38	386.10
84358402	M	20.29	14.34	1,297.00

Group Statistics by Diagnosis

Let’s summarize average values by diagnosis group (malignant vs. benign).

summary_df = df.groupby("diagnosis")[["radius_mean", "texture_mean", "area_mean"]].mean().reset_index()

GT(summary_df).tab_header(
    title="Group-wise Summary",
    subtitle="Mean values for radius, texture, and area by diagnosis"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)

diagnosis	radius_mean	texture_mean	area_mean
Group-wise Summary
Mean values for radius, texture, and area by diagnosis
B	12.15	17.91	462.79
M	17.46	21.60	978.38

You try!

Download the Session 3 follow-along link here 👇

Session 3 Follow-along