Session 3: Intro to Pandas and great_tables

Topic 5: What Is Pandas?

Pandas is a powerful open-source data analysis and manipulation library in Python. It provides data structures, primarily the DataFrame and Series, which are optimized for handling and analyzing large datasets efficiently.

Data Structures:

  • Series: A one-dimensional labeled array, suitable for handling single columns or rows of data.

  • DataFrame: A two-dimensional table with labeled axes (rows and columns), much like a spreadsheet or SQL table, allowing you to work with data in rows and columns simultaneously.

Data Manipulation:

  • Pandas has functions for merging, reshaping, and aggregating datasets, which helps streamline data cleaning and preparation.

  • It can handle missing data, making it easy to filter or fill gaps in datasets.

Data Analysis:

  • Pandas provides extensive functionality for descriptive statistics, grouping data, and handling time series.

  • Integrates well with other libraries, making it easy to move data between libraries like NumPy for numerical computations and Matplotlib or Seaborn for visualization.

Creating folders for project housekeeping

###Example Folder Structure
'''
project_name/
    data/
        raw/
        processed/
    scripts/
    results/
    logs/
'''

#To Create folders

import os

#Defining working directory
base_dir = "G:\\dir_demo"

#Defining Project folder
project_name = os.path.join(base_dir, "my_project")

# Define the subdirectories
subdirs = [
    "data/raw",
    "data/processed",
    "scripts",
    "results",
    "logs",
]

# Create directories
for subdir in subdirs:
    path = os.path.join(project_name, subdir)
    os.makedirs(path, exist_ok=True)  #ensures no error if the folder already exists
    print(f"Created directory: {path}")
Created directory: G:\dir_demo\my_project\data/raw
Created directory: G:\dir_demo\my_project\data/processed
Created directory: G:\dir_demo\my_project\scripts
Created directory: G:\dir_demo\my_project\results
Created directory: G:\dir_demo\my_project\logs

Loading the Dataset

import os
import pandas as pd

# Load the dataset
cancer_data = pd.read_csv("..\session3\example_data\Cancer_Data.csv")

print (type(cancer_data))


# Display the first few rows of the dataset
cancer_data.head()
<class 'pandas.core.frame.DataFrame'>
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN

5 rows × 33 columns

Viewing Basic Information

  • Checking the Dataset’s Shape

.shape returns a tuple with (number of rows, number of columns), which provides a basic overview of the dataset size.

# Display the shape of the dataset
print("Dataset Shape:", cancer_data.shape)
Dataset Shape: (569, 33)
  • Summarizing Column Information

.info() lists all columns, their data types, and counts of non-null values, helping identify any columns that may have missing data.

# Display column names, data types, and non-null counts
cancer_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Viewing Column Names

.columns lists column headers, while .tolist() converts it into a standard Python list for easier viewing.

# Display column names
print("Column Names:", cancer_data.columns.tolist())
Column Names: ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']

Summary Statistics

.describe() generates essential statistics (mean, std, min, max, percentiles) for numeric columns, useful for identifying data distributions.

# Generate summary statistics for numeric columns
cancer_data.describe()
id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 0.0
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 NaN
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 NaN
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 NaN
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 NaN
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 NaN
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 NaN
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 NaN

8 rows × 32 columns

Using value_counts() on a Single Column

This method is straightforward if you want to check the frequency distribution of one specific categorical column. Returns a pandas series object

# Count occurrences of each unique value in the 'diagnosis' column
diagnosis_counts = cancer_data['diagnosis'].value_counts()
print("Diagnosis Counts:\n", diagnosis_counts)
Diagnosis Counts:
 diagnosis
B    357
M    212
Name: count, dtype: int64

To see summary statistics grouped by a categorical variable in pandas, you can use the groupby() method along with describe() or specific aggregation functions like mean(), sum(), etc.

# Group by 'diagnosis' and get summary statistics for each group
grouped_summary = cancer_data.groupby('diagnosis').mean()
print(grouped_summary)


#Group by 'diagnosis' and get summary statistics for only one variable
grouped_radius_mean = cancer_data.groupby('diagnosis')['radius_mean'].mean()
print(grouped_radius_mean)
                     id  radius_mean  texture_mean  perimeter_mean  \
diagnosis                                                            
B          2.654382e+07    12.146524     17.914762       78.075406   
M          3.681805e+07    17.462830     21.604906      115.365377   

            area_mean  smoothness_mean  compactness_mean  concavity_mean  \
diagnosis                                                                  
B          462.790196         0.092478          0.080085        0.046058   
M          978.376415         0.102898          0.145188        0.160775   

           concave points_mean  symmetry_mean  ...  texture_worst  \
diagnosis                                      ...                  
B                     0.025717       0.174186  ...      23.515070   
M                     0.087990       0.192909  ...      29.318208   

           perimeter_worst   area_worst  smoothness_worst  compactness_worst  \
diagnosis                                                                      
B                87.005938   558.899440          0.124959           0.182673   
M               141.370330  1422.286321          0.144845           0.374824   

           concavity_worst  concave points_worst  symmetry_worst  \
diagnosis                                                          
B                 0.166238              0.074444        0.270246   
M                 0.450606              0.182237        0.323468   

           fractal_dimension_worst  Unnamed: 32  
diagnosis                                        
B                         0.079442          NaN  
M                         0.091530          NaN  

[2 rows x 32 columns]
diagnosis
B    12.146524
M    17.462830
Name: radius_mean, dtype: float64

Renaming Columns

To make column names more readable or consistent, you can use rename() to change specific names. Here’s how to rename columns like radius_mean to Radius Mean.

# Rename specific columns for readability. 'old': 'new'

new_columns={
    'radius_mean': 'Radius Mean',
    'texture_mean': 'Texture Mean',
    'perimeter_mean': 'Perimeter Mean'
}

cancer_data = cancer_data.rename(columns=new_columns)

# Display the new column names to verify the changes
print("\nUpdated Column Names:", cancer_data.columns.tolist())

Updated Column Names: ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']

Missing values in Python

Missing values are common in data analysis. Python provides multiple ways to represent missing values, including None, np.nan, and pd.NA. Understanding their behavior is crucial for data cleaning, processing, and analysis.

Missing Values in Python: None

  • Definition: None is a built-in Python object representing “no value.”
  • Use Cases: Works with general Python objects but does not support mathematical operations.
x = None
if x is None:  # Best practice
    print("x is missing")
x is missing

Issue with None in arithmetic:

try: 
    print(x + 1)
except TypeError: 
    print("TypeError: Unsupported operand type(s)")  # TypeError: unsupported operand type(s)
print(x==x)
TypeError: Unsupported operand type(s)
True

Missing Values in NumPy: np.nan

  • np.nan represents missing values in numerical computations.
  • np.nan is a floating-point value (float64).
  • Cannot be checked with == because np.nan != np.nan.
import numpy as np
x = np.nan
if np.isnan(x):  # Correct way to check for np.nan
    print("x is missing")
x is missing
  • Behavior in Math Operations
print(x + 10)  # Output: nan
print(x == x)  # Output: False
nan
False

Missing Values in Pandas: pd.NA

  • pd.NA is Pandas’ missing value representation, introduced in Pandas 1.0.

  • Works with nullable data types (Int64, Float64, boolean, string).

  • Avoids automatic type conversion (e.g., integers remain integers).

  • Behavior in Math Operations

import pandas as pd
x = pd.NA
print(x+1)
<NA>
#Checking for missingness 


x = pd.NA
if pd.isna(x):  # Correct way
    print("x is missing")

try:   
    if x==pd.na:
        print(x)
except AttributeError:
    print('#Incorrect way: if x=pd.NA')

print(x==x)
x is missing
#Incorrect way: if x=pd.NA
<NA>

Comparing None, np.nan, and pd.NA

Feature None np.nan pd.NA
Type NoneType float64 Special Pandas scalar
Use Case General Python NumPy/Pandas numeric data Pandas nullable types
Arithmetic Ops Fails (None + 1) Works but returns nan Works but returns <NA>
Comparison (==) None == None → True np.nan == np.nan → False pd.NA == pd.NA → <NA>
Check Method if x is None: if np.isnan(x): if pd.isna(x):

Handling Missing Values in Pandas

import numpy as np
df = pd.DataFrame({"A": [1, np.nan, 3, None, pd.NA]})
print(df.isna())  # Identifies missing values
for x in df.iloc[:, 0]:
    print(type(x))
       A
0  False
1   True
2  False
3   True
4   True
<class 'int'>
<class 'float'>
<class 'int'>
<class 'NoneType'>
<class 'pandas._libs.missing.NAType'>

Filling Missing Values

df["A"]=df["A"].fillna(pd.NA).astype('Float64')  # Replaces missing values with pd.NA, and then changes the column type to 'FLoat64, pandas' nullable float datatype.
for value in df.iloc[:,0]:
        print(value)
        print(type(value))

print(df["A"].dtype)
1.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
3.0
<class 'numpy.float64'>
<NA>
<class 'pandas._libs.missing.NAType'>
<NA>
<class 'pandas._libs.missing.NAType'>
Float64

Make sure you typecast the column as a pandas nullable data type.

df["A"]=df["A"].astype('float64')  #typecasting as lowercase "float" changes all pd.NA back to np.nan because "float64" (lowercase) is not supported by pd.NA
for value in df.iloc[:,0]:
        print(value)
        print(type(value))

print(df["A"].dtype)
1.0
<class 'float'>
nan
<class 'float'>
3.0
<class 'float'>
nan
<class 'float'>
nan
<class 'float'>
float64

Dropping Missing Values

df=df.dropna()
print(df["A"])
# Removes rows with missing values
0    1.0
2    3.0
Name: A, dtype: float64

Best Practices

  • Use None for general Python objects.
  • Use np.nan for numerical missing values in NumPy.
  • Use pd.NA for missing values in Pandas with nullable data types.
  • Always use isna() or isnull() when working with missing data in Pandas.

Other Pandas Nullable Data Types

Pandas Nullable Data Type Description Typical Usage
Int8, Int16, Int32, Int64 Nullable integer types (can hold pd.NA) Use when you want integers with missing values
Float32, Float64 Nullable float types (standard floats also support NaN) Numeric data with decimals, missing values
boolean Nullable Boolean type (True, False, pd.NA) Binary categories with missing info
string Pandas string data type (nullable) Text data with potential nulls
category Categorical type (can include NaN or pd.NA) Categorical data, efficient storage
datetime64[ns] with pd.NaT Datetime with nanosecond precision Time series, datetime columns
timedelta64[ns] with pd.NaT Timedeltas (differences between datetimes) Duration calculations

More on missing values

To find missing values, you can use isnull() with sum() to calculate the total number of missing values in each column.

# Count missing values in each column
missing_values = cancer_data.isnull().sum()
print("Missing Values per Column:")
print(missing_values)
Missing Values per Column:
id                           0
diagnosis                    0
Radius Mean                  0
Texture Mean                 0
Perimeter Mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: 32                569
dtype: int64

Dropping Columns with Excessive Missing Data Since Unnamed: 32 has no data, it can be dropped from the DataFrame using .drop().

# Drop the 'Unnamed: 32' column if it contains no data
cancer_data = cancer_data.drop(columns=['Unnamed: 32'])

# Verify the column has been dropped
print("\nColumns after dropping 'Unnamed: 32':", cancer_data.columns.tolist())

Columns after dropping 'Unnamed: 32': ['id', 'diagnosis', 'Radius Mean', 'Texture Mean', 'Perimeter Mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

Column Selection

Selecting specific columns is essential for focusing on particular aspects of the dataset. Here are some examples of both single and multiple column selections.

# Select the 'diagnosis' column - diagnosis_column will be a series
diagnosis_column = cancer_data['diagnosis']
print("Diagnosis Column:\n", diagnosis_column.head())
Diagnosis Column:
 0    M
1    M
2    M
3    M
4    M
Name: diagnosis, dtype: object

Alternatively, you can select multiple columns.

# Select multiple columns: 'diagnosis', 'radius_mean', and 'area_mean' - selected_columns will be a pandas DataFrame

selected_columns = cancer_data[['diagnosis', 'Radius Mean', 'area_mean']]
print("Selected Columns:\n", selected_columns.head())
Selected Columns:
   diagnosis  Radius Mean  area_mean
0         M        17.99     1001.0
1         M        20.57     1326.0
2         M        19.69     1203.0
3         M        11.42      386.1
4         M        20.29     1297.0

Row Selection

Selecting rows based on labels or positions is helpful for inspecting specific data points or subsets.

Label-Based Indexing with loc

loc allows selection based on labels (e.g., column names or index labels) and is particularly useful for data subsets.

# Select rows by labels (assuming integer index here) and specific columns
selected_rows_labels = cancer_data.loc[0:4, ['diagnosis', 'Radius Mean', 'area_mean']]
print("Selected Rows with loc:\n", selected_rows_labels)
Selected Rows with loc:
   diagnosis  Radius Mean  area_mean
0         M        17.99     1001.0
1         M        20.57     1326.0
2         M        19.69     1203.0
3         M        11.42      386.1
4         M        20.29     1297.0

Integer-Based Indexing with iloc

iloc allows selection based purely on integer positions, making it convenient for slicing and position-based operations.

# Select rows by integer position and specific columns
selected_rows_position = cancer_data.iloc[0:5, [1, 2, 3]]  # Select first 5 rows and columns at position 1, 2, 3
print("Selected Rows with iloc:\n", selected_rows_position)
Selected Rows with iloc:
   diagnosis  Radius Mean  Texture Mean
0         M        17.99         10.38
1         M        20.57         17.77
2         M        19.69         21.25
3         M        11.42         20.38
4         M        20.29         14.34

Filtering

Filtering enables you to create subsets of data that match specific conditions. For example, we can filter by diagnosis to analyze only malignant (M) or benign (B) cases.

# Filter rows where 'diagnosis' is "M" (Malignant)
malignant_cases = cancer_data[cancer_data['diagnosis'] == 'M']
print("Malignant Cases:\n", malignant_cases.head(20))
Malignant Cases:
           id diagnosis  Radius Mean  Texture Mean  Perimeter Mean  area_mean  \
0     842302         M        17.99         10.38          122.80     1001.0   
1     842517         M        20.57         17.77          132.90     1326.0   
2   84300903         M        19.69         21.25          130.00     1203.0   
3   84348301         M        11.42         20.38           77.58      386.1   
4   84358402         M        20.29         14.34          135.10     1297.0   
5     843786         M        12.45         15.70           82.57      477.1   
6     844359         M        18.25         19.98          119.60     1040.0   
7   84458202         M        13.71         20.83           90.20      577.9   
8     844981         M        13.00         21.82           87.50      519.8   
9   84501001         M        12.46         24.04           83.97      475.9   
10    845636         M        16.02         23.24          102.70      797.8   
11  84610002         M        15.78         17.89          103.60      781.0   
12    846226         M        19.17         24.80          132.40     1123.0   
13    846381         M        15.85         23.95          103.70      782.7   
14  84667401         M        13.73         22.61           93.60      578.3   
15  84799002         M        14.54         27.54           96.73      658.8   
16    848406         M        14.68         20.13           94.74      684.5   
17  84862001         M        16.13         20.68          108.10      798.8   
18    849014         M        19.81         22.15          130.00     1260.0   
22   8511133         M        15.34         14.26          102.50      704.4   

    smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0           0.11840           0.27760         0.30010              0.14710   
1           0.08474           0.07864         0.08690              0.07017   
2           0.10960           0.15990         0.19740              0.12790   
3           0.14250           0.28390         0.24140              0.10520   
4           0.10030           0.13280         0.19800              0.10430   
5           0.12780           0.17000         0.15780              0.08089   
6           0.09463           0.10900         0.11270              0.07400   
7           0.11890           0.16450         0.09366              0.05985   
8           0.12730           0.19320         0.18590              0.09353   
9           0.11860           0.23960         0.22730              0.08543   
10          0.08206           0.06669         0.03299              0.03323   
11          0.09710           0.12920         0.09954              0.06606   
12          0.09740           0.24580         0.20650              0.11180   
13          0.08401           0.10020         0.09938              0.05364   
14          0.11310           0.22930         0.21280              0.08025   
15          0.11390           0.15950         0.16390              0.07364   
16          0.09867           0.07200         0.07395              0.05259   
17          0.11700           0.20220         0.17220              0.10280   
18          0.09831           0.10270         0.14790              0.09498   
22          0.10730           0.21350         0.20770              0.09756   

    ...  radius_worst  texture_worst  perimeter_worst  area_worst  \
0   ...         25.38          17.33           184.60      2019.0   
1   ...         24.99          23.41           158.80      1956.0   
2   ...         23.57          25.53           152.50      1709.0   
3   ...         14.91          26.50            98.87       567.7   
4   ...         22.54          16.67           152.20      1575.0   
5   ...         15.47          23.75           103.40       741.6   
6   ...         22.88          27.66           153.20      1606.0   
7   ...         17.06          28.14           110.60       897.0   
8   ...         15.49          30.73           106.20       739.3   
9   ...         15.09          40.68            97.65       711.4   
10  ...         19.19          33.88           123.80      1150.0   
11  ...         20.42          27.28           136.50      1299.0   
12  ...         20.96          29.94           151.70      1332.0   
13  ...         16.84          27.66           112.00       876.5   
14  ...         15.03          32.01           108.80       697.7   
15  ...         17.46          37.13           124.10       943.2   
16  ...         19.07          30.88           123.40      1138.0   
17  ...         20.96          31.48           136.80      1315.0   
18  ...         27.32          30.88           186.80      2398.0   
22  ...         18.07          19.08           125.10       980.9   

    smoothness_worst  compactness_worst  concavity_worst  \
0             0.1622             0.6656           0.7119   
1             0.1238             0.1866           0.2416   
2             0.1444             0.4245           0.4504   
3             0.2098             0.8663           0.6869   
4             0.1374             0.2050           0.4000   
5             0.1791             0.5249           0.5355   
6             0.1442             0.2576           0.3784   
7             0.1654             0.3682           0.2678   
8             0.1703             0.5401           0.5390   
9             0.1853             1.0580           1.1050   
10            0.1181             0.1551           0.1459   
11            0.1396             0.5609           0.3965   
12            0.1037             0.3903           0.3639   
13            0.1131             0.1924           0.2322   
14            0.1651             0.7725           0.6943   
15            0.1678             0.6577           0.7026   
16            0.1464             0.1871           0.2914   
17            0.1789             0.4233           0.4784   
18            0.1512             0.3150           0.5372   
22            0.1390             0.5954           0.6305   

    concave points_worst  symmetry_worst  fractal_dimension_worst  
0                0.26540          0.4601                  0.11890  
1                0.18600          0.2750                  0.08902  
2                0.24300          0.3613                  0.08758  
3                0.25750          0.6638                  0.17300  
4                0.16250          0.2364                  0.07678  
5                0.17410          0.3985                  0.12440  
6                0.19320          0.3063                  0.08368  
7                0.15560          0.3196                  0.11510  
8                0.20600          0.4378                  0.10720  
9                0.22100          0.4366                  0.20750  
10               0.09975          0.2948                  0.08452  
11               0.18100          0.3792                  0.10480  
12               0.17670          0.3176                  0.10230  
13               0.11190          0.2809                  0.06287  
14               0.22080          0.3596                  0.14310  
15               0.17120          0.4218                  0.13410  
16               0.16090          0.3029                  0.08216  
17               0.20730          0.3706                  0.11420  
18               0.23880          0.2768                  0.07615  
22               0.23930          0.4667                  0.09946  

[20 rows x 32 columns]

You can also filter based on multiple conditions, such as finding rows where the diagnosis is “M” and radius_mean is greater than 15.

Note: You can’t use ‘and’ python operator here, because ‘and’ is a keyword for Python’s boolean operations, which work with single True or False values, not arrays or Series.

# Filter for Malignant cases with radius_mean > 15
large_malignant_cases = cancer_data[(cancer_data['diagnosis'] == 'M') & (cancer_data['Radius Mean'] > 15)]
print("Large Malignant Cases (Radius Mean > 15):\n", large_malignant_cases.head())
Large Malignant Cases (Radius Mean > 15):
          id diagnosis  Radius Mean  Texture Mean  Perimeter Mean  area_mean  \
0    842302         M        17.99         10.38           122.8     1001.0   
1    842517         M        20.57         17.77           132.9     1326.0   
2  84300903         M        19.69         21.25           130.0     1203.0   
4  84358402         M        20.29         14.34           135.1     1297.0   
6    844359         M        18.25         19.98           119.6     1040.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
4          0.10030           0.13280          0.1980              0.10430   
6          0.09463           0.10900          0.1127              0.07400   

   ...  radius_worst  texture_worst  perimeter_worst  area_worst  \
0  ...         25.38          17.33            184.6      2019.0   
1  ...         24.99          23.41            158.8      1956.0   
2  ...         23.57          25.53            152.5      1709.0   
4  ...         22.54          16.67            152.2      1575.0   
6  ...         22.88          27.66            153.2      1606.0   

   smoothness_worst  compactness_worst  concavity_worst  concave points_worst  \
0            0.1622             0.6656           0.7119                0.2654   
1            0.1238             0.1866           0.2416                0.1860   
2            0.1444             0.4245           0.4504                0.2430   
4            0.1374             0.2050           0.4000                0.1625   
6            0.1442             0.2576           0.3784                0.1932   

   symmetry_worst  fractal_dimension_worst  
0          0.4601                  0.11890  
1          0.2750                  0.08902  
2          0.3613                  0.08758  
4          0.2364                  0.07678  
6          0.3063                  0.08368  

[5 rows x 32 columns]

Adding and Modifying Columns

You can create new columns in a DataFrame based on calculations using existing columns. For example, we can calculate the area_ratio by dividing area_worst by area_mean.

# Add a new column 'area_ratio' by dividing 'area_worst' by 'area_mean'
cancer_data['area_ratio'] = cancer_data['area_worst'] / cancer_data['area_mean']
print("New Column 'area_ratio':\n", cancer_data[['area_worst', 'area_mean', 'area_ratio']].head())
New Column 'area_ratio':
    area_worst  area_mean  area_ratio
0      2019.0     1001.0    2.016983
1      1956.0     1326.0    1.475113
2      1709.0     1203.0    1.420615
3       567.7      386.1    1.470344
4      1575.0     1297.0    1.214341

Changing a Value Using .at

Suppose you have a DataFrame and want to update the value in the radius_mean column for a particular index.

# Access and print the original value at index 0 and column 'radius_mean'
original_value = cancer_data.at[0, 'Radius Mean']
print("Original Radius Mean at index 0:", original_value)


# Change the value at index 0 and column 'radius_mean' to 18.5
cancer_data.at[0, 'Radius Mean'] = 18.5


# Verify the updated value
updated_value = cancer_data.at[0, 'Radius Mean']
print("Updated Radius Mean at index 0:", updated_value)
Original Radius Mean at index 0: 17.99
Updated Radius Mean at index 0: 18.5

Sorting by Columns

You can sort a dataset by columns. Here’s how to sort by diagnosis first and then by area_mean in ascending order.

# Sort by 'diagnosis' first, then by 'area_mean' within each diagnosis group
sorted_by_diagnosis_area = cancer_data.sort_values(by=['diagnosis', 'area_mean'], ascending=[True, True])
print("Data sorted by Diagnosis and Area Mean:\n", sorted_by_diagnosis_area[['diagnosis', 'area_mean', 'Radius Mean']].head())
Data sorted by Diagnosis and Area Mean:
     diagnosis  area_mean  Radius Mean
101         B      143.5        6.981
539         B      170.4        7.691
538         B      178.8        7.729
568         B      181.0        7.760
46          B      201.9        8.196

Reordering Columns to Move a Column to the End

You might also want to move a specific column to the end of the DataFrame, such as moving area_ratio to the last position.

# Move 'area_ratio' to the end of the DataFrame
columns_reordered = [col for col in cancer_data.columns if col != 'area_ratio'] + ['area_ratio']
cancer_data_with_area_ratio_last = cancer_data[columns_reordered]

# Display the reordered columns
print("Data with 'area_ratio' at the end:\n", cancer_data_with_area_ratio_last.head())
Data with 'area_ratio' at the end:
          id diagnosis  Radius Mean  Texture Mean  Perimeter Mean  area_mean  \
0    842302         M        18.50         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  texture_worst  perimeter_worst  area_worst  smoothness_worst  \
0  ...          17.33           184.60      2019.0            0.1622   
1  ...          23.41           158.80      1956.0            0.1238   
2  ...          25.53           152.50      1709.0            0.1444   
3  ...          26.50            98.87       567.7            0.2098   
4  ...          16.67           152.20      1575.0            0.1374   

   compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
0             0.6656           0.7119                0.2654          0.4601   
1             0.1866           0.2416                0.1860          0.2750   
2             0.4245           0.4504                0.2430          0.3613   
3             0.8663           0.6869                0.2575          0.6638   
4             0.2050           0.4000                0.1625          0.2364   

   fractal_dimension_worst  area_ratio  
0                  0.11890    2.016983  
1                  0.08902    1.475113  
2                  0.08758    1.420615  
3                  0.17300    1.470344  
4                  0.07678    1.214341  

[5 rows x 33 columns]

Method Chaining in Pandas

In Pandas, you can chain multiple methods together to create a pipeline.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    "name": [" Alice ", "BOB", "Charlie", None],
    "score": [85, 92, None, 74]
})

# Clean the data using method chaining
clean_df = (
    df
    .dropna()                # Method: drop rows with any NaNs
    .assign(                 # Method: add or update columns
        name_clean=lambda d: d["name"].str.strip().str.title()
    )
    .sort_values("score", ascending=False)  # Method: sort by score
)

print(clean_df)
      name  score name_clean
1      BOB   92.0        Bob
0   Alice    85.0      Alice

Why .str.strip() and not just .strip()?

# This works:
df["name"].str.strip()

# This does NOT:
try:
    df["name"].strip()  # ❌ AttributeError
except AttributeError: 
    print("AttributeError: .strip is used for single strings, not a Series of strings")
AttributeError: .strip is used for single strings, not a Series of strings

Why? - df["name"] is a Series — not a string. - .strip() is a string method that works on single strings. - .str is the accessor that tells pandas: “apply this string method to each element in the Series.”

Rule of Thumb:

You have… Use… Why?
A single string "hello".strip() It’s just Python
A Series of strings df["col"].str.strip() It’s pandas, operating on many strings

Applying Functions in Pandas

Applying Functions to Columns

Using apply() to Apply Custom Functions

The .apply() method in pandas lets you apply a custom function to each element in a Series (column) or DataFrame. Here’s how to use it to categorize tumors based on area_mean.

Example: Categorizing Tumors by Size Let’s create a custom function to categorize tumors as “Small”, “Medium”, or “Large” based on area_mean.

# Define a custom function to categorize tumors by area_mean
def categorize_tumor(size):
    if size < 500:
        return 'Small'
    elif 500 <= size < 1000:
        return 'Medium'
    else:
        return 'Large'

# Apply the function to the 'area_mean' column and create a new column 'tumor_size_category'
cancer_data['tumor_size_category'] = cancer_data['area_mean'].apply(categorize_tumor)

# Display the new column to verify the transformation
print("Tumor Size Categories:\n", cancer_data[['area_mean', 'tumor_size_category']].head())
Tumor Size Categories:
    area_mean tumor_size_category
0     1001.0               Large
1     1326.0               Large
2     1203.0               Large
3      386.1               Small
4     1297.0               Large

Using Lambda Functions for Quick Transformations

Lambda functions are useful for simple, one-line operations. For example, we can use a lambda function to convert diagnosis into numerical codes (0 for Benign, 1 for Malignant).

# Apply a lambda function to classify 'diagnosis' into numerical codes
cancer_data['diagnosis_code'] = cancer_data['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

# Display the new column to verify the transformation
print("Diagnosis Codes:\n", cancer_data[['diagnosis', 'diagnosis_code']].head())
Diagnosis Codes:
   diagnosis  diagnosis_code
0         M               1
1         M               1
2         M               1
3         M               1
4         M               1

Applying Multiple Conditions with apply()

You can also use apply() with a lambda function for more complex, multi-condition classifications.

Example: Adding a Column with Risk Levels Suppose we want to create a new column, risk_level, based on both diagnosis and area_mean:

  • “High Risk” for Malignant tumors with area_mean above 1000.
  • “Moderate Risk” for Malignant tumors with area_mean below 1000.
  • “Low Risk” for Benign tumors.
# Apply a lambda function with multiple conditions to create a 'risk_level' column
cancer_data['risk_level'] = cancer_data.apply(
    lambda row: 'High Risk' if row['diagnosis'] == 'M' and row['area_mean'] > 1000 
    else ('Moderate Risk' if row['diagnosis'] == 'M' else 'Low Risk'), axis=1
)

# Display the new column to verify the transformation
print("Risk Levels:\n", cancer_data[['diagnosis', 'area_mean', 'risk_level']].head())

#Axis=1 tells the function to apply it to the rows. axis=0 (default) applies function to the columns
Risk Levels:
   diagnosis  area_mean     risk_level
0         M     1001.0      High Risk
1         M     1326.0      High Risk
2         M     1203.0      High Risk
3         M      386.1  Moderate Risk
4         M     1297.0      High Risk

When to apply axis=

You’re applying to… Use .apply() on… Do you need axis?
A single column (Series) df['col'].apply(func) No
Multiple columns (row-wise) df.apply(func, axis=1) Yes (axis=1)
Column-wise (less common) df.apply(func) or axis=0 Optional (default is 0)

To export a Pandas dataframe to CSV or XLSX

# Export to CSV
'''

df.to_csv('/path/to/directory/example.csv', index=False)  # index=False excludes the row indices


'''
#Export to xlsx

'''

df.to_excel('/path/to/directory/example.xlsx', index=False)

'''
"\n\ndf.to_excel('/path/to/directory/example.xlsx', index=False)\n\n"

great_tables for table generation

You can use the great_tables Python module from the great-tables package to explore and display data from a dataset in a clean and interactive format.

We’ll load the data, summarize it, and then build styled tables using great_tables.

Load and Inspect the Data

import pandas as pd

# Load the dataset
df = pd.read_csv("..\session3\example_data\Cancer_Data.csv")

# Drop unnamed column
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Show the shape and first few rows
df.shape, df.head()
((569, 32),
          id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
 0    842302         M        17.99         10.38          122.80     1001.0   
 1    842517         M        20.57         17.77          132.90     1326.0   
 2  84300903         M        19.69         21.25          130.00     1203.0   
 3  84348301         M        11.42         20.38           77.58      386.1   
 4  84358402         M        20.29         14.34          135.10     1297.0   
 
    smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
 0          0.11840           0.27760          0.3001              0.14710   
 1          0.08474           0.07864          0.0869              0.07017   
 2          0.10960           0.15990          0.1974              0.12790   
 3          0.14250           0.28390          0.2414              0.10520   
 4          0.10030           0.13280          0.1980              0.10430   
 
    ...  radius_worst  texture_worst  perimeter_worst  area_worst  \
 0  ...         25.38          17.33           184.60      2019.0   
 1  ...         24.99          23.41           158.80      1956.0   
 2  ...         23.57          25.53           152.50      1709.0   
 3  ...         14.91          26.50            98.87       567.7   
 4  ...         22.54          16.67           152.20      1575.0   
 
    smoothness_worst  compactness_worst  concavity_worst  concave points_worst  \
 0            0.1622             0.6656           0.7119                0.2654   
 1            0.1238             0.1866           0.2416                0.1860   
 2            0.1444             0.4245           0.4504                0.2430   
 3            0.2098             0.8663           0.6869                0.2575   
 4            0.1374             0.2050           0.4000                0.1625   
 
    symmetry_worst  fractal_dimension_worst  
 0          0.4601                  0.11890  
 1          0.2750                  0.08902  
 2          0.3613                  0.08758  
 3          0.6638                  0.17300  
 4          0.2364                  0.07678  
 
 [5 rows x 32 columns])

Preview of dataset using great_tables

from great_tables import GT

# Select a subset of the columns for preview
preview_df = df[['id', 'diagnosis', 'radius_mean', 'texture_mean', 'area_mean']].head(5)

GT(preview_df).tab_header(
    title="Breast Cancer Diagnosis Preview",
    subtitle="Selected features from the first 5 records"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)
Breast Cancer Diagnosis Preview
Selected features from the first 5 records
id diagnosis radius_mean texture_mean area_mean
842302 M 17.99 10.38 1,001.00
842517 M 20.57 17.77 1,326.00
84300903 M 19.69 21.25 1,203.00
84348301 M 11.42 20.38 386.10
84358402 M 20.29 14.34 1,297.00

Enhance the Table with Styling

Let’s add conditional formatting to highlight larger tumor areas.

GT(preview_df).tab_header(
    title="Styled Cancer Data Table",
    subtitle="With conditional formatting on tumor area"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2
).data_color(
    columns="area_mean",
    palette=["blue", "red"]
)
Styled Cancer Data Table
With conditional formatting on tumor area
id diagnosis radius_mean texture_mean area_mean
842302 M 17.99 10.38 1,001.00
842517 M 20.57 17.77 1,326.00
84300903 M 19.69 21.25 1,203.00
84348301 M 11.42 20.38 386.10
84358402 M 20.29 14.34 1,297.00

Group Statistics by Diagnosis

Let’s summarize average values by diagnosis group (malignant vs. benign).

summary_df = df.groupby("diagnosis")[["radius_mean", "texture_mean", "area_mean"]].mean().reset_index()

GT(summary_df).tab_header(
    title="Group-wise Summary",
    subtitle="Mean values for radius, texture, and area by diagnosis"
).fmt_number(columns=["radius_mean", "texture_mean", "area_mean"], decimals=2)
Group-wise Summary
Mean values for radius, texture, and area by diagnosis
diagnosis radius_mean texture_mean area_mean
B 12.15 17.91 462.79
M 17.46 21.60 978.38

You Try!

Navigate to the follow-along file and try the practice problems!