Homework 3 - Exploratory Data Analysis & Clinical Data

Warning: this assignment is not yet released. Check back on February 11, 2026.

This assignment is due on Wednesday, February 25, 2026 before 11:59PM.

Get Started:

Accept the assignment on GitHub Classroom — You’ll get your own private repository with starter code and data
Clone your repo and complete the exercises in hw3_eda.py
Commit regularly as you work (this is part of your grade!)
Push your completed work to GitHub before the deadline

Learning Objectives

By completing this assignment, you will:

Perform systematic exploratory data analysis on clinical tabular data
Handle missing values appropriately for medical datasets
Create meaningful visualizations of clinical variables
Engineer features from raw clinical measurements
Understand class imbalance and its implications for medical prediction

Background

Before building any predictive model, you must understand your data. In clinical settings, this is especially critical because:

Missing data is rarely random — A missing lab value might mean the test wasn’t ordered (patient seemed healthy) or the result was lost (different implications!)
Outliers might be real — That blood pressure of 250 might be a typo, or it might be a hypertensive crisis
Class imbalance is the norm — Most patients don’t have the disease you’re predicting
Domain knowledge matters — Knowing that HbA1c > 6.5 indicates diabetes is more valuable than any statistical test

This assignment uses the Pima Indians Diabetes Dataset, a classic clinical prediction dataset. Your task is to explore it thoroughly before any modeling begins.

The Dataset

The dataset contains diagnostic measurements from female patients of Pima Indian heritage, used to predict diabetes onset.

Feature	Description	Units
Pregnancies	Number of pregnancies	count
Glucose	Plasma glucose concentration (2hr OGTT)	mg/dL
BloodPressure	Diastolic blood pressure	mm Hg
SkinThickness	Triceps skin fold thickness	mm
Insulin	2-hour serum insulin	μU/mL
BMI	Body mass index	kg/m²
DiabetesPedigreeFunction	Genetic risk score	score
Age	Age	years
Outcome	Diabetes diagnosis (1=yes, 0=no)	binary

Instructions

Part 1: Data Loading & Initial Exploration (20 points)

1.1 Load and Inspect (8 pts)

Load the dataset using pandas
Display basic info: shape, dtypes, first/last rows
Calculate summary statistics for all features

1.2 Identify Data Quality Issues (12 pts)

Some features have 0 values that are biologically impossible (e.g., Glucose=0, BMI=0)
Identify which features have this problem and how many rows are affected
Discuss: Are these missing values encoded as 0, or data entry errors?

Part 2: Missing Data Analysis (25 points)

2.1 Visualize Missing Patterns (10 pts)

Treat biologically impossible zeros as missing values
Create a visualization showing the pattern of missingness across features
Are certain features more likely to be missing together?

2.2 Compare Missing vs. Non-Missing (10 pts)

For patients with missing Insulin values vs. those without:
- Compare the distribution of other features
- Compare the outcome rate (diabetes prevalence)
Is the data Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR)?

2.3 Imputation Strategy (5 pts)

Propose and implement an imputation strategy for the missing values
Justify your choice (mean, median, model-based, or multiple imputation)

Part 3: Visualization & Distributions (25 points)

3.1 Univariate Distributions (10 pts)

Create a figure showing the distribution of each feature:

Use histograms or KDE plots
Separate by outcome (diabetes vs. no diabetes)
Which features show the clearest separation between classes?

3.2 Correlation Analysis (8 pts)

Create a correlation heatmap
Identify the features most correlated with the outcome
Are any features highly correlated with each other (multicollinearity)?

3.3 Clinical Visualizations (7 pts)

Create at least two clinically meaningful visualizations:

Example: BMI vs. Glucose colored by outcome
Example: Age distribution by number of pregnancies
Add appropriate titles, labels, and legends

Part 4: Feature Engineering (20 points)

4.1 Clinical Categories (8 pts)

Create categorical features based on clinical thresholds:

glucose_category: Normal (<100), Prediabetes (100-125), Diabetes (≥126)
bmi_category: Underweight (<18.5), Normal (18.5-25), Overweight (25-30), Obese (≥30)
age_group: Young (<30), Middle (30-50), Senior (≥50)

4.2 Derived Features (7 pts)

Create at least two new features that might be predictive:

Example: glucose_insulin_ratio = Glucose / (Insulin + 1)
Example: age_bmi_interaction = Age × BMI
Justify why these features might be clinically meaningful

4.3 Feature Summary (5 pts)

Create a summary table comparing feature statistics for diabetic vs. non-diabetic patients
Calculate effect sizes (e.g., Cohen’s d) for continuous features
Which engineered features show the strongest association with outcome?

Part 5: Class Imbalance Analysis (10 points)

5.1 Quantify Imbalance (4 pts)

What is the ratio of positive to negative cases?
Visualize the class distribution

5.2 Implications for Modeling (6 pts)

Answer these questions in code comments:

If you built a model that always predicted “no diabetes,” what would its accuracy be?
Why is accuracy a misleading metric for this dataset?
What metrics would be more appropriate? (Name at least 2)

Submission via GitHub

Complete your work in hw3_eda.py
Save your figures to the outputs/ directory
Commit your changes with meaningful messages
Push to GitHub before the deadline

Deliverables

Your repository should contain:

hw3_eda.py — Completed code with comments
outputs/ — Generated figures (PNG files)
Clear commit history showing your progress

Grading Rubric

Component	Points
Part 1: Data Loading & Initial Exploration	20
1.1 Load and inspect	8
1.2 Identify data quality issues	12
Part 2: Missing Data Analysis	25
2.1 Visualize missing patterns	10
2.2 Compare missing vs non-missing	10
2.3 Imputation strategy	5
Part 3: Visualization & Distributions	25
3.1 Univariate distributions	10
3.2 Correlation analysis	8
3.3 Clinical visualizations	7
Part 4: Feature Engineering	20
4.1 Clinical categories	8
4.2 Derived features	7
4.3 Feature summary	5
Part 5: Class Imbalance	10
5.1 Quantify imbalance	4
5.2 Implications for modeling	6
Subtotal	100
Git Workflow
Multiple meaningful commits	-5 if missing
Clear commit messages	-5 if missing

Resources

Tips

Start with simple summaries — df.describe(), df.info(), df.isnull().sum()
Visualize before you analyze — Plots often reveal issues that statistics miss
Think clinically — Would a doctor find this feature meaningful?
Document your reasoning — Comments explaining why you made choices are as important as the code
Commit after each part — Don’t wait until the end