Homework 4 - Machine Learning for Clinical Prediction

Warning: this assignment is not yet released. Check back on February 25, 2026.

This assignment is due on Wednesday, March 4, 2026 before 11:59PM.

Get Started:

Accept the assignment on GitHub Classroom — You’ll get your own private repository with starter code and data
Clone your repo and complete the exercises in hw4_ml.py
Commit regularly as you work (this is part of your grade!)
Push your completed work to GitHub before the deadline

Learning Objectives

By completing this assignment, you will:

Train and evaluate classification models for clinical prediction
Understand evaluation metrics beyond accuracy (ROC, PR curves, calibration)
Apply cross-validation and avoid data leakage
Interpret models using feature importance and SHAP values
Recognize when a model is (and isn’t) clinically useful

Background

Building a machine learning model is straightforward. Building one that’s actually useful in clinical practice is hard. This assignment bridges that gap by focusing on the evaluation and interpretation aspects that determine whether a model could actually help patients.

You’ll work with the Pima Indians Diabetes Dataset from HW3, but now your goal is to build and rigorously evaluate predictive models. The key insight: a model with 0.85 AUC might be useless, while one with 0.75 AUC might save lives. It depends on calibration, clinical context, and what decisions the model informs.

The Dataset

Same dataset as HW3 — you should use your imputed data or start fresh:

Feature	Description
Pregnancies	Number of pregnancies
Glucose	Plasma glucose concentration (2hr OGTT)
BloodPressure	Diastolic blood pressure
SkinThickness	Triceps skin fold thickness
Insulin	2-hour serum insulin
BMI	Body mass index
DiabetesPedigreeFunction	Genetic risk score
Age	Age in years
Outcome	Diabetes diagnosis (1=yes, 0=no) — target variable

Instructions

Part 1: Data Preparation & Train/Test Split (15 points)

1.1 Load and Prepare Data (5 pts)

Load the diabetes dataset
Handle missing values (zeros in Glucose, BMI, etc.) — you can use your approach from HW3
Split into features (X) and target (y)

1.2 Train/Test Split (10 pts)

Create a train/test split (80/20)
Use stratification to maintain class balance
Important: Document the class distribution in both sets
Why is stratification important for imbalanced data?

Part 2: Baseline Models (25 points)

2.1 Logistic Regression (10 pts)

Train a logistic regression model:

Use default parameters first
Report accuracy, precision, recall, F1 on the test set
Print the confusion matrix
Which features have the largest coefficients? What does this mean clinically?

2.2 Random Forest (10 pts)

Train a random forest classifier:

Use default parameters first
Report the same metrics as 2.1
Extract and visualize feature importances
Compare to logistic regression — which features matter most?

2.3 Model Comparison (5 pts)

Create a table comparing the two models
Which model would you choose and why?
Is accuracy the right metric to compare them?

Part 3: Evaluation Beyond Accuracy (25 points)

3.1 ROC Curve Analysis (8 pts)

For both models:

Plot ROC curves on the same figure
Calculate and display AUC for each
Mark the point on each curve corresponding to the default threshold (0.5)
What does the ROC curve tell you that accuracy doesn’t?

3.2 Precision-Recall Curves (8 pts)

For both models:

Plot Precision-Recall curves
Calculate Average Precision (AP) for each
Why are PR curves often more informative than ROC for medical prediction?

3.3 Calibration Analysis (9 pts)

For one model (your choice):

Create a calibration plot (reliability diagram)
Is your model well-calibrated?
Why does calibration matter for clinical decision support?
If poorly calibrated, what could you do to fix it?

Part 4: Cross-Validation & Hyperparameter Tuning (20 points)

4.1 K-Fold Cross-Validation (10 pts)

Using the training data only:

Implement 5-fold stratified cross-validation
Report mean and std of AUC across folds
Why do we use cross-validation instead of a single validation split?

4.2 Hyperparameter Tuning (10 pts)

For Random Forest:

Use GridSearchCV or RandomizedSearchCV to tune:
- n_estimators: [50, 100, 200]
- max_depth: [3, 5, 10, None]
- min_samples_split: [2, 5, 10]
Report the best parameters
Does tuning improve test set performance significantly?
Warning: What’s the risk of tuning on cross-validation and then expecting the same performance on truly new data?

Part 5: Model Interpretability (15 points)

5.1 SHAP Values (10 pts)

Using your best model:

Calculate SHAP values for the test set
Create a SHAP summary plot (beeswarm)
Create a SHAP bar plot (mean absolute SHAP values)
Interpret: Which features drive predictions? In what direction?

5.2 Individual Predictions (5 pts)

Select two test patients:

One correctly classified diabetic
One false negative (missed diabetes)

For each:

Show their feature values
Create a SHAP waterfall plot
Explain why the model made its prediction
For the false negative: what went wrong?

Part 6: Fairness Analysis (10 points - REQUIRED)

6.1 Subgroup Performance (10 pts)

The Pima dataset includes Age. Analyze whether your model performs equitably:

Split the test set into age groups: Young (<30), Middle (30-50), Senior (>50)
Calculate AUC, sensitivity, and specificity for each subgroup
Create a table comparing performance across groups

Age Group	N	AUC	Sensitivity	Specificity
Young (<30)
Middle (30-50)
Senior (>50)

Answer these questions:

Does your model perform equally well across age groups?
If there are disparities, what might explain them?
How would you address this before deployment?

Note: This dataset doesn’t include race/ethnicity, but in real clinical AI, you must examine performance across demographic groups. The Obermeyer et al. paper (required reading) shows what happens when you don’t.

Reflection Questions

Answer these in code comments or a markdown cell:

Clinical Utility: If this model were deployed, what threshold would you use? What’s the tradeoff between missing diabetics (false negatives) and unnecessary follow-ups (false positives)?
Fairness Deep Dive: Based on your subgroup analysis, would you deploy this model as-is? What additional data would you want to collect?
Limitations: What are three reasons this model might not work well in a different hospital system?

Submission via GitHub

Complete your work in hw4_ml.py
Save your figures to the outputs/ directory
Commit your changes with meaningful messages
Push to GitHub before the deadline

Deliverables

Your repository should contain:

hw4_ml.py — Completed code with comments
outputs/ — Generated figures (PNG files)
Clear commit history showing your progress

Grading Rubric

Component	Points
Part 1: Data Preparation	15
1.1 Load and prepare data	5
1.2 Train/test split with stratification	10
Part 2: Baseline Models	25
2.1 Logistic regression	10
2.2 Random forest	10
2.3 Model comparison	5
Part 3: Evaluation Beyond Accuracy	25
3.1 ROC curve analysis	8
3.2 Precision-recall curves	8
3.3 Calibration analysis	9
Part 4: Cross-Validation & Tuning	20
4.1 K-fold cross-validation	10
4.2 Hyperparameter tuning	10
Part 5: Interpretability	15
5.1 SHAP values	10
5.2 Individual predictions	5
Part 6: Fairness Analysis	10
6.1 Subgroup performance	10
Subtotal	110
Git Workflow
Multiple meaningful commits	-5 if missing
Clear commit messages	-5 if missing

Resources

Scikit-learn User Guide
SHAP Documentation
An Introduction to Statistical Learning (ISLR) — Chapters 2-4
Calibration Plots Explained
Decision Curve Analysis

Tips

Don’t chase AUC — A well-calibrated model with 0.75 AUC is often more useful than a poorly-calibrated one with 0.85
Interpret with domain knowledge — If Glucose isn’t a top feature, something might be wrong
Watch for data leakage — Tune hyperparameters on training data only
SHAP is slow — Use a subset of data if needed
Commit after each part — Don’t wait until the end