Skip to main content
Warning: this assignment is not yet released. Check back on February 25, 2026.
This assignment is due on Wednesday, March 11, 2026 before 11:59PM.

Get Started:

  1. Accept the assignment on GitHub Classroom — You’ll get your own private repository with starter code and data
  2. Clone your repo and complete the exercises in hw4_ml.py
  3. Commit regularly as you work (this is part of your grade!)
  4. Push your completed work to GitHub before the deadline

Learning Objectives

By completing this assignment, you will:


Background

Building a machine learning model is straightforward. Building one that’s actually useful in clinical practice is hard. This assignment bridges that gap by focusing on the evaluation and interpretation aspects that determine whether a model could actually help patients.

You’ll work with the Pima Indians Diabetes Dataset from HW3, but now your goal is to build and rigorously evaluate predictive models. The key insight: a model with 0.85 AUC might be useless, while one with 0.75 AUC might save lives. It depends on calibration, clinical context, and what decisions the model informs.


The Dataset

Same dataset as HW3 — you should use your imputed data or start fresh:

Feature Description
Pregnancies Number of pregnancies
Glucose Plasma glucose concentration (2hr OGTT)
BloodPressure Diastolic blood pressure
SkinThickness Triceps skin fold thickness
Insulin 2-hour serum insulin
BMI Body mass index
DiabetesPedigreeFunction Genetic risk score
Age Age in years
Outcome Diabetes diagnosis (1=yes, 0=no) — target variable

Instructions

Part 1: Data Preparation & Train/Test Split (15 points)

1.1 Load and Prepare Data (5 pts)

1.2 Train/Test Split (10 pts)


Part 2: Baseline Models (25 points)

2.1 Logistic Regression (10 pts)

Train a logistic regression model:

2.2 Random Forest (10 pts)

Train a random forest classifier:

2.3 Model Comparison (5 pts)


Part 3: Evaluation Beyond Accuracy (25 points)

3.1 ROC Curve Analysis (8 pts)

For both models:

3.2 Precision-Recall Curves (8 pts)

For both models:

3.3 Calibration Analysis (9 pts)

For one model (your choice):


Part 4: Cross-Validation & Hyperparameter Tuning (20 points)

4.1 K-Fold Cross-Validation (10 pts)

Using the training data only:

4.2 Hyperparameter Tuning (10 pts)

For Random Forest:


Part 5: Model Interpretability (15 points)

5.1 SHAP Values (10 pts)

Using your best model:

5.2 Individual Predictions (5 pts)

Select two test patients:

For each:


Part 6: Fairness Analysis (10 points - REQUIRED)

6.1 Subgroup Performance (10 pts)

The Pima dataset includes Age. Analyze whether your model performs equitably:

Age Group N AUC Sensitivity Specificity
Young (<30)        
Middle (30-50)        
Senior (>50)        

Answer these questions:

Note: This dataset doesn’t include race/ethnicity, but in real clinical AI, you must examine performance across demographic groups. The Obermeyer et al. paper (required reading) shows what happens when you don’t.


Reflection Questions

Answer these in code comments or a markdown cell:

  1. Clinical Utility: If this model were deployed, what threshold would you use? What’s the tradeoff between missing diabetics (false negatives) and unnecessary follow-ups (false positives)?

  2. Fairness Deep Dive: Based on your subgroup analysis, would you deploy this model as-is? What additional data would you want to collect?

  3. Limitations: What are three reasons this model might not work well in a different hospital system?


Submission via GitHub

  1. Complete your work in hw4_ml.py
  2. Save your figures to the outputs/ directory
  3. Commit your changes with meaningful messages
  4. Push to GitHub before the deadline

Deliverables

Your repository should contain:


Grading Rubric

Component Points
Part 1: Data Preparation 15
1.1 Load and prepare data 5
1.2 Train/test split with stratification 10
Part 2: Baseline Models 25
2.1 Logistic regression 10
2.2 Random forest 10
2.3 Model comparison 5
Part 3: Evaluation Beyond Accuracy 25
3.1 ROC curve analysis 8
3.2 Precision-recall curves 8
3.3 Calibration analysis 9
Part 4: Cross-Validation & Tuning 20
4.1 K-fold cross-validation 10
4.2 Hyperparameter tuning 10
Part 5: Interpretability 15
5.1 SHAP values 10
5.2 Individual predictions 5
Part 6: Fairness Analysis 10
6.1 Subgroup performance 10
Subtotal 110
Git Workflow  
Multiple meaningful commits -5 if missing
Clear commit messages -5 if missing

Resources


Tips