Homework 7 - Model Evaluation & Explainability

Warning: this assignment is not yet released. Check back on April 15, 2026.

This assignment is due on Wednesday, April 22, 2026 before 11:59PM.

Get Started:

Accept the assignment on GitHub Classroom — You’ll get your own private repository with starter code
Clone your repo and complete the exercises in hw7_evaluation.py
Push your completed work to GitHub before the deadline

Note: This assignment builds on your work from HW4 (ML) and HW5 (Deep Learning). If you don’t have working models from those assignments, starter models are provided in the repo.

Learning Objectives

By completing this assignment, you will:

Create and interpret calibration plots for clinical prediction models
Apply decision curve analysis to assess clinical utility
Generate and interpret SHAP explanations for tabular models
Apply Grad-CAM visualizations to medical imaging models
Critically evaluate when model explanations are trustworthy

Background

A model with great AUC might still be clinically useless. This assignment goes beyond discrimination metrics to answer the questions clinicians actually care about:

“Can I trust the probability this model gives me?” → Calibration
“At what risk threshold should I act?” → Decision Curves
“Why did it make this prediction?” → SHAP, Grad-CAM

These techniques bridge the gap between “my model has 0.85 AUC” and “this model is ready for clinical use.” They’re also essential components of the field guide you’ll write for your final project.

Scenario

You’ll evaluate two models you’ve already built:

Diabetes prediction model from HW4 (tabular/ML)
Medical image classifier from HW5 (imaging/DL)

This reflects real-world practice: evaluation and explanation happen after you’ve built something, often by someone other than the original developer.

Instructions

Part 1: Calibration Analysis (30 points)

A well-calibrated model means: when it predicts 30% risk, about 30% of those patients actually have the outcome. This matters because clinicians interpret probabilities literally.

1.1 Calibration Plots (10 pts)

Using your HW4 diabetes prediction model:

Create a reliability diagram (calibration plot) with 10 bins
Add a histogram showing the distribution of predicted probabilities
Include 95% confidence intervals on calibration curve

1.2 Recalibration (10 pts)

Apply Platt scaling (logistic recalibration) to your model
Apply isotonic regression as an alternative
Create a comparison plot showing: original vs. Platt vs. isotonic calibration

1.3 Calibration Metrics & Interpretation (10 pts)

Calculate and report:

Brier score (before and after recalibration)
Expected Calibration Error (ECE)
Maximum Calibration Error (MCE)

Write a short analysis (~150 words):

Is the original model over-confident or under-confident?
Which recalibration method works better and why?
What would happen if a clinician used the uncalibrated probabilities for decision-making?

Part 2: Decision Curve Analysis (25 points)

ROC curves tell you about discrimination. Decision curves tell you about clinical utility: does using this model lead to better decisions than simpler strategies?

2.1 Build Decision Curves (10 pts)

For your HW4 model:

Create a decision curve showing net benefit across threshold probabilities (0% to 50%)
Include reference lines for “treat all” and “treat none” strategies
Add your model’s curve with confidence bands (via bootstrapping)

2.2 Clinical Threshold Analysis (15 pts)

Assume the clinical context: patients predicted as high-risk will receive a preventive intervention (lifestyle counseling + more frequent monitoring).

Write a clinical interpretation (~200 words):

At what threshold range does your model provide positive net benefit?
Compare to “treat all patients with BMI > 30” as a simple clinical rule
For a clinician who would intervene at 20% risk: is your model helpful?
For what types of patients (threshold preferences) is the model NOT useful?

Part 3: SHAP Explanations (25 points)

SHAP values explain individual predictions by attributing the prediction to each feature. But explanations can be misleading—your job is to interpret them critically.

3.1 Global Feature Importance (10 pts)

Generate a SHAP summary plot (beeswarm) for your HW4 model
Create a SHAP bar plot showing mean absolute SHAP values
Identify the top 5 most influential features

Answer: Do these align with clinical knowledge about diabetes risk factors? Any surprises?

3.2 Local Explanations (10 pts)

Select 3 individual patients from your test set:

One true positive (correctly identified as high risk)
One false positive (incorrectly flagged as high risk)
One false negative (missed—predicted low risk but developed diabetes)

For each patient:

Generate a SHAP waterfall or force plot
Identify which features drove the prediction
Explain in 1-2 sentences why the model made this decision

3.3 Critical Evaluation (5 pts)

Answer briefly (~100 words):

Are these explanations clinically sensible?
Do any features seem like shortcuts or proxies (e.g., correlating with the outcome for non-causal reasons)?
Would you trust these explanations when presenting to a clinician?

Part 4: Grad-CAM for Image Models (20 points)

Saliency maps show “where the model looked” when making a prediction. Grad-CAM is one popular method—but saliency maps can be misleading if not validated.

4.1 Generate Grad-CAM Visualizations (10 pts)

Using your HW5 medical image classifier:

Generate Grad-CAM heatmaps for 4 images:
- 2 correct predictions (1 positive, 1 negative class)
- 2 incorrect predictions (1 false positive, 1 false negative)
Overlay heatmaps on the original images with appropriate transparency

4.2 Sanity Check (5 pts)

Implement the sanity check from Adebayo et al. (2018):

Randomize the weights of your model’s final layer
Regenerate Grad-CAM for the same images
Compare: does the saliency map change significantly?

If the saliency map looks similar with random weights, the explanation may not be trustworthy.

4.3 Interpretation (5 pts)

Answer briefly (~100 words):

Is the model focusing on anatomically relevant regions?
For the incorrect predictions, does the heatmap reveal why the model failed?
Would you feel comfortable showing these visualizations to a radiologist?

Deliverables

File	Description
`hw7_evaluation.py`	Main code for all parts
`outputs/calibration_original.png`	Original model calibration plot
`outputs/calibration_comparison.png`	Before/after recalibration comparison
`outputs/decision_curve.png`	Decision curve analysis plot
`outputs/shap_summary.png`	SHAP beeswarm plot
`outputs/shap_local_*.png`	SHAP plots for 3 individual patients
`outputs/gradcam_*.png`	Grad-CAM visualizations (4+ images)
`outputs/gradcam_sanity.png`	Sanity check comparison
`analysis.md`	Written interpretations for Parts 1.3, 2.2, 3.3, 4.3

Grading Rubric

Component	Points
Part 1: Calibration Analysis	30
1.1 Calibration plots	10
1.2 Recalibration comparison	10
1.3 Metrics & interpretation	10
Part 2: Decision Curve Analysis	25
2.1 Decision curve plot	10
2.2 Clinical threshold analysis	15
Part 3: SHAP Explanations	25
3.1 Global feature importance	10
3.2 Local explanations	10
3.3 Critical evaluation	5
Part 4: Grad-CAM	20
4.1 Grad-CAM visualizations	10
4.2 Sanity check	5
4.3 Interpretation	5
Total	100

Resources

Calibration:

Decision Curves:

SHAP:

Grad-CAM:

Tips

Reuse your models: This assignment is about evaluation, not building new models. Load your saved models from HW4/HW5.
Starter models provided: If your previous models don’t work, use the provided alternatives.
Think clinically: A beautiful plot means nothing if you can’t explain what it means for patient care.
Be skeptical of explanations: Just because SHAP gives you numbers doesn’t mean they’re meaningful. That’s why we include sanity checks.
This feeds your final project: These evaluation techniques should appear in your field guide. Practice the interpretation skills here.
Written analysis matters: The interpretation sections are where you demonstrate understanding. Don’t rush them.