Skip to main content
Warning: this assignment is not yet released. Check back on April 15, 2026.
This assignment is due on Wednesday, April 22, 2026 before 11:59PM.

Get Started:

  1. Accept the assignment on GitHub Classroom — You’ll get your own private repository with starter code
  2. Clone your repo and complete the exercises in hw7_evaluation.py
  3. Push your completed work to GitHub before the deadline

Note: This assignment builds on your work from HW4 (ML) and HW5 (Deep Learning). If you don’t have working models from those assignments, starter models are provided in the repo.


Learning Objectives

By completing this assignment, you will:


Background

A model with great AUC might still be clinically useless. This assignment goes beyond discrimination metrics to answer the questions clinicians actually care about:

These techniques bridge the gap between “my model has 0.85 AUC” and “this model is ready for clinical use.” They’re also essential components of the field guide you’ll write for your final project.


Scenario

You’ll evaluate two models you’ve already built:

  1. Diabetes prediction model from HW4 (tabular/ML)
  2. Medical image classifier from HW5 (imaging/DL)

This reflects real-world practice: evaluation and explanation happen after you’ve built something, often by someone other than the original developer.


Instructions

Part 1: Calibration Analysis (30 points)

A well-calibrated model means: when it predicts 30% risk, about 30% of those patients actually have the outcome. This matters because clinicians interpret probabilities literally.

1.1 Calibration Plots (10 pts)

Using your HW4 diabetes prediction model:

1.2 Recalibration (10 pts)

1.3 Calibration Metrics & Interpretation (10 pts)

Calculate and report:

Write a short analysis (~150 words):


Part 2: Decision Curve Analysis (25 points)

ROC curves tell you about discrimination. Decision curves tell you about clinical utility: does using this model lead to better decisions than simpler strategies?

2.1 Build Decision Curves (10 pts)

For your HW4 model:

2.2 Clinical Threshold Analysis (15 pts)

Assume the clinical context: patients predicted as high-risk will receive a preventive intervention (lifestyle counseling + more frequent monitoring).

Write a clinical interpretation (~200 words):


Part 3: SHAP Explanations (25 points)

SHAP values explain individual predictions by attributing the prediction to each feature. But explanations can be misleading—your job is to interpret them critically.

3.1 Global Feature Importance (10 pts)

Answer: Do these align with clinical knowledge about diabetes risk factors? Any surprises?

3.2 Local Explanations (10 pts)

Select 3 individual patients from your test set:

For each patient:

3.3 Critical Evaluation (5 pts)

Answer briefly (~100 words):


Part 4: Grad-CAM for Image Models (20 points)

Saliency maps show “where the model looked” when making a prediction. Grad-CAM is one popular method—but saliency maps can be misleading if not validated.

4.1 Generate Grad-CAM Visualizations (10 pts)

Using your HW5 medical image classifier:

4.2 Sanity Check (5 pts)

Implement the sanity check from Adebayo et al. (2018):

If the saliency map looks similar with random weights, the explanation may not be trustworthy.

4.3 Interpretation (5 pts)

Answer briefly (~100 words):


Deliverables

File Description
hw7_evaluation.py Main code for all parts
outputs/calibration_original.png Original model calibration plot
outputs/calibration_comparison.png Before/after recalibration comparison
outputs/decision_curve.png Decision curve analysis plot
outputs/shap_summary.png SHAP beeswarm plot
outputs/shap_local_*.png SHAP plots for 3 individual patients
outputs/gradcam_*.png Grad-CAM visualizations (4+ images)
outputs/gradcam_sanity.png Sanity check comparison
analysis.md Written interpretations for Parts 1.3, 2.2, 3.3, 4.3

Grading Rubric

Component Points
Part 1: Calibration Analysis 30
1.1 Calibration plots 10
1.2 Recalibration comparison 10
1.3 Metrics & interpretation 10
Part 2: Decision Curve Analysis 25
2.1 Decision curve plot 10
2.2 Clinical threshold analysis 15
Part 3: SHAP Explanations 25
3.1 Global feature importance 10
3.2 Local explanations 10
3.3 Critical evaluation 5
Part 4: Grad-CAM 20
4.1 Grad-CAM visualizations 10
4.2 Sanity check 5
4.3 Interpretation 5
Total 100

Resources

Calibration:

Decision Curves:

SHAP:

Grad-CAM:


Tips