Not required for Spring 2026. Given the compressed end-of-semester timeline, HW7 is optional and will not be graded this term. The assignment is left up as a reference — the evaluation and explainability concepts are directly relevant to your final project’s field guide, so skim the parts that help your governance section. No submission needed.
Note: This assignment builds on your work from HW4 (ML) and HW5 (Deep Learning). If you don’t have working models from those assignments, starter models are provided in the repo.
By completing this assignment, you will:
A model with great AUC might still be clinically useless. This assignment goes beyond discrimination metrics to answer the questions clinicians actually care about:
These techniques bridge the gap between “my model has 0.85 AUC” and “this model is ready for clinical use.” They’re also essential components of the field guide you’ll write for your final project.
You’ll evaluate two models you’ve already built:
This reflects real-world practice: evaluation and explanation happen after you’ve built something, often by someone other than the original developer.
A well-calibrated model means: when it predicts 30% risk, about 30% of those patients actually have the outcome. This matters because clinicians interpret probabilities literally.
1.1 Calibration Plots (10 pts)
Using your HW4 diabetes prediction model:
1.2 Recalibration (10 pts)
1.3 Calibration Metrics & Interpretation (10 pts)
Calculate and report:
Write a short analysis (~150 words):
ROC curves tell you about discrimination. Decision curves tell you about clinical utility: does using this model lead to better decisions than simpler strategies?
2.1 Build Decision Curves (10 pts)
For your HW4 model:
2.2 Clinical Threshold Analysis (15 pts)
Assume the clinical context: patients predicted as high-risk will receive a preventive intervention (lifestyle counseling + more frequent monitoring).
Write a clinical interpretation (~200 words):
SHAP values explain individual predictions by attributing the prediction to each feature. But explanations can be misleading—your job is to interpret them critically.
3.1 Global Feature Importance (10 pts)
Answer: Do these align with clinical knowledge about diabetes risk factors? Any surprises?
3.2 Local Explanations (10 pts)
Select 3 individual patients from your test set:
For each patient:
3.3 Critical Evaluation (5 pts)
Answer briefly (~100 words):
Saliency maps show “where the model looked” when making a prediction. Grad-CAM is one popular method—but saliency maps can be misleading if not validated.
4.1 Generate Grad-CAM Visualizations (10 pts)
Using your HW5 medical image classifier:
4.2 Sanity Check (5 pts)
Implement the sanity check from Adebayo et al. (2018):
If the saliency map looks similar with random weights, the explanation may not be trustworthy.
4.3 Interpretation (5 pts)
Answer briefly (~100 words):
| File | Description |
|---|---|
hw7_evaluation.py |
Main code for all parts |
outputs/calibration_original.png |
Original model calibration plot |
outputs/calibration_comparison.png |
Before/after recalibration comparison |
outputs/decision_curve.png |
Decision curve analysis plot |
outputs/shap_summary.png |
SHAP beeswarm plot |
outputs/shap_local_*.png |
SHAP plots for 3 individual patients |
outputs/gradcam_*.png |
Grad-CAM visualizations (4+ images) |
outputs/gradcam_sanity.png |
Sanity check comparison |
analysis.md |
Written interpretations for Parts 1.3, 2.2, 3.3, 4.3 |
| Component | Points |
|---|---|
| Part 1: Calibration Analysis | 30 |
| 1.1 Calibration plots | 10 |
| 1.2 Recalibration comparison | 10 |
| 1.3 Metrics & interpretation | 10 |
| Part 2: Decision Curve Analysis | 25 |
| 2.1 Decision curve plot | 10 |
| 2.2 Clinical threshold analysis | 15 |
| Part 3: SHAP Explanations | 25 |
| 3.1 Global feature importance | 10 |
| 3.2 Local explanations | 10 |
| 3.3 Critical evaluation | 5 |
| Part 4: Grad-CAM | 20 |
| 4.1 Grad-CAM visualizations | 10 |
| 4.2 Sanity check | 5 |
| 4.3 Interpretation | 5 |
| Total | 100 |
Calibration:
Decision Curves:
SHAP:
Grad-CAM: