We introduce the course philosophy: you will learn to build AI tools for medicine, evaluate them with discipline, and write the field guide someone else could use. Key concepts include Models vs Systems, Metrics vs Readiness, and the governance mindset.
Course Overview & Expectations
Introduction to the course structure, grading, and what you’ll build by semester end. We discuss the three project tracks (imaging, NLP, structured data) and the field guide requirement.
The Field Guide Mindset: Models vs. Systems
Why building a model is the easy part. Deployment, workflow integration, governance, and monitoring are where most clinical AI projects fail. We reframe success as “Could a busy community clinic use this safely?”
Five Pillars of Clinical AI
Deep dive into the course philosophy: (1) Models vs Systems, (2) Metrics vs Readiness, (3) Governance as Quantitative Discipline, (4) Maturity Levels, (5) Field Guide Mindset.
What Makes Medical AI Different?
Unique challenges of healthcare: data privacy (HIPAA), class imbalance, distribution shift, high-stakes decisions, regulatory requirements, and the need for explainability.
Readings:
Peter Lee, Carey Goldberg, Isaac Kohane, The AI Revolution in Medicine: GPT-4 and Beyond (Introduction & Chapter 1)
Eric Topol, High-performance medicine: the convergence of human and artificial intelligence
(Optional) Obermeyer et al., Dissecting racial bias in an algorithm used to manage the health of populations
(Optional) Liu et al., JAMA 2019, How to Read Articles That Use Machine Learning
Setting up your development environment with uv for package management, Jupyter notebooks, and PyTorch basics. We’ll establish good coding practices for reproducible medical AI research. This module pairs with Homework 1 to ensure everyone has a working environment.
Development Environment Setup
Modern Python tooling: uv for fast package management, virtual environments for isolation, and why reproducibility matters in medical AI research. We’ll walk through the HW1 setup and troubleshoot common issues.
Git & GitHub for Medical AI
Version control isn’t just for software engineers—it’s essential for reproducible research and increasingly important in the age of AI-assisted coding. Clone, commit, push, branch, and pull request workflows. Your commit history tells the story of your work.
Jupyter Notebooks vs. Python Scripts
When to use notebooks (exploration, visualization, teaching) vs. scripts (production, testing, automation). Best practices for organizing medical AI projects. Introduction to the notebooks you’ll use throughout the course.
The Medical Data Landscape
Preview of the data types we’ll encounter: medical images (DICOM, NIfTI), structured clinical data (EHR, tabular), and clinical text (notes, reports). Understanding what makes medical data different from typical ML datasets.
Readings:
Patrick Mineault, The Good Research Code Handbook — Skim chapters 1-3 on project setup and organization
(Optional) GitHub Guides, Git Handbook — Reference if you’re new to Git
(Optional) (Optional) Great video by Andrej Karpathy on building neural networks from scratch (2 hours) [Video] — Excellent deep dive if you want to understand what’s under the hood
Introduction to medical imaging data formats and handling. Covers DICOM, NIfTI, various imaging modalities, and the MONAI framework for medical image analysis. Includes advanced track for radiation therapy data (RT structures, dose, DVH).
DICOM Fundamentals
The universal language of medical imaging. Understanding DICOM structure: headers, pixel data, and metadata. Reading CT, MR, and X-ray images with pydicom. Why DICOM matters for AI: the metadata is often more valuable than the pixels.
DICOM Coordinate Systems & Geometry
Patient coordinate systems, image orientation, and spatial relationships. Converting between pixel coordinates and physical space. Window/level for visualization. Common pitfalls that break AI models.
Beyond DICOM: NIfTI, PNG, and Research Formats
When to use NIfTI (neuroimaging, volumetric analysis), when DICOM is overkill, and how to convert between formats. Working with public datasets (TCIA, PhysioNet, Grand Challenges) that may use different formats.
Introduction to MONAI
PyTorch-based framework for medical imaging AI. Data loading, transforms, and preprocessing pipelines. Why MONAI exists and when to use it vs. raw PyTorch.
Advanced Track: Radiation Therapy DICOM
For medical physics students: RT Structure Sets (contours), RT Dose (3D dose grids), RT Plan (beam parameters). Calculating DVH from dose and structure data. Real-world challenges in RT data handling.
Readings:
DICOM Standard Browser — Bookmark this—you’ll reference it constantly
(Optional) MONAI Getting Started — Skim the tutorials section
(Optional) The Cancer Imaging Archive (TCIA) — Source for many public medical imaging datasets
Working with electronic health record (EHR) data, feature engineering for clinical variables, and exploratory data analysis with pandas. Understanding the unique challenges of clinical tabular data.
Clinical Tabular Data: What Makes It Different
EHR data vs. research datasets. Missing data that isn’t random (MAR vs MCAR vs MNAR). Class imbalance in medical outcomes. Time-series nature of clinical encounters. Why standard ML assumptions often fail in healthcare.
Exploratory Data Analysis for Clinical Data
Systematic EDA workflow: distributions, correlations, outliers. Visualization techniques for clinical variables. Identifying data quality issues before they break your model. Using pandas profiling and sweetviz for rapid exploration.
Data Cleaning & Preprocessing
Handling missing values: imputation strategies and when deletion is appropriate. Encoding categorical variables (one-hot, target encoding, ordinal). Feature scaling and normalization. Dealing with outliers in clinical measurements.
Feature Engineering for Clinical Prediction
Creating clinically meaningful features from raw data. Time-based features (trends, rates of change). Composite scores and risk indices. Domain knowledge as a feature engineering superpower.
Data Imbalance in Medical Outcomes
Why most medical events are rare. Oversampling (SMOTE), undersampling, and class weights. Evaluation metrics that matter when classes are imbalanced (precision-recall, AUPRC). When accuracy is misleading.
Readings:
(Optional) Pandas Documentation - 10 Minutes to pandas — Quick refresher if needed
Sterne et al., BMJ 2009, Missing Data in Clinical Research — Classic paper on handling missing data
(Optional) Chawla et al., JAIR 2002, SMOTE: Synthetic Minority Over-sampling Technique — Original SMOTE paper—skim for concepts
(Optional) A Visual Introduction to Machine Learning — Beautiful interactive visualization of decision trees
Core machine learning concepts: classification, regression, evaluation metrics, cross-validation, and model selection. Emphasis on metrics that matter for clinical deployment vs. publication.
Supervised Learning: Regression
Linear regression as the foundation. Regularization (L1/Lasso, L2/Ridge) and why it matters for high-dimensional clinical data. Interpreting coefficients in medical contexts. When simple models beat complex ones.
Supervised Learning: Classification
Logistic regression for binary outcomes. Decision trees and their interpretability. Random forests and gradient boosting (XGBoost, LightGBM). SVMs briefly. Choosing the right algorithm for your clinical question.
Model Evaluation: Beyond Accuracy
Confusion matrices, sensitivity, specificity, PPV, NPV. ROC curves and AUC. Precision-recall curves for imbalanced data. Calibration plots and why they matter for clinical decision support. The metrics clinicians actually care about.
Cross-Validation & Model Selection
Train/validation/test splits and why they matter. K-fold cross-validation. Stratified sampling for imbalanced classes. Hyperparameter tuning with grid search and random search. Avoiding data leakage.
Model Interpretability
Why black boxes are problematic in medicine. Feature importance from tree models. SHAP values for any model. LIME for local explanations. Building trust with clinicians through interpretable predictions.
From Model to Clinical Utility
Decision curve analysis: is your model actually useful? Net benefit and clinical thresholds. Comparing models to existing clinical practice. The gap between good AUC and clinical deployment.
Fairness & Bias in Clinical AI
Algorithmic bias in healthcare: how it arises, why it matters, and what to do. The Obermeyer study on racial bias in risk prediction. Fairness metrics (demographic parity, equalized odds, calibration across groups). Subgroup analysis as standard practice. When “fair” models conflict with “accurate” ones. FDA’s emerging focus on bias evaluation.
Readings:
James, Witten, Hastie, Tibshirani, An Introduction to Statistical Learning (ISLR) — Chapters 2-4 (statistical learning, regression, classification)
(Optional) Scikit-learn User Guide — Reference for implementation details
(Optional) Lundberg & Lee, NeurIPS 2017, A Unified Approach to Interpreting Model Predictions (SHAP) — The SHAP paper—technical but foundational
(Optional) Vickers & Elkin, Medical Decision Making 2006, Decision Curve Analysis: A Novel Method for Evaluating Prediction Models — How to evaluate if a model is clinically useful
Obermeyer et al., Science 2019, Dissecting racial bias in an algorithm used to manage the health of populations — Essential case study—how a widely-used algorithm encoded racial bias
(Optional) Chen et al., Annals of Internal Medicine 2021, Ensuring Fairness in Machine Learning to Advance Health Equity — Practical framework for fairness in clinical ML
Convolutional neural networks, U-Net architecture for segmentation, and image classification models. Hands-on with PyTorch and MONAI for medical imaging tasks.
Neural Networks Fundamentals
From biological inspiration to artificial neurons. Activation functions (ReLU, sigmoid, softmax). Feedforward networks, backpropagation, and gradient descent. Loss functions for classification vs. regression. Building intuition before complexity.
Convolutional Neural Networks (CNNs)
Why convolutions work for images. Filters, feature maps, pooling, and stride. Classic architectures: LeNet, AlexNet, VGG, ResNet. What each layer learns. From ImageNet to medical imaging.
Transfer Learning & Fine-Tuning
Why training from scratch rarely makes sense in medical imaging. Using pretrained models (ImageNet, RadImageNet). Freezing layers, learning rates for fine-tuning. When transfer learning fails and what to do about it.
Image Segmentation with U-Net
The encoder-decoder architecture that changed medical imaging. Skip connections and why they matter. Loss functions for segmentation (Dice, cross-entropy). Variants: Attention U-Net, nnU-Net. Hands-on with MONAI.
Training Deep Learning Models: Practical Considerations
Data augmentation for medical images (rotation, elastic deformation, intensity). Batch normalization, dropout, and regularization. Learning rate schedules. Early stopping and checkpointing. Debugging when training goes wrong.
Readings:
Deep Learning (Goodfellow, Bengio, Courville) - Chapter 9: CNNs — The foundational reference on CNNs
Ronneberger et al., MICCAI 2015, U-Net: Convolutional Networks for Biomedical Image Segmentation — The paper that started it all for medical image segmentation
(Optional) MONAI Tutorials - 2D Classification — Hands-on tutorial we’ll work through in class
Students work on a self-selected medical imaging project from approved options: chest X-ray classification, CT organ segmentation, pathology analysis, retinal imaging, or other approved imaging tasks. Includes mini field guide deliverable.
Midterm Project Kickoff
Overview of project options, dataset access, and expectations. Forming teams (if applicable). Timeline and milestones. What makes a good mini field guide.
Work Session: Data Exploration & Baseline
In-class work time with instructor and TA support. Focus on loading data, EDA, and establishing a simple baseline model. Troubleshooting environment and data issues.
Work Session: Model Development
In-class work time focused on model architecture, training, and iteration. Peer feedback on approaches. Debugging common deep learning issues.
Midterm Presentations
Brief presentations of results and mini field guides. Peer feedback and discussion. What worked, what didn’t, and lessons learned.
Homework:
Text processing, word embeddings, and the unique challenges of clinical natural language processing. Working with clinical notes, medical terminology, and de-identification considerations.
Introduction to Clinical NLP
Why clinical text is different: abbreviations, misspellings, implicit context, copy-paste artifacts. Types of clinical documents (notes, reports, discharge summaries). The de-identification challenge. HIPAA and safe harbor.
Text Preprocessing & Traditional NLP
Tokenization, stemming, lemmatization. Bag of words and TF-IDF. N-grams. Why these still matter even in the age of transformers. Building a simple text classifier with scikit-learn.
Word Embeddings
From one-hot to distributed representations. Word2Vec, GloVe, and FastText. Domain-specific embeddings (BioWordVec, clinical embeddings). Visualizing embeddings with t-SNE. What embeddings capture and what they miss.
Clinical NLP Tools & Resources
Overview of clinical NLP tools: cTAKES, MetaMap, scispaCy, MedSpaCy. UMLS and medical ontologies. Named entity recognition for medications, diagnoses, and procedures. When to use off-the-shelf vs. custom solutions.
Readings:
Speech and Language Processing (Jurafsky & Martin) - Chapters 2, 6 — Text processing and vector semantics chapters
(Optional) Névéol et al., JAMIA 2018, Clinical Natural Language Processing in Languages Other Than English — Good overview of clinical NLP challenges
(Optional) Neumann et al., BioNLP 2019, scispaCy: Fast and Robust Models for Biomedical NLP — Practical tool we’ll use in class
Transformer architectures, large language models, prompting strategies, and clinical applications. Understanding capabilities and limitations of LLMs in healthcare settings.
The Transformer Revolution
Attention is all you need. Self-attention and multi-head attention. Positional encodings. Encoder-decoder vs. decoder-only architectures. Why transformers replaced RNNs for most NLP tasks.
Large Language Models: GPT, BERT, and Beyond
Pre-training and fine-tuning paradigm. BERT for understanding, GPT for generation. Scaling laws and emergent capabilities. Medical LLMs: ClinicalBERT, PubMedBERT, Med-PaLM. What’s actually in these models.
Prompting & In-Context Learning
Zero-shot, few-shot, and chain-of-thought prompting. Prompt engineering for medical tasks. Retrieval-augmented generation (RAG). When prompting beats fine-tuning and vice versa.
LLMs in Clinical Practice: Opportunities & Risks
Current applications: clinical documentation, patient communication, decision support. Hallucination and factual accuracy in medical contexts. Bias and fairness concerns. Regulatory landscape. The responsible deployment question.
Readings:
(Optional) Vaswani et al., NeurIPS 2017, Attention Is All You Need — The transformer paper—technical but foundational
Lee, Goldberg, Kohane, The AI Revolution in Medicine (Chapters on GPT-4) — Accessible overview of LLMs in medicine
(Optional) (Optional) Great video by Andrej Karpathy on Tokenization (2 hours 14 minutes) [Video] — Deep dive into how LLMs process text
(Optional) Singhal et al., Nature 2023, Large Language Models in Medicine — Med-PaLM paper—state of the art for medical QA
What governance means in clinical settings. Designing acceptance tests, choosing monitoring metrics, setting review cadences, and defining human-in-the-loop rules. Practical exercises in building monitoring dashboards.
What Governance Actually Means
Governance is not bureaucracy—it is constraints (what must always be true), experiments (acceptance tests and drift checks), and logs (traceability). We map familiar clinical concepts (commissioning, QA, chart rounds) to AI governance.
Acceptance Testing Design
How to design local validation: picking a representative cohort, choosing metrics that matter for your use case, setting thresholds, and documenting the process.
Monitoring & Drift Detection
Choosing 2-3 key monitoring stats, setting review cadence, defining who is responsible, and what to do when performance drifts. Real-world examples from clinical AI deployments.
Lab: Building a Simple Monitoring Dashboard
Hands-on exercise implementing a monitoring script or dashboard for a model you’ve trained. Track predictions, flag anomalies, and generate alerts.
Readings:
Sculley et al., NeurIPS 2015, Hidden Technical Debt in Machine Learning Systems — Classic paper on ML systems in production
(Optional) FDA Guidance on AI/ML-Based Software as a Medical Device — Understand the regulatory landscape
Deployment considerations for clinical AI, writing effective field guides, communicating with non-technical clinicians, and understanding the path from research to clinical implementation.
The Gap Between Model and Product
Why most ML papers never become clinical tools. Understanding workflow integration, regulatory pathways, and the organizational factors that determine success.
Writing Effective Field Guides
How to write documentation that busy clinicians will actually read and use. Plain language, clear rules, simple checklists, and the 8-section field guide template.
Workshop: Peer Review of Draft Field Guides
Students exchange draft field guides and provide structured feedback. Practice explaining technical work to non-technical audiences.
Readings:
Students present their final projects, which include both a technical artifact (model/pipeline) and a field guide document. Peer feedback and discussion of deployment readiness.
Final Project Work Session
Last in-class work time. Final debugging, polishing presentations, and completing field guides. One-on-one check-ins with instructor.
Final Presentations: Day 1
Student presentations (15-20 min each). Focus on the problem, approach, results, and field guide highlights. Q&A and peer feedback.
Final Presentations: Day 2
Remaining presentations. Course wrap-up: what we learned, where the field is going, and how to continue learning after the course.