← Back to Module 6 · MPHY 6120 · clinicalai.guide

Beat the Baseline: 25 Experiments in Deep Learning

DermaMNIST Skin Lesion Classification — MPHY 6120 Module 6

0.929
Best AUC (Cosine LR)
0.00
Baseline Melanoma Recall
25
Experiments Run
~14 min
Total Runtime (MPS)

The Big Question

You just hit 0.93 AUC on a 7-class skin lesion classifier. Congratulations!

But would you deploy this to a dermatology clinic?

Your 0.93 AUC model catches 41% of melanomas.

That means 6 out of 10 melanoma patients walk out the door undiagnosed. AUC is a useful optimization target, but it is not a clinical deployment metric.

Full Results

Sorted by macro AUC. Watch the melanoma recall column.

#ExperimentChapterAUC Mel RecallCancer Sens AccuracyParamsTime

The Story: 10 Chapters

Click any experiment to expand details about what changed and what it teaches.

Chapter 1: The Baseline

The baseline gets 68% accuracy by predicting "benign mole" for everything. It catches zero cancers. AUC of 0.82 sounds decent until you look at per-class recall: only melanocytic nevi (nv) are detected. This is the consequence of class imbalance — the model learns the shortcut of always predicting the majority class.
Exp 01 — Baseline AUC 0.822
AUC: 0.822 Mel Recall: 0% Cancer Sens: 0% Accuracy: 68.0% Params: 6,887 Time: 8.3s
Model: 2-layer CNN (8 → 16 channels) Optimizer: SGD(lr=0.1) Epochs: 5 Augmentation: None

The intentionally weak starting point. SGD at lr=0.1 is dangerously high — the model overshoots good minima. Only 8 and 16 filters mean the model can barely learn any useful features. With no augmentation, it memorizes the training set's class distribution instead of learning discriminative features. The 0% cancer recall means this model is clinically useless — every cancer patient walks out undiagnosed.

Chapter 2: Fix the Optimizer

Switching SGD → Adam(lr=1e-3) is often the single best one-line change. Adam maintains per-parameter adaptive learning rates, so it navigates loss landscapes that trip up vanilla SGD. But learning rate still matters — too low underfits, too high diverges.
Exp 02 — Adam lr=1e-3 AUC 0.846
AUC: 0.846 Mel Recall: 9% Cancer Sens: 4% Accuracy: 69.0%
Changed: optimizer = optim.Adam(model.parameters(), lr=1e-3)

One line changed, AUC jumps +0.024. Adam's adaptive learning rates let each parameter move at its own pace. The model starts learning features beyond just "predict nv." This is the go-to optimizer for almost any deep learning project. Note: melanoma recall is still terrible because the architecture is still too weak to learn fine-grained skin lesion features.

Exp 03 — SGD lr=0.01 AUC 0.662
AUC: 0.662 Mel Recall: 0% Cancer Sens: 0% Accuracy: 66.9%
Changed: optimizer = optim.SGD(model.parameters(), lr=0.01)

Lowering SGD's learning rate from 0.1 to 0.01 actually made things worse. Why? Without momentum, SGD at 0.01 crawls through the loss landscape in just 5 epochs. It barely learns anything beyond the majority-class shortcut. This shows that learning rate and optimizer choice interact — you can't tune them independently.

Exp 04 — Adam lr=1e-4 AUC 0.650
AUC: 0.650 Mel Recall: 0% Cancer Sens: 0% Accuracy: 66.9%
Changed: optimizer = optim.Adam(model.parameters(), lr=1e-4)

Even Adam can underfit if the learning rate is too conservative. At 1e-4 with only 5 epochs, the model barely moves from initialization. This is why lr=1e-4 is typically used for fine-tuning pretrained models (which are already close to a good solution), not for training from scratch. Lesson: match your learning rate to your training budget.

Exp 05 — SGD + Momentum AUC 0.819
AUC: 0.819 Mel Recall: 3% Cancer Sens: 1% Accuracy: 67.0%
Changed: optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Adding momentum=0.9 gives SGD a "memory" of past gradient directions, letting it build up speed in consistent directions and dampen oscillations. This is how SGD is supposed to be used — vanilla SGD without momentum is almost never the right choice. Result is on par with the baseline but still behind Adam for this small model and short training run.

Chapter 3: Build a Better Architecture

More channels, deeper networks, BatchNorm, and Dropout each contribute. The baseline's 6,887 parameters are severely capacity-starved — it literally cannot represent the features needed to distinguish 7 skin lesion types. Going to 94k params with BatchNorm is the inflection point where the model starts catching melanoma.
Exp 06 — More Channels (32/64) AUC 0.898
AUC: 0.898 Mel Recall: 19% Cancer Sens: 31% Params: 41,351
Model: 2-layer CNN with 32 → 64 channels (was 8 → 16) Optimizer: Adam(lr=1e-3)

6x more parameters, AUC jumps from 0.846 to 0.898. More channels = more feature detectors. The model can now learn separate filters for different colors, edges, and textures in skin lesions. Melanoma recall hits 19% — still bad clinically, but the model is starting to notice that melanomas look different from nevi. This is a capacity story: you need enough parameters to represent the decision boundaries between 7 classes.

Exp 07 — Three Conv Blocks + AdaptivePool AUC 0.855
AUC: 0.855 Mel Recall: 0% Cancer Sens: 0% Params: 94,151
Model: 3 blocks (32 → 64 → 128 channels) + AdaptiveAvgPool2d(1) Optimizer: Adam(lr=1e-3)

Deeper, but AUC actually dropped vs. More Channels. Why? Deeper networks are harder to train — without BatchNorm, gradients degrade as they pass through more layers (the "vanishing gradient" problem). The 0% melanoma recall is a red flag: this model collapsed to predicting majority class. Takeaway: depth without BatchNorm can hurt. AdaptiveAvgPool2d is still a good pattern — it replaces hardcoded spatial dimensions with a learnable global pooling.

Exp 08 — BatchNorm AUC 0.897
AUC: 0.897 Mel Recall: 41% Cancer Sens: 57% Params: 94,599
Model: 3 blocks (32 → 64 → 128) + BatchNorm2d after each conv Optimizer: Adam(lr=1e-3)

Same architecture as Exp 07, but with BatchNorm after each conv layer. AUC jumps from 0.855 to 0.897, and melanoma recall explodes from 0% to 41%. BatchNorm normalizes each layer's inputs to zero mean and unit variance, which (a) prevents internal covariate shift, (b) acts as mild regularization, and (c) allows higher learning rates. This is often the single most impactful architectural change you can make. This model becomes our base for the rest of the experiments.

Exp 09 — Dropout AUC 0.897
AUC: 0.897 Mel Recall: 50% Cancer Sens: 57% Params: 94,599
Added: nn.Dropout(0.3) before the classifier linear layer

AUC is nearly identical to BatchNorm alone, but melanoma recall jumped from 41% to 50%. Dropout randomly zeros 30% of activations during training, forcing the network to not rely on any single feature. This regularization makes the model more robust to subtle differences — exactly what you need for distinguishing melanoma from benign nevi. The AUC/recall divergence is starting to show: improvements in rare-class detection don't always move the overall metric.

Exp 10 — Hidden FC Layer AUC 0.841
AUC: 0.841 Mel Recall: 66% Cancer Sens: 50% Params: 102,407
Classifier: Linear(128,64) → ReLU → Dropout(0.2) → Linear(64,7)

AUC dropped to 0.841, but melanoma recall hit 66% — the highest of any experiment! Adding a hidden FC layer gives the classifier more expressive power to separate classes in feature space. The AUC drop likely comes from worse calibration on majority classes. This is a key clinical tradeoff: this model would catch 2 out of 3 melanomas at the cost of more false positives on common lesions. In a derm clinic with follow-up biopsy capability, that tradeoff might be exactly right.

Chapter 4: Data Augmentation

A simple horizontal flip + the BatchNorm model = 0.92 AUC. Augmentation is "free data" — it shows the model different views of each image, reducing overfitting and improving generalization. But you need more epochs to benefit, since the model sees different versions of each image each time.
Exp 11 — Horizontal Flip AUC 0.925
AUC: 0.925 Mel Recall: 33% Cancer Sens: 22% Epochs: 10
train_transform: RandomHorizontalFlip() + ToTensor() Epochs: 10 (was 5)

The simplest possible augmentation yields a massive AUC jump: 0.897 → 0.925. Skin lesions have no inherent left-right orientation, so horizontal flip is always safe. With 10 epochs, the model sees each image ~5 times flipped and ~5 times unflipped, effectively doubling the training data. But notice: melanoma recall actually dropped from 41% to 33%. The extra epochs let the model fit majority classes better, which boosts AUC while relatively neglecting rare classes.

Exp 12 — + Rotation AUC 0.921
AUC: 0.921 Mel Recall: 50% Cancer Sens: 51%
train_transform: RandomHorizontalFlip() + RandomRotation(15) + ToTensor()

Adding ±15° rotation simulates dermoscopes being held at different angles. AUC is slightly lower than flip-only, but melanoma recall jumped back to 50% and cancer sensitivity hit 51%. The rotation adds enough noise that the model can't just memorize spatial layout — it has to learn rotationally-invariant features, which happen to be more useful for distinguishing cancer from benign lesions.

Exp 13 — + Color Jitter AUC 0.915
AUC: 0.915 Mel Recall: 31% Cancer Sens: 23%
Added: ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.05)

Color jitter simulates lighting variation in dermoscopy. AUC dipped slightly, and melanoma recall dropped. Why? Color is actually informative for skin lesion classification — melanomas tend to have irregular colors. Too aggressive color jitter can wash out the signal that distinguishes dangerous lesions. Lesson: not all augmentation helps equally. Domain knowledge matters when choosing augmentations.

Exp 14 — + Normalize (GOTCHA) AUC 0.539
AUC: 0.539 Mel Recall: 0% Cancer Sens: 0%
Added: transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) BUT: val_transform stays as just ToTensor() (locked section!)

WORST RESULT IN THE ENTIRE SUITE.

This is an intentional trap. Normalizing with ImageNet statistics shifts pixel values to be centered around zero with unit variance. But the validation transform can't be changed (it's in a locked section), so val data stays in [0,1] range. The model learns features on normalized data and then sees completely different distributions at evaluation time. Preprocessing must be consistent between train and validation. This is one of the most common bugs in real ML pipelines and is extremely hard to debug because the model still trains fine — it only fails silently at evaluation.

Chapter 5: Handle Class Imbalance

Class weights are the most direct way to tell the model "melanoma matters more." They penalize misclassification of rare classes more heavily. AUC dips slightly because the model sacrifices accuracy on the dominant class (nv), but cancer detection improves substantially. This is almost always the right tradeoff in medicine.
Exp 15 — Class Weights AUC 0.905
AUC: 0.905 Mel Recall: 56% Cancer Sens: 29% Accuracy: 62.4%
criterion = nn.CrossEntropyLoss(weight=inverse_frequency_weights)

Accuracy dropped from ~70% to 62% — but melanoma recall jumped to 56%. The inverse-frequency weights make a melanoma misclassification cost ~6x more than missing a nevi. The model now actively looks for melanoma features instead of defaulting to the safe "it's probably benign" prediction. In clinical terms: we went from catching 0 melanomas to catching more than half. The accuracy drop is from nevi being misclassified more often — more false positives, but those just mean extra biopsies, not missed cancers.

Exp 16 — Label Smoothing AUC 0.903
AUC: 0.903 Mel Recall: 26% Cancer Sens: 44%
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

Label smoothing replaces hard targets [0,0,0,0,1,0,0] with soft targets [0.014,...,0.9,...,0.014]. This prevents the model from becoming overconfident and acts as regularization. AUC is comparable to class weights, but melanoma recall is lower (26% vs 56%). Label smoothing helps generalization broadly but doesn't specifically target rare classes the way class weights do. For imbalanced medical data, class weights are usually more impactful.

Chapter 6: Batch Size & Epochs

More epochs help (0.92 at 20 epochs), but watch for overfitting. Batch size has surprising effects on class balance — large batches smooth out gradients in a way that favors majority classes, devastating rare-class detection.
Exp 17 — Batch Size 64 AUC 0.899
AUC: 0.899 Mel Recall: 8% Cancer Sens: 35%
Changed: BATCH_SIZE = 64 (was 32)

Doubling batch size from 32 to 64 barely changed AUC but halved the gradient updates per epoch (110 → 55 steps). The smoother gradients favor majority classes. Training is faster (fewer steps), but melanoma recall collapsed from 41% to 8%. In imbalanced datasets, smaller batches create noisier gradients that actually help the model "notice" rare classes.

Exp 18 — Batch Size 128 AUC 0.911
AUC: 0.911 Mel Recall: 1% Cancer Sens: 24%
Changed: BATCH_SIZE = 128

Surprisingly high AUC (0.911!) but 1% melanoma recall. This is the poster child for why AUC alone is dangerous. With only ~55 batches per epoch and 7007 images, some batches might contain zero melanoma samples. The model optimizes for the majority class and achieves great AUC by being very good at distinguishing the common lesions from each other. A student showing this AUC would look great on the leaderboard while deploying a clinically dangerous model.

Exp 19 — 20 Epochs AUC 0.923
AUC: 0.923 Mel Recall: 46% Cancer Sens: 51%
Changed: NUM_EPOCHS = 20 (was 10)

More training time helps across the board: AUC 0.923, melanoma recall 46%, cancer sensitivity 51%. The model has more time to learn features for rare classes. But check the training curves — the gap between train and val loss is widening by epoch 20. That's overfitting starting. Going to 50 epochs without regularization would likely hurt. The sweet spot depends on your architecture and regularization strategy.

Chapter 7: Learning Rate Scheduling

Cosine annealing achieved the highest AUC of any experiment: 0.929. The idea: start with a high learning rate to explore broadly, then gradually reduce it to fine-tune into a sharp minimum. CosineAnnealingLR is the modern default.
Exp 20 — Cosine LR Schedule AUC 0.929 — #1 AUC
AUC: 0.929 Mel Recall: 41% Cancer Sens: 45% Epochs: 15
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15) NUM_EPOCHS = 15

The leaderboard champion. CosineAnnealingLR smoothly decays the learning rate from 1e-3 to ~0 over 15 epochs following a cosine curve. Early epochs make big moves; late epochs do precision fine-tuning. The result: best-in-suite AUC at 0.929. But melanoma recall is only 41%. This model excels at the overall classification task but doesn't specifically prioritize cancer. This is the model students would be proudest of — and the one that would hurt patients if deployed without additional safeguards.

Chapter 8: Stack the Wins

Combining BatchNorm + augmentation + class weights + cosine LR = best cancer detection. Individual improvements compound, but there are diminishing returns. The clinically-focused combined model trades leaderboard position for patient safety.
Exp 21 — Best Custom CNN AUC 0.891
AUC: 0.891 Mel Recall: 56% Cancer Sens: 60% Params: 102,407
Model: 3-block BN+Dropout(0.4) + hidden FC(128→64→7) Augmentation: HFlip + VFlip + Rotation(20) + ColorJitter Loss: CrossEntropyLoss(weight=class_weights) Optimizer: Adam(1e-3) + CosineAnnealingLR, 15 epochs, batch=64

Everything we learned combined into one model. AUC is "only" 0.891 (rank ~14 out of 25), but cancer sensitivity hits 60% and melanoma recall is 56%. The class weights pull the model toward detecting rare cancers, the augmentation improves generalization, and the cosine scheduler fine-tunes into a good minimum. This is the model a dermatologist would actually want as a screening aid: it catches more than half of all cancers while maintaining reasonable specificity.

Exp 22 — + Label Smoothing AUC 0.846
AUC: 0.846 Mel Recall: 65% Cancer Sens: 61%
Added: label_smoothing=0.1 to the weighted CrossEntropyLoss

Adding label smoothing on top of class weights pushed melanoma recall to 65% and cancer sensitivity to 61% — but AUC dropped to 0.846. The soft targets from label smoothing compound with class weights to create very strong pressure toward rare-class detection, at the cost of majority-class calibration. This is diminishing returns: each added technique provides less marginal benefit and can start interfering with others. The art of ML is knowing when to stop stacking.

Chapter 9: Transfer Learning

ResNet18 pretrained on ImageNet achieves 0.91 AUC — matching weeks of architecture search with one import statement. But pretrained features need fine-tuning for medical images, and 11M parameters on 7k training images risks overfitting.
Exp 23 — ResNet18 Fine-Tuned AUC 0.915
AUC: 0.915 Mel Recall: 28% Cancer Sens: 33% Params: 11,180,103
Model: torchvision ResNet18(pretrained=ImageNet) with fc → Linear(512,7) Optimizer: Adam(lr=1e-4), CosineAnnealingLR, 10 epochs

11 million parameters pretrained on 1.2 million natural images. The conv layers already know edges, textures, colors, and shapes — they just need to adapt to dermoscopy. AUC of 0.915 is strong, but melanoma recall is only 28%. Without class weights, the massive model capacity is spent on majority classes. Transfer learning is powerful but not magic — you still need to handle imbalance. Also note: 87s training time vs ~12s for the custom CNN. The ImageNet features help on average but don't specifically target the clinical question.

Exp 24 — ResNet18 Frozen AUC 0.761
AUC: 0.761 Mel Recall: 3% Cancer Sens: 3%
All backbone params frozen (requires_grad=False), only fc layer trains

A common shortcut: freeze the pretrained backbone and only train the classification head. This failed badly (0.761, worse than baseline). Why? ImageNet features are learned from natural images (dogs, cars, buildings). Dermoscopy images are tiny (28×28), with very different visual statistics. The frozen features don't transfer well to this domain. Lesson: feature extraction without fine-tuning only works when domains are similar. Medical imaging almost always requires fine-tuning.

Chapter 10: The Kitchen Sink

Everything we know, thrown at the problem. Modified conv1 for small images, all augmentations, class weights, label smoothing, cosine scheduling. This is what a competition submission looks like when you optimize for the right metric.
Exp 25 — Kitchen Sink AUC 0.869
AUC: 0.869 Mel Recall: 58% Cancer Sens: 59% Params: 11,172,423 Time: 138s
Model: ResNet18 with conv1 replaced (3x3 stride-1 for 28x28), maxpool removed Dropout(0.3) → Linear(512,7) Augmentation: HFlip + VFlip + Rotation(20) + RandomAffine + ColorJitter Loss: CrossEntropyLoss(weight=class_weights, label_smoothing=0.1) Optimizer: Adam(lr=1e-4) + CosineAnnealingLR, 15 epochs, batch=64

The "final boss" experiment. We adapted ResNet18 for 28×28 images by replacing the aggressive 7×7 stride-2 first conv with a 3×3 stride-1 conv (preserving spatial resolution) and removing the maxpool. Combined with class weights, aggressive augmentation, and label smoothing, this model prioritizes cancer detection over leaderboard rank. AUC of 0.869 wouldn't win any competitions, but 58% melanoma recall and 59% cancer sensitivity represent the best balance we achieved between overall performance and clinical utility.

The Verdict: Which Model Would You Deploy?

A dermatology practice needs a screening aid that catches cancer (high sensitivity), while keeping false positives manageable (patients sent for unnecessary biopsies). Missing a melanoma is catastrophic; an extra biopsy is inconvenient. The cost asymmetry is extreme.

Our Pick: Best Custom CNN (Exp 21)

AUC: 0.891 Melanoma Recall: 56% Cancer Sensitivity: 60% Params: 102k Inference: Fast (CPU OK)

Why this one?

In a real deployment, this model would flag suspicious lesions for dermatologist review. The 40% of missed melanomas motivates the next step: ensembling or multi-stage screening.

Runner-up: Custom + Label Smoothing (Exp 22) — 65% melanoma recall, 61% cancer sensitivity, but lower AUC (0.846). If your only goal is catching melanoma, this model wins. The tradeoff: more false positives on benign lesions, meaning more unnecessary biopsies.
Honorable mention: Hidden FC (Exp 10)66% melanoma recall (highest in suite!) from just 5 epochs of training. A surprisingly strong result from a simple architecture change. The hidden FC layer gives the classifier enough expressive power to learn a non-linear decision boundary for melanoma, even with only 102k parameters.

What About Ensembles?

In real clinical AI, the answer is rarely a single model. A mixture-of-experts approach could combine the high-AUC Cosine LR model (good at overall classification) with the class-weighted Best Custom CNN (good at catching cancer). If either model flags a lesion as suspicious, send it for biopsy. This "OR" ensemble would have much higher sensitivity than any single model, at the cost of more false positives — a tradeoff most dermatologists would happily accept.

The Stanford CNN that matched 21 dermatologists used exactly this kind of approach: an ensemble of models optimized for different metrics, combined with clinical decision rules. The single-model results you see here are just the starting point.

The Punchline

The model that wins the leaderboard (0.93 AUC)
is not the model you'd deploy to a clinic.

Cosine LR catches 41% of melanomas. The Best Custom CNN with class weights catches 56%. In the real world, the metric you optimize is not always the metric that matters. Always look at per-class metrics for safety-critical applications.

Visualizations

Per-class recall heatmap
The heatmap tells the real story. Green = the model detects that disease. Red = it misses it. Look at the MELANOMA column: most high-AUC models are yellow or red. Only models with class weights (Best Custom CNN, Kitchen Sink) turn green.
Clinical metrics comparison
Left: High AUC doesn't guarantee high melanoma recall. The danger zone (pink) catches most models. Right: Blue bars (AUC) vs red/orange bars (cancer metrics) diverge significantly.
AUC ranked bar chart
All 25 experiments ranked by macro AUC. Six experiments beat the 0.90 target. The normalize gotcha is dead last.
Chapter groups
Experiments grouped by chapter. Within each chapter, you can see which single change helped most.
Progression arc
The narrative arc from experiment 1 to 25. Note the dip at exp 14 (normalize gotcha) and the recovery with combined strategies.
AUC vs time
Efficiency frontier: the custom CNNs cluster in the fast/high-AUC region. ResNet18 is slower but not always better.
AUC vs parameters
A well-tuned 95k-param CNN can match an 11M-param ResNet18. The baseline's 7k params are simply insufficient.

Key Takeaways for HW5

  1. Adam(lr=1e-3) is almost always better than SGD out of the box
  2. BatchNorm is often the single biggest architecture improvement
  3. Data augmentation is free performance — but keep it consistent with val
  4. Class weights trade overall accuracy for catching rare important classes
  5. CosineAnnealingLR is the modern default for scheduling
  6. Transfer learning is the biggest single lever, but needs fine-tuning
  7. The metric you optimize is not always the metric that matters