← Back to Module 6 · MPHY 6120 · clinicalai.guide

DermaMNIST Strategy Guide

MPHY 6120 · Module 6 · Beat the Baseline Challenge

This guide won't give you the answers. It will point you in productive directions and help you understand why each change matters. Open only the hints you need — you'll learn more by struggling a bit first.

Remember: modify only the sections marked MODIFY THIS and re-run with uv run python challenge_beat_the_baseline.py

Progress Milestones

Where does your AUC put you?

~0.82

Baseline territory. The model is basically guessing "benign" for everything.

0.85

You fixed the optimizer. Good first step — now the model is actually learning.

0.88

Better architecture or augmentation is working. You're above average.

0.90

Target beaten. You've combined multiple improvements effectively.

0.92+

Expert territory. You're using scheduling, strong augmentation, or transfer learning.

???

Bonus: check your melanoma recall. High AUC doesn't mean you catch cancer.

Level 1 Quick Wins (5 minutes)

These are one-line changes in Section 5 that should immediately improve your score.

The optimizer matters more than you think.

The baseline uses SGD with a learning rate of 0.1. That's... aggressive. Think about what happens when you take giant steps through a loss landscape with lots of valleys.

Look up torch.optim.Adam. What default learning rate do most tutorials recommend? Try swapping the optimizer in Section 5 — just one line.

Hint 1: Which optimizer?

Adam (Adaptive Moment Estimation) maintains per-parameter learning rates. It's the go-to optimizer for almost any deep learning project. It's in torch.optim.

Hint 2: What learning rate?

The most common default for Adam is 1e-3 (0.001). That's 100x smaller than the baseline's SGD learning rate. Too high = overshooting. Too low = underfitting in few epochs.

5 epochs might not be enough.

The model sees each image only 5 times. Would you learn to distinguish 7 types of skin lesions from 5 passes through a textbook?

Try 10 or 15 epochs in Section 5. Watch both train and val loss — if they start diverging, you're overfitting.

Level 2 Build a Better Model (15 minutes)

These changes go in Section 4. The baseline model is intentionally weak.

The baseline has 6,887 parameters. That's fewer parameters than pixels in a single training batch. The model literally cannot represent the features it needs. Think of it like trying to describe 7 types of skin lesions using only 10 words.

The baseline conv layers use 8 and 16 channels. What happens if you use 32 and 64? You're giving the model more "feature detectors."

Hint 1: What to change in the model

In the BaselineModel.__init__, look at the nn.Conv2d layers. The second argument is the number of output channels (feature maps). More channels = more features the model can detect.

If you change channel counts, you also need to update the nn.Linear layer's input size to match. Think about what size the feature maps are after two rounds of MaxPool2d(2) on a 28×28 image.

Hint 2: The magic ingredient most architectures use

Look up nn.BatchNorm2d. It goes after each Conv2d layer (before or after ReLU, both work). It normalizes activations, stabilizes training, and often provides the single biggest accuracy jump.

Usage: nn.BatchNorm2d(num_channels) where num_channels matches the Conv2d output channels.

Hint 3: Going deeper

Consider adding a third conv block. A common pattern: 32 → 64 → 128 channels, each block being Conv → BatchNorm → ReLU → Pool.

If you add more pooling layers, the spatial size shrinks. After two MaxPool2d(2) on a 28×28 image you get 7×7. A third would give 3×3. Or try nn.AdaptiveAvgPool2d(1) which collapses any spatial size to 1×1 — very clean.

Overfitting? If your training accuracy is much higher than validation accuracy, the model is memorizing instead of learning. There's a standard technique to fight this.

Hint: Regularization

nn.Dropout(p) randomly zeros out activations during training with probability p. Try p=0.3 or p=0.5 before your final Linear layer. It forces the model to not rely on any single neuron. It automatically turns off during evaluation.

Level 3 Data Augmentation (15 minutes)

These changes go in Section 2. You're creating "free" training data by showing the model transformed versions of each image.

The model sees 7,007 training images. That's tiny by deep learning standards. But does a skin lesion look different when flipped horizontally? Rotated 10 degrees? Slightly brighter? No. So we can generate new training examples for free.

Look at torchvision.transforms. Add augmentations to train_transform before ToTensor(). Start with the simplest one.

Never augment validation data! Only modify train_transform. The validation set needs to be consistent so you can compare scores across experiments.

Hint 1: The easiest augmentation

transforms.RandomHorizontalFlip() — 50% chance of mirroring each image. Skin lesions have no left-right orientation, so this is always safe. One line, often noticeable improvement.

Hint 2: More augmentations to stack

Geometric: RandomRotation(degrees), RandomVerticalFlip(), RandomAffine(degrees, translate)

Photometric: ColorJitter(brightness, contrast, saturation, hue)

Stack them in the Compose list. Order matters slightly but isn't critical. Start conservative (small values) and increase if it helps.

Hint 3: A common gotcha with Normalize

You might be tempted to add transforms.Normalize(mean=[...], std=[...]) with ImageNet statistics. Be very careful.

Think about what happens if your training data is normalized but your validation data isn't. The model learns on data centered around 0, then sees data in [0,1] range at test time. What do you think happens?

Since you can't modify val_transform (it's in a locked section), adding Normalize to train_transform alone will hurt performance.

With augmentation, your model sees a different version of each image every epoch. Do you think 5 epochs is still enough? What happens if you increase to 10 or 15?

Level 4 Advanced Techniques (20 minutes)

These are the techniques that separate good results from great ones. Changes span Section 4 and Section 5.

This dataset is heavily imbalanced. Look at the class distribution printed when you run the script. One class has ~4,700 images. Another has ~80. The model takes the easy path: predict the big class for everything and get 67% accuracy.

nn.CrossEntropyLoss accepts a weight parameter. What if you made misclassifying a rare class cost more?

Hint 1: Class weights concept

The idea: a class with 80 samples should have a higher weight than a class with 4,700. This forces the model to pay equal attention to all classes, not just the common ones.

A standard formula: weight = total_samples / (num_classes * class_count)

Hint 2: Computing the weights

The training labels are already extracted as train_labels (a numpy array). You can use np.bincount(train_labels) to count each class.

Compute inverse-frequency weights, convert to a torch.FloatTensor, move to DEVICE, and pass as nn.CrossEntropyLoss(weight=your_weights).

Put this code in Section 5, before the criterion line.

Should the learning rate stay the same the whole time? Think of searching for something in a room. First you scan broadly (big steps). Then you look carefully in the promising area (small steps).

Hint 1: Learning rate scheduling

Look at torch.optim.lr_scheduler. The training loop already handles schedulers — if scheduler is not None, it calls scheduler.step() once per epoch.

A popular choice: CosineAnnealingLR. It smoothly decays the learning rate following a cosine curve from your initial LR down to ~0.

Hint 2: Setting it up

Create the optimizer first, then the scheduler:

scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=NUM_EPOCHS)

T_max should match your total number of epochs so the cosine curve completes exactly one half-cycle.

What if someone already trained a powerful vision model on millions of images? What if you could use their feature detectors and just retrain the final classification layer for your task?

Hint 1: Transfer learning

torchvision.models has pretrained models like ResNet18, trained on ImageNet (1.2 million natural images). These models have already learned to detect edges, textures, and patterns that transfer to medical images.

The idea: load a pretrained model, replace the final classification layer to output 7 classes, and fine-tune.

Hint 2: How to adapt ResNet18

Load it: from torchvision import models

ResNet18 has a .fc attribute (the final fully-connected layer). By default it outputs 1000 classes (ImageNet). You need to replace it with one that outputs NUM_CLASSES (7).

Use a lower learning rate (try 1e-4) — the pretrained weights are already good, you just need to nudge them.

Hint 3: An important subtlety for small images

ResNet18 was designed for 224×224 images. Its first layer is a 7×7 conv with stride 2, followed by a maxpool with stride 2. On a 28×28 image, that aggressively downsamples to 7×7 in the first two operations, throwing away spatial information.

Advanced move: replace model.conv1 with a smaller conv (3×3, stride 1) and replace model.maxpool with nn.Identity(). This preserves more spatial resolution for small images.

Level 5 Think Like a Clinician

This isn't about code. It's about what your model actually does.

Look at the per-class results printed at the end of each run.

Find the row for mel (melanoma). What's the recall? Recall means: of all actual melanomas, what fraction did your model catch?

Your model gets 0.92 AUC. Impressive! But melanoma recall is 0.15. That means 85% of melanoma patients are told their lesion is benign. Would you deploy this model in a dermatology clinic?

In medical AI, the metric you optimize is not always the metric that matters.

AUC measures overall discrimination across all classes equally. But missing a melanoma (life-threatening) is not the same as misclassifying a benign keratosis (inconvenient). A model optimized purely for AUC will spend most of its capacity on the common classes because that's where the easy points are.

After you've beaten 0.90 AUC, go back and try to maximize melanoma recall instead. What changes help? What's the tradeoff?

Hint: The single most impactful change for melanoma detection

Class weights (from Level 4). When you penalize melanoma misclassification more heavily, the model is forced to learn melanoma-specific features instead of taking the easy path of predicting the majority class.

Your AUC might actually go down when you add class weights. That's okay. You're trading leaderboard points for clinical utility. In the real world, that's the right tradeoff.

0.90+

AUC Target
(leaderboard metric)

50%+

Melanoma Recall
(clinical metric — bonus challenge)

Cheat Sheet: What to Try When You're Stuck

Symptom	Likely Cause	Try This
AUC stuck at ~0.82	Bad optimizer or LR	Switch to Adam, lower LR
AUC ~0.85, won't go higher	Model too small	More channels, add BatchNorm
Train acc >> val acc	Overfitting	Add Dropout, augmentation
Val loss going up	Too many epochs	Fewer epochs, or add scheduler
AUC dropped drastically	Train/val mismatch	Check if you transformed val data differently
Accuracy ~67%, model predicts one class	Class imbalance	Class weights in loss function
High AUC but 0% melanoma recall	Model ignoring rare classes	Class weights, smaller batch size