MPHY 6120 · Module 6 · Beat the Baseline Challenge
This guide won't give you the answers. It will point you in productive directions and help you understand why each change matters. Open only the hints you need — you'll learn more by struggling a bit first.
Remember: modify only the sections marked
MODIFY THIS and re-run with
uv run python challenge_beat_the_baseline.py
Where does your AUC put you?
These are one-line changes in Section 5 that should immediately improve your score.
The optimizer matters more than you think.
The baseline uses SGD with a learning rate of 0.1. That's... aggressive. Think about what happens when you take giant steps through a loss landscape with lots of valleys.
torch.optim.Adam. What default learning rate do most tutorials recommend?
Try swapping the optimizer in Section 5 — just one line.
torch.optim.
1e-3 (0.001).
That's 100x smaller than the baseline's SGD learning rate.
Too high = overshooting. Too low = underfitting in few epochs.
5 epochs might not be enough.
The model sees each image only 5 times. Would you learn to distinguish 7 types of skin lesions from 5 passes through a textbook?
These changes go in Section 4. The baseline model is intentionally weak.
The baseline has 6,887 parameters. That's fewer parameters than pixels in a single training batch. The model literally cannot represent the features it needs. Think of it like trying to describe 7 types of skin lesions using only 10 words.
In the BaselineModel.__init__, look at the nn.Conv2d layers.
The second argument is the number of output channels (feature maps). More channels = more
features the model can detect.
If you change channel counts, you also need to update the nn.Linear
layer's input size to match. Think about what size the feature maps are after two rounds
of MaxPool2d(2) on a 28×28 image.
Look up nn.BatchNorm2d. It goes after each Conv2d layer (before or after ReLU,
both work). It normalizes activations, stabilizes training, and often provides
the single biggest accuracy jump.
Usage: nn.BatchNorm2d(num_channels) where num_channels matches
the Conv2d output channels.
Consider adding a third conv block. A common pattern: 32 → 64 → 128 channels, each block being Conv → BatchNorm → ReLU → Pool.
If you add more pooling layers, the spatial size shrinks. After two MaxPool2d(2) on
a 28×28 image you get 7×7. A third would give 3×3. Or try
nn.AdaptiveAvgPool2d(1) which collapses any spatial size to 1×1 —
very clean.
Overfitting? If your training accuracy is much higher than validation accuracy, the model is memorizing instead of learning. There's a standard technique to fight this.
nn.Dropout(p) randomly zeros out activations during training with probability
p. Try p=0.3 or p=0.5 before your final Linear layer.
It forces the model to not rely on any single neuron. It automatically turns off during evaluation.
These changes go in Section 2. You're creating "free" training data by showing the model transformed versions of each image.
The model sees 7,007 training images. That's tiny by deep learning standards. But does a skin lesion look different when flipped horizontally? Rotated 10 degrees? Slightly brighter? No. So we can generate new training examples for free.
torchvision.transforms. Add augmentations to train_transform
before ToTensor(). Start with the simplest one.
train_transform.
The validation set needs to be consistent so you can compare scores across experiments.
transforms.RandomHorizontalFlip() — 50% chance of mirroring each image.
Skin lesions have no left-right orientation, so this is always safe.
One line, often noticeable improvement.
Geometric: RandomRotation(degrees), RandomVerticalFlip(),
RandomAffine(degrees, translate)
Photometric: ColorJitter(brightness, contrast, saturation, hue)
Stack them in the Compose list. Order matters slightly but isn't critical. Start conservative (small values) and increase if it helps.
You might be tempted to add transforms.Normalize(mean=[...], std=[...])
with ImageNet statistics. Be very careful.
Think about what happens if your training data is normalized but your validation data isn't. The model learns on data centered around 0, then sees data in [0,1] range at test time. What do you think happens?
Since you can't modify val_transform (it's in a locked section),
adding Normalize to train_transform alone will hurt performance.
These are the techniques that separate good results from great ones. Changes span Section 4 and Section 5.
This dataset is heavily imbalanced. Look at the class distribution printed when you run the script. One class has ~4,700 images. Another has ~80. The model takes the easy path: predict the big class for everything and get 67% accuracy.
nn.CrossEntropyLoss accepts a weight parameter.
What if you made misclassifying a rare class cost more?
The idea: a class with 80 samples should have a higher weight than a class with 4,700. This forces the model to pay equal attention to all classes, not just the common ones.
A standard formula: weight = total_samples / (num_classes * class_count)
The training labels are already extracted as train_labels (a numpy array).
You can use np.bincount(train_labels) to count each class.
Compute inverse-frequency weights, convert to a torch.FloatTensor,
move to DEVICE, and pass as nn.CrossEntropyLoss(weight=your_weights).
Put this code in Section 5, before the criterion line.
Should the learning rate stay the same the whole time? Think of searching for something in a room. First you scan broadly (big steps). Then you look carefully in the promising area (small steps).
Look at torch.optim.lr_scheduler. The training loop already handles schedulers —
if scheduler is not None, it calls scheduler.step()
once per epoch.
A popular choice: CosineAnnealingLR. It smoothly decays the learning rate
following a cosine curve from your initial LR down to ~0.
Create the optimizer first, then the scheduler:
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=NUM_EPOCHS)
T_max should match your total number of epochs so the cosine curve
completes exactly one half-cycle.
What if someone already trained a powerful vision model on millions of images? What if you could use their feature detectors and just retrain the final classification layer for your task?
torchvision.models has pretrained models like ResNet18, trained on
ImageNet (1.2 million natural images). These models have already learned to detect
edges, textures, and patterns that transfer to medical images.
The idea: load a pretrained model, replace the final classification layer to output 7 classes, and fine-tune.
Load it: from torchvision import models
ResNet18 has a .fc attribute (the final fully-connected layer).
By default it outputs 1000 classes (ImageNet). You need to replace it with one that
outputs NUM_CLASSES (7).
Use a lower learning rate (try 1e-4) — the pretrained weights are already good, you just need to nudge them.
ResNet18 was designed for 224×224 images. Its first layer is a 7×7 conv with stride 2, followed by a maxpool with stride 2. On a 28×28 image, that aggressively downsamples to 7×7 in the first two operations, throwing away spatial information.
Advanced move: replace model.conv1 with a smaller conv (3×3, stride 1)
and replace model.maxpool with nn.Identity(). This preserves
more spatial resolution for small images.
This isn't about code. It's about what your model actually does.
Look at the per-class results printed at the end of each run.
Find the row for mel (melanoma). What's the recall?
Recall means: of all actual melanomas, what fraction did your model catch?
In medical AI, the metric you optimize is not always the metric that matters.
AUC measures overall discrimination across all classes equally. But missing a melanoma (life-threatening) is not the same as misclassifying a benign keratosis (inconvenient). A model optimized purely for AUC will spend most of its capacity on the common classes because that's where the easy points are.
Class weights (from Level 4). When you penalize melanoma misclassification more heavily, the model is forced to learn melanoma-specific features instead of taking the easy path of predicting the majority class.
Your AUC might actually go down when you add class weights. That's okay. You're trading leaderboard points for clinical utility. In the real world, that's the right tradeoff.
| Symptom | Likely Cause | Try This |
|---|---|---|
| AUC stuck at ~0.82 | Bad optimizer or LR | Switch to Adam, lower LR |
| AUC ~0.85, won't go higher | Model too small | More channels, add BatchNorm |
| Train acc >> val acc | Overfitting | Add Dropout, augmentation |
| Val loss going up | Too many epochs | Fewer epochs, or add scheduler |
| AUC dropped drastically | Train/val mismatch | Check if you transformed val data differently |
| Accuracy ~67%, model predicts one class | Class imbalance | Class weights in loss function |
| High AUC but 0% melanoma recall | Model ignoring rare classes | Class weights, smaller batch size |