Get Started:
hw6_nlp.pyLLM Access (Part 3-4): You’ll need access to at least two language models.
Recommended: OpenRouter (free, no GPU needed)
Alternative: Ollama (local) If you set up Ollama during the lab, that works too. See Resources below.
By completing this assignment, you will:
This assignment puts two generations of NLP side by side. You’ll build a traditional pipeline (preprocessing, TF-IDF, entity extraction) and an LLM-powered pipeline, then run them on the same clinical notes and compare.
The punchline isn’t “LLMs are better” — it’s more nuanced than that. Traditional tools are deterministic, fast, and free. LLMs are flexible but hallucinate, cost money, and need guardrails. Understanding when to use each is a core clinical AI skill.
This builds on the Module 8 Lab (LLM Arena): In class you explored model capabilities interactively. Here you’ll build evaluation pipelines programmatically.
The starter repo includes 20 synthetic discharge summaries (data/notes.json) with ground-truth annotations:
These notes mimic real clinical text without containing actual patient information.
1.1 Text Preprocessing (8 pts)
Clinical text is messy. Implement a preprocess_note(text) function that handles:
Run your preprocessor on 3 sample notes and show before/after output.
1.2 TF-IDF Classification (12 pts)
Build a classifier to identify complex discharges (patients needing extra follow-up):
This is your baseline — you’ll compare LLM performance against it in Part 4.
2.1 Named Entity Recognition with scispaCy (10 pts)
Extract medical entities from the clinical notes:
# Setup (run once)
# uv add scispacy
# uv run python -m pip install \
# https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz
For each note, extract:
Compare extracted entities to the ground-truth annotations. Report precision and recall for medication extraction specifically.
If scispaCy installation fails (e.g., network issues), implement a rule-based extractor using regex patterns for medications (drug name + dose + route + frequency) and report its precision/recall instead. Document what you tried.
2.2 Negation Detection (10 pts)
Negation changes everything: “no pneumonia” is the opposite of “pneumonia.”
Implement a detect_negation(sentence, entity) function that returns PRESENT, ABSENT, or UNCERTAIN using prefix/suffix cue words (similar to what we built in the NLP walkthrough notebook).
Test on these cases and at least 5 of your own:
"No evidence of pneumonia" → pneumonia: ABSENT
"Patient has diabetes" → diabetes: PRESENT
"Ruled out PE" → PE: UNCERTAIN
"Patient denies chest pain" → chest pain: ABSENT
"History of stroke, now resolved" → stroke: ??? (discuss)
Discuss: Where does this simple approach fail? What would you need for production use?
Use the OpenRouter API (or Ollama) to access LLMs programmatically. The starter code includes a query_llm() helper function.
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
def query_llm(prompt, system_prompt="", model="meta-llama/llama-3.2-3b-instruct:free"):
"""Query an LLM via OpenRouter. Returns the response text."""
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": prompt})
response = client.chat.completions.create(
model=model, messages=messages, temperature=0.3, max_tokens=1024
)
return response.choices[0].message.content
Important: Never hardcode your API key. Use an environment variable:
export OPENROUTER_API_KEY="sk-or-..."
3.1 Zero-Shot Information Extraction (10 pts)
Write a prompt that extracts structured information from a clinical note:
Return results as JSON. Test on at least 5 notes from the dataset.
For each note, verify the extracted data against ground truth. Report:
3.2 Prompt Engineering (10 pts)
Take your extraction prompt from 3.1 and improve it systematically:
Run all three versions on the same 5 notes. Create a comparison table:
| Note | Version | Medications Found | Correct | Hallucinated | Valid JSON |
|---|---|---|---|---|---|
| 1 | Baseline | … | … | … | … |
| 1 | Structured | … | … | … | … |
| 1 | Few-shot | … | … | … | … |
Which version performed best? Did any version eliminate hallucinations entirely?
3.3 Clinical Summarization (10 pts)
Write a prompt that generates a 3-sentence handoff summary for each note — the kind a night physician would read to take over care.
Generate summaries for 5 notes, then evaluate each summary on three criteria:
This is manual evaluation — read the note, read the summary, and judge. Include your evaluations as code comments or a results table.
This is the capstone: put traditional NLP and LLMs on the same task and compare rigorously.
4.1 Medication Extraction: scispaCy vs LLM (12 pts)
Run both your scispaCy pipeline (Part 2.1) and your best LLM prompt (Part 3.2) on the same 10 notes. Extract medications from each.
Create a comparison table:
| Method | Precision | Recall | F1 | Avg Time/Note | Cost/Note | Hallucinations |
|---|---|---|---|---|---|---|
| scispaCy | Free | N/A | ||||
| LLM (3B) | Free tier | |||||
| LLM (12B) | Free tier |
If you have access to a larger model (via Ollama or paid API), add a row for it.
4.2 Hallucination Stress Test (10 pts)
Design 5 “trap” notes that test specific hallucination risks:
For each trap, run your LLM prompt and document whether it fell for the trap.
Report your hallucination rate: out of all the traps across all notes, what fraction produced hallucinated content?
4.3 Deployment Recommendation (8 pts)
Write a brief analysis (~250 words) answering:
Your hospital wants to auto-extract medication lists from discharge notes to populate the patient portal. Which approach would you recommend (scispaCy, small LLM, large LLM, or hybrid) and why?
What specific failure modes would you test for before deployment?
What human oversight would you require?
Frame this as a recommendation to a non-technical hospital administrator — clear, specific, and honest about limitations.
Your repository should contain:
| File | Description |
|---|---|
hw6_nlp.py |
All code with clear comments |
outputs/classification_results.txt |
Part 1.2 metrics and top features |
outputs/extraction_comparison.txt |
Part 3.2 prompt comparison table |
outputs/head_to_head.txt |
Part 4.1 comparison table |
outputs/hallucination_results.txt |
Part 4.2 trap results |
Written analyses (Parts 2.2 discussion, 3.3 evaluations, 4.3 recommendation) can be in code comments or a separate analysis.md.
| Component | Points |
|---|---|
| Part 1: Traditional NLP | 20 |
| 1.1 Text preprocessing | 8 |
| 1.2 TF-IDF classification + metrics | 12 |
| Part 2: Entity Extraction | 20 |
| 2.1 scispaCy NER + accuracy | 10 |
| 2.2 Negation detection + discussion | 10 |
| Part 3: LLM Clinical Tasks | 30 |
| 3.1 Zero-shot extraction + verification | 10 |
| 3.2 Prompt engineering comparison | 10 |
| 3.3 Summarization + manual evaluation | 10 |
| Part 4: Head-to-Head Evaluation | 30 |
| 4.1 scispaCy vs LLM comparison | 12 |
| 4.2 Hallucination stress test | 10 |
| 4.3 Deployment recommendation | 8 |
| Subtotal | 100 |
| Git Workflow | |
| Multiple meaningful commits | -5 if missing |
| API key committed to repo | -10 |
Option 1: OpenRouter (Recommended)
Free account, free models, no GPU needed.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
# Free models to try:
# "meta-llama/llama-3.2-3b-instruct:free" (small, fast)
# "google/gemma-3-12b-it:free" (medium, better quality)
Option 2: Ollama (Local)
If you installed Ollama during the Module 8 lab:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Use whatever models you pulled: "llama3.2", "gemma3:4b", etc.
Both options use the same openai Python library — just different base_url.
export OPENROUTER_API_KEY="sk-or-..."