Skip to main content
Warning: this assignment is not yet released. Check back on February 11, 2026.
This assignment is due on Wednesday, February 25, 2026 before 11:59PM.

Get Started:

  1. Accept the assignment on GitHub Classroom — You’ll get your own private repository with starter code and data
  2. Clone your repo and complete the exercises in hw3_eda.py
  3. Commit regularly as you work (this is part of your grade!)
  4. Push your completed work to GitHub before the deadline

Learning Objectives

By completing this assignment, you will:


Background

Before building any predictive model, you must understand your data. In clinical settings, this is especially critical because:

This assignment uses the Pima Indians Diabetes Dataset, a classic clinical prediction dataset. Your task is to explore it thoroughly before any modeling begins.


The Dataset

The dataset contains diagnostic measurements from female patients of Pima Indian heritage, used to predict diabetes onset.

Feature Description Units
Pregnancies Number of pregnancies count
Glucose Plasma glucose concentration (2hr OGTT) mg/dL
BloodPressure Diastolic blood pressure mm Hg
SkinThickness Triceps skin fold thickness mm
Insulin 2-hour serum insulin μU/mL
BMI Body mass index kg/m²
DiabetesPedigreeFunction Genetic risk score score
Age Age years
Outcome Diabetes diagnosis (1=yes, 0=no) binary

Instructions

Part 1: Data Loading & Initial Exploration (20 points)

1.1 Load and Inspect (8 pts)

1.2 Identify Data Quality Issues (12 pts)


Part 2: Missing Data Analysis (25 points)

2.1 Visualize Missing Patterns (10 pts)

2.2 Compare Missing vs. Non-Missing (10 pts)

2.3 Imputation Strategy (5 pts)


Part 3: Visualization & Distributions (25 points)

3.1 Univariate Distributions (10 pts)

Create a figure showing the distribution of each feature:

3.2 Correlation Analysis (8 pts)

3.3 Clinical Visualizations (7 pts)

Create at least two clinically meaningful visualizations:


Part 4: Feature Engineering (20 points)

4.1 Clinical Categories (8 pts)

Create categorical features based on clinical thresholds:

4.2 Derived Features (7 pts)

Create at least two new features that might be predictive:

4.3 Feature Summary (5 pts)


Part 5: Class Imbalance Analysis (10 points)

5.1 Quantify Imbalance (4 pts)

5.2 Implications for Modeling (6 pts)

Answer these questions in code comments:


Submission via GitHub

  1. Complete your work in hw3_eda.py
  2. Save your figures to the outputs/ directory
  3. Commit your changes with meaningful messages
  4. Push to GitHub before the deadline

Deliverables

Your repository should contain:


Grading Rubric

Component Points
Part 1: Data Loading & Initial Exploration 20
1.1 Load and inspect 8
1.2 Identify data quality issues 12
Part 2: Missing Data Analysis 25
2.1 Visualize missing patterns 10
2.2 Compare missing vs non-missing 10
2.3 Imputation strategy 5
Part 3: Visualization & Distributions 25
3.1 Univariate distributions 10
3.2 Correlation analysis 8
3.3 Clinical visualizations 7
Part 4: Feature Engineering 20
4.1 Clinical categories 8
4.2 Derived features 7
4.3 Feature summary 5
Part 5: Class Imbalance 10
5.1 Quantify imbalance 4
5.2 Implications for modeling 6
Subtotal 100
Git Workflow  
Multiple meaningful commits -5 if missing
Clear commit messages -5 if missing

Resources


Tips