Classification is one of the most common and useful tasks in machine learning. You have data about something — a patient, an email, a customer, a flower — and you want your model to assign it to the correct category. Spam or not spam. Sick or healthy. Will churn or will stay.

This guide walks through a complete classification project from start to finish. We will use the Breast Cancer dataset that is built directly into scikit-learn — no downloading required. Every step is explained clearly so you understand not just what to do but why you are doing it.

Download the Complete Python Script The full classification project — data loading, preprocessing, three models and all evaluation metrics — ready to run in Jupyter or as a .py file.
Python script  ·  Full project  ·  Free

What Is Classification

In classification, the output is a discrete category rather than a continuous number. Instead of predicting "house price is $350,000" (that is regression), you predict "this tumour is malignant" or "this email is spam".

Classification comes in two forms. Binary classification means there are exactly two categories — yes or no, 0 or 1, spam or not spam. Multi-class classification means there are three or more categories — handwritten digit 0 through 9, type of flower species, animal in a photo.

In this guide we will use binary classification on the Breast Cancer dataset. Given measurements taken from a tumour biopsy, the model predicts whether the tumour is malignant (cancerous) or benign (not cancerous).

ℹ️ Classification vs Regression reminder: if your output is a category (spam, not spam) use classification. If your output is a number (price, temperature, score) use regression. The algorithms are different but the overall pipeline — load, clean, engineer, train, evaluate — is almost identical.

The ML Pipeline

Every classification project follows the same steps in order. Rushing past any step causes problems later.

1
Load and explore the dataUnderstand what you have — shape, feature types, target distribution, missing values
2
PreprocessFill missing values, encode text categories, scale numeric features
3
Split into train and testKeep a clean test set the model never sees until the final evaluation
4
Train multiple modelsStart simple (Logistic Regression) then try more powerful ones (Random Forest)
5
Evaluate thoroughlyUse accuracy, precision, recall, F1, confusion matrix and ROC-AUC
6
Cross-validateCheck that results are consistent across different splits of the data
7
Predict new samplesUse the best model to classify unseen examples

Setup and Loading the Data

Python — imports and loading the Breast Cancer dataset
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, roc_curve) # Load the dataset cancer = load_breast_cancer() df = pd.DataFrame(cancer.data, columns=cancer.feature_names) df['target'] = cancer.target # target: 0 = malignant (cancerous), 1 = benign (not cancerous) print(f"Dataset shape: {df.shape}") # (569, 31) print(f"Features: {df.shape[1]-1}") # 30 input features print(f"Samples: {df.shape[0]}") # 569 patients print(df.head()) print(df.info())

Exploring the Data

Before training anything, you need to understand your data. Two things matter most in classification EDA: is the target balanced (roughly equal samples per class?) and which features vary the most between classes?

Checking Class Balance

Class imbalance means one category has far more examples than the other. For example, if 95% of emails are not spam and only 5% are spam, a model that always predicts "not spam" would score 95% accuracy — but it would be completely useless. Always check the class distribution before anything else.

Python — checking class distribution
# Count samples in each class class_counts = df['target'].value_counts() print(class_counts) # 1 (benign): 357 (62.7%) # 0 (malignant): 212 (37.3%) # Slightly imbalanced but acceptable — no special handling needed here plt.figure(figsize=(5, 4)) df['target'].value_counts().plot(kind='bar', color=['#f87171', '#4ade80']) plt.xticks([0, 1], ['Malignant (0)', 'Benign (1)'], rotation=0) plt.title('Class Distribution') plt.ylabel('Number of samples') plt.show()

Feature Distributions by Class

Box plots show you how each feature's values differ between the two classes. A feature where the two boxes barely overlap is a strong predictor — the model can use it to separate the classes easily.

Python — visualising which features separate the classes
# Plot the top 6 most correlated features top_features = df.corr()['target'].abs().sort_values(ascending=False)[1:7].index.tolist() fig, axes = plt.subplots(2, 3, figsize=(14, 8)) axes = axes.flatten() for i, feat in enumerate(top_features): df.boxplot(column=feat, by='target', ax=axes[i]) axes[i].set_title(feat) axes[i].set_xlabel('0 = Malignant, 1 = Benign') plt.suptitle('Top 6 Features by Class') plt.tight_layout() plt.show() # Check missing values print("Missing values:", df.isnull().sum().sum()) # 0 in this dataset

Preprocessing — Getting the Data Ready

Handling Missing Values

The Breast Cancer dataset is clean with no missing values. In real projects you will almost always have some. The same strategy applies as in regression — fill numeric columns with the median and categorical columns with the most frequent value.

Python — robust missing value handling for any dataset
from sklearn.impute import SimpleImputer # Always check first print(df.isnull().sum()) # If there are missing numeric values — fill with median feature_cols = [c for c in df.columns if c != 'target'] imputer = SimpleImputer(strategy='median') df[feature_cols] = imputer.fit_transform(df[feature_cols])

Encoding Categorical Features

All features in this dataset are already numeric (measurements from the biopsy). If your dataset has text categories like "neighbourhood" or "grade", convert them to numbers using pd.get_dummies() for one-hot encoding or LabelEncoder for ordinal categories.

Python — one-hot encoding for any text category columns
# If your dataset has text columns, encode them like this # Example: a "grade" column with values "low", "medium", "high" df_example = pd.get_dummies(df_example, columns=['grade'], drop_first=True) # For ordinal categories where order matters (low=1, medium=2, high=3) from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df_example['grade'] = le.fit_transform(df_example['grade'])

Feature Scaling

Logistic Regression uses gradient descent to learn — it is sensitive to the scale of features. If one feature ranges from 0 to 1 and another ranges from 0 to 1000, the large-valued feature will dominate the learning process. Scaling puts everyone on the same scale.

⚠️ Fit the scaler on training data only. Always split first, then fit StandardScaler on training data only. Use that fitted scaler to transform both training and test sets. Never fit on test data — that is data leakage.

Train Test Split

We split the data into 80% for training and 20% for testing. The stratify=y parameter is especially important for classification — it ensures the train and test sets have the same class ratio as the full dataset. Without this, you might accidentally put most malignant samples in training and almost none in the test set.

Python — stratified train test split and scaling
X = df.drop('target', axis=1) y = df['target'] # stratify=y ensures both sets have the same class ratio X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # keeps class ratio consistent in both sets ) print(f"Train: {X_train.shape[0]} samples") print(f"Test: {X_test.shape[0]} samples") print("Train class ratio:", y_train.value_counts(normalize=True).round(2).to_dict()) print("Test class ratio:", y_test.value_counts(normalize=True).round(2).to_dict()) # Scale features — fit only on training data scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

Logistic Regression — The Simple Baseline

Despite the name, Logistic Regression is a classification algorithm. It calculates the probability that a sample belongs to each class and picks the class with the higher probability. It works well when the relationship between features and class is roughly linear, it is fast to train and easy to interpret.

Always start with Logistic Regression as your baseline. If it already performs well, you might not need anything more complex.

Python — Logistic Regression training and prediction
lr = LogisticRegression(max_iter=1000, random_state=42) lr.fit(X_train_scaled, y_train) lr_preds = lr.predict(X_test_scaled) lr_probs = lr.predict_proba(X_test_scaled)[:, 1] # probability of class 1 print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_preds):.4f}") # predict_proba gives you the confidence of each prediction # predict() picks the class with probability > 0.5 by default # You can lower this threshold to catch more positives (at cost of more false positives)

Decision Tree — How Rules Are Learned

A Decision Tree learns a set of yes/no rules from the data. At each step it asks a question about one feature — "is the worst radius bigger than 16.8?" If yes, go left. If no, go right. It keeps splitting until it reaches a final prediction. The result is a tree of rules that is very easy to visualise and explain.

The downside is that Decision Trees tend to overfit. They memorise the training data too well and perform worse on new data. max_depth limits how deep the tree grows, which prevents overfitting.

Python — Decision Tree with depth control
dt = DecisionTreeClassifier( max_depth=5, # limit depth to prevent overfitting min_samples_leaf=5, # each leaf must have at least 5 samples random_state=42 ) dt.fit(X_train_scaled, y_train) dt_preds = dt.predict(X_test_scaled) dt_probs = dt.predict_proba(X_test_scaled)[:, 1] print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_preds):.4f}") # See the top rule in the tree print("First split feature:", cancer.feature_names[dt.tree_.feature[0]]) print("Split threshold: ", round(dt.tree_.threshold[0], 4))

Random Forest — Many Trees Voting Together

Random Forest builds hundreds of Decision Trees and takes the majority vote across all of them. Because each tree sees a different random sample of the data and a different random subset of features, the trees make different mistakes. When you average their votes, the mistakes cancel out and you get much better accuracy than any single tree.

Python — Random Forest classifier
rf = RandomForestClassifier( n_estimators=200, # 200 trees max_depth=10, # limit depth of each tree min_samples_leaf=2, random_state=42, n_jobs=-1 # use all CPU cores ) rf.fit(X_train, y_train) # Random Forest does not need scaling rf_preds = rf.predict(X_test) rf_probs = rf.predict_proba(X_test)[:, 1] print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_preds):.4f}")

Evaluation Metrics — Beyond Accuracy

Accuracy alone is not enough. Imagine a cancer detector that calls everything benign — it would have 63% accuracy on this dataset but would miss every single cancer case. That is useless and dangerous.

For classification, especially medical or fraud detection, you need metrics that tell you about false positives and false negatives separately.

Confusion Matrix

A confusion matrix shows exactly where your model gets things right and wrong. For a binary classifier it is a 2x2 grid:

Predicted BenignPredicted Malignant
Actual BenignTrue Positive (TP)False Negative (FN) — missed a benign
Actual MalignantFalse Positive (FP) — false alarmTrue Negative (TN)
Python — plotting the confusion matrix as a heatmap
def plot_confusion_matrix(y_true, y_pred, title): cm = confusion_matrix(y_true, y_pred) plt.figure(figsize=(5, 4)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Malignant', 'Benign'], yticklabels=['Malignant', 'Benign']) plt.xlabel('Predicted') plt.ylabel('Actual') plt.title(title) plt.show() plot_confusion_matrix(y_test, rf_preds, 'Random Forest — Confusion Matrix')

Classification Report

The classification report gives you precision, recall and F1 for each class. These three metrics together tell the full story:

  • Precision — of all the samples the model called malignant, what fraction actually were? High precision means few false alarms.
  • Recall — of all the actual malignant samples, what fraction did the model catch? High recall means few missed cases.
  • F1 score — the harmonic mean of precision and recall. Use this when you want a single number that balances both.
Python — full classification report for all three models
def evaluate_model(name, y_true, y_pred): print(f"\n{'='*50}") print(f"{name}") print(f"{'='*50}") print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}") print(f"Precision: {precision_score(y_true, y_pred):.4f}") print(f"Recall: {recall_score(y_true, y_pred):.4f}") print(f"F1 Score: {f1_score(y_true, y_pred):.4f}") print("\nDetailed report:") print(classification_report(y_true, y_pred, target_names=['Malignant', 'Benign'])) evaluate_model('Logistic Regression', y_test, lr_preds) evaluate_model('Decision Tree', y_test, dt_preds) evaluate_model('Random Forest', y_test, rf_preds)

ROC Curve and AUC Score

The ROC curve shows how your model's true positive rate and false positive rate change as you move the decision threshold from 0 to 1. The AUC (Area Under the Curve) summarises the whole curve as a single number. An AUC of 1.0 is a perfect classifier. An AUC of 0.5 is no better than random guessing.

Python — ROC curves for all three models on one chart
plt.figure(figsize=(8, 6)) for name, probs in [ ('Logistic Regression', lr_probs), ('Decision Tree', dt_probs), ('Random Forest', rf_probs) ]: fpr, tpr, _ = roc_curve(y_test, probs) auc = roc_auc_score(y_test, probs) plt.plot(fpr, tpr, label=f"{name} (AUC = {auc:.3f})") plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curves — All Models') plt.legend() plt.show()

Cross Validation — More Honest Results

A single train/test split might be lucky or unlucky depending on which samples ended up in each set. Cross-validation splits the data into 5 (or 10) folds. The model trains on 4 folds and tests on the remaining 1, repeats this 5 times rotating the test fold, then averages all 5 scores. This gives a far more reliable estimate of real-world performance.

Python — 5-fold cross-validation for all three models
from sklearn.model_selection import cross_val_score for name, model, X_data in [ ('Logistic Regression', lr, X_train_scaled), ('Decision Tree', dt, X_train_scaled), ('Random Forest', rf, X_train) ]: scores = cross_val_score( model, X_data, y_train, cv=5, # 5 folds scoring='f1' # use F1 score ) print(f"{name}:") print(f" CV F1 scores: {scores.round(3)}") print(f" Mean: {scores.mean():.3f} Std: {scores.std():.3f}") # A high std means results are inconsistent across folds — the model is unreliable # A low std means consistent performance — the model is stable
A small standard deviation is good. If cross-validation F1 scores are 0.97, 0.98, 0.96, 0.97, 0.98 — that is very consistent. If they are 0.72, 0.91, 0.85, 0.69, 0.94 — the model is unreliable and you should investigate why.

Feature Importance

Random Forest tells you which features mattered most in making its decisions. This is valuable for understanding the problem and for deciding which features to keep or remove.

Python — top 10 most important features
importance_df = pd.DataFrame({ 'Feature': cancer.feature_names, 'Importance': rf.feature_importances_ }).sort_values('Importance', ascending=False) print(importance_df.head(10)) plt.figure(figsize=(10, 6)) top10 = importance_df.head(10) plt.barh(top10['Feature'], top10['Importance']) plt.title('Top 10 Most Important Features (Random Forest)') plt.xlabel('Importance Score') plt.gca().invert_yaxis() plt.tight_layout() plt.show()

Predicting New Samples

Once the model is trained and you are happy with its performance, using it on new data is straightforward. You build the feature array in the same format and column order as the training data, and call predict().

Python — classifying a new unseen patient sample
def predict_sample(model, sample_dict, feature_names): """Classify one new sample. sample_dict must have all feature values.""" sample_df = pd.DataFrame([sample_dict])[feature_names] prediction = model.predict(sample_df)[0] probability = model.predict_proba(sample_df)[0] label = 'Benign' if prediction == 1 else 'Malignant' print(f"Prediction: {label}") print(f"Confidence: {max(probability)*100:.1f}%") print(f"P(Malignant): {probability[0]*100:.1f}%") print(f"P(Benign): {probability[1]*100:.1f}%") return label, probability # Take a real sample from the test set to verify sample = X_test.iloc[0].to_dict() print(f"Actual label: {'Benign' if y_test.iloc[0]==1 else 'Malignant'}") predict_sample(rf, sample, cancer.feature_names)

Model Comparison

ModelAccuracyF1AUCNeeds ScalingInterpretable
Logistic Regression~0.974~0.979~0.997YesVery
Decision Tree~0.939~0.950~0.944NoVery
Random Forest~0.965~0.971~0.998NoModerate
ℹ️ Logistic Regression wins here. On clean, well-scaled data like this one, Logistic Regression often matches or beats more complex models. Always try the simple model first — complexity has a cost in compute time, memory and interpretability.

⚡ Key Takeaways
  • Classification predicts a discrete category rather than a number. Binary classification has two classes. Multi-class classification has three or more.
  • Always check class balance first. A heavily imbalanced dataset (95% one class) makes accuracy meaningless. Use F1, precision and recall instead.
  • Use stratify=y in train_test_split so both training and test sets have the same class ratio as the original data.
  • Fit the scaler on training data only. Fitting on the whole dataset leaks test information and gives inflated accuracy scores.
  • Logistic Regression is an excellent baseline and often competitive with more complex models on clean data. Always start here before moving to trees or forests.
  • Decision Trees are easy to interpret but overfit without regularisation. Limit them with max_depth and min_samples_leaf.
  • Random Forest typically beats a single Decision Tree by averaging over 200 trees. Each tree sees different data and features, so errors cancel out.
  • Accuracy alone is not enough. For medical or fraud detection problems, always look at precision (false alarms) and recall (missed cases) separately. F1 balances both.
  • A confusion matrix shows exactly where the model goes wrong. In cancer detection, false negatives (missed cancers) are far more costly than false positives.
  • The ROC-AUC score measures how well a model separates the classes across all possible thresholds. An AUC of 0.5 is random. An AUC of 1.0 is perfect.
  • Use cross-validation to get a reliable estimate of model performance. A single train/test split can be lucky or unlucky. Five-fold cross-validation gives a much more trustworthy result.
Download the Complete Python Script All steps in one file — data loading, EDA, preprocessing, three classifiers, confusion matrix, ROC curves, cross-validation and predictions.
Python script  ·  Full project  ·  Free