Build a Classification Model Using Supervised Learning

Classification is one of the most common and useful tasks in machine learning. You have data about something — a patient, an email, a customer, a flower — and you want your model to assign it to the correct category. Spam or not spam. Sick or healthy. Will churn or will stay.

This guide walks through a complete classification project from start to finish. We will use the Breast Cancer dataset that is built directly into scikit-learn — no downloading required. Every step is explained clearly so you understand not just what to do but why you are doing it.

What Is Classification

In classification, the output is a discrete category rather than a continuous number. Instead of predicting "house price is $350,000" (that is regression), you predict "this tumour is malignant" or "this email is spam".

Classification comes in two forms. Binary classification means there are exactly two categories — yes or no, 0 or 1, spam or not spam. Multi-class classification means there are three or more categories — handwritten digit 0 through 9, type of flower species, animal in a photo.

In this guide we will use binary classification on the Breast Cancer dataset. Given measurements taken from a tumour biopsy, the model predicts whether the tumour is malignant (cancerous) or benign (not cancerous).

ℹ️ Classification vs Regression reminder: if your output is a category (spam, not spam) use classification. If your output is a number (price, temperature, score) use regression. The algorithms are different but the overall pipeline — load, clean, engineer, train, evaluate — is almost identical.

The ML Pipeline

Every classification project follows the same steps in order. Rushing past any step causes problems later.

Load and explore the dataUnderstand what you have — shape, feature types, target distribution, missing values

PreprocessFill missing values, encode text categories, scale numeric features

Split into train and testKeep a clean test set the model never sees until the final evaluation

Train multiple modelsStart simple (Logistic Regression) then try more powerful ones (Random Forest)

Evaluate thoroughlyUse accuracy, precision, recall, F1, confusion matrix and ROC-AUC

Cross-validateCheck that results are consistent across different splits of the data

Predict new samplesUse the best model to classify unseen examples

Setup and Loading the Data

Python — imports and loading the Breast Cancer dataset
import pandas    as pd
import numpy     as np
import matplotlib.pyplot as plt
import seaborn   as sns

from sklearn.datasets         import load_breast_cancer
from sklearn.model_selection  import train_test_split, cross_val_score
from sklearn.preprocessing    import StandardScaler
from sklearn.linear_model      import LogisticRegression
from sklearn.tree              import DecisionTreeClassifier
from sklearn.ensemble          import RandomForestClassifier
from sklearn.metrics           import (accuracy_score, precision_score,
                                        recall_score, f1_score,
                                        confusion_matrix, classification_report,
                                        roc_auc_score, roc_curve)

# Load the dataset
cancer = load_breast_cancer()

df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
# target: 0 = malignant (cancerous), 1 = benign (not cancerous)

print(f"Dataset shape: {df.shape}")     # (569, 31)
print(f"Features:      {df.shape[1]-1}")  # 30 input features
print(f"Samples:       {df.shape[0]}")     # 569 patients
print(df.head())
print(df.info())

Exploring the Data

Before training anything, you need to understand your data. Two things matter most in classification EDA: is the target balanced (roughly equal samples per class?) and which features vary the most between classes?

Checking Class Balance

Class imbalance means one category has far more examples than the other. For example, if 95% of emails are not spam and only 5% are spam, a model that always predicts "not spam" would score 95% accuracy — but it would be completely useless. Always check the class distribution before anything else.

Python — checking class distribution
# Count samples in each class
class_counts = df['target'].value_counts()
print(class_counts)
# 1 (benign):    357  (62.7%)
# 0 (malignant): 212  (37.3%)
# Slightly imbalanced but acceptable — no special handling needed here

plt.figure(figsize=(5, 4))
df['target'].value_counts().plot(kind='bar', color=['#f87171', '#4ade80'])
plt.xticks([0, 1], ['Malignant (0)', 'Benign (1)'], rotation=0)
plt.title('Class Distribution')
plt.ylabel('Number of samples')
plt.show()

Feature Distributions by Class

Box plots show you how each feature's values differ between the two classes. A feature where the two boxes barely overlap is a strong predictor — the model can use it to separate the classes easily.

Python — visualising which features separate the classes
# Plot the top 6 most correlated features
top_features = df.corr()['target'].abs().sort_values(ascending=False)[1:7].index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, feat in enumerate(top_features):
    df.boxplot(column=feat, by='target', ax=axes[i])
    axes[i].set_title(feat)
    axes[i].set_xlabel('0 = Malignant, 1 = Benign')

plt.suptitle('Top 6 Features by Class')
plt.tight_layout()
plt.show()

# Check missing values
print("Missing values:", df.isnull().sum().sum())  # 0 in this dataset

Preprocessing — Getting the Data Ready

Handling Missing Values

The Breast Cancer dataset is clean with no missing values. In real projects you will almost always have some. The same strategy applies as in regression — fill numeric columns with the median and categorical columns with the most frequent value.

Python — robust missing value handling for any dataset
from sklearn.impute import SimpleImputer

# Always check first
print(df.isnull().sum())

# If there are missing numeric values — fill with median
feature_cols = [c for c in df.columns if c != 'target']
imputer      = SimpleImputer(strategy='median')
df[feature_cols] = imputer.fit_transform(df[feature_cols])

Encoding Categorical Features

All features in this dataset are already numeric (measurements from the biopsy). If your dataset has text categories like "neighbourhood" or "grade", convert them to numbers using pd.get_dummies() for one-hot encoding or LabelEncoder for ordinal categories.

Python — one-hot encoding for any text category columns
# If your dataset has text columns, encode them like this
# Example: a "grade" column with values "low", "medium", "high"
df_example = pd.get_dummies(df_example, columns=['grade'], drop_first=True)

# For ordinal categories where order matters (low=1, medium=2, high=3)
from sklearn.preprocessing import LabelEncoder
le                 = LabelEncoder()
df_example['grade'] = le.fit_transform(df_example['grade'])

Feature Scaling

Logistic Regression uses gradient descent to learn — it is sensitive to the scale of features. If one feature ranges from 0 to 1 and another ranges from 0 to 1000, the large-valued feature will dominate the learning process. Scaling puts everyone on the same scale.

⚠️ Fit the scaler on training data only. Always split first, then fit StandardScaler on training data only. Use that fitted scaler to transform both training and test sets. Never fit on test data — that is data leakage.

Train Test Split

We split the data into 80% for training and 20% for testing. The stratify=y parameter is especially important for classification — it ensures the train and test sets have the same class ratio as the full dataset. Without this, you might accidentally put most malignant samples in training and almost none in the test set.

Python — stratified train test split and scaling
X = df.drop('target', axis=1)
y = df['target']

# stratify=y ensures both sets have the same class ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y       # keeps class ratio consistent in both sets
)

print(f"Train: {X_train.shape[0]} samples")
print(f"Test:  {X_test.shape[0]} samples")
print("Train class ratio:", y_train.value_counts(normalize=True).round(2).to_dict())
print("Test  class ratio:", y_test.value_counts(normalize=True).round(2).to_dict())

# Scale features — fit only on training data
scaler         = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Logistic Regression — The Simple Baseline

Despite the name, Logistic Regression is a classification algorithm. It calculates the probability that a sample belongs to each class and picks the class with the higher probability. It works well when the relationship between features and class is roughly linear, it is fast to train and easy to interpret.

Always start with Logistic Regression as your baseline. If it already performs well, you might not need anything more complex.

Python — Logistic Regression training and prediction
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)

lr_preds  = lr.predict(X_test_scaled)
lr_probs  = lr.predict_proba(X_test_scaled)[:, 1]  # probability of class 1

print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_preds):.4f}")

# predict_proba gives you the confidence of each prediction
# predict() picks the class with probability > 0.5 by default
# You can lower this threshold to catch more positives (at cost of more false positives)

Decision Tree — How Rules Are Learned

A Decision Tree learns a set of yes/no rules from the data. At each step it asks a question about one feature — "is the worst radius bigger than 16.8?" If yes, go left. If no, go right. It keeps splitting until it reaches a final prediction. The result is a tree of rules that is very easy to visualise and explain.

The downside is that Decision Trees tend to overfit. They memorise the training data too well and perform worse on new data. max_depth limits how deep the tree grows, which prevents overfitting.

Python — Decision Tree with depth control
dt = DecisionTreeClassifier(
    max_depth=5,           # limit depth to prevent overfitting
    min_samples_leaf=5,   # each leaf must have at least 5 samples
    random_state=42
)
dt.fit(X_train_scaled, y_train)

dt_preds = dt.predict(X_test_scaled)
dt_probs = dt.predict_proba(X_test_scaled)[:, 1]

print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_preds):.4f}")

# See the top rule in the tree
print("First split feature:",
      cancer.feature_names[dt.tree_.feature[0]])
print("Split threshold:     ", round(dt.tree_.threshold[0], 4))

Random Forest — Many Trees Voting Together

Random Forest builds hundreds of Decision Trees and takes the majority vote across all of them. Because each tree sees a different random sample of the data and a different random subset of features, the trees make different mistakes. When you average their votes, the mistakes cancel out and you get much better accuracy than any single tree.

Python — Random Forest classifier
rf = RandomForestClassifier(
    n_estimators=200,     # 200 trees
    max_depth=10,          # limit depth of each tree
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1             # use all CPU cores
)
rf.fit(X_train, y_train)   # Random Forest does not need scaling

rf_preds = rf.predict(X_test)
rf_probs = rf.predict_proba(X_test)[:, 1]

print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_preds):.4f}")

Evaluation Metrics — Beyond Accuracy

Accuracy alone is not enough. Imagine a cancer detector that calls everything benign — it would have 63% accuracy on this dataset but would miss every single cancer case. That is useless and dangerous.

For classification, especially medical or fraud detection, you need metrics that tell you about false positives and false negatives separately.

Confusion Matrix

A confusion matrix shows exactly where your model gets things right and wrong. For a binary classifier it is a 2x2 grid:

	Predicted Benign	Predicted Malignant
Actual Benign	True Positive (TP)	False Negative (FN) — missed a benign
Actual Malignant	False Positive (FP) — false alarm	True Negative (TN)

Python — plotting the confusion matrix as a heatmap
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
               xticklabels=['Malignant', 'Benign'],
               yticklabels=['Malignant', 'Benign'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(title)
    plt.show()

plot_confusion_matrix(y_test, rf_preds, 'Random Forest — Confusion Matrix')

Classification Report

The classification report gives you precision, recall and F1 for each class. These three metrics together tell the full story:

Precision — of all the samples the model called malignant, what fraction actually were? High precision means few false alarms.
Recall — of all the actual malignant samples, what fraction did the model catch? High recall means few missed cases.
F1 score — the harmonic mean of precision and recall. Use this when you want a single number that balances both.

Python — full classification report for all three models
def evaluate_model(name, y_true, y_pred):
    print(f"\n{'='*50}")
    print(f"{name}")
    print(f"{'='*50}")
    print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score:  {f1_score(y_true, y_pred):.4f}")
    print("\nDetailed report:")
    print(classification_report(y_true, y_pred,
                               target_names=['Malignant', 'Benign']))

evaluate_model('Logistic Regression', y_test, lr_preds)
evaluate_model('Decision Tree',       y_test, dt_preds)
evaluate_model('Random Forest',       y_test, rf_preds)

ROC Curve and AUC Score

The ROC curve shows how your model's true positive rate and false positive rate change as you move the decision threshold from 0 to 1. The AUC (Area Under the Curve) summarises the whole curve as a single number. An AUC of 1.0 is a perfect classifier. An AUC of 0.5 is no better than random guessing.

Python — ROC curves for all three models on one chart
plt.figure(figsize=(8, 6))

for name, probs in [
    ('Logistic Regression', lr_probs),
    ('Decision Tree',       dt_probs),
    ('Random Forest',       rf_probs)
]:
    fpr, tpr, _ = roc_curve(y_test, probs)
    auc         = roc_auc_score(y_test, probs)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {auc:.3f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves — All Models')
plt.legend()
plt.show()

Cross Validation — More Honest Results

A single train/test split might be lucky or unlucky depending on which samples ended up in each set. Cross-validation splits the data into 5 (or 10) folds. The model trains on 4 folds and tests on the remaining 1, repeats this 5 times rotating the test fold, then averages all 5 scores. This gives a far more reliable estimate of real-world performance.

Python — 5-fold cross-validation for all three models
from sklearn.model_selection import cross_val_score

for name, model, X_data in [
    ('Logistic Regression', lr, X_train_scaled),
    ('Decision Tree',       dt, X_train_scaled),
    ('Random Forest',       rf, X_train)
]:
    scores = cross_val_score(
        model, X_data, y_train,
        cv=5,              # 5 folds
        scoring='f1'        # use F1 score
    )
    print(f"{name}:")
    print(f"  CV F1 scores: {scores.round(3)}")
    print(f"  Mean: {scores.mean():.3f}  Std: {scores.std():.3f}")

# A high std means results are inconsistent across folds — the model is unreliable
# A low std means consistent performance — the model is stable

✅ A small standard deviation is good. If cross-validation F1 scores are 0.97, 0.98, 0.96, 0.97, 0.98 — that is very consistent. If they are 0.72, 0.91, 0.85, 0.69, 0.94 — the model is unreliable and you should investigate why.

Feature Importance

Random Forest tells you which features mattered most in making its decisions. This is valuable for understanding the problem and for deciding which features to keep or remove.

Python — top 10 most important features
importance_df = pd.DataFrame({
    'Feature':    cancer.feature_names,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print(importance_df.head(10))

plt.figure(figsize=(10, 6))
top10 = importance_df.head(10)
plt.barh(top10['Feature'], top10['Importance'])
plt.title('Top 10 Most Important Features (Random Forest)')
plt.xlabel('Importance Score')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Predicting New Samples

Once the model is trained and you are happy with its performance, using it on new data is straightforward. You build the feature array in the same format and column order as the training data, and call predict().

Python — classifying a new unseen patient sample
def predict_sample(model, sample_dict, feature_names):
    """Classify one new sample. sample_dict must have all feature values."""
    sample_df    = pd.DataFrame([sample_dict])[feature_names]
    prediction   = model.predict(sample_df)[0]
    probability  = model.predict_proba(sample_df)[0]
    label        = 'Benign' if prediction == 1 else 'Malignant'
    print(f"Prediction:   {label}")
    print(f"Confidence:   {max(probability)*100:.1f}%")
    print(f"P(Malignant): {probability[0]*100:.1f}%")
    print(f"P(Benign):    {probability[1]*100:.1f}%")
    return label, probability

# Take a real sample from the test set to verify
sample = X_test.iloc[0].to_dict()
print(f"Actual label: {'Benign' if y_test.iloc[0]==1 else 'Malignant'}")
predict_sample(rf, sample, cancer.feature_names)

Model Comparison

Model	Accuracy	F1	AUC	Needs Scaling	Interpretable
Logistic Regression	~0.974	~0.979	~0.997	Yes	Very
Decision Tree	~0.939	~0.950	~0.944	No	Very
Random Forest	~0.965	~0.971	~0.998	No	Moderate

ℹ️ Logistic Regression wins here. On clean, well-scaled data like this one, Logistic Regression often matches or beats more complex models. Always try the simple model first — complexity has a cost in compute time, memory and interpretability.

⚡ Key Takeaways

Classification predicts a discrete category rather than a number. Binary classification has two classes. Multi-class classification has three or more.
Always check class balance first. A heavily imbalanced dataset (95% one class) makes accuracy meaningless. Use F1, precision and recall instead.
Use stratify=y in train_test_split so both training and test sets have the same class ratio as the original data.
Fit the scaler on training data only. Fitting on the whole dataset leaks test information and gives inflated accuracy scores.
Logistic Regression is an excellent baseline and often competitive with more complex models on clean data. Always start here before moving to trees or forests.
Decision Trees are easy to interpret but overfit without regularisation. Limit them with max_depth and min_samples_leaf.
Random Forest typically beats a single Decision Tree by averaging over 200 trees. Each tree sees different data and features, so errors cancel out.
Accuracy alone is not enough. For medical or fraud detection problems, always look at precision (false alarms) and recall (missed cases) separately. F1 balances both.
A confusion matrix shows exactly where the model goes wrong. In cancer detection, false negatives (missed cancers) are far more costly than false positives.
The ROC-AUC score measures how well a model separates the classes across all possible thresholds. An AUC of 0.5 is random. An AUC of 1.0 is perfect.
Use cross-validation to get a reliable estimate of model performance. A single train/test split can be lucky or unlucky. Five-fold cross-validation gives a much more trustworthy result.

Tags: Machine Learning Classification Python scikit-learn Random Forest Logistic Regression

Shashank Shekhar

Founder & Creator — Hoopsiper.com

Full stack developer and educator. Building Hoopsiper to help developers learn faster through practical, no-fluff coding guides on JavaScript, AI/ML, Python and modern web development.

Build a Classification ModelUsing Supervised Learning

What Is Classification

The ML Pipeline

Setup and Loading the Data

Exploring the Data

Checking Class Balance

Feature Distributions by Class

Preprocessing — Getting the Data Ready

Handling Missing Values

Encoding Categorical Features

Feature Scaling

Train Test Split

Logistic Regression — The Simple Baseline

Decision Tree — How Rules Are Learned

Random Forest — Many Trees Voting Together

Evaluation Metrics — Beyond Accuracy

Confusion Matrix

Classification Report

ROC Curve and AUC Score

Cross Validation — More Honest Results

Feature Importance

Predicting New Samples

Model Comparison

House Price Prediction

Regression vs Classification

Data Pipeline Efficiency

Build a Classification Model
Using Supervised Learning