Binary Classification Via The Breast Cancer Dataset¶

The Breast Cancer Wisconsin (Diagnostic) Dataset is a widely used dataset in the field of machine learning and medical research. It contains features computed from digitized images of fine needle aspirates (FNAs) of breast masses. The dataset includes information about the characteristics of cell nuclei present in the images.

Key Features of the Dataset¶

Number of Instances: 569
Number of Attributes: 30 numeric features
Target Variable: Binary classification (malignant or benign)
Features: Measurements such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension

Attributes Information¶

The features are computed for each cell nucleus, and the dataset includes both mean, standard error, and "worst" (mean of the three largest values) of these features.

Target Variable¶

The target variable indicates the diagnosis of the breast mass: 0 represents benign, and 1 represents malignant.

This dataset is often used for classification tasks to build models that can predict whether a breast tumor is malignant or benign based on the given features.

Example Usage in Machine Learning¶

Classification Algorithms: It can be used to train various classification algorithms including, but not limited to logistic regression, decision trees, random forests, and support vector machines.
Feature Selection: The dataset is also useful for performing feature selection to identify which features are most important for predicting breast cancer.

The dataset is available in the scikit-learn library, making it easy to load and use in Python for various machine learning tasks.

Model Tuner Library Instructions¶

This notebook provides a guide on how to install and use the model_tuner library in a notebook environment like Google Colab.

Model Tuner Description¶

The model_tuner library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.

Key Features¶

Automatic Hyperparameter Tuning

The library can automatically tune hyperparameters for a variety of machine learning models using advanced optimization techniques.

Cross-Validation

Integrated cross-validation ensures that the models are evaluated robustly, preventing overfitting.

Documentation¶

For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.

By following these steps, you should be able to install and use the model_tuner library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.

Installation¶

To install the model_tuner library, use the following command:

! pip install model_tuner

Collecting model_tuner
  Downloading model_tuner-0.0.15a0-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: joblib>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.4.2)
Requirement already satisfied: numpy>=1.21.6 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.26.4)
Requirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.3.2)
Requirement already satisfied: scipy>=1.7.3 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.13.1)
Requirement already satisfied: tqdm>=4.66.4 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (4.66.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0.2->model_tuner) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3.5->model_tuner) (1.16.0)
Downloading model_tuner-0.0.15a0-py3-none-any.whl (20 kB)
Installing collected packages: model_tuner
Successfully installed model_tuner-0.0.15a0

Importing the Library¶

After installation, you can import the necessary components from the model_tuner library as shown below:

import model_tuner # import model_tuner to show version info.
from model_tuner import Model # Model class from model_tuner lib.
from sklearn.impute import SimpleImputer # for model imputation
from sklearn.preprocessing import StandardScaler # for feature scaling

Checking the Version¶

To ensure that the model_tuner library is installed correctly, you can check its version:

print(help(model_tuner))

Help on package model_tuner:

NAME
    model_tuner

DESCRIPTION
    The `model_tuner` library is a versatile and powerful tool designed to 
    facilitate the training, evaluation, and tuning of machine learning models. 
    It supports various functionalities such as handling imbalanced data, applying 
    different scaling and imputation techniques, calibrating models, and conducting 
    cross-validation. This library is particularly useful for model selection, 
    hyperparameter tuning, and ensuring optimal performance across different metrics.
    
    Version: 0.0.15a

PACKAGE CONTENTS
    bootstrapper
    main
    model_tuner_utils
    pickleObjects

DATA
    __email__ = 'lshpaner@ucla.edu; alafunnell@gmail.com; pp89@ucla.edu'

VERSION
    0.0.15a

AUTHOR
    Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

FILE
    /usr/local/lib/python3.10/dist-packages/model_tuner/__init__.py


None

Import Requisite Libraries¶

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
import xgboost as xgb

Load the dataset, define X, y¶

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X.head() # inspect the first 5 rows of data

y.head()

if isinstance(y, pd.DataFrame):
    y = y.squeeze()

Create an Instance of the XGBClassifier¶

# Creating an instance of the XGBClassifier
xgb_model = xgb.XGBClassifier(
    random_state=222,
)

Define Hyperparameters for XGBoost¶

# Estimator name prefix for use in GridSearchCV or similar tools
estimator_name_xgb = "xgb"

# Define the hyperparameters for XGBoost
xgb_learning_rates = [0.1, 0.01, 0.05]  # Learning rate or eta
xgb_n_estimators = [100, 200, 300]  # Number of trees. Equivalent to n_estimators in GB
xgb_max_depths = [3, 5, 7]  # Maximum depth of the trees
xgb_subsamples = [0.8, 1.0]  # Subsample ratio of the training instances
xgb_colsample_bytree = [0.8, 1.0]

xgb_eval_metric = ["logloss"]
xgb_early_stopping_rounds = [10]
xgb_verbose = [False]  # Subsample ratio of columns when constructing each tree

# Combining the hyperparameters in a dictionary
xgb_pipeline_hyperparms_grid = {
        "xgb__learning_rate": xgb_learning_rates,
        "xgb__n_estimators": xgb_n_estimators,
        "xgb__max_depth": xgb_max_depths,
        "xgb__subsample": xgb_subsamples,
        "xgb__colsample_bytree": xgb_colsample_bytree,
        "xgb__eval_metric": xgb_eval_metric,
        "xgb__early_stopping_rounds": xgb_early_stopping_rounds,
        "xgb__verbose": xgb_verbose,
        "selectKBest__k": [5, 10, 20],
    }

Initialize and Configure the Model¶

# Initialize ModelTuner
model_tuner = Model(
    pipeline_steps=[
        ("Preprocessor", SimpleImputer()),
    ],
    name="XGBoost_Breast_Cancer",
    estimator_name=estimator_name_xgb,
    calibrate=True,
    estimator=xgb_model,
    xgboost_early=True,
    kfold=False,
    selectKBest=True,
    stratify_y=False,
    grid=xgb_pipeline_hyperparms_grid,
    # randomized_grid=True,
    # n_iter=5,
    scoring=["roc_auc"],
    random_state=222,
    n_jobs=-1,
)

Perform Grid Search Parameter Tuning¶

# Perform grid search parameter tuning
model_tuner.grid_search_param_tuning(X, y)

100%|██████████| 324/324 [01:15<00:00,  4.31it/s]

Best score/param set found on validation set:
{'params': {'selectKBest__k': 20,
            'xgb__colsample_bytree': 0.8,
            'xgb__early_stopping_rounds': 10,
            'xgb__eval_metric': 'logloss',
            'xgb__learning_rate': 0.1,
            'xgb__max_depth': 3,
            'xgb__n_estimators': 170,
            'xgb__subsample': 0.8},
 'score': 0.9987212276214834}
Best roc_auc: 0.999

Fit The Model¶

# Get the training and validation data
X_train, y_train = model_tuner.get_train_data(X, y)
X_valid, y_valid = model_tuner.get_valid_data(X, y)
X_test, y_test = model_tuner.get_test_data(X, y)

# Fit the model with the validation data
model_tuner.fit(
    X_train, y_train, validation_data=(X_valid, y_valid), score="roc_auc",
)

Return Metrics (Optional)¶

# Return metrics for the validation set
metrics = model_tuner.return_metrics(
    X_valid,
    y_valid,
)
metrics

Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
            Pos  Neg
--------------------------------------------------------------------------------
Actual: Pos 65 (tp)   3 (fn)
        Neg  0 (fp)  46 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.94      1.00      0.97        46
           1       1.00      0.96      0.98        68

    accuracy                           0.97       114
   macro avg       0.97      0.98      0.97       114
weighted avg       0.98      0.97      0.97       114

--------------------------------------------------------------------------------

Feature names selected:
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean compactness', 'mean concavity', 'mean concave points', 'radius error', 'perimeter error', 'area error', 'concavity error', 'concave points error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points']

{'Classification Report': {'0': {'precision': 0.9387755102040817,
   'recall': 1.0,
   'f1-score': 0.968421052631579,
   'support': 46.0},
  '1': {'precision': 1.0,
   'recall': 0.9558823529411765,
   'f1-score': 0.9774436090225563,
   'support': 68.0},
  'accuracy': 0.9736842105263158,
  'macro avg': {'precision': 0.9693877551020409,
   'recall': 0.9779411764705883,
   'f1-score': 0.9729323308270676,
   'support': 114.0},
  'weighted avg': {'precision': 0.9752953813104189,
   'recall': 0.9736842105263158,
   'f1-score': 0.9738029283735655,
   'support': 114.0}},
 'Confusion Matrix': array([[46,  0],
        [ 3, 65]]),
 'K Best Features': ['mean radius',
  'mean texture',
  'mean perimeter',
  'mean area',
  'mean compactness',
  'mean concavity',
  'mean concave points',
  'radius error',
  'perimeter error',
  'area error',
  'concavity error',
  'concave points error',
  'worst radius',
  'worst texture',
  'worst perimeter',
  'worst area',
  'worst smoothness',
  'worst compactness',
  'worst concavity',
  'worst concave points']}

Calibrate The Model¶

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve

# Get the predicted probabilities for the validation data from the uncalibrated model
y_prob_uncalibrated = model_tuner.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the uncalibrated model
prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(y_test, y_prob_uncalibrated, n_bins=10)

# Calibrate the model
if model_tuner.calibrate:
    model_tuner.calibrateModel(X, y, score="roc_auc")

# Predict on the validation set
y_test_pred = model_tuner.predict_proba(X_test)[:,1]

Change back to CPU
Confusion matrix on validation set for roc_auc
--------------------------------------------------------------------------------
          Predicted:
            Pos  Neg
--------------------------------------------------------------------------------
Actual: Pos 65 (tp)   3 (fn)
        Neg  0 (fp)  46 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.94      1.00      0.97        46
           1       1.00      0.96      0.98        68

    accuracy                           0.97       114
   macro avg       0.97      0.98      0.97       114
weighted avg       0.98      0.97      0.97       114

--------------------------------------------------------------------------------
roc_auc after calibration: 0.9987212276214834

# Get the predicted probabilities for the validation data from the calibrated model
y_prob_calibrated = model_tuner.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the calibrated model
prob_true_calibrated, prob_pred_calibrated = calibration_curve(y_test, y_prob_calibrated, n_bins=10,)

# Plot the calibration curves
plt.figure(figsize=(5, 5))
plt.plot(prob_pred_uncalibrated, prob_true_uncalibrated, marker='o', label='Uncalibrated XGBoost')
plt.plot(prob_pred_calibrated, prob_true_calibrated, marker='o', label='Calibrated XGBoost')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated')
plt.xlabel('Predicted probability')
plt.ylabel('True probability in each bin')
plt.title('Calibration plot (reliability curve)')
plt.legend()
plt.show()

Classification Report¶

print(model_tuner.classification_report)

              precision    recall  f1-score   support

           0       0.94      1.00      0.97        46
           1       1.00      0.96      0.98        68

    accuracy                           0.97       114
   macro avg       0.97      0.98      0.97       114
weighted avg       0.98      0.97      0.97       114

Reference¶

Scikit-learn Developers. (2023). Breast Cancer Wisconsin (Diagnostic) Dataset. Scikit-learn Documentation.
https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-dataset.

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678