UCI Machine Learning Repository Dataset - AIDS Clinical Trials Group Study 175¶

The UCI Machine Learning Repository is a well-known resource for accessing a wide range of datasets used for machine learning research and practice. One such dataset is the AIDS Clinical Trials Group Study dataset, which can be used to build and evaluate predictive models.

In our library, you can easily fetch this dataset using the ucimlrepo package. If you haven't installed it yet, you can do so by running pip install ucimlrepo.

Model Tuner Library Instructions¶

This notebook provides a guide on how to install and use the model_tuner library in a notebook environment like Google Colab.

Model Tuner Description¶

The model_tuner library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.

Key Features¶

Automatic Hyperparameter Tuning

The library can automatically tune hyperparameters for a variety of machine learning models using advanced optimization techniques.

Cross-Validation

Integrated cross-validation ensures that the models are evaluated robustly, preventing overfitting.

Documentation¶

For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.

By following these steps, you should be able to install and use the model_tuner library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.

Installation¶

To install the model_tuner library, use the following command:

! pip install model_tuner

Requirement already satisfied: model_tuner in /usr/local/lib/python3.10/dist-packages (0.0.15a0)
Requirement already satisfied: joblib>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.4.2)
Requirement already satisfied: numpy>=1.21.6 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.26.4)
Requirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.3.2)
Requirement already satisfied: scipy>=1.7.3 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.13.1)
Requirement already satisfied: tqdm>=4.66.4 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (4.66.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0.2->model_tuner) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3.5->model_tuner) (1.16.0)

Importing the Library¶

After installation, you can import the necessary components from the model_tuner library as shown below:

import model_tuner # import model_tuner to show version info.
from model_tuner import Model # Model class from model_tuner lib.
from sklearn.impute import SimpleImputer # for model imputation

Checking the Version¶

To ensure that the model_tuner library is installed correctly, you can check its version:

print(help(model_tuner))

Help on package model_tuner:

NAME
    model_tuner

DESCRIPTION
    The `model_tuner` library is a versatile and powerful tool designed to 
    facilitate the training, evaluation, and tuning of machine learning models. 
    It supports various functionalities such as handling imbalanced data, applying 
    different scaling and imputation techniques, calibrating models, and conducting 
    cross-validation. This library is particularly useful for model selection, 
    hyperparameter tuning, and ensuring optimal performance across different metrics.
    
    Version: 0.0.15a

PACKAGE CONTENTS
    bootstrapper
    main
    model_tuner_utils
    pickleObjects

DATA
    __email__ = 'lshpaner@ucla.edu; alafunnell@gmail.com; pp89@ucla.edu'

VERSION
    0.0.15a

AUTHOR
    Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

FILE
    /usr/local/lib/python3.10/dist-packages/model_tuner/__init__.py


None

Binary Classification Via The Breast Cancer Dataset¶

AIDS Clinical Trials Group Study 175 Dataset¶

The AIDS Clinical Trials Group Study 175 Dataset is a healthcare dataset that contains statistical and categorical information about patients who have been diagnosed with AIDS. This dataset, which was initially published in 1996, is often used to predict whether or not a patient will respond to different AIDS treatments.

Key Features of the Dataset¶

Number of Instances: 2,139
Number of Features: 23
Feature Type: Categorical, Integer
Subject Area: Health and Medicine
Associated Tasks: Classification, Regression

Dataset Information¶

Purpose of the Dataset: The dataset was created to examine the performance of two different types of AIDS treatments.
Funding: The creation of this dataset was funded by the AIDS Clinical Trials Group of the National Institute of Allergy and Infectious Diseases and General Research Center units funded by the National Center for Research Resources.
Instances Represent: The dataset includes health records of AIDS patients from the US only. Sensitive Data: The dataset includes sensitive information such as ethnicity (race) and gender.
Data Preprocessing: No preprocessing was performed on the data.
Missing Values: The dataset does not have missing values.

Example Usage in Machine Learning¶

Predictive Modeling: The dataset can be used to train models that predict patient outcomes based on demographic and clinical features.
Treatment Efficacy Analysis: Researchers can use the dataset to compare the effectiveness of different AIDS treatments.
Health Data Analytics: This dataset is valuable for analyzing trends in the progression and treatment of AIDS among patients in the United States.

Accessing the Dataset¶

To work with the AIDS Clinical Trials Group Study 175 Dataset, you can load it using the ucimlrepo package. If you haven't installed it yet, install it with:

pip install ucimlrepo

! pip install ucimlrepo

Requirement already satisfied: ucimlrepo in /usr/local/lib/python3.10/dist-packages (0.0.7)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2.1.4)
Requirement already satisfied: certifi>=2020.12.5 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2024.8.30)
Requirement already satisfied: numpy<2,>=1.22.4 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.16.0)

Load the dataset, define X, y¶

Once installed, you can quickly load the AIDS Clinical Trials Group Study dataset with a few simple commands:

from ucimlrepo import fetch_ucirepo

# fetch dataset
aids_clinical_trials_group_study_175 = fetch_ucirepo(id=890)

# data (as pandas dataframes)
X = aids_clinical_trials_group_study_175.data.features
y = aids_clinical_trials_group_study_175.data.targets

Import Requisite Libraries¶

import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_diabetes

X.head() # inspect the first 5 rows of data

if isinstance(y, pd.DataFrame):
    y = y.squeeze()

Check for zero-variance columns and drop accordingly¶

# Check for zero-variance columns and drop them
zero_variance_columns = X.columns[X.var() == 0]
if not zero_variance_columns.empty:
    X = X.drop(columns=zero_variance_columns)

Create an Instance of the XGBClassifier¶

# Creating an instance of the XGBClassifier
xgb_model = xgb.XGBClassifier(
    random_state=222,
)

Define Hyperparameters for XGBoost¶

# Estimator name prefix for use in GridSearchCV or similar tools
estimator_name_xgb = "xgb"

# Define the hyperparameters for XGBoost
xgb_learning_rates = [0.1, 0.01, 0.05]  # Learning rate or eta
xgb_n_estimators = [100, 200, 300]  # Number of trees. Equivalent to n_estimators in GB
xgb_max_depths = [3, 5, 7]  # Maximum depth of the trees
xgb_subsamples = [0.8, 1.0]  # Subsample ratio of the training instances
xgb_colsample_bytree = [0.8, 1.0]

xgb_eval_metric = ["logloss"]
xgb_early_stopping_rounds = [10]
xgb_verbose = [False]  # Subsample ratio of columns when constructing each tree

# Combining the hyperparameters in a dictionary
xgb_pipeline_hyperparms_grid = {
        "xgb__learning_rate": xgb_learning_rates,
        "xgb__n_estimators": xgb_n_estimators,
        "xgb__max_depth": xgb_max_depths,
        "xgb__subsample": xgb_subsamples,
        "xgb__colsample_bytree": xgb_colsample_bytree,
        "xgb__eval_metric": xgb_eval_metric,
        "xgb__early_stopping_rounds": xgb_early_stopping_rounds,
        "xgb__verbose": xgb_verbose,
        "selectKBest__k": [5,10,20],
    }

Initialize and Configure the Model¶

from model_tuner import Model

# Initialize ModelTuner
model_tuner = Model(
    pipeline_steps=[
        ("Preprocessor", SimpleImputer()),
    ],
    name="XGBoost_AIDS",
    estimator_name=estimator_name_xgb,
    calibrate=True,
    estimator=xgb_model,
    xgboost_early=True,
    kfold=False,
    selectKBest=True,
    stratify_y=True,
    stratify_cols=["gender", "race"],
    grid=xgb_pipeline_hyperparms_grid,
    # randomized_grid=True,
    # n_iter=5,
    scoring=["roc_auc"],
    random_state=222,
    n_jobs=-1,
)

Perform Grid Search Parameter Tuning¶

# Perform grid search parameter tuning
model_tuner.grid_search_param_tuning(X,y)

100%|██████████| 324/324 [01:31<00:00,  3.54it/s]

Best score/param set found on validation set:
{'params': {'selectKBest__k': 20,
            'xgb__colsample_bytree': 1.0,
            'xgb__early_stopping_rounds': 10,
            'xgb__eval_metric': 'logloss',
            'xgb__learning_rate': 0.05,
            'xgb__max_depth': 5,
            'xgb__n_estimators': 100,
            'xgb__subsample': 1.0},
 'score': 0.946877967711301}
Best roc_auc: 0.947

Fit the Model¶

# Get the training and validation data
X_train, y_train = model_tuner.get_train_data(X, y)
X_valid, y_valid = model_tuner.get_valid_data(X, y)
X_test, y_test = model_tuner.get_test_data(X, y)

# Fit the model with the validation data
model_tuner.fit(
  X_train,
  y_train,
  validation_data=(X_valid, y_valid),
  score="roc_auc",
)

Return Metrics (Optional)¶

# Return metrics for the validation set
metrics = model_tuner.return_metrics(
    X_valid,
    y_valid,
)
metrics

Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  89 (tp)   15 (fn)
        Neg  21 (fp)  303 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.95      0.94      0.94       324
           1       0.81      0.86      0.83       104

    accuracy                           0.92       428
   macro avg       0.88      0.90      0.89       428
weighted avg       0.92      0.92      0.92       428

--------------------------------------------------------------------------------

Feature names selected:
['time', 'trt', 'age', 'hemo', 'homo', 'drugs', 'karnof', 'oprior', 'z30', 'preanti', 'race', 'gender', 'str2', 'strat', 'symptom', 'treat', 'offtrt', 'cd40', 'cd420', 'cd80']

{'Classification Report': {'0': {'precision': 0.9528301886792453,
   'recall': 0.9351851851851852,
   'f1-score': 0.9439252336448598,
   'support': 324.0},
  '1': {'precision': 0.8090909090909091,
   'recall': 0.8557692307692307,
   'f1-score': 0.8317757009345793,
   'support': 104.0},
  'accuracy': 0.9158878504672897,
  'macro avg': {'precision': 0.8809605488850771,
   'recall': 0.895477207977208,
   'f1-score': 0.8878504672897196,
   'support': 428.0},
  'weighted avg': {'precision': 0.9179028870970327,
   'recall': 0.9158878504672897,
   'f1-score': 0.9166739453227356,
   'support': 428.0}},
 'Confusion Matrix': array([[303,  21],
        [ 15,  89]]),
 'K Best Features': ['time',
  'trt',
  'age',
  'hemo',
  'homo',
  'drugs',
  'karnof',
  'oprior',
  'z30',
  'preanti',
  'race',
  'gender',
  'str2',
  'strat',
  'symptom',
  'treat',
  'offtrt',
  'cd40',
  'cd420',
  'cd80']}

Calibrate the Model¶

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve

# Get the predicted probabilities for the validation data from the uncalibrated model
y_prob_uncalibrated = model_tuner.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the uncalibrated model
prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(
    y_test,
    y_prob_uncalibrated,
    n_bins=10,
)

# Calibrate the model
if model_tuner.calibrate:
    model_tuner.calibrateModel(X, y, score="roc_auc")

# Predict on the validation set
y_test_pred = model_tuner.predict_proba(X_test)[:,1]

Change back to CPU
Confusion matrix on validation set for roc_auc
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  88 (tp)   16 (fn)
        Neg  20 (fp)  304 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.95      0.94      0.94       324
           1       0.81      0.85      0.83       104

    accuracy                           0.92       428
   macro avg       0.88      0.89      0.89       428
weighted avg       0.92      0.92      0.92       428

--------------------------------------------------------------------------------
roc_auc after calibration: 0.9467592592592593

# Get the predicted probabilities for the validation data from calibrated model
y_prob_calibrated = model_tuner.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the calibrated model
prob_true_calibrated, prob_pred_calibrated = calibration_curve(
  y_test,
  y_prob_calibrated,
  n_bins=5,
)


# Plot the calibration curves
plt.figure(figsize=(5, 5))
plt.plot(
  prob_pred_uncalibrated,
  prob_true_uncalibrated,
  marker="o",
  label="Uncalibrated XGBoost",
)
plt.plot(
  prob_pred_calibrated,
  prob_true_calibrated,
  marker="o",
  label="Calibrated XGBoost",
)
plt.plot(
  [0, 1],
  [0, 1],
  linestyle="--",
  label="Perfectly calibrated",
)
plt.xlabel("Predicted probability")
plt.ylabel("True probability in each bin")
plt.title("Calibration plot (reliability curve)")
plt.legend()
plt.show()

print(model_tuner.classification_report)

              precision    recall  f1-score   support

           0       0.95      0.94      0.94       324
           1       0.81      0.85      0.83       104

    accuracy                           0.92       428
   macro avg       0.88      0.89      0.89       428
weighted avg       0.92      0.92      0.92       428

Reference¶

El-Sadr, W., & Abrams, D. (1998). AIDS Clinical Trials Group Study 175. UCI Machine Learning Repository.
https://doi.org/10.24432/C5G896.

	time	trt	age	wtkg	homo	drugs	karnof	z30	...	gender	str2	strat	treat	offtrt	cd40	cd420	cd80	cd820
0	948	2	48	89.8128	0	0	100	0	...	0	0	1	1	0	422	477	566	324
1	1002	3	61	49.4424	0	0	90	1	...	0	1	3	1	0	162	218	392	564
2	961	3	45	88.4520	1	1	90	1	...	1	1	3	1	1	326	274	2063	1893
3	1166	3	47	85.2768	1	0	100	1	...	1	1	3	1	0	287	394	1590	966
4	1090	0	43	66.6792	1	0	100	1	...	1	1	3	0	0	504	353	870	782