UCI Machine Learning Repository Dataset - AIDS Clinical Trials Group Study 175

The UCI Machine Learning Repository is a well-known resource for accessing a wide range of datasets used for machine learning research and practice. One such dataset is the AIDS Clinical Trials Group Study dataset, which can be used to build and evaluate predictive models.

In our library, you can easily fetch this dataset using the ucimlrepo package. If you haven't installed it yet, you can do so by running pip install ucimlrepo.

Model Tuner Library Instructions

This notebook provides a guide on how to install and use the model_tuner library in a notebook environment like Google Colab.

Model Tuner Description

The model_tuner library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.

Key Features

Automatic Hyperparameter Tuning

The library can automatically tune hyperparameters for a variety of machine learning models using advanced optimization techniques.

Cross-Validation

Integrated cross-validation ensures that the models are evaluated robustly, preventing overfitting.

Documentation

For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.

By following these steps, you should be able to install and use the model_tuner library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.

Installation

To install the model_tuner library, use the following command:

In [1]:
! pip install model_tuner
Collecting model_tuner
  Downloading model_tuner-0.0.22a0-py3-none-any.whl.metadata (5.7 kB)
Collecting joblib==1.3.2 (from model_tuner)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting tqdm==4.66.4 (from model_tuner)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.1 MB/s eta 0:00:00
Collecting catboost==1.2.7 (from model_tuner)
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting pip==24.2 (from model_tuner)
  Downloading pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Requirement already satisfied: setuptools==75.1.0 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (75.1.0)
Collecting wheel==0.44.0 (from model_tuner)
  Downloading wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: numpy<2.0.0,>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.26.4)
Requirement already satisfied: pandas<2.2.3,>=1.3.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.2.2)
Collecting scikit-learn<1.4.0,>=1.0.2 (from model_tuner)
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting scipy<1.11,>=1.6.3 (from model_tuner)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 3.7 MB/s eta 0:00:00
Collecting scikit-optimize==0.10.2 (from model_tuner)
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Requirement already satisfied: imbalanced-learn==0.12.4 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (0.12.4)
Requirement already satisfied: xgboost==2.1.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.1.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (0.20.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (3.8.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (5.24.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn==0.12.4->model_tuner) (3.5.0)
Collecting pyaml>=16.9 (from scikit-optimize==0.10.2->model_tuner)
  Downloading pyaml-24.9.0-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from scikit-optimize==0.10.2->model_tuner) (24.2)
Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.10/dist-packages (from xgboost==2.1.2->model_tuner) (2.23.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas<2.2.3,>=1.3.5->model_tuner) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<2.2.3,>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas<2.2.3,>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.10/dist-packages (from pyaml>=16.9->scikit-optimize==0.10.2->model_tuner) (6.0.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (4.55.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (1.4.7)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (11.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (3.2.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost==1.2.7->model_tuner) (9.0.0)
Downloading model_tuner-0.0.22a0-py3-none-any.whl (23 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 MB 8.3 MB/s eta 0:00:00
Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 17.8 MB/s eta 0:00:00
Downloading pip-24.2-py3-none-any.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 20.3 MB/s eta 0:00:00
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.8/107.8 kB 8.4 MB/s eta 0:00:00
Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 4.2 MB/s eta 0:00:00
Downloading wheel-0.44.0-py3-none-any.whl (67 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.1/67.1 kB 4.2 MB/s eta 0:00:00
Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.8/10.8 MB 45.9 MB/s eta 0:00:00
Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 29.8 MB/s eta 0:00:00
Downloading pyaml-24.9.0-py3-none-any.whl (24 kB)
Installing collected packages: wheel, tqdm, scipy, pyaml, pip, joblib, scikit-learn, scikit-optimize, catboost, model_tuner
  Attempting uninstall: wheel
    Found existing installation: wheel 0.45.0
    Uninstalling wheel-0.45.0:
      Successfully uninstalled wheel-0.45.0
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.6
    Uninstalling tqdm-4.66.6:
      Successfully uninstalled tqdm-4.66.6
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.1
    Uninstalling scipy-1.13.1:
      Successfully uninstalled scipy-1.13.1
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
  Attempting uninstall: joblib
    Found existing installation: joblib 1.4.2
    Uninstalling joblib-1.4.2:
      Successfully uninstalled joblib-1.4.2
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.5.2
    Uninstalling scikit-learn-1.5.2:
      Successfully uninstalled scikit-learn-1.5.2
Successfully installed catboost-1.2.7 joblib-1.3.2 model_tuner-0.0.22a0 pip-24.2 pyaml-24.9.0 scikit-learn-1.3.2 scikit-optimize-0.10.2 scipy-1.10.1 tqdm-4.66.4 wheel-0.44.0

Importing the Library

After installation, you can import the necessary components from the model_tuner library as shown below:

In [2]:
import model_tuner  # import model_tuner to show version info.
from model_tuner import Model  # Model class from model_tuner lib.

Checking the Version

To ensure that the model_tuner library is installed correctly, you can check its version:

In [3]:
print(help(model_tuner))
Help on package model_tuner:

NAME
    model_tuner

DESCRIPTION
    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    $      __  __           _      _   _____                          $ 
    $     |  \/  | ___   __| | ___| | |_   _|   _ _ __   ___ _ __     $
    $     | |\/| |/ _ \ / _` |/ _ \ |   | || | | | '_ \ / _ \ '__|    $
    $     | |  | | (_) | (_| |  __/ |   | || |_| | | | |  __/ |       $
    $     |_|  |_|\___/ \__,_|\___|_|   |_| \__,_|_| |_|\___|_|       $
    $                                                                 $
    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
                                                                   
    The `model_tuner` library is a versatile and powerful tool designed to 
    facilitate the training, evaluation, and tuning of machine learning models. 
    It supports various functionalities such as handling imbalanced data, applying 
    different scaling and imputation techniques, calibrating models, and conducting 
    cross-validation. This library is particularly useful for model selection, 
    hyperparameter tuning, and ensuring optimal performance across different metrics.
    
    Version: 0.0.22a

PACKAGE CONTENTS
    bootstrapper
    main
    model_tuner_utils
    pickleObjects

DATA
    __email__ = 'lshpaner@ucla.edu; alafunnell@gmail.com; pp89@ucla.edu'

VERSION
    0.0.22a

AUTHOR
    Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

FILE
    /usr/local/lib/python3.10/dist-packages/model_tuner/__init__.py


None

Binary Classification Via The Breast Cancer Dataset

AIDS Clinical Trials Group Study 175 Dataset

The AIDS Clinical Trials Group Study 175 Dataset is a healthcare dataset that contains statistical and categorical information about patients who have been diagnosed with AIDS. This dataset, which was initially published in 1996, is often used to predict whether or not a patient will respond to different AIDS treatments.

Key Features of the Dataset

  • Number of Instances: 2,139
  • Number of Features: 23
  • Feature Type: Categorical, Integer
  • Subject Area: Health and Medicine
  • Associated Tasks: Classification, Regression

Dataset Information

  • Purpose of the Dataset: The dataset was created to examine the performance of two different types of AIDS treatments.
  • Funding: The creation of this dataset was funded by the AIDS Clinical Trials Group of the National Institute of Allergy and Infectious Diseases and General Research Center units funded by the National Center for Research Resources.
  • Instances Represent: The dataset includes health records of AIDS patients from the US only. Sensitive Data: The dataset includes sensitive information such as ethnicity (race) and gender.
  • Data Preprocessing: No preprocessing was performed on the data.
  • Missing Values: The dataset does not have missing values.

Example Usage in Machine Learning

  • Predictive Modeling: The dataset can be used to train models that predict patient outcomes based on demographic and clinical features.
  • Treatment Efficacy Analysis: Researchers can use the dataset to compare the effectiveness of different AIDS treatments.
  • Health Data Analytics: This dataset is valuable for analyzing trends in the progression and treatment of AIDS among patients in the United States.

Accessing the Dataset

To work with the AIDS Clinical Trials Group Study 175 Dataset, you can load it using the ucimlrepo package. If you haven't installed it yet, install it with:

pip install ucimlrepo
In [4]:
! pip install ucimlrepo
Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2.2.2)
Requirement already satisfied: certifi>=2020.12.5 in /usr/local/lib/python3.10/dist-packages (from ucimlrepo) (2024.8.30)
Requirement already satisfied: numpy>=1.22.4 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.0->ucimlrepo) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.16.0)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7

Load the dataset, define X, y

Once installed, you can quickly load the AIDS Clinical Trials Group Study dataset with a few simple commands:

In [5]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
aids_clinical_trials_group_study_175 = fetch_ucirepo(id=890)

# data (as pandas dataframes)
X = aids_clinical_trials_group_study_175.data.features
y = aids_clinical_trials_group_study_175.data.targets

Import Requisite Libraries

In [6]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import ElasticNet
In [7]:
X.head() # inspect the first 5 rows of data
Out[7]:
time trt age wtkg hemo homo drugs karnof oprior z30 ... gender str2 strat symptom treat offtrt cd40 cd420 cd80 cd820
0 948 2 48 89.8128 0 0 0 100 0 0 ... 0 0 1 0 1 0 422 477 566 324
1 1002 3 61 49.4424 0 0 0 90 0 1 ... 0 1 3 0 1 0 162 218 392 564
2 961 3 45 88.4520 0 1 1 90 0 1 ... 1 1 3 0 1 1 326 274 2063 1893
3 1166 3 47 85.2768 0 1 0 100 0 1 ... 1 1 3 0 1 0 287 394 1590 966
4 1090 0 43 66.6792 0 1 0 100 0 1 ... 1 1 3 0 0 0 504 353 870 782

5 rows × 23 columns

In [8]:
if isinstance(y, pd.DataFrame):
    y = y.squeeze()

Check for zero-variance columns and drop accordingly

In [9]:
# Check for zero-variance columns and drop them
zero_variance_columns = X.columns[X.var() == 0]
if not zero_variance_columns.empty:
    X = X.drop(columns=zero_variance_columns)

Define Hyperparameters for XGBoost

In [10]:
xgb_name = "xgb"
xgb = XGBClassifier(
    objective="binary:logistic",
    random_state=222,
)
xgbearly = True
tuned_parameters_xgb = {
    f"{xgb_name}__max_depth": [3, 10, 20, 200, 500],
    f"{xgb_name}__learning_rate": [1e-4],
    f"{xgb_name}__n_estimators": [1000],
    f"{xgb_name}__early_stopping_rounds": [100],
    f"{xgb_name}__verbose": [0],
    f"{xgb_name}__eval_metric": ["logloss"],
}

xgb_definition = {
    "clc": xgb,
    "estimator_name": xgb_name,
    "tuned_parameters": tuned_parameters_xgb,
    "randomized_grid": False,
    "n_iter": 5,
    "early": xgbearly,
}

Define The Model Object

In [11]:
model_type = "xgb"
clc = xgb_definition["clc"]
estimator_name = xgb_definition["estimator_name"]

tuned_parameters = xgb_definition["tuned_parameters"]
n_iter = xgb_definition["n_iter"]
rand_grid = xgb_definition["randomized_grid"]
early_stop = xgb_definition["early"]
kfold = False
calibrate = True

Using Imputation and Scaling in Pipeline Steps for Model Preprocessing

The pipeline_steps parameter accepts a list of tuples, where each tuple specifies a transformation step to be applied to the data. For example, the code block below performs imputation followed by standardization on the dataset before training the model.

pipeline_steps=[
    ("Imputer", SimpleImputer()),
    ("StandardScaler", StandardScaler()),
]

When Is Imputation and Feature Scaling in pipeline_steps Beneficial?

  • Logistic Regression: Highly sensitive to feature scaling and missing data. Preprocessing steps like imputation and standardization improve model performance significantly.
  • Linear Models (e.g., Ridge, Lasso): Similar to Logistic Regression, these models require feature scaling for optimal performance.
  • SVMs: Sensitive to the scale of the features, requiring preprocessing like standardization.

Models Not Benefiting From Imputation and Scaling in pipeline_steps:

  • Tree-Based Models (e.g., XGBoost, Random Forests, Decision Trees): These models are invariant to feature scaling and can handle missing values natively. Passing preprocessing steps like StandardScaler or Imputer may be redundant or even unnecessary.

Why Doesn't XGBoost Require Imputation and Scaling in pipeline_steps?

XGBoost and similar tree-based models work on feature splits rather than feature values directly. This makes them robust to unscaled data and capable of handling missing values using default mechanisms like missing parameter handling in XGBoost. Thus, adding steps like scaling or imputation often does not improve and might complicate the training process.

To this end, it is best to use pipeline_steps strategically for algorithms that rely on numerical properties (e.g., Logistic Regression). For XGBoost, focus on other optimization techniques like hyperparameter tuning and feature engineering instead.

Initialize and Configure the Model

In [12]:
model_xgb = Model(
    name=f"AIDS_Clinical_{model_type}",
    estimator_name=estimator_name,
    calibrate=calibrate,
    estimator=clc,
    model_type="classification",
    kfold=kfold,
    stratify_y=True,
    stratify_cols=False,
    grid=tuned_parameters,
    randomized_grid=rand_grid,
    boost_early=early_stop,
    scoring=["roc_auc"],
    random_state=222,
    n_jobs=2,
)

Tuning f1_beta_tune and optimal_threshold Parameters for Improved Performance on Imbalanced Datasets

When working with imbalance datasets, standard metrics like precision and F-score can be misleading, especially for classes with few samples. To address this, we provide the f1_beta_tune and optimal_threshold parameters, which allow for more reliable metric calculation and improved model performance.

  • f1_beta_tune: Setting f1_beta_tune=True enables the model to adjust the F1 beta threshold during parameter tuning. This adjustment lets you control the balance between precision and recall based on the needs of your application. For instance, a higher beta value would prioritize recall, which might be crucial in applications where false negatives are costly. This parameter is particularly valuable for fine-tuning models on imbalanced datasets where a single threshold may not optimize F1 performance across classes.

  • optimal_threshold: By setting optimal_threshold=True in return_metrics(), the model will automatically find the threshold that maximizes the F1 score. This dynamic threshold adjustment helps avoid situations where precision or F-score are undefined (set to 0.0) due to a lack of predicted samples for certain classes. This prevents UndefinedMetricWarning and ensures that metrics are calculated more consistently across all classes, even in cases of severe class imbalance.

These parameters enable better handling of imbalanced datasets, providing more reliable metrics and improving overall model interpretability.

Perform Grid Search Parameter Tuning and Retrieve Split Data

In [13]:
model_xgb.grid_search_param_tuning(X, y, f1_beta_tune=True)

X_train, y_train = model_xgb.get_train_data(X, y)
X_test, y_test = model_xgb.get_test_data(X, y)
X_valid, y_valid = model_xgb.get_valid_data(X, y)
Pipeline Steps:

┌─────────────────┐
│ Step 1: xgb     │
│ XGBClassifier   │
└─────────────────┘

100%|██████████| 5/5 [00:44<00:00,  8.80s/it]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  3.23it/s]
Best score/param set found on validation set:
{'params': {'xgb__early_stopping_rounds': 100,
            'xgb__eval_metric': 'logloss',
            'xgb__learning_rate': 0.0001,
            'xgb__max_depth': 3,
            'xgb__n_estimators': 999},
 'score': 0.9280033238366572}
Best roc_auc: 0.928 


Fit the Model

Model Training with Evaluation Set and Scoring Options

In XGBoost, specifying an eval_set (such as validation_data=[X_valid, y_valid]) within the fit method is recommended for monitoring model performance on unseen data during training. This helps in early stopping and in selecting optimal parameters by evaluating model performance metrics at each boosting round.

By default, XGBoost uses ROC AUC as the scoring metric for binary classification, a measure that reflects the trade-off between true positive and false positive rates. However, users can adjust this by setting score="average_precision" if optimizing for precision-recall (particularly useful in imbalanced datasets). This flexibility allows the model to be tailored to specific performance needs based on the application.

In [14]:
model_xgb.fit(
    X_train,
    y_train,
    validation_data=[X_valid, y_valid],
)

Return Metrics (Optional)

In [15]:
# ------------------------- VALID AND TEST METRICS -----------------------------

print("Validation Metrics")
class_report_val, cm_val = model_xgb.return_metrics(
    X_valid,
    y_valid,
    optimal_threshold=True,
)
print()
print("Test Metrics")
class_report_test, cm_test = model_xgb.return_metrics(
    X_test,
    y_test,
    optimal_threshold=True,
)
Validation Metrics
Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  95 (tp)    9 (fn)
        Neg  79 (fp)  245 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9280033238366572,
 'Average Precision': 0.7992275185850191,
 'Brier Score': 0.16713189436073958,
 'Precision/PPV': 0.5459770114942529,
 'Sensitivity': 0.9134615384615384,
 'Specificity': 0.7561728395061729}
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.96      0.76      0.85       324
           1       0.55      0.91      0.68       104

    accuracy                           0.79       428
   macro avg       0.76      0.83      0.77       428
weighted avg       0.86      0.79      0.81       428

--------------------------------------------------------------------------------

Test Metrics
Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  95 (tp)    9 (fn)
        Neg  78 (fp)  246 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.934576804368471,
 'Average Precision': 0.8023014087345259,
 'Brier Score': 0.16628708993634742,
 'Precision/PPV': 0.5491329479768786,
 'Sensitivity': 0.9134615384615384,
 'Specificity': 0.7592592592592593}
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.96      0.76      0.85       324
           1       0.55      0.91      0.69       104

    accuracy                           0.80       428
   macro avg       0.76      0.84      0.77       428
weighted avg       0.86      0.80      0.81       428

--------------------------------------------------------------------------------

Calibrate the Model

In [16]:
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve

# Get the predicted probabilities for the validation data from uncalibrated model
y_prob_uncalibrated = model_xgb.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the uncalibrated model
prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(
    y_test,
    y_prob_uncalibrated,
    n_bins=10,
)

# Calibrate the model
if model_xgb.calibrate:
    model_xgb.calibrateModel(X, y, score="roc_auc")

# Predict on the validation set
y_test_pred = model_xgb.predict_proba(X_test)[:, 1]


# Get the predicted probabilities for the validation data from calibrated model
y_prob_calibrated = model_xgb.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the calibrated model
prob_true_calibrated, prob_pred_calibrated = calibration_curve(
    y_test,
    y_prob_calibrated,
    n_bins=10,
)


# Plot the calibration curves
plt.figure(figsize=(5, 5))
plt.plot(
    prob_pred_uncalibrated,
    prob_true_uncalibrated,
    marker="o",
    label="Uncalibrated XGBoost",
)
plt.plot(
    prob_pred_calibrated,
    prob_true_calibrated,
    marker="o",
    label="Calibrated XGBoost",
)
plt.plot(
    [0, 1],
    [0, 1],
    linestyle="--",
    label="Perfectly calibrated",
)
plt.xlabel("Predicted probability")
plt.ylabel("True probability in each bin")
plt.title("Calibration plot (reliability curve)")
plt.legend()
plt.show()
Change back to CPU
Confusion matrix on validation set for roc_auc
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  70 (tp)   34 (fn)
        Neg   9 (fp)  315 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.90      0.97      0.94       324
           1       0.89      0.67      0.77       104

    accuracy                           0.90       428
   macro avg       0.89      0.82      0.85       428
weighted avg       0.90      0.90      0.89       428

--------------------------------------------------------------------------------
roc_auc after calibration: 0.9280033238366572
In [17]:
print(model_xgb.classification_report)
              precision    recall  f1-score   support

           0       0.90      0.97      0.94       324
           1       0.89      0.67      0.77       104

    accuracy                           0.90       428
   macro avg       0.89      0.82      0.85       428
weighted avg       0.90      0.90      0.89       428

Recursive Feature Elimination (RFE)

Now that we've trained the models, we can also refine them by identifying which features contribute most to their performance. One effective method for this is Recursive Feature Elimination (RFE). This technique allows us to systematically remove the least important features, retraining the model at each step to evaluate how performance is affected. By focusing only on the most impactful variables, RFE helps streamline the dataset, reduce noise, and improve both the accuracy and interpretability of the final model.

It works by recursively training a model, ranking the importance of features based on the model’s outputas (such as coefficients in linear models or importance scores in tree-based models), and then removing the least important features one by one. This process continues until a specified number of features remains or the desired performance criteria are met.

The primary advantage of RFE is its ability to streamline datasets, improving model performance and interpretability by focusing on features that contribute the most to the predictive power. However, it can be computationally expensive since it involves repeated model training, and its effectiveness depends on the underlying model’s ability to evaluate feature importance. RFE is commonly used with cross-validation to ensure that the selected features generalize well across datasets, making it a robust choice for model optimization and dimensionality reduction.

As an illustrative example, we will retrain the above model using RFE.

We will begin by appending the feature selection technique to our tuned_parameters dictionary.

In [18]:
xgb_definition["tuned_parameters"][f"feature_selection_rfe__n_features_to_select"] = [
    5,
    10,
]

We will use ElasticNet because it strikes a balance between two widely used regularization techniques: Lasso (L1) and Ridge (L2). ElasticNet is particularly effective in scenarios where we expect the dataset to have a mix of strongly and weakly correlated features. Lasso alone tends to select only one feature from a group of highly correlated ones, ignoring the others, while Ridge includes all features but may not perform well when some are entirely irrelevant. ElasticNet addresses this limitation by combining both penalties, allowing it to handle multicollinearity more effectively while still performing feature selection.

Additionally, ElasticNet provides flexibility by controlling the ratio between L1 and L2 penalties, enabling fine-tuning to suit the specific needs of our dataset. This makes it a robust choice for datasets with many features, some of which may be irrelevant or redundant, as it can reduce overfitting while retaining a manageable subset of predictors.

In [19]:
rfe_estimator = ElasticNet()

rfe = RFE(rfe_estimator)
In [20]:
model_xgb = Model(
    name=f"AIDS_Clinical_{model_type}",
    estimator_name=estimator_name,
    calibrate=calibrate,
    estimator=clc,
    model_type="classification",
    kfold=kfold,
    pipeline_steps=[
        ("rfe", rfe),
    ],
    stratify_y=True,
    stratify_cols=False,
    grid=tuned_parameters,
    randomized_grid=rand_grid,
    feature_selection=True,
    boost_early=early_stop,
    scoring=["roc_auc"],
    random_state=222,
    n_jobs=2,
)

model_xgb.grid_search_param_tuning(X, y, f1_beta_tune=True)

X_train, y_train = model_xgb.get_train_data(X, y)
X_test, y_test = model_xgb.get_test_data(X, y)
X_valid, y_valid = model_xgb.get_valid_data(X, y)

model_xgb.fit(
    X_train,
    y_train,
    validation_data=[X_valid, y_valid],
)


# ------------------------- VALID AND TEST METRICS -----------------------------

print("Validation Metrics")
model_xgb.return_metrics(
    X_valid,
    y_valid,
    optimal_threshold=True,
)
print()
print("Test Metrics")
model_xgb.return_metrics(
    X_test,
    y_test,
    optimal_threshold=True,
)

print()
Pipeline Steps:

┌─────────────────────────────────┐
│ Step 1: feature_selection_rfe   │
│ RFE                             │
└─────────────────────────────────┘
                │
                ▼
┌─────────────────────────────────┐
│ Step 2: xgb                     │
│ XGBClassifier                   │
└─────────────────────────────────┘

100%|██████████| 10/10 [00:25<00:00,  2.59s/it]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  3.21it/s]
Best score/param set found on validation set:
{'params': {'feature_selection_rfe__n_features_to_select': 10,
            'xgb__early_stopping_rounds': 100,
            'xgb__eval_metric': 'logloss',
            'xgb__learning_rate': 0.0001,
            'xgb__max_depth': 10,
            'xgb__n_estimators': 999},
 'score': 0.9316684472934472}
Best roc_auc: 0.932 

Validation Metrics
Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  95 (tp)    9 (fn)
        Neg  70 (fp)  254 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9316981244064577,
 'Average Precision': 0.8206553111036822,
 'Brier Score': 0.16608154668556174,
 'Precision/PPV': 0.5757575757575758,
 'Sensitivity': 0.9134615384615384,
 'Specificity': 0.7839506172839507}
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.97      0.78      0.87       324
           1       0.58      0.91      0.71       104

    accuracy                           0.82       428
   macro avg       0.77      0.85      0.79       428
weighted avg       0.87      0.82      0.83       428

--------------------------------------------------------------------------------

Feature names selected:
['time', 'preanti', 'strat', 'symptom', 'treat', 'offtrt', 'cd40', 'cd420', 'cd80', 'cd820']


Test Metrics
Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos  91 (tp)   13 (fn)
        Neg  70 (fp)  254 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9278104226020893,
 'Average Precision': 0.8133787683637559,
 'Brier Score': 0.1658272032260468,
 'Precision/PPV': 0.5652173913043478,
 'Sensitivity': 0.875,
 'Specificity': 0.7839506172839507}
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.95      0.78      0.86       324
           1       0.57      0.88      0.69       104

    accuracy                           0.81       428
   macro avg       0.76      0.83      0.77       428
weighted avg       0.86      0.81      0.82       428

--------------------------------------------------------------------------------

Feature names selected:
['time', 'preanti', 'strat', 'symptom', 'treat', 'offtrt', 'cd40', 'cd420', 'cd80', 'cd820']


Using SHAP to Interpret Model Predictions with a Pipeline and Feature Selection

This example demonstrates how to compute and visualize SHAP (SHapley Additive exPlanations) values for a machine learning model with a pipeline that includes feature selection. SHAP values provide insights into how individual features contribute to the predictions of a model.

Steps

  1. The dataset is transformed through the model's feature selection pipeline to ensure only the selected features are used for SHAP analysis.

  2. The final model (e.g., XGBoost classifier) is retrieved from the custom Model object. This is required because SHAP operates on the underlying model, not the pipeline.

  3. SHAP's TreeExplainer is used to explain the predictions of the XGBoost classifier.

  4. SHAP values are calculated for the transformed dataset to quantify the contribution of each feature to the predictions.

  5. A summary plot is generated to visualize the impact of each feature across all data points.

Step 1: Transform the test data using the feature selection pipeline

In [21]:
## The pipeline applies preprocessing (e.g., imputation, scaling) and feature
## selection (RFE) to X_test
X_test_transformed = model_xgb.get_feature_selection_pipeline().transform(X_test)

Step 2: Retrieve the trained XGBoost classifier from the pipeline

In [22]:
## The last estimator in the pipeline is the XGBoost model
xgb_classifier = model_xgb.estimator[-1]

Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier

In [23]:
## Import SHAP for model explainability
import shap

## Feature names are required for interpretability in SHAP plots
feature_names = X_train.columns.to_list()

## Initialize the SHAP explainer with the model
explainer = shap.TreeExplainer(xgb_classifier)

Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values

In [24]:
## Compute SHAP values for the transformed dataset
shap_values = explainer.shap_values(X_test_transformed)

Step 5: Generate a summary plot of SHAP values

In [25]:
## Plot SHAP values
## Summary plot of SHAP values for all features across all data points
shap.summary_plot(shap_values, X_test_transformed, feature_names=feature_names,)

Reference

El-Sadr, W., & Abrams, D. (1998). AIDS Clinical Trials Group Study 175. UCI Machine Learning Repository.
https://doi.org/10.24432/C5G896.