iPython Notebooks
Binary Classification Examples
Regression Example
Google Colab Notebook
HTML File
Key Methods and Functionalities
__init__(...)
Initializes the model tuner with configurations, including estimator, cross-validation settings, scoring metrics, pipeline steps, feature selection, imbalance sampler, Bayesian search, and model calibration options.
reset_estimator()
Resets the estimator and pipeline configuration.
process_imbalance_sampler(X_train, y_train)
Processes the imbalance sampler, applying it to resample the training data.
calibrateModel(X, y, score=None)
Calibrates the model with cross-validation support and configurable calibration methods, improving probability estimates.
get_train_data(X, y), get_valid_data(X, y), get_test_data(X, y)
Retrieves train, validation, and test data based on specified indices.
calibrate_report(X, y, score=None)
Generates a calibration report, including a confusion matrix and classification report.
fit(X, y, validation_data=None, score=None)
Fits the model to training data and, if applicable, tunes threshold and performs early stopping. Allows feature selection and processing steps as part of the pipeline.
return_metrics(X_test, y_test, optimal_threshold=False)
Returns evaluation metrics with confusion matrix and classification report, optionally using optimized classification thresholds.
predict(X, y=None, optimal_threshold=False), predict_proba(X, y=None)
Makes predictions and predicts probabilities, allowing threshold tuning.
grid_search_param_tuning(X, y, f1_beta_tune=False, betas=[1, 2])
Performs grid or Bayesian search parameter tuning, optionally tuning F-beta score thresholds for classification.
print_selected_best_features(X)
Prints and returns the selected top K best features based on the feature selection step.
tune_threshold_Fbeta(score, y_valid, betas, y_valid_proba, kfold=False)
Tunes classification threshold for optimal F-beta score, balancing precision and recall across various thresholds.
train_val_test_split(X, y, stratify_y, train_size, validation_size, test_size, random_state, stratify_cols)
Splits data into train, validation, and test sets, supporting stratification by specific columns or the target variable.
get_best_score_params(X, y)
Retrieves the best hyperparameters for the model based on cross-validation scores for specified metrics.
conf_mat_class_kfold(X, y, test_model, score=None)
Generates and averages confusion matrices across k-folds, producing a combined classification report.
regression_report_kfold(X, y, test_model, score=None)
Generates averaged regression metrics across k-folds.
regression_report(y_true, y_pred, print_results=True)
Generates a regression report with metrics like Mean Absolute Error, R-squared, and Root Mean Squared Error.
Helper Functions
kfold_split(classifier, X, y, stratify=False, scoring=["roc_auc"], n_splits=10, random_state=3)
Splits data using k-fold or stratified k-fold cross-validation.
get_cross_validate(classifier, X, y, kf, scoring=["roc_auc"])
Performs cross-validation and returns training scores and estimator instances.
_confusion_matrix_print(conf_matrix, labels)
Prints the formatted confusion matrix for binary classification.
print_pipeline(pipeline)
Displays an ASCII representation of the pipeline steps for visual clarity.
report_model_metrics(model, X_valid=None, y_valid=None, threshold=0.5)
Generates a DataFrame of key model performance metrics, including Precision, Sensitivity, Specificity, and AUC-ROC.
Note
This class is designed to be flexible and can be extended to include additional functionalities or custom metrics.
It is essential to properly configure the parameters during initialization to suit the specific requirements of your machine learning task.
Ensure that all dependencies are installed and properly imported before using the
Model
class from themodel_tuner
library.
Input Parameters
- class Model(name, estimator_name, estimator, model_type, calibrate=False, kfold=False, imbalance_sampler=None, train_size=0.6, validation_size=0.2, test_size=0.2, stratify_y=False, stratify_cols=None, grid=None, scoring=['roc_auc'], n_splits=10, random_state=3, n_jobs=1, display=True, randomized_grid=False, n_iter=100, pipeline_steps=[], boost_early=False, feature_selection=False, class_labels=None, multi_label=False, calibration_method='sigmoid', custom_scorer=[], bayesian=False)
A class for building, tuning, and evaluating machine learning models, supporting both classification and regression tasks, as well as multi-label classification.
- Parameters:
name (str) – A unique name for the model, helpful for tracking outputs and logs.
estimator_name (str) – Prefix for the estimator in the pipeline, used for setting parameters in tuning (e.g., estimator_name +
__param_name
).estimator (object) – The machine learning model to be trained and tuned.
model_type (str) – Specifies the type of model, must be either
classification
orregression
.calibrate (bool, optional) – Whether to calibrate the model’s probability estimates. Default is
False
.kfold (bool, optional) – Whether to perform k-fold cross-validation. Default is
False
.imbalance_sampler (object, optional) – An imbalanced data sampler from the imblearn library, e.g.,
RandomUnderSampler
orRandomOverSampler
.train_size (float, optional) – Proportion of the data to be used for training. Default is
0.6
.validation_size (float, optional) – Proportion of the data to be used for validation. Default is
0.2
.test_size (float, optional) – Proportion of the data to be used for testing. Default is
0.2
.stratify_y (bool, optional) – Whether to stratify by the target variable during data splitting. Default is
False
.stratify_cols (str, list, or pandas.DataFrame, optional) – Columns to use for stratification during data splitting. Can be a single column name (as a string), a list of column names (as strings), or a DataFrame containing the columns for stratification. Default is
None
.grid (list of dict) – Hyperparameter grid for model tuning, supporting both regular and Bayesian search.
scoring (list of str) – List of scoring metrics for evaluation, e.g.,
["roc_auc", "accuracy"]
.n_splits (int, optional) – Number of splits for k-fold cross-validation. Default is
10
.random_state (int, optional) – Seed for random number generation to ensure reproducibility. Default is
3
.n_jobs (int, optional) – Number of parallel jobs to run for model fitting. Default is
1
.display (bool, optional) – Whether to print messages during the tuning and training process. Default is
True
.randomized_grid (bool, optional) – Whether to use randomized grid search. Default is
False
.n_iter (int, optional) – Number of iterations for randomized grid search. Default is
100
.pipeline_steps (list, optional) – List of steps for the pipeline, e.g., preprocessing and feature selection steps. Default is
[]
.boost_early (bool, optional) – Whether to enable early stopping for boosting algorithms like XGBoost. Default is
False
.feature_selection (bool, optional) – Whether to enable feature selection. Default is
False
.class_labels (list, optional) – List of labels for multi-class classification. Default is
None
.multi_label (bool, optional) – Whether the task is a multi-label classification problem. Default is
False
.calibration_method (str, optional) – Method for calibration; options include
sigmoid
andisotonic
. Default issigmoid
.custom_scorer (dict, optional) – Dictionary of custom scoring functions, allowing additional metrics to be evaluated. Default is
[]
.bayesian (bool, optional) – Whether to perform Bayesian hyperparameter tuning using
BayesSearchCV
. Default isFalse
.
- Raises:
ImportError – If the
bootstrapper
module is not found or not installed.ValueError – Raised for various issues, such as: - Invalid
model_type
value. Themodel_type
must be explicitly specified as eitherclassification
orregression
. - Invalid hyperparameter configurations or mismatchedX
andy
shapes.AttributeError – Raised if an expected pipeline step is missing, or if
self.estimator
is improperly initialized.TypeError – Raised when an incorrect parameter type is provided, such as passing
None
instead of a valid object.IndexError – Raised for indexing issues, particularly in confusion matrix formatting functions.
KeyError – Raised when accessing dictionary keys that are not available, such as missing scores in
self.best_params_per_score
.RuntimeError – Raised for unexpected issues during model fitting or transformations that do not fit into the other exception categories.
Pipeline Management
The pipeline in the model tuner class is designed to automatically organize steps into three categories: preprocessing, feature selection, and imbalanced sampling. The steps are ordered in the following sequence:
Preprocessing:
Imputation
Scaling
Other preprocessing steps
Imbalanced Sampling
Feature Selection
Classifier
The pipeline_assembly
method automatically sorts the steps into this order.
Specifying Pipeline Steps
Pipeline steps can be specified in multiple ways. For example, if naming a pipeline step then specify like so:
pipeline_steps = ['imputer', SimpleImputer()]
Naming each step is optional and the steps can also be specified like so:
pipeline_steps = [SimpleImputer(), StandardScalar(), rfe()]
If no name is assigned, the step will be renamed automatically to follow the convention
step_0
,step_1
, etc.Column transformers can also be included in the pipeline and are automatically categorized under the preprocessing section.
Helper Methods for Pipeline Extraction
To support advanced use cases, the model tuner provides helper methods to extract parts of the pipeline for later use. For example, when generating SHAP plots, users might only need the preprocessing section of the pipeline.
Here are some of the available methods:
- get_preprocessing_and_feature_selection_pipeline()
Extracts both the preprocessing and feature selection parts of the pipeline.
Example:
def get_preprocessing_and_feature_selection_pipeline(self): steps = [ (name, transformer) for name, transformer in self.estimator.steps if name.startswith("preprocess_") or name.startswith("feature_selection_") ] return self.PipelineClass(steps)
- get_feature_selection_pipeline()
Extracts only the feature selection part of the pipeline.
Example:
def get_feature_selection_pipeline(self): steps = [ (name, transformer) for name, transformer in self.estimator.steps if name.startswith("feature_selection_") ] return self.PipelineClass(steps)
- get_preprocessing_pipeline()
Extracts only the preprocessing part of the pipeline.
Example:
def get_preprocessing_pipeline(self): preprocessing_steps = [ (name, transformer) for name, transformer in self.estimator.steps if name.startswith("preprocess_") ] return self.PipelineClass(preprocessing_steps)
Summary
By organizing pipeline steps automatically and providing helper methods for extraction, the model tuner class offers flexibility and ease of use for building and managing complex pipelines. Users can focus on specifying the steps, and the tuner handles naming, sorting, and category assignments seamlessly.
Binary Classification
Binary classification is a type of supervised learning where a model is trained
to distinguish between two distinct classes or categories. In essence, the model
learns to classify input data into one of two possible outcomes, typically
labeled as 0
and 1
, or negative and positive. This is commonly used in
scenarios such as spam detection, disease diagnosis, or fraud detection.
In our library, binary classification is handled seamlessly through the Model
class. Users can specify a binary classifier as the estimator, and the library
takes care of essential tasks like data preprocessing, model calibration, and
cross-validation. The library also provides robust support for evaluating the
model’s performance using a variety of metrics, such as accuracy, precision,
recall, and ROC-AUC, ensuring that the model’s ability to distinguish between the
two classes is thoroughly assessed. Additionally, the library supports advanced
techniques like imbalanced data handling and model calibration to fine-tune
decision thresholds, making it easier to deploy effective binary classifiers in
real-world applications.
AIDS Clinical Trials Group Study
The UCI Machine Learning Repository is a well-known resource for accessing a wide range of datasets used for machine learning research and practice. One such dataset is the AIDS Clinical Trials Group Study dataset, which can be used to build and evaluate predictive models.
You can easily fetch this dataset using the ucimlrepo package. If you haven’t installed it yet, you can do so by running the following command:
pip install ucimlrepo
Once installed, you can quickly load the AIDS Clinical Trials Group Study dataset with a simple command:
from ucimlrepo import fetch_ucirepo
Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
import xgboost as xgb
from model_tuner import Model
Step 2: Load the dataset, define X, y
# fetch dataset
aids_clinical_trials_group_study_175 = fetch_ucirepo(id=890)
# data (as pandas dataframes)
X = aids_clinical_trials_group_study_175.data.features
y = aids_clinical_trials_group_study_175.data.targets
y = y.squeeze() # convert a DataFrame to Series when single column
Step 3: Check for zero-variance columns and drop accordingly
# Check for zero-variance columns and drop them
zero_variance_columns = X.columns[X.var() == 0]
if not zero_variance_columns.empty:
X = X.drop(columns=zero_variance_columns)
Step 4: Create an Instance of the XGBClassifier
# Creating an instance of the XGBClassifier
xgb_name = "xgb"
xgb = XGBClassifier(
objective="binary:logistic",
random_state=222,
)
Step 5: Define Hyperparameters for XGBoost
xgbearly = True
tuned_parameters_xgb = {
f"{xgb_name}__max_depth": [3, 10, 20, 200, 500],
f"{xgb_name}__learning_rate": [1e-4],
f"{xgb_name}__n_estimators": [1000],
f"{xgb_name}__early_stopping_rounds": [100],
f"{xgb_name}__verbose": [0],
f"{xgb_name}__eval_metric": ["logloss"],
}
xgb_definition = {
"clc": xgb,
"estimator_name": xgb_name,
"tuned_parameters": tuned_parameters_xgb,
"randomized_grid": False,
"n_iter": 5,
"early": xgbearly,
}
Note
The verbose
parameter in XGBoost allows you to control the level of output during training:
Set to
0
orFalse
: Suppresses all training output (silent mode).Set to
1
orTrue
: Displays progress and evaluation metrics during training.
This can be particularly useful for monitoring model performance when early stopping is enabled.
Step 6: Initialize and Configure the Model
model_type = "xgb"
clc = xgb_definition["clc"]
estimator_name = xgb_definition["estimator_name"]
tuned_parameters = xgb_definition["tuned_parameters"]
n_iter = xgb_definition["n_iter"]
rand_grid = xgb_definition["randomized_grid"]
early_stop = xgb_definition["early"]
kfold = False
calibrate = True
# Initialize model_tuner
model_xgb = Model(
name=f"AIDS_Clinical_{model_type}",
estimator_name=estimator_name,
calibrate=calibrate,
estimator=clc,
model_type="classification",
kfold=kfold,
stratify_y=True,
stratify_cols=["gender", "race"],
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["roc_auc"],
random_state=222,
n_jobs=2,
)
Step 7: Perform Grid Search Parameter Tuning
# Perform grid search parameter tuning
model_xgb.grid_search_param_tuning(X, y, f1_beta_tune=True)
Pipeline Steps:
┌─────────────────┐
│ Step 1: xgb │
│ XGBClassifier │
└─────────────────┘
100%|██████████| 5/5 [00:19<00:00, 3.98s/it]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00, 3.42it/s]Best score/param set found on validation set:
{'params': {'xgb__early_stopping_rounds': 100,
'xgb__eval_metric': 'logloss',
'xgb__learning_rate': 0.0001,
'xgb__max_depth': 3,
'xgb__n_estimators': 999},
'score': 0.9280033238366572}
Best roc_auc: 0.928
Step 8: Fit the Model
## Get the training and validation data
X_train, y_train = model_tuner.get_train_data(X, y)
X_valid, y_valid = model_tuner.get_valid_data(X, y)
X_test, y_test = model_tuner.get_test_data(X, y)
model_xgb.fit(
X_train,
y_train,
validation_data=[X_valid, y_valid],
)
Step 9: Return Metrics (Optional)
You can use this function to evaluate the model by printing the output.
Validation Metrics
Confusion matrix on set provided:
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 95 (tp) 9 (fn)
Neg 79 (fp) 245 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9280033238366572,
'Average Precision': 0.7992275185850191,
'Brier Score': 0.16713189436073958,
'Precision/PPV': 0.5459770114942529,
'Sensitivity': 0.9134615384615384,
'Specificity': 0.7561728395061729}
--------------------------------------------------------------------------------
precision recall f1-score support
0 0.96 0.76 0.85 324
1 0.55 0.91 0.68 104
accuracy 0.79 428
macro avg 0.76 0.83 0.77 428
weighted avg 0.86 0.79 0.81 428
--------------------------------------------------------------------------------
Test Metrics
Confusion matrix on set provided:
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 95 (tp) 9 (fn)
Neg 78 (fp) 246 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.934576804368471,
'Average Precision': 0.8023014087345259,
'Brier Score': 0.16628708993634742,
'Precision/PPV': 0.5491329479768786,
'Sensitivity': 0.9134615384615384,
'Specificity': 0.7592592592592593}
--------------------------------------------------------------------------------
precision recall f1-score support
0 0.96 0.76 0.85 324
1 0.55 0.91 0.69 104
accuracy 0.80 428
macro avg 0.76 0.84 0.77 428
weighted avg 0.86 0.80 0.81 428
--------------------------------------------------------------------------------
Step 10: Calibrate the Model (if needed)
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
## Get the predicted probabilities for the validation data from uncalibrated model
y_prob_uncalibrated = model_xgb.predict_proba(X_test)[:, 1]
## Compute the calibration curve for the uncalibrated model
prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(
y_test,
y_prob_uncalibrated,
n_bins=10,
)
## Calibrate the model
if model_xgb.calibrate:
model_xgb.calibrateModel(X, y, score="roc_auc")
## Predict on the validation set
y_test_pred = model_xgb.predict_proba(X_test)[:, 1]
Change back to CPU
Confusion matrix on validation set for roc_auc
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 70 (tp) 34 (fn)
Neg 9 (fp) 315 (tn)
--------------------------------------------------------------------------------
precision recall f1-score support
0 0.90 0.97 0.94 324
1 0.89 0.67 0.77 104
accuracy 0.90 428
macro avg 0.89 0.82 0.85 428
weighted avg 0.90 0.90 0.89 428
--------------------------------------------------------------------------------
roc_auc after calibration: 0.9280033238366572
## Get the predicted probabilities for the validation data from calibrated model
y_prob_calibrated = model_xgb.predict_proba(X_test)[:, 1]
## Compute the calibration curve for the calibrated model
prob_true_calibrated, prob_pred_calibrated = calibration_curve(
y_test,
y_prob_calibrated,
n_bins=10,
)
## Plot the calibration curves
plt.figure(figsize=(5, 5))
plt.plot(
prob_pred_uncalibrated,
prob_true_uncalibrated,
marker="o",
label="Uncalibrated XGBoost",
)
plt.plot(
prob_pred_calibrated,
prob_true_calibrated,
marker="o",
label="Calibrated XGBoost",
)
plt.plot(
[0, 1],
[0, 1],
linestyle="--",
label="Perfectly calibrated",
)
plt.xlabel("Predicted probability")
plt.ylabel("True probability in each bin")
plt.title("Calibration plot (reliability curve)")
plt.legend()
plt.show()
Classification Report (Optional)
A classification report is readily available at this stage, should you wish to
print and examine it. A call to print(model_tuner.classification_report)
will
output it as follows:
print(model_tuner.classification_report)
precision recall f1-score support
0 0.90 0.97 0.94 324
1 0.89 0.67 0.77 104
accuracy 0.90 428
macro avg 0.89 0.82 0.85 428
weighted avg 0.90 0.90 0.89 428
Recursive Feature Elimination (RFE)
Now that we’ve trained the models, we can also refine them by identifying which features contribute most to their performance. One effective method for this is Recursive Feature Elimination (RFE). This technique allows us to systematically remove the least important features, retraining the model at each step to evaluate how performance is affected. By focusing only on the most impactful variables, RFE helps streamline the dataset, reduce noise, and improve both the accuracy and interpretability of the final model.
It works by recursively training a model, ranking the importance of features based on the model’s outputas (such as coefficients in linear models or importance scores in tree-based models), and then removing the least important features one by one. This process continues until a specified number of features remains or the desired performance criteria are met.
The primary advantage of RFE is its ability to streamline datasets, improving model performance and interpretability by focusing on features that contribute the most to the predictive power. However, it can be computationally expensive since it involves repeated model training, and its effectiveness depends on the underlying model’s ability to evaluate feature importance. RFE is commonly used with cross-validation to ensure that the selected features generalize well across datasets, making it a robust choice for model optimization and dimensionality reduction.
As an illustrative example, we will retrain the above model using RFE.
We will begin by appending the feature selection technique to our tuned parameters dictionary.
xgb_definition["tuned_parameters"][f"feature_selection_rfe__n_features_to_select"] = [
5,
10,
]
Elastic Net for Feature Selection with RFE
Note
You may wish to explore this section for the rationale in applying this technique.
We will use elastic net because it strikes a balance between two widely used regularization techniques: Lasso (\(L1\)) and Ridge (\(L2\)). Elastic net is particularly effective in scenarios where we expect the dataset to have a mix of strongly and weakly correlated features. Lasso alone tends to select only one feature from a group of highly correlated ones, ignoring the others, while Ridge includes all features but may not perform well when some are entirely irrelevant. Elastic net addresses this limitation by combining both penalties, allowing it to handle multicollinearity more effectively while still performing feature selection.
Additionally, elastic net provides flexibility by controlling the ratio between \(L1\) and \(L2\) penalties, enabling fine-tuning to suit the specific needs of our dataset. This makes it a robust choice for datasets with many features, some of which may be irrelevant or redundant, as it can reduce overfitting while retaining a manageable subset of predictors.
rfe_estimator = ElasticNet()
rfe = RFE(rfe_estimator)
from model_tuner import Model
model_xgb = Model(
name=f"AIDS_Clinical_{model_type}",
estimator_name=estimator_name,
calibrate=calibrate,
estimator=clc,
model_type="classification",
kfold=kfold,
pipeline_steps=[
("rfe", rfe),
],
stratify_y=True,
stratify_cols=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
feature_selection=True,
boost_early=early_stop,
scoring=["roc_auc"],
random_state=222,
n_jobs=2,
)
model_xgb.grid_search_param_tuning(X, y, f1_beta_tune=True)
X_train, y_train = model_xgb.get_train_data(X, y)
X_test, y_test = model_xgb.get_test_data(X, y)
X_valid, y_valid = model_xgb.get_valid_data(X, y)
model_xgb.fit(
X_train,
y_train,
validation_data=[X_valid, y_valid],
)
# ------------------------- VALID AND TEST METRICS -----------------------------
print("Validation Metrics")
model_xgb.return_metrics(
X_valid,
y_valid,
optimal_threshold=True,
)
print()
print("Test Metrics")
model_xgb.return_metrics(
X_test,
y_test,
optimal_threshold=True,
)
print()
┌─────────────────────────────────┐
│ Step 1: feature_selection_rfe │
│ RFE │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Step 2: xgb │
│ XGBClassifier │
└─────────────────────────────────┘
100%|██████████| 10/10 [00:25<00:00, 2.52s/it]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00, 3.53it/s]
Best score/param set found on validation set:
{'params': {'feature_selection_rfe__n_features_to_select': 10,
'xgb__early_stopping_rounds': 100,
'xgb__eval_metric': 'logloss',
'xgb__learning_rate': 0.0001,
'xgb__max_depth': 10,
'xgb__n_estimators': 999},
'score': 0.9316684472934472}
Best roc_auc: 0.932
Validation Metrics
Confusion matrix on set provided:
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 95 (tp) 9 (fn)
Neg 70 (fp) 254 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9316981244064577,
'Average Precision': 0.8206553111036822,
'Brier Score': 0.16608154668556174,
'Precision/PPV': 0.5757575757575758,
'Sensitivity': 0.9134615384615384,
'Specificity': 0.7839506172839507}
--------------------------------------------------------------------------------
precision recall f1-score support
0 0.97 0.78 0.87 324
1 0.58 0.91 0.71 104
accuracy 0.82 428
macro avg 0.77 0.85 0.79 428
weighted avg 0.87 0.82 0.83 428
--------------------------------------------------------------------------------
Feature names selected:
['time', 'preanti', 'strat', 'symptom', 'treat', 'offtrt', 'cd40', 'cd420', 'cd80', 'cd820']
Test Metrics
Confusion matrix on set provided:
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 91 (tp) 13 (fn)
Neg 70 (fp) 254 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9278104226020893,
'Average Precision': 0.8133787683637559,
'Brier Score': 0.1658272032260468,
'Precision/PPV': 0.5652173913043478,
'Sensitivity': 0.875,
'Specificity': 0.7839506172839507}
--------------------------------------------------------------------------------
precision recall f1-score support
0 0.95 0.78 0.86 324
1 0.57 0.88 0.69 104
accuracy 0.81 428
macro avg 0.76 0.83 0.77 428
weighted avg 0.86 0.81 0.82 428
--------------------------------------------------------------------------------
Feature names selected:
['time', 'preanti', 'strat', 'symptom', 'treat', 'offtrt', 'cd40', 'cd420', 'cd80', 'cd820']
Important
Passing feature_selection=True
in conjunction with accounting for rfe
for
the pipeline_steps
inside the Model`
class above is necessary to print the
output of the feature names selected, thus yielding:
Feature names selected:
['offtrt', 'cd40', 'cd420', 'cd80', 'cd820']
Imbalanced Learning
In machine learning, imbalanced datasets are a frequent challenge, especially in real-world scenarios. These datasets have an unequal distribution of target classes, with one class (e.g., fraudulent transactions, rare diseases, or other low-frequency events) being underrepresented compared to the majority class. Models trained on imbalanced data often struggle to generalize, as they tend to favor the majority class, leading to poor performance on the minority class.
To mitigate these issues, it is crucial to:
Understand the nature of the imbalance in the dataset.
Apply appropriate resampling techniques (oversampling, undersampling, or hybrid methods).
Use metrics beyond accuracy, such as precision, recall, and F1-score, to evaluate model performance fairly.
Generating an Imbalanced Dataset
Demonstrated below are the steps to generate an imbalanced dataset using
make_classification
from the sklearn.datasets
module. The following
parameters are specified:
n_samples=1000
: The dataset contains 1,000 samples.n_features=20
: Each sample has 20 features.n_informative=2
: Two features are informative for predicting the target.n_redundant=2
: Two features are linear combinations of the informative features.weights=[0.9, 0.1]
: The target class distribution is 90% for the majority class and 10% for the minority class, creating an imbalance.flip_y=0
: No label noise is added to the target variable.random_state=42
: Ensures reproducibility by using a fixed random seed.
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=2,
n_redundant=2,
n_clusters_per_class=1,
weights=[0.9, 0.1],
flip_y=0,
random_state=42,
)
## Convert to a pandas DataFrame for better visualization
data = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(1, 21)])
data['target'] = y
X = data[[col for col in data.columns if "target" not in col]]
y = pd.Series(data["target"])
Below, you will see that the dataset we have generated is severely imbalanced with 900 observations allocated to the majority class (0) and 100 observations to the minority class (1).
import matplotlib.pyplot as plt
## Create a bar plot
value_counts = pd.Series(y).value_counts()
ax = value_counts.plot(
kind="bar",
rot=0,
width=0.9,
)
## Add labels inside the bars
for index, count in enumerate(value_counts):
plt.text(
index,
count / 2,
str(count),
ha="center",
va="center",
color="yellow",
)
## Customize labels and title
plt.xlabel("Class")
plt.ylabel("Count")
plt.title("Class Distribution")
plt.show() ## Show the plot
Define Hyperparameters for XGBoost
Below, we will use an XGBoost classifier with the following hyperparameters:
from xgboost import XGBClassifier
xgb_name = "xgb"
xgb = XGBClassifier(
random_state=222,
)
xgbearly = True
tuned_parameters_xgb = {
f"{xgb_name}__max_depth": [3, 10, 20, 200, 500],
f"{xgb_name}__learning_rate": [1e-4],
f"{xgb_name}__n_estimators": [1000],
f"{xgb_name}__early_stopping_rounds": [100],
f"{xgb_name}__verbose": [0],
f"{xgb_name}__eval_metric": ["logloss"],
}
xgb_definition = {
"clc": xgb,
"estimator_name": xgb_name,
"tuned_parameters": tuned_parameters_xgb,
"randomized_grid": False,
"n_iter": 5,
"early": xgbearly,
}
Define The Model object
model_type = "xgb"
clc = xgb_definition["clc"]
estimator_name = xgb_definition["estimator_name"]
tuned_parameters = xgb_definition["tuned_parameters"]
n_iter = xgb_definition["n_iter"]
rand_grid = xgb_definition["randomized_grid"]
early_stop = xgb_definition["early"]
kfold = False
calibrate = True
Addressing Class Imbalance in Machine Learning
Class imbalance occurs when one class significantly outweighs another in the dataset, leading to biased models that perform well on the majority class but poorly on the minority class. Techniques like SMOTE and others aim to address this issue by improving the representation of the minority class, ensuring balanced learning and better generalization.
Techniques to Address Class Imbalance
Resampling Techniques
SMOTE (Synthetic Minority Oversampling Technique): SMOTE generates synthetic samples for the minority class by interpolating between existing minority class data points and their nearest neighbors. This helps create a more balanced class distribution without merely duplicating data, thus avoiding overfitting.
Oversampling: Randomly duplicates examples from the minority class to balance the dataset. While simple, it risks overfitting to the duplicated examples.
Undersampling: Reduces the majority class by randomly removing samples. While effective, it can lead to loss of important information.
Purpose of Using These Techniques
The goal of using these techniques is to improve model performance on imbalanced datasets, specifically by:
Ensuring the model captures meaningful patterns in the minority class.
Reducing bias toward the majority class, which often dominates predictions in imbalanced datasets.
Improving metrics like recall, F1-score, and AUC-ROC for the minority class, which are critical in applications like fraud detection, healthcare, and rare event prediction.
Note
While we provide comprehensive examples for SMOTE, ADASYN, and
RandomUnderSampler in the accompanying notebook,
this documentation section demonstrates the implementation of SMOTE. The other
examples follow a similar workflow and can be executed by simply passing the
respective imbalance_sampler
input to ADASYN()
or RandomUnderSampler()
, as
needed. For detailed examples of all methods, please refer to the linked notebook.
Synthetic Minority Oversampling Technique (SMOTE)
SMOTE (Synthetic Minority Oversampling Technique) is a method used to address class imbalance in datasets. It generates synthetic samples for the minority class by interpolating between existing minority samples and their nearest neighbors, effectively increasing the size of the minority class without duplicating data. This helps models better learn patterns from the minority class, improving classification performance on imbalanced datasets.
Initalize and Configure The Model
Important
In the code block below, we initalize and configure the model by calling the
Model
class, and assign it to a new variable call xgb_smote
. Notice that
we pass the imbalance_sampler=SMOTE()
as a necessary step of activating
this imbalanced sampler.
from model_tuner import Model
xgb_smote = Model(
name=f"Make_Classification_{model_type}",
estimator_name=estimator_name,
calibrate=calibrate,
model_type="classification",
estimator=clc,
kfold=kfold,
stratify_y=True,
stratify_cols=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["roc_auc"],
random_state=222,
n_jobs=2,
imbalance_sampler=SMOTE(),
)
Perform Grid Search Parameter Tuning and Retrieve Split Data
xgb_smote.grid_search_param_tuning(
X,
y,
f1_beta_tune=True,
)
X_train, y_train = xgb_smote.get_train_data(X, y)
X_test, y_test = xgb_smote.get_test_data(X, y)
X_valid, y_valid = xgb_smote.get_valid_data(X, y)
Pipeline Steps:
┌─────────────────────┐
│ Step 1: resampler │
│ SMOTE │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Step 2: xgb │
│ XGBClassifier │
└─────────────────────┘
Distribution of y values after resampling: target
0 540
1 540
Name: count, dtype: int64
100%|██████████| 5/5 [00:34<00:00, 6.87s/it]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00, 4.37it/s]Best score/param set found on validation set:
{'params': {'xgb__early_stopping_rounds': 100,
'xgb__eval_metric': 'logloss',
'xgb__learning_rate': 0.0001,
'xgb__max_depth': 10,
'xgb__n_estimators': 999},
'score': 0.9990277777777777}
Best roc_auc: 0.999
SMOTE: Distribution of y values after resampling
Notice that the target has been redistributed after SMOTE to 540 observations for the minority class and 540 observations for the majority class.
Fit The Model
xgb_smote.fit(
X_train,
y_train,
validation_data=[X_valid, y_valid],
)
Return Metrics (Optional)
Validation Metrics
Confusion matrix on set provided:
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 20 (tp) 0 (fn)
Neg 6 (fp) 174 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9955555555555555,
'Average Precision': 0.9378696741854636,
'Brier Score': 0.20835571676988004,
'Precision/PPV': 0.7692307692307693,
'Sensitivity': 1.0,
'Specificity': 0.9666666666666667}
--------------------------------------------------------------------------------
precision recall f1-score support
0 1.00 0.97 0.98 180
1 0.77 1.00 0.87 20
accuracy 0.97 200
macro avg 0.88 0.98 0.93 200
weighted avg 0.98 0.97 0.97 200
--------------------------------------------------------------------------------
Test Metrics
Confusion matrix on set provided:
--------------------------------------------------------------------------------
Predicted:
Pos Neg
--------------------------------------------------------------------------------
Actual: Pos 19 (tp) 1 (fn)
Neg 3 (fp) 177 (tn)
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
{'AUC ROC': 0.9945833333333333,
'Average Precision': 0.9334649122807017,
'Brier Score': 0.20820269480995568,
'Precision/PPV': 0.8636363636363636,
'Sensitivity': 0.95,
'Specificity': 0.9833333333333333}
--------------------------------------------------------------------------------
precision recall f1-score support
0 0.99 0.98 0.99 180
1 0.86 0.95 0.90 20
accuracy 0.98 200
macro avg 0.93 0.97 0.95 200
weighted avg 0.98 0.98 0.98 200
--------------------------------------------------------------------------------
SHAP (SHapley Additive exPlanations)
This example demonstrates how to compute and visualize SHAP (SHapley Additive exPlanations) values for a machine learning model with a pipeline that includes feature selection. SHAP values provide insights into how individual features contribute to the predictions of a model.
Steps
The dataset is transformed through the model’s feature selection pipeline to ensure only the selected features are used for SHAP analysis.
The final model (e.g.,
XGBoost
classifier) is retrieved from the custom Model object. This is required because SHAP operates on the underlying model, not the pipeline.SHAP’s
TreeExplainer
is used to explain the predictions of the XGBoost classifier.SHAP values are calculated for the transformed dataset to quantify the contribution of each feature to the predictions.
A summary plot is generated to visualize the impact of each feature across all data points.
Step 1: Transform the test data using the feature selection pipeline
## The pipeline applies preprocessing (e.g., imputation, scaling) and feature
## selection (RFE) to X_test
X_test_transformed = model_xgb.get_feature_selection_pipeline().transform(X_test)
Step 2: Retrieve the trained XGBoost classifier from the pipeline
## The last estimator in the pipeline is the XGBoost model
xgb_classifier = model_xgb.estimator[-1]
Step 3: Extract feature names from the training data, and initialize the SHAP explainer for the XGBoost classifier
## Import SHAP for model explainability
import shap
## Feature names are required for interpretability in SHAP plots
feature_names = X_train.columns.to_list()
## Initialize the SHAP explainer with the model
explainer = shap.TreeExplainer(xgb_classifier)
Step 4: Compute SHAP values for the transformed test dataset and generate a summary plot of SHAP values
## Compute SHAP values for the transformed dataset
shap_values = explainer.shap_values(X_test_transformed)
Step 5: Generate a summary plot of SHAP values
## Plot SHAP values
## Summary plot of SHAP values for all features across all data points
shap.summary_plot(shap_values, X_test_transformed, feature_names=feature_names,)
Feature Importance and Impact
This SHAP summary plot provides a detailed visualization of how each feature
contributes to the model’s predictions, offering insight into feature importance
and their directional effects. The X-axis represents SHAP values, which quantify
the magnitude and direction of a feature’s influence. Positive SHAP values
indicate that the feature increases the predicted output, while negative values
suggest a decrease. Along the Y-axis, features are ranked by their overall importance,
with the most influential features, such as time
, positioned at the top.
Each point on the plot corresponds to an individual observation, where the color gradient reflects the feature value. Blue points represent lower feature values, while pink points indicate higher values, allowing us to observe how varying feature values affect the prediction. For example, the time feature shows a wide range of SHAP values, with higher values (pink) strongly increasing the prediction and lower values (blue) reducing it, demonstrating its critical role in driving the model’s output.
In contrast, features like hemo
and age
exhibit SHAP values closer to zero,
signifying a lower overall impact on predictions. Features such as homo
, karnof
,
and trt
show more variability in their influence, indicating that their effect is
context-dependent and can significantly shift predictions in certain cases. This
plot provides a holistic view of feature behavior, enabling a deeper understanding
of the model’s decision-making process.
Regression
Here is an example of using the Model
class for regression using XGBoost
on the California Housing dataset.
California Housing with XGBoost
Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from sklearn.impute import SimpleImputer
from sklearn.datasets import fetch_california_housing
from model_tuner import Model
Step 2: Load the Dataset
# Load the California Housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")
Step 3: Create an Instance of the XGBRegressor
xgb_name = "xgb"
xgb = XGBRegressor(random_state=222)
Step 4: Define Hyperparameters for XGBoost
tuned_parameters_xgb = [
{
f"{xgb_name}__learning_rate": [0.1, 0.01, 0.05],
f"{xgb_name}__n_estimators": [100, 200, 300], # Number of trees.
f"{xgb_name}__max_depth": [3, 5, 7][:1], # Maximum depth of the trees
f"{xgb_name}__subsample": [0.8, 1.0][:1], # Subsample ratio of the
# training instances
f"{xgb_name}__colsample_bytree": [0.8, 1.0][:1],
f"{xgb_name}__eval_metric": ["logloss"],
f"{xgb_name}__early_stopping_rounds": [10],
f"{xgb_name}__tree_method": ["hist"],
f"{xgb_name}__verbose": [False],
}
]
xgb_definition = {
"clc": xgb,
"estimator_name": xgb_name,
"tuned_parameters": tuned_parameters_xgb,
"randomized_grid": False,
"early": True,
}
model_definition = {xgb_name: xgb_definition}
Step 5: Initialize and Configure the Model
XGBRegressor
inherently handles missing values (NaN
) without requiring explicit
imputation strategies. During training, XGBoost
treats missing values as a
separate category and learns how to route them within its decision trees.
Therefore, passing a SimpleImputer
or using an imputation strategy is unnecessary
when using XGBRegressor
.
kfold = False
calibrate = False
# Define model object
model_type = "xgb"
clc = model_definition[model_type]["clc"]
estimator_name = model_definition[model_type]["estimator_name"]
# Set the parameters by cross-validation
tuned_parameters = model_definition[model_type]["tuned_parameters"]
rand_grid = model_definition[model_type]["randomized_grid"]
early_stop = model_definition[model_type]["early"]
model_xgb = Model(
name=f"xgb_{model_type}",
estimator_name=estimator_name,
model_type="regression",
calibrate=calibrate,
estimator=clc,
kfold=kfold,
stratify_y=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["r2"],
random_state=222,
n_jobs=2,
)
Step 6: Perform Grid Search Parameter Tuning and Retrieve Split Data
model_xgb.grid_search_param_tuning(X, y,)
X_train, y_train = model_xgb.get_train_data(X, y)
X_test, y_test = model_xgb.get_test_data(X, y)
X_valid, y_valid = model_xgb.get_valid_data(X, y)
Pipeline Steps:
┌────────────────┐
│ Step 1: xgb │
│ XGBRegressor │
└────────────────┘
100%|██████████| 9/9 [00:22<00:00, 2.45s/it]Best score/param set found on validation set:
{'params': {'xgb__colsample_bytree': 0.8,
'xgb__early_stopping_rounds': 10,
'xgb__eval_metric': 'logloss',
'xgb__learning_rate': 0.1,
'xgb__max_depth': 3,
'xgb__n_estimators': 67,
'xgb__subsample': 0.8,
'xgb__tree_method': 'hist'},
'score': 0.7651490279157868}
Best r2: 0.765
Step 7: Fit the Model
model_xgb.fit(
X_train,
y_train,
validation_data=[X_valid, y_valid],
)
Step 8: Return Metrics (Optional)
Validation Metrics
********************************************************************************
{'Explained Variance': 0.7647451659057567,
'Mean Absolute Error': 0.3830825326824073,
'Mean Squared Error': 0.3066172248224347,
'Median Absolute Error': 0.2672762813568116,
'R2': 0.7647433075624044,
'RMSE': 0.5537302816556403}
********************************************************************************
Test Metrics
********************************************************************************
{'Explained Variance': 0.7888942913974833,
'Mean Absolute Error': 0.3743548199982513,
'Mean Squared Error': 0.28411432705731066,
'Median Absolute Error': 0.26315186452865597,
'R2': 0.7888925135381788,
'RMSE': 0.533023758436067}
********************************************************************************
{'Explained Variance': 0.7888942913974833,
'R2': 0.7888925135381788,
'Mean Absolute Error': 0.3743548199982513,
'Median Absolute Error': 0.26315186452865597,
'Mean Squared Error': 0.28411432705731066,
'RMSE': 0.533023758436067}
Bootstrap Metrics
The bootstrapper.py
module provides utility functions for input type checking, data resampling, and evaluating bootstrap metrics.
- check_input_type(x)
Validates and normalizes the input type for data processing. Converts NumPy arrays, Pandas Series, and DataFrames into a standard Pandas DataFrame with a reset index.
- Parameters:
x (array-like) – Input data (NumPy array, Pandas Series, or DataFrame).
- Returns:
Normalized input as a Pandas DataFrame.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If the input type is not supported.
- sampling_method(y, n_samples, stratify=False, balance=False, class_proportions=None)
Resamples a dataset based on specified options for balancing, stratification, or custom class proportions.
- Parameters:
y (pandas.Series) – Target variable to resample.
n_samples (int) – Number of samples to draw.
stratify (bool, optional) – Whether to stratify based on the provided target variable.
balance (bool, optional) – Whether to balance class distributions equally.
class_proportions (dict, optional) – Custom proportions for each class. Must sum to 1.
- Returns:
Resampled target variable.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If class proportions do not sum to 1.
- evaluate_bootstrap_metrics(model=None, X=None, y=None, y_pred_prob=None, n_samples=500, num_resamples=1000, metrics=['roc_auc', 'f1_weighted', 'average_precision'], random_state=42, threshold=0.5, model_type='classification', stratify=None, balance=False, class_proportions=None)
Evaluates classification or regression metrics on bootstrap samples using a pre-trained model or pre-computed predictions.
- Parameters:
model (object, optional) – Pre-trained model with a
predict_proba
method. Required ify_pred_prob
is not provided.X (array-like, optional) – Input features. Not required if
y_pred_prob
is provided.y (array-like) – Ground truth labels.
y_pred_prob (array-like, optional) – Pre-computed predicted probabilities.
n_samples (int, optional) – Number of samples per bootstrap iteration. Default is 500.
num_resamples (int, optional) – Number of bootstrap iterations. Default is 1000.
metrics (list of str) – List of metrics to calculate (e.g.,
"roc_auc"
,"f1_weighted"
).random_state (int, optional) – Random seed for reproducibility. Default is 42.
threshold (float, optional) – Classification threshold for probability predictions. Default is 0.5.
model_type (str) – Specifies the task type, either
"classification"
or"regression"
.stratify (pandas.Series, optional) – Variable for stratified sampling.
balance (bool, optional) – Whether to balance class distributions.
class_proportions (dict, optional) – Custom class proportions for sampling.
- Returns:
DataFrame with mean and confidence intervals for each metric.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If invalid parameters or metrics are provided.
RuntimeError – If sample size is insufficient for metric calculation.
Note
The model_tuner_utils.py
module includes utility functions for evaluating bootstrap metrics in the context of model tuning.
- return_bootstrap_metrics(X_test, y_test, metrics, threshold=0.5, num_resamples=500, n_samples=500, balance=False)
Evaluates bootstrap metrics for a trained model using the test dataset. This function supports both classification and regression tasks by leveraging evaluate_bootstrap_metrics to compute confidence intervals for the specified metrics.
- Parameters:
X_test (pandas.DataFrame) – Test dataset features.
y_test (pandas.Series or pandas.DataFrame) – Test dataset labels.
metrics (list of str) – List of metric names to calculate (e.g.,
"roc_auc"
,"f1_weighted"
).threshold (float, optional) – Threshold for converting predicted probabilities into class predictions. Default is 0.5.
num_resamples (int, optional) – Number of bootstrap iterations. Default is 500.
n_samples (int, optional) – Number of samples per bootstrap iteration. Default is 500.
balance (bool, optional) – Whether to balance the class distribution during resampling. Default is False.
- Returns:
DataFrame containing mean and confidence intervals for the specified metrics.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If
X_test
ory_test
are not provided as Pandas DataFrames or if unsupported input types are specified.
Bootstrap Metrics Example
Continuing from the model output object (model_xgb
) from the regression example above, we leverage the return_bootstrap_metrics
method from model_tuner_utils.py
to print bootstrap performance metrics (\(R^2\) and \(\text{explained variance}\)) at 95% confidence levels as shown below:
print("Bootstrap Metrics")
model_xgb.return_bootstrap_metrics(
X_test=X_test,
y_test=y_test,
metrics=["r2", "explained_variance"],
n_samples=30,
num_resamples=300,
)
Bootstrap Metrics
100%|██████████| 300/300 [00:00<00:00, 358.05it/s]
Metric Mean 95% CI Lower 95% CI Upper
0 r2 0.781523 0.770853 0.792193
1 explained_variance 0.788341 0.777898 0.798785