Model Tuner Library Instructions

This notebook provides a guide on how to install and use the model_tuner library in a notebook environment like Google Colab.

Model Tuner Description

The model_tuner library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.

Documentation

For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.

By following these steps, you should be able to install and use the model_tuner library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.

Installation

To install the model_tuner library, use the following command:

In [1]:
! pip install model_tuner
! pip install sns
Collecting model_tuner
  Downloading model_tuner-0.0.15a0-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: joblib>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.4.2)
Requirement already satisfied: numpy>=1.21.6 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.26.4)
Requirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.3.2)
Requirement already satisfied: scipy>=1.7.3 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.13.1)
Requirement already satisfied: tqdm>=4.66.4 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (4.66.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0.2->model_tuner) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3.5->model_tuner) (1.16.0)
Downloading model_tuner-0.0.15a0-py3-none-any.whl (20 kB)
Installing collected packages: model_tuner
Successfully installed model_tuner-0.0.15a0
Collecting sns
  Downloading sns-0.1.tar.gz (2.1 kB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: sns
  Building wheel for sns (setup.py) ... done
  Created wheel for sns: filename=sns-0.1-py3-none-any.whl size=2639 sha256=f48e49a12d41c416aeb626123e809a5ff0aab8bfbdf9caa485f07c14de2c0d6c
  Stored in directory: /root/.cache/pip/wheels/76/1a/47/c3b6a8b9d3ae47b1488f4be13c86586327c07e0ac1bb5b3337
Successfully built sns
Installing collected packages: sns
Successfully installed sns-0.1

Importing the Library

After installation, you can import the necessary components from the model_tuner library as shown below:

In [2]:
from model_tuner import Model
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

Binary Classification with the titanic dataset and Pipeline

In [3]:
titanic = sns.load_dataset('titanic')
titanic.head()
Out[3]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [4]:
X = titanic[[col for col in titanic.columns if col != "survived"]]
### Removing repeated data
X = X.drop(columns=['alive', 'class', 'embarked'])
y = titanic['survived']
In [5]:
rf = RandomForestClassifier(class_weight="balanced")

estimator_name = "rf"

rf_pipeline_hyperparams_grid = {
    f"{estimator_name}__max_depth": [3, 5, 10, None],
    f"{estimator_name}__n_estimators": [10, 100, 200],
    f"{estimator_name}__max_features": [1, 3, 5, 7],
    f"{estimator_name}__min_samples_leaf": [1, 2, 3],
}

Defining pipeline steps

Here we look at the columns of the data and work out what data points need what sort of preprocessing, for example we may want to scale the continuous input data. The ordinal data will need converting to appropriate numbers e.g. A-> 0 B-> 1, C-> 3. Or the otherway around. The other categorical data needs one hot encoding.

This can be done easily through the pipeline so that we can ensure there is no data leakage.

This also allows us to handle missing data when it comes to predicting. Using the OneHotEncoder with handle_unknown set to ignore will generate a new empty column if we have missing data.

We also set impute to True this helps us handle missing data by automatically imputing it with the mean. This step can be removed and a custom imptuer can be used through the pipeline_steps if necessary.

In [6]:
X.head()
Out[6]:
pclass sex age sibsp parch fare who adult_male deck embark_town alone
0 3 male 22.0 1 0 7.2500 man True NaN Southampton False
1 1 female 38.0 1 0 71.2833 woman False C Cherbourg False
2 3 female 26.0 0 0 7.9250 woman False NaN Southampton True
3 1 female 35.0 1 0 53.1000 woman False C Southampton False
4 3 male 35.0 0 0 8.0500 man True NaN Southampton True
In [7]:
# ### Defining columns to be scaled and columns to be onehotencoded
# from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.impute import SimpleImputer
# from pprint import pprint

# ohencoder = OneHotEncoder(handle_unknown="ignore")

# ohcols = [
#     "embark_town",
#     "who",
#     "sex",
#     "adult_male"
# ]

# ordencoder = OrdinalEncoder()

# ordcols = [
#     "deck"
# ]

# minmaxscaler = MinMaxScaler()

# simpleimputer = SimpleImputer()

# scalercols = [
#     "parch",
#     "fare",
#     "age",
#     "pclass"
# ]

# categorical_transformer = Pipeline(
#     steps=[
#         (
#             "imputer",
#             SimpleImputer(strategy="constant", fill_value="missing"),
#         ),
#         ("onehot", OneHotEncoder(handle_unknown="ignore")),
#     ]
# )

# numeric_transformer = Pipeline(
#     steps=[
#         ("imputer", SimpleImputer(strategy="mean")),
#         ("scaler", MinMaxScaler()),  # Scaling the data
#     ]
# )

# ct = ColumnTransformer(
#     [
#         ("OneHotEncoder", ohencoder, ohcols),
#         ("OrdinalEncoder", ordencoder, ordcols),
#         ("MinMaxScaler", minmaxscaler, scalercols),
#         ("SimpleImputer", simpleimputer, numeric_transformer[])
#     ],
#     remainder='passthrough'
# )
In [8]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Define columns
ohcols = [
    "embark_town",
    "who",
    "sex",
    "adult_male"
]

ordcols = [
    "deck"
]

scalercols = [
    "parch",
    "fare",
    "age",
    "pclass"
]

# Create the pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Create the pipeline for ordinal features
ordinal_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ordinal", OrdinalEncoder())
    ]
)

# Create the pipeline for numeric features (imputation followed by scaling)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", MinMaxScaler())
    ]
)

# Define the ColumnTransformer
ct = ColumnTransformer(
    transformers=[
        ("OneHotEncoder", categorical_transformer, ohcols),
        ("OrdinalEncoder", ordinal_transformer, ordcols),
        ("Numeric", numeric_transformer, scalercols),
    ],
    remainder='passthrough'  # Keep other columns unchanged
)
In [9]:
# Initialize titanic_model
titanic_model_rf = Model(
    name="RandomForest_Titanic",
    estimator_name=estimator_name,
    calibrate=True,
    estimator=rf,
    kfold=False,
    pipeline_steps=[("Preproccesor", ct)],
    stratify_y=True,
    grid=rf_pipeline_hyperparams_grid,
    randomized_grid=True,
    n_iter=5,
    scoring=["roc_auc"],
    random_state=42,
    n_jobs=-1,
)
In [10]:
titanic_model_rf.grid_search_param_tuning(X, y, f1_beta_tune=True)
100%|██████████| 5/5 [00:02<00:00,  2.31it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:01<00:00,  1.43it/s]
Best score/param set found on validation set:
{'params': {'rf__max_depth': 5,
            'rf__max_features': 5,
            'rf__min_samples_leaf': 1,
            'rf__n_estimators': 200},
 'score': 0.8776069518716578}
Best roc_auc: 0.878 


In [11]:
X_train, y_train = titanic_model_rf.get_train_data(X, y)
X_valid, y_valid = titanic_model_rf.get_valid_data(X, y)
X_test, y_test = titanic_model_rf.get_test_data(X, y)

titanic_model_rf.fit(X_train, y_train)
In [12]:
prob_uncalibrated = titanic_model_rf.predict_proba(X_test)[:, 1]

if titanic_model_rf.calibrate == True:
  titanic_model_rf.calibrateModel(X, y)
Confusion matrix on validation set:
--------------------------------------------------------------------------------
          Predicted:
            Pos  Neg
--------------------------------------------------------------------------------
Actual: Pos 51 (tp)  17 (fn)
        Neg 12 (fp)  98 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.85      0.89      0.87       110
           1       0.81      0.75      0.78        68

    accuracy                           0.84       178
   macro avg       0.83      0.82      0.82       178
weighted avg       0.84      0.84      0.84       178

--------------------------------------------------------------------------------
In [13]:
metrics = titanic_model_rf.return_metrics(X_test, y_test)
Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
            Pos  Neg
--------------------------------------------------------------------------------
Actual: Pos 50 (tp)  19 (fn)
        Neg 11 (fp)  99 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.84      0.90      0.87       110
           1       0.82      0.72      0.77        69

    accuracy                           0.83       179
   macro avg       0.83      0.81      0.82       179
weighted avg       0.83      0.83      0.83       179

--------------------------------------------------------------------------------
In [14]:
titanic_model_rf.threshold
Out[14]:
{'roc_auc': 0.36}

Calibrating Model

In [15]:
from matplotlib import pyplot as plt
from sklearn.calibration import calibration_curve
In [16]:
# Get the predicted probabilities for the validation data from the calibrated model
y_prob_calibrated = titanic_model_rf.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the calibrated model
prob_true_calibrated, prob_pred_calibrated = calibration_curve(y_test, y_prob_calibrated, n_bins=4)
prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(y_test, prob_uncalibrated, n_bins=4)

# Plot the calibration curves
plt.figure(figsize=(5, 5))
plt.plot(prob_pred_uncalibrated, prob_true_uncalibrated, marker='o', label='Uncalibrated XGBoost')
plt.plot(prob_pred_calibrated, prob_true_calibrated, marker='o', label='Calibrated XGBoost')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated')
plt.xlabel('Predicted probability')
plt.ylabel('True probability in each bin')
plt.title('Calibration plot (reliability curve)')
plt.legend()
plt.show()

KFold?

If we want to use KFold we can simply set the kfold parameter to True this will automatically split the data accordingly.

In [17]:
# Initialize titanic_model

titanic_model_kf = Model(
    name="RandomForest_Titanic",
    estimator_name=estimator_name,
    calibrate=True,
    estimator=rf,
    kfold=True,
    pipeline_steps=[("ColumnTransformer", ct)],
    stratify_y=False,
    n_splits=10,
    grid=rf_pipeline_hyperparams_grid,
    randomized_grid=True,
    n_iter=5,
    scoring=["roc_auc"],
    random_state=42,
    n_jobs=-1,
)
In [18]:
#### When using KFold X and y are passed as a whole to the fit method as they
#### as they are split within this into the separate folds.
#### The metrics are assessed over each fold and averaged.

titanic_model_kf.grid_search_param_tuning(X, y, f1_beta_tune=True)
# Tuning hyper-parameters for roc_auc
Fitting 10 folds for each of 5 candidates, totalling 50 fits

Best score/param set found on development set:
{0.8738231613937606: {'rf__max_depth': 10,
                      'rf__max_features': 5,
                      'rf__min_samples_leaf': 3,
                      'rf__n_estimators': 100}}

Grid scores on development set:
0.848 (+/-0.084) for {'rf__n_estimators': 10, 'rf__min_samples_leaf': 1, 'rf__max_features': 3, 'rf__max_depth': None}
0.864 (+/-0.093) for {'rf__n_estimators': 100, 'rf__min_samples_leaf': 1, 'rf__max_features': 5, 'rf__max_depth': 3}
0.865 (+/-0.099) for {'rf__n_estimators': 100, 'rf__min_samples_leaf': 1, 'rf__max_features': 3, 'rf__max_depth': 10}
0.874 (+/-0.095) for {'rf__n_estimators': 100, 'rf__min_samples_leaf': 3, 'rf__max_features': 5, 'rf__max_depth': 10}
0.871 (+/-0.095) for {'rf__n_estimators': 200, 'rf__min_samples_leaf': 1, 'rf__max_features': 5, 'rf__max_depth': 5}
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:01<00:00,  1.32it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:01<00:00,  1.61it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:01<00:00,  1.21it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:01<00:00,  1.32it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:02<00:00,  1.18s/it]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  2.63it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  3.13it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  3.41it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  3.26it/s]
Fitting model with best params and tuning for best threshold ...
100%|██████████| 2/2 [00:00<00:00,  3.36it/s]
In [19]:
#### When using KFold X and y are passed as a whole to the fit method as they
#### as they are split within this into the separate folds.
#### The metrics are assessed over each fold and averaged.

titanic_model_kf.fit(X, y)
In [20]:
titanic_model_kf.threshold
Out[20]:
{'roc_auc': 0.327}
In [21]:
titanic_model_kf.return_metrics(X, y)
Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos 292 (tp)   50 (fn)
        Neg  40 (fp)  509 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.91      0.93      0.92       549
           1       0.88      0.85      0.87       342

    accuracy                           0.90       891
   macro avg       0.90      0.89      0.89       891
weighted avg       0.90      0.90      0.90       891

--------------------------------------------------------------------------------
Out[21]:
{'Classification Report': {'0': {'precision': 0.9105545617173524,
   'recall': 0.9271402550091075,
   'f1-score': 0.9187725631768954,
   'support': 549.0},
  '1': {'precision': 0.8795180722891566,
   'recall': 0.8538011695906432,
   'f1-score': 0.8664688427299703,
   'support': 342.0},
  'accuracy': 0.898989898989899,
  'macro avg': {'precision': 0.8950363170032545,
   'recall': 0.8904707122998754,
   'f1-score': 0.8926207029534328,
   'support': 891.0},
  'weighted avg': {'precision': 0.8986415657752166,
   'recall': 0.898989898989899,
   'f1-score': 0.8986963876518129,
   'support': 891.0}},
 'Confusion Matrix': array([[509,  40],
        [ 50, 292]])}