Model Tuner Library Instructions¶

This notebook provides a guide on how to install and use the model_tuner library in a notebook environment like Google Colab.

Model Tuner Description¶

The model_tuner library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.

Documentation¶

For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.

By following these steps, you should be able to install and use the model_tuner library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.

Installation¶

To install the model_tuner library, use the following command:

! pip install model_tuner
! pip install sns

Collecting model_tuner
  Downloading model_tuner-0.0.15a0-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: joblib>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.4.2)
Requirement already satisfied: numpy>=1.21.6 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.26.4)
Requirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.3.2)
Requirement already satisfied: scipy>=1.7.3 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.13.1)
Requirement already satisfied: tqdm>=4.66.4 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (4.66.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.3.5->model_tuner) (2024.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0.2->model_tuner) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.3.5->model_tuner) (1.16.0)
Downloading model_tuner-0.0.15a0-py3-none-any.whl (20 kB)
Installing collected packages: model_tuner
Successfully installed model_tuner-0.0.15a0
Collecting sns
  Downloading sns-0.1.tar.gz (2.1 kB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: sns
  Building wheel for sns (setup.py) ... done
  Created wheel for sns: filename=sns-0.1-py3-none-any.whl size=2639 sha256=f48e49a12d41c416aeb626123e809a5ff0aab8bfbdf9caa485f07c14de2c0d6c
  Stored in directory: /root/.cache/pip/wheels/76/1a/47/c3b6a8b9d3ae47b1488f4be13c86586327c07e0ac1bb5b3337
Successfully built sns
Installing collected packages: sns
Successfully installed sns-0.1

Importing the Library¶

After installation, you can import the necessary components from the model_tuner library as shown below:

from model_tuner import Model
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

Binary Classification with the titanic dataset and Pipeline¶

titanic = sns.load_dataset('titanic')
titanic.head()

X = titanic[[col for col in titanic.columns if col != "survived"]]
### Removing repeated data
X = X.drop(columns=['alive', 'class', 'embarked'])
y = titanic['survived']

rf = RandomForestClassifier(class_weight="balanced")

estimator_name = "rf"

rf_pipeline_hyperparams_grid = {
    f"{estimator_name}__max_depth": [3, 5, 10, None],
    f"{estimator_name}__n_estimators": [10, 100, 200],
    f"{estimator_name}__max_features": [1, 3, 5, 7],
    f"{estimator_name}__min_samples_leaf": [1, 2, 3],
}

Defining pipeline steps¶

Here we look at the columns of the data and work out what data points need what sort of preprocessing, for example we may want to scale the continuous input data. The ordinal data will need converting to appropriate numbers e.g. A-> 0 B-> 1, C-> 3. Or the otherway around. The other categorical data needs one hot encoding.

This can be done easily through the pipeline so that we can ensure there is no data leakage.

This also allows us to handle missing data when it comes to predicting. Using the OneHotEncoder with handle_unknown set to ignore will generate a new empty column if we have missing data.

We also set impute to True this helps us handle missing data by automatically imputing it with the mean. This step can be removed and a custom imptuer can be used through the pipeline_steps if necessary.

X.head()

# ### Defining columns to be scaled and columns to be onehotencoded
# from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.impute import SimpleImputer
# from pprint import pprint

# ohencoder = OneHotEncoder(handle_unknown="ignore")

# ohcols = [
#     "embark_town",
#     "who",
#     "sex",
#     "adult_male"
# ]

# ordencoder = OrdinalEncoder()

# ordcols = [
#     "deck"
# ]

# minmaxscaler = MinMaxScaler()

# simpleimputer = SimpleImputer()

# scalercols = [
#     "parch",
#     "fare",
#     "age",
#     "pclass"
# ]

# categorical_transformer = Pipeline(
#     steps=[
#         (
#             "imputer",
#             SimpleImputer(strategy="constant", fill_value="missing"),
#         ),
#         ("onehot", OneHotEncoder(handle_unknown="ignore")),
#     ]
# )

# numeric_transformer = Pipeline(
#     steps=[
#         ("imputer", SimpleImputer(strategy="mean")),
#         ("scaler", MinMaxScaler()),  # Scaling the data
#     ]
# )

# ct = ColumnTransformer(
#     [
#         ("OneHotEncoder", ohencoder, ohcols),
#         ("OrdinalEncoder", ordencoder, ordcols),
#         ("MinMaxScaler", minmaxscaler, scalercols),
#         ("SimpleImputer", simpleimputer, numeric_transformer[])
#     ],
#     remainder='passthrough'
# )

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Define columns
ohcols = [
    "embark_town",
    "who",
    "sex",
    "adult_male"
]

ordcols = [
    "deck"
]

scalercols = [
    "parch",
    "fare",
    "age",
    "pclass"
]

# Create the pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Create the pipeline for ordinal features
ordinal_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ordinal", OrdinalEncoder())
    ]
)

# Create the pipeline for numeric features (imputation followed by scaling)
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", MinMaxScaler())
    ]
)

# Define the ColumnTransformer
ct = ColumnTransformer(
    transformers=[
        ("OneHotEncoder", categorical_transformer, ohcols),
        ("OrdinalEncoder", ordinal_transformer, ordcols),
        ("Numeric", numeric_transformer, scalercols),
    ],
    remainder='passthrough'  # Keep other columns unchanged
)

# Initialize titanic_model
titanic_model_rf = Model(
    name="RandomForest_Titanic",
    estimator_name=estimator_name,
    calibrate=True,
    estimator=rf,
    kfold=False,
    pipeline_steps=[("Preproccesor", ct)],
    stratify_y=True,
    grid=rf_pipeline_hyperparams_grid,
    randomized_grid=True,
    n_iter=5,
    scoring=["roc_auc"],
    random_state=42,
    n_jobs=-1,
)

titanic_model_rf.grid_search_param_tuning(X, y, f1_beta_tune=True)

100%|██████████| 5/5 [00:02<00:00,  2.31it/s]

Fitting model with best params and tuning for best threshold ...

100%|██████████| 2/2 [00:01<00:00,  1.43it/s]

Best score/param set found on validation set:
{'params': {'rf__max_depth': 5,
            'rf__max_features': 5,
            'rf__min_samples_leaf': 1,
            'rf__n_estimators': 200},
 'score': 0.8776069518716578}
Best roc_auc: 0.878

X_train, y_train = titanic_model_rf.get_train_data(X, y)
X_valid, y_valid = titanic_model_rf.get_valid_data(X, y)
X_test, y_test = titanic_model_rf.get_test_data(X, y)

titanic_model_rf.fit(X_train, y_train)

prob_uncalibrated = titanic_model_rf.predict_proba(X_test)[:, 1]

if titanic_model_rf.calibrate == True:
  titanic_model_rf.calibrateModel(X, y)

Confusion matrix on validation set:
--------------------------------------------------------------------------------
          Predicted:
            Pos  Neg
--------------------------------------------------------------------------------
Actual: Pos 51 (tp)  17 (fn)
        Neg 12 (fp)  98 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.85      0.89      0.87       110
           1       0.81      0.75      0.78        68

    accuracy                           0.84       178
   macro avg       0.83      0.82      0.82       178
weighted avg       0.84      0.84      0.84       178

--------------------------------------------------------------------------------

metrics = titanic_model_rf.return_metrics(X_test, y_test)

Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
            Pos  Neg
--------------------------------------------------------------------------------
Actual: Pos 50 (tp)  19 (fn)
        Neg 11 (fp)  99 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.84      0.90      0.87       110
           1       0.82      0.72      0.77        69

    accuracy                           0.83       179
   macro avg       0.83      0.81      0.82       179
weighted avg       0.83      0.83      0.83       179

--------------------------------------------------------------------------------

titanic_model_rf.threshold

{'roc_auc': 0.36}

Calibrating Model¶

from matplotlib import pyplot as plt
from sklearn.calibration import calibration_curve

# Get the predicted probabilities for the validation data from the calibrated model
y_prob_calibrated = titanic_model_rf.predict_proba(X_test)[:, 1]

# Compute the calibration curve for the calibrated model
prob_true_calibrated, prob_pred_calibrated = calibration_curve(y_test, y_prob_calibrated, n_bins=4)
prob_true_uncalibrated, prob_pred_uncalibrated = calibration_curve(y_test, prob_uncalibrated, n_bins=4)

# Plot the calibration curves
plt.figure(figsize=(5, 5))
plt.plot(prob_pred_uncalibrated, prob_true_uncalibrated, marker='o', label='Uncalibrated XGBoost')
plt.plot(prob_pred_calibrated, prob_true_calibrated, marker='o', label='Calibrated XGBoost')
plt.plot([0, 1], [0, 1], linestyle='--', label='Perfectly calibrated')
plt.xlabel('Predicted probability')
plt.ylabel('True probability in each bin')
plt.title('Calibration plot (reliability curve)')
plt.legend()
plt.show()

KFold?¶

If we want to use KFold we can simply set the kfold parameter to True this will automatically split the data accordingly.

# Initialize titanic_model

titanic_model_kf = Model(
    name="RandomForest_Titanic",
    estimator_name=estimator_name,
    calibrate=True,
    estimator=rf,
    kfold=True,
    pipeline_steps=[("ColumnTransformer", ct)],
    stratify_y=False,
    n_splits=10,
    grid=rf_pipeline_hyperparams_grid,
    randomized_grid=True,
    n_iter=5,
    scoring=["roc_auc"],
    random_state=42,
    n_jobs=-1,
)

#### When using KFold X and y are passed as a whole to the fit method as they
#### as they are split within this into the separate folds.
#### The metrics are assessed over each fold and averaged.

titanic_model_kf.grid_search_param_tuning(X, y, f1_beta_tune=True)

# Tuning hyper-parameters for roc_auc
Fitting 10 folds for each of 5 candidates, totalling 50 fits

Best score/param set found on development set:
{0.8738231613937606: {'rf__max_depth': 10,
                      'rf__max_features': 5,
                      'rf__min_samples_leaf': 3,
                      'rf__n_estimators': 100}}

Grid scores on development set:
0.848 (+/-0.084) for {'rf__n_estimators': 10, 'rf__min_samples_leaf': 1, 'rf__max_features': 3, 'rf__max_depth': None}
0.864 (+/-0.093) for {'rf__n_estimators': 100, 'rf__min_samples_leaf': 1, 'rf__max_features': 5, 'rf__max_depth': 3}
0.865 (+/-0.099) for {'rf__n_estimators': 100, 'rf__min_samples_leaf': 1, 'rf__max_features': 3, 'rf__max_depth': 10}
0.874 (+/-0.095) for {'rf__n_estimators': 100, 'rf__min_samples_leaf': 3, 'rf__max_features': 5, 'rf__max_depth': 10}
0.871 (+/-0.095) for {'rf__n_estimators': 200, 'rf__min_samples_leaf': 1, 'rf__max_features': 5, 'rf__max_depth': 5}
Fitting model with best params and tuning for best threshold ...

100%|██████████| 2/2 [00:01<00:00,  1.32it/s]

Fitting model with best params and tuning for best threshold ...

100%|██████████| 2/2 [00:01<00:00,  1.61it/s]

Fitting model with best params and tuning for best threshold ...

100%|██████████| 2/2 [00:01<00:00,  1.21it/s]

Fitting model with best params and tuning for best threshold ...

100%|██████████| 2/2 [00:01<00:00,  1.32it/s]

Fitting model with best params and tuning for best threshold ...

100%|██████████| 2/2 [00:02<00:00,  1.18s/it]

#### When using KFold X and y are passed as a whole to the fit method as they
#### as they are split within this into the separate folds.
#### The metrics are assessed over each fold and averaged.

titanic_model_kf.fit(X, y)

titanic_model_kf.threshold

{'roc_auc': 0.327}

titanic_model_kf.return_metrics(X, y)

Confusion matrix on set provided: 
--------------------------------------------------------------------------------
          Predicted:
             Pos   Neg
--------------------------------------------------------------------------------
Actual: Pos 292 (tp)   50 (fn)
        Neg  40 (fp)  509 (tn)
--------------------------------------------------------------------------------

              precision    recall  f1-score   support

           0       0.91      0.93      0.92       549
           1       0.88      0.85      0.87       342

    accuracy                           0.90       891
   macro avg       0.90      0.89      0.89       891
weighted avg       0.90      0.90      0.90       891

--------------------------------------------------------------------------------

{'Classification Report': {'0': {'precision': 0.9105545617173524,
   'recall': 0.9271402550091075,
   'f1-score': 0.9187725631768954,
   'support': 549.0},
  '1': {'precision': 0.8795180722891566,
   'recall': 0.8538011695906432,
   'f1-score': 0.8664688427299703,
   'support': 342.0},
  'accuracy': 0.898989898989899,
  'macro avg': {'precision': 0.8950363170032545,
   'recall': 0.8904707122998754,
   'f1-score': 0.8926207029534328,
   'support': 891.0},
  'weighted avg': {'precision': 0.8986415657752166,
   'recall': 0.898989898989899,
   'f1-score': 0.8986963876518129,
   'support': 891.0}},
 'Confusion Matrix': array([[509,  40],
        [ 50, 292]])}

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

	pclass	sex	age	sibsp	fare	who	adult_male	deck	embark_town	alone
0	3	male	22.0	1	7.2500	man	True	NaN	Southampton	False
1	1	female	38.0	1	71.2833	woman	False	C	Cherbourg	False
2	3	female	26.0	0	7.9250	woman	False	NaN	Southampton	True
3	1	female	35.0	1	53.1000	woman	False	C	Southampton	False
4	3	male	35.0	0	8.0500	man	True	NaN	Southampton	True