Model Tuner Regression - Redfin Real Estate¶

Overview¶

This Google Colab notebook utilizes a dataset on the Los Angeles Real Estate market sourced from Redfin, consisting of 200 rows and 27 columns. The dataset captures a snapshot in time of various property listings, providing detailed information about each property.

Dataset Columns¶

The dataset includes the following columns:

SALE TYPE
SOLD DATE
PROPERTY TYPE
ADDRESS
CITY
STATE OR PROVINCE
ZIP OR POSTAL CODE
PRICE
BEDS
BATHS
LOCATION
SQUARE FEET
LOT SIZE
YEAR BUILT
DAYS ON MARKET
$/SQUARE FEET
HOA/MONTH
STATUS
NEXT OPEN HOUSE START TIME
NEXT OPEN HOUSE END TIME
URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)
SOURCE
MLS#
FAVORITE
INTERESTED
LATITUDE
LONGITUDE

Usage¶

The primary purpose of this notebook is to demonstrate various regression examples using the Redfin dataset. Specifically, the following variables are used for testing in this notebook:

Features (X):
- BEDS
- BATHS
- SQUARE FEET
- LOT SIZE
Target (y):
- PRICE

Source¶

The dataset is sourced from Redfin, available at Redfin's Comparative Market Analysis.

Note¶

This dataset represents a single snapshot in time and may not reflect current market conditions.

Model Tuner Library Instructions¶

This notebook provides a guide on how to install and use the model_tuner library in a notebook environment like Google Colab.

Model Tuner Description¶

The model_tuner library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.

Key Features¶

Automatic Hyperparameter Tuning

The library can automatically tune hyperparameters for a variety of machine learning models using advanced optimization techniques.

Cross-Validation

Integrated cross-validation ensures that the models are evaluated robustly, preventing overfitting.

Documentation¶

For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.

By following these steps, you should be able to install and use the model_tuner library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.

Installation¶

To install the model_tuner library, use the following command:

! pip install model_tuner

Collecting model_tuner
  Downloading model_tuner-0.0.20a0-py3-none-any.whl.metadata (5.7 kB)
Collecting joblib==1.3.2 (from model_tuner)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting tqdm==4.66.4 (from model_tuner)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 2.2 MB/s eta 0:00:00
Collecting catboost==1.2.7 (from model_tuner)
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting pip==24.2 (from model_tuner)
  Downloading pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Requirement already satisfied: setuptools==75.1.0 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (75.1.0)
Collecting wheel==0.44.0 (from model_tuner)
  Downloading wheel-0.44.0-py3-none-any.whl.metadata (2.3 kB)
Requirement already satisfied: numpy<2.0.0,>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (1.26.4)
Requirement already satisfied: pandas<2.2.3,>=1.3.5 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.2.2)
Collecting scikit-learn<1.4.0,>=1.0.2 (from model_tuner)
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting scipy<1.11,>=1.6.3 (from model_tuner)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.9/58.9 kB 1.6 MB/s eta 0:00:00
Collecting scikit-optimize==0.10.2 (from model_tuner)
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Requirement already satisfied: imbalanced-learn==0.12.4 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (0.12.4)
Requirement already satisfied: xgboost==2.1.2 in /usr/local/lib/python3.10/dist-packages (from model_tuner) (2.1.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (0.20.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (3.8.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (5.24.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost==1.2.7->model_tuner) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn==0.12.4->model_tuner) (3.5.0)
Collecting pyaml>=16.9 (from scikit-optimize==0.10.2->model_tuner)
  Downloading pyaml-24.9.0-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from scikit-optimize==0.10.2->model_tuner) (24.2)
Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.10/dist-packages (from xgboost==2.1.2->model_tuner) (2.23.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas<2.2.3,>=1.3.5->model_tuner) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<2.2.3,>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas<2.2.3,>=1.3.5->model_tuner) (2024.2)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.10/dist-packages (from pyaml>=16.9->scikit-optimize==0.10.2->model_tuner) (6.0.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (4.55.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (1.4.7)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (11.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost==1.2.7->model_tuner) (3.2.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost==1.2.7->model_tuner) (9.0.0)
Downloading model_tuner-0.0.20a0-py3-none-any.whl (24 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 MB 5.2 MB/s eta 0:00:00
Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 7.4 MB/s eta 0:00:00
Downloading pip-24.2-py3-none-any.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 32.3 MB/s eta 0:00:00
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.8/107.8 kB 7.0 MB/s eta 0:00:00
Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 4.4 MB/s eta 0:00:00
Downloading wheel-0.44.0-py3-none-any.whl (67 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.1/67.1 kB 3.5 MB/s eta 0:00:00
Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.8/10.8 MB 42.4 MB/s eta 0:00:00
Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 21.1 MB/s eta 0:00:00
Downloading pyaml-24.9.0-py3-none-any.whl (24 kB)
Installing collected packages: wheel, tqdm, scipy, pyaml, pip, joblib, scikit-learn, scikit-optimize, catboost, model_tuner
  Attempting uninstall: wheel
    Found existing installation: wheel 0.45.0
    Uninstalling wheel-0.45.0:
      Successfully uninstalled wheel-0.45.0
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.6
    Uninstalling tqdm-4.66.6:
      Successfully uninstalled tqdm-4.66.6
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.1
    Uninstalling scipy-1.13.1:
      Successfully uninstalled scipy-1.13.1
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
  Attempting uninstall: joblib
    Found existing installation: joblib 1.4.2
    Uninstalling joblib-1.4.2:
      Successfully uninstalled joblib-1.4.2
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.5.2
    Uninstalling scikit-learn-1.5.2:
      Successfully uninstalled scikit-learn-1.5.2
Successfully installed catboost-1.2.7 joblib-1.3.2 model_tuner-0.0.20a0 pip-24.2 pyaml-24.9.0 scikit-learn-1.3.2 scikit-optimize-0.10.2 scipy-1.10.1 tqdm-4.66.4 wheel-0.44.0

Importing the Library¶

After installation, you can import the necessary components from the model_tuner library as shown below:

import model_tuner
from model_tuner import Model

Checking the Version¶

To ensure that the model_tuner library is installed correctly, you can check its version:

print(help(model_tuner))

Help on package model_tuner:

NAME
    model_tuner

DESCRIPTION
    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    $      __  __           _      _   _____                          $ 
    $     |  \/  | ___   __| | ___| | |_   _|   _ _ __   ___ _ __     $
    $     | |\/| |/ _ \ / _` |/ _ \ |   | || | | | '_ \ / _ \ '__|    $
    $     | |  | | (_) | (_| |  __/ |   | || |_| | | | |  __/ |       $
    $     |_|  |_|\___/ \__,_|\___|_|   |_| \__,_|_| |_|\___|_|       $
    $                                                                 $
    $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
                                                                   
    The `model_tuner` library is a versatile and powerful tool designed to 
    facilitate the training, evaluation, and tuning of machine learning models. 
    It supports various functionalities such as handling imbalanced data, applying 
    different scaling and imputation techniques, calibrating models, and conducting 
    cross-validation. This library is particularly useful for model selection, 
    hyperparameter tuning, and ensuring optimal performance across different metrics.
    
    Version: 0.0.20a

PACKAGE CONTENTS
    bootstrapper
    main
    model_tuner_utils
    pickleObjects

DATA
    __email__ = 'lshpaner@ucla.edu; alafunnell@gmail.com; pp89@ucla.edu'

VERSION
    0.0.20a

AUTHOR
    Arthur Funnell, Leonid Shpaner, Panayiotis Petousis

FILE
    /usr/local/lib/python3.10/dist-packages/model_tuner/__init__.py


None

Import Other Relevant Libraries¶

import pandas as pd
import numpy as np

from sklearn.linear_model import Lasso, Ridge, SGDRegressor
from xgboost import XGBRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

import warnings
from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

Read In The Dataset¶

# Direct download link to the Excel file
url = (
    "https://github.com/uclamii/model_tuner/raw/main/public_data/"
    "redfin_2024-04-16-15-59-17.xlsx"
)

# Read the Excel file
df = pd.read_excel(url)

df.head() # inspect first 5 rows of data

df["PROPERTY TYPE"].unique()

array([nan, 'Single Family Residential', 'Condo/Co-op',
       'Multi-Family (5+ Unit)', 'Vacant Land'], dtype=object)

Simple EDA and Preprocessing¶

df.columns # inspect the list of cols in the dataset

Index(['SALE TYPE', 'SOLD DATE', 'PROPERTY TYPE', 'ADDRESS', 'CITY',
       'STATE OR PROVINCE', 'ZIP OR POSTAL CODE', 'PRICE', 'BEDS', 'BATHS',
       'LOCATION', 'SQUARE FEET', 'LOT SIZE', 'YEAR BUILT', 'DAYS ON MARKET',
       '$/SQUARE FEET', 'HOA/MONTH', 'STATUS', 'NEXT OPEN HOUSE START TIME',
       'NEXT OPEN HOUSE END TIME',
       'URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       'SOURCE', 'MLS#', 'FAVORITE', 'INTERESTED', 'LATITUDE', 'LONGITUDE'],
      dtype='object')

print(f"This dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

This dataset has 200 rows and 27 columns.

df = df.drop(df.index[0]) # remove first row of dataframe which is not used

Categorical and Numerical Data Types¶

# >2 categories
categorical_features = [
    "PROPERTY TYPE",
]

# continuous or binary
numerical_features = [
    "BEDS",
    "BATHS",
    "SQUARE FEET",
    "LOT SIZE"
]

Define The Model Objects Hyperparameters¶

################################ Lasso Regression ##############################

lasso_name = "lasso"
lasso = Lasso(random_state=3)
tuned_parameters_lasso = [
    {
        f"{lasso_name}__fit_intercept": [True, False],
        f"{lasso_name}__precompute": [True, False],
        f"{lasso_name}__copy_X": [True, False],
        f"{lasso_name}__max_iter": [100, 500, 1000, 2000],
        f"{lasso_name}__tol": [1e-4, 1e-3],
        f"{lasso_name}__warm_start": [True, False],
        f"{lasso_name}__positive": [True, False],
    }
]
lasso_definition = {
    "clc": lasso,
    "estimator_name": lasso_name,
    "tuned_parameters": tuned_parameters_lasso,
    "randomized_grid": False,
    "early": False,
}

################################ Ridge Regression ##############################

ridge_name = "ridge"
ridge = Ridge(random_state=3)
tuned_parameters_ridge = [
    {
        f"{ridge_name}__max_iter": [100, 200, 500],
        f"{ridge_name}__alpha": [0.1, 1, 0.5],
    }
]

ridge_definition = {
    "clc": ridge,
    "estimator_name": ridge_name,
    "tuned_parameters": tuned_parameters_ridge,
    "randomized_grid": False,
    "early": False,
}

################################# SGD Regression ###############################

sgd_name = "sgd"
sgd = SGDRegressor(random_state=3)
tuned_parameters_sgd = [
    {
        f"{sgd_name}__loss": [
            "squared_error",
            "huber",
            "epsilon_insensitive",
            "squared_epsilon_insensitive",
        ],
        f"{sgd_name}__penalty": [None, "l2", "l1", "elasticnet"][:1],
        f"{sgd_name}__alpha": [0.0001, 0.001, 0.01, 0.1][:1],
        f"{sgd_name}__l1_ratio": [
            0.15,
            0.25,
            0.5,
            0.75,
        ][
            :1
        ],  # Only used if penalty is 'elasticnet'
        f"{sgd_name}__fit_intercept": [True, False][:1],
        f"{sgd_name}__max_iter": [1000, 2000, 3000][:1],
        f"{sgd_name}__tol": [1e-3, 1e-4][:1],
        f"{sgd_name}__epsilon": [
            0.1,
            0.2,
        ],  # Only used for 'huber' and 'epsilon_insensitive'
        f"{sgd_name}__learning_rate": [
            "constant",
            "optimal",
            "invscaling",
            "adaptive",
        ][:1],
        f"{sgd_name}__eta0": [
            0.01,
            0.1,
        ][:1],
        f"{sgd_name}__power_t": [
            0.25,
            0.5,
        ][:1],
        f"{sgd_name}__early_stopping": [True, False][:1],
        f"{sgd_name}__validation_fraction": [
            0.1,
            0.2,
        ][:1],
        f"{sgd_name}__n_iter_no_change": [
            5,
            10,
        ][:1],
        f"{sgd_name}__warm_start": [True, False][:1],
        f"{sgd_name}__average": [
            False,
            True,
            10,
        ][:1],
    }
]

sgd_definition = {
    "clc": sgd,
    "estimator_name": sgd_name,
    "tuned_parameters": tuned_parameters_sgd,
    "randomized_grid": False,
    "early": False,
}

################################# XGB Regression ###############################

xgb_name = "xgb"
xgb = XGBRegressor(random_state=3)
tuned_parameters_xgb = [
    {
        f"{xgb_name}__learning_rates": [0.1, 0.01, 0.05][:1],
        f"{xgb_name}__n_estimators": [100, 200, 300][
            :1
        ],  # Number of trees. Equivalent to n_estimators in GB
        f"{xgb_name}__max_depths": [3, 5, 7][:1],  # Maximum depth of the trees
        f"{xgb_name}__subsamples": [0.8, 1.0][
            :1
        ],  # Subsample ratio of the training instances
        f"{xgb_name}__colsample_bytree": [0.8, 1.0][:1],
        f"{xgb_name}__eval_metric": ["logloss"],
        f"{xgb_name}__early_stopping_rounds": [10],
        f"{xgb_name}__tree_method": ["hist"],
        f"{xgb_name}__stopping_mode": ["min"],
        f"{xgb_name}__stopping_patience": [5],
        f"{xgb_name}__verbose": [False],
    }
]

xgb_definition = {
    "clc": xgb,
    "estimator_name": xgb_name,
    "tuned_parameters": tuned_parameters_xgb,
    "randomized_grid": False,
    "early": True,
}

Set Up The Feature Space and Dependent Variable¶

outcome = "PRICE"
features = numerical_features + categorical_features
X, y = df[features], df[outcome]

model_definitions = {
    lasso_name: lasso_definition,
    ridge_name: ridge_definition,
    sgd_name: sgd_definition,
    xgb_name: xgb_definition,
}

Set Up The Column Transformers¶

# Define transformers for different column types
numerical_transformer = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("imputer", SimpleImputer(strategy="mean")),
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Create the ColumnTransformer with passthrough
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="passthrough",
)

Run The Model¶

For these examples, we will not be doing a KFold split nor will we be calibrating our models, so we will inherently set the following to False as follows.

kfold = False
calibrate = False

Lasso Regression¶

# Step 4: define model object
model_type = "lasso"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]

# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]

model_lasso = Model(
    name=f"lasso_{model_type}",
    estimator_name=estimator_name,
    model_type="regression",
    calibrate=calibrate,
    estimator=clc,
    kfold=kfold,
    pipeline_steps=[("ColumnTransformer", preprocessor)],
    stratify_y=False,
    grid=tuned_parameters,
    randomized_grid=rand_grid,
    boost_early=early_stop,
    scoring=["r2"],
    random_state=3,
    n_jobs=2,
)

Perform Grid Search Parameter Tuning and Retrieve Split Data¶

model_lasso.grid_search_param_tuning(X, y)

X_train, y_train = model_lasso.get_train_data(X, y)
X_test, y_test = model_lasso.get_test_data(X, y)
X_valid, y_valid = model_lasso.get_valid_data(X, y)

Pipeline Steps:

┌───────────────────────────────────────────────────────────┐
│ Step 1: preprocess_column_transformer_ColumnTransformer   │
│ ColumnTransformer                                         │
└───────────────────────────────────────────────────────────┘
                             │
                             ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: lasso                                             │
│ Lasso                                                     │
└───────────────────────────────────────────────────────────┘

100%|██████████| 256/256 [00:12<00:00, 20.31it/s]

Best score/param set found on validation set:
{'params': {'lasso__copy_X': True,
            'lasso__fit_intercept': False,
            'lasso__max_iter': 100,
            'lasso__positive': True,
            'lasso__precompute': True,
            'lasso__tol': 0.001,
            'lasso__warm_start': True},
 'score': 0.6249270420945918}
Best r2: 0.625

Fit The Model¶

model_lasso.fit(X_train, y_train)

Return Metrics (Optional)¶

print("Validation Metrics")
model_lasso.return_metrics(X_valid, y_valid)

print("Test Metrics")
model_lasso.return_metrics(X_test, y_test)

Validation Metrics
********************************************************************************
{'Explained Variance': 0.63140874491235,
 'Mean Absolute Error': 8530767.117433036,
 'Mean Squared Error': 201656154500832.03,
 'Median Absolute Error': 6082397.606998152,
 'R2': 0.6249270420945918,
 'RMSE': 14200568.809059447}
********************************************************************************
Test Metrics
********************************************************************************
{'Explained Variance': 0.3208058129107655,
 'Mean Absolute Error': 6079440.836396942,
 'Mean Squared Error': 62456380953366.58,
 'Median Absolute Error': 5494557.572099712,
 'R2': 0.3148227914856916,
 'RMSE': 7902934.958189051}
********************************************************************************

{'Explained Variance': 0.3208058129107655,
 'R2': 0.3148227914856916,
 'Mean Absolute Error': 6079440.836396942,
 'Median Absolute Error': 5494557.572099712,
 'Mean Squared Error': 62456380953366.58,
 'RMSE': 7902934.958189051}

Ridge Regression¶

# Step 4: define model object
model_type = "ridge"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]

# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]

model_ridge = Model(
    name=f"ridge_{model_type}",
    estimator_name=estimator_name,
    model_type="regression",
    calibrate=calibrate,
    estimator=clc,
    kfold=kfold,
    pipeline_steps=[("ColumnTransformer", preprocessor)],
    stratify_y=False,
    grid=tuned_parameters,
    randomized_grid=rand_grid,
    boost_early=early_stop,
    scoring=["r2"],
    random_state=3,
    n_jobs=2,
)

Perform Grid Search Parameter Tuning and Retrieve Split Data¶

model_ridge.grid_search_param_tuning(X, y)

X_train, y_train = model_ridge.get_train_data(X, y)
X_test, y_test = model_ridge.get_test_data(X, y)
X_valid, y_valid = model_ridge.get_valid_data(X, y)

Pipeline Steps:

┌───────────────────────────────────────────────────────────┐
│ Step 1: preprocess_column_transformer_ColumnTransformer   │
│ ColumnTransformer                                         │
└───────────────────────────────────────────────────────────┘
                             │
                             ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: ridge                                             │
│ Ridge                                                     │
└───────────────────────────────────────────────────────────┘

100%|██████████| 9/9 [00:00<00:00, 16.04it/s]

Best score/param set found on validation set:
{'params': {'ridge__alpha': 0.1, 'ridge__max_iter': 100},
 'score': 0.6220567591777855}
Best r2: 0.622

Fit The Model¶

model_ridge.fit(X_train, y_train)

Return Metrics (Optional)¶

print("Validation Metrics")
model_ridge.return_metrics(X_valid, y_valid)

print("Test Metrics")
model_ridge.return_metrics(X_test, y_test)

Validation Metrics
********************************************************************************
{'Explained Variance': 0.6273874448547909,
 'Mean Absolute Error': 8520571.076365553,
 'Mean Squared Error': 203199348173244.3,
 'Median Absolute Error': 5716371.538647759,
 'R2': 0.6220567591777855,
 'RMSE': 14254800.881571244}
********************************************************************************
Test Metrics
********************************************************************************
{'Explained Variance': -0.1368384947842085,
 'Mean Absolute Error': 7090129.6388441445,
 'Mean Squared Error': 103644668880608.83,
 'Median Absolute Error': 5561147.7122084005,
 'R2': -0.13703297912873147,
 'RMSE': 10180602.579445327}
********************************************************************************

{'Explained Variance': -0.1368384947842085,
 'R2': -0.13703297912873147,
 'Mean Absolute Error': 7090129.6388441445,
 'Median Absolute Error': 5561147.7122084005,
 'Mean Squared Error': 103644668880608.83,
 'RMSE': 10180602.579445327}

SGD Regressor¶

# Step 4: define model object
model_type = "sgd"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]

# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]

model_sgd = Model(
    name=f"sgd_{model_type}",
    estimator_name=estimator_name,
    model_type="regression",
    calibrate=calibrate,
    estimator=clc,
    kfold=kfold,
    pipeline_steps=[("ColumnTransformer", preprocessor)],
    stratify_y=False,
    grid=tuned_parameters,
    randomized_grid=rand_grid,
    boost_early=early_stop,
    scoring=["r2"],
    random_state=3,
    n_jobs=2,
)

model_sgd.grid_search_param_tuning(X, y)

X_train, y_train = model_sgd.get_train_data(X, y)
X_test, y_test = model_sgd.get_test_data(X, y)
X_valid, y_valid = model_sgd.get_valid_data(X, y)

Pipeline Steps:

┌───────────────────────────────────────────────────────────┐
│ Step 1: preprocess_column_transformer_ColumnTransformer   │
│ ColumnTransformer                                         │
└───────────────────────────────────────────────────────────┘
                             │
                             ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: sgd                                               │
│ SGDRegressor                                              │
└───────────────────────────────────────────────────────────┘

100%|██████████| 8/8 [00:00<00:00, 13.93it/s]

Best score/param set found on validation set:
{'params': {'sgd__alpha': 0.0001,
            'sgd__average': False,
            'sgd__early_stopping': True,
            'sgd__epsilon': 0.1,
            'sgd__eta0': 0.01,
            'sgd__fit_intercept': True,
            'sgd__l1_ratio': 0.15,
            'sgd__learning_rate': 'constant',
            'sgd__loss': 'squared_error',
            'sgd__max_iter': 1000,
            'sgd__n_iter_no_change': 5,
            'sgd__penalty': None,
            'sgd__power_t': 0.25,
            'sgd__tol': 0.001,
            'sgd__validation_fraction': 0.1,
            'sgd__warm_start': True},
 'score': 0.6397401819911099}
Best r2: 0.640

model_sgd.fit(X_train, y_train)

print("Validation Metrics")
model_sgd.return_metrics(X_valid, y_valid)

print("Test Metrics")
model_sgd.return_metrics(X_test, y_test)

Validation Metrics
********************************************************************************
{'Explained Variance': 0.6556002188330988,
 'Mean Absolute Error': 8550312.718350902,
 'Mean Squared Error': 193691941766604.28,
 'Median Absolute Error': 5629943.613204028,
 'R2': 0.6397401819911099,
 'RMSE': 13917325.237508977}
********************************************************************************
Test Metrics
********************************************************************************
{'Explained Variance': 0.40838481345427613,
 'Mean Absolute Error': 5800097.347773206,
 'Mean Squared Error': 53961579906047.836,
 'Median Absolute Error': 4685614.5746103935,
 'R2': 0.4080149358213063,
 'RMSE': 7345854.606922726}
********************************************************************************

{'Explained Variance': 0.40838481345427613,
 'R2': 0.4080149358213063,
 'Mean Absolute Error': 5800097.347773206,
 'Median Absolute Error': 4685614.5746103935,
 'Mean Squared Error': 53961579906047.836,
 'RMSE': 7345854.606922726}

XGBoost¶

Initialize and Configure the XGB `Model`¶

# Step 4: define model object
model_type = "xgb"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]

# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]

model_xgb = Model(
    name=f"xgb_{model_type}",
    estimator_name=estimator_name,
    model_type="regression",
    calibrate=calibrate,
    estimator=clc,
    kfold=kfold,
    pipeline_steps=[("ColumnTransformer", preprocessor)],
    stratify_y=False,
    grid=tuned_parameters,
    randomized_grid=rand_grid,
    boost_early=early_stop,
    scoring=["r2"],
    random_state=3,
    n_jobs=2,
)

Perform Grid Search Parameter Tuning and Retrieve Split Data¶

model_xgb.grid_search_param_tuning(X, y,)

X_train, y_train = model_xgb.get_train_data(X, y)
X_test, y_test = model_xgb.get_test_data(X, y)
X_valid, y_valid = model_xgb.get_valid_data(X, y)

Pipeline Steps:

┌───────────────────────────────────────────────────────────┐
│ Step 1: preprocess_column_transformer_ColumnTransformer   │
│ ColumnTransformer                                         │
└───────────────────────────────────────────────────────────┘
                             │
                             ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: xgb                                               │
│ XGBRegressor                                              │
└───────────────────────────────────────────────────────────┘

100%|██████████| 1/1 [00:01<00:00,  1.00s/it]

Best score/param set found on validation set:
{'params': {'xgb__colsample_bytree': 0.8,
            'xgb__early_stopping_rounds': 10,
            'xgb__eval_metric': 'logloss',
            'xgb__learning_rates': 0.1,
            'xgb__max_depths': 3,
            'xgb__n_estimators': 10,
            'xgb__stopping_mode': 'min',
            'xgb__stopping_patience': 5,
            'xgb__subsamples': 0.8,
            'xgb__tree_method': 'hist'},
 'score': 0.5914299860292405}
Best r2: 0.591

Fit The Model¶

model_xgb.fit(
    X_train,
    y_train,
    validation_data=[X_valid, y_valid],
)

Return Metrics (Optional)¶

print("Validation Metrics")
model_xgb.return_metrics(X_valid, y_valid)

print("Test Metrics")
model_xgb.return_metrics(X_test, y_test)

Validation Metrics
********************************************************************************
{'Explained Variance': 0.5799146267919161,
 'Mean Absolute Error': 8574101.81875,
 'Mean Squared Error': 241943111587155.1,
 'Median Absolute Error': 3382780.25,
 'R2': 0.5499947981629387,
 'RMSE': 15554520.615793824}
********************************************************************************
Test Metrics
********************************************************************************
{'Explained Variance': 0.3035148406860937,
 'Mean Absolute Error': 5300267.91875,
 'Mean Squared Error': 63728252885510.6,
 'Median Absolute Error': 2967484.0,
 'R2': 0.30086973101770154,
 'RMSE': 7982997.738037422}
********************************************************************************

{'Explained Variance': 0.3035148406860937,
 'R2': 0.30086973101770154,
 'Mean Absolute Error': 5300267.91875,
 'Median Absolute Error': 2967484.0,
 'Mean Squared Error': 63728252885510.6,
 'RMSE': 7982997.738037422}

Use the Model To Predict¶

In the below example, we use the Ridge model to make predictions on the test set.

model_ridge.predict(X_test)

array([  2475322.63121453,  15767323.36202717,  30675229.83672049,
         6565064.095829  ,  13044571.09380202,   7483454.45057633,
        14880830.8108245 ,  22289917.39381347,  13974820.12494977,
        -2560086.66868213,    728891.03278439,  13872662.09696066,
         1404673.60434406,  12075192.55096853,   2650908.78718742,
         9586553.50809658,  18908758.35711392,  21415434.69014483,
       -19863567.16988775,  14175583.38571148,   1954056.63664775,
         7151349.82104422,   1640093.11124138,  17740523.87236339,
         9765637.2639145 ,   8367412.40272542,  27502336.78564465,
        18071704.72336394,  34288103.95283619,   2577452.36208398,
        20160060.30913903,   2678446.83763464,   1192906.32767121,
         9922724.33061478,   8193224.66175568,   -169114.54984852,
        17705018.1704131 ,  35016713.99499328,  29247957.37588841,
         1875507.78054288])

Bootstrap Metrics¶

In the below example, we use the Lasso model to produce bootstrapped metrics.

print("Bootstrap Metrics")

model_lasso.return_bootstrap_metrics(
    X_test=X_test,
    y_test=y_test,
    metrics=["r2", "explained_variance"],
    n_samples=30,
    num_resamples=300,
)

Bootstrap Metrics

100%|██████████| 300/300 [00:02<00:00, 130.22it/s]

Reference¶

Redfin. (n.d.). Redfin: Real Estate, Homes for Sale, MLS Listings, Agents. Retrieved from https://www.redfin.com

	SALE TYPE	SOLD DATE	PROPERTY TYPE	ADDRESS	CITY	STATE OR PROVINCE	ZIP OR POSTAL CODE	PRICE	BEDS	BATHS	...	STATUS	NEXT OPEN HOUSE START TIME	NEXT OPEN HOUSE END TIME	URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)	SOURCE	MLS#	FAVORITE	INTERESTED	LATITUDE	LONGITUDE
0	In accordance with local MLS rules, some MLS l...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	MLS Listing	NaN	Single Family Residential	1633 N Beverly Dr	Beverly Hills	CA	90210.0	2899000.0	2.0	2.5	...	Active	NaN	NaN	https://www.redfin.com/CA/Beverly-Hills/1633-N...	CRMLS	SB23231159	N	Y	34.100409	-118.415785
2	MLS Listing	NaN	Single Family Residential	1135 Coldwater Canyon Dr	Beverly Hills	CA	90210.0	5650000.0	5.0	5.0	...	Active	NaN	NaN	https://www.redfin.com/CA/Beverly-Hills/1135-C...	TheMLS	24-380159	N	Y	34.093565	-118.407785
3	MLS Listing	NaN	Condo/Co-op	433 N Doheny Dr #302	Beverly Hills	CA	90210.0	1980000.0	3.0	3.0	...	Active	April-21-2024 02:00 PM	April-21-2024 05:00 PM	https://www.redfin.com/CA/Beverly-Hills/433-N-...	TheMLS	24-380177	N	Y	34.078053	-118.389995
4	MLS Listing	NaN	Single Family Residential	1531 Summitridge Dr	Beverly Hills	CA	90210.0	18900000.0	8.0	10.0	...	Active	NaN	NaN	https://www.redfin.com/CA/Beverly-Hills/1531-S...	TheMLS	24-378855	N	Y	34.099480	-118.420121

	Metric	Mean	95% CI Lower	95% CI Upper
0	r2	0.217425	0.184779	0.250070
1	explained_variance	0.255068	0.224071	0.286064

Model Tuner Regression - Redfin Real Estate¶

Overview¶

Dataset Columns¶

Usage¶

Source¶

Note¶

Model Tuner Library Instructions¶

Model Tuner Description¶

Key Features¶

Documentation¶

Installation¶

Importing the Library¶

Checking the Version¶

Import Other Relevant Libraries¶

Read In The Dataset¶

Simple EDA and Preprocessing¶

Categorical and Numerical Data Types¶

Define The Model Objects Hyperparameters¶

Set Up The Feature Space and Dependent Variable¶

Set Up The Column Transformers¶

Run The Model¶

Lasso Regression¶

Perform Grid Search Parameter Tuning and Retrieve Split Data¶

Fit The Model¶

Return Metrics (Optional)¶

Ridge Regression¶

Perform Grid Search Parameter Tuning and Retrieve Split Data¶

Fit The Model¶

Return Metrics (Optional)¶

SGD Regressor¶

XGBoost¶

Initialize and Configure the XGB Model¶

Perform Grid Search Parameter Tuning and Retrieve Split Data¶

Fit The Model¶

Return Metrics (Optional)¶

Use the Model To Predict¶

Bootstrap Metrics¶

Reference¶

Initialize and Configure the XGB `Model`¶