This Google Colab notebook utilizes a dataset on the Los Angeles Real Estate market sourced from Redfin
, consisting of 200 rows and 27 columns. The dataset captures a snapshot in time of various property listings, providing detailed information about each property.
The dataset includes the following columns:
SALE TYPE
SOLD DATE
PROPERTY TYPE
ADDRESS
CITY
STATE OR PROVINCE
ZIP OR POSTAL CODE
PRICE
BEDS
BATHS
LOCATION
SQUARE FEET
LOT SIZE
YEAR BUILT
DAYS ON MARKET
$/SQUARE FEET
HOA/MONTH
STATUS
NEXT OPEN HOUSE START TIME
NEXT OPEN HOUSE END TIME
URL (SEE https://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)
SOURCE
MLS#
FAVORITE
INTERESTED
LATITUDE
LONGITUDE
The primary purpose of this notebook is to demonstrate various regression examples using the Redfin dataset. Specifically, the following variables are used for testing in this notebook:
Features (X):
BEDS
BATHS
SQUARE FEET
LOT SIZE
Target (y):
PRICE
The dataset is sourced from Redfin, available at Redfin's Comparative Market Analysis.
This dataset represents a single snapshot in time and may not reflect current market conditions.
This notebook provides a guide on how to install and use the model_tuner
library in a notebook environment like Google Colab.
The model_tuner
library is designed to streamline the process of hyperparameter tuning and model optimization for machine learning algorithms. It provides an easy-to-use interface for defining, tuning, and evaluating models.
Automatic Hyperparameter Tuning
The library can automatically tune hyperparameters for a variety of machine learning models using advanced optimization techniques.
Cross-Validation
Integrated cross-validation ensures that the models are evaluated robustly, preventing overfitting.
For detailed documentation and advanced usage of the model_tuner library, please refer to the model_tuner documentation.
By following these steps, you should be able to install and use the model_tuner
library effectively in your notebook environment. If you encounter any issues or have further questions, feel free to reach out for support.
To install the model_tuner
library, use the following command:
! pip install model_tuner
After installation, you can import the necessary components from the model_tuner library as shown below:
import model_tuner
from model_tuner import Model
To ensure that the model_tuner library is installed correctly, you can check its version:
print(help(model_tuner))
import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso, Ridge, SGDRegressor
from xgboost import XGBRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action="ignore", category=DataConversionWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
# Direct download link to the Excel file
url = (
"https://github.com/uclamii/model_tuner/raw/main/public_data/"
"redfin_2024-04-16-15-59-17.xlsx"
)
# Read the Excel file
df = pd.read_excel(url)
df.head() # inspect first 5 rows of data
df["PROPERTY TYPE"].unique()
df.columns # inspect the list of cols in the dataset
print(f"This dataset has {df.shape[0]} rows and {df.shape[1]} columns.")
df = df.drop(df.index[0]) # remove first row of dataframe which is not used
# >2 categories
categorical_features = [
"PROPERTY TYPE",
]
# continuous or binary
numerical_features = [
"BEDS",
"BATHS",
"SQUARE FEET",
"LOT SIZE"
]
################################ Lasso Regression ##############################
lasso_name = "lasso"
lasso = Lasso(random_state=3)
tuned_parameters_lasso = [
{
f"{lasso_name}__fit_intercept": [True, False],
f"{lasso_name}__precompute": [True, False],
f"{lasso_name}__copy_X": [True, False],
f"{lasso_name}__max_iter": [100, 500, 1000, 2000],
f"{lasso_name}__tol": [1e-4, 1e-3],
f"{lasso_name}__warm_start": [True, False],
f"{lasso_name}__positive": [True, False],
}
]
lasso_definition = {
"clc": lasso,
"estimator_name": lasso_name,
"tuned_parameters": tuned_parameters_lasso,
"randomized_grid": False,
"early": False,
}
################################ Ridge Regression ##############################
ridge_name = "ridge"
ridge = Ridge(random_state=3)
tuned_parameters_ridge = [
{
f"{ridge_name}__max_iter": [100, 200, 500],
f"{ridge_name}__alpha": [0.1, 1, 0.5],
}
]
ridge_definition = {
"clc": ridge,
"estimator_name": ridge_name,
"tuned_parameters": tuned_parameters_ridge,
"randomized_grid": False,
"early": False,
}
################################# SGD Regression ###############################
sgd_name = "sgd"
sgd = SGDRegressor(random_state=3)
tuned_parameters_sgd = [
{
f"{sgd_name}__loss": [
"squared_error",
"huber",
"epsilon_insensitive",
"squared_epsilon_insensitive",
],
f"{sgd_name}__penalty": [None, "l2", "l1", "elasticnet"][:1],
f"{sgd_name}__alpha": [0.0001, 0.001, 0.01, 0.1][:1],
f"{sgd_name}__l1_ratio": [
0.15,
0.25,
0.5,
0.75,
][
:1
], # Only used if penalty is 'elasticnet'
f"{sgd_name}__fit_intercept": [True, False][:1],
f"{sgd_name}__max_iter": [1000, 2000, 3000][:1],
f"{sgd_name}__tol": [1e-3, 1e-4][:1],
f"{sgd_name}__epsilon": [
0.1,
0.2,
], # Only used for 'huber' and 'epsilon_insensitive'
f"{sgd_name}__learning_rate": [
"constant",
"optimal",
"invscaling",
"adaptive",
][:1],
f"{sgd_name}__eta0": [
0.01,
0.1,
][:1],
f"{sgd_name}__power_t": [
0.25,
0.5,
][:1],
f"{sgd_name}__early_stopping": [True, False][:1],
f"{sgd_name}__validation_fraction": [
0.1,
0.2,
][:1],
f"{sgd_name}__n_iter_no_change": [
5,
10,
][:1],
f"{sgd_name}__warm_start": [True, False][:1],
f"{sgd_name}__average": [
False,
True,
10,
][:1],
}
]
sgd_definition = {
"clc": sgd,
"estimator_name": sgd_name,
"tuned_parameters": tuned_parameters_sgd,
"randomized_grid": False,
"early": False,
}
################################# XGB Regression ###############################
xgb_name = "xgb"
xgb = XGBRegressor(random_state=3)
tuned_parameters_xgb = [
{
f"{xgb_name}__learning_rates": [0.1, 0.01, 0.05][:1],
f"{xgb_name}__n_estimators": [100, 200, 300][
:1
], # Number of trees. Equivalent to n_estimators in GB
f"{xgb_name}__max_depths": [3, 5, 7][:1], # Maximum depth of the trees
f"{xgb_name}__subsamples": [0.8, 1.0][
:1
], # Subsample ratio of the training instances
f"{xgb_name}__colsample_bytree": [0.8, 1.0][:1],
f"{xgb_name}__eval_metric": ["logloss"],
f"{xgb_name}__early_stopping_rounds": [10],
f"{xgb_name}__tree_method": ["hist"],
f"{xgb_name}__stopping_mode": ["min"],
f"{xgb_name}__stopping_patience": [5],
f"{xgb_name}__verbose": [False],
}
]
xgb_definition = {
"clc": xgb,
"estimator_name": xgb_name,
"tuned_parameters": tuned_parameters_xgb,
"randomized_grid": False,
"early": True,
}
outcome = "PRICE"
features = numerical_features + categorical_features
X, y = df[features], df[outcome]
model_definitions = {
lasso_name: lasso_definition,
ridge_name: ridge_definition,
sgd_name: sgd_definition,
xgb_name: xgb_definition,
}
# Define transformers for different column types
numerical_transformer = Pipeline(
steps=[
("scaler", StandardScaler()),
("imputer", SimpleImputer(strategy="mean")),
]
)
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
("encoder", OneHotEncoder(handle_unknown="ignore")),
]
)
# Create the ColumnTransformer with passthrough
preprocessor = ColumnTransformer(
transformers=[
("num", numerical_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
For these examples, we will not be doing a KFold split nor will we be calibrating our models, so we will inherently set the following to False
as follows.
kfold = False
calibrate = False
# Step 4: define model object
model_type = "lasso"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]
# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]
model_lasso = Model(
name=f"lasso_{model_type}",
estimator_name=estimator_name,
model_type="regression",
calibrate=calibrate,
estimator=clc,
kfold=kfold,
pipeline_steps=[("ColumnTransformer", preprocessor)],
stratify_y=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["r2"],
random_state=3,
n_jobs=2,
)
model_lasso.grid_search_param_tuning(X, y)
X_train, y_train = model_lasso.get_train_data(X, y)
X_test, y_test = model_lasso.get_test_data(X, y)
X_valid, y_valid = model_lasso.get_valid_data(X, y)
model_lasso.fit(X_train, y_train)
print("Validation Metrics")
model_lasso.return_metrics(X_valid, y_valid)
print("Test Metrics")
model_lasso.return_metrics(X_test, y_test)
# Step 4: define model object
model_type = "ridge"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]
# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]
model_ridge = Model(
name=f"ridge_{model_type}",
estimator_name=estimator_name,
model_type="regression",
calibrate=calibrate,
estimator=clc,
kfold=kfold,
pipeline_steps=[("ColumnTransformer", preprocessor)],
stratify_y=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["r2"],
random_state=3,
n_jobs=2,
)
model_ridge.grid_search_param_tuning(X, y)
X_train, y_train = model_ridge.get_train_data(X, y)
X_test, y_test = model_ridge.get_test_data(X, y)
X_valid, y_valid = model_ridge.get_valid_data(X, y)
model_ridge.fit(X_train, y_train)
print("Validation Metrics")
model_ridge.return_metrics(X_valid, y_valid)
print("Test Metrics")
model_ridge.return_metrics(X_test, y_test)
# Step 4: define model object
model_type = "sgd"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]
# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]
model_sgd = Model(
name=f"sgd_{model_type}",
estimator_name=estimator_name,
model_type="regression",
calibrate=calibrate,
estimator=clc,
kfold=kfold,
pipeline_steps=[("ColumnTransformer", preprocessor)],
stratify_y=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["r2"],
random_state=3,
n_jobs=2,
)
model_sgd.grid_search_param_tuning(X, y)
X_train, y_train = model_sgd.get_train_data(X, y)
X_test, y_test = model_sgd.get_test_data(X, y)
X_valid, y_valid = model_sgd.get_valid_data(X, y)
model_sgd.fit(X_train, y_train)
print("Validation Metrics")
model_sgd.return_metrics(X_valid, y_valid)
print("Test Metrics")
model_sgd.return_metrics(X_test, y_test)
Model
¶# Step 4: define model object
model_type = "xgb"
clc = model_definitions[model_type]["clc"]
estimator_name = model_definitions[model_type]["estimator_name"]
# Set the parameters by cross-validation
tuned_parameters = model_definitions[model_type]["tuned_parameters"]
rand_grid = model_definitions[model_type]["randomized_grid"]
early_stop = model_definitions[model_type]["early"]
model_xgb = Model(
name=f"xgb_{model_type}",
estimator_name=estimator_name,
model_type="regression",
calibrate=calibrate,
estimator=clc,
kfold=kfold,
pipeline_steps=[("ColumnTransformer", preprocessor)],
stratify_y=False,
grid=tuned_parameters,
randomized_grid=rand_grid,
boost_early=early_stop,
scoring=["r2"],
random_state=3,
n_jobs=2,
)
model_xgb.grid_search_param_tuning(X, y,)
X_train, y_train = model_xgb.get_train_data(X, y)
X_test, y_test = model_xgb.get_test_data(X, y)
X_valid, y_valid = model_xgb.get_valid_data(X, y)
model_xgb.fit(
X_train,
y_train,
validation_data=[X_valid, y_valid],
)
print("Validation Metrics")
model_xgb.return_metrics(X_valid, y_valid)
print("Test Metrics")
model_xgb.return_metrics(X_test, y_test)
In the below example, we use the Ridge model to make predictions on the test set.
model_ridge.predict(X_test)
In the below example, we use the Lasso model to produce bootstrapped metrics.
print("Bootstrap Metrics")
model_lasso.return_bootstrap_metrics(
X_test=X_test,
y_test=y_test,
metrics=["r2", "explained_variance"],
n_samples=30,
num_resamples=300,
)
Redfin. (n.d.). Redfin: Real Estate, Homes for Sale, MLS Listings, Agents. Retrieved from https://www.redfin.com