This is a self-correcting activity generated by nbgrader. Fill in any place that says YOUR CODE HERE or YOUR ANSWER HERE. Run subsequent cells to check your code.


Diabetes

In this activity, you’ll train several regression models to predict the disease progression one year after.

The Diabetes dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

Environment setup

# Import base packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Import ML packages
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

Step 1: Loading the data

dataset = load_diabetes()

# Put data in a pandas DataFrame
df_diab = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_diab["target"] = dataset.target
# Show 10 random samples
df_diab.sample(n=10)

Step 2: Preparing the data

Question

Split the dataset into training (variables x_train, y_train) and test sets (variables x_test, y_test) with a 20% ratio.

# YOUR CODE HERE
print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")

assert x_train.shape == (353, 10)
assert y_train.shape == (353,)
assert x_test.shape == (89, 10)
assert y_test.shape == (89,)

Step 3: Training several models

def eval_model(model):
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Train and test MSE
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)

    print(f"Training MSE: {train_mse:.2f}. Test MSE: {test_mse:.2f}")
    
    return train_mse, test_mse

Question

Create and train a Decision Tree, a MultiLayer Perceptron and a Random Forest on the training data.

Compute their MSE on the training and test data.

# Import the needed scikit-learn packages
# YOUR CODE HERE
# Create and train a Decision Tree
# YOUR CODE HERE

_ = eval_model(dt_model)
# Create and train a MLP
# YOUR CODE HERE

_ = eval_model(mlp_model)
# Create and train a Random Forest
# YOUR CODE HERE

_ = eval_model(rf_model)

Step 4: Tuning the most promising model

Question

Choose the most promising model and tune it, using a GridSearchCV instance stored in the grid_search_cv variable.

Your test MSE should be less than 3500.

# YOUR CODE HERE
# Search for the best parameters with the specified classifier on training data
grid_search_cv.fit(x_train, y_train)

# Print the best combination of hyperparameters found
print(grid_search_cv.best_params_)
# Evaluate best estimator
train_mse, test_mse = eval_model(grid_search_cv.best_estimator_)

assert train_mse < 1000
assert test_mse < 3500