This is a self-correcting activity generated by nbgrader. Fill in any place that says YOUR CODE HERE or YOUR ANSWER HERE. Run subsequent cells to check your code.


Boston housing prices

The goal of this activity is to predict the median price (in $1,000’s) of homes given their characteristics.

The dataset used here has ethical problems and will be removed in a future version of scikit-learn. It is left there as a example of possible issues with Machine Learning.

The Boston Housing Prices dataset is frequently used to test regression algorithms.

Boston suburb

The dataset contains information gathered in the 1970s concerning housing in the Boston suburban area. Each house has the following features.

Feature

Description

0

Per capita crime rate by town

1

Proportion of residential land zoned for lots over 25,000 sq.ft.

2

Proportion of non-retail business acres per town.

3

Charles River dummy variable (1 if tract bounds river; 0 otherwise)

4

Nitric oxides concentration (parts per 10 million)

5

Average number of rooms per dwelling

6

Proportion of owner-occupied units built prior to 1940

7

Weighted distances to five Boston employment centres

8

Index of accessibility to radial highways

9

Full-value property-tax rate per $10,000

10

Pupil-teacher ratio by town

11

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

12

Lower status of the population

Environment setup

# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor, LinearRegression

Step 1: Loading the data

dataset = load_boston()

# Describe the dataset
print(dataset.DESCR)
# Show a sample of raw training data
df_boston = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_boston['MEDV'] = dataset.target
# Show 10 random samples
df_boston.sample(n=10)

Step 2: Preparing the data

Question

Store input data and labels into the x and y variables respectively.

# YOUR CODE HERE
print(f'x: {x.shape}. y: {y.shape}')

assert x.shape == (506, 13)
assert y.shape == (506,)

Question

Prepare data for training. Store the data subsets in variables named x_train/y_train and x_test/y_test with a 20% ratio.

# YOUR CODE HERE
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (404, 13)
assert y_train.shape == (404,)
assert x_test.shape == (102, 13)
assert y_test.shape == (102,)

Question

Scale features by standardization while preventing information leakage from the test set. This means standardization values (mean and standard deviation) should be computed on the training set only.

# YOUR CODE HERE
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

mean_test = x_test.mean()
std_test = x_test.std()
print(f'mean_test: {mean_test}. std_test: {std_test}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

Step 3: Training a model

Question

Create a SGDRegressor instance and store it into the model variable. Fit this model on the training data.

# YOUR CODE HERE

Step 4: Evaluating the model

Question

Compute the training and test MSE into the mse_train and mse_test variables respectively.

# YOUR CODE HERE
print (f'Training MSE: {mse_train:.2f}. Test MSE: {mse_test:.2f}')

plt.scatter(y_test, y_test_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

Question

Go back to step 3 and try to obtain the best possible test MSE by tweaking the SGDRegressor parameters.

Step 5: Use another regression algorithm

Question

  • Create and fit a LinearRegression instance, which uses the normal equation instead of gradient descent.

  • Compute the training and test MSE for this instance (variables mse_train_n and mse_test_n). How does it compare to the SGDRegressor in this case?

# YOUR CODE HERE
# YOUR CODE HERE
print (f'Training MSE: {mse_train_n:.2f}. Test MSE: {mse_test_n:.2f}')

plt.scatter(y_test, y_test_pred_n)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("Actual Prices vs Predicted Prices")