Boston housing prices¶

The goal of this activity is to predict the median price (in $1,000’s) of homes given their characteristics.

The dataset used here has ethical problems and will be removed in a future version of scikit-learn. It is left there as a example of possible issues with Machine Learning.

The Boston Housing Prices dataset is frequently used to test regression algorithms.

Boston suburb

The dataset contains information gathered in the 1970s concerning housing in the Boston suburban area. Each house has the following features.

Feature	Description
0	Per capita crime rate by town
1	Proportion of residential land zoned for lots over 25,000 sq.ft.
2	Proportion of non-retail business acres per town.
3	Charles River dummy variable (1 if tract bounds river; 0 otherwise)
4	Nitric oxides concentration (parts per 10 million)
5	Average number of rooms per dwelling
6	Proportion of owner-occupied units built prior to 1940
7	Weighted distances to five Boston employment centres
8	Index of accessibility to radial highways
9	Full-value property-tax rate per $10,000
10	Pupil-teacher ratio by town
11	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
12	Lower status of the population

Environment setup¶

# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor, LinearRegression

Step 1: Loading the data¶

dataset = load_boston()

# Describe the dataset
print(dataset.DESCR)

# Show a sample of raw training data
df_boston = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_boston['MEDV'] = dataset.target
# Show 10 random samples
df_boston.sample(n=10)

Step 2: Preparing the data¶

Question¶

Store input data and labels into the x and y variables respectively.

# YOUR CODE HERE

print(f'x: {x.shape}. y: {y.shape}')

assert x.shape == (506, 13)
assert y.shape == (506,)

Question¶

Prepare data for training. Store the data subsets in variables named x_train/y_train and x_test/y_test with a 20% ratio.

# YOUR CODE HERE

print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (404, 13)
assert y_train.shape == (404,)
assert x_test.shape == (102, 13)
assert y_test.shape == (102,)

Question¶

Scale features by standardization while preventing information leakage from the test set. This means standardization values (mean and standard deviation) should be computed on the training set only.

# YOUR CODE HERE

mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

mean_test = x_test.mean()
std_test = x_test.std()
print(f'mean_test: {mean_test}. std_test: {std_test}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

Step 3: Training a model¶

Question¶

Create a SGDRegressor instance and store it into the model variable. Fit this model on the training data.

# YOUR CODE HERE

Step 4: Evaluating the model¶

Question¶

Compute the training and test MSE into the mse_train and mse_test variables respectively.

# YOUR CODE HERE

print (f'Training MSE: {mse_train:.2f}. Test MSE: {mse_test:.2f}')

plt.scatter(y_test, y_test_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

Question¶

Go back to step 3 and try to obtain the best possible test MSE by tweaking the SGDRegressor parameters.

Step 5: Use another regression algorithm¶

Question¶

Create and fit a LinearRegression instance, which uses the normal equation instead of gradient descent.
Compute the training and test MSE for this instance (variables mse_train_n and mse_test_n). How does it compare to the SGDRegressor in this case?

# YOUR CODE HERE

# YOUR CODE HERE

print (f'Training MSE: {mse_train_n:.2f}. Test MSE: {mse_test_n:.2f}')

plt.scatter(y_test, y_test_pred_n)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("Actual Prices vs Predicted Prices")

Machine Learning Katas

Boston housing prices

Contents

Boston housing prices¶

Environment setup¶

Step 1: Loading the data¶

Step 2: Preparing the data¶

Question¶

Question¶

Question¶

Step 3: Training a model¶

Question¶

Step 4: Evaluating the model¶

Question¶

Question¶

Step 5: Use another regression algorithm¶

Question¶