This is a self-correcting activity generated by nbgrader. Fill in any place that says YOUR CODE HERE or YOUR ANSWER HERE. Run subsequent cells to check your code.


Breast cancer

In this activity, you’ll use a K-Nearest Neighbors classifier to help diagnose breast tumors.

The Breast Cancer dataset is used for multivariate binary classification between benign and maligant tumors. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

Breast cancer logo

Environment setup

# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

Step 1: Loading the data

dataset = load_breast_cancer()

# Put data in a pandas DataFrame
df_breast_cancer = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_breast_cancer['target'] = dataset.target
df_breast_cancer['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_breast_cancer.sample(n=10)

Step 2: Preparing the data

Question

Compute the number of features of the dataset into the num_features variable.

# YOUR CODE HERE
print(f'Number of features: {num_features}')

assert num_features == 30

Question

In order to evaluate class distribution, compute the number of benign and malignant tumors into the num_benign and num_malignant variables respectively.

# YOUR CODE HERE
print(f'Benign count: {num_benign}. Malignant count: {num_malignant}')

assert num_benign == 357
assert num_malignant == 212
# Store input and labels
x = dataset.data
y = dataset.target

print(f'x: {x.shape}. y: {y.shape}')

Question

Split the dataset into training and test sets with a 25% ratio. Use variables x_train, y_train, x_test and y_test.

# YOUR CODE HERE
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (426, 30)
assert y_train.shape == (426, )
assert x_test.shape == (143, 30)
assert y_test.shape == (143,)

Question

Scale features by standardization while preventing information leakage from the test set.

# YOUR CODE HERE
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

Step 3: Creating a classifier

Question

Create a KNeighborsClassifier instance using only one nearest neighbor, store it into the model variable, and fit the training data.

# YOUR CODE HERE

Step 4: Evaluating the classifier

# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Question

Display precision, recall and f1-score for the classifier on test data. Interpret the results.

# YOUR CODE HERE

Question

Go back to step 3 and try to find the best value for the k number of nearest neighbors.