Breast cancer¶

In this activity, you’ll use a K-Nearest Neighbors classifier to help diagnose breast tumors.

The Breast Cancer dataset is used for multivariate binary classification between benign and maligant tumors. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

Breast cancer logo

Environment setup¶

# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')

from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

Step 1: Loading the data¶

dataset = load_breast_cancer()

# Put data in a pandas DataFrame
df_breast_cancer = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_breast_cancer['target'] = dataset.target
df_breast_cancer['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_breast_cancer.sample(n=10)

Step 2: Preparing the data¶

Question¶

Compute the number of features of the dataset into the num_features variable.

# YOUR CODE HERE

print(f'Number of features: {num_features}')

assert num_features == 30

Question¶

In order to evaluate class distribution, compute the number of benign and malignant tumors into the num_benign and num_malignant variables respectively.

# YOUR CODE HERE

print(f'Benign count: {num_benign}. Malignant count: {num_malignant}')

assert num_benign == 357
assert num_malignant == 212

# Store input and labels
x = dataset.data
y = dataset.target

print(f'x: {x.shape}. y: {y.shape}')

Question¶

Split the dataset into training and test sets with a 25% ratio. Use variables x_train, y_train, x_test and y_test.

# YOUR CODE HERE

print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (426, 30)
assert y_train.shape == (426, )
assert x_test.shape == (143, 30)
assert y_test.shape == (143,)

Question¶

Scale features by standardization while preventing information leakage from the test set.

# YOUR CODE HERE

mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')

assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6

Step 3: Creating a classifier¶

Question¶

Create a KNeighborsClassifier instance using only one nearest neighbor, store it into the model variable, and fit the training data.

# YOUR CODE HERE

Step 4: Evaluating the classifier¶

# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)

print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')

Question¶

Display precision, recall and f1-score for the classifier on test data. Interpret the results.

# YOUR CODE HERE

Question¶

Go back to step 3 and try to find the best value for the k number of nearest neighbors.

Machine Learning Katas

Breast cancer

Contents

Breast cancer¶

Environment setup¶

Step 1: Loading the data¶

Step 2: Preparing the data¶

Question¶

Question¶

Question¶

Question¶

Step 3: Creating a classifier¶

Question¶

Step 4: Evaluating the classifier¶

Question¶

Question¶