This is a self-correcting activity generated by nbgrader. Fill in any place that says YOUR CODE HERE or YOUR ANSWER HERE. Run subsequent cells to check your code.


Heart disease

In this activity, you’ll use a small dataset provided by the Cleveland Clinic Foundation for Heart Disease.

Each row describes a patient, and each column describes an attribute. You will use this information to predict whether a patient has heart disease.

Below is a description of each column.

Column

Description

Feature Type

Data Type

Age

Age in years

Numerical

integer

Sex

(1 = male; 0 = female)

Categorical

integer

CP

Chest pain type (0, 1, 2, 3, 4)

Categorical

integer

Trestbpd

Resting blood pressure (in mm Hg on admission to the hospital)

Numerical

integer

Chol

Serum cholestoral in mg/dl

Numerical

integer

FBS

(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

Categorical

integer

RestECG

Resting electrocardiographic results (0, 1, 2)

Categorical

integer

Thalach

Maximum heart rate achieved

Numerical

integer

Exang

Exercise induced angina (1 = yes; 0 = no)

Categorical

integer

Oldpeak

ST depression induced by exercise relative to rest

Numerical

float

Slope

The slope of the peak exercise ST segment

Numerical

integer

CA

Number of major vessels (0-3) colored by flourosopy

Numerical

integer

Thal

3 = normal; 6 = fixed defect; 7 = reversable defect

Categorical

string

Target

Diagnosis of heart disease (1 = true; 0 = false)

Classification

integer

Environment setup

import platform

print(f"Python version: {platform.python_version()}")

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Setup plots
%matplotlib inline
plt.rcParams["figure.figsize"] = 10, 8
%config InlineBackend.figure_format = "retina"
sns.set()
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")

# You may add other imports here as needed
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    classification_report,
    RocCurveDisplay,
)
from sklearn.model_selection import cross_val_score

Step 1: loading the data

Question

Load the dataset into a pandas DataFrame named df_heart.

csv_url = "https://raw.githubusercontent.com/bpesquet/mlkatas/master/_datasets/heart.csv"

# YOUR CODE HERE
print(f"df_heart: {df_heart.shape}")

assert df_heart.shape == (301, 14)

Step 2: prepare the data

Question

Use the following cells to discover data.

# Print info about the dataset

# YOUR CODE HERE
# Print the first 10 data samples

# YOUR CODE HERE
# # Print descriptive statistics for all numerical attributes

# YOUR CODE HERE
# Print distribution of target values

# YOUR CODE HERE

Question

Use the following cells to prepare data for training:

  • Split data between training and test sets with a 20% ratio.

  • Store inputs and labels in the x_train and y_train variables.

  • Preprocess training input data as needed.

# Split dataset between training and test

# YOUR CODE HERE
print(f"Training dataset: {df_train.shape}")
print(f"Test dataset: {df_test.shape}")

assert df_train.shape == (240, 14)
assert df_test.shape == (61, 14)
# Split training dataset between inputs and target

# YOUR CODE HERE
print(f"Training data: {df_x_train.shape}")
print(f"Training labels: {y_train.shape}")

assert df_x_train.shape == (240, 13)
assert y_train.shape == (240,)
# Print numerical and categorical features

num_features = df_x_train.select_dtypes(include=[np.number]).columns
print(num_features)

cat_features = df_x_train.select_dtypes(include=[object]).columns
print(cat_features)
# Print distribution for the "thal" feature

# YOUR CODE HERE
# Preprocess data to have similar scales and only numerical values

# YOUR CODE HERE
# Print preprocessed data shape and first sample
print(f"x_train: {x_train.shape}")
print(x_train[0])

assert x_train.shape == (240, 15)

Step 3: train and evaluate a model

Question

Use the following cells to:

  • Train a SGD classifier on the training data.

  • Evaluate its accuracy using K-fold cross-validation.

  • Compute the precision, recall and f1-score metrics.

  • Plot its confusion matrix and ROC curve.

# Fit a SGD classifier to the training set

# YOUR CODE HERE
# Use cross-validation to evaluate accuracy, using 3 folds
# Store the result in the cv_acc variable

# YOUR CODE HERE
print(f"CV accuracy: {cv_acc}")

assert np.mean(cv_acc) > 0.70
# Plot the confusion matrix for a model and a dataset
def plot_conf_mat(model, x, y):
    with sns.axes_style("white"):  # Temporary hide Seaborn grid lines
        display = ConfusionMatrixDisplay.from_estimator(
            model, x, y, values_format="d", cmap=plt.cm.Blues
        )
# Plot confusion matrix for the SGD classifier

# YOUR CODE HERE
# Compute precision, recall and f1-score for the SGD classifier

# YOUR CODE HERE
# Plot ROC curve for the SGD classifier

# YOUR CODE HERE

Bonus

Train another classifier and plot confusion matrices and ROC curves for both.