This is a self-correcting activity generated by nbgrader. Fill in any place that says YOUR CODE HERE or YOUR ANSWER HERE. Run subsequent cells to check your code.


Titanic

The goal of this activity is to predicts which passengers survived the Titanic shipwreck. It uses the famous Kaggle Titanic dataset which is a staple of ML challenges.

Here is a description of this dataset:

Variable

Definition

Key

PassengerId

Passenger ID

0 = No, 1 = Yes

Survived

Survival

0 = No, 1 = Yes

pclass

Ticket class

1 = 1st, 2 = 2nd, 3 = 3rd

Name

Last and first names

sex

Sex

Age

Age in years

sibsp

# of siblings / spouses aboard the Titanic

parch

# of parents / children aboard the Titanic

ticket

Ticket number

fare

Passenger fare

cabin

Cabin number

embarked

Port of Embarkation

C = Cherbourg, Q = Queenstown, S = Southampton

Environment setup

Question

Import the necessary packages.

# YOUR CODE HERE
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
# YOUR CODE HERE

Data loading and analysis

Question

# YOUR CODE HERE

Question

Display 10 random samples from the dataset.

# YOUR CODE HERE

Question

Print a concise summary of this dataset.

# YOUR CODE HERE

Question

Plot an histogram of numerical features.

# YOUR CODE HERE

Data preprocessing

Question

Remove from the dataset columns that seem non-informative for Machine Learning.

Hint: there are 4 of them.

# YOUR CODE HERE

Question

Display the first 10 samples of the cleaned dataset.

# YOUR CODE HERE

Question

The Age feature should be very interesting for predicting survival. However, several values are missing.

Use the pandas fillna() function to replace all NaN values by -1 for the Age feature.

# YOUR CODE HERE

Question

Use the pandas cut() function to segment the Age feature into categories, accodring to the provided labels and intervals.

age_labels = ['Missing', 'Child', 'Teenager', 'Young adult', 'Adult', 'Senior']
age_intervals = [-2, 0, 12, 18, 35, 60, 100]

# YOUR CODE HERE

Question

Display the first 10 samples of the dataset.

# YOUR CODE HERE

Question

Apply the following function to one-hot encode categorical features “Age”, “Sex”, “Embarked”, “SibSp” and “Pclass”.

def apply_dummies(df, column_name):

    # Codage binaire dans un nouveau DataFrame
    dummies_features = pd.get_dummies(df[column_name], prefix=column_name)
    # Concaténation du DataFrame avec les nouvelles colonnes
    df = pd.concat([df, dummies_features], axis=1)
    # Suppression de la colonne initiale
    df = df.drop(columns=[column_name])

    return df


# YOUR CODE HERE

Question

Display the first 10 samples of the dataset.

# YOUR CODE HERE

Model training

Question

Separate dataset between inputs and target data. Print their shapes.

# YOUR CODE HERE

Question

Split dataset between training and test sets, using a 20% ratio for test. Print shapes of all sets.

# YOUR CODE HERE

Question

Train several Machine Learning models:

  • a Logistic Regression classifier;

  • a Decision Tree;

  • a MultiLayer Perceptron.

# YOUR CODE HERE
# YOUR CODE HERE
# YOUR CODE HERE

Models evaluation

Question

Print a classification report for your models.

# YOUR CODE HERE

Question

Print the confusion matrix for your models.

# Plot the confusion matrix for a model and a dataset
def plot_conf_mat(model, x, y):
    with sns.axes_style("white"):  # Temporary hide Seaborn grid lines
        display = plot_confusion_matrix(
            model, x, y, values_format="d", cmap=plt.cm.Blues
        )

# YOUR CODE HERE