Titanic¶

The goal of this activity is to predicts which passengers survived the Titanic shipwreck. It uses the famous Kaggle Titanic dataset which is a staple of ML challenges.

Here is a description of this dataset:

Variable	Definition	Key
PassengerId	Passenger ID	0 = No, 1 = Yes
Survived	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
Name	Last and first names
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Environment setup¶

Question¶

Import the necessary packages.

# YOUR CODE HERE

# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()

# YOUR CODE HERE

Data loading and analysis¶

Question¶

Use pandas to import the dataset as CSV data from URL https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv.
Print dataset shape.

# YOUR CODE HERE

Question¶

Display 10 random samples from the dataset.

# YOUR CODE HERE

Question¶

Print a concise summary of this dataset.

# YOUR CODE HERE

Question¶

Plot an histogram of numerical features.

# YOUR CODE HERE

Data preprocessing¶

Question¶

Remove from the dataset columns that seem non-informative for Machine Learning.

Hint: there are 4 of them.

# YOUR CODE HERE

Question¶

Display the first 10 samples of the cleaned dataset.

# YOUR CODE HERE

Question¶

The Age feature should be very interesting for predicting survival. However, several values are missing.

Use the pandas fillna() function to replace all NaN values by -1 for the Age feature.

# YOUR CODE HERE

Question¶

Use the pandas cut() function to segment the Age feature into categories, accodring to the provided labels and intervals.

age_labels = ['Missing', 'Child', 'Teenager', 'Young adult', 'Adult', 'Senior']
age_intervals = [-2, 0, 12, 18, 35, 60, 100]

# YOUR CODE HERE

Question¶

Display the first 10 samples of the dataset.

# YOUR CODE HERE

Question¶

Apply the following function to one-hot encode categorical features “Age”, “Sex”, “Embarked”, “SibSp” and “Pclass”.

def apply_dummies(df, column_name):

    # Codage binaire dans un nouveau DataFrame
    dummies_features = pd.get_dummies(df[column_name], prefix=column_name)
    # Concaténation du DataFrame avec les nouvelles colonnes
    df = pd.concat([df, dummies_features], axis=1)
    # Suppression de la colonne initiale
    df = df.drop(columns=[column_name])

    return df


# YOUR CODE HERE

Question¶

Display the first 10 samples of the dataset.

# YOUR CODE HERE

Model training¶

Question¶

Separate dataset between inputs and target data. Print their shapes.

# YOUR CODE HERE

Question¶

Split dataset between training and test sets, using a 20% ratio for test. Print shapes of all sets.

# YOUR CODE HERE

Question¶

Train several Machine Learning models:

a Logistic Regression classifier;
a Decision Tree;
a MultiLayer Perceptron.

# YOUR CODE HERE

# YOUR CODE HERE

# YOUR CODE HERE

Models evaluation¶

Question¶

Print a classification report for your models.

# YOUR CODE HERE

Question¶

Print the confusion matrix for your models.

# Plot the confusion matrix for a model and a dataset
def plot_conf_mat(model, x, y):
    with sns.axes_style("white"):  # Temporary hide Seaborn grid lines
        display = plot_confusion_matrix(
            model, x, y, values_format="d", cmap=plt.cm.Blues
        )

# YOUR CODE HERE

Machine Learning Katas

Titanic

Contents

Titanic¶

Environment setup¶

Question¶

Data loading and analysis¶

Question¶

Question¶

Question¶

Question¶

Data preprocessing¶

Question¶

Question¶

Question¶

Question¶

Question¶

Question¶

Question¶

Model training¶

Question¶

Question¶

Question¶

Models evaluation¶

Question¶

Question¶