The well-known passenger ship Titanic, for which there were rumors that it was “unsinkable,” had a severe accident in 1912. after hitting an iceberg on its first trip. It is estimated that there were 2,224 passengers on the ship, and more than 1,500 people died, which made this one of the biggest maritime disasters in modern history.

### Data analysis behind a disaster

While some element of luck was included in surviving this shipwreck, after analyzing the data, it was concluded that certain groups of people were more likely to survive than others. This blog answers the question “Which characteristics of Titanic passengers increased their chances of survival?” using available personal data about passengers.

For this purpose, we’ll use a dataset from Kaggle, which we will train to make passenger survival predictions.

“No data is clean, but most is useful“ – Dean Abbott

Raw data is often hard to understand, and its preprocessing is usually the most important step in every analysis. Good quality preprocessing of the Titanic data sample enables us to make a successful model for predicting passengers survival. All of this is easily done using the Python programming language and its libraries (e.g. pandasnumpysklearn). In this blog, we used it, along with Google Colab, to do an analysis of the Titanic dataset.

The downloaded data sample contained 891 records in total, and its basic attributes information was accessed using Python command  `data.info()` The following table represents the results. As seen at first sight, there are three attributes that don’t have full completeness. A useful code snippet to detect missing values is the following:

```total = data.isnull().sum().sort_values(ascending=False)
percent = data.isnull().sum()/data.isnull().count()*100
percent_rounded = (round(percent, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_rounded], axis=1, keys=['Total', '%'])
```

One of the steps when building a prediction model is detecting attributes with missing values and further preprocessing if necessary. Filling missing values isn’t always simple and depends on the dataset itself, the number, and the type of missing values. We need to be extra careful not to have too much impact on final model predictions by adding some random values in the dataset.

In our case, the attribute ‘cabin’ with the most missing values was detected as non-relevant for further analysis and was dropped from the dataset. A few missing values in the ‘embarked’ attribute were filled with the most frequent value – “Southampton”. Finally, missing values in the ‘age’ attribute were filled with random ages in the standard deviation range around the attribute’s mean value.

The next preprocessing step is related to the ‘name’ attribute. By itself, the name is not relevant for survival predictive modeling. However, names in this dataset contain titles that were successfully extracted and mostly mapped to standard titles (‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’). This resulted in title information, which in contrast to the name, could be helpful in the final model. Code snippet related to mentioned preprocessing is the following:

```# Default titles for mapping (most frequent titles and other “Rare” titles)
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

# Extracting existing titles in names
data['Title'] = data.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

# Map existing titles to default titles
data['Title'] = data['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr','Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
data['Title'] = data['Title'].replace('Mlle', 'Miss')
data['Title'] = data['Title'].replace('Ms', 'Miss')
data['Title'] = data['Title'].replace('Mme', 'Mrs')
```

You can have data without information, but you cannot have information without data.” – Daniel Keys Moran

It’s up to us to read from the data and extract useful information. Before making a prediction model, data were additionally analyzed to make connections between attributes. The results of the data connections research were quite interesting and helpful. I will show a few examples in this blog.

In the following picture, passengers’ survival is presented as dependence on their age and sex. Code snippet used to plot these graphs is the following:

```survived = 'survived'
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4))
women = data[data['Sex']=='female']
men = data[data['Sex']=='male']
ax = sns.distplot(women[women['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes, kde =False)
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes, kde =False)
ax.set_xlabel('Age')
ax.legend()
ax.set_title('Women')
ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label = survived, ax = axes, kde = False)
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label = not_survived, ax = axes, kde = False)
ax.set_xlabel('Age')
ax.legend()
_ = ax.set_title('Men')
```

Observing the picture above, the following conclusions can be made:

• Women were more likely to survive than men
• Little boys were more likely to survive than adult men
• Infants had a higher chance of survival As a result of mentioned observations, passenger age groups were used later in the predictive modeling.

Another relation, the number of survivors in dependence on the passenger class, is shown in the picture below. The element of age was added to the graph as a vertical axis, and survivors count as the horizontal one. As seen from the graphs, the lowest survival chance was for the passengers with third class tickets in age from around 18 to 40 years, and the highest survival chance was for middle-aged passengers with first-class tickets. This resulted in adding a new attribute: age group multiplied by passenger class, where passengers with the lower value of this attribute seem to have had more chances to survive.

After final data processing based on mentioned and multiple other detected data correlations, it was possible to build a model to predict the survival outcome of Titanic passengers.

The best way to predict the future is to study the past, or prognosticate.” – Robert Kiyosaki

The final step in this analysis was to build a machine learning model for prediction. First, the dataset was divided into train and test data using a 70:30 percentage ratio. After that, two exemplary models were built by using two different algorithms on train data: Gaussian Naive Bayes and Random Forest. The following code snippet does the mentioned work for Random Forest model:

```# Split dataset to train and test data
from sklearn.model_selection import train_test_split
X =  data.drop("Survived", axis=1)
Y = data["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=11, stratify=Y)

# Random Forest model
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
Y_prediction = random_forest.predict(X_test)
random_forest.score(X_train, y_train)

# Model accuracy
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
print(acc_random_forest)
```

Using Naive Bayes, we got a model with 78% accuracy, while Random Forest algorithm resulted in a 92% accurate model. This means that when the Random Forest model was tested, it successfully predicted survival outcomes for 92% of passengers from test data. The following picture shows the top six most important attributes for survival prediction using Random Forest. All of these attributes were expected to have high importance based on the analysis made throughout this blog. This short example showed how data analysis and machine learning are truly amazing technical areas that can be used in multiple ways to extract useful information and build models to predict outcomes and help the future.