Introduction
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this project, I will complete the analysis of what sorts of people were likely to survive. In particular, apply the tools of machine learning to predict which passengers survived the tragedy.
Questions
All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let’s think of some questions or insights we would like to obtain from the data. So here’s a list of questions we’ll try to answer.
First some basic questions:
1.) Who were the passengers on the Titanic? (Ages,Gender,Class,..etc)
2.) What deck were the passengers on and how does that relate to their class?
3.) Where did the passengers come from?
4.) Who was alone and who was with family?
Then we’ll dig deeper, with a broader question:
5.) What factors helped someone survive the sinking?
So let’s start with the first question…
Question 1: Who were the passengers on the Titanic? (Ages,Gender,Class,..etc)
In [1]:
# Check out the Kaggle Titanic Challenge at the following link: https://www.kaggle.com/c/titanic-gettingStarted # Opening data with pandas import csv import pandas as pd from pandas import Series, DataFrame %matplotlib inline
In [2]:
# Set up the Titanic csv file as a DataFrame titanic_df = pd.read_csv('train.csv')
In [3]:
# Let's see a preview of the data titanic_df.head()
Out [3]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [4]:
# We could also get overall info for the dataset titanic_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
In [5]:
# Let's import what we need for the analysis and visualization import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
In [6]:
# Let's first check gender titanic = sns.load_dataset("titanic") sns.set_style("white") sns.countplot('sex',data=titanic)
Out [6]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f4d17bd30>

In [7]:
# Now we can seperate the genders by classes sns.countplot('sex',data=titanic,hue='pclass')
Out [7]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f4e442be0>

In [8]:
sns.countplot('pclass',data=titanic,hue='sex')
Out [8]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f4e44c6d8>

Interesting, there were quite a few more males in the 3rd class than females, an interesting find. However, it might be useful to know the split between males,females,and children. How can we go about this?
In [9]:
# We'll treat anyone as under 16 as a child, and then use the apply technique with a function to create a new column
In [10]:
# First let's make a function to sort through the sex def male_female_child(passenger): age,sex = passenger if age < 16: return 'child' else: return sex
In [11]:
# We will define a new column called 'person', and then specify axis=1 for columns and not index titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child,axis=1)
In [12]:
# Let's see if this worked, check out the first ten rows titanic_df[0:10]
Out [12]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | male |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | male |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S | child |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S | female |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C | child |
Now we have seperated the passengers between male, female, and child. This will be important later on beacuse of the famous “Women and children first policy”!
In [13]:
# Let's try the countplot again! sns.countplot('Pclass', data=titanic_df,hue='person')
Out [13]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f2e4755f8>

Interesting, quite a bit of children in 3rd class and not so many in 1st! How about we create a distribution of the ages to get a more precise picture of the who the passengers were.
In [14]:
# Quick way to create a histogram using pandas sns.set_style() titanic_df['Age'].hist(bins=70)
Out [14]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f4e56cdd8>

In [15]:
# We could also get a quick overall comparison of male, female, child titanic_df['Age'].mean()
Out [15]:
29.69911764705882
In [16]:
titanic_df['person'].value_counts()
Out [16]:
male 537 female 271 child 83 Name: person, dtype: int64
In [17]:
# Another way to visualize the data is to use FacetGrid to plot multiple kedplots on one plot # Set the figure equal to a facetgrid with the pandas dataframe as its data source, set the hue, and change the aspect ratio. fig = sns.FacetGrid(titanic_df, hue="Sex",aspect=4) # Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice fig.map(sns.kdeplot,'Age',shade= True) # Set the x max limit by the oldest passenger oldest = titanic_df['Age'].max() #Since we know no one can be negative years old set the x lower limit at 0 fig.set(xlim=(0,oldest)) #Finally add a legend fig.add_legend()
Out [17]:
<seaborn.axisgrid.FacetGrid at 0x13f4e692978>

In [18]:
fig = sns.FacetGrid(titanic_df,hue='Sex',aspect=4) fig.map(sns.kdeplot,'Age',shade=True) oldest = titanic_df['Age'].max() fig.set(xlim=(0,oldest)) fig.add_legend()
Out [18]:
<seaborn.axisgrid.FacetGrid at 0x13f4e586208>

We now have a pretty good picture of who the passengers were based on Sex, Age, and Class. So let’s move on to our 2nd question…
Question 2:) What deck were the passengers on and how does that relate to their class?
In [19]:
# We could have done the same thing for the 'person' column to include children: fig = sns.FacetGrid(titanic_df,hue='person',aspect=4) fig.map(sns.kdeplot,'Age',shade=True) oldest = titanic_df['Age'].max() fig.set(xlim=(0,oldest)) fig.add_legend()
Out [19]:
<seaborn.axisgrid.FacetGrid at 0x13f507b6898>

In [20]:
# Let's do the same for class by changing the hue argument: fig = sns.FacetGrid(titanic_df,hue='Pclass',aspect=4) fig.map(sns.kdeplot,'Age',shade=True) oldest = titanic_df['Age'].max() fig.set(xlim=(0,oldest)) fig.add_legend()
Out [20]:
<seaborn.axisgrid.FacetGrid at 0x13f4e5cf0f0>

In [21]:
# Let's get a quick look at our dataset again titanic_df.head()
Out [21]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male |
From this data we can see that the Cabin column has information on the deck, but it has several NaN values, so we’ll have to drop them.
In [22]:
# First we'll drop the NaN values and create a new object, deck deck = titanic_df['Cabin'].dropna()
In [23]:
# Quick preview of the decks deck.head()
Out [23]:
1 C85 3 C123 6 E46 10 G6 11 C103 Name: Cabin, dtype: object
Luckily we only need the first letter of the deck to classify its level (e.g. A,B,C,D,E,F,G)
In [24]:
# So let's grab that letter for the deck level with a simple for loop # Set empty list levels = [] # Loop to grab first letter for level in deck: levels.append(level[0]) # Reset DataFrame and use factor plot cabin_df = DataFrame(levels) cabin_df.columns = ['Cabin'] order=["A","B","C","D","E","F","G","T"] sns.countplot('Cabin',data=cabin_df,palette='winter_d',order=order)
Out [24]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f50840ef0>

Interesting to note we have a ‘T’ deck value there which doesn’t make sense, we can drop it out with the following code:
In [25]:
# Redefine cabin_df as everything but where the row was equal to 'T' cabin_df = cabin_df[cabin_df.Cabin != 'T'] # Replot sns.countplot('Cabin',data=cabin_df,palette='summer',order=["A","B","C","D","E","F","G"])
Out [25]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f508f8518>

Quick note: I used changed my palettes to keep things interesting. Great now that we’ve analyzed the distribution by decks, let’s go ahead and answer our third question…
Question 3: Where did the passengers come from?
In [26]:
# Let's take another look at our original data titanic_df.head()
Out [26]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male |
Note here that the Embarked column has C,Q,and S values. Reading about the project on Kaggle you’ll note that these stand for Cherbourg, Queenstown, Southhampton.
In [27]:
# Now we can make a quick countplot to check out the results, note the x_order argument, used to deal with NaN values sns.countplot('Embarked',data=titanic_df,hue='Pclass',order=["C","Q","S"])
Out [27]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f509496d8>

An interesting find here is that in Queenstown, almost all the passengers that boarded there were 3rd class. It would be intersting to look at the economics of that town in that time period for further investigation.
Now let’s take a look at the 4th question…
Question 4: Who was alone and who was with family?
In [28]:
# Let's start by adding a new column to define alone # Then we'll add the parent/child column with the sibsp column titanic_df['Alone'] = titanic_df.SibSp + titanic_df.Parch
In [29]:
titanic_df['Alone'] = titanic_df.Parch + titanic_df.SibSp titanic_df['Alone']
Out [29]:
0 1 1 1 2 0 3 1 4 0 5 0 6 0 7 4 8 2 9 1 10 2 11 0 12 0 13 6 14 0 15 0 16 5 17 0 18 1 19 0 20 0 21 0 22 0 23 0 24 4 25 6 26 0 27 5 28 0 29 0 .. 861 1 862 0 863 10 864 0 865 0 866 1 867 0 868 0 869 2 870 0 871 2 872 0 873 0 874 1 875 0 876 0 877 0 878 0 879 1 880 1 881 0 882 0 883 0 884 0 885 5 886 0 887 0 888 3 889 0 890 0 Name: Alone, Length: 891, dtype: int64
We know that if the Alone column is anything but 0, then the passenger had family aboard and wasn’t alone. So let’s change the column now so that if the value is greater than 0, we know the passenger was with his/her family, otherwise they were alone.
In [30]:
# Look for >0 or ==0 to set alone status titanic_df['Alone'].loc[titanic_df['Alone'] >0] = 'With Family' titanic_df['Alone'].loc[titanic_df['Alone'] == 0] = 'Alone'
In [31]:
# Let's check to make sure it worked titanic_df.head()
Out [31]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | person | Alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male | With Family |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female | With Family |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female | Alone |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female | With Family |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male | Alone |
In [32]:
# Now here's a simple visualization! sns.countplot('Alone',data=titanic_df,palette='Blues', order=["Alone", "With Family"])
Out [32]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f509d3668>

Great! Now that we’ve throughly analyzed the data let’s go ahead and take a look at the most interesting (and open-ended) question: What factors helped someone survive the sinking?
Question 5: What factors helped someone survive the sinking?
In [33]:
# Let's start by creating a new column for legibility purposes through mapping (Lec 36) titanic_df['Survivor'] = titanic_df.Survived.map({0:'no',1:'yes'}) # Let's just get a quick overall view of survied vs died. sns.countplot('Survivor',data=titanic_df,palette='Set1')
Out [33]:
<matplotlib.axes._subplots.AxesSubplot at 0x13f50a69390>

So quite a few more people died than those who survived. Let’s see if the class of the passengers had an effect on their survival rate, since the movie Titanic popularized the notion that the 3rd class passengers did not do as well as their 1st and 2nd class counterparts.
In [34]:
# Let's use a catplot again, but now considering class sns.catplot(x='Pclass', y='Survived',data=titanic_df, kind='point')
Out [34]:
<seaborn.axisgrid.FacetGrid at 0x13f50aa6e48>

Look like survival rates for the 3rd class are substantially lower! But maybe this effect is being caused by the large amount of men in the 3rd class in combination with the women and children first policy. Let’s use ‘hue’ to get a clearer picture on this.
In [35]:
# Let's use a catplot again, but now considering class and gender sns.catplot(x='Pclass', y='Survived',hue='person',data=titanic_df, kind='point')
Out [35]:
<seaborn.axisgrid.FacetGrid at 0x13f50abff28>

From this data it looks like being a male or being in 3rd class were both not favourable for survival. Even regardless of class the result of being a male in any class dramatically decreases your chances of survival.
But what about age? Did being younger or older have an effect on survival rate?
In [36]:
# Let's use a linear plot on age versus survival sns.lmplot('Age','Survived',data=titanic_df)
Out [36]:
<seaborn.axisgrid.FacetGrid at 0x13f508886a0>

Looks like there is a general trend that the older the passenger was, the less likely they survived. Let’s go ahead and use hue to take a look at the effect of class and age.
In [37]:
# Let's use a linear plot on age versus survival using hue for class seperation sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter')
Out [37]:
<seaborn.axisgrid.FacetGrid at 0x13f51ba5f28>

We can also use the x_bin argument to clean up this figure and grab the data and bin it by age with a std attached!
In [38]:
# Let's use a linear plot on age versus survival using hue for class seperation generations = [10,20,40,60,80] sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter',x_bins=generations)
Out [38]:
<seaborn.axisgrid.FacetGrid at 0x13f51c51278>

Interesting find on the older 1st class passengers! What about if we relate gender and age with the survival set?
In [39]:
sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='winter',x_bins=generations)
Out [39]:
<seaborn.axisgrid.FacetGrid at 0x13f51cd66d8>

Conclusion
Done! We now have some great insights on how gender,age, and class all related to a passengers chance of survival. With this data we can answer two more questions.
6.) Did the deck have an effect on the passengers survival rate? Did this answer match up with your intuition?
7.) Did having a family member increase the odds of surviving the crash?
Question 6: Did the deck have an effect on the passengers survival rate? Did this answer match up with your intuition?
In [40]:
titanic_df.head()
Out [40]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | person | Alone | Survivor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | male | With Family | no |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | female | With Family | yes |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | female | Alone | yes |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | female | With Family | yes |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | male | Alone | no |
In [41]:
cabin_df = cabin_df[cabin_df.Cabin != 'T'] sns.catplot('Cabin', data=cabin_df, palette='winter_d',kind='count',order=["A","B","C","D","E","F","G"])
Out [41]:
<seaborn.axisgrid.FacetGrid at 0x13f51cd6d68>

From this chart we can see most of the passengers were located in Cabin C.
In [42]:
cabin_df.head()
Out [42]:
Cabin | |
---|---|
0 | C |
1 | C |
2 | E |
3 | G |
4 | C |
In [43]:
cabin_df = pd.concat([cabin_df, titanic_df['Sex']], axis=1) cabin_df = pd.concat([cabin_df, titanic_df['Survived']], axis=1) cabin_df.head()
Out [43]:
Cabin | Sex | Survived | |
---|---|---|---|
0 | C | male | 0 |
1 | C | female | 1 |
2 | E | female | 1 |
3 | G | female | 1 |
4 | C | male | 0 |
In [44]:
sns.catplot(x='Cabin', y='Survived',data=cabin_df, palette='winter_d', kind='bar', order=["A","B","C","D","E","F","G"])
Out [44]:
<seaborn.axisgrid.FacetGrid at 0x13f51d74780>

It seems most the survivors were in cabin A, so the deck did have a small effect on the chance on the passengers survival rate
Question 7: Did having a family member increase the odds of surviving the crash?
In [45]:
# Reset DataFrame and use factor plot sibsp_df = DataFrame(levels) sibsp_df.columns = ['SibSp'] sibsp_df = sibsp_df[sibsp_df.SibSp != '5'] sns.catplot(y='Survived', x='SibSp', data=titanic_df, palette='winter_d',kind='bar')
Out [45]:
<seaborn.axisgrid.FacetGrid at 0x13f51dc0f98>

We can see having a single family member actually reduces the odds of surviving the crash. Probably understandable since women and children have priority over single adults without families. However, it’s interesting to note that the chances of survival reduce when more family members are on board