The dataset of study contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. The variables included are:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES: Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) Parent: Mother or Father of Passenger Aboard Titanic Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
In this analysis we will try to answer some questions related to Survival rate according to :
1. Fare category
2. A person being Male or Female
3. Age of the person i.e, Child , Adult , Senior Citizen
4. Male Child or Female Child
5. Socio-economic status Upper Class (1st), Middle Class(2nd) , Lower Class(3rd)
6. Comparision of survival with respect to embarkment station
7. Chances of survival of Men with child(Father) or spouse(Husband) or Single?
8. Age-group of people with higher probablity of survival
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
titanic_data = pd.read_csv('titanic_data.csv')
titanic_data.head()
This dataset have some NaN values which will stop us doing proper analysis. In this phase we will detect those records and will clean our dataset.
#Testing Presence of Null values in data set
total_null_values = titanic_data.isnull().sum().sum() #Total null values in titanic_dataset
null_values_survived = titanic_data.Survived.isnull().sum().sum() #Total null values in Survived col titanic_dataset
null_values_pclass = titanic_data.Pclass.isnull().sum().sum() #Total null values in Pclass col titanic_dataset
null_values_name = titanic_data.Name.isnull().sum().sum() #Total null values in Name col titanic_dataset
null_values_sex = titanic_data.Sex.isnull().sum().sum() #Total null values in Sex col titanic_dataset
null_values_Age = titanic_data.Age.isnull().sum().sum() #Total null values in Age col titanic_dataset
null_values_SibSp = titanic_data.SibSp.isnull().sum().sum() #Total null values in SibSp col titanic_dataset
null_values_parch = titanic_data.Parch.isnull().sum().sum() #Total null values in Parch col titanic_dataset
null_values_ticket = titanic_data.Ticket.isnull().sum().sum() #Total null values in Ticket col titanic_dataset
null_values_fare = titanic_data.Fare.isnull().sum().sum() #Total null values in Fare col titanic_dataset
null_values_cabin = titanic_data.Cabin.isnull().sum().sum() #Total null values in Cabin col titanic_dataset
null_values_embarked = titanic_data.Embarked.isnull().sum().sum() #Total null values in Embarked col titanic_dataset
print 'Total null values in titanic_dataset : {}'.format(total_null_values)
print 'Total null values in Survived col titanic_dataset : {}'.format(null_values_survived)
print 'Total null values in Pclass col titanic_dataset : {}'.format(null_values_pclass)
print 'Total null values in Name col titanic_dataset : {}'.format(null_values_name)
print 'Total null values in Sex col titanic_dataset : {}'.format(null_values_sex)
print 'Total null values in Age col titanic_dataset : {}'.format(null_values_Age)
print 'Total null values in Sibsp col titanic_dataset : {}'.format(null_values_SibSp)
print 'Total null values in Parch col titanic_dataset : {}'.format(null_values_parch)
print 'Total null values in Ticket col titanic_dataset : {}'.format(null_values_ticket)
print 'Total null values in Cabin col titanic_dataset : {}'.format(null_values_cabin)
print 'Total null values in Embarked col titanic_dataset : {}'.format(null_values_embarked)
float(2)/891*100
It means there are 3 columns with NaN values Age(177) =19.8 % , Cabin(687) = 77.1% ,Embarked(2)=0.22% of 891 values So, for analysis we can ignore Embarked for their NaN because it will not effect much. Let's consider the statistics of Age , Cabin column.
Age_data = titanic_data['Age']
no_NaN_Age = [x for x in Age_data if str(x) != 'nan']
ax = sns.distplot(no_NaN_Age)
ax.set(xlabel='Age', ylabel='no. of people',title="Distribution between Frequency Vs Age of People")
sns.plt.show()
no_NaN_Age = np.array(no_NaN_Age)
print('Mean of Age data with No NaNs = {}').format(np.mean(no_NaN_Age))
Cabin_data = titanic_data['Cabin']
no_NaN_Cabin = [x for x in Cabin_data if str(x) != 'nan']
Cabin_set = set()
for x in no_NaN_Cabin:
Cabin_set.add(x)
print '\n'
print('Number of distinct cabins are {}+').format(len(Cabin_set))
print 'List of all distinct cabins :'
for x in Cabin_set:
print '"'+x+'"',
print '\n'
Age data with No NaNs is approximately normally distributed, so mean value will give better clarity about data's central tendancy. So, for analysis stuff we can clean the data by replacing all NaNs with mean(29.699) value of data.
Cabin data is having 147+ discrete string values,due to this we cann't calculate mean value for it. We cann't even use classification model to categorize data as there will be 147+ different categories on 891 samples.There are some values with "B51 B53 B55"which looks like including 3 cabin no.s . So, will ignore Cabin data for analysis.
* Let's clean Age data and make it NaN free and remove Cabin data from titanic_data dataframe.
titanic_data.Age.fillna(np.mean(no_NaN_Age),inplace=True)
del titanic_data['Cabin']
#No. of different fares in Titanic Ship
#Fare Variation
fare_list = titanic_data.Fare.unique()
fare_list = pd.DataFrame(fare_list)
#fare_list.describe()
no_survived = titanic_data['Survived'].value_counts()[1] #no of people survived
#print no_survived
no_died = len(titanic_data) - no_survived #no of people died
#print no_died
print('No of people Survived : {} , {:.2f}% of total'.format(no_survived, float(no_survived*100 )/len(titanic_data)))
print('No of people Died : {} , {:.2f}% of total'.format(no_died, float(no_died*100 )/len(titanic_data)))
fare_list.sort_values([0],inplace =True)
fares = pd.DataFrame(titanic_data.Fare)
fares.sort_values(['Fare'],inplace =True)
top_90_fare = fares[800:801]['Fare'] #Top 10% fare
top_90_fare
def isVIP(x):
if x == 0:
return "LowerClass" #Probably a Staff's relative/friend travelling with passes
elif x >= 77.2875:
return "VIP" # One of Top 10% guys travelling in Ship
else:
return "Gen" #Normal People travelling in Ship
titanic_data["Is_VIP"] = pd.Series(titanic_data["Fare"].apply(isVIP), index=titanic_data.index)
no_Gen = titanic_data['Is_VIP'].value_counts()['Gen']
no_Lower = titanic_data['Is_VIP'].value_counts()['LowerClass']
no_VIP = titanic_data['Is_VIP'].value_counts()['VIP']
no_Gen_survived = titanic_data.groupby(['Is_VIP' , 'Survived']).size()['Gen'][1]
no_Gen_died = no_Gen - no_Gen_survived
no_Lower_survived = titanic_data.groupby(['Is_VIP' , 'Survived']).size()['LowerClass'][1]
no_Lower_died = no_Lower - no_Lower_survived
no_VIP_survived = titanic_data.groupby(['Is_VIP' , 'Survived']).size()['VIP'][1]
no_VIP_died = no_VIP - no_VIP_survived
print('No. of General People with $0< fare < 77.28 : {} , {:.2f}% of total'.format(no_Gen, float(no_Gen*100 )/len(titanic_data)))
print('No. of General People Survived : {} , {:.2f}%'.format(no_Gen_survived , float(no_Gen_survived)*100/no_Gen))
print('No. of General People Died : {}, {:.2f}% '.format( no_Gen_died, float(no_Gen_died)*100/no_Gen))
print '\n'
print('No. of Lower Class People / Employees who were travelling for free : {} , {:.2f}% of total'.format(no_Lower, float(no_Lower*100 )/len(titanic_data)))
print('No. of Lower Class People/ Employees Survived : {} , {:.2f}%'.format(no_Lower_survived , float(no_Lower_survived)*100/no_Lower))
print('No. of Lower Class People/ Employees Died : {}, {:.2f}% '.format( no_Lower_died, float(no_Lower_died)*100/no_Lower))
print '\n'
print('No. of VIPs who were travelling : {} , {:.2f}% of total'.format(no_VIP, float(no_VIP*100 )/len(titanic_data)))
print('No. of VIPs Survived : {} , {:.2f}%'.format(no_VIP_survived , float(no_VIP_survived)*100/no_VIP))
print('No. of VIPs Died : {}, {:.2f}% '.format( no_VIP_died, float(no_VIP_died)*100/no_VIP))
sns.set_style("whitegrid")
sns.barplot(data = titanic_data , y = "Survived" , x ="Is_VIP")
plt.xlabel('Categorization according to Ticket Fare')
plt.ylabel('Survival Rate')
plt.title("Distribution of Survival rate vs Categorization according to ticket fare" , fontsize = 13)
sns.plt.show()
%pylab inline
#This graph shows the basic fare analysis
import matplotlib
matplotlib.style.use('ggplot')
ax = fare_list.plot(kind="hist")
ax.set_xlabel("Fare --->")
ax.set_title("Fare Distribution")
The above analysis shows that if a person is travelling with high price ticket i.e, VIPs then their survival rate is much higher then General & Lower Class / Employees.
#No. of males
no_male = titanic_data['Sex'].value_counts()['male']
no_female = titanic_data['Sex'].value_counts()['female']
#Survived/Died Male guys who survived
no_male_survived = titanic_data.groupby(['Sex' , 'Survived']).size()[3]
no_male_died = no_male - no_male_survived
#Survived/Died Female guys who survived
no_female_survived = titanic_data.groupby(['Sex' , 'Survived']).size()[1]
no_female_died = no_female - no_female_survived
print('No. of Males : {} , {:.2f}% of total'.format(no_male, float(no_male*100 )/len(titanic_data)))
print('No. of Male Survived : {} , {:.2f}%'.format(no_male_survived , float(no_male_survived)*100/no_male))
print('No. of Male Died : {}, {:.2f}% '.format( no_male_died, float(no_male_died)*100/no_male ))
print '\n'
print('No. of Females : {} , {:.2f}% of total'.format(no_female, float(no_female)*100 /len(titanic_data)))
print('No. of Female Survived : {} , {:.2f}%'.format(no_female_survived , float(no_female_survived)*100/no_female))
print('No. of Female Died : {}, {:.2f}% '.format( no_female_died, float(no_female_died)*100/no_female ))
sns.set_style("whitegrid")
sns.barplot(data = titanic_data , x = "Survived" , y ="Sex",capsize=14)
plt.ylabel('Sex (male /female)', fontsize=16)
plt.xlabel('Survival Rate', fontsize=16)
plt.title("Distribution of Survival rate vs Sex Ratio" , fontsize = 13)
sns.plt.show()
The above analysis gives an insight that Females were preferred to be saved i.e, their survival rate was high
def isAge(x):
if x < 18.0:
return "Child"
elif x >60.0:
return "Senior Citizen"
else:
return "Adult"
titanic_data["IsChild"] = pd.Series(titanic_data["Age"].apply(isAge), index=titanic_data.index)
no_Child = titanic_data['IsChild'].value_counts()['Child']
no_SrCz = titanic_data['IsChild'].value_counts()['Senior Citizen']
no_Adult = titanic_data['IsChild'].value_counts()['Adult']
no_Child_survived = titanic_data.groupby(['IsChild' , 'Survived']).size()['Child'][1]
no_Child_died = no_Child - no_Child_survived
#Survived/Died Female guys who survived
no_SrCz_survived = titanic_data.groupby(['IsChild' , 'Survived']).size()['Senior Citizen'][1]
no_SrCz_died = no_SrCz - no_SrCz_survived
no_Adult_survived = titanic_data.groupby(['IsChild' , 'Survived']).size()['Adult'][1]
no_Adult_died = no_Adult - no_Adult_survived
print('No. of Children : {} , {:.2f}% of total'.format(no_Child, float(no_Child*100 )/len(titanic_data)))
print('No. of Child Survived : {} , {:.2f}%'.format(no_Child_survived , float(no_Child_survived)*100/no_Child))
print('No. of Child Died : {}, {:.2f}% '.format( no_Child_died, float(no_Child_died)*100/no_Child ))
print '\n'
print('No. of Senior Citizen : {} , {:.2f}% of total'.format(no_SrCz, float(no_SrCz*100 )/len(titanic_data)))
print('No. of Senior Citizen Survived : {} , {:.2f}%'.format(no_SrCz_survived , float(no_SrCz_survived)*100/no_SrCz))
print('No. of Senior Citizen Died : {}, {:.2f}% '.format( no_SrCz_died, float(no_SrCz_died)*100/no_SrCz ))
print '\n'
print('No. of Adults : {} , {:.2f}% of total'.format(no_Adult, float(no_Adult*100 )/len(titanic_data)))
print('No. of Adults Survived : {} , {:.2f}%'.format(no_Adult_survived , float(no_Adult_survived)*100/no_Adult))
print('No. of Adults Died : {}, {:.2f}% '.format( no_Adult_died, float(no_Adult_died)*100/no_Adult ))
sns.set_style("whitegrid")
sns.barplot(data = titanic_data , y = "Survived" , x ="IsChild")
plt.ylabel('Survival Rate', fontsize=16)
plt.xlabel('Categorization according to age', fontsize=16)
plt.title("Distribution of Survival rate vs Categorization according to age" , fontsize = 13)
sns.plt.show()
The Above Analysis shows that
54% of Children were saved. So, survival of children were higher than Adults and Senior Citizens.
Survival Rate of Adults(36.5%) is higher than Senior Citizens(22.7%)
no_Female_Child = titanic_data.groupby(['IsChild' , 'Sex']).size()['Child']['female']
no_male_Child = titanic_data.groupby(['IsChild' , 'Sex']).size()['Child']['male']
no_CFemale_Survived = titanic_data.groupby(['IsChild' ,'Survived', 'Sex']).size()['Child'][1]['female']
no_CFemale_Died = no_Female_Child - no_CFemale_Survived
no_CMale_Survived = titanic_data.groupby(['IsChild' ,'Survived', 'Sex']).size()['Child'][1]['male']
no_CMale_Died = no_male_Child -no_CMale_Survived
print('No. of Female Child : {} , {:.2f}% of total'.format(no_Female_Child, float(no_Female_Child*100 )/len(titanic_data)))
print('No. of Female Child Survived : {} , {:.2f}%'.format(no_CFemale_Survived , float(no_CFemale_Survived)*100/no_Female_Child))
print('No. of Female Child Died : {}, {:.2f}% '.format( no_CFemale_Died, float(no_CFemale_Died)*100/no_Female_Child ))
print '\n'
print('No. of Male Child : {} , {:.2f}% of total'.format(no_male_Child, float(no_male_Child*100 )/len(titanic_data)))
print('No. of Male Child Survived : {} , {:.2f}%'.format(no_CMale_Survived , float(no_CMale_Survived)*100/no_male_Child))
print('No. of Male Child Died : {}, {:.2f}% '.format( no_CMale_Died, float(no_CMale_Died)*100/no_male_Child ))
print '\n'
sns.set_style("darkgrid")
sns.barplot(data = titanic_data , y = "Survived" , x ="Sex" , hue="IsChild")
plt.ylabel('Survival Rate', fontsize=16)
plt.title("Distribution of Survival rate vs Sex Ratio vs Age Categorization" , fontsize = 13)
sns.plt.show()
It shows that Survival Rate of Female Children(69%) is more than Male Child(40%)
no_class_1 = titanic_data['Pclass'].value_counts()[1]
no_class_2 = titanic_data['Pclass'].value_counts()[2]
no_class_3 = titanic_data['Pclass'].value_counts()[3]
#print titanic_data.groupby(['Pclass' , 'Survived']).size()
no_class_1_survived = titanic_data.groupby(['Pclass' , 'Survived']).size()[1][1]
no_class_1_died = no_class_1 - no_class_1_survived
no_class_2_survived = titanic_data.groupby(['Pclass' , 'Survived']).size()[2][1]
no_class_2_died = no_class_2 - no_class_2_survived
no_class_3_survived = titanic_data.groupby(['Pclass' , 'Survived']).size()[3][1]
no_class_3_died = no_class_3 - no_class_3_survived
print('No. of Class 1 people : {} , {:.2f}% of total'.format(no_class_1, float(no_class_1*100 )/len(titanic_data)))
print('No. of Class 1 people Survived : {} , {:.2f}%'.format(no_class_1_survived , float(no_class_1_survived)*100/no_class_1))
print('No. of Class 1 people Died : {}, {:.2f}% '.format( no_class_1_died, float(no_class_1_died)*100/no_class_1 ))
print '\n'
print('No. of Class 2 people : {} , {:.2f}% of total'.format(no_class_2, float(no_class_2)*100 /len(titanic_data)))
print('No. of Class 2 people Survived : {} , {:.2f}%'.format(no_class_2_survived , float(no_class_2_survived)*100/no_class_2))
print('No. of Class 2 people Died : {}, {:.2f}% '.format( no_class_2_died, float(no_class_2_died)*100/no_class_2 ))
print '\n'
print('No. of Class 3 people : {} , {:.2f}% of total'.format(no_class_3, float(no_class_3)*100 /len(titanic_data)))
print('No. of Class 3 people Survived : {} , {:.2f}%'.format(no_class_3_survived , float(no_class_3_survived)*100/no_class_3))
print('No. of Class 3 people Died : {}, {:.2f}% '.format( no_class_3_died, float(no_class_3_died)*100/no_class_3 ))
sns.set_style("darkgrid")
sns.barplot(data = titanic_data , y = "Survived" , x ="Pclass" )
plt.ylabel('Survival Rate', fontsize=16)
plt.xlabel('Passenger Class', fontsize=16)
plt.title("Distribution of Survival rate vs Passenger Class" , fontsize = 13)
sns.plt.show()
It shows that Upper Class (63%) were preffered over Middle Class(47%) & Lower Class(24%) people.
no_boarded_C = titanic_data['Embarked'].value_counts()['C']
no_boarded_Q = titanic_data['Embarked'].value_counts()['Q']
no_boarded_S = titanic_data['Embarked'].value_counts()['S']
#print titanic_data.groupby(['Embarked' , 'Survived']).size()
no_boarded_C_survived = titanic_data.groupby(['Embarked' , 'Survived']).size()['C'][1]
no_boarded_C_died = no_boarded_C - no_boarded_C_survived
no_boarded_Q_survived = titanic_data.groupby(['Embarked' , 'Survived']).size()['Q'][1]
no_boarded_Q_died = no_boarded_Q - no_boarded_Q_survived
no_boarded_S_survived = titanic_data.groupby(['Embarked' , 'Survived']).size()['S'][1]
no_boarded_S_died = no_boarded_S - no_boarded_S_survived
print('No. of People boarded from Cherbourg : {} , {:.2f}% of total'.format(no_boarded_C, float(no_boarded_C*100 )/len(titanic_data)))
print('No. of People boarded from Cherbourg who Survived: {} , {:.2f}%'.format(no_boarded_C_survived , float(no_boarded_C_survived)*100/no_boarded_C))
print('No. of People boarded from Cherbourg who Died : {}, {:.2f}% '.format( no_boarded_C_died, float(no_boarded_C_died)*100/no_boarded_C ))
print '\n'
print('No. of People boarded from Queenstown : {} , {:.2f}% of total'.format(no_boarded_Q, float(no_class_2)*100 /len(titanic_data)))
print('No. of People boarded from Queenstown who Survived : {} , {:.2f}%'.format(no_boarded_Q_survived , float(no_boarded_Q_survived)*100/no_boarded_Q))
print('No. of People boarded from Queenstown who Died : {}, {:.2f}% '.format( no_boarded_Q_died, float(no_boarded_Q_died)*100/no_boarded_Q ))
print '\n'
print('No. of People boarded from Southampton : {} , {:.2f}% of total'.format(no_boarded_S, float(no_boarded_S)*100 /len(titanic_data)))
print('No. of People boarded from Southampton who Survived : {} , {:.2f}%'.format(no_boarded_S_survived , float(no_boarded_S_survived)*100/no_boarded_S))
print('No. of People boarded from Southampton who Died : {}, {:.2f}% '.format( no_boarded_S_died, float(no_boarded_S_died)*100/no_boarded_S ))
sns.set_style("darkgrid")
sns.barplot(data = titanic_data , y = "Survived" , x ="Embarked" )
plt.ylabel('Survival Rate', fontsize=16)
plt.xlabel('Embarkment Station', fontsize=16)
plt.title("Distribution of Survival rate vs Embarkment Station" , fontsize = 13)
sns.plt.show()
It shows that people who boarded from :
- Cherbourg had higher probablity of survival(55.36%)
- Southampton had lowest probablity of survival(33.7%)
def isAdultMan(x):
return (x["IsChild"] =="Senior Citizen" or x["IsChild"] =="Adult") and x["Sex"] == "male"
adult_man_titanic_data = titanic_data[titanic_data.apply(isAdultMan, axis=1)]
def isFamilyMan(x):
if x["SibSp"] > 0:
if x["Parch"] > 0:
return "Father"
else:
return "Husband"
else:
return "Single"
adult_man_titanic_data["FamilyMan"] = pd.Series(adult_man_titanic_data.apply(isFamilyMan, axis=1), index=adult_man_titanic_data.index)
# print adult_man_titanic_data["FamilyMan"].value_counts()
no_Adult_Fathers_survived = adult_man_titanic_data.groupby(['FamilyMan' , 'Survived']).size()['Father'][1]
no_Adult_Fathers_died = adult_man_titanic_data.groupby(['FamilyMan' , 'Survived']).size()['Father'][0]
no_Fathers = adult_man_titanic_data["FamilyMan"].value_counts()["Father"]
print('No. of Adult Fathers : {} , {:.2f}% of total'.format(no_Fathers , float(no_Fathers *100 )/len(titanic_data)))
print('No. of Adult Fathers Survived : {} , {:.2f}%'.format(no_Adult_Fathers_survived , float(no_Adult_Fathers_survived)*100/no_Fathers))
print('No. of Adult Fathers Died : {}, {:.2f}% '.format( no_Adult_Fathers_died, float(no_Adult_Fathers_died)*100/no_Fathers ))
print '\n'
no_Husband_survived = adult_man_titanic_data.groupby(['FamilyMan' , 'Survived']).size()['Husband'][1]
no_Husband_died = adult_man_titanic_data.groupby(['FamilyMan' , 'Survived']).size()['Husband'][0]
no_Husband = adult_man_titanic_data["FamilyMan"].value_counts()["Husband"]
print('No. of Adult Husband : {} , {:.2f}% of total'.format(no_Husband , float(no_Husband *100 )/len(titanic_data)))
print('No. of Adult Husband Survived : {} , {:.2f}%'.format(no_Husband_survived , float(no_Husband_survived)*100/no_Husband))
print('No. of Adult Husband Died : {}, {:.2f}% '.format( no_Husband_died, float(no_Husband_died)*100/no_Husband ))
print '\n'
no_Single_survived = adult_man_titanic_data.groupby(['FamilyMan' , 'Survived']).size()['Single'][1]
no_Single_died = adult_man_titanic_data.groupby(['FamilyMan' , 'Survived']).size()['Single'][0]
no_Single = adult_man_titanic_data["FamilyMan"].value_counts()["Single"]
print('No. of Adult Single : {} , {:.2f}% of total'.format(no_Single , float(no_Single *100 )/len(titanic_data)))
print('No. of Adult Single Survived : {} , {:.2f}%'.format(no_Single_survived , float(no_Single_survived)*100/no_Single))
print('No. of Adult Single Died : {}, {:.2f}% '.format( no_Single_died, float(no_Single_died)*100/no_Single ))
print '\n'
sns.set_style("darkgrid")
g= sns.factorplot(data=adult_man_titanic_data,x="Survived", col="FamilyMan", kind="count" )
g.set_axis_labels("", "Survival Rate").set_xticklabels(["Died", "Survived"])
sns.plt.show()
It shows that Suvival rate of Men travelling with their Wife is Higher that Men travelling with their kids
Best_Age = titanic_data[titanic_data['Survived']==1]['Age']
ax = sns.distplot(Best_Age,bins=30)
ax.set(xlabel='Age', ylabel='no. of people',title="Distribution between Frequency Vs Age of People")
sns.plt.show()
It shows that (29 - 32) age group people survived more
I have drawn 3 interactive graphs to summarize our analysis. Graphs can be seen here.
While using the Titanic dataset I found several limitations that made making deeper analysis more difficult and in some cases unreliable. The dataset is filled with missing values. In Age, Cabin column there are many NaN values.