{"id":318,"date":"2017-07-28T17:11:41","date_gmt":"2017-07-28T17:11:41","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=318"},"modified":"2017-07-28T17:11:41","modified_gmt":"2017-07-28T17:11:41","slug":"data-wrangling","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/data-wrangling\/","title":{"rendered":"Titanic Survivors Dataset and Data Wrangling"},"content":{"rendered":"<figure id=\"attachment_376\" aria-describedby=\"caption-attachment-376\" style=\"width: 968px\" class=\"wp-caption aligncenter\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"376\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/data_wrangler\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=968%2C593&amp;ssl=1\" data-orig-size=\"968,593\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"data_wrangler\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=300%2C184&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=968%2C593&amp;ssl=1\" class=\"wp-image-376 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?resize=968%2C593&#038;ssl=1\" alt=\"Data Wrangling, Yee Ha!\" width=\"968\" height=\"593\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?w=968&amp;ssl=1 968w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?resize=300%2C184&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?resize=768%2C470&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?resize=940%2C576&amp;ssl=1 940w\" sizes=\"auto, (max-width: 968px) 100vw, 968px\" \/><figcaption id=\"caption-attachment-376\" class=\"wp-caption-text\">Data Wrangling, Yee Ha!<\/figcaption><\/figure>\n<p>We have learnt <a href=\"https:\/\/www.codeastar.com\/choose-machine-learning-models-python\/\">how to select a machine learning model<\/a>, it is time to study another Data Science topic from the <a href=\"https:\/\/www.codeastar.com\/what-is-data-science\/#DataScienceLifeCycle\">Data Science Life Cycle<\/a> &#8212; Data Collection.<\/p>\n<p>Yes, we do need to know how to collect data.\u00a0Unlike our <a href=\"https:\/\/www.codeastar.com\/beginner-data-science-tutorial\/\">Iris Classification<\/a>\u00a0project, which its data set is well prepared. Sometimes, we need to prepare our own data for machine to learn.<\/p>\n<p><!--more--><\/p>\n<h3>Data Wrangling<\/h3>\n<p>Then we have another topic to learn &#8212; Data Wrangling. Data Wrangling is a process to transform raw data to machine readable data. This time, we use a well known data set as our subject, the\u00a0Titanic survivors data sets.<\/p>\n<p>First of all, let&#8217;s get the data sets from the <a href=\"https:\/\/www.kaggle.com\/c\/titanic\" target=\"_blank\" rel=\"noopener\">Titanic Machine Learning competition<\/a> at Kaggle.com . Although it is called a &#8220;competition&#8221;, it is an entry level data science practice actually.<\/p>\n<p>You can download a train.csv file as a training data set and a test.csv file for result predication. Then we use Pandas to take a look on their data structures:<\/p>\n<pre lang=\"python\" line=\"1\">import pandas as pd\r\nimport numpy as np\r\n\r\n#replace the file paths for your csv files\r\ntrain_df = pd.read_csv(\"Titanic\/train.csv\")\r\ntest_df = pd.read_csv(\"Titanic\/test.csv\")\r\n\r\n#show the numbers of row and column\r\nprint(train_df.shape)\r\n#show the numbers of missing values\r\nprint(train_df.apply(lambda x: sum(x.isnull()),axis=0))\r\nprint(test_df.shape)\r\nprint(test_df.apply(lambda x: sum(x.isnull()),axis=0))\r\n<\/pre>\n<pre>(891, 12)\r\nPassengerId      0\r\nSurvived         0\r\nPclass           0\r\nName             0\r\nSex              0\r\nAge            177\r\nSibSp            0\r\nParch            0\r\nTicket           0\r\nFare             0\r\nCabin          687\r\nEmbarked         2\r\ndtype: int64\r\n(418, 11)\r\nPassengerId      0\r\nPclass           0\r\nName             0\r\nSex              0\r\nAge             86\r\nSibSp            0\r\nParch            0\r\nTicket           0\r\nFare             1\r\nCabin          327\r\nEmbarked         0\r\ndtype: int64\r\n<\/pre>\n<p>Our mission for this data science project is to find out the survivors on the testing data set. That is why the &#8220;Survived&#8221; field is missed from the test.csv file. And we also find that there are missing values in &#8220;Age&#8221;, &#8220;Fare&#8221;, &#8220;Cabin&#8221; and &#8220;Embarked&#8221; features, e.g.:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"332\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/titantic1\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?fit=1225%2C312&amp;ssl=1\" data-orig-size=\"1225,312\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"titantic1\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?fit=300%2C76&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?fit=1024%2C261&amp;ssl=1\" class=\"aligncenter wp-image-332 size-large\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?resize=940%2C240&#038;ssl=1\" alt=\"\" width=\"940\" height=\"240\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?resize=1024%2C261&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?resize=300%2C76&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?resize=768%2C196&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?resize=940%2C239&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/titantic1.png?w=1225&amp;ssl=1 1225w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/p>\n<p>Other than those missing values, &#8220;Name&#8221;, &#8220;Sex&#8221;, &#8220;Ticket&#8221;, &#8220;Cabin&#8221; and &#8220;Embarked&#8221; are all non-numeric features. Do not feel frustrated, think positive! This is a chance for us to learn the technique on data wrangling.<\/p>\n<h3>Let the machine read the data<\/h3>\n<p>Our objective on data wrangling is to transform raw data to machine read-able data. Let&#8217; start from the easiest step first, the &#8220;Sex&#8221; feature. We open our machine&#8217;s &#8220;eyes&#8221; by mapping male as &#8220;1&#8221; \u00a0and female as &#8220;0&#8221;:<\/p>\n<pre lang=\"python\" line=\"1\">train_df[\"Sex\"] = train_df[\"Sex\"].map({\"male\": 1, \"female\":0})\r\n<\/pre>\n<p>Then we have changed the original &#8220;male&#8221; and &#8220;female&#8221; values to machine read-able 1\/0 values:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"334\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/t3\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?fit=1196%2C187&amp;ssl=1\" data-orig-size=\"1196,187\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"T3\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?fit=300%2C47&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?fit=1024%2C160&amp;ssl=1\" class=\"wp-image-334 size-large aligncenter\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?resize=940%2C147&#038;ssl=1\" alt=\"\" width=\"940\" height=\"147\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?resize=1024%2C160&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?resize=300%2C47&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?resize=768%2C120&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?resize=940%2C147&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/T3.png?w=1196&amp;ssl=1 1196w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/p>\n<p>We can apply the same technique on &#8220;Embarked&#8221; feature, but, we know that there are 2 missing values on the training data set. So we go to check the current content of the &#8220;Embarked&#8221; feature:<\/p>\n<pre lang=\"python\" line=\"1\">print(train_df['Embarked'].value_counts(ascending=True))\r\nprint(train_df['Embarked'].value_counts(normalize=True,ascending=True))\r\n<\/pre>\n<p>And we got:<\/p>\n<pre>Q     77\r\nC    168\r\nS    644\r\nName: Embarked, dtype: int64\r\nQ    0.086614\r\nC    0.188976\r\nS    0.724409\r\nName: Embarked, dtype: float64\r\n<\/pre>\n<p>&#8220;S&#8221; (Southampton from the data dictation) is the majority of available values, thus we can fill &#8220;S&#8221; into the 2 missing values and map all values into numbers.<\/p>\n<pre lang=\"python\" line=\"1\">train_df['Embarked'] = train_df['Embarked'].fillna('S')\r\ntrain_df['Embarked'] = train_df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)\r\n<\/pre>\n<p>For &#8220;Cabin&#8221; feature, since there are 687 out of 891 missing records, we can simply skip this feature.<\/p>\n<p>There is no null values in &#8220;Ticket&#8221; feature, but I hardly find the data-real life relationship between survival rate and ticket number, so we skip this feature as well.<\/p>\n<h3>That&#8217;s Not My Name<\/h3>\n<p>Now we only have one non-numeric feature to solve: &#8220;Name&#8221;. We find that other than first name and last name, there is title stored in the &#8220;Name&#8221; column as well. Let&#8217;s take a closer look on the title value.<\/p>\n<pre lang=\"python\" line=\"1\">#get title from name \r\ntrain_df['Title'] = df['Name'].apply(lambda x: x.split(\",\")[1].split(\".\")[0].strip())\r\nprint(train_df['TicketPrefix'].value_counts(ascending=True, dropna=False))\r\n<\/pre>\n<pre>Ms                1\r\nMme               1\r\nCapt              1\r\nSir               1\r\nJonkheer          1\r\nthe Countess      1\r\nLady              1\r\nDon               1\r\nMajor             2\r\nCol               2\r\nMlle              2\r\nRev               6\r\nDr                7\r\nMaster           40\r\nMrs             125\r\nMiss            182\r\nMr              517\r\n<\/pre>\n<p>We group different titles according to their social status and similarity.<\/p>\n<pre lang=\"python\" line=\"1\">train_df[\"Title\"] = train_df[\"Title\"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\r\ntrain_df['Title'] = train_df['Title'].replace('Mlle', 'Miss')\r\ntrain_df['Title'] = train_df['Title'].replace('Ms', 'Miss')\r\ntrain_df['Title'] = train_df['Title'].replace('Mme', 'Mrs')\r\n<\/pre>\n<p>And see their relationship with survival rate.<\/p>\n<pre lang=\"python\" line=\"1\">import matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n\r\nsns.barplot(x=\"Title\", y=\"Survived\", data=train_df)\r\n\r\nplt.show()<\/pre>\n<p>Please note that we are using <em>seaborn<\/em> library, which is built on top of <em>matplotlib<\/em>. It provides further enhancement on both functionality and presentation for plotting. In short: an upgrade.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"347\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/title\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/title.png?fit=495%2C346&amp;ssl=1\" data-orig-size=\"495,346\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"title\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/title.png?fit=300%2C210&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/title.png?fit=495%2C346&amp;ssl=1\" class=\"alignnone wp-image-347 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/title.png?resize=495%2C346&#038;ssl=1\" alt=\"\" width=\"495\" height=\"346\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/title.png?w=495&amp;ssl=1 495w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/title.png?resize=300%2C210&amp;ssl=1 300w\" sizes=\"auto, (max-width: 495px) 100vw, 495px\" \/><\/p>\n<p>Obviously, female (&#8220;Mrs&#8221; and &#8220;Miss&#8221; groups) got much better chance to survive. But &#8220;Rare&#8221; and &#8220;Master&#8221; title groups had better survival rates than &#8220;Mr&#8221; group. Sometimes, having better title does not only mean having better social status, it pays off in the game of surviving.<\/p>\n<p>We map title group into numeric values and drop unused features.<\/p>\n<pre lang=\"python\" line=\"1\">title_mapping = {\"Mr\": 1, \"Miss\": 2, \"Mrs\": 3, \"Master\": 4, \"Rare\": 5}\r\ntrain_df['Title'] = train_df['Title'].map(title_mapping)\r\n\r\ntrain_df = train_df.drop([\"Name\", \"Ticket\", \"Cabin\"], axis=1)\r\n<\/pre>\n<p>Then we have an all-numeric data set:<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"349\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/drop\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/drop.png?fit=690%2C190&amp;ssl=1\" data-orig-size=\"690,190\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"drop\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/drop.png?fit=300%2C83&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/drop.png?fit=690%2C190&amp;ssl=1\" class=\"wp-image-349 size-full alignnone\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/drop.png?resize=690%2C190&#038;ssl=1\" alt=\"\" width=\"690\" height=\"190\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/drop.png?w=690&amp;ssl=1 690w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/drop.png?resize=300%2C83&amp;ssl=1 300w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/p>\n<h3>The Number Game<\/h3>\n<p>Now we focus on the numeric features. &#8220;SibSp&#8221; and &#8220;Parch&#8221; represent siblings, spouses, parents and children, i.e. family members. We make a new feature, &#8220;FamilySize&#8221;, for storing these 2 features. And check its relationship with survival rate using kernel density estimate graph.<\/p>\n<pre lang=\"python\" line=\"1\">train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1\r\n\r\nfacet = sns.FacetGrid(train_df, hue=\"Survived\",aspect=4)\r\nfacet.map(sns.kdeplot,'FamilySize',shade= True)\r\nfacet.set(xlim=(0, train_df['FamilySize'].max()))\r\nfacet.add_legend()\r\n\r\nplt.show()\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"353\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/fsize\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?fit=903%2C204&amp;ssl=1\" data-orig-size=\"903,204\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"fsize\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?fit=300%2C68&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?fit=903%2C204&amp;ssl=1\" class=\"alignnone wp-image-353 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?resize=903%2C204&#038;ssl=1\" alt=\"\" width=\"903\" height=\"204\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?w=903&amp;ssl=1 903w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?resize=300%2C68&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fsize.png?resize=768%2C174&amp;ssl=1 768w\" sizes=\"auto, (max-width: 903px) 100vw, 903px\" \/><\/p>\n<p>It is quite sure that, passengers with 2 or 3 family members are easier to survive then those individual travelers. Then we separate &#8220;FamilySize&#8221; into 4 different groups.<\/p>\n<pre lang=\"python\" line=\"1\">bins = (-1, 1, 2, 3, 12)\r\ngroup_names = [1,2,3,4]\r\ncategories = pd.cut(train_df['FamilySize'], bins, labels=group_names)\r\ntrain_df['FamilyGroup'] = categories\r\n<\/pre>\n<p>On the family size feature, it raises out another question. Would passengers with parent and children have better chance to survive than people with siblings and spouses? Thus I create following 2 features, &#8220;withP&#8221; (with parents\/children) and &#8220;withS&#8221; (with siblings\/spouses):<\/p>\n<pre lang=\"python\" line=\"1\">train_df['withP']=0\r\ntrain_df['withS']=0\r\n   \r\ntrain_df.loc[train_df['SibSp'] &gt; 0, 'withS'] = 1\r\ntrain_df.loc[train_df['Parch'] &gt; 0, 'withP'] = 1\r\n\r\nsns.barplot(x=\"withP\", y=\"Survived\", hue=\"withS\", data=train_df)\r\nplt.show()\r\n<\/pre>\n<p>And compare their relationship with survival rates:<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"355\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/parents\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/parents.png?fit=495%2C343&amp;ssl=1\" data-orig-size=\"495,343\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"parents\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/parents.png?fit=300%2C208&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/parents.png?fit=495%2C343&amp;ssl=1\" class=\"alignnone wp-image-355 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/parents.png?resize=495%2C343&#038;ssl=1\" alt=\"\" width=\"495\" height=\"343\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/parents.png?w=495&amp;ssl=1 495w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/parents.png?resize=300%2C208&amp;ssl=1 300w\" sizes=\"auto, (max-width: 495px) 100vw, 495px\" \/><br \/>\nWhen parents\/children are passengers only relatives on the ship, they have better chance to survive than others.<\/p>\n<h3>Money, Money, Money<\/h3>\n<p>It is time to put passengers&#8217; fare into our model learning routine, but:<\/p>\n<pre lang=\"python\" line=\"1\">fare_dist = sns.distplot(train_df[\"Fare\"], color=\"m\", label=\"Skewness : %.2f\"%(train_df[\"Fare\"].skew()))\r\nfare_dist = fare_dist.legend(loc=\"best\")\r\nplt.show()\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"357\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/fare\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare.png?fit=493%2C343&amp;ssl=1\" data-orig-size=\"493,343\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"fare\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare.png?fit=300%2C209&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare.png?fit=493%2C343&amp;ssl=1\" class=\"alignnone wp-image-357 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare.png?resize=493%2C343&#038;ssl=1\" alt=\"\" width=\"493\" height=\"343\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare.png?w=493&amp;ssl=1 493w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare.png?resize=300%2C209&amp;ssl=1 300w\" sizes=\"auto, (max-width: 493px) 100vw, 493px\" \/><\/p>\n<p>There are extreme values inside the &#8220;Fare&#8221; feature. On this case, we can use logarithm to remove the impact of extreme values.<\/p>\n<pre lang=\"python\" line=\"1\">train_df[\"Fare\"] = train_df[\"Fare\"].fillna(train_df[\"Fare\"].median())\r\ntrain_df['Fare_log'] = train_df[\"Fare\"].map(lambda i: np.log(i) if i &gt; 0 else 0)\r\n\r\nfare_log_dist = sns.distplot(train_df[\"Fare_log\"], color=\"m\", label=\"Skewness : %.2f\"%(train_df[\"Fare_log\"].skew()))\r\nfare_log_dist = fare_log_dist.legend(loc=\"best\")\r\nplt.show()\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"358\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/fare_log\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare_log.png?fit=481%2C343&amp;ssl=1\" data-orig-size=\"481,343\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"fare_log\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare_log.png?fit=300%2C214&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare_log.png?fit=481%2C343&amp;ssl=1\" class=\"alignnone wp-image-358 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare_log.png?resize=481%2C343&#038;ssl=1\" alt=\"\" width=\"481\" height=\"343\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare_log.png?w=481&amp;ssl=1 481w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/fare_log.png?resize=300%2C214&amp;ssl=1 300w\" sizes=\"auto, (max-width: 481px) 100vw, 481px\" \/><\/p>\n<p>See? The skewness has been changed from 4.79 to 0.44!<\/p>\n<pre lang=\"python\" line=\"1\">facet = sns.FacetGrid(train_df, hue=\"Survived\",aspect=4)\r\nfacet.map(sns.kdeplot,'Fare_log',shade= True)\r\nfacet.set(xlim=(0, train_df['Fare_log'].max()))\r\nfacet.add_legend()\r\n\r\nplt.show()  \r\n\r\nbins = (-1, 2, 2.68, 3.44, 10)\r\ngroup_names = [1,2,3,4]\r\ncategories = pd.cut(train_df['Fare_log'], bins, labels=group_names)\r\ntrain_df['FareGroup'] = categories\r\n<\/pre>\n<p>According to the fare_log facet grid graph, we cut the fare_log into 4 groups.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"360\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/face_log_facet\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?fit=903%2C204&amp;ssl=1\" data-orig-size=\"903,204\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"face_log_facet\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?fit=300%2C68&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?fit=903%2C204&amp;ssl=1\" class=\"alignnone wp-image-360 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?resize=903%2C204&#038;ssl=1\" alt=\"\" width=\"903\" height=\"204\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?w=903&amp;ssl=1 903w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?resize=300%2C68&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/face_log_facet.png?resize=768%2C174&amp;ssl=1 768w\" sizes=\"auto, (max-width: 903px) 100vw, 903px\" \/><\/p>\n<p>We are almost done, let&#8217;s finish our last feature, the Age.<\/p>\n<p>For the Age feature in our training data set, there are 177 out of 891 missing values. First, we would like to know the correlation of age and other features by using a heat map.<\/p>\n<pre lang=\"python\" line=\"1\">age_heat = sns.heatmap(train_df[[\"Age\",\"Sex\",\"SibSp\",\"Parch\",\"Pclass\",\"Embarked\"]].corr(),annot=True)\r\nplt.show()\u00a0\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"362\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/heat_age\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/heat_age.png?fit=451%2C330&amp;ssl=1\" data-orig-size=\"451,330\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"heat_age\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/heat_age.png?fit=300%2C220&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/heat_age.png?fit=451%2C330&amp;ssl=1\" class=\"alignnone wp-image-362 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/heat_age.png?resize=451%2C330&#038;ssl=1\" alt=\"\" width=\"451\" height=\"330\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/heat_age.png?w=451&amp;ssl=1 451w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/heat_age.png?resize=300%2C220&amp;ssl=1 300w\" sizes=\"auto, (max-width: 451px) 100vw, 451px\" \/><\/p>\n<p>We find that &#8220;SibSp&#8221;, &#8220;Parch&#8221; and &#8220;Pclass&#8221; are relevant to &#8220;Age&#8221;. Thus, instead of filling those missing ages with the mean value, we should compare the ages of passengers with similar family sizes and classes. If no similar background is found, we fill the missing age with a random value between the mean value minus standard deviation and the mean value plus standard deviation.<\/p>\n<pre lang=\"python\" line=\"1\">index_NaN_age = list(train_df[\"Age\"][train_df[\"Age\"].isnull()].index)\r\n   \r\nfor i in index_NaN_age :\r\n  age_mean = train_df[\"Age\"].mean()\r\n  age_std = train_df[\"Age\"].std()\r\n  age_pred_w_spc = train_df[\"Age\"][((train_df['SibSp'] == train_df.iloc[i][\"SibSp\"]) &amp; (train_df['Parch'] == train_df.iloc[i][\"Parch\"]) &amp; (train_df['Pclass'] == train_df.iloc[i][\"Pclass\"]))].mean()\r\n  age_pred_wo_spc = np.random.randint(age_mean - age_std, age_mean + age_std)\r\n    \r\n  if not np.isnan(age_pred_w_spc) :\r\n     train_df['Age'].iloc[i] = age_pred_w_spc\r\n  else :\r\n     train_df['Age'].iloc[i] = age_pred_wo_spc  \r\n<\/pre>\n<p>Now we have handled all non-numeric\/missing values, we can drop those unused features.<\/p>\n<pre lang=\"python\" line=\"1\">X_learning = train_df.drop(['Name', 'Cabin', 'SibSp', 'Parch', 'Fare', 'Survived', 'Ticket', 'Fare_log', 'FamilySize', 'PassengerId'], axis=1)\r\nY_learning = train_df['Survived']\r\n<\/pre>\n<h3>K-Fold Cross-Validation Time<\/h3>\n<p>Do you remember the K-Fold Cross Validation process on <a href=\"https:\/\/www.codeastar.com\/choose-machine-learning-models-python\/\">last post<\/a>? Yes, it is the process to choose a suitable learning model. We have our well formatted training data set, it is time to use it on the validation.<\/p>\n<pre lang=\"python\" line=\"1\">random_state = 33\r\nmodels = []\r\nmodels.append((\"RFC\", RandomForestClassifier(random_state=random_state)) )\r\nmodels.append((\"ETC\", ExtraTreesClassifier(random_state=random_state)) )\r\nmodels.append((\"ADA\", AdaBoostClassifier(random_state=random_state)) )\r\nmodels.append((\"GBC\", GradientBoostingClassifier(random_state=random_state)) )\r\nmodels.append((\"SVC\", SVC(random_state=random_state)) )\r\nmodels.append((\"LoR\", LogisticRegression(random_state=random_state)) )\r\nmodels.append((\"LDA\", LinearDiscriminantAnalysis()) )\r\nmodels.append((\"QDA\", QuadraticDiscriminantAnalysis()) )\r\nmodels.append((\"DTC\", DecisionTreeClassifier(random_state=random_state)) )\r\nmodels.append((\"XGB\", xgb.XGBClassifier()) )\r\n<\/pre>\n<p>Please note that, other than popular classifier models from Scikit Learn library, I have added XGBoost Classifier (XGBoost) into the model list. XGBoost is a powerful boosting algorithm and always be chosen as a winning tool in data analysis competition.<\/p>\n<pre lang=\"python\" line=\"1\">from sklearn import model_selection\r\n\r\nkfold = model_selection.KFold(n_splits=10)\r\n\r\nfor name, model in models:\r\n     #cross validation among models, score based on accuracy\r\n     cv_results = model_selection.cross_val_score(model, X_learning, Y_learning, scoring='accuracy', cv=kfold )\r\n     print(\"\\n[%s] Mean: %.8f Std. Dev.: %8f\" %(name, cv_results.mean(), cv_results.std()))   \r\n<\/pre>\n<p>And the results are:<\/p>\n<pre>[RFC] Mean: 0.80365793 Std. Dev.: 0.033661\r\n\r\n[ETC] Mean: 0.78902622 Std. Dev.: 0.030693\r\n\r\n[ADA] Mean: 0.80585518 Std. Dev.: 0.032839\r\n\r\n[GBC] Mean: 0.82720350 Std. Dev.: 0.033126\r\n\r\n[SVC] Mean: 0.79242197 Std. Dev.: 0.047439\r\n\r\n[LoR] Mean: 0.80810237 Std. Dev.: 0.029757\r\n\r\n[LDA] Mean: 0.79574282 Std. Dev.: 0.034687\r\n\r\n[QDA] Mean: 0.79466916 Std. Dev.: 0.042005\r\n\r\n[DTC] Mean: 0.77669164 Std. Dev.: 0.026440\r\n\r\n[XGB] Mean: 0.83053683 Std. Dev.: 0.031099\r\n<\/pre>\n<p>In bar chart:<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"363\" data-permalink=\"https:\/\/www.codeastar.com\/data-wrangling\/kfc\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/kfc.png?fit=503%2C343&amp;ssl=1\" data-orig-size=\"503,343\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"kfc\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/kfc.png?fit=300%2C205&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/kfc.png?fit=503%2C343&amp;ssl=1\" class=\"alignnone wp-image-363 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/kfc.png?resize=503%2C343&#038;ssl=1\" alt=\"\" width=\"503\" height=\"343\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/kfc.png?w=503&amp;ssl=1 503w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/kfc.png?resize=300%2C205&amp;ssl=1 300w\" sizes=\"auto, (max-width: 503px) 100vw, 503px\" \/><\/p>\n<p>Obviously, XGBoost tops the K-Fold Cross Validation and it is followed by\u00a0the Gradient Boosting Classifier. (Oh, boosting algorithm rules the game this time)<\/p>\n<p>Since we are focusing on Data Wrangling this time not model tuning, I just use a plain XGBoost to predict the testing data set and submit to Kaggle. It gives me 0.78469 score.<\/p>\n<p>There is still room for improvement, I hope all of you can learn from this post and make a better model. Just remember, practice makes perfect, enjoy.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>The complete source can be found at\u00a0<a href=\"https:\/\/github.com\/codeastar\/kaggle_Titanic\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/codeastar\/kaggle_Titanic<\/a>\u00a0.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have learnt how to select a machine learning model, it is time to study another Data Science topic from the Data Science Life Cycle &#8212; Data Collection. Yes, we do need to know how to collect data.\u00a0Unlike our Iris Classification\u00a0project, which its data set is well prepared. Sometimes, we need to prepare our own [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[18],"tags":[19,31,30,29,28],"class_list":["post-318","post","type-post","status-publish","format-standard","hentry","category-machine-learning","tag-data-science","tag-data-wrangling","tag-kaggle","tag-titanic","tag-xgboost"],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Titanic Survivors Dataset and Data Wrangling &#8902; Code A Star<\/title>\n<meta name=\"description\" content=\"We have learnt how to select a machine learning model, it is time to study another Data Science topic --- Data Wrangling.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codeastar.com\/data-wrangling\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Titanic Survivors Dataset and Data Wrangling &#8902; Code A Star\" \/>\n<meta property=\"og:description\" content=\"We have learnt how to select a machine learning model, it is time to study another Data Science topic --- Data Wrangling.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codeastar.com\/data-wrangling\/\" \/>\n<meta property=\"og:site_name\" content=\"Code A Star\" \/>\n<meta property=\"article:publisher\" content=\"codeastar\" \/>\n<meta property=\"article:author\" content=\"codeastar\" \/>\n<meta property=\"article:published_time\" content=\"2017-07-28T17:11:41+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png\" \/>\n<meta name=\"author\" content=\"Raven Hon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@codeastar\" \/>\n<meta name=\"twitter:site\" content=\"@codeastar\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raven Hon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/\"},\"author\":{\"name\":\"Raven Hon\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"headline\":\"Titanic Survivors Dataset and Data Wrangling\",\"datePublished\":\"2017-07-28T17:11:41+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/\"},\"wordCount\":1093,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png\",\"keywords\":[\"Data Science\",\"Data Wrangling\",\"Kaggle\",\"Titanic\",\"XGBoost\"],\"articleSection\":[\"Learn Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codeastar.com\/data-wrangling\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/\",\"url\":\"https:\/\/www.codeastar.com\/data-wrangling\/\",\"name\":\"Titanic Survivors Dataset and Data Wrangling &#8902; Code A Star\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png\",\"datePublished\":\"2017-07-28T17:11:41+00:00\",\"description\":\"We have learnt how to select a machine learning model, it is time to study another Data Science topic --- Data Wrangling.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codeastar.com\/data-wrangling\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=968%2C593&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=968%2C593&ssl=1\",\"width\":968,\"height\":593},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codeastar.com\/data-wrangling\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codeastar.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Titanic Survivors Dataset and Data Wrangling\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codeastar.com\/#website\",\"url\":\"https:\/\/www.codeastar.com\/\",\"name\":\"Code A Star\",\"description\":\"We don&#039;t wish upon a star, we code a star\",\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codeastar.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\",\"name\":\"Raven Hon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"width\":70,\"height\":70,\"caption\":\"Raven Hon\"},\"logo\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\"},\"description\":\"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.\",\"sameAs\":[\"https:\/\/www.codeastar.com\",\"codeastar\",\"https:\/\/x.com\/codeastar\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Titanic Survivors Dataset and Data Wrangling &#8902; Code A Star","description":"We have learnt how to select a machine learning model, it is time to study another Data Science topic --- Data Wrangling.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codeastar.com\/data-wrangling\/","og_locale":"en_US","og_type":"article","og_title":"Titanic Survivors Dataset and Data Wrangling &#8902; Code A Star","og_description":"We have learnt how to select a machine learning model, it is time to study another Data Science topic --- Data Wrangling.","og_url":"https:\/\/www.codeastar.com\/data-wrangling\/","og_site_name":"Code A Star","article_publisher":"codeastar","article_author":"codeastar","article_published_time":"2017-07-28T17:11:41+00:00","og_image":[{"url":"https:\/\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png","type":"","width":"","height":""}],"author":"Raven Hon","twitter_card":"summary_large_image","twitter_creator":"@codeastar","twitter_site":"@codeastar","twitter_misc":{"Written by":"Raven Hon","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codeastar.com\/data-wrangling\/#article","isPartOf":{"@id":"https:\/\/www.codeastar.com\/data-wrangling\/"},"author":{"name":"Raven Hon","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"headline":"Titanic Survivors Dataset and Data Wrangling","datePublished":"2017-07-28T17:11:41+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codeastar.com\/data-wrangling\/"},"wordCount":1093,"commentCount":0,"publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"image":{"@id":"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png","keywords":["Data Science","Data Wrangling","Kaggle","Titanic","XGBoost"],"articleSection":["Learn Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codeastar.com\/data-wrangling\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codeastar.com\/data-wrangling\/","url":"https:\/\/www.codeastar.com\/data-wrangling\/","name":"Titanic Survivors Dataset and Data Wrangling &#8902; Code A Star","isPartOf":{"@id":"https:\/\/www.codeastar.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage"},"image":{"@id":"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage"},"thumbnailUrl":"https:\/\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png","datePublished":"2017-07-28T17:11:41+00:00","description":"We have learnt how to select a machine learning model, it is time to study another Data Science topic --- Data Wrangling.","breadcrumb":{"@id":"https:\/\/www.codeastar.com\/data-wrangling\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codeastar.com\/data-wrangling\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/data-wrangling\/#primaryimage","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=968%2C593&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/07\/data_wrangler.png?fit=968%2C593&ssl=1","width":968,"height":593},{"@type":"BreadcrumbList","@id":"https:\/\/www.codeastar.com\/data-wrangling\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codeastar.com\/"},{"@type":"ListItem","position":2,"name":"Titanic Survivors Dataset and Data Wrangling"}]},{"@type":"WebSite","@id":"https:\/\/www.codeastar.com\/#website","url":"https:\/\/www.codeastar.com\/","name":"Code A Star","description":"We don&#039;t wish upon a star, we code a star","publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codeastar.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd","name":"Raven Hon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","width":70,"height":70,"caption":"Raven Hon"},"logo":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/"},"description":"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.","sameAs":["https:\/\/www.codeastar.com","codeastar","https:\/\/x.com\/codeastar"]}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8PcRO-58","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/318","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/comments?post=318"}],"version-history":[{"count":41,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/318\/revisions"}],"predecessor-version":[{"id":379,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/318\/revisions\/379"}],"wp:attachment":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media?parent=318"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/categories?post=318"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/tags?post=318"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}