{"id":1240,"date":"2018-07-30T21:14:51","date_gmt":"2018-07-30T21:14:51","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=1240"},"modified":"2018-10-25T09:59:59","modified_gmt":"2018-10-25T09:59:59","slug":"blending-data-science-competition","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/blending-data-science-competition\/","title":{"rendered":"Blending, the Dark Side in Data Science Competition"},"content":{"rendered":"
In our past machine learning topic, “Ensemble Modeling<\/a>“, we mentioned how blending helps on improving our prediction. Then in another topic, “Why are people frustrated on Kaggle\u2019s challenge?<\/a>“, we mentioned how blending ruins a data science competition. Okay, we have a question here, is blending good or bad?<\/p>\n <\/p>\n Technically, blending is good and we did prove it by improving our Iowa House Price<\/a>‘s prediction. The technique itself is not an issue. But the way to use it is. From the last TalkingData Click Fraud Detection<\/a> challenge, people spent days and nights on features engineering and models researching. They posted and shared their results as public kernels. Then some people, we call them “blenders”, gathered other people’s hard working results, applied the blending technique in 5 minutes and got a better result. It made them have a higher ranking in leaderboard too. If you were one of those hard working developers and got out-ranked by blenders, you might be frustrated.<\/p>\n We should not abuse other people’s hard work, but we should know how people do that by blending. So we start our experiment in the House Price prediction challenge. First of all, we go to collect output files from 7 best RMSD<\/a> public kernels. So we have:<\/p>\n Then we can download output files from above kernels, open our own kernel (or\u00a0Jupyter Notebook) and import the output files as our input (feel like the way we did on the CNN image recognizer<\/a> project: outputs from previous layer are inputs of next layer).<\/p>\n We have 7 dataframes and all of them contain “Id” and “SalePrice” fields. Then we pick 2 dataframes, “df_base_0” and “df_base_5” as examples:<\/p>\n All of our dataframes have the same Id but different SalePrice. So we can merge them into a single dataframe, “df_base”, using the Id as the key.<\/p>\n Here it comes:<\/p>\n Instead of blending all the SalePrices and getting the mean score, we can move one step forward to get a better result.<\/p>\n In order to get the better result, we should blend with different sources. That is why we intended to use output files from different models. We can also visualize how those output files are different from one another using interactive<\/a> heatmap.<\/p>\nBlending and Frustration<\/h3>\n
Get a better result
by taking advantage of others<\/del><\/h3>\n\n
import pandas as pd\r\n\r\ndf_base_0 = pd.read_csv('..\/input\/stacking-mice-and-brutal-force-10985\/House_Prices_submit.csv',names=[\"Id\",\"SalePrice_0\"], skiprows=[0],header=None)\r\ndf_base_1 = pd.read_csv('..\/input\/lasso-11365\/lasso_sol22_Median.csv',names=[\"Id\",\"SalePrice_1\"], skiprows=[0],header=None)\r\ndf_base_2 = pd.read_csv('..\/input\/bangladesh-stack-11416\/submission (1).csv',names=[\"Id\",\"SalePrice_2\"], skiprows=[0],header=None)\r\ndf_base_3 = pd.read_csv('..\/input\/pca-11421\/submission (2).csv',names=[\"Id\",\"SalePrice_3\"], skiprows=[0],header=None)\r\ndf_base_4 = pd.read_csv('..\/input\/xgb-lasso-11439\/output.csv',names=[\"Id\",\"SalePrice_4\"], skiprows=[0],header=None)\r\ndf_base_5 = pd.read_csv('..\/input\/nn-1148\/submission (3).csv',names=[\"Id\",\"SalePrice_5\"], skiprows=[0],header=None)\r\ndf_base_6 = pd.read_csv('..\/input\/stack-xgb-lgb-11533\/submission_stacked.csv',names=[\"Id\",\"SalePrice_6\"], skiprows=[0],header=None)\r\n<\/pre>\n
<\/p>\n
df_base = pd.merge(df_base_0,df_base_1,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_2,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_3,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_4,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_5,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_6,how='inner',on='Id')\r\n<\/pre>\n
<\/p>\n
Blend by Correlation<\/h3>\n
import plotly.graph_objs as go\r\nimport plotly.offline as py\r\npy.init_notebook_mode(connected=True)\r\n\r\ndata = [\r\n go.Heatmap(\r\n z = df_base.iloc[:,1:].corr().values,\r\n x = df_base.iloc[:,1:].columns.values,\r\n y = df_base.iloc[:,1:].columns.values,\r\n colorscale='Earth')\r\n]\r\n\r\nlayout = go.Layout(\r\n title ='Correlation of SalePrice',\r\n xaxis = dict(ticks='outside', nticks=36),\r\n yaxis = dict(ticks='outside' ),\r\n width = 800, height = 700)\r\n\r\nfig = go.Figure(data=data, layout=layout)\r\npy.iplot(fig)\r\n<\/pre>\n