{"id":1240,"date":"2018-07-30T21:14:51","date_gmt":"2018-07-30T21:14:51","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=1240"},"modified":"2018-10-25T09:59:59","modified_gmt":"2018-10-25T09:59:59","slug":"blending-data-science-competition","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/blending-data-science-competition\/","title":{"rendered":"Blending, the Dark Side in Data Science Competition"},"content":{"rendered":"

In our past machine learning topic, “Ensemble Modeling<\/a>“, we mentioned how blending helps on improving our prediction. Then in another topic, “Why are people frustrated on Kaggle\u2019s challenge?<\/a>“, we mentioned how blending ruins a data science competition. Okay, we have a question here, is blending good or bad?<\/p>\n

<\/p>\n

Blending and Frustration<\/h3>\n

Technically, blending is good and we did prove it by improving our Iowa House Price<\/a>‘s prediction. The technique itself is not an issue. But the way to use it is. From the last TalkingData Click Fraud Detection<\/a> challenge, people spent days and nights on features engineering and models researching. They posted and shared their results as public kernels. Then some people, we call them “blenders”, gathered other people’s hard working results, applied the blending technique in 5 minutes and got a better result. It made them have a higher ranking in leaderboard too. If you were one of those hard working developers and got out-ranked by blenders, you might be frustrated.<\/p>\n

Get a better result by taking advantage of others<\/del><\/h3>\n

We should not abuse other people’s hard work, but we should know how people do that by blending. So we start our experiment in the House Price prediction challenge. First of all, we go to collect output files from 7 best RMSD<\/a> public kernels. So we have:<\/p>\n

    \n
  1. stacking, MICE and brutal force<\/a>\u00a0 – 0.10985<\/li>\n
  2. Lasso model for regression problem<\/a> – 0.11365<\/li>\n
  3. House Price Prediction From Bangladesh<\/a> – 0.11416<\/li>\n
  4. All You Need is PCA<\/a> – 0.11421<\/li>\n
  5. Amit Choudhary’s Kernel Notebook-ified<\/a> – 0.11439<\/li>\n
  6. just NN use gluon<\/a> – 0.1148<\/li>\n
  7. My submission to predict sale price<\/a>\u00a0– 0.11533\n
    <\/div>\n
    Please note that other than selecting kernels by scores, we trend to select kernels using different model(s).<\/div>\n<\/li>\n<\/ol>\n

    Then we can download output files from above kernels, open our own kernel (or\u00a0Jupyter Notebook) and import the output files as our input (feel like the way we did on the CNN image recognizer<\/a> project: outputs from previous layer are inputs of next layer).<\/p>\n

    import pandas as pd\r\n\r\ndf_base_0 = pd.read_csv('..\/input\/stacking-mice-and-brutal-force-10985\/House_Prices_submit.csv',names=[\"Id\",\"SalePrice_0\"], skiprows=[0],header=None)\r\ndf_base_1 = pd.read_csv('..\/input\/lasso-11365\/lasso_sol22_Median.csv',names=[\"Id\",\"SalePrice_1\"], skiprows=[0],header=None)\r\ndf_base_2 = pd.read_csv('..\/input\/bangladesh-stack-11416\/submission (1).csv',names=[\"Id\",\"SalePrice_2\"], skiprows=[0],header=None)\r\ndf_base_3 = pd.read_csv('..\/input\/pca-11421\/submission (2).csv',names=[\"Id\",\"SalePrice_3\"], skiprows=[0],header=None)\r\ndf_base_4 = pd.read_csv('..\/input\/xgb-lasso-11439\/output.csv',names=[\"Id\",\"SalePrice_4\"], skiprows=[0],header=None)\r\ndf_base_5 = pd.read_csv('..\/input\/nn-1148\/submission (3).csv',names=[\"Id\",\"SalePrice_5\"], skiprows=[0],header=None)\r\ndf_base_6 = pd.read_csv('..\/input\/stack-xgb-lgb-11533\/submission_stacked.csv',names=[\"Id\",\"SalePrice_6\"], skiprows=[0],header=None)\r\n<\/pre>\n

    We have 7 dataframes and all of them contain “Id” and “SalePrice” fields. Then we pick 2 dataframes, “df_base_0” and “df_base_5” as examples:<\/p>\n

    \"dataframe<\/p>\n

    All of our dataframes have the same Id but different SalePrice. So we can merge them into a single dataframe, “df_base”, using the Id as the key.<\/p>\n

    df_base = pd.merge(df_base_0,df_base_1,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_2,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_3,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_4,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_5,how='inner',on='Id')\r\ndf_base = pd.merge(df_base,df_base_6,how='inner',on='Id')\r\n<\/pre>\n

    Here it comes:<\/p>\n

    \"df<\/p>\n

    Instead of blending all the SalePrices and getting the mean score, we can move one step forward to get a better result.<\/p>\n

    Blend by Correlation<\/h3>\n

    In order to get the better result, we should blend with different sources. That is why we intended to use output files from different models. We can also visualize how those output files are different from one another using interactive<\/a> heatmap.<\/p>\n

    import plotly.graph_objs as go\r\nimport plotly.offline as py\r\npy.init_notebook_mode(connected=True)\r\n\r\ndata = [\r\n    go.Heatmap(\r\n        z = df_base.iloc[:,1:].corr().values,\r\n        x = df_base.iloc[:,1:].columns.values,\r\n        y = df_base.iloc[:,1:].columns.values,\r\n        colorscale='Earth')\r\n]\r\n\r\nlayout = go.Layout(\r\n    title ='Correlation of SalePrice',\r\n    xaxis = dict(ticks='outside', nticks=36),\r\n    yaxis = dict(ticks='outside' ),\r\n    width = 800, height = 700)\r\n\r\nfig = go.Figure(data=data, layout=layout)\r\npy.iplot(fig)\r\n<\/pre>\n