{"id":1191,"date":"2018-07-17T18:17:33","date_gmt":"2018-07-17T18:17:33","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=1191"},"modified":"2018-07-17T19:51:32","modified_gmt":"2018-07-17T19:51:32","slug":"tfidf-predict-deal-probability","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/","title":{"rendered":"TFIDF technique and Deal Probability Prediction (in Russian)"},"content":{"rendered":"

Our topic this time is too Russia, so my water just turned into Vodka :]] . Actually, this is all about Kaggle’s competition: Avito Demand Prediction Challenge<\/a>. Avito is the biggest classified site in the Mother Russia, just likes the\u00a0Craigslist. Our mission for this time is, predicting the rate that a seller can make a deal on Avito and handle text features with TFIDF.<\/p>\n

<\/p>\n

Let’s get started – Load & Know<\/h3>\n

Likes the way we did on previous machine learning challenges (Titanic Survivors<\/a>, Iowa House Pricing<\/a> and TalkingData Click Fraud<\/a>), the first thing we should do after getting the dataset is, take a look on it. Although it sounds dumb, it is just the right thing we should do. So we can make sure we load the right dataset and know what is the dataset about.<\/p>\n

import pandas as pd\u00a0\r\nimport gc\r\n\r\ntrain_df = pd.read_csv(\"..\/input\/train.csv\", parse_dates=[\"activation_date\"])\r\ntrain_df.head()\r\n<\/pre>\n

\"\"<\/p>\n

Russian, Russian,\u00a0\u0440\u0443\u0301\u0441\u0441\u043a\u0438\u0439 \u044f\u0437\u044b\u0301\u043a! We also load “activate_date” as a date column, so we can apply date functions on it.<\/p>\n

The second thing we should do is finding missing data inside the dataset.<\/p>\n

import matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n\r\ndf_missing = train_df.isnull().sum()\r\ndf_missing = df_missing[df_missing > 0]\r\ndf_missing = df_missing.append(pd.Series([train_df.shape[0]], index=['total_ads']))\r\n\r\nplt.figure(figsize=(10,6))\r\nsns.set_style(\"darkgrid\")\r\nsns.barplot(x=df_missing.values, y=df_missing.index)\r\nplt.title(\"Features with Null value and Number of Total Ads\")\r\nplt.show()\r\n<\/pre>\n

\"Null<\/p>\n

So we can think rather we need to take action on features with null value. From the top chart, we find several null features, null value in image, price and optional parameters are understandable, but what about ads without description? Let’s find out.<\/p>\n

desc_dp = [train_df[train_df['description'].notna()]['deal_probability'].mean(), \r\n           train_df[train_df['description'].isna()]['deal_probability'].mean(), \r\n           train_df['deal_probability'].mean()]\r\nplt.title(\"Deal Probability with or without description\")\r\nax = sns.barplot(x=[\"With Description\", \"Without Description\", \"All Items\"], y=desc_dp)\r\nfor p in ax.patches:\r\n   ax.annotate('{:.5f}'.format(p.get_height()), (p.get_x()+p.get_width()\/4, p.get_height()+.001))\r\nplt.show\r\n<\/pre>\n

\"Deal<\/p>\n

It is clear that ads without description will get lower deal probability. We can also note that description is a major feature of our challenge this time.<\/p>\n

Interact with charts<\/h3>\n

Since we load the “activate_date” as a date object, we can create new date features from it.<\/p>\n

train_df['weekday'] = train_df.activation_date.dt.weekday\r\ntrain_df['day'] = train_df.activation_date.dt.day\r\ntrain_df['week'] = train_df.activation_date.dt.week \r\n<\/pre>\n

We can create a bar chart to show the number of ads by weekdays, but this time, we make a chart that we can interact with.<\/p>\n

Other than our usual way to create charts with Matplotlib<\/a> and\u00a0Seaborn<\/a>, we can now create interactive charts by Plotly<\/a>.<\/p>\n

From our\u00a0Juypter notebooks (or Kaggle’s kernels), let’s import Plotly’s libraries:<\/p>\n

import plotly.graph_objs as go\r\nimport plotly.offline as py\r\npy.init_notebook_mode(connected=True)\r\n<\/pre>\n

Please note that “plotly.offline” module is initialized, so we can plot and save our charts locally. Then we fill up the data (number of ads by weekdays) and define the chart’s layout:<\/p>\n

def generateBarChart(df, group_by, title, \r\n                     x_axis, y_axis, color=\"royalblue\", \r\n                     width=700, height=400, record_size=100): \r\n    df2 = df.groupby([group_by]).size()[:record_size]\r\n    \r\n    trace = go.Bar(\r\n            x = df2.index,\r\n            y = df2.values,\r\n            marker=dict(color = color)\r\n        )\r\n\r\n    layout = go.Layout(\r\n                title = title,\r\n                xaxis=dict(\r\n                  title=x_axis\r\n                ),\r\n                yaxis=dict(\r\n                  title=y_axis\r\n                ), \r\n                width = width, \r\n                height = height\r\n             )\r\n    data = [trace]\r\n    fig = go.Figure(data=data, layout=layout)\r\n    py.iplot(fig)\r\n    del df2;gc.collect()\r\n\r\ngenerateBarChart(train_df, \"weekday\", \"Number of Ads by Weekdays\", \r\n\"Weekdays\", \"Number of Ads\")\r\n<\/pre>\n

When we run above codes on Jupyter notebooks, an interactive chart will appear:
\n