{"id":1191,"date":"2018-07-17T18:17:33","date_gmt":"2018-07-17T18:17:33","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=1191"},"modified":"2018-07-17T19:51:32","modified_gmt":"2018-07-17T19:51:32","slug":"tfidf-predict-deal-probability","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/","title":{"rendered":"TFIDF technique and Deal Probability Prediction (in Russian)"},"content":{"rendered":"<p>Our topic this time is too Russia, so my water just turned into Vodka :]] . Actually, this is all about Kaggle&#8217;s competition: <a href=\"https:\/\/www.kaggle.com\/c\/avito-demand-prediction\" target=\"_blank\" rel=\"noopener\">Avito Demand Prediction Challenge<\/a>. Avito is the biggest classified site in the Mother Russia, just likes the\u00a0Craigslist. Our mission for this time is, predicting the rate that a seller can make a deal on Avito and handle text features with TFIDF.<\/p>\n<p><!--more--><\/p>\n<h3>Let&#8217;s get started &#8211; Load &amp; Know<\/h3>\n<p>Likes the way we did on previous machine learning challenges (<a href=\"https:\/\/www.codeastar.com\/data-wrangling\/\">Titanic Survivors<\/a>, <a href=\"https:\/\/www.codeastar.com\/win-big-real-estate-market-data-science\/\">Iowa House Pricing<\/a> and <a href=\"https:\/\/www.codeastar.com\/click-fraud-detection\/\">TalkingData Click Fraud<\/a>), the first thing we should do after getting the dataset is, take a look on it. Although it sounds dumb, it is just the right thing we should do. So we can make sure we load the right dataset and know what is the dataset about.<\/p>\n<pre lang=\"python\" line=\"1\">import pandas as pd\u00a0\r\nimport gc\r\n\r\ntrain_df = pd.read_csv(\"..\/input\/train.csv\", parse_dates=[\"activation_date\"])\r\ntrain_df.head()\r\n<\/pre>\n<p><img data-attachment-id=\"1198\" data-permalink=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/avito1\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?fit=841%2C346&amp;ssl=1\" data-orig-size=\"841,346\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"avito1\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?fit=300%2C123&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?fit=841%2C346&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-1198 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?resize=841%2C346&#038;ssl=1\" alt=\"\" width=\"841\" height=\"346\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?w=841&amp;ssl=1 841w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?resize=300%2C123&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/avito1.png?resize=768%2C316&amp;ssl=1 768w\" sizes=\"(max-width: 841px) 100vw, 841px\" data-recalc-dims=\"1\" \/><\/p>\n<p>Russian, Russian,\u00a0\u0440\u0443\u0301\u0441\u0441\u043a\u0438\u0439 \u044f\u0437\u044b\u0301\u043a! We also load &#8220;activate_date&#8221; as a date column, so we can apply date functions on it.<\/p>\n<p>The second thing we should do is finding missing data inside the dataset.<\/p>\n<pre lang=\"python\" line=\"1\">import matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n\r\ndf_missing = train_df.isnull().sum()\r\ndf_missing = df_missing[df_missing &gt; 0]\r\ndf_missing = df_missing.append(pd.Series([train_df.shape[0]], index=['total_ads']))\r\n\r\nplt.figure(figsize=(10,6))\r\nsns.set_style(\"darkgrid\")\r\nsns.barplot(x=df_missing.values, y=df_missing.index)\r\nplt.title(\"Features with Null value and Number of Total Ads\")\r\nplt.show()\r\n<\/pre>\n<p><img data-attachment-id=\"1202\" data-permalink=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/null\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/null.png?fit=639%2C370&amp;ssl=1\" data-orig-size=\"639,370\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"null\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/null.png?fit=300%2C174&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/null.png?fit=639%2C370&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-1202 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/null.png?resize=639%2C370&#038;ssl=1\" alt=\"Null counts in training dataset\" width=\"639\" height=\"370\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/null.png?w=639&amp;ssl=1 639w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/null.png?resize=300%2C174&amp;ssl=1 300w\" sizes=\"(max-width: 639px) 100vw, 639px\" data-recalc-dims=\"1\" \/><\/p>\n<p>So we can think rather we need to take action on features with null value. From the top chart, we find several null features, null value in image, price and optional parameters are understandable, but what about ads without description? Let&#8217;s find out.<\/p>\n<pre lang=\"python\" line=\"1\">desc_dp = [train_df[train_df['description'].notna()]['deal_probability'].mean(), \r\n           train_df[train_df['description'].isna()]['deal_probability'].mean(), \r\n           train_df['deal_probability'].mean()]\r\nplt.title(\"Deal Probability with or without description\")\r\nax = sns.barplot(x=[\"With Description\", \"Without Description\", \"All Items\"], y=desc_dp)\r\nfor p in ax.patches:\r\n   ax.annotate('{:.5f}'.format(p.get_height()), (p.get_x()+p.get_width()\/4, p.get_height()+.001))\r\nplt.show\r\n<\/pre>\n<p><img data-attachment-id=\"1203\" data-permalink=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/dp_desc\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/dp_desc.png?fit=374%2C261&amp;ssl=1\" data-orig-size=\"374,261\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"dp_desc\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/dp_desc.png?fit=300%2C209&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/dp_desc.png?fit=374%2C261&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-1203 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/dp_desc.png?resize=374%2C261&#038;ssl=1\" alt=\"Deal probability with or without description\" width=\"374\" height=\"261\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/dp_desc.png?w=374&amp;ssl=1 374w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/dp_desc.png?resize=300%2C209&amp;ssl=1 300w\" sizes=\"(max-width: 374px) 100vw, 374px\" data-recalc-dims=\"1\" \/><\/p>\n<p>It is clear that ads without description will get lower deal probability. We can also note that description is a major feature of our challenge this time.<\/p>\n<h3>Interact with charts<\/h3>\n<p>Since we load the &#8220;activate_date&#8221; as a date object, we can create new date features from it.<\/p>\n<pre lang=\"python\" line=\"1\">train_df['weekday'] = train_df.activation_date.dt.weekday\r\ntrain_df['day'] = train_df.activation_date.dt.day\r\ntrain_df['week'] = train_df.activation_date.dt.week \r\n<\/pre>\n<p>We can create a bar chart to show the number of ads by weekdays, but this time, we make a chart that we can interact with.<\/p>\n<p>Other than our usual way to create charts with <a href=\"https:\/\/matplotlib.org\/\" target=\"_blank\" rel=\"noopener\">Matplotlib<\/a> and\u00a0<a href=\"https:\/\/seaborn.pydata.org\/\" target=\"_blank\" rel=\"noopener\">Seaborn<\/a>, we can now create interactive charts by <a href=\"https:\/\/plot.ly\/\" target=\"_blank\" rel=\"noopener\">Plotly<\/a>.<\/p>\n<p>From our\u00a0Juypter notebooks (or Kaggle&#8217;s kernels), let&#8217;s import Plotly&#8217;s libraries:<\/p>\n<pre lang=\"python\" line=\"1\">import plotly.graph_objs as go\r\nimport plotly.offline as py\r\npy.init_notebook_mode(connected=True)\r\n<\/pre>\n<p>Please note that &#8220;plotly.offline&#8221; module is initialized, so we can plot and save our charts locally. Then we fill up the data (number of ads by weekdays) and define the chart&#8217;s layout:<\/p>\n<pre lang=\"python\" line=\"1\">def generateBarChart(df, group_by, title, \r\n                     x_axis, y_axis, color=\"royalblue\", \r\n                     width=700, height=400, record_size=100): \r\n    df2 = df.groupby([group_by]).size()[:record_size]\r\n    \r\n    trace = go.Bar(\r\n            x = df2.index,\r\n            y = df2.values,\r\n            marker=dict(color = color)\r\n        )\r\n\r\n    layout = go.Layout(\r\n                title = title,\r\n                xaxis=dict(\r\n                  title=x_axis\r\n                ),\r\n                yaxis=dict(\r\n                  title=y_axis\r\n                ), \r\n                width = width, \r\n                height = height\r\n             )\r\n    data = [trace]\r\n    fig = go.Figure(data=data, layout=layout)\r\n    py.iplot(fig)\r\n    del df2;gc.collect()\r\n\r\ngenerateBarChart(train_df, \"weekday\", \"Number of Ads by Weekdays\", \r\n\"Weekdays\", \"Number of Ads\")\r\n<\/pre>\n<p>When we run above codes on Jupyter notebooks, an interactive chart will appear:<br \/>\n<iframe loading=\"lazy\" src=\"\/\/plot.ly\/~codeastar\/3.embed\" width=\"700\" height=\"400\" frameborder=\"0\" scrolling=\"no\"><span data-mce-type=\"bookmark\" style=\"display: inline-block; width: 0px; overflow: hidden; line-height: 0;\" class=\"mce_SELRES_start\">\ufeff<\/span><\/iframe><\/p>\n<p>We can get more detail from the bars by pointing any one of them. And we find out, Saturday (&#8220;6&#8221; from the bar chart), has the most ads posted while Friday (&#8220;5&#8221; from the chart) has the lowest in number.<\/p>\n<h3>Handle the Text Features<\/h3>\n<p>Since we know that description is an important feature affecting the deal probability, we would like to deal with it. Unlike other simple text features such as location and category, which we can encode them into numeric values. There are complex structures and varieties inside the description feature.\u00a0Although we can&#8217;t simply handle the description feature, it doesn&#8217;t mean we can&#8217;t handle it at all.<\/p>\n<p>Firstly, we can get the length of description of each ad.<\/p>\n<pre lang=\"python\" line=\"1\">train_df['description'] = train_df['description'].fillna(\" \")\r\ntrain_df['description_len'] = train_df['description'].apply(lambda x : len(x.split()))\r\n<\/pre>\n<p>Then we can have a look of ads with 0 to 50 words in description.<\/p>\n<pre lang=\"python\" line=\"1\">generateBarChart(train_df, \"description_len\", \"Number of Ads for Description with (0 - 50) words\", \r\n                    \"Description length\", \"Number of Ads\", color=\"lightgreen\", \r\n                width=800, height=500, record_size=50)\r\n<\/pre>\n<p><iframe loading=\"lazy\" src=\"\/\/plot.ly\/~codeastar\/5.embed\" width=\"800\" height=\"500\" frameborder=\"0\" scrolling=\"no\"><span data-mce-type=\"bookmark\" style=\"display: inline-block; width: 0px; overflow: hidden; line-height: 0;\" class=\"mce_SELRES_start\">\ufeff<\/span><\/iframe><\/p>\n<p>Other than ads with null description, ads are trended to have 3 to 11 words in description.<\/p>\n<p>Besides the word count, we can also create new textual features, like character count,\u00a0punctuation count and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stop_words\" target=\"_blank\" rel=\"noopener\">stop words<\/a> count.<\/p>\n<pre lang=\"python\" line=\"1\">import string\r\nfrom nltk.corpus import stopwords\r\n\r\npunctuation = string.punctuation\r\nstop_words =  stopwords.words('russian')    #Avito is a Russian classified site\r\ntrain_df['punctuation_count'] = train_df['description'].apply(lambda x: len(\"\".join(_ for _ in x if _ in punctuation))) \r\ntrain_df['stopword_count'] = train_df['description'].apply(lambda x: len([wrd for wrd in x.split() if wrd.lower() in stop_words]))\r\n<\/pre>\n<p>We can apply the same techniques on &#8220;title&#8221;, &#8220;param_1&#8221;, &#8220;param_2&#8221; and &#8220;param_3&#8221; features. Then we can encode other categorical features like &#8220;region&#8221;, &#8220;city&#8221;, &#8220;parent_category_name&#8221;, &#8220;category_name&#8221; into numerical features.<\/p>\n<pre lang=\"python\" line=\"1\">cat_vars = [\"user_id\", \"region\", \"city\", \"parent_category_name\", \"category_name\", \"user_type\", \"param_1\", \"param_2\", \"param_3\"]\r\n\r\nfor col in cat_vars:\r\n    lb = preprocessing.LabelEncoder()\r\n    lb.fit(list(train_df[col].values.astype('str')) + list(test_df[col].values.astype('str')))\r\n    train_df[col+'_code'] = lb.transform(list(train_df[col].values.astype('str')))\r\n    test_df[col+'_code'] = lb.transform(list(test_df[col].values.astype('str')))\r\n<\/pre>\n<p>Once we have all categorical features changed to numerical features, it is time to remove those categorical features. Before doing this, let&#8217;s keep the &#8220;description&#8221; feature in another dataframe, &#8220;train_desc&#8221;. (We will explain it later)<\/p>\n<pre lang=\"python\" line=\"1\">train_desc = train_df[\"description\"]\r\n\r\ncols_to_drop = [\"item_id\", \"user_id\", \"region\", \"city\", \"parent_category_name\",  \r\n                \"category_name\", \"param_1\", \"param_2\", \"param_3\", \"title\",\r\n                \"description\", \"activation_date\", \"user_type\", \"image\", \"param_combined\"]            \r\ntrain_df = train_df.drop(cols_to_drop, axis=1)\r\n<\/pre>\n<p>Now we have all numerical features, the next thing we want to know is, which features are important to deal probability? Well, we can find our answer by using a heatmap.<\/p>\n<pre lang=\"python\" line=\"1\">data = [\r\n    go.Heatmap(\r\n        z = train_df.corr().values,\r\n        x = train_df.columns.values,\r\n        y = train_df.columns.values,\r\n        colorscale='Blackbody')\r\n]\r\n\r\nlayout = go.Layout(\r\n    title ='Correlation of Features',\r\n    xaxis = dict(ticks='outside'),\r\n    yaxis = dict(ticks='outside' ),\r\n    width = 800, height = 700)\r\n\r\nfig = go.Figure(data=data, layout=layout)\r\npy.iplot(fig)\r\n<\/pre>\n<p>Here we go:<br \/>\n<iframe loading=\"lazy\" src=\"\/\/plot.ly\/~codeastar\/7.embed\" width=\"800\" height=\"700\" frameborder=\"0\" scrolling=\"no\"><span data-mce-type=\"bookmark\" style=\"display: inline-block; width: 0px; overflow: hidden; line-height: 0;\" class=\"mce_SELRES_start\">\ufeff<\/span><span data-mce-type=\"bookmark\" style=\"display: inline-block; width: 0px; overflow: hidden; line-height: 0;\" class=\"mce_SELRES_start\">\ufeff<\/span><span data-mce-type=\"bookmark\" style=\"display: inline-block; width: 0px; overflow: hidden; line-height: 0;\" class=\"mce_SELRES_start\">\ufeff<\/span><\/iframe><br \/>\nWe know that &#8220;description&#8221; is an important feature, but we cannot find how the length and stop words features correlate with the deal probability. On the other hand, the encoded textual features, &#8220;param_1_code&#8221;, &#8220;param_2_code&#8221;, &#8220;param_3_code&#8221; and their related features have stronger correlation with deal probability. It would be good if we have a way to encode the content of &#8220;description&#8221;, so we have&#8230;<\/p>\n<h3>TF;IDF<\/h3>\n<p>No, it is not tl;dr. TFIDF stands for\u00a0<em>Term Frequency\u2013Inverse Document Frequency<\/em>. It is a weighting technique commonly used in\u00a0information retrieval and text mining.<\/p>\n<ul>\n<li>TF (Term Frequency) is simple the count of a term appearing in a document, i.e. TF = (number of times term T appearing in a document) \/ (total number of terms in the document)<\/li>\n<li>IDF (Inverse Document Frequency) is the way to find out a term&#8217;\u00a0specificity among other documents, i.e. IDF = log( total number of documents \/ number of documents containing term T)<\/li>\n<\/ul>\n<p>For example, we pick a post from this web site, and find the term &#8220;code&#8221; appearing 8 times in a post with 1000 terms. i.e. We have TF = 8 \/ 1000 = 0.008. Then we have 50 posts with 25 posts containing the term &#8220;code&#8221;, so we have IDF = log (50 \/ 25) = 0.301. At the end, the weighting of &#8220;code&#8221; is TF x IDF = 0.008 x 0.301 = 0.002408 .<\/p>\n<p>So our mission is to find out the TFIDF values of words inside the textual features (in our case, &#8220;description&#8221;, &#8220;title&#8221;, &#8220;param&#8221; features).<\/p>\n<h3>TFIDF Transformer<\/h3>\n<p>First, as usual, we import required modules for TFIDF.<\/p>\n<pre lang=\"python\" line=\"1\">from sklearn.feature_extraction.text import TfidfVectorizer\r\nfrom scipy.sparse import hstack, csr_matrix\r\n<\/pre>\n<p>We have &#8220;TfidVectorizer&#8221;, the transformer we need to get TFIDF values from textual features. Then we can assign TFIDF parameters to it and start training it.<\/p>\n<pre lang=\"python\" line=\"1\">tfidf_para = {\r\n    \"stop_words\": stop_words,\r\n    \"analyzer\": 'word',   #analyzer in 'word' or 'character' \r\n    \"token_pattern\": r'\\w{1,}',    #match any word with 1 and unlimited length \r\n    \"sublinear_tf\": True,    #Apply sublinear tf scaling, to reduce the range of tf with 1 + log(tf)\r\n    \"dtype\": np.float32,   #return data type \r\n    \"norm\": 'l2',     #apply l2 normalization\r\n    \"smooth_idf\":False,   #no need to one to document frequencies to avoid zero divisions\r\n    \"ngram_range\" : (1, 2),   #the min and max size of tokenized terms\r\n    \"max_features\": 17000    #the top 17000 weighted features\r\n}\r\n\r\ntfidf_vect = TfidfVectorizer(**tfidf_para)\r\n<\/pre>\n<p>Do you remember the dataframe &#8220;train_desc&#8221; that we kept it a few moments ago? It is the time we transform it into TFIDF values.<\/p>\n<pre lang=\"python\" line=\"1\">tfidf_vect.fit(training_desc)\r\n\r\ntransformed_txt = tfidf_vect.transform(training_desc)\r\n<\/pre>\n<p>We have the &#8220;transformed_txt&#8221;, then what is that actually? It is a matrix storing the TFIDF values with 17000 (the &#8220;max_features&#8221; we set previouly) terms we gathered from &#8220;description&#8221; features. Let&#8217;s have a sneak peek on the TFIDF features.<\/p>\n<p><img data-attachment-id=\"1224\" data-permalink=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/tfidf_feat\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?fit=816%2C400&amp;ssl=1\" data-orig-size=\"816,400\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"tfidf_feat\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?fit=300%2C147&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?fit=816%2C400&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-1224 size-full\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?resize=816%2C400&#038;ssl=1\" alt=\"tfidf features\" width=\"816\" height=\"400\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?w=816&amp;ssl=1 816w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?resize=300%2C147&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/tfidf_feat.png?resize=768%2C376&amp;ssl=1 768w\" sizes=\"(max-width: 816px) 100vw, 816px\" data-recalc-dims=\"1\" \/><\/p>\n<p>Those 17000 features contain 1 or 2 terms (from the &#8220;ngram_range&#8221; setting) and have different TFIDF values among each record. We can now integrate the training dataframe &#8220;train_df&#8221; with the TFIDF values &#8220;transformed_txt&#8221;. Since &#8220;transformed_txt&#8221; is stored as compressed sparse row matrix (CSR matrix), we have to convert &#8220;train_df&#8221; to CSR matrix then merge with &#8220;transformed_txt&#8221;.<\/p>\n<pre lang=\"python\" line=\"1\">from scipy.sparse import hstack, csr_matrix\r\n\r\ncombined_train = hstack([csr_matrix(train_df.values),transformed_txt])\r\ncombined_feat = train_df.columns.tolist() + tfidf_features\r\n\r\nprint(combined_train.shape)<\/pre>\n<p>After the integration, we have a dataframe with 17025 features (the original 25 features plus the 17000 TFIDF features).<\/p>\n<pre>(1503424, 17025)\r\n<\/pre>\n<p>Then we can use <a href=\"https:\/\/www.codeastar.com\/lgb-winning-gradient-boosting-model\/\">LGB<\/a> to fit and predict the deal probability.<\/p>\n<pre lang=\"python\" line=\"1\">import lightgbm as lgb\r\n\r\nlgb_train = lgb.Dataset(combined_train, train_df.deal_probability,\r\n                    feature_name=combined_feat)\r\n\r\nlgb_classifier = lgb.train(\r\n        lgb_para,\r\n        lgb_train,\r\n        num_boost_round=2500,\r\n        verbose_eval=100\r\n    )\r\n\r\nlgb_prediction = lgb_classifier.predict(test_df)\r\n<\/pre>\n<p>Please note that we only deal with one of the textual features (&#8220;description&#8221;) and limit the number of TFIDF features at 17000. It is just barely enough to run on Kaggle 17 GB RAM kernel with LGB. So if we want more accurate results, we need to invest more on our platform.<\/p>\n<h3>What have we learnt in this post?<\/h3>\n<ol>\n<li>Usage of interactive Plotly chart<\/li>\n<li>TFIDF Handling of textual feature<\/li>\n<li>The limitation of Kaggle free kernel<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Our topic this time is too Russia, so my water just turned into Vodka :]] . Actually, this is all about Kaggle&#8217;s competition: Avito Demand Prediction Challenge. Avito is the biggest classified site in the Mother Russia, just likes the\u00a0Craigslist. Our mission for this time is, predicting the rate that a seller can make a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1237,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_newsletter_tier_id":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[18],"tags":[91,90,93,30,22,92,89],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.8.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>TFIDF technique and Deal Probability Prediction (in Russian) &#8902; Code A Star<\/title>\n<meta name=\"description\" content=\"Our topic this time is predicting the probability that a seller can make a deal on Avito (Russia&#039;s Craigslist) and handle text features with TFIDF.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"TFIDF technique and Deal Probability Prediction (in Russian) &#8902; Code A Star\" \/>\n<meta property=\"og:description\" content=\"Our topic this time is predicting the probability that a seller can make a deal on Avito (Russia&#039;s Craigslist) and handle text features with TFIDF.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\" \/>\n<meta property=\"og:site_name\" content=\"Code A Star\" \/>\n<meta property=\"article:publisher\" content=\"codeastar\" \/>\n<meta property=\"article:author\" content=\"codeastar\" \/>\n<meta property=\"article:published_time\" content=\"2018-07-17T18:17:33+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-07-17T19:51:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/russia-1.png?fit=1000%2C406&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"406\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Raven Hon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@codeastar\" \/>\n<meta name=\"twitter:site\" content=\"@codeastar\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raven Hon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\"},\"author\":{\"name\":\"Raven Hon\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"headline\":\"TFIDF technique and Deal Probability Prediction (in Russian)\",\"datePublished\":\"2018-07-17T18:17:33+00:00\",\"dateModified\":\"2018-07-17T19:51:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\"},\"wordCount\":1142,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"keywords\":[\"avito\",\"deal probability\",\"heatmap\",\"Kaggle\",\"Machine Learning\",\"plotly\",\"TFIDF\"],\"articleSection\":[\"Learn Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\",\"url\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\",\"name\":\"TFIDF technique and Deal Probability Prediction (in Russian) &#8902; Code A Star\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/#website\"},\"datePublished\":\"2018-07-17T18:17:33+00:00\",\"dateModified\":\"2018-07-17T19:51:32+00:00\",\"description\":\"Our topic this time is predicting the probability that a seller can make a deal on Avito (Russia's Craigslist) and handle text features with TFIDF.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codeastar.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"TFIDF technique and Deal Probability Prediction (in Russian)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codeastar.com\/#website\",\"url\":\"https:\/\/www.codeastar.com\/\",\"name\":\"Code A Star\",\"description\":\"We don&#039;t wish upon a star, we code a star\",\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codeastar.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\",\"name\":\"Raven Hon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"width\":70,\"height\":70,\"caption\":\"Raven Hon\"},\"logo\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\"},\"description\":\"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.\",\"sameAs\":[\"https:\/\/www.codeastar.com\",\"codeastar\",\"https:\/\/twitter.com\/codeastar\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"TFIDF technique and Deal Probability Prediction (in Russian) &#8902; Code A Star","description":"Our topic this time is predicting the probability that a seller can make a deal on Avito (Russia's Craigslist) and handle text features with TFIDF.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/","og_locale":"en_US","og_type":"article","og_title":"TFIDF technique and Deal Probability Prediction (in Russian) &#8902; Code A Star","og_description":"Our topic this time is predicting the probability that a seller can make a deal on Avito (Russia's Craigslist) and handle text features with TFIDF.","og_url":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/","og_site_name":"Code A Star","article_publisher":"codeastar","article_author":"codeastar","article_published_time":"2018-07-17T18:17:33+00:00","article_modified_time":"2018-07-17T19:51:32+00:00","og_image":[{"width":1000,"height":406,"url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/russia-1.png?fit=1000%2C406&ssl=1","type":"image\/png"}],"author":"Raven Hon","twitter_card":"summary_large_image","twitter_creator":"@codeastar","twitter_site":"@codeastar","twitter_misc":{"Written by":"Raven Hon","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#article","isPartOf":{"@id":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/"},"author":{"name":"Raven Hon","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"headline":"TFIDF technique and Deal Probability Prediction (in Russian)","datePublished":"2018-07-17T18:17:33+00:00","dateModified":"2018-07-17T19:51:32+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/"},"wordCount":1142,"commentCount":1,"publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"keywords":["avito","deal probability","heatmap","Kaggle","Machine Learning","plotly","TFIDF"],"articleSection":["Learn Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/","url":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/","name":"TFIDF technique and Deal Probability Prediction (in Russian) &#8902; Code A Star","isPartOf":{"@id":"https:\/\/www.codeastar.com\/#website"},"datePublished":"2018-07-17T18:17:33+00:00","dateModified":"2018-07-17T19:51:32+00:00","description":"Our topic this time is predicting the probability that a seller can make a deal on Avito (Russia's Craigslist) and handle text features with TFIDF.","breadcrumb":{"@id":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codeastar.com\/"},{"@type":"ListItem","position":2,"name":"TFIDF technique and Deal Probability Prediction (in Russian)"}]},{"@type":"WebSite","@id":"https:\/\/www.codeastar.com\/#website","url":"https:\/\/www.codeastar.com\/","name":"Code A Star","description":"We don&#039;t wish upon a star, we code a star","publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codeastar.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd","name":"Raven Hon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","width":70,"height":70,"caption":"Raven Hon"},"logo":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/"},"description":"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.","sameAs":["https:\/\/www.codeastar.com","codeastar","https:\/\/twitter.com\/codeastar"]}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/07\/russia-1.png?fit=1000%2C406&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8PcRO-jd","jetpack-related-posts":[{"id":1487,"url":"https:\/\/www.codeastar.com\/revenue-prediction-google-store\/","url_meta":{"origin":1191,"position":0},"title":"Revenue Prediction in Google Store","author":"Raven Hon","date":"November 21, 2018","format":false,"excerpt":"Every business owner wants to make revenue prediction, so he or she can have better marketing decisions. On Kaggle, the data science community site, there is a challenge on making a store's revenue prediction. And that is the topic we are looking for. The store in this challenge is none\u2026","rel":"","context":"In &quot;Learn Machine Learning&quot;","block_context":{"text":"Learn Machine Learning","link":"https:\/\/www.codeastar.com\/category\/machine-learning\/"},"img":{"alt_text":"GStore Revenue Prediction","src":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/11\/gstore.png?fit=800%2C429&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/11\/gstore.png?fit=800%2C429&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/11\/gstore.png?fit=800%2C429&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/11\/gstore.png?fit=800%2C429&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":1066,"url":"https:\/\/www.codeastar.com\/bartener-machine-learning\/","url_meta":{"origin":1191,"position":1},"title":"&#8220;Do you have a dog?&#8221; explained in Machine Learning","author":"Raven Hon","date":"May 19, 2018","format":false,"excerpt":"You have probably read the above comic in 9gag or imgur before. It is a funny joke, but on the other hand, it is also a material for our Machine Learning topic. It sounds weird? Oh yeah, sometimes knowledge comes from strange ideas. The Comic Here is the comic, for\u2026","rel":"","context":"In &quot;Learn Machine Learning&quot;","block_context":{"text":"Learn Machine Learning","link":"https:\/\/www.codeastar.com\/category\/machine-learning\/"},"img":{"alt_text":"\"Do you have a dog?\" in Machine Learning","src":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/dyhad.png?fit=377%2C221&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1631,"url":"https:\/\/www.codeastar.com\/get-rich-stock-trading-machine-learning\/","url_meta":{"origin":1191,"position":2},"title":"Stock Trading with Machine Learning and Get Rich","author":"Raven Hon","date":"January 16, 2019","format":false,"excerpt":"Okay, I admit it, it looks like a clickbait headline :]] (yes, we did the similar thing before :]] ). But this is not a clickbait at all, as we are actually discussing this topic this time. There is a Kaggle's challenge on predicting stock trading trend, which is a\u2026","rel":"","context":"In &quot;Learn Machine Learning&quot;","block_context":{"text":"Learn Machine Learning","link":"https:\/\/www.codeastar.com\/category\/machine-learning\/"},"img":{"alt_text":"Stock Trading with Machine Learning","src":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/01\/stock1.png?fit=800%2C432&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/01\/stock1.png?fit=800%2C432&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/01\/stock1.png?fit=800%2C432&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/01\/stock1.png?fit=800%2C432&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":1040,"url":"https:\/\/www.codeastar.com\/lgb-winning-gradient-boosting-model\/","url_meta":{"origin":1191,"position":3},"title":"LGB, the winning Gradient Boosting model","author":"Raven Hon","date":"June 1, 2018","format":false,"excerpt":"Last time, we tried the Kaggle's TalkingData Click Fraud Detection challenge. And we used limited resources to handle a 200 million records sized\u00a0dataset. Although we can make our classification with Random Forest model, we still want a better scoring result.\u00a0 Inside the Click Fraud Detection challenge's leaderboard, I find that\u2026","rel":"","context":"In &quot;Learn Machine Learning&quot;","block_context":{"text":"Learn Machine Learning","link":"https:\/\/www.codeastar.com\/category\/machine-learning\/"},"img":{"alt_text":"Gradient Boosting","src":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/gradient.png?fit=1033%2C608&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/gradient.png?fit=1033%2C608&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/gradient.png?fit=1033%2C608&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/06\/gradient.png?fit=1033%2C608&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":2027,"url":"https:\/\/www.codeastar.com\/save-and-load-your-rnn-model\/","url_meta":{"origin":1191,"position":4},"title":"Save and Load your RNN model","author":"Raven Hon","date":"June 25, 2019","format":false,"excerpt":"In this blog, we tasted different kinds of machine learning projects so far. Our projects included prediction on stock price, image recognizer on hand writing, NLP on comment classification and others. There was one thing in common --- we used long time to train a model. It is okay to\u2026","rel":"","context":"In &quot;Learn Machine Learning&quot;","block_context":{"text":"Learn Machine Learning","link":"https:\/\/www.codeastar.com\/category\/machine-learning\/"},"img":{"alt_text":"Save and Load Model","src":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/06\/save_n_load.png?fit=1100%2C400&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/06\/save_n_load.png?fit=1100%2C400&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/06\/save_n_load.png?fit=1100%2C400&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/06\/save_n_load.png?fit=1100%2C400&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/06\/save_n_load.png?fit=1100%2C400&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":764,"url":"https:\/\/www.codeastar.com\/convolutional-neural-network-python\/","url_meta":{"origin":1191,"position":5},"title":"Python Image Recognizer with Convolutional Neural Network","author":"Raven Hon","date":"February 11, 2018","format":false,"excerpt":"On our data science journey, we have solved classification and regression problems. What's next? There is one popular machine learning territory we have not set feet on yet --- the image recognition. But now the wait is over, in this post we are going to teach our machine to recognize\u2026","rel":"","context":"In &quot;Learn Machine Learning&quot;","block_context":{"text":"Learn Machine Learning","link":"https:\/\/www.codeastar.com\/category\/machine-learning\/"},"img":{"alt_text":"Teach our machine with Convolutional Neural Network","src":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/02\/learning.png?fit=1052%2C744&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/02\/learning.png?fit=1052%2C744&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/02\/learning.png?fit=1052%2C744&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/02\/learning.png?fit=1052%2C744&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/02\/learning.png?fit=1052%2C744&ssl=1&resize=1050%2C600 3x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/1191"}],"collection":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/comments?post=1191"}],"version-history":[{"count":35,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/1191\/revisions"}],"predecessor-version":[{"id":1236,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/1191\/revisions\/1236"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media\/1237"}],"wp:attachment":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media?parent=1191"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/categories?post=1191"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/tags?post=1191"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}