{"id":990,"date":"2018-05-04T21:46:15","date_gmt":"2018-05-04T21:46:15","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=990"},"modified":"2018-05-04T21:55:17","modified_gmt":"2018-05-04T21:55:17","slug":"click-fraud-detection","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/click-fraud-detection\/","title":{"rendered":"Click Fraud Detection with Machine Learning"},"content":{"rendered":"<p>Up to now, we have tried 3 different Kaggle journeys, <a href=\"https:\/\/www.codeastar.com\/data-wrangling\/\">the Titanic Survivors<\/a>, <a href=\"https:\/\/www.codeastar.com\/win-big-real-estate-market-data-science\/\">the Iowa House Prices<\/a> and <a href=\"https:\/\/www.codeastar.com\/convolutional-neural-network-python\/\">the hand written digits recognition<\/a>. Those journeys covered popular Machine Learning topics, such as classification, regression, deep learning, and so on. I would suggest fans of Machine Learning to start with those journeys. So we can learn the basics of Machine Learning there. After that, we can move one step further, to try a real Kaggle competition &#8212;\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/talkingdata-adtracking-fraud-detection\" target=\"_blank\" rel=\"noopener\">TalkingData AdTracking<\/a> Fraud Detection Challenge.<\/p>\n<p><!--more--><\/p>\n<h3>The Happy &#8220;Farm&#8221;<\/h3>\n<p>TalkingData is the China&#8217;s largest big data service platform, which covers 70% of active mobile devices nationwide (yeah, Big Brother is watching you :]] ). They handle 3 billion clicks a day and 90% of them are possibly fraud. In case you don&#8217;t know, there are &#8220;click farms&#8221; in China, which produce fake ratings and fake download rates.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"992\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/clickfarm\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/clickfarm.gif?fit=600%2C370&amp;ssl=1\" data-orig-size=\"600,370\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"clickfarm\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/clickfarm.gif?fit=300%2C185&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/clickfarm.gif?fit=600%2C370&amp;ssl=1\" class=\"alignnone size-full wp-image-992\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/clickfarm.gif?resize=600%2C370&#038;ssl=1\" alt=\"\" width=\"600\" height=\"370\" \/><\/p>\n<p>(Click farm in China, image source: <a href=\"https:\/\/twitter.com\/EnglishRussia1\/status\/862661011882561537\" target=\"_blank\" rel=\"noopener\">English Russia<\/a>)<\/p>\n<p>This Kaggle challenge is aimed to handle the click fraud issue. So our objective here is clear: build a model to determine a click is fake or not.<\/p>\n<h3>Click Fraud Big Data<\/h3>\n<p>Like our usual Data Science project routine, we read the data description, then start to load the training data and have a look on the content. Oh, wait, just know that we are handling a &#8220;big data&#8221;. A really big big data. As the training dataset contains 200 million records there! You probably need a machine with at least 32GB ram to load the entire training csv file. But what if we don&#8217;t have a decent machine? You can pay to use Amazon Web Service cloud computing, or use Kaggle free kernel. Data Science should belong to everyone, so we go for the free solution from Kaggle. And there is 17GB ram available for each kernel. Well, it is still not enough to load the complete training dataset, but hey, we have other workarounds.<\/p>\n<h3>Take Big Part from Big Data<\/h3>\n<p>We can not load the full set of training data, we can still load a part of that. Assume we are going to load a training dataset &#8220;<em>df_train<\/em>&#8220;, we can\u00a0 use pandas with following options:<\/p>\n<pre lang=\"python\" line=\"1\">df_train = pd.read_csv('..\/input\/train.csv', skiprows=range(1,125000000), nrows=50000000)\r\n<\/pre>\n<p>Then we have loaded 50 million [\u00a0<em>nrows=50000000\u00a0<\/em>] out of the 200 million records, starting from the 125 million th record [\u00a0<em>skiprows=range(1,125000000)\u00a0<\/em>]. It uses around 3 minutes and 3 GB ram to load the training data.<\/p>\n<pre lang=\"python\" line=\"1\">df_train.info()<\/pre>\n<pre>RangeIndex: 50000000 entries, 0 to 49999999\r\nData columns (total 8 columns):\r\nip                 int64\r\napp                int64\r\ndevice             int64\r\nos                 int64\r\nchannel            int64\r\nclick_time         object\r\nattributed_time    object\r\nis_attributed      int64\r\ndtypes: int64(6), object(2)\r\nmemory usage: 3.0+ GB\r\n<\/pre>\n<p>Let&#8217;s tell our system on what columns and the data types to load:<\/p>\n<pre lang=\"python\" line=\"1\">#columns and their data types to load\r\ndtypes = {\r\n        'ip'            : 'uint32',\r\n        'app'           : 'uint16',\r\n        'device'        : 'uint16',\r\n        'os'            : 'uint16',\r\n        'channel'       : 'uint16',\r\n        'is_attributed' : 'uint8',\r\n        'click_id'      : 'uint32',\r\n        }\r\n\r\ncolumns = ['ip','app','device','os', 'channel', 'click_time', 'is_attributed']\r\n\r\ndf_train = pd.read_csv('..\/input\/train.csv', skiprows=range(1,125000000), nrows=50000000, dtype=dtypes, usecols=columns)<\/pre>\n<p>And here we got:<\/p>\n<pre>RangeIndex: 50000000 entries, 0 to 49999999\r\nData columns (total 7 columns):\r\nip               uint32\r\napp              uint16\r\ndevice           uint16\r\nos               uint16\r\nchannel          uint16\r\nclick_time       object\r\nis_attributed    uint8\r\ndtypes: object(1), uint16(4), uint32(1), uint8(1)\r\nmemory usage: 1001.4+ MB<\/pre>\n<p>It uses only 1GB ram to load the data now.<\/p>\n<h3>&#8220;I will DELETE you!&#8221;<\/h3>\n<p>Another tip on memory tight environment is, delete AND garbage collect every unused segment. Although there is automatic garbage collection in Python, it will only run when the ratio of allocations \/ deallocations hits the threshold. In this case, we can run the garbage collection manually to free memory blocks.<\/p>\n<pre lang=\"python\" line=\"1\">import gc\r\ndel df_train\r\ngc.collect()\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1002\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/delete\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/delete.gif?fit=474%2C266&amp;ssl=1\" data-orig-size=\"474,266\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"delete\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/delete.gif?fit=300%2C168&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/delete.gif?fit=474%2C266&amp;ssl=1\" class=\"alignnone size-full wp-image-1002\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/delete.gif?resize=474%2C266&#038;ssl=1\" alt=\"\" width=\"474\" height=\"266\" \/><\/p>\n<h3>EDA on the Click Fraud data set<\/h3>\n<p>Now we have the training data, it is time to run our EDA (exploratory data analysis) on the click industry. The challenge is about finding the fraud rate, so the first thing I would like to know is, how &#8220;fake&#8221; of those clicks are in the training data? i.e. What is the percentage of a click leading to an app download?<\/p>\n<pre lang=\"python\" line=\"1\">download_rate = df_train['is_attributed'].value_counts(normalize=True)*100\r\nax = sns.barplot(['Click without download', 'Click with download'], download_rate)\r\nfor p in ax.patches:\r\n    ax.annotate('{:.2f}%'.format(p.get_height()), (p.get_x()+p.get_width()\/3, p.get_height()+0.1))\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1004\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download.png?fit=501%2C276&amp;ssl=1\" data-orig-size=\"501,276\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"download_rate\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download.png?fit=300%2C165&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download.png?fit=501%2C276&amp;ssl=1\" class=\"alignnone size-full wp-image-1004\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download.png?resize=501%2C276&#038;ssl=1\" alt=\"\" width=\"501\" height=\"276\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download.png?w=501&amp;ssl=1 501w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download.png?resize=300%2C165&amp;ssl=1 300w\" sizes=\"auto, (max-width: 501px) 100vw, 501px\" \/><\/p>\n<p>About 99.75% of clicks are fraud. (For lazy data scientists: just fill up your output file with 0.0025, then the work is done :]] )<\/p>\n<p>Let&#8217;s check out the unique values per feature in our training data:<\/p>\n<pre lang=\"python\" line=\"1\">cols = ['ip', 'app', 'device', 'os', 'channel']\r\nuniques = [len(df_train[col].unique()) for col in cols]\r\nax =sns.barplot(cols, uniques, log=True)\r\nfor p in ax.patches:\r\n    ax.annotate(p.get_height(), (p.get_x()+p.get_width()\/3, p.get_height()+0.1))\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1006\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-1.png?fit=486%2C276&amp;ssl=1\" data-orig-size=\"486,276\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"unique_count\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-1.png?fit=300%2C170&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-1.png?fit=486%2C276&amp;ssl=1\" class=\"alignnone size-full wp-image-1006\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-1.png?resize=486%2C276&#038;ssl=1\" alt=\"\" width=\"486\" height=\"276\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-1.png?w=486&amp;ssl=1 486w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-1.png?resize=300%2C170&amp;ssl=1 300w\" sizes=\"auto, (max-width: 486px) 100vw, 486px\" \/><\/p>\n<p>As expected, ip address is the largest feature with unique values, while channel is the one with fewer unique values. So we can see &#8216;channel&#8217; taking a big part in our machine learning model.<\/p>\n<p>And we take a look on the distribution of our features, first, we go for &#8216;channel&#8217;.<\/p>\n<pre lang=\"python\" line=\"1\">def displayCountAndPercentage(df, groupby, countby):\r\n    counts = df[[groupby]+[countby]].groupby(groupby, as_index=False).count().sort_values(countby, ascending=False)\r\n    percentage = df[groupby].value_counts(normalize=True)*100\r\n    ax = sns.barplot(x=groupby, y=\"is_attributed\", data=counts[:10], order=counts[groupby][:10])\r\n    ax.set(ylabel='Number of click', title='Top 10 Click and Percentcage of Feature: [{}]'.format(groupby))\r\n\r\n    i = 0\r\n    for p in ax.patches:    \r\n        ax.annotate('{:.2f}%'.format(percentage.iloc[i]), (p.get_x()+p.get_width()\/3, p.get_height()+0.5))\r\n        i = i + 1\r\n\r\n    del counts, percentage\r\n    gc.collect()  \r\n\r\ndisplayCountAndPercentage(df_train, 'channel', 'is_attributed')<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1009\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-2-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-2.png?fit=535%2C306&amp;ssl=1\" data-orig-size=\"535,306\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"download (2)\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-2.png?fit=300%2C172&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-2.png?fit=535%2C306&amp;ssl=1\" class=\"alignnone size-full wp-image-1009\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-2.png?resize=535%2C306&#038;ssl=1\" alt=\"\" width=\"535\" height=\"306\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-2.png?w=535&amp;ssl=1 535w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-2.png?resize=300%2C172&amp;ssl=1 300w\" sizes=\"auto, (max-width: 535px) 100vw, 535px\" \/><\/p>\n<p>After that, we use the same function\u00a0on &#8216;ip&#8217;, &#8216;app&#8217;, &#8216;os&#8217; and &#8216;device&#8217; features.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1010\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-3\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-3.png?fit=529%2C306&amp;ssl=1\" data-orig-size=\"529,306\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"display ip\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-3.png?fit=300%2C174&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-3.png?fit=529%2C306&amp;ssl=1\" class=\"alignnone wp-image-1010\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-3.png?resize=400%2C231&#038;ssl=1\" alt=\"\" width=\"400\" height=\"231\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-3.png?w=529&amp;ssl=1 529w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-3.png?resize=300%2C174&amp;ssl=1 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/>\u00a0<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1011\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-4\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-4.png?fit=507%2C306&amp;ssl=1\" data-orig-size=\"507,306\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"display app\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-4.png?fit=300%2C181&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-4.png?fit=507%2C306&amp;ssl=1\" class=\"alignnone wp-image-1011\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-4.png?resize=400%2C241&#038;ssl=1\" alt=\"\" width=\"400\" height=\"241\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-4.png?w=507&amp;ssl=1 507w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-4.png?resize=300%2C181&amp;ssl=1 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/> <img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1012\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-5\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-5.png?fit=507%2C306&amp;ssl=1\" data-orig-size=\"507,306\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"display device\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-5.png?fit=300%2C181&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-5.png?fit=507%2C306&amp;ssl=1\" class=\"alignnone wp-image-1012\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-5.png?resize=400%2C241&#038;ssl=1\" alt=\"\" width=\"400\" height=\"241\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-5.png?w=507&amp;ssl=1 507w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-5.png?resize=300%2C181&amp;ssl=1 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/> <img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1013\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-6\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-6.png?fit=497%2C306&amp;ssl=1\" data-orig-size=\"497,306\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"display os\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-6.png?fit=300%2C185&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-6.png?fit=497%2C306&amp;ssl=1\" class=\"alignnone wp-image-1013\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-6.png?resize=400%2C246&#038;ssl=1\" alt=\"\" width=\"400\" height=\"246\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-6.png?w=497&amp;ssl=1 497w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-6.png?resize=300%2C185&amp;ssl=1 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><br \/>\nThere are interesting plots on &#8216;os&#8217; and &#8216;device&#8217; features. As those features are dominated by certain values. The &#8216;device&#8217; feature is dominated by the encoded value &#8216;1&#8217; with 94.56%. And we are pretty sure that the &#8216;1&#8217; device is Android and the &#8216;2&#8217; device is iPhone. So we can also assume os &#8217;19&#8217;, &#8217;13&#8217;, &#8217;17&#8217;, &#8217;18&#8217; and &#8217;22&#8217; are all Android OS in different versions.<\/p>\n<h3>Relationship between Number of Click and Download Rate<\/h3>\n<p>We know certain values dominated a feature, but how do they take part in the download rate? Let&#8217;s find it out. This time, we take a look on the top 20 clicks in each feature, and see how they perform in term of download rate. We go for the &#8216;device&#8217; feature first this time.<\/p>\n<pre lang=\"python\" line=\"1\">def displayCountAndDownloadRate(df, groupby, countby, top):\r\n   counts = df[[groupby]+[countby]].groupby(groupby, as_index=False).count().sort_values(countby, ascending=False)       \r\n   download_rates = df[[groupby]+[countby]].groupby(groupby, as_index=False).mean().sort_values(countby, ascending=False)\r\n   df_merge = counts.merge(download_rates, on=groupby, how='left')\r\n   del counts,download_rates\r\n   gc.collect()\r\n   df_merge.columns = [groupby, 'click_count', 'download_rate']\r\n   df_merge[groupby] = df_merge[groupby].astype('category')\r\n   ax = df_merge[:top].plot(x=groupby, y=\"download_rate\", legend=False, kind=\"bar\", color=\"orange\", label=\"download rate (left)\")\r\n   ax2 = ax.twinx()\r\n   df_merge[:top].plot(x=groupby, y=\"click_count\", ax=ax2, legend=False, kind=\"line\", color=\"blue\",  label=\"click count (right)\")\r\n   ax.set_xticklabels(df_merge[groupby])\r\n   ax.figure.legend(loc='upper left')\r\n   ax.set_title(\"Top {} Click Counts and Download Rates of [{}]\".format(top, groupby)) \r\n   del df_merge\r\n   gc.collect() \r\n\r\ndisplayCountAndDownloadRate(df_train, 'device', 'is_attributed', 20)\r\n<\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1016\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-7\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-7.png?fit=541%2C339&amp;ssl=1\" data-orig-size=\"541,339\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"device 20\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-7.png?fit=300%2C188&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-7.png?fit=541%2C339&amp;ssl=1\" class=\"alignnone size-full wp-image-1016\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-7.png?resize=541%2C339&#038;ssl=1\" alt=\"\" width=\"541\" height=\"339\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-7.png?w=541&amp;ssl=1 541w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/download-7.png?resize=300%2C188&amp;ssl=1 300w\" sizes=\"auto, (max-width: 541px) 100vw, 541px\" \/><\/p>\n<p>Device &#8220;1&#8221; and &#8220;2&#8221; take 99+% of all devices, but their download rates are around 0.0017 (0.17%) and 0.00029 (0.029%). Other uncommon devices can get 15%+ download rates, but they are lower than 1% of the data. (Lazy scientists part 2: let&#8217;s fill 95% of output file with 0.0018 )<\/p>\n<p>Again, we apply the same function on &#8216;os&#8217;, &#8216;ip&#8217;, &#8216;channel&#8217; and &#8216;app&#8217; features.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1020\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/os20\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/os20.png?fit=555%2C336&amp;ssl=1\" data-orig-size=\"555,336\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"os20\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/os20.png?fit=300%2C182&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/os20.png?fit=555%2C336&amp;ssl=1\" class=\"alignnone wp-image-1020\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/os20.png?resize=450%2C272&#038;ssl=1\" alt=\"\" width=\"450\" height=\"272\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/os20.png?w=555&amp;ssl=1 555w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/os20.png?resize=300%2C182&amp;ssl=1 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1019\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/ip20\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/ip20.png?fit=572%2C352&amp;ssl=1\" data-orig-size=\"572,352\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"ip20\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/ip20.png?fit=300%2C185&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/ip20.png?fit=572%2C352&amp;ssl=1\" class=\"alignnone wp-image-1019\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/ip20.png?resize=450%2C277&#038;ssl=1\" alt=\"\" width=\"450\" height=\"277\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/ip20.png?w=572&amp;ssl=1 572w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/ip20.png?resize=300%2C185&amp;ssl=1 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1017\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/app20\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/app20.png?fit=550%2C326&amp;ssl=1\" data-orig-size=\"550,326\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"app20\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/app20.png?fit=300%2C178&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/app20.png?fit=550%2C326&amp;ssl=1\" class=\"alignnone wp-image-1017\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/app20.png?resize=450%2C267&#038;ssl=1\" alt=\"\" width=\"450\" height=\"267\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/app20.png?w=550&amp;ssl=1 550w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/app20.png?resize=300%2C178&amp;ssl=1 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/> <img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1018\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/channel20\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/channel20.png?fit=578%2C333&amp;ssl=1\" data-orig-size=\"578,333\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"channel20\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/channel20.png?fit=300%2C173&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/channel20.png?fit=578%2C333&amp;ssl=1\" class=\"alignnone wp-image-1018\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/channel20.png?resize=450%2C259&#038;ssl=1\" alt=\"\" width=\"450\" height=\"259\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/channel20.png?w=578&amp;ssl=1 578w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/04\/channel20.png?resize=300%2C173&amp;ssl=1 300w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p>There are some findings from above charts: Android OSes hardly make an impact on download rate, it matches our observation on Top device Click\/Download rate chart. The top 20 ip clickers did download! Although their download rates are tiny, a download is a download, we can&#8217;t write off them for being big clickers. For channel and app, the download rate does not correlate the number of clicks much. Certain apps and channels just out-download others, no matter the size of clicks.<\/p>\n<h3>Does Time matter?<\/h3>\n<p>Since there are massive clicks and the download rate is low in general, we are making 2 assumptions:<\/p>\n<ol>\n<li>clicks came from hired clickers<\/li>\n<li>clicks came from click flooding<\/li>\n<\/ol>\n<p>As of 2017 November (the training data record date), click farm and click flooding are still the major players in click fraud (reference:\u00a0The State of Mobile Fraud: <a href=\"https:\/\/www.appsflyer.com\/resources\/the-state-of-mobile-fraud-q1-2018\/\" target=\"_blank\" rel=\"noopener\">Q1 2018<\/a>). Click farm companies should hire click farmers and operate click flooding in regular working hours to control the cost, i.e. around 0900 to 1800. The usage of click bots in non-office hours is omitted, as the use of bots in mobile fraud was still low in 2017.\u00a0Then we take a look on the click time distribution. But first, since those click time records in training data are in UTC, we have to convert them to China time, i.e. GMT +8. After that, we round those click time records to nearest hour.<\/p>\n<pre lang=\"python\" line=\"1\">df_train['click_time_china'] = pd.to_datetime(df_train.click_time)\r\ndf_train['click_time_china'] += pd.to_timedelta(8, unit='h')\r\ndf_train['click_time_hour']=pd.to_datetime(df_train['click_time_china']).dt.round('H') \r\n<\/pre>\n<p>And get the click counts and download rates in 24 hours.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1026\" data-permalink=\"https:\/\/www.codeastar.com\/click-fraud-detection\/download-2-3\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?fit=902%2C514&amp;ssl=1\" data-orig-size=\"902,514\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"download (2)\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?fit=300%2C171&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?fit=902%2C514&amp;ssl=1\" class=\"alignnone size-full wp-image-1026\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?resize=902%2C514&#038;ssl=1\" alt=\"\" width=\"902\" height=\"514\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?w=902&amp;ssl=1 902w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?resize=300%2C171&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/download-2.png?resize=768%2C438&amp;ssl=1 768w\" sizes=\"auto, (max-width: 902px) 100vw, 902px\" \/><\/p>\n<p>According to <a href=\"https:\/\/www.emarketer.com\/Article\/Nighttime-Heats-Up-Mobile-Shopping-China\/1012591\" target=\"_blank\" rel=\"noopener\">eMarketer report<\/a>, 2100 to 2359 should be the peak section for Chinese mobile users. Then we have even more counts then the peak hours during 1200 to 1500. It matches our assumption of click farm companies&#8217; operation in office hours. We may need to add a new feature, &#8220;farm_hour&#8221;, to indicate a time period with higher click farming activity.<\/p>\n<h3>The Next Steps<\/h3>\n<p>There is more than one way to do feature engineering for the TalkingData challenge. You may create a new feature which combines ip and device or app and channel. Or create a feature which indicates the next or previous click of a user. After that, you can pick a learning model to predict your results.<\/p>\n<p>For starter, you can take a look of my Random Forest kernel <a href=\"https:\/\/www.kaggle.com\/codeastar\/random-forest-classification-on-talkingdata?scriptVersionId=3377346\" target=\"_blank\" rel=\"noopener\">here<\/a>. It won&#8217;t help you to get high score in leaderboard (use LGB if you are looking for higher score), but it is easy to understand and straight forward. In case you don&#8217;t know, I am always <a href=\"https:\/\/www.codeastar.com\/random-random-forest-tutorial\/\">a fan of Random Forest model<\/a> :]] .<\/p>\n<p>Just use what you have learnt to process the results, don&#8217;t be afraid of failing. Every time we fail, get up and see what we have done wrong. Then we can learn from mistakes and become better and better.<\/p>\n<p>&nbsp;<\/p>\n<h3>What have we learnt in this post?<\/h3>\n<ol>\n<li>Handle large data set with limited resources<\/li>\n<li>Use charts on EDA (Exploratory Data Analysis)<\/li>\n<li>Don&#8217;t be afraid of failing!<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Up to now, we have tried 3 different Kaggle journeys, the Titanic Survivors, the Iowa House Prices and the hand written digits recognition. Those journeys covered popular Machine Learning topics, such as classification, regression, deep learning, and so on. I would suggest fans of Machine Learning to start with those journeys. So we can learn [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1039,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[18],"tags":[21,19,74,75,30,22],"class_list":["post-990","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-big-data","tag-data-science","tag-eda","tag-fraud-detection","tag-kaggle","tag-machine-learning"],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Click Fraud Detection with Machine Learning &#8902; Code A Star<\/title>\n<meta name=\"description\" content=\"After learning the basics of Machine Learning, we can move one step further, to try a real Kaggle competition ---\u00a0TalkingData AdTracking Fraud Detection Challenge.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codeastar.com\/click-fraud-detection\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Click Fraud Detection with Machine Learning &#8902; Code A Star\" \/>\n<meta property=\"og:description\" content=\"After learning the basics of Machine Learning, we can move one step further, to try a real Kaggle competition ---\u00a0TalkingData AdTracking Fraud Detection Challenge.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codeastar.com\/click-fraud-detection\/\" \/>\n<meta property=\"og:site_name\" content=\"Code A Star\" \/>\n<meta property=\"article:publisher\" content=\"codeastar\" \/>\n<meta property=\"article:author\" content=\"codeastar\" \/>\n<meta property=\"article:published_time\" content=\"2018-05-04T21:46:15+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-05-04T21:55:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1049\" \/>\n\t<meta property=\"og:image:height\" content=\"419\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Raven Hon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@codeastar\" \/>\n<meta name=\"twitter:site\" content=\"@codeastar\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raven Hon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/\"},\"author\":{\"name\":\"Raven Hon\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"headline\":\"Click Fraud Detection with Machine Learning\",\"datePublished\":\"2018-05-04T21:46:15+00:00\",\"dateModified\":\"2018-05-04T21:55:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/\"},\"wordCount\":1295,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1\",\"keywords\":[\"Big Data\",\"Data Science\",\"EDA\",\"fraud detection\",\"Kaggle\",\"Machine Learning\"],\"articleSection\":[\"Learn Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codeastar.com\/click-fraud-detection\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/\",\"url\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/\",\"name\":\"Click Fraud Detection with Machine Learning &#8902; Code A Star\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1\",\"datePublished\":\"2018-05-04T21:46:15+00:00\",\"dateModified\":\"2018-05-04T21:55:17+00:00\",\"description\":\"After learning the basics of Machine Learning, we can move one step further, to try a real Kaggle competition ---\u00a0TalkingData AdTracking Fraud Detection Challenge.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codeastar.com\/click-fraud-detection\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1\",\"width\":1049,\"height\":419,\"caption\":\"Click Fraud Detection\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codeastar.com\/click-fraud-detection\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codeastar.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Click Fraud Detection with Machine Learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codeastar.com\/#website\",\"url\":\"https:\/\/www.codeastar.com\/\",\"name\":\"Code A Star\",\"description\":\"We don&#039;t wish upon a star, we code a star\",\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codeastar.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\",\"name\":\"Raven Hon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"width\":70,\"height\":70,\"caption\":\"Raven Hon\"},\"logo\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\"},\"description\":\"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.\",\"sameAs\":[\"https:\/\/www.codeastar.com\",\"codeastar\",\"https:\/\/x.com\/codeastar\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Click Fraud Detection with Machine Learning &#8902; Code A Star","description":"After learning the basics of Machine Learning, we can move one step further, to try a real Kaggle competition ---\u00a0TalkingData AdTracking Fraud Detection Challenge.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codeastar.com\/click-fraud-detection\/","og_locale":"en_US","og_type":"article","og_title":"Click Fraud Detection with Machine Learning &#8902; Code A Star","og_description":"After learning the basics of Machine Learning, we can move one step further, to try a real Kaggle competition ---\u00a0TalkingData AdTracking Fraud Detection Challenge.","og_url":"https:\/\/www.codeastar.com\/click-fraud-detection\/","og_site_name":"Code A Star","article_publisher":"codeastar","article_author":"codeastar","article_published_time":"2018-05-04T21:46:15+00:00","article_modified_time":"2018-05-04T21:55:17+00:00","og_image":[{"width":1049,"height":419,"url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1","type":"image\/png"}],"author":"Raven Hon","twitter_card":"summary_large_image","twitter_creator":"@codeastar","twitter_site":"@codeastar","twitter_misc":{"Written by":"Raven Hon","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#article","isPartOf":{"@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/"},"author":{"name":"Raven Hon","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"headline":"Click Fraud Detection with Machine Learning","datePublished":"2018-05-04T21:46:15+00:00","dateModified":"2018-05-04T21:55:17+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/"},"wordCount":1295,"commentCount":1,"publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"image":{"@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1","keywords":["Big Data","Data Science","EDA","fraud detection","Kaggle","Machine Learning"],"articleSection":["Learn Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codeastar.com\/click-fraud-detection\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/","url":"https:\/\/www.codeastar.com\/click-fraud-detection\/","name":"Click Fraud Detection with Machine Learning &#8902; Code A Star","isPartOf":{"@id":"https:\/\/www.codeastar.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage"},"image":{"@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1","datePublished":"2018-05-04T21:46:15+00:00","dateModified":"2018-05-04T21:55:17+00:00","description":"After learning the basics of Machine Learning, we can move one step further, to try a real Kaggle competition ---\u00a0TalkingData AdTracking Fraud Detection Challenge.","breadcrumb":{"@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codeastar.com\/click-fraud-detection\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#primaryimage","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1","width":1049,"height":419,"caption":"Click Fraud Detection"},{"@type":"BreadcrumbList","@id":"https:\/\/www.codeastar.com\/click-fraud-detection\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codeastar.com\/"},{"@type":"ListItem","position":2,"name":"Click Fraud Detection with Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.codeastar.com\/#website","url":"https:\/\/www.codeastar.com\/","name":"Code A Star","description":"We don&#039;t wish upon a star, we code a star","publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codeastar.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd","name":"Raven Hon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","width":70,"height":70,"caption":"Raven Hon"},"logo":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/"},"description":"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.","sameAs":["https:\/\/www.codeastar.com","codeastar","https:\/\/x.com\/codeastar"]}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/05\/fraud_click2.png?fit=1049%2C419&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8PcRO-fY","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/990","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/comments?post=990"}],"version-history":[{"count":30,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/990\/revisions"}],"predecessor-version":[{"id":1038,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/990\/revisions\/1038"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media\/1039"}],"wp:attachment":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media?parent=990"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/categories?post=990"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/tags?post=990"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}