{"id":990,"date":"2018-05-04T21:46:15","date_gmt":"2018-05-04T21:46:15","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=990"},"modified":"2018-05-04T21:55:17","modified_gmt":"2018-05-04T21:55:17","slug":"click-fraud-detection","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/click-fraud-detection\/","title":{"rendered":"Click Fraud Detection with Machine Learning"},"content":{"rendered":"
Up to now, we have tried 3 different Kaggle journeys, the Titanic Survivors<\/a>, the Iowa House Prices<\/a> and the hand written digits recognition<\/a>. Those journeys covered popular Machine Learning topics, such as classification, regression, deep learning, and so on. I would suggest fans of Machine Learning to start with those journeys. So we can learn the basics of Machine Learning there. After that, we can move one step further, to try a real Kaggle competition —\u00a0TalkingData AdTracking<\/a> Fraud Detection Challenge.<\/p>\n <\/p>\n TalkingData is the China’s largest big data service platform, which covers 70% of active mobile devices nationwide (yeah, Big Brother is watching you :]] ). They handle 3 billion clicks a day and 90% of them are possibly fraud. In case you don’t know, there are “click farms” in China, which produce fake ratings and fake download rates.<\/p>\n <\/p>\n (Click farm in China, image source: English Russia<\/a>)<\/p>\n This Kaggle challenge is aimed to handle the click fraud issue. So our objective here is clear: build a model to determine a click is fake or not.<\/p>\n Like our usual Data Science project routine, we read the data description, then start to load the training data and have a look on the content. Oh, wait, just know that we are handling a “big data”. A really big big data. As the training dataset contains 200 million records there! You probably need a machine with at least 32GB ram to load the entire training csv file. But what if we don’t have a decent machine? You can pay to use Amazon Web Service cloud computing, or use Kaggle free kernel. Data Science should belong to everyone, so we go for the free solution from Kaggle. And there is 17GB ram available for each kernel. Well, it is still not enough to load the complete training dataset, but hey, we have other workarounds.<\/p>\n We can not load the full set of training data, we can still load a part of that. Assume we are going to load a training dataset “df_train<\/em>“, we can\u00a0 use pandas with following options:<\/p>\n Then we have loaded 50 million [\u00a0nrows=50000000\u00a0<\/em>] out of the 200 million records, starting from the 125 million th record [\u00a0skiprows=range(1,125000000)\u00a0<\/em>]. It uses around 3 minutes and 3 GB ram to load the training data.<\/p>\n Let’s tell our system on what columns and the data types to load:<\/p>\n And here we got:<\/p>\n It uses only 1GB ram to load the data now.<\/p>\n Another tip on memory tight environment is, delete AND garbage collect every unused segment. Although there is automatic garbage collection in Python, it will only run when the ratio of allocations \/ deallocations hits the threshold. In this case, we can run the garbage collection manually to free memory blocks.<\/p>\n <\/p>\n Now we have the training data, it is time to run our EDA (exploratory data analysis) on the click industry. The challenge is about finding the fraud rate, so the first thing I would like to know is, how “fake” of those clicks are in the training data? i.e. What is the percentage of a click leading to an app download?<\/p>\n <\/p>\n About 99.75% of clicks are fraud. (For lazy data scientists: just fill up your output file with 0.0025, then the work is done :]] )<\/p>\n Let’s check out the unique values per feature in our training data:<\/p>\n <\/p>\n As expected, ip address is the largest feature with unique values, while channel is the one with fewer unique values. So we can see ‘channel’ taking a big part in our machine learning model.<\/p>\n And we take a look on the distribution of our features, first, we go for ‘channel’.<\/p>\n <\/p>\n After that, we use the same function\u00a0on ‘ip’, ‘app’, ‘os’ and ‘device’ features.<\/p>\n \u00a0 We know certain values dominated a feature, but how do they take part in the download rate? Let’s find it out. This time, we take a look on the top 20 clicks in each feature, and see how they perform in term of download rate. We go for the ‘device’ feature first this time.<\/p>\n <\/p>\n Device “1” and “2” take 99+% of all devices, but their download rates are around 0.0017 (0.17%) and 0.00029 (0.029%). Other uncommon devices can get 15%+ download rates, but they are lower than 1% of the data. (Lazy scientists part 2: let’s fill 95% of output file with 0.0018 )<\/p>\n Again, we apply the same function on ‘os’, ‘ip’, ‘channel’ and ‘app’ features.<\/p>\n <\/p>\n <\/p>\n <\/p>\n There are some findings from above charts: Android OSes hardly make an impact on download rate, it matches our observation on Top device Click\/Download rate chart. The top 20 ip clickers did download! Although their download rates are tiny, a download is a download, we can’t write off them for being big clickers. For channel and app, the download rate does not correlate the number of clicks much. Certain apps and channels just out-download others, no matter the size of clicks.<\/p>\n Since there are massive clicks and the download rate is low in general, we are making 2 assumptions:<\/p>\n As of 2017 November (the training data record date), click farm and click flooding are still the major players in click fraud (reference:\u00a0The State of Mobile Fraud: Q1 2018<\/a>). Click farm companies should hire click farmers and operate click flooding in regular working hours to control the cost, i.e. around 0900 to 1800. The usage of click bots in non-office hours is omitted, as the use of bots in mobile fraud was still low in 2017.\u00a0Then we take a look on the click time distribution. But first, since those click time records in training data are in UTC, we have to convert them to China time, i.e. GMT +8. After that, we round those click time records to nearest hour.<\/p>\n And get the click counts and download rates in 24 hours.<\/p>\n <\/p>\n According to eMarketer report<\/a>, 2100 to 2359 should be the peak section for Chinese mobile users. Then we have even more counts then the peak hours during 1200 to 1500. It matches our assumption of click farm companies’ operation in office hours. We may need to add a new feature, “farm_hour”, to indicate a time period with higher click farming activity.<\/p>\n There is more than one way to do feature engineering for the TalkingData challenge. You may create a new feature which combines ip and device or app and channel. Or create a feature which indicates the next or previous click of a user. After that, you can pick a learning model to predict your results.<\/p>\n For starter, you can take a look of my Random Forest kernel here<\/a>. It won’t help you to get high score in leaderboard (use LGB if you are looking for higher score), but it is easy to understand and straight forward. In case you don’t know, I am always a fan of Random Forest model<\/a> :]] .<\/p>\n Just use what you have learnt to process the results, don’t be afraid of failing. Every time we fail, get up and see what we have done wrong. Then we can learn from mistakes and become better and better.<\/p>\n <\/p>\n <\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" Up to now, we have tried 3 different Kaggle journeys, the Titanic Survivors, the Iowa House Prices and the hand written digits recognition. Those journeys covered popular Machine Learning topics, such as classification, regression, deep learning, and so on. I would suggest fans of Machine Learning to start with those journeys. So we can learn […]<\/p>\n","protected":false},"author":1,"featured_media":1039,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_newsletter_tier_id":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[18],"tags":[21,19,74,75,30,22],"jetpack_publicize_connections":[],"yoast_head":"\nThe Happy “Farm”<\/h3>\n
Click Fraud Big Data<\/h3>\n
Take Big Part from Big Data<\/h3>\n
df_train = pd.read_csv('..\/input\/train.csv', skiprows=range(1,125000000), nrows=50000000)\r\n<\/pre>\n
df_train.info()<\/pre>\n
RangeIndex: 50000000 entries, 0 to 49999999\r\nData columns (total 8 columns):\r\nip int64\r\napp int64\r\ndevice int64\r\nos int64\r\nchannel int64\r\nclick_time object\r\nattributed_time object\r\nis_attributed int64\r\ndtypes: int64(6), object(2)\r\nmemory usage: 3.0+ GB\r\n<\/pre>\n
#columns and their data types to load\r\ndtypes = {\r\n 'ip' : 'uint32',\r\n 'app' : 'uint16',\r\n 'device' : 'uint16',\r\n 'os' : 'uint16',\r\n 'channel' : 'uint16',\r\n 'is_attributed' : 'uint8',\r\n 'click_id' : 'uint32',\r\n }\r\n\r\ncolumns = ['ip','app','device','os', 'channel', 'click_time', 'is_attributed']\r\n\r\ndf_train = pd.read_csv('..\/input\/train.csv', skiprows=range(1,125000000), nrows=50000000, dtype=dtypes, usecols=columns)<\/pre>\n
RangeIndex: 50000000 entries, 0 to 49999999\r\nData columns (total 7 columns):\r\nip uint32\r\napp uint16\r\ndevice uint16\r\nos uint16\r\nchannel uint16\r\nclick_time object\r\nis_attributed uint8\r\ndtypes: object(1), uint16(4), uint32(1), uint8(1)\r\nmemory usage: 1001.4+ MB<\/pre>\n
“I will DELETE you!”<\/h3>\n
import gc\r\ndel df_train\r\ngc.collect()\r\n<\/pre>\n
EDA on the Click Fraud data set<\/h3>\n
download_rate = df_train['is_attributed'].value_counts(normalize=True)*100\r\nax = sns.barplot(['Click without download', 'Click with download'], download_rate)\r\nfor p in ax.patches:\r\n ax.annotate('{:.2f}%'.format(p.get_height()), (p.get_x()+p.get_width()\/3, p.get_height()+0.1))\r\n<\/pre>\n
cols = ['ip', 'app', 'device', 'os', 'channel']\r\nuniques = [len(df_train[col].unique()) for col in cols]\r\nax =sns.barplot(cols, uniques, log=True)\r\nfor p in ax.patches:\r\n ax.annotate(p.get_height(), (p.get_x()+p.get_width()\/3, p.get_height()+0.1))\r\n<\/pre>\n
def displayCountAndPercentage(df, groupby, countby):\r\n counts = df[[groupby]+[countby]].groupby(groupby, as_index=False).count().sort_values(countby, ascending=False)\r\n percentage = df[groupby].value_counts(normalize=True)*100\r\n ax = sns.barplot(x=groupby, y=\"is_attributed\", data=counts[:10], order=counts[groupby][:10])\r\n ax.set(ylabel='Number of click', title='Top 10 Click and Percentcage of Feature: [{}]'.format(groupby))\r\n\r\n i = 0\r\n for p in ax.patches: \r\n ax.annotate('{:.2f}%'.format(percentage.iloc[i]), (p.get_x()+p.get_width()\/3, p.get_height()+0.5))\r\n i = i + 1\r\n\r\n del counts, percentage\r\n gc.collect() \r\n\r\ndisplayCountAndPercentage(df_train, 'channel', 'is_attributed')<\/pre>\n
\nThere are interesting plots on ‘os’ and ‘device’ features. As those features are dominated by certain values. The ‘device’ feature is dominated by the encoded value ‘1’ with 94.56%. And we are pretty sure that the ‘1’ device is Android and the ‘2’ device is iPhone. So we can also assume os ’19’, ’13’, ’17’, ’18’ and ’22’ are all Android OS in different versions.<\/p>\nRelationship between Number of Click and Download Rate<\/h3>\n
def displayCountAndDownloadRate(df, groupby, countby, top):\r\n counts = df[[groupby]+[countby]].groupby(groupby, as_index=False).count().sort_values(countby, ascending=False) \r\n download_rates = df[[groupby]+[countby]].groupby(groupby, as_index=False).mean().sort_values(countby, ascending=False)\r\n df_merge = counts.merge(download_rates, on=groupby, how='left')\r\n del counts,download_rates\r\n gc.collect()\r\n df_merge.columns = [groupby, 'click_count', 'download_rate']\r\n df_merge[groupby] = df_merge[groupby].astype('category')\r\n ax = df_merge[:top].plot(x=groupby, y=\"download_rate\", legend=False, kind=\"bar\", color=\"orange\", label=\"download rate (left)\")\r\n ax2 = ax.twinx()\r\n df_merge[:top].plot(x=groupby, y=\"click_count\", ax=ax2, legend=False, kind=\"line\", color=\"blue\", label=\"click count (right)\")\r\n ax.set_xticklabels(df_merge[groupby])\r\n ax.figure.legend(loc='upper left')\r\n ax.set_title(\"Top {} Click Counts and Download Rates of [{}]\".format(top, groupby)) \r\n del df_merge\r\n gc.collect() \r\n\r\ndisplayCountAndDownloadRate(df_train, 'device', 'is_attributed', 20)\r\n<\/pre>\n
Does Time matter?<\/h3>\n
\n
df_train['click_time_china'] = pd.to_datetime(df_train.click_time)\r\ndf_train['click_time_china'] += pd.to_timedelta(8, unit='h')\r\ndf_train['click_time_hour']=pd.to_datetime(df_train['click_time_china']).dt.round('H') \r\n<\/pre>\n
The Next Steps<\/h3>\n
What have we learnt in this post?<\/h3>\n
\n