{"id":1487,"date":"2018-11-21T20:35:55","date_gmt":"2018-11-21T20:35:55","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=1487"},"modified":"2020-06-02T01:28:06","modified_gmt":"2020-06-02T01:28:06","slug":"revenue-prediction-google-store","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/revenue-prediction-google-store\/","title":{"rendered":"Revenue Prediction in Google Store"},"content":{"rendered":"

Every business owner wants to make revenue prediction, so he or she can have better marketing decisions. On Kaggle<\/a>, the data science community site, there is a challenge on making a store’s revenue prediction<\/a>. And that is the topic we are looking for. The store in this challenge is none other than the Google\u00a0Merchandise Store. (It seems Google did not spend enough on Google Plus’s revenue prediction, at the end they just lost the direction and decided to close it<\/a>. :]] )<\/p>\n

<\/p>\n

When Big Data is really Big<\/h3>\n

We handled the TalkingData Click Fraud<\/a> challenge with big training dataset in the past. That was a dataset with\u00a0200 million records in 1.2GB file size. This time, we handle a dataset with 1.7 million records, but well, in 23.7GB<\/strong> file size. Once again, it is impossible to load the file directly from the Kaggle’s 17GB kernel. It should be impossible to load the file from a machine with 32GB ram also. But, by using the trick we learnt from TalkingData challenge — nrows<\/em>, we can load a part (20 rows) of the dataset file first.<\/p>\n

df = pd.read_csv('..\/input\/train_v2.csv', nrows=20)\ndf.head()\n<\/pre>\n

Then we find the reason why the dataset file is so big:<\/p>\n

\"JSON<\/p>\n

There are several JSON columns stored inside the file, which contain multiple objects on each row that increase the file size.<\/p>\n

So now we have to face 2 issues on the training dataset file:<\/p>\n

    \n
  1. Loading a huge file<\/li>\n
  2. Handling records with JSON objects<\/li>\n<\/ol>\n

    For the first issue, we can load the 1.7 million plus records in 18 rounds. Then we handle 100 thousand records a time. For the second issue, it is good that Python has a\u00a0json_normalize<\/a> API for us. We can simple use it to normalize JSON objects into flat table structure.\u00a0 i.e. it transforms nested JSON objects into different columns in a dataframe.<\/p>\n

    Does everything look good?<\/p>\n

    Not really. By using json_normalize, we can retrieve normalized columns from the dataset. However it turns out we have so many columns there.<\/p>\n

    \"df<\/p>\n

    We finds that there are several columns containing the same content.<\/p>\n

    cols_w_same_content= .nunique() <= 1]\nprint('Columns with same content: ', cols_w_same_content)\n<\/pre>\n

    And some of them are duplicated, like “customDimensions” and “geoNetwork_continent”. So we have to filter those problematic columns and combine with the json_normalize method:<\/p>\n

    import json,time, gc\nfrom pandas.io.json import json_normalize\n\ndef load_df(csv_path, chunksize=100000):\n    #use only reasonable columns \n    features = ['channelGrouping', 'date', 'fullVisitorId', 'visitId',\n                'visitNumber', 'visitStartTime', 'device_browser',\n                'device_deviceCategory', 'device_isMobile', 'device_operatingSystem',\n                'geoNetwork_city', 'geoNetwork_continent', 'geoNetwork_country',\n                'geoNetwork_metro', 'geoNetwork_networkDomain', 'geoNetwork_region',\n                'geoNetwork_subContinent', 'totals_bounces', 'totals_hits',\n                'totals_newVisits', 'totals_pageviews', 'totals_transactionRevenue',\n                'trafficSource_adContent', 'trafficSource_campaign',\n                'trafficSource_isTrueDirect', 'trafficSource_keyword',\n                'trafficSource_medium', 'trafficSource_referralPath',\n                'trafficSource_source']\n\n    #columns with JSON objects to normalize\n    JSON_COLS = ['device', 'geoNetwork', 'totals', 'trafficSource']\n    print('Load {}'.format(csv_path))\n    df_reader = pd.read_csv(csv_path,\n                            converters={ column: json.loads for column in JSON_COLS },\n                            dtype={ 'date': str, 'fullVisitorId': str, 'sessionId': str, \n                                  'totals_transactionRevenue' : 'uint64', 'visitId': 'uint64', 'visitNumber': 'uint8', \n                                  'visitStartTime': 'uint64', 'totals_hits': 'uint8'},\n                            chunksize=chunksize)\n    res = pd.DataFrame()\n    for cidx, df in enumerate(df_reader):\n        df.reset_index(drop=True, inplace=True)\n        for col in JSON_COLS:\n            col_as_df = json_normalize(df[col])\n            col_as_df.columns = ['{}_{}'.format(col, subcol) for subcol in col_as_df.columns]\n            df = df.drop(col, axis=1).merge(col_as_df, right_index=True, left_index=True)\n        res = pd.concat([res, df[features]], axis=0).reset_index(drop=True)\n        del df\n        gc.collect()\n        print('Round {}: DF shape {}'.format(cidx + 1, res.shape))\n    return res\n\nstart_time = time.time()\ntrain_df = load_df('..\/input\/train_v2.csv')\nprint (\"Time used: {} sec\".format(time.time()-start_time))\n<\/pre>\n

    Then we can load all the 1.7 million records within the 17GB Kaggle’s kernel in 10 minutes.<\/p>\n

    Revenue Prediction in Future<\/h3>\n

    In this challenge, we are predicting customers’ sum of transaction for\u00a0December 1st 2018 to January 31st 2019<\/strong>, a future date range.\u00a0First, let’ see what we have in our training and testing datasets. We can use the interactive charts that we learnt from\u00a0Avito Demand Prediction Challenge<\/a>\u00a0for data analysis.<\/p>\n

    import plotly.graph_objs as go\nimport plotly.offline as py\npy.init_notebook_mode(connected=True)\n\ndf2 = train_df.groupby('date')['totals_transactionRevenue'].sum().reset_index()\ndf3 = test_df.groupby('date')['totals_transactionRevenue'].sum().reset_index()\n\ntrace = go.Scatter(\n            x = pd.to_datetime(df2.date),\n            y = df2.totals_transactionRevenue,\n            name=\"Train df\"\n        )\n\ntrace2 = go.Scatter(\n            x = pd.to_datetime(df3.date),\n            y = df3.totals_transactionRevenue,\n            name=\"Test df\"\n        )\n\nlayout = go.Layout(\n             title = \"Volume of Transaction Revenue among Train and Test datasets\",\n                xaxis=dict(\n                  title='Date',\n                  rangeslider=dict(visible=True),\n                  type='date'\n                ),\n                yaxis=dict(\n                  title=\"Volume of Transaction Revenue\",\n                  type='log',\n                  autorange=True\n                )\n             )\ndata = [trace, trace2]\nfig = go.Figure(data=data, layout=layout)\npy.iplot(fig)\ndel df2, df3;gc.collect()\n<\/pre>\n