{"id":469,"date":"2017-11-07T19:35:26","date_gmt":"2017-11-07T19:35:26","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=469"},"modified":"2018-12-21T04:19:08","modified_gmt":"2018-12-21T04:19:08","slug":"win-big-real-estate-market-data-science","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/win-big-real-estate-market-data-science\/","title":{"rendered":"To win big in real estate market using data science – Part 1"},"content":{"rendered":"
Okay, yes, once again, it is a catchy topic. BUT<\/strong>, this post is indeed trying to help people (including me) to gain an upper hand in real estate market, using data science.<\/p>\n From our last post<\/a> (2 months ago, I will try to update this blog more frequently :]] ), we learn that we can use regression to predict a restaurant tips trend in data science. And now we can apply this finding to a Kaggle competition, House Prices: Advanced Regression Techniques<\/a> , in order to predict a house price trend.<\/p>\n <\/p>\n Don’t you feel excited by seeing the heading above? Yes! In trading, pricing is the most critical component to maximizing your profit<\/a>. You can make better decision when you know what the right price is.<\/p>\n From the Kaggle Housing competition, our goal is to predict the right prices of 1450+ properties according to 75+ features.<\/p>\n First things first, since there are 75+ features on the data sets, it is good to download the data description file from Kaggle to find out their meanings.<\/p>\n Then we get training and testing data sets, load them into data frames and query their data size.<\/p>\n Last time, we predicted about 400 passengers’ statuses. This time, we raise the bar! We are going to predict 1459 records using just 1460 records.<\/p>\n Let’s get a feeling of what the training data set looks like:<\/p>\n <\/p>\n Next, we go to check if there are null values inside the data set.<\/p>\n <\/p>\n Oh, there are many actually.<\/p>\n Don’t worry, according to the data description, null values mean “None” or 0 in related features. E.g. it is “No alley access” for null values in “Alley”, and it is 0 masonry veneer area square feet in “MasVnrArea”. That is why I suggest we need to take a look on the data description file first.<\/p>\n Here comes the solution. We can add a function to handle null values in the data set, by filling 0, “None” or the most common values in the features.<\/p>\n And apply it on both training and testing data sets.<\/p>\n Poof! Now the void issue was gone.<\/p>\n There are 2 major factors affecting a property price, location and size. For size, we can get its information from the “GrLivArea” (above ground living area) feature. And plot a chart to show the relationship between size and sale price:<\/p>\n <\/p>\n Well, it looks linear, however there are 2 outliers on the bottom right. Those 2 properties are sized 4000+ square feet but sold with unreasonable low prices. According to the author of the data set, Dr. Dean De Cock, he would recommend removing any houses with more than 4000 square feet<\/a>. In order to eliminate unusual observations.<\/p>\n So we remove data with living area larger than 4000 square feet and plot the regression chart again.<\/p>\n <\/p>\n It then looks more linear now.<\/p>\n Sale price is the target we are going to predict. First, let’ see how it distributes in the training data set:<\/p>\n <\/p>\n We saw a similar chart in our past data science exercise, the Titanic Project<\/a>, when we analyzed passengers’ fare variable. The distribution doesn’t look like a normal distribution as there are a few high sale price records. From our last experience in fare feature, we can apply the same logarithm handling to remove the impact of extreme values.<\/p>\n <\/p>\n After that, the skewness of sale price is improved and we have a better distribution of sale price.<\/p>\n Since we are doing machine learning, we must transform features for machine to read and learn. In short, change all to numerical features. Before we do that, we are going to change “MoSold” and “MSSubClass” features that use numerical values back to categorical features (that is why I say we have to read the data description file first).<\/p>\n Our next step, we get categorical features and change them to numerical features by the means of sale price.<\/p>\n After the transformation, we can drop those categorical features. And now we have machine read-able both training and testing data sets.<\/p>\n We have skewed the sale price feature to obtain a better distribution, we can do the same thing on all other features as well.<\/p>\n First of all, we combine the training and testing data sets into an “all data” data set.<\/p>\n Then we find and skew features that have greater than 0.75 skewness.<\/p>\n Done! And it is the time to get our finalized training and testing data sets.<\/p>\n X data, check. Y data, check. Test data, check. What time is it? It’s clobber… err.. It’s modeling time!<\/p>\n Do you remember the way we choose a model for machine learning? Yes, we use the k-fold cross validation<\/a> to pick our model.<\/p>\n We have chosen several common regression models, plus the people’s favorite XGBoost model, for this house price project:<\/p>\n Then we apply the 10-fold cross validation on our models.<\/p>\nTo Win in Real Estate Market<\/h3>\n
import pandas as pd\ndf_train = pd.read_csv(\"train.csv\")\ndf_test = pd.read_csv(\"test.csv\")\nprint(\"Size of training dataset:\", df_train.shape)\nprint(\"Size of testing dataset:\", df_test.shape)\n<\/pre>\n
Size of training dataset: (1460, 81)\nSize of testing dataset: (1459, 80)<\/pre>\n
df_train.head(5)<\/pre>\n
import matplotlib.pyplot as plt\nimport seaborn as sns\ndf_missing = df_train.isnull().sum()\ndf_missing = df_missing[df_missing > 0]\nsns.barplot(x=df_missing.values, y=df_missing.index)\nplt.show()\n<\/pre>\n
Enter the void<\/h3>\n
def fillNAonDF(df):\n for feat in ('MSZoning', 'Utilities','Exterior1st', 'Exterior2nd', 'BsmtFinSF1', 'BsmtFinSF2', 'Electrical'):\n df.loc[:, feat] = df.loc[:, feat].fillna(df[feat].mode()[0])\n for feat in ('BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'KitchenQual', 'Functional', 'SaleType'):\n df.loc[:, feat] = df.loc[:, feat].fillna(df[feat].mode()[0])\n for feat in ('Alley','BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):\n df.loc[:, feat] = df.loc[:, feat].fillna(\"None\") \n for feat in ('MasVnrType', 'FireplaceQu','GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):\n df.loc[:, feat] = df.loc[:, feat].fillna(\"None\") \n for feat in ('MasVnrArea', 'GarageYrBlt', 'GarageArea', 'GarageCars'):\n df.loc[:, feat] = df.loc[:, feat].fillna(0) \n for feat in ('PoolQC','Fence', 'MiscFeature'):\n df.loc[:, feat] = df.loc[:, feat].fillna(\"None\")\n df.loc[:, 'LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))\n<\/pre>\n
fillNAonDF(df_train)\nfillNAonDF(df_test)\n<\/pre>\n
Size does matter<\/h3>\n
sns.regplot(x=\"GrLivArea\", y=\"SalePrice\", data=df_train)\nplt.show()\n<\/pre>\n
df_train = df_train.loc[df_train.GrLivArea < 4000] \nsns.regplot(x=\"GrLivArea\", y=\"SalePrice\", data=df_train)\nplt.show()\n<\/pre>\n
Money, Money, Money, again<\/h3>\n
price_dist = sns.distplot(df_train[\"SalePrice\"], color=\"m\", label=\"Skewness : %.2f\"%(df_train[\"SalePrice\"].skew()))\nprice_dist = price_dist.legend(loc=\"best\")\nplt.show()\n<\/pre>\n
df_train.loc[:,'SalePrice_log'] = df_train[\"SalePrice\"].map(lambda i: np.log1p(i) if i > 0 else 0)\nprice_log_dist = sns.distplot(df_train[\"SalePrice_log\"], color=\"m\", label=\"Skewness : %.2f\"%(df_train[\"SalePrice_log\"].skew()))\nprice_log_dist = price_log_dist.legend(loc=\"best\")\nplt.show()\n<\/pre>\n
Features Engineering<\/h3>\n
def trxNumericToCategory(df): \n df['MSSubClass'] = df['MSSubClass'].apply(str)\n df['MoSold'] = df['MoSold'].apply(str)\n\ntrxNumericToCategory(df_train)\ntrxNumericToCategory(df_test)\n<\/pre>\n
def quantifier(df, feature, df2):\n new_order = pd.DataFrame() \n new_order['value'] = df[feature].unique()\n new_order.index = new_order.value \n new_order['price_mean'] = df[[feature, 'SalePrice_log']].groupby(feature).mean()['SalePrice_log']\n new_order= new_order.sort_values('price_mean') \n new_order = new_order['price_mean'].to_dict()\n \n for categorical_value, price_mean in new_order.items():\n df.loc[df[feature] == categorical_value, feature+'_Q'] = price_mean\n df2.loc[df2[feature] == categorical_value, feature+'_Q'] = price_mean\n \ncategorical_features = df_train.select_dtypes(include = [\"object\"])\nfor f in categorical_features: \n quantifier(df_train, f, df_test)\n<\/pre>\n
Skew Them All<\/h3>\n
df_all_data = pd.concat((df_train, df_test)).reset_index(drop=True)\ntrain_index = df_train.shape[0]\n<\/pre>\n
def skewFeatures(df):\n skewness = df.skew().sort_values(ascending=False)\n df_skewness = pd.DataFrame({'Skew' :skewness})\n df_skewness= df_skewness[abs(df_skewness) > 0.75]\n df_skewness = df_skewness.dropna(axis=0, how='any')\n skewed_features = df_skewness.index\n\n for feat in skewed_features:\n df[feat] = np.log1p(df[feat])\n\nskewFeatures(df_all_data)\n<\/pre>\n
X_learning = df_all_data[:train_index]\nX_test = df_all_data[train_index:]\nY_learning = df_train['SalePrice_log']\n<\/pre>\n
Modeling Time<\/h3>\n
models = []\nmodels.append((\"LrE\", LinearRegression() ))\nmodels.append((\"RidCV\", RidgeCV() ))\nmodels.append((\"LarCV\", LarsCV() ))\nmodels.append((\"LasCV\", LassoCV() ))\nmodels.append((\"ElNCV\", ElasticNetCV() ))\nmodels.append((\"LaLaCV\", LassoLarsCV() ))\nmodels.append((\"XGB\", xgb.XGBRegressor() ))\n<\/pre>\n
kfold = KFold(n_splits=10)\n\ndef getCVResult(models, X_learning, Y_learning):\n for name, model in models:\n cv_results = cross_val_score(model, X_learning, Y_learning, scoring='neg_mean_squared_error', cv=kfold )\n rmsd_scores = np.sqrt(-cv_results)\n print(\"\\n[%s] Mean: %.8f Std. Dev.: %8f\" %(name, rmsd_scores.mean(), rmsd_scores.std()))\n\ngetCVResult(models, X_learning, Y_learning)\n<\/pre>\n