Executive Summary

Our business problem is to predict total bicycle demand for the next day using different Day, Weather and past Demand information.  While Day information for the next day is known beforehand, Weather information is known only up to the current day.  Furthermore, past demand information is available only till the previous day since total demand for current day is not known before day end.  Additionally, our business goal revolves around profit maximization where loss due to over or under prediction is asymmetric ($2 for each unit of over-prediction vs. $1 for under-prediction).

We use a combination of ARIMA based time series forecasting, along with five different single model approaches. These methods are

  • Robust Linear Regression by Huber’s M estimation using bi-square weights
  • Quantile Regression using smoothing algorithm
  • Boosted Generalized Linear Model (GLM)
  • Multiple Linear Regression
  • Support Vector Regression using polynomial kernel

Our best performing single model comes out to be the Boosted GLM model.  We discard tree-based techniques due to the strong increasing trend in demand that these approaches will miss out if actual demand is modeled.  Some exploration using C&RT is done with absolute and percentage differences from past actuals as target variables.  The results, however, are not as good as the selected models listed above.  Neural network and other boosting-based approaches (GBM, GBRT) are not used because of their inherent overfitting nature.  Since static models are being built and no updates are allowed on a dynamic basis, we feel that it is better to ensure robustness of our solution than accuracy of short-term results.

All model predictions provide higher profit than the Naïve benchmark (yesterday’s actual) model for the test period (2012).  While benchmark model gives $1.443mn total profit (35.2% RoI), our best performing single model Boosted GLM results into a total profit of $1.612mn (40.0% RoI).  The final Ensemble, an Expert Classifier combining predictions from the 3 best approaches increases the total profit further to $1.625mn (41.3% RoI).  No strong systematic pattern is observed in error for any model by season, weather type or other variables.  The predictions, however, always show significant over-estimation for natural disaster days (e.g. tornado, hurricane) which cannot be predicted beforehand with available data.

Comparing performance of our models at different unit revenue levels ($2-$10) shows that our recommended Ensemble consistently outperforms the benchmark model, both by total profit and RoI.  Our models and final Ensemble show a stable error pattern for 6-9 moths into the test period (2012).  All models show some deterioration in Q4, 2012.  Surprisingly, prediction performance goes down after re-weighing the parameters based on 1.5 years of data. We attribute this to imbalance in data as models learn more about Jan-Jun features than Jul-Dec and we test the models for Jul-Dec period only.

Data balancing helps when we re-weigh each model using a weight function (1 for Jan-Jun days and 2 for Jul-Dec days).  We, however, feel that the models should be rebuilt (not simply parameters re-weighted using same predictors) at least on a half-yearly basis to cater to the growing and dynamic business environment of this industry.

Deb_DA1BikeShare
Univariate analysis of key numeric variables

Detailed analysis report can be accessed at: Google Drive (File Name: DA1 – BikeShare Demand Prediction.pdf)

Additionally, the following FAQ section should be able to address some of the common queries that a reader may have.  For additional information / clarification, do leave a reply and I will respond.

Question-Answer Mapping with Assignment

  • Build a prediction model that can predict bicycle demand for tomorrow – use a single model. Compare different modeling algorithms to obtain the best performing single model
    • We used a combination of ARIMA based time series forecasting, along with five different single model approaches.
    • These methods were
      • Robust Linear Regression by Huber’s M estimation using Bisquare weights
      • Quantile Regression using smoothing algorithm
      • Boosted Generalized Linear Model (GLM)
      • Multiple Linear Regression
      • Support Vector Regression using polynomial kernel
    • Our best performing single model came out to be the Boosted GLM model
  • What is the profit (total and percentage of expenditure) for the default prediction?
    • Naïve benchmark (yesterday’s actual) model gave $1.443mn total profit (35.2% RoI) for 2012
  • Demonstrate that Capital would make more profit (over the test period) using the prediction model than they would if they did not use the model
    • All model predictions provided higher profit than the Naïve benchmark (yesterday’s actual) model for test period (2012) giving $1.443mn total profit (35.2% RoI)
    • Our best performing single model Boosted GLM resulted into a total profit of $1.612mn for 2012 (40.0% RoI)
  • Repeat using an Ensemble model instead of a single model – try to increase model performance and hence $ profit
    • Our final Ensemble, an Expert Classifier System combining predictions from the top 3 approaches resulted in a total profit of $1.625mn (41.3% RoI) for 2012
  • Under what revenue considerations is your prediction model better than the default model?
    • Comparing performance of our models at different unit revenue levels ($2-$10) showed that our recommended Ensemble consistently outperformed the benchmark model, both by total profit and RoI
  • Did you find any evidence that your model performance correlates with season or other similar factors?
    • No strong systematic pattern was observed in error by season, weather type or other variables. The predictions, however, always showed significant over-estimation for natural disaster days (e.g. tornado, hurricane) which cannot be predicted with precision beforehand using the available data.
  • How well will the model extrapolate into the future?
    • Our models and final Ensemble showed a stable error performance for 6-9 months into the test period (2012). Model performances deteriorated in Q4, 2012.
  • Did the model generated from the 18 months’ training set give higher profit than the model built from the 12 months’ training set, when tested on the test period (Jul-Dec’12)?
    • Model performance went down after re-weighing parameters based on 1.5 years of data. We attribute this to imbalanced data as models learnt more of Jan-Jun features than Jul-Dec and we tested the model for Jul-Dec period only.
  • Did you find any evidence that balancing the data in the 18 month training set gave higher performance?
    • Data balancing helped when we re-weighed each model using a weight function (1 for Jan-Jun days and 2 for Jul-Dec days). However, it was still felt that the models should be rebuilt (not simply parameters re-weighted using same predictors) to cater to the growing and dynamic business environment in this industry.
  • Is predicting absolute demand the best target? Perhaps percentage increase in demand is better.
    • We explored both (and more) possibilities like- actual demand, absolute and relative (%) difference from underlying trend, likelihood of tomorrow being a high demand or low demand day etc.
    • Each of our 5 constructs used different target variables. We did this to ensure that we perceived and analyzed the prediction problem from different perspectives to get somewhat disparate predictions. This ensured a robust and superior performance of our Ensemble model across business scenarios
  • Creating Lag and Trend variables
    • Short terms lags (up to 3 days till today) were used for Weather variables while larger lags (up to 7 days till yesterday) were used for Demand variables due to strong weekly patterns
    • Original variables were dropped after lag creation as they were not usable by design of our business problem
    • Trend variables for demand were introduced in form of slope of movement for last 7 days, through multiplicative decomposition and using an ARIMA model on each demand series
  • Which Softwares were used for this project?
    • MS Excel was used for initial data pre-processing and time series analysis (up to decomposition)
    • R was used for ARIMA modeling. R was also used for validating some of the models originally developed in SAS
    • SAS was used for first 4 modeling constructs
    • SPSS-Modeler was used for k-means clustering
    • Weka was used for Support Vector Regression

Team Details

Theingi Zaw, Vadym Kulish, Gomathypriya Dhanapal, Sougata Deb

Advertisements