Analytics And Intelligent Systems

NUS ISS AIS Practice Group




Source: Click here for the paper

Submitted by:

Team Incognito

Pankaj Mitra (A0163319E), Deepthi Suresh (A0163328E), Anand Rajan (A0163200B) and Neeraja Lalitha Muralidharan (A0163327H)

Spatial-Temporal Analytics with Students Data to recommend optimum regions to stay — May 7, 2017

Spatial-Temporal Analytics with Students Data to recommend optimum regions to stay

Objectives & Motivation:

Objective is to find a convenient place for students to stay based on data collected in Singapore

To any international student coming into Singapore needs robust information about convenient places to rent. To proceed with this we analyzed student’s movement data which was collected by students of NUS belonging to ISS school, exploratory spatial data analysis was done on that to find the pattern and insights. We considered that as a student basic amenities would be economical stay, cycling path, library, MRT closeness, parks, hawker center.

Data Sources:

Dataset description Data Source URL
Data points of all NUS-ISS  students IVLE (Apps used: Openpaths)
Dwelling data
Hawker Centres
Cycling Path
Park Connector Loop




OpenPath data with student’s personal location information was collected for the month of April. Data Cleaning was done to refine the data. As the data was stored on multiple mobile devices the date and time formats were inconsistent, thus, all the date and time field values were converted to a single standard format. To do more analysis, separate columns were created for date and time. The outliers were treated using R and the data points outside Singapore were ignored. Modified Name and MailID columns to obtain missing details using EXCEL.

Exploratory Analysis of Students Data (with derived fields):

Mean Centre:


From the above exploration, it is evident that most student stay at the university, travel to the residence and they use MRT most of the time. As students would prefer to stay near university, place near to MRT station and have the potential to use Cycling path, our aim is to find and suggest better zones for student life. Thus, we need to add these layers to the student data and carry out analysis.


To get more insights into a suitable dwelling, we tried to separately analyze various datasets such as HDB dwelling population data, Hawker center location, MRT station, and cycling paths.

Geographically Weighted Regression:

To find insights from student data using HDB and population data layer, geographically weighted regression is performed using variables as shown below with an assumption that data points having the timestamp of late night represent student’s home. Spatial join was done on hawker center, dwelling and open path student data layers to obtain final GWR model.

Explanatory Variables:

The count_  variable is the number of student data points per polygon, Count_1 is the number of Hawker centers per polygon, showsSHAPE_Area of the dwelling layer and HDB the total count of HDB per person.

GWR Model

The GWR results show that model can perform with moderate accuracy having adjusted R2 of 0.31.

Spatial Autocorrelation (Moran’s I) tool on the regression residuals was run to ensure that the model residuals are spatially random. Statistically significant clustering of high and/or low residuals (model under- and overprediction) indicates that our GWR model is not accurate enough to predict.

GWR Fail

The below map indicates regions with localized R square to find HDB population using students data. The Dark Red regions are the places where the model predicts with higher accuracy and the blue regions are the place where its prediction is poor.


As the model residuals are not random based on the spatial autocorrelation ( Morons I ), it cannot be used for prediction purposes. This model was just build to study the insights of students data with another layer of data.

Thus, to find convenient places for students, the places were ranked using student data points and factors like MRT, cycling path and hawker centers availability.


Ranking zones based on Hawker Centres:

To get a density of hawker centers in each we calculated a new field which was used for ranking:

Ranking of the classified field by using reclassify tool to rank the regions based on the hawker center density which shows places ranked according to economic zones for food.


Ranking zones Based on proximity to cycling paths:

The areas are ranked as per its proximity of cycling paths and data is converted to raster data and the ranked.


Ranking zones based on MRT by reclassify:

The easy access to public transport is considered one of the major consideration while choosing a place to stay and have used proximity to MRT as a factor to rank areas. MRT location data is converted to raster data to rank areas based on its proximity to MRT stations


 Final Ranking:

The final rank for each zone in Singapore was calculated based on the average of other three ranks (MRT, Hawker center, and Cycling path). Areas which are ranked good can be most favorable for staying. This final ranking can help to choose a place for staying based on individual priority.


 Final Rank




Using the final ranking, we can recommend to a new student coming to study at NUS a convenient place to stay, considering MRT, Cycling and Food places in the ranking.

As expected the better ranking zones are crowded near NUS itself and there are other places also being suggested by the ranking.

As a future scope of this story, we can add a configurable element which can replace Hawker center layer with many another layer like Libraries, Parks, HDB rental prices, Bus stop layer to form an ideal tool for the upcoming student to use it.

For detailed analysis, please read the entire report: – Spatial-Temporal Analytics with Students Data

Explored and submitted by: 





LI MEIYAO (A0163379U)




To find most accessible study areas for students in NUS

Problem Space:


Analysis Strategy:

  • What? – Availability of study areas for students
  • Where? – Inside NUS campus
  • Why? – The various reasons for the preferred locations
  • How? – Finding the clusters around various study locations

Data Exploration and Feature Addition:

  • Data source is OpenPaths
  • The class data was cleansed and the records pertaining to geographical coordinates of Singapore/NUS was chosen
  • The initial analysis of the sampled data was performed using various tools such as R , carto, ArcGIS to study the geographic spread
  •  A new dataset was formulated for representing the various study centres in NUS
  • Transformation was performed to achieve the variables in the necessary format for the geo-visualization
  • Reverse geocoding was performed on the dataset using the ‘ggmap’ package in R and the corresponding locations were obtained

Preliminary Analysis:


  • Carto was used to analyse the spread of the data points during the class hours and after the class hours
  • The results of the analysis portrayed that after the class hours the population spread is more at University town owing to availability of more study areas and facilities



Base Map Creation and Polygon Generation:


Addition of Layers:


Step 3: Loaded class data (master class namely) and converted coordinates for visibility

Density Analysis:



Step 6: The high-density area was in and around National University of Singapore. We can conclude that the data points are either working in NUS or students of NUS.

Assumption and Addition of new Layer:


Hot Spot Analysis:

Step 8: Hot Spot Analysis was performed, Arc Tool Box->Spatial Statistics Tools->Mapping Clusters->Optimized Hotspot Analysis


Model Diagram – Proximity Analysis:


Step 9: Proximity Analysis was performed to find most accessible study areas for students in NUS, Arc Tool Box->Analysis Tools->Proximity->Near

Proximity Analysis Results:


Inferences/Solution Outline:

  • Comparing the inference obtained from CARTO and Model built in ARCGIS, We can find that the students only focus on University Town
  • From the model it is evident that there were other study areas that could be preferred as the data points were close to these study areas
  • The students can explore other areas like FOE,SOC etc for holding discussion sessions
  • We can propose a new study area at an optimal location based on the geographic distribution of student data in case the number of students enrolled increases over a period of time


  • It’s a student’s location data
  • Sample size is limited to ISS students only
  • Non availability of accurate shuttle bus timings data
  • Non availability of students enrolled in each and every faculty

Team Name: Incognito

Team Members: Pankaj Mitra, Deepthi Suresh, Anand Rajan, Neeraja Lalitha Muralidharan, Nandini Paila Reddy

Delay Estimation in Pedestrian crossing — April 29, 2017

Delay Estimation in Pedestrian crossing

Team Name: Incognito

Team Members: Deepthi Suresh, Neeraja Lalitha Muralidharan, Pankaj Mitra, Sindhu Rumesh Kumar


This study uses multiple linear regression:

1.To provide theoretical support for traffic management and control

2.To increase efficiency at intersections and improve security

One Page Journal Summary “Estimates of pedestrian crossing delay based on multiple linear regression and application” authored by Li Dan and Xiaofa Shi.

Journal_Pedestrian Crossing

Click here for Journal.

Analysis on Workplace Injuries — March 17, 2017

Analysis on Workplace Injuries


The objective of this dashboard is to demonstrate different levels and types of injuries caused in the workplaces of Singapore.


Analysis on Workplace Injuries – Analytics And Intelligent Systems


  1. In which industry majority of injuries occur?

Majority of injuries occur in the Construction and Manufacturing industry consistently over the years but in the year 2016, the injuries pertaining to “Accommodation and Food Services” has also risen.

  1. With the current scenario in workplaces, which type of injury occurs more frequently?

The number of minor injuries (94.9%) greatly exceeds the number of major (4.5%) and fatal (0.5%).

  1. Is there a trend seen in the number of injuries caused in the industries through the years 2014-2016?

The fatal and major injuries are consistent through the years, while the number of minor injuries dropped in 2015 and increased drastically in 2016.

  1. What are the common types of minor injuries across different industries?

Injuries caused by slips, trips and falls are the maximum across industries.  Injuries caused by cuts or stabs by objects is more at the Accommodation and Food services industries, while workmen at Construction and Manufacturing industries are more affected by moving objects.

  1. Why has minor injury increased drastically from 2015 to 2016 while fatal and major injuries are consistent?

A good measure has been taken to avoid the fatal injuries yet minor injuries have not decreased. This could be because of the conversion of fatal injuries to minor injuries.


Final Analysis

We can clearly see that fatal and major injuries are very low and steps have also been taken to reduce this further, there is still a rise in the minor injuries. This may be because the fatal injuries have been converted to minor injuries.

Struck by moving/falling objects is one of the common cause of minor injury in Construction as well as Manufacturing industry, so to suggest a common solution for this, we could make use of Wireless sensors (heat and motion) on the objects which make an alert sound when it reaches within 50 m of any human being. This way the people working at the site will be aware of any moving objects in their vicinity and move out of harm.

The injuries may be minor, but if the injuries occur simultaneously to multiple people, it may affect the overall productivity of the company.


Tableau Public link:

Workplace Injuries in Singapore(2011-2016)

Submitted by:

Pankaj Mitra (A0163319E)

Deepthi Suresh (A0163328E)

Neeraja Lalitha Muralidharan (A0163327H)

Kriti Srivastasa (A0163206N)

Sindhu Rumesh Kumar (A0163342M)


Examining relationship between Airtel and Singtel Stock prices via Transfer Function — October 18, 2018

Examining relationship between Airtel and Singtel Stock prices via Transfer Function

Team Ensemblers: Ankita Avadhani, Bethapudi Keerthnana, Rajan Dhingra, Sabrish Gopalakrishnan

1.     Background

Singtel and Airtel, on 14th June 2016, announced a strategic alliance.

The development of this combined network increased FDI from US$16 billion to US$22 billion in the first half of 2015 compared to the second half of 2014. Through this partnership, Singtel also strengthened its lead as the largest IP VPN provider in Asia Pacific with strong domestic data network in Australia, India and Singapore.

Singtel currently has a significant stake in Bharti Airtel in India and considers it as one of its chief associates in the Asia Pacific Region. Lately (since 2018), due to Reliance Telecomm, Bharti Airtel has seen considerable reduction in number of subscriptions, chiefly due to migration of users from Airtel to Reliance Jio and this has reduced the share price on average for Bharti Airtel stocks in BSE.

Coincidently, Singtel stocks have also dropped in this time-period. Several expert articles have cited one of the reasons as declining performance of Airtel.

Thus, it becomes an interesting exercise to determine the cause and effect between these entities and their stock prices. We try to see if any correlation is to be found with decrease in daily closing price between Airtel and Singtel using Transfer Functions.

2.     Data Understanding

The data has been pulled from Yahoo finance.

The Airtel stock prices and Singtel stock prices have been pulled from 10 January 2018 – 10 September 2018. We consider the daily closing price as the variable of interest.

3.       Bi-Variate Plot

From the plot, we can see that the Airtel and Singtel stock prices move roughly together.


4.         Independent Variable (Airtel_Close):

               We begin by choosing Airtel_Close as independent variable.

 4.1       Tests for stationarity:

ADF tau statistics for the causal co-variate without trend and with trend including an intercept model in either case is -2.87 and -3.17:


Checking for significance, under first difference, both trend and otherwise show insignificance, which may be deemed sufficient for our purpose. Thus, we select d = 1 for pre-whitening under alternative, i.e. the process is stationary under d = 1.



4.2         Fitting a stochastic model

We see a non-linear decay for autocovariance. We see insignificant partial auto-covariance post 1 lag. Thus, we start by fitting an ARI(1,1) model, but the model parameters is not significant. However the residual is a white noise



Thus, we run the ARIMA Model Group to determine the stochastic model that would give the least AIC and significant parameters. We found out the least AIC is with the following ARIMA(2,2,3)



The residuals post fit show no significant autocorrelation and partial autocorrelation, so the residual is having white noise.


4.3         Pre-Whitening:

After performing pre-whitening of the time series by fitting ARIMA(2,2,3), we see significant cross-correlation between the covariates, however from lag -9  to lag -21 with peak at -15.


This indicates that our initial choice of cause and effect needs to be reversed. Thus, we proceed further with reversing the dependent and independent variable roles.

5.         Independent Variable (Singtel_Close):

5.1         Fitting a stochastic model

Based on the above Cross-correlation plot, we find the out that

b = 9

s = 15 – 9 = 6

r = 2/1 (begin with 2). Fitting the Transfer Function with these parameters yields:


However, we find that not all variables are significant. We now try with r=1, Which gives a better model.



Thus, we choose our parameters as b = 9, s = 6 and r = 1.


We checked the residual plot and found the white noise in the residuals

6          Results & Conclusion

Thus, in our search for a relationship between Singtel and Airtel Stock prices, we found that Airtel stock prices may be modelled as a response to changes in Singtel stock prices. More noticeably, we see a delay in response of approximately 9 timestamp units since a change in the Singtel stock price. Thus, we can say that roughly, Airtel stock price proportionately reacts to change in Singtel stock price after 9 days.

To Predict the Concentration of Atmospheric Particulate Matter in Northern Taiwan by Multiple Linear Regression — October 15, 2018

To Predict the Concentration of Atmospheric Particulate Matter in Northern Taiwan by Multiple Linear Regression

1. Introduction

An air quality monitoring dataset for Northern Taiwan was obtained from the Environmental Protection Administration of Taiwan. This dataset contains measurements from 25 observation stations located in Northern Taiwan, and comprises of 21 meteorological and environmental parameters measured in 2015.

2. Objective

The aim of this report is to build a predictive model using Multiple Linear Regression to predict the concentration of atmospheric particulate matter with diameter less than 2.5 micrometer (PM2.5) based on other measured environmental and meteorological data.

3. Data Preparation

Prior to using the dataset for analysis, a series of data preparation activities were carried out. This includes a sanity check of the data reported for each variable to ensure that the range of values reported for each variable were logical. The steps involved in data preparation are detailed in the following sub sections.

3.1 Invalid values

As identified in the data dictionary (included in the Appendix), invalid values due to equipment inspection, program inspection and human inspection were appended with “#,* and x” respectively. As these values could not be imputed reliably, listwise deletion was used to remove all rows containing such invalid values.

3.2 Data Imputation

As indicated in the data dictionary, three variables (PH_RAIN, RAINFALL and RAIN_COND) had “NR” (No Rainfall) indicated. While those under RAINFALL could be imputed as “0”, as no rainfall means that 0mm of rainfall was measured, such observations under PH_RAIN and RAIN_COND had to be removed, as it’s not possible to reflect these values numerically.

3.3 Removal of Date/Time and Location

Considering that the 21 environmental variables contains information such as humidity, ambient temperature etc, it was determined that these variables were sufficient to describe the conditions of the location and the time at which these data were collected. As such, Date/Time and Location were excluded in our analysis.

3.4 Missing Values

The remaining observations which had missing values after the above steps were removed due to the lack of domain knowledge. It was decided that imputation without the necessary domain knowledge may skew the dataset, and make the modelling less accurate.

4. Multiple Linear Regression Model Building

4.1 Visual Inspection of the Scatterplot of PM2.5

Figure 1.png
Figure 1: Scatterplot of PM2.5 versus all other variables

After data preparation was completed, a scatterplot of PM2.5 against all other variables was generated to investigate if there is any evidence of linearity between other predictors. From Figure 1, its noted that there’s a very strong evidence of linearity between PM10 and PM2.5. This is logical, because the definition of PM10 (Particulate Matter of less than 10 micrometer in diameter) also overlaps the definition of PM2.5. Apart from this, PH_RAIN and NO2 also appears to be following the trend of the fitted line generated. As such, this dataset was deemed suitable for Multiple Linear Regression.

4.2 Approach

To investigate if the 20 variables are sufficiently significant in explaining the variance of the dataset, these variables would be used in model fitting with their p values evaluated against a level of significance of 0.05, and the adjusted R² reviewed. Once the predictors have been finalized, their Variance Inflationary Factor (VIF) would be checked to ensure that no multi-collinearity exists between themselves.

4.3 Training and Validation Data Split

After the Data Preparation stage, the final dataset contains a total of 583 samples with 1 target variable and 20 predictors. Considering that the number of observations versus the number of predictors exceed the 10:1 ratio, the size of the dataset was deemed sufficient for Multiple Linear Regression. The dataset was then split into training data and test data based on a ratio of 70:30, so that sufficient data was available to train and validate the model.

4.4 Training Data Model

4.4.1 Model Building

From the training dataset obtained in section 4.3, a linear model was fitted with PM2.5 set as the dependent variable. Based on the results generated in R, the least significant variable was iteratively dropped, with the model evaluated after each removal of variables. This process was completed once there were no insignificant variables remaining.

Table 1: Removal of Variables and Resulting Adjusted R²

Sequence of Removal Variable Dropped Adjusted R²
initial model 0.8106
1 NO 0.8111
2 NMHC 0.8116
3 RAINFALL 0.812
4 RAIN_COND 0.8125
5 WIND_SPEED 0.8128
6 WIND_DIREC 0.813
7 CH4 0.8129
8 NO2 0.8122
9 THC 0.8117
10 WD_HR 0.8112
11 NOx 0.8105

The significant variables remaining are listed in Table 2. At this stage, the adjusted R² of the model is 0.8105. In addition, by inspecting the values of Estimate ± Standard Error, it’s verified that the coefficients of these variables are non-zero. As such, the finalised variables are considered sufficient for further analysis.

Table 2: List of Variables, Estimate, Standard Error and P Value

Variable Interpretation Estimate Standard Error P value
AMB_TEMP Ambient Air Temperature -0.24038 0.04669 0.000000414
CO Carbon Monoxide 4.93266 0.93706 0.000000231
O3 Ozone 0.06573 0.02042 0.001388
PH_RAIN pH of Rain -1.35644 0.34374 0.0000938
PM10 Particulate Matter ≤ 10µm 0.47179 0.01566 < 2e-16
RH Relative Humidity 0.18825 0.04953 0.000167
SO2 Sulphur Dioxide 0.40099 0.18984 0.035289
UVB Ultraviolet B 0.66109 0.29199 0.024104
WS_HR Average Wind Speed per Hour -0.59653 0.17656 0.0008

4.4.2 Residual Analysis

Figure 2
Figure 2: Diagnostic Plots for Linear Regression Analysis Non-Linearity between Dependent and Predictor Variables

From Figure 2, it’s observed from the Residuals versus Fitted Plot that there is no non-linear relationship between the predictors and the dependent variable. This means that the multiple linear regression model is capable of capturing the data in the trained dataset. Normality of Residuals

From the Normal Q-Q Plot, a straight line is observed. This satisfies the assumption that the residuals are normally distributed. Homoscedasticity of Residuals

From the Scale-Location Plot, there appears to be evidence of heteroscedasticity, as the red line is moving upwards as the fitted values increases beyond 30. Further verification through the Breusch-Pagan Test in R confirms that heteroscedasticity is present, with a p-value of less than 2.2e-16. This violates the assumption of linear regression that the variance of the residuals is constant. Effect of Outliers on Model

Lastly, the Residuals versus Leverage Plot indicates that there are no outliers that have a high Cook’s distance. This means that any outliers in our dataset, if any, are not significant in influencing the linear regression model built.

4.4.3 Correcting the model for Heteroscedasticity

Since heteroscedasticity is observed in our model as indicated in section, transformation of the dependent variable and application of a weighted least squares model were two approaches explored in an attempt to correct the model. Dependent Variable Transformation

Various approaches (e.g. applying a natural logarithmic function to PM2.5) were explored to transform the dependent variable before rebuilding a new model. However, the approach that yielded the best result appeared to be a Box Cox Transformation. The intent of Box Cos Transformation is to find a value of λ such that the dependent variable is transformed to the following:Eqn to the limitation of the transformation, 13 instances of the training dataset where PM2.5 = 0 was removed. Based on the transformation, λ = 0.4 was obtained. Building a new model based on this new transformed dependent variable yielded the results in Table 3, with an adjusted R² of 0.7577.

Table 3: List of Variables, Estimate, Standard Error and P Value (Transformed Dependent Variable)

Variable Interpretation Estimate Standard Error P value
AMB_TEMP Ambient Air Temperature -0.05957 0.009197 2.82E-10
CO Carbon Monoxide 0.76531 0.3363 0.02341
NMHC Non-Methane Hydrocarbon 1.38527 0.65308 0.03454
O3 Ozone 0.01689 0.00424 8.2E-05
PH_RAIN pH of Rain -0.3407 0.07574 9.1E-06
PM10 Particulate Matter ≤ 10µm 0.08149 0.00337 < 2E-16
SO2 Sulphur Dioxide 0.08564 0.04279 0.04606
WS_HR Average Wind Speed per Hour -0.1588 0.04859 0.00118
Figure 3
Figure 3: Diagnostic Plots for Linear Regression Analysis (Transformed Dependent Variable)

On inspecting the residual diagnostic plots of the newly built model, it’s noted that there is an improvement in the Scale-Location Plot. In addition, the characteristics of the other three plots remain unchanged. However, on verifying the model with the Breusch -Pagan Test, it’s noted that the residuals are still considered as heteroscedastic, with a p-value of 4.766e-08 rejecting the null hypothesis that the residuals are homoscedastic. Weighted Least Squares

Since the initial model was built based on ordinary least square estimators, heteroscedasticity was likely to occur due to incorrect computation of standard errors. Accordingly, the initial model was transformed into one with homoscedastic errors, using generalized least squares estimator.

It is noted that this transformation requires a logarithmic function to be applied on all predictors. As such, observations wherein the variables contained zero value were removed (2 occurrences of SO2 and 1 occurrence of WS_HR), while UVB was removed from the dataset altogether as there were 233 occurrences of zero values for this variable, amounting to more than half of the training dataset. This resulted in a training dataset of 407 observations with 19 variables used to build a new model applying weighted least squares.

The fitted models are as follows:

Ordinary Least Squares:


Generalised Least Squares:


However, the adjusted R² of the Generalised Least Squares model dropped from 0.8105 to 0.8047, while the p-value of the Breusch-Pagan Test did not improve. Hence, this approach was not considered to be appropriate, and the dependent variable transformation approach was adopted instead. 

4.5 Multi-Collinearity

Based on the model built in section, a VIF test was done to inspect the finalized dataset for multi-collinearity.

Table 4: VIF Results

1.200704 5.088495 5.868661 1.933892 1.282449 1.304998 1.247557 1.276881

From the results, it was noted that CO and NMHC exhibit multi-collinearity against each other. NMHC was dropped due to its lower significance compared to CO, resulting in a model without multi-collinearity. At this point, the adjusted R² is 0.7556, which is a negligible drop from the earlier model. In addition, the residual diagnostic plots did not exhibit any difference due to the removal of NMHC from the model.

4.6 Model Accuracy

With the model finalized, the validation dataset was used to evaluate the prediction accuracy of the model. Predicted values of the transformed dependent variables were generated with λ = 0.4 as per section, and these were compared with the actual values of the transformed dependent variables using the Mean Absolute Percentage Error (MAPE).

The formula of MAPE is given by:Eqn 4.6where n is number of observations,  is the actual value, and  is the forecasted value.

Since the limitation of MAPE is that Actual values cannot be zero (due to division by 0), these two occurrences were removed from the test dataset first. This gives us a MAPE of 25.8%. While this result is not ideal, there’s a marked improvement in accuracy as compared to the original model built in section 4.1, before dependent variable transformation. If that model was used, the MAPE of that model comes to 34.3%. 

5. Final Model

The formula of the final model is given by:

Eqn 5

6. Conclusion

In this report, an air quality monitoring dataset for Northern Taiwan was used to build a Multiple Linear Regression model to predict the concentration of PM2.5 based on other measured environmental and meteorological data. While the dataset was assessed to be suitable for Linear Regression, heteroscedasticity of the residuals was observed during the initial residual analysis. The dependent variable was transformed using Box Cox Transformation to correct for heteroscedasticity, resulting in an approximately 10% improvement in model accuracy as compared to the original approach, with the final model having a MAPE of 25.8% based on the validation dataset. Moving forward, an alternative predictive model such as logistic regression through categorizing PM2.5 into dichotomous levels (e.g. Healthy, Unhealthy) could be explored to improve the accuracy of the predictive model, as homoscedasticity is not required for error terms in logistic regression.

Appendix A

The following metadata is provided by the Environmental Protection Administration of Taiwan together with the raw data.

The columns in csv file are:

  • time – The first column is observation time of 2015
  • station – The second column is station name, there is 25 observation stations
    • [Banqiao, Cailiao, Datong, Dayuan, Guanyin, Guting, Keelung, Linkou, Longtan, Pingzhen, Sanchong, Shilin, Songshan, Tamsui, Taoyuan, Tucheng, Wanhua, Wanli, Xindian, Xinzhuang, Xizhi, Yangming, Yonghe, Zhongli, Zhongshan]
  • items – From the third column to the last one
    • item – unit – description
    • SO2 – ppb – Sulfur dioxide
    • CO – ppm – Carbon monoxide
    • O3 – ppb – ozone
    • PM10 – μg/m3 – Particulate matter
    • 5 – μg/m3 – Particulate matter
    • NOx – ppb – Nitrogen oxides
    • NO – ppb – Nitric oxide
    • NO2 – ppb – Nitrogen dioxide
    • THC – ppm – Total Hydrocarbons
    • NMHC – ppm – Non-Methane Hydrocarbon
    • CH4 – ppm – Methane
    • UVB – UVI – Ultraviolet index
    • AMB_TEMP – Celsius – Ambient air temperature
    • RAINFALL – mm
    • RH – % – Relative humidity
    • WIND_SPEED – m/sec – The average of last ten minutes per hour
    • WIND_DIREC – degress – The average of last ten minutes per hour
    • WS_HR – m/sec – The average of hour
    • WD_HR – degress – The average of hour
    • PH_RAIN – PH – Acid rain
    • RAIN_COND – μS/cm – Conductivity of acid rain

Data mark

  • # indicates invalid value by equipment inspection
  • indicates invalid value by program inspection
  • x indicates invalid value by human inspection
  • NR indicates no rainfall
  • blank indicates no data


Team members:

Student ID Name
A0178551X Choo Ming Hui Raymond
A0178431A Huang Qingyi
A0178415Y Jiang Zhiyuan
A0178365R Wang Jingli
A0178500J Yang Chia Lieh
Do Urban Tourism Hotspots Affect Berlin Housing Rents —
Relationship between Household Electricity Consumption and Surface Air Temperature in Singapore —

Relationship between Household Electricity Consumption and Surface Air Temperature in Singapore

1. Introduction

According to the National Environment Agency (NEA), Singapore’s household electricity consumption has increased by about 17 per cent over the past decade [1]. A 2017 NEA survey of 550 households found that air-conditioners were largely to blame, accounting for about 24 per cent of the consumption of a typical home [1]. However, it remains unknown if households currently use their air-conditioners out of habit (e.g. turning on the air-conditioner every day), or when the need arises. In addition, when referenced against the surface air temperature, there might be a delay in turning on the air-conditioner as our sensation of heat depends not only on temperature, but other factors such as humidity, cloud cover, sun intensity and wind [2]. Hence, the intent of this study is to investigate if there is a relationship between household electricity consumption and the surface air temperature in Singapore.

2. Hypothesis

This study hypothesises that:

  1. Household Electricity Consumption in Singapore is influenced by the Surface Air Temperature of Singapore
  2. There are no appreciable lags between the two variables

3. Data Sources

Table 1: Data Sources

Variable Frequency Source URL
Mean Surface Air Temperature Monthly National Environment Authority
Household Electricity Consumption Monthly Energy Market Authority

To perform analysis on these variables, data sources with the same time resolution was first obtained from The Mean Surface Air Temperature was obtained from the “Surface Air Temperature – Monthly Mean” dataset published by NEA, whereas the Household Electricity Consumption was obtained from the “Monthly Electricity Consumption by Sector” dataset published by the Energy Market Authority (EMA) of Singapore.

It was noted that “Monthly Household Electricity Consumption” records are only available till September 2015. Hence, this study utilised data from October 2010 to September 2015 (5 years), providing a total of 60 data points for analysis.

4. Analysis of Input Series

4.1 Pre-whitening of Input Series

Based on our hypothesis indicated in section 2, the input series was taken to be Mean Surface Air Temperature. Initial inspection of the data reveals that the data was stationary. However, based on the time series and the Autocorrelation Function (ACF) plots, a 12-month seasonality was observed. This is consistent with the knowledge that weather in Singapore follows a 12-months cycle, largely characterised by two monsoon seasons – the Northeast Monsoon (December to early March) and the Southwest Monsoon (June to September) [3]. As such, the first step is to apply a first order seasonal differencing of 12 months to the input series.

Figure 1
Figure 1: Basis for first order seasonal differencing, season = 12 months
Figure 2
Figure 2: ACF and PACF plots after first order seasonal differencing, season = 12 months

The resulting Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots (Figure 2) indicated that the resulting residuals are white noise. Hence, the Seasonal Difference model built for the input series is taken to be (0,0,0),(0,1,0)12 at this stage.

5. Analysis of Output Series

5.1 Pre-whitening of Output Series

With the Seasonal Difference Model for the input series identified in section 4.1, the output series (Household Electricity Consumption) was pre-whitened with the same model. Based on the ACF and PACF plots in Figure 3, it’s observed that the PACF plot is dying down, while the ACF plot cuts off at the first lag. As such, a MA(1) term was added to the Seasonal Difference model in an attempt to achieve white noise residuals.

Figure 3
Figure 3: (0,0,0),(0,1,0)12 I model applied on output series

From Figure 4, it was observed that the MA(1) term added was significant. The Akaike’s ‘A’ Information Criterion (AIC) was deemed acceptable, and the model managed to accurately model the 60 points to project a forecast that not only repeats the trend, but also has a tight confidence interval. The residuals of this (0,0,1),(0,1,0)12 IMA Model has also achieved white noise as evidenced in the ACF and PACF plots in Figure 5. Hence, the IMA model built for the output series is taken to be (0,0,1),(0,1,0)12 at this stage.

Figure 4
Figure 4: (0,0,1),(0,1,0)12 IMA Model (Output Series)
Figure 5
Figure 5: ACF & PACF Plots for (0,0,1),(0,1,0)12 IMA Model (Output Series)

5.2 Verification of white noise residuals on Input Series

Since the input series was previously pre-whitened with a (0,0,0),(0,1,0)12 Seasonal Difference model, the IMA model (0,0,1),(0,1,0)12 developed for the output series was applied to the input series to verify that pre-whitening the input series with the output IMA model would still result in white noise residuals.

Based on Figure 6, it was observed that the residuals remained as white noise. In addition, it was observed that the additional MA(1) term added is not significant. This was assessed to be reasonable, as the original input series didn’t require this MA(1) term during the previous pre-whitening stage to achieve white noise. Hence, the final model chosen is a (0,0,1),(0,1,0)12 IMA Model.

Figure 6
Figure 6: ACF & PACF Plots for (0,0,1),(0,1,0)12 IMA Model (Input Series)
Figure 7
Figure 7: (0,0,1),(0,1,0)12 IMA Model (Input Series)

6. Transfer Function Modelling

6.1 Cross-correlation between Pre-Whitened Series

With the (0,0,1),(0,1,0)12 IMA model finalised in section 5.2, the cross correlation between the pre-whitened input and output series was computed. From the results shown in Figure 8, significant correlation was observed between lag 0 to 2. Hence, the transfer function model will resemble the following equation:

eqn 6.1

where are the transfer function weights to be determined.

Figure 8
Figure 8: Pre-whitening Plot (0,0,1),(0,1,0)12

6.2 Transfer Function Model Parameters

An alternate method to represent the model would be as follows:

eqn 6.2.1


eqn 6.2.2

In order to determine the transfer function model, the parameters b, s and r must first be identified. From Figure 8, it was shown that the first non-zero correlation starts at lag 0, which means that the input and output series are in sync. As such, b = 0. Meanwhile, the cross correlation peak was observed at lag 1. Hence, s = 1. Lastly, it was noted that the cross correlation seemed to be oscillating about 0 in the pre-whitening plot. Hence, r = 2 was selected.

6.3 Transfer Function Model Evaluation

With the transfer function model parameters selected in the previous section, the model was fitted and evaluated. From Figure 9, it was observed that both the AIC and Schwarz’s Bayesian Criterion (SBC) are reasonably small, and the parameters used in fitting the models were all statistically significant. Figure 10 also shows that the residuals of this transfer function model was white noise. Compared with an alternate transfer function model where r = 1 (b = 0, s = 1, r = 1), this model was assessed to be better in terms of AIC and SBC. Hence, the final transfer function model parameters selected was (b = 0, s = 1, r = 2).

Figure 9
Figure 9: Transfer Function Model Summary (b = 0, s = 1, r = 2)
Figure 10
Figure 10: ACF & PACF Plot of Transfer Function Model (b = 0, s = 1, r = 2)
Figure 11
Figure 11: Comparison between Transfer Function Models (b = 0, s = 1, r = 2) and (b = 0, s = 1, r = 1)

7. Relationship between Household Electricity Consumption and Mean Surface Air Temperature

Based on the pre-whitening plot in Figure 8, it was initially concluded that there was no lag between Household Electricity Consumption (y) and Mean Surface Air Temperature (x). However, expansion of the transfer function model shown in figure 9 yields the following relationship:

eqn 7.1

This shows that while there is no lag between x and y, past values of both variables also impact y. More importantly, it was observed that the coefficients of each y term are the same with itself 12 months ago. Rearranging the above formula gives the following:

eqn 7.2

This means that while yt-1, yt-2, xt, xt-1, et, et-1, et-2 and  et-3 may explain the variability in y, an offset from last year’s set of y terms is required for it to fully explain yt. This indicates an upward trend in household electricity consumption, assuming mean surface air temperature remains the same every season.

8. Conclusion

This study hypothesizes that there is an instantaneous unidirectional relationship between the mean monthly surface air temperature of Singapore, and the monthly household electricity consumption in Singapore. While some believe that human behaviour such as being cost conscious or simply the lack of awareness of rising temperatures may influence and possibly delay the decision to switch on the air-conditioner, time series transfer function modelling of these two variables confirms that (1) household electricity consumption is influenced by mean surface air temperature and (2) they are moving in phase.  It was further observed that there is a rising trend in the household electricity consumption in Singapore, as household electricity consumption is autocorrelated with its previous season.

9. References

[1]      A. Zhaki, “Singapore’s household electricity consumption up 17 per cent over past decade,” The Straits Times, Singapore, 05-May-2018.

[2]      AccuWeather, “The AccuWeather RealFeel Temperature,” 2011. [Online]. Available: [Accessed: 18-Sep-2018].

[3]      Meteorological Service Singapore, “Climate of Singapore.” [Online]. Available: [Accessed: 18-Sep-2018].

Team Name Zero
Student ID Name
A0178551X Choo Ming Hui Raymond
A0178431A Huang Qingyi
A0178415Y Jiang Zhiyuan
A0178365R Wang Jingli
A0178329R Wong Yeng Fai, Edric
A0178371X Yang Shuting
The relationship of Bitcoin with price of GPU — Transfer function —

The relationship of Bitcoin with price of GPU — Transfer function

Executive Summary

Initiated from the creation of Bitcoin by Satoshi Nakamoto, blockchain and cryptocurrency are introduced to the public in 2009. Over the last 9 years, the price of bitcoin has gone through a roller-coaster journey in which it reached nearly $20,000 per bitcoin at the peak but plunged to $6,500 at the moment, however it did not affect its position of the highest market cap among all cryptocurrency. Inspired by Satoshi, a Russian-Canadian programmer Vitalik Buterin initiated his project of Ethereum in late 2013 and soon it became the second largest crypto globally. Just like natural gold and silver, bitcoin and ethereum require a process of “mining” to be generated. It is a computational calculation process which consumes lots of processor power and manufacturers built mining machine for such process. One key component of such miner is GPU and it is believed consumption of mining cryptocurrency has led to a change of GPU price.

1.Data Description

A dataset was downloaded from Kaggle and it recorded the historical price of GPU chip and cryptocurrency which are cropped from price comparison website and  We selected two-time series data of GPU and Bitcoin price from January 1st to March 16th 2018 to further study if there is any relationship between these two time series. There is a total of 75 records over 2.5 months of data.

Data link:

  1. Transfer Function


JMP is used for the transfer function formation and the process is shown below:



Firstly, we selected GPU price in USD as the Y, Time Series and the bitcoin price in USD (open) as input List.


We then observed the PACF of input is cut off in 1 and filled these parameters using ARIMA model group.



We found ARIMA(0,1,0)(0,1,1) has the best performance.


As the above shows, we fit ARIMA(0,1,0)(0,1,1) in output GPU price and the result is also good.



We then used an ARIMA(0,1,0)(0,1,1) pre-whitening model as shown above. Because the first significantly non-zero autocorrelation occurs at lag -5, b is equal to 3. Also, the values exhibit exponential decay after -4, which suggests that s = -4 –(-5)=1. Usually, r is either 1 or 2, and we start with r=1.


From the final result, we could get the price of GPU is correlated with the open price of bitcoin. As the formula shows, GPU price could be predicted by the price of bitcoin yesterday.



Ni Haojun          A0178177N

Wang Ziyang     A0077924U

Xue Zehao        A0178283U

Zhang Guoxiao  A0178480W

Motor vehicle retail index and Motor vehicle deregistration —

Motor vehicle retail index and Motor vehicle deregistration


Problem statement

Does motor vehicle deregistration affect the retail index of motor vehicles in Singapore?

Data source

Using monthly data sourced from Singapore Department of Statistics (DOS) for period from Jan 2009 to Jul 2018 – 115 data points. The dataset consists of 2 columns. For output, we use retail index (at constant prices) for motor vehicles per month in Singapore. For input, we use the total motor vehicles deregistered under vehicle quota system. The vehicle category includes cars, goods vehicle, buses and taxis.

Preparatory Analysis

Data Exploration

We observed that both the motor vehicle retail index and the motor vehicle deregistered data display similar trend. Both series are non stationary and shows a rising trend which flattens over recent period.


Using total deregistered vehicles as the input, we started by doing ARIMA and Pre-whitening on the input set. Based on the PACF graph which showed that the significance of the lag drops after lag2, we chose AR2 with yearly seasonality of period 12.



We proceed to apply ARIMA to the input data series. The combination selected for ARIMA on the motor vehicle deregistered was (0,2,2)(0,2,2) period 12. We tested out various combination and noted that using (0,2,2)(0,2,2) period 12 combination, the AIC is the lowest and all the coefficients are significant.




Based on the pre-whitening chart, we were able to identify that the significant auto-correlation occurred at lag3. (b=3). However, the series immediately decayed therefore s=0.


Transfer function

Using the result from pre-whitened series where b=3 and s=0, we computed the transfer function using combination (0,2,2)(0,2,2) period 12.

Iteration 1

Using r=2 and s=0, the numerator for motor vehicle deregistered is set to 0 indicating that there is no correlation between the retail index and deregistered motor vehicle. The result shows that not all coefficients are significant in the formula with AIC =683.8 and SBC = 700.9.



Iteration 2

Using s=1 and r=2, we noted a negative correlation between motor vehicle retail index and motor vehicle de-registration with AIC at 678.9 and SHC at 700.8. Both measurements improved versus model where s=0. However not all coefficients are significant.


Iteration 3

We repeated the steps with ARIMA (0,2,2)(0,2,2)12 constant and using s=1 and r=1. We noted that the AIC at 674.8 and SHC at 694.2 is smaller as compared to r=2.  We decided to select this model as our final even though not all coefficients are significant.


Transfer function model formula in jmp:


Solving the equation:


In the equation above, yt refers to the retail index at time period t and e being the error term. X refers to the total motor vehicle deregistered at time t.

Based on the equation solved, we noted that the motor vehicle deregistered impact is very small at lag3 and lag4. On the other hand, the auto-correlation of retail index is highest at lag 13 and lag 14. The highest error occurred at lag 25 and 26 which highlights that the predictive power the time series is not suitable for long term.


In conclusion, there is a very weak unidirectional relationship between deregistered motor vehicle and motor vehicle retail index where an increase in deregistered motor vehicle will lead to increase in motor vehicle retail index. In addition, there is a strong auto-correlation for motor vehicle retail index.

Team Members:
Adarsh Chandrashekar, Khine Zin Win, Lim Si Yi, Rajneesh Singh, Sua Jo Nie
Singapore Personal Crime Statistics — March 19, 2018

Singapore Personal Crime Statistics

GQM Framework Storytelling


  • The Police department of Singapore
  • Law ministry
  • Crime Statistics wing


  • Find the insights for Crime data using visualization capabilities which otherwise is difficult to look directly from the data
  • Identify and define key performance indicators and suggest action items

Key Questions

  • How is Singapore placed against other major cities
    • Overall Singapore is ranked 1st in terms of Personal security and 2nd for overall security in the world from Safety Index 2017
  • What is the current crime statistics of Singapore
    • Overall crime has controlled in past 6 years with conviction rate down as much as 46%
  • What is the demography of the people in penal admissions
    • It is found that locals involvement is found to be higher than the foreigner coming to Singapore for work or other purposes
  • What are the major offenses in Singapore
    • Drug and property related crimes are highest
  • Money Lending cases hotspots
    • Woodlands tops the chart in Moneylending and Moneylending Harassment cases

KPI Metrics

  • Crime rate
  • Offense type distribution
  • Over the year change in crime
  • Over the year change in Remand
  • Penal admission by age
  • Penal admission by gender
  • Penal admission by education
  • Total Moneylending cases

Key Findings

  • Woodlands is the hotspot for Money Lending cases
  • Drug offenses and Property Crime tops the list
  • Secondary Level offenders are much higher than primary or no Education
  • Past 10 years conviction has decreased where remand has increased



The dashboard can be accessed at the public tableau

Data Sources:

Team Neo:

Anusuya Manickavasagam (A0163300Y)

Kesavan Sridharan (A0163207M)

Muni Ranjan (A0163382E)

Pradeep Kumar (A0163453H)

Supply Chain Analytics — Optimization for Small Retail Firm — December 19, 2017

Supply Chain Analytics — Optimization for Small Retail Firm


Supply chain optimization is critical to ensure the optimal operation of a manufacturing and distribution supply chain, it basically refers to the minimizing of operating cost whilst meeting the demand of downstream customers. Nowadays, data analytics is widely used in supply chain management and optimization to make better business decisions. It is the science of examining raw data to help draw conclusions about information. The supply chain is a great place to apply analytic theories to look for a competitive advantage, because of its complexity and also because of the prominent role supply chain plays in a company’s cost structure and profitability.

In this research, the data set is simulated according to real business situation, it represents the historical data of a small fictitious retailer, including files of items, orders, purchase orders, suppliers, BOM, etc. Below is the dashboard of the data set.


The optimization is conducted based on the pull strategy and agile theory. Agile is defined as a strategy that is more responsive in a volatile market place, where this strategy is totally demand driven. As consumers buying patterns are changing on a very rapid pace, so does the whole supply chain management changes. The fundamental drivers of agile supply chain are Speed, Cost and Efficiency. A pull strategy is when customer demand drives the entire production process. Long term customer demand forecast is accomplished. The advantage of using pull strategy is that it has higher service levels, lower carrying costs, decreased inventory levels, and fewer markdowns. We then further conduct inventory optimization including ABC analysis, stock reward function calculation, and route optimization on optimal route design.

Demand Forecasting

Demand forecasting is important for the other parts of supply chain management. As the retailer, we need to do the operational short term demand forecasting. For this project, we mainly made the quantity prediction of each item. As our data records covering the time   period from 2013 January to 2016 September, we carried out the demand forecasting of 2016 October, November and December. The data structure is shown as below.


Data includes the order records of each product items in three different warehouses and on different date, each warehouse has the same 106 items. As it shows, for each item it has very little order quantity, so the data was rearranged as monthly orders. The three locations of the warehouses are Chicago, New York, Los Angeles. To simplify the introduction of our forecasting model, New York is chosen as an example to do the later forecasting and later inventory optimization. A quantitative model is built to predict future demand of each item. Following figure is the graph of one item in one location during the round 3 years’ records.


As it can be seen, the trends of one product is not so obvious. Then some traditional machine learning methods were applied, however the performance is not satisfied, compared with time series model, we tried doing the prediction, but the accuracy is not as good as expected.

The reason might caused by the demand of each item in each warehouse is small. So the model could be sensitive for the tiny variety. And the seasonality is not so obvious. So we move to try another method to do the forecasting. Long short term memory is one excellent deep learning method for time series forecasting. The partial structure of LSTM is shown as below.


The LSTM is a special kind of RNN, recurrent neural network, it is capable of learning long-term dependencies, learn from the previous data input, to our case, each neuron will learn from the previous neuron, with current month demand and the last month demand, as our model we set the time step as 6, which means we will get one prediction depends on the previous 6 months’ actual demand. After trained our model with our 45 months data, we finally can get our Oct’ s demand, and combined with the predicted Oct demand, do the prediction for the Nov, then combine and forecast Dec demand. It realized the information persistence. And not so strict to find the seasonality, so the result could show better performance.

As we can see, here is the result of the 45 months training processing results, it fits well of our original demand, and we also get the predicted demand for the last three months of 2016.


Then we repeated the same work for each item in each warehouse.

ABC Analysis

Used for: Inventory management and optimization

Based on: Pareto principle, 80/20 rule

Logic: vital few and trivial many

Objective: Maximize service AND Minimize inventory cost

Data preparation, Analysis and Results:

Inventory management is a challenging problem area in supply chain management. Companies need to have inventories in warehouses in order to fulfill customer demand, meanwhile these inventories have holding costs and this is frozen fund that can be lost. To achieve inventory management and optimization, ABC analysis is conducted to categories inventory items into ABC class based on importance rating.


Data set on sales and inventory with over 100 types of products in 2015 is deployed for analysis. To carry out such analysis, we first derive the items like unit cost and the usage of each product over 1 year. Then calculate the turnover in 2015, rank those products based on net value, and got the cumulative cost and items. After we got the pre-processed data table in above figure, we finally use the ABC rules listed below to plot the ABC classification results as a line chart.



Revealing from the figure 3, there are in total over 20% of products are classify into A type, which make up over 70 % of the net value. Then follows B type having over 30% of products which accounts for almost 20% of the revenue. Finally, products in C class can only earn less than 10% net value but make up 40% of the total products which indicates as a great amount. Overall it proves the rule of thumb: vital few but trivial many.

Inventory management Implication and Strategies


  • A-items should have tight inventory control, more secured storage areas and better sales forecasts; re-orders should be frequent, with weekly or even daily reorder; avoiding stock-outs on A-items is a priority.
  • B-items benefit from an intermediate status between A and C; an important aspect of class B is the monitoring of potential evolution toward class A or, in the contrary, toward the class C.
  • Reordering C-items is made less frequently; a typically inventory policy for C-items consist of having only 1 unit on hand, this approach leads to stock-out situation after each purchase which can be an acceptable situation, as the C-items present both low demand and higher risk of excessive inventory costs. Make to order production strategy is also applicable there.

Warehouse Insights and Optimization


ABC analysis can also be used in design of warehouse when carry out storage and retrieval process. Since majority of the picking activity is performed in a rather small area, the warehouse layout should be optimized to reduce time spent looking for product in the back of the warehouse. Demonstrated from above figure,  the fastest moving products in the inventory should be located closest to the shipping, staging, and receiving area in the lower right of the diagram below. The area marked with A, B, C indicates the ABC items.

Application of ABC analysis in warehouse


Previously, the shelves are set up in a U-shaped layout, providing a logical path for picking. However, this layout is not efficient. For each preparation of partial products, operators have to walk the whole U. If the stored items are big volumes or if the U cell is widespread, the time spent in walking and transportation becomes significant.

Since the time passed to move or transport material or parts is considered as a non-value operation, or a waste in the lean thinking way. ABC analysis is deployed here focusing on the “stock turns” index or “frequency of picking”. The A class items being so often picked, it is only common sense to place them close to entry-exit (green zone) in order to reduce reaching distance and picking time. Statistically in fewer demand, B class items are placed behind A class (orange zone) and C class items (the least picked) stored even further (red zone). So in most cases, the journey in the U cell will be restricted to the front (green zone). In few cases only, statistically less frequent, the operator will have to walk the whole U cell.

On top of time and distance reduction, we can imagine the light of our green zone should be on constantly, but orange and red zones could be equipped with sensors or switches to light them up only when somebody is in. In the same way, for the comfort of workers, heating or air conditioning isn’t necessary in zones where they seldom go. These are opportunities to reduce energy expenses.

Implication and Recommendation


Besides the implication and strategies, for further scope, developing a customized inventory management agent system is also applicable by consolidate the technology we proposed so far as demonstrated above.

  • Purpose — To determine the amount of inventory to keep in stock, which is how much to order and when to order.
  • Functions — Automation of inventory management; Timely react to demand deviation from the forecast demand; Making corrections based on stock reward function & replenishment policies
  • Working Logic — ERP data is first as employed as the input for ABC analysis. Then the LSTM algorithm are used and the related forecast error will be calculate, more deep learning algorithms like RNN or time series model like Arima can also be involved to finally choose the best. The stock rewarding algorithm (SRA) can be used to get the recommended stock level using forecast results, at the same time some replenishment policies and rules from domain expert of this company can also be added to supplement SRA to achieve optimal stock level. At last, real demand will be compared with the forecast one, and the performance of SRA will be evaluated to modify the parameters in the previous section to achieve better performance.

Stock Reward Function


Since our customer is a retail company, it should pay a lot attention to stock related management. Except for demand prediction, optimize stock is also an important point. Here our team try an emerging method, called Stock Reward Function.

Stock reward function is defined as bellow:


In which:

  • is the gross margin for selling 1 unit
  • Sis the stock-out penalty (negative) for not serving 1 unit
  • is the carrying cost penalty (negative) for not selling 1 unit in stock
  • is the number of units held in stock
  • ytis the demand for the period t
  • αis a discount factor
  • Ris identical to R but with S=0

In our implementation, we suppose that R∗ is equal to R, and S=0, and we will get Stock reward function as bellow:



Since we wanted to try different way with other teams, we not only tried Lingo to implement this function, but also used Python to implement it.

We used a Python package called DEAP. Which is a Genetic Algorithm package. Since we have fixed some parameters like bellow:

  • C = – 0.3
  • α = 0.98

Our chromosome and gene design become quite easy, which is consist of K. And as for demand yt, we choose 3-month demand.

Our target for this GA is maximizing reward for every product. And we set GA initial population as 1000, 40 generations and mutation to get different input combination.

Finally, we will get the optima stock for every product, which is showed bellow:


Order Routing


Retail Order Routing is the efficient allocation and fulfillment of customers orders. It means knowing where all the inventory is located, understanding the costs of shipping from each location and having the flexibility to change priorities on the fly to better manage the business. Order Routing helps manage, optimize and improve the profitability of the retail business. As the business grows, routing the orders becomes a complex problem that grows exponentially. Running a real-time business without a solid retail order routing system – is just not sustainable. Efficient and timely routing helps in increasing the customer satisfaction.

The problem discussed here is about:

  • A small retailer company hosts warehouses in three major cities in the United States ,New York, Chicago and Los Angeles.
  • A survey collected from the various customers indicated that they are quite happy with the quality but they are not satisfied with the on time delivery.
  • In order to analyse this issue, the customer order data was collected from the various warehouse locations to suggest any improvement in the existing routing procedures.
  • The team chose the warehouse located in New York and analysed the various outbound routes to the customer locations situated in the city.
  • The main parameters that were considered while optimization of routes were distance, delivery time window, and service time.

Routing use My Route Online

My route online is an online software which aids in optimization of multiple routes using the google map API. We can  Import as many as 350 addresses from Excel, or manually add multiple locations in various different formats.


Steps of using My Route Online

1)Enter the address manually in the prompt that appears or import the address from excel files and click on next button.


2)Set the required parameters like the departure time,number of routes, vehicle size etc and click on Plan My Route button at the bottom corner.


3)Visualization of the route


  • Print out the best routes



Manual changes in the route can be performed using the Manual Changes button and the best route map can be obtained by clicking the Print Map button.

Route with Delivery Time Constraints

For Deliveries adhering to time windows i.e the customer wants the delivery to be done with in a particular time window during the course of the day,the below instructions needs to be followed

1)Add two additional fields to your spreadsheet(excel spreadsheet). Label them “Delivery From” and “Delivery To”. The labels must be exactly as written here, without the quotes. These labels will be recognized as “Window From” and “Window To” in the software.


2)In the “Delivery From” field, put in the earliest delivery time for the stop and in “Delivery To”, put the latest. Make sure that the times are 24 hour time (military time). For example, 10:00 AM is 10:00, and 2:00 PM is 14:00.

3)Import your spreadsheet into MyRouteOnline.

4) Set the parameters you need and click Plan My Route.


Team Member

A0163200B – Anand Rajan

A0163364E – Gao Yuxi

A0163265E – Li Hangxing

A0163312U – Yao Meng

A0163313R – Shi Haiyuan

Divorce Trend in Singapore — August 31, 2017

Divorce Trend in Singapore


Singapore is a unique city-state with cultural diversity that has gone through tremendous changes in the past 40 years, growing from a sleepy fishing village to one of the most vibrant commercial hubs in the world. As the economy and infrastructure were transformed from pre-industrial to industrial status to position the nation state as a competitive global market player, its divorce rate are also on the rise which is quite alarming with the ranking of 4th among Asian countries. The risk factors that trigger marriage dissolution can be plenty to gradually make it a grave social concern, which involving children, employment, household income, marriage duration, education, the changing social status of Singapore Woman and diverse racial and ethnic characterization etc (Straughan, 2009).

GQM Analysis:


Based on the data availability and literature review of impact factors on divorce(QUAH, 2016), this dashboard applies 5 crucial factors to conduct dashboard visualization and analysis, which involving household income, ethnic group, age group of the couple, marriage duration and employment. Here we identified the business goals and set up KPI.

  • Conclude the divorce trends in Singapore over 35 years’ data and make reasonable predictions
  • Identify the crucial factors which triggering marriage dissolution and analyze their impacts on divorce rate.


  • How long do the marriages last before the divorce?
  • what is the age distribution of divorce number?
  • How does unemployment influences divorces?
  • How does income influences divorces?
  • What is the difference in divorce based on the race?


  • Duration of marriage before divorces per year
  • Divorce number of different age groups
  • Annual long-term unemployment rate
  • Monthly household income per year
  • Yearly racial divorce number



Access here: Dashboard – Divorce Trend in Singapore


Overall Divorce Trends


According to the annual report released by the department of Statistics in Singapore, A total of 7,614 marriages ended in a divorce or an annulment in 2016, which indicating as the highest annual figure over 36 years ever since. Both civil and Muslim divorce have contributed to such peak, with up to 1.5% rise from the 7522 marital dissolutions in the previous year.

Comparing to other Asian countries which also suffering from increasing Divorces, Singapore ranks the 4th with both nearly 7 annual divorces per 1000 married males or females respectively.

Hence such high figures on record together with the SmartArt (shown on the left) to perform exploratory analysis on divorces over years is indicating the next peak of this socioeconomic phenomenon in the coming future.


Q1: What is the age-specific distribution of divorce rate within Singapore? Should relevant campaigns or social trend be created to address this situation?


Overall, the number of divorce people has been increasing yearly for all age groups in Singapore. Specifically, people aged in 25-49 consist of the majority of divorce population. It means that middle and middle-late-aged people are more likely to divorce than people in any other age group. It can be reasonable because:

  • For young people, since the average age of marriage keeps rising, there are fewer and fewer young people are getting married. Less get married, less divorce.
  • For middle and late-middle-aged people, levels of stress were highest among this age group. Children’s education, taking care of parents, career challenge and so forth make people feel anxiety and easily cause couples quarrel and then get into divorce.
  • As for the elder people who generally have married for over 20 years, it seems after all these years, couples have adjusted and broken with each other well. Their marriages become more stable and also would be probably less and less willing to divorce as time goes by.

Q2: How long do the marriages last before the divorce?


As we can see, below 5 years marriage duration has the most number of divorces, followed by 5 – 9 years duration. All other duration have significantly lower number of divorces. And a cross-over in year 2010 where 5-9 years duration has more divorces than below 5 years duration, ever since then. Besides, sudden peaks in year 1999 and 2003 across all duration. There are some possible reasons:

  • If a couple can ‘tolerate’ each other for 10 years, they are less likely to divorce thereafter.
  • The ‘7-year itch’ phenomena is getting more significant in modern times.
  • Some events happened around or before 1999 and 2003 could have caused the sudden peaks. The events could have been socio-economic in nature and was widespread and significant enough to affect marriages across all duration.

Q3: How does unemployment influences divorces?


Here we can see there are sudden peaks in long-term unemployment rate from 1997 to 1999 and from 2001 to 2003. This correlates to the sudden peaks in divorces around 1999 and 2003 shown in the previous 2 charts. It can be reasonable because:

  • The 1997 Asian Financial Crisis had an effect on the economy of Singapore. This in-turn caused long-term unemployment and could have strained the marriages. However, the Singapore well-managed currency recovered the economy quickly and was not affected for long.
  • The 2001 terrorist attack on World Trade Center in US affected the global economy and had a bigger effect on Singapore economy than 1997. The effect lasted longer and the unemployment rate took longer to recover. The divorces has a sharper increase than 1997.

Q4: How can median monthly household reflect the total number of divorce?


As we can see from the graph, from 2000 and 2003, the household income is decreasing, meanwhile, the number of divorce people is increasing in a dramatic way. While, according to the analysis of that years, we found that the average level of price was still increasing. The conflict between income and expenditures increased the family burden which affect the harmony of family relationship. After 2005, with the increasing of income, the increase speed of  total divorce number was slow down.

Until 2011, when household income broke 7000 dollar per month, the total divorce number tend  to be stable. After 2011, although  the household was still increased, the total divorce number kept stable. According to this situation, we could draw the conclusion that once the household income per month higher than S$7000, the income may not influence divorce or not.

Q5: What is the difference in divorce based on the race?


Generally speaking, the divorce trends are quite similar between Indian and Chinese. Before 1998,  the divorce rate of Chinese is slightly higher than that of Indian. After 1998, the divorce rate of Chinese kept on climbing, while the divorce rate of Indians began to fluctuate between 2.0 and 2.5. From year 2010, the divorce rate for both races seem to fluctuate at a relatively constant rate.

There was a peak around 1998 for Indian and Chinese divorces. The divorce rates increased to form some peaks. There was an Asian Financial Crisis in 1997 that caused many Singaporeans to lose their jobs, the household income decreases, and financial conflicts in the family worsen. Hence in the few years after the Asian financial crisis, the divorce rate increased, then decreased after the crisis has eased.

From the graph, the divorce rate of other races, that consists mainly the Malays, is significantly lower than Indians and Chinese. This could be due to their overall different economic status, social structure and outlook in life in general which are also different from that of the Chinese and Indians.

Further, it is observed that the divorce trend of the other races (mainly Malays) is similar to the divorce trend of Chinese, with a delayed effect of about 5 to 6 years. Hence, we could make a prediction that after 2015, the divorce rate of Other Race will begin to fluctuate at a relatively constant rate, like the trend for Chinese after year 2010.


The general divorce rate in 2016 was unchanged from 2015. However, there was a “prominent shift” in the age profile of divorces towards the older age groups in the last decade. From above discussion we find these main factors which affect divorce:

  • Distribution of Ethnic Groups – There is a tendency for different ethnic groups to have particular reasons for divorce. Each ethnic group has experienced different patterns of divorce, degrees of proneness to divorce, and causes for divorce. Although ethnic groups are, to some extent, differentiated by their different socio-economic status in the society, they experience the same processes of social and economic changes and live in the same society. This points to the fact that cultural elements play a crucial role in determining the pattern of divorce and the divorce rate in a society.
  • Age Distribution of Divorced Couples – From before studying, we find that the trend suggests that the dangerous age to be involved in divorce was overall increased, it moved from 25-40 to 30-45 years old over the past three decades.
  • The Time for Divorce – The mean duration of marriage for the divorced couples in 2016 was 10 years. The distribution of the duration indicated that more divorce took place around the fifth to ninth year of marriage, which accounted for 29.9 per cent of the total divorces. Generally speaking, after five to nine years of marriage, the husband has usually established himself, and the family is better off. At this stage, the wife does not keep up with her husband’s status. This may cause the breakdown of a number of marriages.
  • Household Income & Unemployment Rate – Money issues could be another main reason of divorce. Finance stress can cause marital problems, and marital problems can result in divorce. The correlation between income and divorce isn’t quite that clear cut, however. It indicates that financial problems are directly linked to marital problems, other factors may help contribute to whether those problems ultimately bring about divorce.

Other Crucial Impact Factors on Divorce & Suggestions

The dashboard was limited to only certain kinds of information available, as the actual results reveal more variables related to divorce proneness. The following are some other variables which would affect the marriages :

  • Number of Children of the Divorced Couples – Divorce proneness is correlated with childlessness. Childless couples apparently separated more frequently and earlier than couples with children. Childlessness may be a cause of divorce, but it may just as well be a consequence of marital instability or disharmony that leads to divorce.
  • Divorce and Occupation – In Singapore, there did find some correlations between divorce proneness and occupational groups. There was a positive correlation between occupational status and proneness to divorces. Those who were at higher occupational status were more prone to divorce.

Divorce is a form of marital instability and a type of family dissolution. It is a complicated process which involves multi-dimensional factors, both intra-familial personal factors and extra-familial socio-cultural factors. There are some suggestions on rising divorce rate:

  • Having marriage counselling programmes for couples in a difficult marriage situation could be a good idea, since married couples attending the programmes may be able to work the differences out and eventually decide against a divorce. These marriage-related programmes does enhance the overall quality of marriages and also improves relationship building skills amongst married couples.
  • Taking into account the current divorce trends, perhaps more comprehensive marriage preparation programmes and seminars for couples preparing for their wedding can be conducted by MSF, community-centered organizations and relevant religious groups to help couples better understand topics such as communication, conflict management, commitment and problem-solving to build stronger and more lasting marriage unions.

Data Sources & References

Data Sources


Team Members(Team 18)

Li Hangxing        A0163265E

Li Ruiqi                A0163293A

Lim Chong Wui  A0163460L

Shi Haiyuan        A0163313R

Yao Meng             A0163312U

Feasibility Analysis of Developing Singapore into Smart City — August 17, 2017

Feasibility Analysis of Developing Singapore into Smart City


Advances in digital technology have opened up new possibilities to enhance the way we live, work, play, and interact. And “Smart City” is not a measure of how advanced or complex the technology being adopted is, but how well a society uses technology to solve its problems and address existential challenges. Singapore strives to become a Smart City to support better living, stronger communities, and create more opportunities, for all.

Smartness penetrates in everyday life of each citizen, it is significantly impacting the  transformation of the conventional way of living. Building a smart city includes providing:

  • efficient, safe, reliable, and enhanced transportation
  • integrated and seamless healthcare
  • sustainable and livable residence environment
  • close and intimate public sector services

The analysis is conduct based on  the foundation construction of the above aims, involving nation wide Internet coverage, citizen acceptability, manpower and financial funding input, etc., to discuss whether it is the right time for transformation in Singapore.

Target Audience

Minister Lawrence Wong, Ministry of National Development

GQM Methodology

Goal — Analyze current status of infrastructure construction to suggest building Singapore into the first Smart City around the globe.


  • Is the network coverage broad enough?
  • What is the R&D (research & development) expenditure currently?
  • Do citizens get used to digital life?
  • What is the current status of domestic enterprises (acceptability)?
  • Is the IT manpower sufficient enough to support smart city development?


  • Wi-Fi hotspot density
  • web presence percentage of domestic enterprises
  • mobile penetration
  • online-shoppers distribution
  • resource requirement from IT industry


Date Modeling



Dashboard Design


Insight Discovery

Successful transformation of living habits for citizen

Citizens are trying to adapt to the “smartness” of living environment. Data of “Online shopper by age” reveals active online users distribution among different age range.

  • Age between 25 and 34 is the major consumer group in online shopping, 75% of them prefer shop online.
  • Age group from 15 to 24 has a sharp increase from year 2007 to 2014, 55% of them use online shopping for daily needs.

Fully network coverage enables smart solutions launching

Singapore has high density of Wi-Fi hotspot coverage (excluding industrial areas of westernmost and Changi airport), which stimulates the facilitation and implementation of smart solution towards almost every aspect. Citizens will be able to keep themselves connected to the entire smart ecosystem via Wi-fi or 4G network. For instance smart living solutions aims at providing seamlessly and remotely managing connected homes from anywhere, at anytime, on any screen in real time.

The intensive coverage of network nation wide is solid foundation of launching smart solutions, enables society to be fully connected.

Stably increased financial investment towards R&D provides strong support in innovation and development.

Financial investment to R&D is continuous increasing, it has an obvious rising trend from 1990 to 2014 in both private and public sectors:
Private — comprises all business enterprises in the private sector.
Public — comprises all entities in the Government Sector, Higher Education Sector and Public Research Institutes

Government tends to sustain the R&D in innovation driving, smart solution research, technology development, etc., to guarantee a stable research environment for designing, building and implementing smart city.


Team members

Cui Leyang    (A0163218J)

Duan Han     (A0163238E)

Gao Yuxi       (A0163364E)

Li Yue            (A0163373E)

Que Qiwen   (A0163391E)