Source: Click here for the paper
Pankaj Mitra (A0163319E), Deepthi Suresh (A0163328E), Anand Rajan (A0163200B) and Neeraja Lalitha Muralidharan (A0163327H)
Source: Click here for the paper
Pankaj Mitra (A0163319E), Deepthi Suresh (A0163328E), Anand Rajan (A0163200B) and Neeraja Lalitha Muralidharan (A0163327H)
Objectives & Motivation:
Objective is to find a convenient place for students to stay based on data collected in Singapore
To any international student coming into Singapore needs robust information about convenient places to rent. To proceed with this we analyzed student’s movement data which was collected by students of NUS belonging to ISS school, exploratory spatial data analysis was done on that to find the pattern and insights. We considered that as a student basic amenities would be economical stay, cycling path, library, MRT closeness, parks, hawker center.
|Dataset description||Data Source URL|
|Data points of all NUS-ISS students||IVLE (Apps used: Openpaths)|
|Park Connector Loop||https://data.gov.sg/dataset/park-connector-loop|
OpenPath data with student’s personal location information was collected for the month of April. Data Cleaning was done to refine the data. As the data was stored on multiple mobile devices the date and time formats were inconsistent, thus, all the date and time field values were converted to a single standard format. To do more analysis, separate columns were created for date and time. The outliers were treated using R and the data points outside Singapore were ignored. Modified Name and MailID columns to obtain missing details using EXCEL.
Exploratory Analysis of Students Data (with derived fields):
From the above exploration, it is evident that most student stay at the university, travel to the residence and they use MRT most of the time. As students would prefer to stay near university, place near to MRT station and have the potential to use Cycling path, our aim is to find and suggest better zones for student life. Thus, we need to add these layers to the student data and carry out analysis.
To get more insights into a suitable dwelling, we tried to separately analyze various datasets such as HDB dwelling population data, Hawker center location, MRT station, and cycling paths.
Geographically Weighted Regression:
To find insights from student data using HDB and population data layer, geographically weighted regression is performed using variables as shown below with an assumption that data points having the timestamp of late night represent student’s home. Spatial join was done on hawker center, dwelling and open path student data layers to obtain final GWR model.
The count_ variable is the number of student data points per polygon, Count_1 is the number of Hawker centers per polygon, showsSHAPE_Area of the dwelling layer and HDB the total count of HDB per person.
The GWR results show that model can perform with moderate accuracy having adjusted R2 of 0.31.
Spatial Autocorrelation (Moran’s I) tool on the regression residuals was run to ensure that the model residuals are spatially random. Statistically significant clustering of high and/or low residuals (model under- and overprediction) indicates that our GWR model is not accurate enough to predict.
The below map indicates regions with localized R square to find HDB population using students data. The Dark Red regions are the places where the model predicts with higher accuracy and the blue regions are the place where its prediction is poor.
As the model residuals are not random based on the spatial autocorrelation ( Morons I ), it cannot be used for prediction purposes. This model was just build to study the insights of students data with another layer of data.
Thus, to find convenient places for students, the places were ranked using student data points and factors like MRT, cycling path and hawker centers availability.
Ranking zones based on Hawker Centres:
To get a density of hawker centers in each we calculated a new field which was used for ranking:
Ranking of the classified field by using reclassify tool to rank the regions based on the hawker center density which shows places ranked according to economic zones for food.
Ranking zones Based on proximity to cycling paths:
The areas are ranked as per its proximity of cycling paths and data is converted to raster data and the ranked.
Ranking zones based on MRT by reclassify:
The easy access to public transport is considered one of the major consideration while choosing a place to stay and have used proximity to MRT as a factor to rank areas. MRT location data is converted to raster data to rank areas based on its proximity to MRT stations
The final rank for each zone in Singapore was calculated based on the average of other three ranks (MRT, Hawker center, and Cycling path). Areas which are ranked good can be most favorable for staying. This final ranking can help to choose a place for staying based on individual priority.
Using the final ranking, we can recommend to a new student coming to study at NUS a convenient place to stay, considering MRT, Cycling and Food places in the ranking.
As expected the better ranking zones are crowded near NUS itself and there are other places also being suggested by the ranking.
As a future scope of this story, we can add a configurable element which can replace Hawker center layer with many another layer like Libraries, Parks, HDB rental prices, Bus stop layer to form an ideal tool for the upcoming student to use it.
For detailed analysis, please read the entire report: – Spatial-Temporal Analytics with Students Data
Explored and submitted by:
ARUN KUMAR BALASUBRAMANIAN (A0163264H)
DEVI VIJAYAKUMAR (A0163403R)
RAGHU ADITYA (A0163260N)
SHARVINA PAWASKAR (A0163302W)
LI MEIYAO (A0163379U)
To find most accessible study areas for students in NUS
Data Exploration and Feature Addition:
Base Map Creation and Polygon Generation:
Addition of Layers:
Step 3: Loaded class data (master class namely) and converted coordinates for visibility
Step 6: The high-density area was in and around National University of Singapore. We can conclude that the data points are either working in NUS or students of NUS.
Assumption and Addition of new Layer:
Hot Spot Analysis:
Step 8: Hot Spot Analysis was performed, Arc Tool Box->Spatial Statistics Tools->Mapping Clusters->Optimized Hotspot Analysis
Model Diagram – Proximity Analysis:
Step 9: Proximity Analysis was performed to find most accessible study areas for students in NUS, Arc Tool Box->Analysis Tools->Proximity->Near
Proximity Analysis Results:
Team Name: Incognito
Team Members: Pankaj Mitra, Deepthi Suresh, Anand Rajan, Neeraja Lalitha Muralidharan, Nandini Paila Reddy
Team Name: Incognito
Team Members: Deepthi Suresh, Neeraja Lalitha Muralidharan, Pankaj Mitra, Sindhu Rumesh Kumar
This study uses multiple linear regression:
1.To provide theoretical support for traffic management and control
2.To increase efficiency at intersections and improve security
One Page Journal Summary “Estimates of pedestrian crossing delay based on multiple linear regression and application” authored by Li Dan and Xiaofa Shi.
Click here for Journal.
The objective of this dashboard is to demonstrate different levels and types of injuries caused in the workplaces of Singapore.
Analysis on Workplace Injuries – Analytics And Intelligent Systems
Majority of injuries occur in the Construction and Manufacturing industry consistently over the years but in the year 2016, the injuries pertaining to “Accommodation and Food Services” has also risen.
The number of minor injuries (94.9%) greatly exceeds the number of major (4.5%) and fatal (0.5%).
The fatal and major injuries are consistent through the years, while the number of minor injuries dropped in 2015 and increased drastically in 2016.
Injuries caused by slips, trips and falls are the maximum across industries. Injuries caused by cuts or stabs by objects is more at the Accommodation and Food services industries, while workmen at Construction and Manufacturing industries are more affected by moving objects.
A good measure has been taken to avoid the fatal injuries yet minor injuries have not decreased. This could be because of the conversion of fatal injuries to minor injuries.
We can clearly see that fatal and major injuries are very low and steps have also been taken to reduce this further, there is still a rise in the minor injuries. This may be because the fatal injuries have been converted to minor injuries.
Struck by moving/falling objects is one of the common cause of minor injury in Construction as well as Manufacturing industry, so to suggest a common solution for this, we could make use of Wireless sensors (heat and motion) on the objects which make an alert sound when it reaches within 50 m of any human being. This way the people working at the site will be aware of any moving objects in their vicinity and move out of harm.
The injuries may be minor, but if the injuries occur simultaneously to multiple people, it may affect the overall productivity of the company.
Tableau Public link:
Pankaj Mitra (A0163319E)
Deepthi Suresh (A0163328E)
Neeraja Lalitha Muralidharan (A0163327H)
Kriti Srivastasa (A0163206N)
Sindhu Rumesh Kumar (A0163342M)
Team Ensemblers: Ankita Avadhani, Bethapudi Keerthnana, Rajan Dhingra, Sabrish Gopalakrishnan
Singtel and Airtel, on 14th June 2016, announced a strategic alliance.
The development of this combined network increased FDI from US$16 billion to US$22 billion in the first half of 2015 compared to the second half of 2014. Through this partnership, Singtel also strengthened its lead as the largest IP VPN provider in Asia Pacific with strong domestic data network in Australia, India and Singapore.
Singtel currently has a significant stake in Bharti Airtel in India and considers it as one of its chief associates in the Asia Pacific Region. Lately (since 2018), due to Reliance Telecomm, Bharti Airtel has seen considerable reduction in number of subscriptions, chiefly due to migration of users from Airtel to Reliance Jio and this has reduced the share price on average for Bharti Airtel stocks in BSE.
Coincidently, Singtel stocks have also dropped in this time-period. Several expert articles have cited one of the reasons as declining performance of Airtel.
Thus, it becomes an interesting exercise to determine the cause and effect between these entities and their stock prices. We try to see if any correlation is to be found with decrease in daily closing price between Airtel and Singtel using Transfer Functions.
The data has been pulled from Yahoo finance.
From the plot, we can see that the Airtel and Singtel stock prices move roughly together.
We begin by choosing Airtel_Close as independent variable.
ADF tau statistics for the causal co-variate without trend and with trend including an intercept model in either case is -2.87 and -3.17:
Checking for significance, under first difference, both trend and otherwise show insignificance, which may be deemed sufficient for our purpose. Thus, we select d = 1 for pre-whitening under alternative, i.e. the process is stationary under d = 1.
We see a non-linear decay for autocovariance. We see insignificant partial auto-covariance post 1 lag. Thus, we start by fitting an ARI(1,1) model, but the model parameters is not significant. However the residual is a white noise
Thus, we run the ARIMA Model Group to determine the stochastic model that would give the least AIC and significant parameters. We found out the least AIC is with the following ARIMA(2,2,3)
The residuals post fit show no significant autocorrelation and partial autocorrelation, so the residual is having white noise.
After performing pre-whitening of the time series by fitting ARIMA(2,2,3), we see significant cross-correlation between the covariates, however from lag -9 to lag -21 with peak at -15.
This indicates that our initial choice of cause and effect needs to be reversed. Thus, we proceed further with reversing the dependent and independent variable roles.
Based on the above Cross-correlation plot, we find the out that
b = 9
s = 15 – 9 = 6
r = 2/1 (begin with 2). Fitting the Transfer Function with these parameters yields:
However, we find that not all variables are significant. We now try with r=1, Which gives a better model.
Thus, we choose our parameters as b = 9, s = 6 and r = 1.
We checked the residual plot and found the white noise in the residuals
Thus, in our search for a relationship between Singtel and Airtel Stock prices, we found that Airtel stock prices may be modelled as a response to changes in Singtel stock prices. More noticeably, we see a delay in response of approximately 9 timestamp units since a change in the Singtel stock price. Thus, we can say that roughly, Airtel stock price proportionately reacts to change in Singtel stock price after 9 days.
An air quality monitoring dataset for Northern Taiwan was obtained from the Environmental Protection Administration of Taiwan. This dataset contains measurements from 25 observation stations located in Northern Taiwan, and comprises of 21 meteorological and environmental parameters measured in 2015.
The aim of this report is to build a predictive model using Multiple Linear Regression to predict the concentration of atmospheric particulate matter with diameter less than 2.5 micrometer (PM2.5) based on other measured environmental and meteorological data.
Prior to using the dataset for analysis, a series of data preparation activities were carried out. This includes a sanity check of the data reported for each variable to ensure that the range of values reported for each variable were logical. The steps involved in data preparation are detailed in the following sub sections.
As identified in the data dictionary (included in the Appendix), invalid values due to equipment inspection, program inspection and human inspection were appended with “#,* and x” respectively. As these values could not be imputed reliably, listwise deletion was used to remove all rows containing such invalid values.
As indicated in the data dictionary, three variables (PH_RAIN, RAINFALL and RAIN_COND) had “NR” (No Rainfall) indicated. While those under RAINFALL could be imputed as “0”, as no rainfall means that 0mm of rainfall was measured, such observations under PH_RAIN and RAIN_COND had to be removed, as it’s not possible to reflect these values numerically.
Considering that the 21 environmental variables contains information such as humidity, ambient temperature etc, it was determined that these variables were sufficient to describe the conditions of the location and the time at which these data were collected. As such, Date/Time and Location were excluded in our analysis.
The remaining observations which had missing values after the above steps were removed due to the lack of domain knowledge. It was decided that imputation without the necessary domain knowledge may skew the dataset, and make the modelling less accurate.
After data preparation was completed, a scatterplot of PM2.5 against all other variables was generated to investigate if there is any evidence of linearity between other predictors. From Figure 1, its noted that there’s a very strong evidence of linearity between PM10 and PM2.5. This is logical, because the definition of PM10 (Particulate Matter of less than 10 micrometer in diameter) also overlaps the definition of PM2.5. Apart from this, PH_RAIN and NO2 also appears to be following the trend of the fitted line generated. As such, this dataset was deemed suitable for Multiple Linear Regression.
To investigate if the 20 variables are sufficiently significant in explaining the variance of the dataset, these variables would be used in model fitting with their p values evaluated against a level of significance of 0.05, and the adjusted R² reviewed. Once the predictors have been finalized, their Variance Inflationary Factor (VIF) would be checked to ensure that no multi-collinearity exists between themselves.
After the Data Preparation stage, the final dataset contains a total of 583 samples with 1 target variable and 20 predictors. Considering that the number of observations versus the number of predictors exceed the 10:1 ratio, the size of the dataset was deemed sufficient for Multiple Linear Regression. The dataset was then split into training data and test data based on a ratio of 70:30, so that sufficient data was available to train and validate the model.
From the training dataset obtained in section 4.3, a linear model was fitted with PM2.5 set as the dependent variable. Based on the results generated in R, the least significant variable was iteratively dropped, with the model evaluated after each removal of variables. This process was completed once there were no insignificant variables remaining.
Table 1: Removal of Variables and Resulting Adjusted R²
|Sequence of Removal||Variable Dropped||Adjusted R²|
The significant variables remaining are listed in Table 2. At this stage, the adjusted R² of the model is 0.8105. In addition, by inspecting the values of Estimate ± Standard Error, it’s verified that the coefficients of these variables are non-zero. As such, the finalised variables are considered sufficient for further analysis.
Table 2: List of Variables, Estimate, Standard Error and P Value
|Variable||Interpretation||Estimate||Standard Error||P value|
|AMB_TEMP||Ambient Air Temperature||-0.24038||0.04669||0.000000414|
|PH_RAIN||pH of Rain||-1.35644||0.34374||0.0000938|
|PM10||Particulate Matter ≤ 10µm||0.47179||0.01566||< 2e-16|
|WS_HR||Average Wind Speed per Hour||-0.59653||0.17656||0.0008|
From Figure 2, it’s observed from the Residuals versus Fitted Plot that there is no non-linear relationship between the predictors and the dependent variable. This means that the multiple linear regression model is capable of capturing the data in the trained dataset.
From the Normal Q-Q Plot, a straight line is observed. This satisfies the assumption that the residuals are normally distributed.
From the Scale-Location Plot, there appears to be evidence of heteroscedasticity, as the red line is moving upwards as the fitted values increases beyond 30. Further verification through the Breusch-Pagan Test in R confirms that heteroscedasticity is present, with a p-value of less than 2.2e-16. This violates the assumption of linear regression that the variance of the residuals is constant.
Lastly, the Residuals versus Leverage Plot indicates that there are no outliers that have a high Cook’s distance. This means that any outliers in our dataset, if any, are not significant in influencing the linear regression model built.
Since heteroscedasticity is observed in our model as indicated in section 126.96.36.199, transformation of the dependent variable and application of a weighted least squares model were two approaches explored in an attempt to correct the model.
Various approaches (e.g. applying a natural logarithmic function to PM2.5) were explored to transform the dependent variable before rebuilding a new model. However, the approach that yielded the best result appeared to be a Box Cox Transformation. The intent of Box Cos Transformation is to find a value of λ such that the dependent variable is transformed to the following:Due to the limitation of the transformation, 13 instances of the training dataset where PM2.5 = 0 was removed. Based on the transformation, λ = 0.4 was obtained. Building a new model based on this new transformed dependent variable yielded the results in Table 3, with an adjusted R² of 0.7577.
Table 3: List of Variables, Estimate, Standard Error and P Value (Transformed Dependent Variable)
|Variable||Interpretation||Estimate||Standard Error||P value|
|AMB_TEMP||Ambient Air Temperature||-0.05957||0.009197||2.82E-10|
|PH_RAIN||pH of Rain||-0.3407||0.07574||9.1E-06|
|PM10||Particulate Matter ≤ 10µm||0.08149||0.00337||< 2E-16|
|WS_HR||Average Wind Speed per Hour||-0.1588||0.04859||0.00118|
On inspecting the residual diagnostic plots of the newly built model, it’s noted that there is an improvement in the Scale-Location Plot. In addition, the characteristics of the other three plots remain unchanged. However, on verifying the model with the Breusch -Pagan Test, it’s noted that the residuals are still considered as heteroscedastic, with a p-value of 4.766e-08 rejecting the null hypothesis that the residuals are homoscedastic.
Since the initial model was built based on ordinary least square estimators, heteroscedasticity was likely to occur due to incorrect computation of standard errors. Accordingly, the initial model was transformed into one with homoscedastic errors, using generalized least squares estimator.
It is noted that this transformation requires a logarithmic function to be applied on all predictors. As such, observations wherein the variables contained zero value were removed (2 occurrences of SO2 and 1 occurrence of WS_HR), while UVB was removed from the dataset altogether as there were 233 occurrences of zero values for this variable, amounting to more than half of the training dataset. This resulted in a training dataset of 407 observations with 19 variables used to build a new model applying weighted least squares.
The fitted models are as follows:
Ordinary Least Squares:
Generalised Least Squares:
However, the adjusted R² of the Generalised Least Squares model dropped from 0.8105 to 0.8047, while the p-value of the Breusch-Pagan Test did not improve. Hence, this approach was not considered to be appropriate, and the dependent variable transformation approach was adopted instead.
Based on the model built in section 188.8.131.52, a VIF test was done to inspect the finalized dataset for multi-collinearity.
Table 4: VIF Results
From the results, it was noted that CO and NMHC exhibit multi-collinearity against each other. NMHC was dropped due to its lower significance compared to CO, resulting in a model without multi-collinearity. At this point, the adjusted R² is 0.7556, which is a negligible drop from the earlier model. In addition, the residual diagnostic plots did not exhibit any difference due to the removal of NMHC from the model.
With the model finalized, the validation dataset was used to evaluate the prediction accuracy of the model. Predicted values of the transformed dependent variables were generated with λ = 0.4 as per section 184.108.40.206, and these were compared with the actual values of the transformed dependent variables using the Mean Absolute Percentage Error (MAPE).
The formula of MAPE is given by:where n is number of observations, is the actual value, and is the forecasted value.
Since the limitation of MAPE is that Actual values cannot be zero (due to division by 0), these two occurrences were removed from the test dataset first. This gives us a MAPE of 25.8%. While this result is not ideal, there’s a marked improvement in accuracy as compared to the original model built in section 4.1, before dependent variable transformation. If that model was used, the MAPE of that model comes to 34.3%.
The formula of the final model is given by:
In this report, an air quality monitoring dataset for Northern Taiwan was used to build a Multiple Linear Regression model to predict the concentration of PM2.5 based on other measured environmental and meteorological data. While the dataset was assessed to be suitable for Linear Regression, heteroscedasticity of the residuals was observed during the initial residual analysis. The dependent variable was transformed using Box Cox Transformation to correct for heteroscedasticity, resulting in an approximately 10% improvement in model accuracy as compared to the original approach, with the final model having a MAPE of 25.8% based on the validation dataset. Moving forward, an alternative predictive model such as logistic regression through categorizing PM2.5 into dichotomous levels (e.g. Healthy, Unhealthy) could be explored to improve the accuracy of the predictive model, as homoscedasticity is not required for error terms in logistic regression.
The following metadata is provided by the Environmental Protection Administration of Taiwan together with the raw data.
The columns in csv file are:
|A0178551X||Choo Ming Hui Raymond|
|A0178500J||Yang Chia Lieh|
|A0178551X||Choo Ming Hui Raymond|
|A0178500J||Yang Chia Lieh|
According to the National Environment Agency (NEA), Singapore’s household electricity consumption has increased by about 17 per cent over the past decade . A 2017 NEA survey of 550 households found that air-conditioners were largely to blame, accounting for about 24 per cent of the consumption of a typical home . However, it remains unknown if households currently use their air-conditioners out of habit (e.g. turning on the air-conditioner every day), or when the need arises. In addition, when referenced against the surface air temperature, there might be a delay in turning on the air-conditioner as our sensation of heat depends not only on temperature, but other factors such as humidity, cloud cover, sun intensity and wind . Hence, the intent of this study is to investigate if there is a relationship between household electricity consumption and the surface air temperature in Singapore.
This study hypothesises that:
Table 1: Data Sources
|Mean Surface Air Temperature||Monthly||National Environment Authority||https://data.gov.sg/dataset/surface-air-temperature-monthly-mean|
|Household Electricity Consumption||Monthly||Energy Market Authority||https://data.gov.sg/dataset/monthly-electricity-consumption-by-sector-total|
To perform analysis on these variables, data sources with the same time resolution was first obtained from data.gov.sg. The Mean Surface Air Temperature was obtained from the “Surface Air Temperature – Monthly Mean” dataset published by NEA, whereas the Household Electricity Consumption was obtained from the “Monthly Electricity Consumption by Sector” dataset published by the Energy Market Authority (EMA) of Singapore.
It was noted that “Monthly Household Electricity Consumption” records are only available till September 2015. Hence, this study utilised data from October 2010 to September 2015 (5 years), providing a total of 60 data points for analysis.
Based on our hypothesis indicated in section 2, the input series was taken to be Mean Surface Air Temperature. Initial inspection of the data reveals that the data was stationary. However, based on the time series and the Autocorrelation Function (ACF) plots, a 12-month seasonality was observed. This is consistent with the knowledge that weather in Singapore follows a 12-months cycle, largely characterised by two monsoon seasons – the Northeast Monsoon (December to early March) and the Southwest Monsoon (June to September) . As such, the first step is to apply a first order seasonal differencing of 12 months to the input series.
The resulting Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots (Figure 2) indicated that the resulting residuals are white noise. Hence, the Seasonal Difference model built for the input series is taken to be (0,0,0),(0,1,0)12 at this stage.
With the Seasonal Difference Model for the input series identified in section 4.1, the output series (Household Electricity Consumption) was pre-whitened with the same model. Based on the ACF and PACF plots in Figure 3, it’s observed that the PACF plot is dying down, while the ACF plot cuts off at the first lag. As such, a MA(1) term was added to the Seasonal Difference model in an attempt to achieve white noise residuals.
From Figure 4, it was observed that the MA(1) term added was significant. The Akaike’s ‘A’ Information Criterion (AIC) was deemed acceptable, and the model managed to accurately model the 60 points to project a forecast that not only repeats the trend, but also has a tight confidence interval. The residuals of this (0,0,1),(0,1,0)12 IMA Model has also achieved white noise as evidenced in the ACF and PACF plots in Figure 5. Hence, the IMA model built for the output series is taken to be (0,0,1),(0,1,0)12 at this stage.
Since the input series was previously pre-whitened with a (0,0,0),(0,1,0)12 Seasonal Difference model, the IMA model (0,0,1),(0,1,0)12 developed for the output series was applied to the input series to verify that pre-whitening the input series with the output IMA model would still result in white noise residuals.
Based on Figure 6, it was observed that the residuals remained as white noise. In addition, it was observed that the additional MA(1) term added is not significant. This was assessed to be reasonable, as the original input series didn’t require this MA(1) term during the previous pre-whitening stage to achieve white noise. Hence, the final model chosen is a (0,0,1),(0,1,0)12 IMA Model.
With the (0,0,1),(0,1,0)12 IMA model finalised in section 5.2, the cross correlation between the pre-whitened input and output series was computed. From the results shown in Figure 8, significant correlation was observed between lag 0 to 2. Hence, the transfer function model will resemble the following equation:
where are the transfer function weights to be determined.
An alternate method to represent the model would be as follows:
In order to determine the transfer function model, the parameters b, s and r must first be identified. From Figure 8, it was shown that the first non-zero correlation starts at lag 0, which means that the input and output series are in sync. As such, b = 0. Meanwhile, the cross correlation peak was observed at lag 1. Hence, s = 1. Lastly, it was noted that the cross correlation seemed to be oscillating about 0 in the pre-whitening plot. Hence, r = 2 was selected.
With the transfer function model parameters selected in the previous section, the model was fitted and evaluated. From Figure 9, it was observed that both the AIC and Schwarz’s Bayesian Criterion (SBC) are reasonably small, and the parameters used in fitting the models were all statistically significant. Figure 10 also shows that the residuals of this transfer function model was white noise. Compared with an alternate transfer function model where r = 1 (b = 0, s = 1, r = 1), this model was assessed to be better in terms of AIC and SBC. Hence, the final transfer function model parameters selected was (b = 0, s = 1, r = 2).
Based on the pre-whitening plot in Figure 8, it was initially concluded that there was no lag between Household Electricity Consumption (y) and Mean Surface Air Temperature (x). However, expansion of the transfer function model shown in figure 9 yields the following relationship:
This shows that while there is no lag between x and y, past values of both variables also impact y. More importantly, it was observed that the coefficients of each y term are the same with itself 12 months ago. Rearranging the above formula gives the following:
This means that while yt-1, yt-2, xt, xt-1, et, et-1, et-2 and et-3 may explain the variability in y, an offset from last year’s set of y terms is required for it to fully explain yt. This indicates an upward trend in household electricity consumption, assuming mean surface air temperature remains the same every season.
This study hypothesizes that there is an instantaneous unidirectional relationship between the mean monthly surface air temperature of Singapore, and the monthly household electricity consumption in Singapore. While some believe that human behaviour such as being cost conscious or simply the lack of awareness of rising temperatures may influence and possibly delay the decision to switch on the air-conditioner, time series transfer function modelling of these two variables confirms that (1) household electricity consumption is influenced by mean surface air temperature and (2) they are moving in phase. It was further observed that there is a rising trend in the household electricity consumption in Singapore, as household electricity consumption is autocorrelated with its previous season.
 A. Zhaki, “Singapore’s household electricity consumption up 17 per cent over past decade,” The Straits Times, Singapore, 05-May-2018.
 AccuWeather, “The AccuWeather RealFeel Temperature,” 2011. [Online]. Available: https://www.accuweather.com/en/outdoor-articles/outdoor-living/the-accuweather-realfeel-tempe/55627. [Accessed: 18-Sep-2018].
 Meteorological Service Singapore, “Climate of Singapore.” [Online]. Available: http://www.weather.gov.sg/climate-climate-of-singapore/. [Accessed: 18-Sep-2018].
|A0178551X||Choo Ming Hui Raymond|
|A0178329R||Wong Yeng Fai, Edric|
Initiated from the creation of Bitcoin by Satoshi Nakamoto, blockchain and cryptocurrency are introduced to the public in 2009. Over the last 9 years, the price of bitcoin has gone through a roller-coaster journey in which it reached nearly $20,000 per bitcoin at the peak but plunged to $6,500 at the moment, however it did not affect its position of the highest market cap among all cryptocurrency. Inspired by Satoshi, a Russian-Canadian programmer Vitalik Buterin initiated his project of Ethereum in late 2013 and soon it became the second largest crypto globally. Just like natural gold and silver, bitcoin and ethereum require a process of “mining” to be generated. It is a computational calculation process which consumes lots of processor power and manufacturers built mining machine for such process. One key component of such miner is GPU and it is believed consumption of mining cryptocurrency has led to a change of GPU price.
A dataset was downloaded from Kaggle and it recorded the historical price of GPU chip and cryptocurrency which are cropped from price comparison website and CoinMarketCap.com. We selected two-time series data of GPU and Bitcoin price from January 1st to March 16th 2018 to further study if there is any relationship between these two time series. There is a total of 75 records over 2.5 months of data.
JMP is used for the transfer function formation and the process is shown below:
Firstly, we selected GPU price in USD as the Y, Time Series and the bitcoin price in USD (open) as input List.
We then observed the PACF of input is cut off in 1 and filled these parameters using ARIMA model group.
We found ARIMA(0,1,0)(0,1,1) has the best performance.
As the above shows, we fit ARIMA(0,1,0)(0,1,1) in output GPU price and the result is also good.
We then used an ARIMA(0,1,0)(0,1,1) pre-whitening model as shown above. Because the first significantly non-zero autocorrelation occurs at lag -5, b is equal to 3. Also, the values exhibit exponential decay after -4, which suggests that s = -4 –(-5)=1. Usually, r is either 1 or 2, and we start with r=1.
From the final result, we could get the price of GPU is correlated with the open price of bitcoin. As the formula shows, GPU price could be predicted by the price of bitcoin yesterday.
Ni Haojun A0178177N
Wang Ziyang A0077924U
Xue Zehao A0178283U
Zhang Guoxiao A0178480W
Does motor vehicle deregistration affect the retail index of motor vehicles in Singapore?
Using monthly data sourced from Singapore Department of Statistics (DOS) for period from Jan 2009 to Jul 2018 – 115 data points. The dataset consists of 2 columns. For output, we use retail index (at constant prices) for motor vehicles per month in Singapore. For input, we use the total motor vehicles deregistered under vehicle quota system. The vehicle category includes cars, goods vehicle, buses and taxis.
We observed that both the motor vehicle retail index and the motor vehicle deregistered data display similar trend. Both series are non stationary and shows a rising trend which flattens over recent period.
Using total deregistered vehicles as the input, we started by doing ARIMA and Pre-whitening on the input set. Based on the PACF graph which showed that the significance of the lag drops after lag2, we chose AR2 with yearly seasonality of period 12.
We proceed to apply ARIMA to the input data series. The combination selected for ARIMA on the motor vehicle deregistered was (0,2,2)(0,2,2) period 12. We tested out various combination and noted that using (0,2,2)(0,2,2) period 12 combination, the AIC is the lowest and all the coefficients are significant.
Based on the pre-whitening chart, we were able to identify that the significant auto-correlation occurred at lag3. (b=3). However, the series immediately decayed therefore s=0.
Using the result from pre-whitened series where b=3 and s=0, we computed the transfer function using combination (0,2,2)(0,2,2) period 12.
Using r=2 and s=0, the numerator for motor vehicle deregistered is set to 0 indicating that there is no correlation between the retail index and deregistered motor vehicle. The result shows that not all coefficients are significant in the formula with AIC =683.8 and SBC = 700.9.
Using s=1 and r=2, we noted a negative correlation between motor vehicle retail index and motor vehicle de-registration with AIC at 678.9 and SHC at 700.8. Both measurements improved versus model where s=0. However not all coefficients are significant.
We repeated the steps with ARIMA (0,2,2)(0,2,2)12 constant and using s=1 and r=1. We noted that the AIC at 674.8 and SHC at 694.2 is smaller as compared to r=2. We decided to select this model as our final even though not all coefficients are significant.
Transfer function model formula in jmp:
Solving the equation:
In the equation above, yt refers to the retail index at time period t and e being the error term. X refers to the total motor vehicle deregistered at time t.
Based on the equation solved, we noted that the motor vehicle deregistered impact is very small at lag3 and lag4. On the other hand, the auto-correlation of retail index is highest at lag 13 and lag 14. The highest error occurred at lag 25 and 26 which highlights that the predictive power the time series is not suitable for long term.
In conclusion, there is a very weak unidirectional relationship between deregistered motor vehicle and motor vehicle retail index where an increase in deregistered motor vehicle will lead to increase in motor vehicle retail index. In addition, there is a strong auto-correlation for motor vehicle retail index.
The dashboard can be accessed at the public tableau
Anusuya Manickavasagam (A0163300Y)
Kesavan Sridharan (A0163207M)
Muni Ranjan (A0163382E)
Pradeep Kumar (A0163453H)
Supply chain optimization is critical to ensure the optimal operation of a manufacturing and distribution supply chain, it basically refers to the minimizing of operating cost whilst meeting the demand of downstream customers. Nowadays, data analytics is widely used in supply chain management and optimization to make better business decisions. It is the science of examining raw data to help draw conclusions about information. The supply chain is a great place to apply analytic theories to look for a competitive advantage, because of its complexity and also because of the prominent role supply chain plays in a company’s cost structure and profitability.
In this research, the data set is simulated according to real business situation, it represents the historical data of a small fictitious retailer, including files of items, orders, purchase orders, suppliers, BOM, etc. Below is the dashboard of the data set.
The optimization is conducted based on the pull strategy and agile theory. Agile is defined as a strategy that is more responsive in a volatile market place, where this strategy is totally demand driven. As consumers buying patterns are changing on a very rapid pace, so does the whole supply chain management changes. The fundamental drivers of agile supply chain are Speed, Cost and Efficiency. A pull strategy is when customer demand drives the entire production process. Long term customer demand forecast is accomplished. The advantage of using pull strategy is that it has higher service levels, lower carrying costs, decreased inventory levels, and fewer markdowns. We then further conduct inventory optimization including ABC analysis, stock reward function calculation, and route optimization on optimal route design.
Demand forecasting is important for the other parts of supply chain management. As the retailer, we need to do the operational short term demand forecasting. For this project, we mainly made the quantity prediction of each item. As our data records covering the time period from 2013 January to 2016 September, we carried out the demand forecasting of 2016 October, November and December. The data structure is shown as below.
Data includes the order records of each product items in three different warehouses and on different date, each warehouse has the same 106 items. As it shows, for each item it has very little order quantity, so the data was rearranged as monthly orders. The three locations of the warehouses are Chicago, New York, Los Angeles. To simplify the introduction of our forecasting model, New York is chosen as an example to do the later forecasting and later inventory optimization. A quantitative model is built to predict future demand of each item. Following figure is the graph of one item in one location during the round 3 years’ records.
As it can be seen, the trends of one product is not so obvious. Then some traditional machine learning methods were applied, however the performance is not satisfied, compared with time series model, we tried doing the prediction, but the accuracy is not as good as expected.
The reason might caused by the demand of each item in each warehouse is small. So the model could be sensitive for the tiny variety. And the seasonality is not so obvious. So we move to try another method to do the forecasting. Long short term memory is one excellent deep learning method for time series forecasting. The partial structure of LSTM is shown as below.
The LSTM is a special kind of RNN, recurrent neural network, it is capable of learning long-term dependencies, learn from the previous data input, to our case, each neuron will learn from the previous neuron, with current month demand and the last month demand, as our model we set the time step as 6, which means we will get one prediction depends on the previous 6 months’ actual demand. After trained our model with our 45 months data, we finally can get our Oct’ s demand, and combined with the predicted Oct demand, do the prediction for the Nov, then combine and forecast Dec demand. It realized the information persistence. And not so strict to find the seasonality, so the result could show better performance.
As we can see, here is the result of the 45 months training processing results, it fits well of our original demand, and we also get the predicted demand for the last three months of 2016.
Then we repeated the same work for each item in each warehouse.
Used for: Inventory management and optimization
Based on: Pareto principle, 80/20 rule
Logic: vital few and trivial many
Objective: Maximize service AND Minimize inventory cost
Inventory management is a challenging problem area in supply chain management. Companies need to have inventories in warehouses in order to fulfill customer demand, meanwhile these inventories have holding costs and this is frozen fund that can be lost. To achieve inventory management and optimization, ABC analysis is conducted to categories inventory items into ABC class based on importance rating.
Data set on sales and inventory with over 100 types of products in 2015 is deployed for analysis. To carry out such analysis, we first derive the items like unit cost and the usage of each product over 1 year. Then calculate the turnover in 2015, rank those products based on net value, and got the cumulative cost and items. After we got the pre-processed data table in above figure, we finally use the ABC rules listed below to plot the ABC classification results as a line chart.
Revealing from the figure 3, there are in total over 20% of products are classify into A type, which make up over 70 % of the net value. Then follows B type having over 30% of products which accounts for almost 20% of the revenue. Finally, products in C class can only earn less than 10% net value but make up 40% of the total products which indicates as a great amount. Overall it proves the rule of thumb: vital few but trivial many.
ABC analysis can also be used in design of warehouse when carry out storage and retrieval process. Since majority of the picking activity is performed in a rather small area, the warehouse layout should be optimized to reduce time spent looking for product in the back of the warehouse. Demonstrated from above figure, the fastest moving products in the inventory should be located closest to the shipping, staging, and receiving area in the lower right of the diagram below. The area marked with A, B, C indicates the ABC items.
Previously, the shelves are set up in a U-shaped layout, providing a logical path for picking. However, this layout is not efficient. For each preparation of partial products, operators have to walk the whole U. If the stored items are big volumes or if the U cell is widespread, the time spent in walking and transportation becomes significant.
Since the time passed to move or transport material or parts is considered as a non-value operation, or a waste in the lean thinking way. ABC analysis is deployed here focusing on the “stock turns” index or “frequency of picking”. The A class items being so often picked, it is only common sense to place them close to entry-exit (green zone) in order to reduce reaching distance and picking time. Statistically in fewer demand, B class items are placed behind A class (orange zone) and C class items (the least picked) stored even further (red zone). So in most cases, the journey in the U cell will be restricted to the front (green zone). In few cases only, statistically less frequent, the operator will have to walk the whole U cell.
On top of time and distance reduction, we can imagine the light of our green zone should be on constantly, but orange and red zones could be equipped with sensors or switches to light them up only when somebody is in. In the same way, for the comfort of workers, heating or air conditioning isn’t necessary in zones where they seldom go. These are opportunities to reduce energy expenses.
Besides the implication and strategies, for further scope, developing a customized inventory management agent system is also applicable by consolidate the technology we proposed so far as demonstrated above.
Since our customer is a retail company, it should pay a lot attention to stock related management. Except for demand prediction, optimize stock is also an important point. Here our team try an emerging method, called Stock Reward Function.
Stock reward function is defined as bellow:
In our implementation, we suppose that R∗ is equal to R, and S=0, and we will get Stock reward function as bellow:
Since we wanted to try different way with other teams, we not only tried Lingo to implement this function, but also used Python to implement it.
We used a Python package called DEAP. Which is a Genetic Algorithm package. Since we have fixed some parameters like bellow:
Our chromosome and gene design become quite easy, which is consist of K. And as for demand yt, we choose 3-month demand.
Our target for this GA is maximizing reward for every product. And we set GA initial population as 1000, 40 generations and mutation to get different input combination.
Finally, we will get the optima stock for every product, which is showed bellow:
Retail Order Routing is the efficient allocation and fulfillment of customers orders. It means knowing where all the inventory is located, understanding the costs of shipping from each location and having the flexibility to change priorities on the fly to better manage the business. Order Routing helps manage, optimize and improve the profitability of the retail business. As the business grows, routing the orders becomes a complex problem that grows exponentially. Running a real-time business without a solid retail order routing system – is just not sustainable. Efficient and timely routing helps in increasing the customer satisfaction.
The problem discussed here is about:
My route online is an online software which aids in optimization of multiple routes using the google map API. We can Import as many as 350 addresses from Excel, or manually add multiple locations in various different formats.
1)Enter the address manually in the prompt that appears or import the address from excel files and click on next button.
2)Set the required parameters like the departure time,number of routes, vehicle size etc and click on Plan My Route button at the bottom corner.
3)Visualization of the route
Manual changes in the route can be performed using the Manual Changes button and the best route map can be obtained by clicking the Print Map button.
For Deliveries adhering to time windows i.e the customer wants the delivery to be done with in a particular time window during the course of the day,the below instructions needs to be followed
1)Add two additional fields to your spreadsheet(excel spreadsheet). Label them “Delivery From” and “Delivery To”. The labels must be exactly as written here, without the quotes. These labels will be recognized as “Window From” and “Window To” in the software.
2)In the “Delivery From” field, put in the earliest delivery time for the stop and in “Delivery To”, put the latest. Make sure that the times are 24 hour time (military time). For example, 10:00 AM is 10:00, and 2:00 PM is 14:00.
3)Import your spreadsheet into MyRouteOnline.
4) Set the parameters you need and click Plan My Route.
A0163200B – Anand Rajan
A0163364E – Gao Yuxi
A0163265E – Li Hangxing
A0163312U – Yao Meng
A0163313R – Shi Haiyuan
Singapore is a unique city-state with cultural diversity that has gone through tremendous changes in the past 40 years, growing from a sleepy fishing village to one of the most vibrant commercial hubs in the world. As the economy and infrastructure were transformed from pre-industrial to industrial status to position the nation state as a competitive global market player, its divorce rate are also on the rise which is quite alarming with the ranking of 4th among Asian countries. The risk factors that trigger marriage dissolution can be plenty to gradually make it a grave social concern, which involving children, employment, household income, marriage duration, education, the changing social status of Singapore Woman and diverse racial and ethnic characterization etc (Straughan, 2009).
Based on the data availability and literature review of impact factors on divorce(QUAH, 2016), this dashboard applies 5 crucial factors to conduct dashboard visualization and analysis, which involving household income, ethnic group, age group of the couple, marriage duration and employment. Here we identified the business goals and set up KPI.
Access here: Dashboard – Divorce Trend in Singapore
According to the annual report released by the department of Statistics in Singapore, A total of 7,614 marriages ended in a divorce or an annulment in 2016, which indicating as the highest annual figure over 36 years ever since. Both civil and Muslim divorce have contributed to such peak, with up to 1.5% rise from the 7522 marital dissolutions in the previous year.
Comparing to other Asian countries which also suffering from increasing Divorces, Singapore ranks the 4th with both nearly 7 annual divorces per 1000 married males or females respectively.
Hence such high figures on record together with the SmartArt (shown on the left) to perform exploratory analysis on divorces over years is indicating the next peak of this socioeconomic phenomenon in the coming future.
Overall, the number of divorce people has been increasing yearly for all age groups in Singapore. Specifically, people aged in 25-49 consist of the majority of divorce population. It means that middle and middle-late-aged people are more likely to divorce than people in any other age group. It can be reasonable because:
As we can see, below 5 years marriage duration has the most number of divorces, followed by 5 – 9 years duration. All other duration have significantly lower number of divorces. And a cross-over in year 2010 where 5-9 years duration has more divorces than below 5 years duration, ever since then. Besides, sudden peaks in year 1999 and 2003 across all duration. There are some possible reasons:
Here we can see there are sudden peaks in long-term unemployment rate from 1997 to 1999 and from 2001 to 2003. This correlates to the sudden peaks in divorces around 1999 and 2003 shown in the previous 2 charts. It can be reasonable because:
As we can see from the graph, from 2000 and 2003, the household income is decreasing, meanwhile, the number of divorce people is increasing in a dramatic way. While, according to the analysis of that years, we found that the average level of price was still increasing. The conflict between income and expenditures increased the family burden which affect the harmony of family relationship. After 2005, with the increasing of income, the increase speed of total divorce number was slow down.
Until 2011, when household income broke 7000 dollar per month, the total divorce number tend to be stable. After 2011, although the household was still increased, the total divorce number kept stable. According to this situation, we could draw the conclusion that once the household income per month higher than S$7000, the income may not influence divorce or not.
Generally speaking, the divorce trends are quite similar between Indian and Chinese. Before 1998, the divorce rate of Chinese is slightly higher than that of Indian. After 1998, the divorce rate of Chinese kept on climbing, while the divorce rate of Indians began to fluctuate between 2.0 and 2.5. From year 2010, the divorce rate for both races seem to fluctuate at a relatively constant rate.
There was a peak around 1998 for Indian and Chinese divorces. The divorce rates increased to form some peaks. There was an Asian Financial Crisis in 1997 that caused many Singaporeans to lose their jobs, the household income decreases, and financial conflicts in the family worsen. Hence in the few years after the Asian financial crisis, the divorce rate increased, then decreased after the crisis has eased.
From the graph, the divorce rate of other races, that consists mainly the Malays, is significantly lower than Indians and Chinese. This could be due to their overall different economic status, social structure and outlook in life in general which are also different from that of the Chinese and Indians.
Further, it is observed that the divorce trend of the other races (mainly Malays) is similar to the divorce trend of Chinese, with a delayed effect of about 5 to 6 years. Hence, we could make a prediction that after 2015, the divorce rate of Other Race will begin to fluctuate at a relatively constant rate, like the trend for Chinese after year 2010.
The general divorce rate in 2016 was unchanged from 2015. However, there was a “prominent shift” in the age profile of divorces towards the older age groups in the last decade. From above discussion we find these main factors which affect divorce:
The dashboard was limited to only certain kinds of information available, as the actual results reveal more variables related to divorce proneness. The following are some other variables which would affect the marriages :
Divorce is a form of marital instability and a type of family dissolution. It is a complicated process which involves multi-dimensional factors, both intra-familial personal factors and extra-familial socio-cultural factors. There are some suggestions on rising divorce rate:
Li Hangxing A0163265E
Li Ruiqi A0163293A
Lim Chong Wui A0163460L
Shi Haiyuan A0163313R
Yao Meng A0163312U
Advances in digital technology have opened up new possibilities to enhance the way we live, work, play, and interact. And “Smart City” is not a measure of how advanced or complex the technology being adopted is, but how well a society uses technology to solve its problems and address existential challenges. Singapore strives to become a Smart City to support better living, stronger communities, and create more opportunities, for all.
Smartness penetrates in everyday life of each citizen, it is significantly impacting the transformation of the conventional way of living. Building a smart city includes providing:
The analysis is conduct based on the foundation construction of the above aims, involving nation wide Internet coverage, citizen acceptability, manpower and financial funding input, etc., to discuss whether it is the right time for transformation in Singapore.
Minister Lawrence Wong, Ministry of National Development
Goal — Analyze current status of infrastructure construction to suggest building Singapore into the first Smart City around the globe.
Successful transformation of living habits for citizen
Citizens are trying to adapt to the “smartness” of living environment. Data of “Online shopper by age” reveals active online users distribution among different age range.
Fully network coverage enables smart solutions launching
Singapore has high density of Wi-Fi hotspot coverage (excluding industrial areas of westernmost and Changi airport), which stimulates the facilitation and implementation of smart solution towards almost every aspect. Citizens will be able to keep themselves connected to the entire smart ecosystem via Wi-fi or 4G network. For instance smart living solutions aims at providing seamlessly and remotely managing connected homes from anywhere, at anytime, on any screen in real time.
The intensive coverage of network nation wide is solid foundation of launching smart solutions, enables society to be fully connected.
Stably increased financial investment towards R&D provides strong support in innovation and development.
Financial investment to R&D is continuous increasing, it has an obvious rising trend from 1990 to 2014 in both private and public sectors:
Private — comprises all business enterprises in the private sector.
Public — comprises all entities in the Government Sector, Higher Education Sector and Public Research Institutes
Government tends to sustain the R&D in innovation driving, smart solution research, technology development, etc., to guarantee a stable research environment for designing, building and implementing smart city.
Cui Leyang (A0163218J)
Duan Han (A0163238E)
Gao Yuxi (A0163364E)
Li Yue (A0163373E)
Que Qiwen (A0163391E)