Source: Click here for the paper
Submitted by:
Team Incognito
Pankaj Mitra (A0163319E), Deepthi Suresh (A0163328E), Anand Rajan (A0163200B) and Neeraja Lalitha Muralidharan (A0163327H)
Source: Click here for the paper
Submitted by:
Team Incognito
Pankaj Mitra (A0163319E), Deepthi Suresh (A0163328E), Anand Rajan (A0163200B) and Neeraja Lalitha Muralidharan (A0163327H)
Objectives & Motivation:
Objective is to find a convenient place for students to stay based on data collected in Singapore
To any international student coming into Singapore needs robust information about convenient places to rent. To proceed with this we analyzed student’s movement data which was collected by students of NUS belonging to ISS school, exploratory spatial data analysis was done on that to find the pattern and insights. We considered that as a student basic amenities would be economical stay, cycling path, library, MRT closeness, parks, hawker center.
Data Sources:
Dataset description | Data Source URL |
Data points of all NUS-ISS students | IVLE (Apps used: Openpaths) |
Dwelling data | https://data.gov.sg/dataset/master-plan-2014-planning-area-boundary |
Hawker Centres | https://data.gov.sg/dataset/hawker-centres |
Cycling Path | https://data.gov.sg/dataset/cycling-path-network |
Park Connector Loop | https://data.gov.sg/dataset/park-connector-loop |
Methodology:
INPUT:
OpenPath data with student’s personal location information was collected for the month of April. Data Cleaning was done to refine the data. As the data was stored on multiple mobile devices the date and time formats were inconsistent, thus, all the date and time field values were converted to a single standard format. To do more analysis, separate columns were created for date and time. The outliers were treated using R and the data points outside Singapore were ignored. Modified Name and MailID columns to obtain missing details using EXCEL.
Exploratory Analysis of Students Data (with derived fields):
Mean Centre:
From the above exploration, it is evident that most student stay at the university, travel to the residence and they use MRT most of the time. As students would prefer to stay near university, place near to MRT station and have the potential to use Cycling path, our aim is to find and suggest better zones for student life. Thus, we need to add these layers to the student data and carry out analysis.
Process:
To get more insights into a suitable dwelling, we tried to separately analyze various datasets such as HDB dwelling population data, Hawker center location, MRT station, and cycling paths.
Geographically Weighted Regression:
To find insights from student data using HDB and population data layer, geographically weighted regression is performed using variables as shown below with an assumption that data points having the timestamp of late night represent student’s home. Spatial join was done on hawker center, dwelling and open path student data layers to obtain final GWR model.
Explanatory Variables:
The count_ variable is the number of student data points per polygon, Count_1 is the number of Hawker centers per polygon, showsSHAPE_Area of the dwelling layer and HDB the total count of HDB per person.
The GWR results show that model can perform with moderate accuracy having adjusted R2 of 0.31.
Spatial Autocorrelation (Moran’s I) tool on the regression residuals was run to ensure that the model residuals are spatially random. Statistically significant clustering of high and/or low residuals (model under- and overprediction) indicates that our GWR model is not accurate enough to predict.
The below map indicates regions with localized R square to find HDB population using students data. The Dark Red regions are the places where the model predicts with higher accuracy and the blue regions are the place where its prediction is poor.
As the model residuals are not random based on the spatial autocorrelation ( Morons I ), it cannot be used for prediction purposes. This model was just build to study the insights of students data with another layer of data.
Thus, to find convenient places for students, the places were ranked using student data points and factors like MRT, cycling path and hawker centers availability.
Output:
Ranking zones based on Hawker Centres:
To get a density of hawker centers in each we calculated a new field which was used for ranking:
Ranking of the classified field by using reclassify tool to rank the regions based on the hawker center density which shows places ranked according to economic zones for food.
Ranking zones Based on proximity to cycling paths:
The areas are ranked as per its proximity of cycling paths and data is converted to raster data and the ranked.
Ranking zones based on MRT by reclassify:
The easy access to public transport is considered one of the major consideration while choosing a place to stay and have used proximity to MRT as a factor to rank areas. MRT location data is converted to raster data to rank areas based on its proximity to MRT stations
Final Ranking:
The final rank for each zone in Singapore was calculated based on the average of other three ranks (MRT, Hawker center, and Cycling path). Areas which are ranked good can be most favorable for staying. This final ranking can help to choose a place for staying based on individual priority.
Recommendations:
Using the final ranking, we can recommend to a new student coming to study at NUS a convenient place to stay, considering MRT, Cycling and Food places in the ranking.
As expected the better ranking zones are crowded near NUS itself and there are other places also being suggested by the ranking.
As a future scope of this story, we can add a configurable element which can replace Hawker center layer with many another layer like Libraries, Parks, HDB rental prices, Bus stop layer to form an ideal tool for the upcoming student to use it.
For detailed analysis, please read the entire report: – Spatial-Temporal Analytics with Students Data
Explored and submitted by:
ARUN KUMAR BALASUBRAMANIAN (A0163264H)
DEVI VIJAYAKUMAR (A0163403R)
RAGHU ADITYA (A0163260N)
SHARVINA PAWASKAR (A0163302W)
LI MEIYAO (A0163379U)
Objective:
To find most accessible study areas for students in NUS
Problem Space:
Analysis Strategy:
Data Exploration and Feature Addition:
Preliminary Analysis:
Modelling:
Base Map Creation and Polygon Generation:
Addition of Layers:
Step 3: Loaded class data (master class namely) and converted coordinates for visibility
Density Analysis:
Step 6: The high-density area was in and around National University of Singapore. We can conclude that the data points are either working in NUS or students of NUS.
Assumption and Addition of new Layer:
Hot Spot Analysis:
Step 8: Hot Spot Analysis was performed, Arc Tool Box->Spatial Statistics Tools->Mapping Clusters->Optimized Hotspot Analysis
Model Diagram – Proximity Analysis:
Step 9: Proximity Analysis was performed to find most accessible study areas for students in NUS, Arc Tool Box->Analysis Tools->Proximity->Near
Proximity Analysis Results:
Inferences/Solution Outline:
Limitations:
Team Name: Incognito
Team Members: Pankaj Mitra, Deepthi Suresh, Anand Rajan, Neeraja Lalitha Muralidharan, Nandini Paila Reddy
Team Name: Incognito
Team Members: Deepthi Suresh, Neeraja Lalitha Muralidharan, Pankaj Mitra, Sindhu Rumesh Kumar
Journal:
This study uses multiple linear regression:
1.To provide theoretical support for traffic management and control
2.To increase efficiency at intersections and improve security
One Page Journal Summary “Estimates of pedestrian crossing delay based on multiple linear regression and application” authored by Li Dan and Xiaofa Shi.
Click here for Journal.
The objective of this dashboard is to demonstrate different levels and types of injuries caused in the workplaces of Singapore.
Analysis on Workplace Injuries – Analytics And Intelligent Systems
Questions:
Majority of injuries occur in the Construction and Manufacturing industry consistently over the years but in the year 2016, the injuries pertaining to “Accommodation and Food Services” has also risen.
The number of minor injuries (94.9%) greatly exceeds the number of major (4.5%) and fatal (0.5%).
The fatal and major injuries are consistent through the years, while the number of minor injuries dropped in 2015 and increased drastically in 2016.
Injuries caused by slips, trips and falls are the maximum across industries. Injuries caused by cuts or stabs by objects is more at the Accommodation and Food services industries, while workmen at Construction and Manufacturing industries are more affected by moving objects.
A good measure has been taken to avoid the fatal injuries yet minor injuries have not decreased. This could be because of the conversion of fatal injuries to minor injuries.
Final Analysis
We can clearly see that fatal and major injuries are very low and steps have also been taken to reduce this further, there is still a rise in the minor injuries. This may be because the fatal injuries have been converted to minor injuries.
Struck by moving/falling objects is one of the common cause of minor injury in Construction as well as Manufacturing industry, so to suggest a common solution for this, we could make use of Wireless sensors (heat and motion) on the objects which make an alert sound when it reaches within 50 m of any human being. This way the people working at the site will be aware of any moving objects in their vicinity and move out of harm.
The injuries may be minor, but if the injuries occur simultaneously to multiple people, it may affect the overall productivity of the company.
Tableau Public link:
Workplace Injuries in Singapore(2011-2016)
Submitted by:
Pankaj Mitra (A0163319E)
Deepthi Suresh (A0163328E)
Neeraja Lalitha Muralidharan (A0163327H)
Kriti Srivastasa (A0163206N)
Sindhu Rumesh Kumar (A0163342M)
Prepared by:
Team Data Insighters
Balacoumarane Vetrivel – A0178301L
Mohammed Ismail Khan – A0178366N
Meghna Vinay Amin – A0178307Y
Sreekar Bethu – A0178220L
Vaishnavi Renganathan – A0178229U
With each passing year, oil seems to play an even greater role in the world-wide process of people making, selling, and buying things. In the early days, finding oil during a drill was carefully thought believed somewhat of an annoyance as the meant treasures were usually in a common and regular way water or salt. It wasn’t until 1857 that the first commercial oil well was drilled in Romania. The U.S. petroleum industry was born two years later with a drilling in Titusville, Pa.
Oil has replaced coal as the world’s primary fuel source. Oil’s use in fuels continues to be the primary factor in making it a high-demand commodity around the globe.
The Determinants of Oil Prices:
With oil’s stature as a high-demand global commodity comes the possibility that major fluctuations in price can have a significant economic impact. The two primary factors that impact the price of oil are:
Supply and demand:
The price of oil as known is actually set in the oil futures market. An oil futures contract is a binding agreement that gives one the right to purchase oil by the barrel at a predefined price on a predefined date in the future. Under a futures contract, both the buyer and the seller are obligated to fulfil their side of the transaction on the specified date.
Market sentiment:
The mere belief that oil demand will increase dramatically at some point in the future can result in a dramatic increase in oil prices in the present as speculators and hedgers alike snap up oil futures contracts. The opposite is also true that means prices can hinge on little more than market psychology at times.
Other variables:
For seven straight years, the US has pumped more oil and gas out of the ground than any other country and this lead will only widen. Production of crude topped 10.7 million barrels per day with production of natural gas hitting 4 million barrels per day. The surge in U.S. output is due in large part to the wide use of horizontal hydraulic fracturing, or fracking, as new technologies give drillers access to some of the largest oil deposits in the world that were once too tight to exploit.
Data source:
The data is obtained from the link
https://data.oecd.org/energy/crude-oil-production.htm#indicator-chart
3.1 Crude oil production (input series):
Crude oil production is defined as the quantities of oil extracted from the ground after the removal of inert matter or impurities. It includes crude oil, natural gas liquids (NGLs) and additives. This indicator is measured in thousand tonne of oil equivalent (toe). The data which has been considered is Total Crude oil production in US.
3.2 Crude Oil Import price (output series):
Crude oil import prices come from the IEA’s Crude Oil Import Register. Average prices are obtained by dividing value by volume as recorded by customs administrations for each tariff position. Values are recorded at the time of import and include cost, insurance and freight, but exclude import duties. The price measured in USD per barrel of oil.
Graph:
The graph depicts the time series of both the crude oil production in US and import price rate. To some extent, there is an inverse relationship between these two.
Total crude oil production is a major factor in determining the price of a barrel is tacit. To challenge this common belief, import price of oil is considered as Input series(X) and oil production is considered to be the Output series(Y).
Hypothesis – “There is an underlying relationship between the total crude oil production in US and crude oil import price.”
Equation – Production(Y(t)) ~ Import Price(X(t)).
The following steps show how various parameters were considered while building the ARIMA model and ultimately the team arrived at the Transfer Function.
Step 1: Fit ARIMA model to the input production series X_{t. }Input data is now loaded and an ARIMA model is tried to fit with different parameters. The model IMA (2,1) shows good overall summary.
ACF and PACF were less than significant level and model coefficient are significant. The below plot shows the ACF and PACF obtained for the IMA model tried above. It concurs what we have observed. The residuals plot is also attached for the model.
Arima residual check:
Step 2: By fitting a preliminary model, we can get rid of the autocorrelation, if there exists any in the data. This process is called pre-whitening.
Pre-whiten the input production-output price series and check for cross correlation.
The cross-correlation plot between input and output indicate there is significant correlation in lag 3 to 5 and lag 8.
This suggests that our transfer function equation will have terms related to the input series only for lags 3,4,5 and 8.
The model equation:
Y_{t}=V_{t-3}X_{t-3} +V_{t-4} X_{t-4}+V_{t-5}X_{t-5}+V_{t-6} X_{t-6}+V_{t-7}X_{t-7}+V_{t-8}X_{t-8}+n_{t}.
Step 3: Compute the transfer function.
Parameters identified are: b=3, s=8-3=5, and r=2, these values were identified from the above pre-whitening plot and used to build transfer function.
Model residual diagnostics:
The residuals plot for the transfer function is obtained through the software. We do a diagnostic check for the fitted transfer function model – noise. The parameter significance is also checked from the plot and it is found that the ACF and PACF value ranges are in the significant range that we are looking for with varied significance.
6.1 Final Equation:
Y_{t }– 0.03 Y_{(t-1) }-1.03 Y_{(t-2) }-1.85 Y_{(t-3)} +1.91 Y_{(t-4)} = -290.3574+197.96X _{(t-3)}-249.74X _{(t-4) }+19.84X _{(t-5) }+ 93.04 X _{(t-6)} + 230.5484 X _{(t-7) }-763.4634 X _{(t-8)} + e_{(t)} +2.56 e_{(t-1)} +3.08 e_{(t-2)} +1.14 e_{(t-3)}
Form the transfer function it is evident that import price depends on lag variables (3,4,5,8) of the crude oil production. As mentioned in our hypothesis it has been observed and re-iterated that there indeed exists a relationship, inverse in nature, between the two variables considered under our analysis, namely, Crude oil production and the import prices. Rising production in crude oil shall forecast a diminishment in the import prices.
1. Background
In rapidly ageing Singapore, the demographic trends are worrying. While the birth-rate fell to a seven-year low, the number of deaths recorded was the highest in at least two decades. The death-rate rose by 4 per cent from 20,017 deaths in 2016 to 20,905 deaths last year, the report on Registration of Births and Deaths-2017 showed. The report was released by the Immigration and Checkpoints Authority (ICA). Therefore, this study aims to analyze factors related to deaths in order to help government have more insights beyond the current situation. According to our analysis, Admissions to Accident & Emergency Departments is one of factors that has correlation with death.
2. Hypothesis
We assume that the relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month are correlated. In other words, increase in the number of admissions will raise death count. In this case, the dependent variable is number of deaths aggregated monthly, while the independent variable is admissions to Accident & Emergency Departments aggregated by month.
3. Dataset
3.1 Data Source
The two datasets were collected from Department of Statistics Singapore. The admissions to Accident & Emergency Departments data was extracted from the dataset called ‘Admissions to Public Sector Hospitals, Monthly’, whereas the monthly death data was extracted from ‘Deaths by Ethnic Group and Sex, Monthly’. （Link:http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=15193 http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=15167）
3.2 Data Transformation
Originally, the data structure was a multi-dimension table which was not suitable for analytics. Therefore, we transformed the data structure in order to make it more suitable for analytics. Besides, we selected the data from Jan 2012 to Dec 2017 only as we wanted to mainly focus on its recent trend.
3.2 Data Records
The data was extracted from Jan 2012 to Dec 2017. It has 72 records in total which are enough for transfer function, because generally speaking, the threshold of transfer function is 60 records.
4. ARIMA Model
4.1 Observing plots
As shown in Figure 1 and 2, the monthly deaths are non-stationary with a trend, while the admissions to Accidents & Emergency Department are stationary. In order to test their stationary, KPSS test was employed. The null hypothesis of KPSS test is that the data is level or trend stationary. Therefore, if the data is level or trend stationary, its p-value should be larger than 0.05.
According to the outputs of KPSS test shown in Figure 3, it is obvious that the monthly deaths are non-stationary, while the admissions to Accidents & Emergency Department are stationary.
Figure 1 Line chart of monthly death
Figure 2 Line chart of admissions to Accidents & Emergency Department
Figure 3 Output of KPSS test
4.2 Determination of parameters
Based on the ACF and PACF graphs in Figure 4, the parameters of ARIMA model can be determined.
According to the PACF graph, there is a spike at lag 2. Therefore, p is equal to 2.
Because the original data is stationary. Therefore, the differencing order is equal to 0.
According to the ACF graph, there is a spike at lag 2. Therefore, p is equal to 2.
According to the PACF graph, there is a spike at lags 12 and 24. Therefore, P is equal to 2.
Because the original admissions data is stationary. Therefore, D is equal to 0.
According to the ACF graph, there is a spike at lag 12. Therefore, p is equal to 1.
Figure 4 Time Series Basic Diagnostics
4.3 Result of Seasonal ARIMA (2,0,2) (2,0,1)
The result of ARIMA is shown below. It is obvious that some coefficients are not significant, so the insignificant parameters have to be removed one by one to find out the parameters of best ARIMA model.
Figure 5 Output of Seasonal ARIMA (2,0,2) (2,0,1)
4.4 Output of ARIMA Model Group
In order to find out the best parameters, ARIMA Model Group was employed. The output of ARIMA Model Group is shown as below. According to AIC criteria, the best solution is (2,0,2) (1,0,1).
Figure 6 Output of ARIMA Model Group
4.5 Output of Seasonal ARIMA (2,0,2) (1,0,1)
The output of Season ARIMA (2,0,2) (1,0,1) is shown as below. The model is regarded as the best model for the following reasons:
Figure 7 Model Summary
Figure 8 Forecast
Figure 9 Parameter Estimates
Figure 10 Residuals
5. Transfer Function
5.1 Prewhitening
After finding the suitable ARIMA Model, the parameters of X’s ARIMA was used to pre-whiten the input and output series in order to get their cross-correlation graph which is shown as below.
Figure 11 Prewhitening Plot
5.2 Identifying Parameters and Fitting Transfer Function Noise Model
According to Figure 11, there are two spikes at lags 10 and 11. It means the non-zero autocorrelation occurs at lag10 and the values decay after lag 11. Therefore, b=10, s=11-10=1. Besides, r is equal to2. Finally, the parameters we used are shown in Figure below:
Figure 12 Transfer Function Model
5.3 Diagnostic Checks
According to Figure 13, the residuals are randomly distributed.
Figure 13 Residuals of Transfer Function
According to the result, all parameters are significant.
Figure 14 Parameter Estimates of Transfer Function
5.4 Model Comparison
Although the above solution is good enough, other parameters were tried for comparison. Finally, according to the output of model comparison in Figure 15, the solution mentioned above is the best one.
Figure 15 Model Comparison
5.5 Expanded Formula
Figure 16 Formula of Transfer Function
The formula of transfer function is shown above. However, in order to understand the formula better, we expanded the model in full with backshift operator as shown below. The expanded formula includes Y-Deaths, X-Admissions to Accident & Emergency Departments and e-Error Term. The number at the bottom right corner means the lags of its corresponding item.
Figure 17 Expanded Formula
5.6 Model Summary
As shown in Model Summary in Figure 18, the MAPE is equal to 2.77, while the MAE is equal to 46.92. The performance is acceptable. The forecasting points and confidence interval are shown in Figure 19.
Figure 18 Model Summary
Figure 19 Forecasting graph
5.7 Output of test data
As shown in Figure 20, the MAE of test data is equal to 100.157, while the MAPE is equal to 5.55. Both evaluation metrics are larger than that of train data. However, the result still can be a reference for government.
Time | Predictive Value | Actual Value |
Jan 2018 | 1696.946 | 1924 |
Feb 2018 | 1565.284 | 1662 |
Mar 2018 | 1666.980 | 1776 |
Apr 2018 | 1654.818 | 1624 |
May 2018 | 1684.973 | 1803 |
Jun 2018 | 1710.541 | 1729 |
MAE | 100.157 | |
MAPE | 5.545054604 |
Figure 20 Output of test data
This transfer function indicates the relationship between Deaths and Admissions to Accident & Emergency Departments. It validates the initial hypothesis that the Admissions to Accident & Emergency Departments leads number of deaths aggregated by month in Singapore. Besides, the performance of model is acceptable. Therefore, government can use this model to predict deaths in advance and then take actions to lower death count. For example, if the government sees an increase in the number of admissions, it can put more effort to provide medical assistance and perform researches to understand the underlying reasons.
The relationship between weekly data of crude oil (WTI) price and Dow Jones Commodity Index is examined in this report. The 2 time-series begins from 6 Jul 2014 and consists of 221 observations. The WTI (West Texas Intermediate) is a grade of crude oil used as a benchmark in oil pricing. Here the downloaded data’s unit of measure is USD/barrel. The Dow Jones Commodity Index was launched on 2 Jul 2014 by S&P. The DJCI includes 23 of the commodities included in the world production-weighted S&P GSCI, and it weighs each of the 3 major sectors – energy, metals, agriculture and livestock – equally. In this report we look at only the energy section DJCI, which consists of 6 commodities: WTI Crude Oil, Heating Oil, Brent Crude Oil, RBOB Gasoline, Gasoil and Natural Gas.
The data are downloaded from the following website: https://www.investing.com/indices/dj-commodity-energy-historical-data
https://www.investing.com/commodities/crude-oil-historical-data
Figure 1. Plot of WTI Price and DJCIEN
From the plot it can be seen that essentially the two time-series move together. To discover the relationship between these two, we do pre-whitening in JMP to see whether we can apply the transfer function technique.
Using WTI Oil Price as Input series, the ACF and PACF plots are shown below:Figure 2. ACF and PACF of WTI Oil Price Series
It can be easily seen that the time series is not stationary. Therefore, we apply 1^{st} order differencing to the series:Figure 3. ACF and PACF of WTI Oil Price differenced Series
ARIMA(1,1,1) is applied to the input series:Figure 4. ARIMA(1,1,1) applied to input series
Apply Ljung-Box test to the residual after ARIMA, p-values are all more than 0.05. Therefore, we don’t reject the null hypothesis that the residuals are independently distributed. The model fits fine.Figure 5. Input Series residual after ARIMA(1,1,1)
ARIMA(1,1,1) is applied to both input and output series and the following pre-whitening plot is obtained:Figure 6. Pre-whitening Plot
From this plot, there is a significant peak at the time lag 0. The other correlations on positive and negative sides of the lag are small enough to be neglectable. This implies that Y_{t }(Output Series) reacts immediately to the change of input X_{t}. It’s not possible to identify the transfer function model’s b, s, r here. Gretl will be used to discover the cointegration relationship between these 2 time-series.
Before the Cointegration test, Augmented Dickey-Fuller test are performed:Figure 7. ADF test for WTI Oil Price Figure 8. ADF test for DJCIEN
Both of them having p-value > 0.05, therefore, we do not reject the null hypothesis that unit root presents in the time-series.
The theoretical background for Engle-Granger Cointegration test is layed down as follows:
The estimation of regression equation of Y on X or X on Y in level form OLS method is referred to as cointegration regression as shown below:
Y_{t} = intercept + b1 X_{t }+ error
X_{t} = intercept + b1 Y_{t} + error
Where b1 = long run regression coefficient
If X_{t} and Y_{t} are non-stationary and cointegrated
then a linear combination of them must be stationary. In other words:
Y_{t} – βX_{t} = U_{t}, where U_{t} is stationary.
ADF test is carried out for U_{t} to confirm the stationarity.
Results from the Engle-Granger Cointegration test is shown as below:Figure 9. Engle-Granger cointegration test result
The p-value for ADF test for the residual is small, therefore we reject the null hypothesis, U_{t} is stationary.
We conclude that the two time-series are cointegrated.
From Y_{t} – βX_{t }= U_{t}, we know that U_{t }will capture the error correction relationship by capturing the degree to which Y and X are out of equilibrium. Therefore, ECM can be built as: ΔY_{t }= C + Φ ΔX_{t} + α U_{t-1}, where U_{t-1 }= Y_{t-1 }– X_{t-1}, ΔX_{t} is any short-term effects.
Use Ordinary Least Square firstly to obtain U_{t}, and then use U_{t-1 }to build ECM:
Figure 10. Setting used for Error Correction Model
Result of the model shown as below:Figure 11. Error Correction Model
The coefficient of the error correction term presents statistical significance, indicating that short-term imbalances between the two series should disappear in the exact condition of the long-run equilibrium. Durbin Watson test shows residuals are normally distributed. R^{2 }value is 0.881811. Therefore, we conclude that the ECM is reasonably good.
If two time-series, X and Y, are cointegrated, there must exist Granger causality either from X to Y, or from Y to X, or in both directions. However, the reverse is not true. Since we already tested the cointegration relationship, Granger Causality Test is then carried out to discover the causality relationship.
In Gretl, Granger Causality Test is done in Vector Autoregression. Before doing the VAR, it’s important to find the maximum lag to specify in VAR(p), since this p value will affect the VAR result a lot. We use VAR lag selection function in Gretl to determine the max lag. From the following result, lag 1 is chosen because of the smallest AIC & BIC.Figure 12. VAR lag selection
Using Lag 1 to run VAR, the result is shown as follows:
Figure 13. Result of VAR
From the equations:
Equation 1: DJ Energy, the null hypothesis of all lags of WTI coefficients are zero cannot be rejected at 5% significance level. Therefore, WTI Oil Price does not Granger cause Dow Jones Commodity Index Energy.
Equation 2: WTI, the null hypothesis of all lags of WTI coefficients are zero is rejected at 10% significance level, and essentially at 5% significance level also.
Therefore, we conclude that Dow Jones Commodity Index Energy Granger cause WTI Oil Price.
From the above analysis, the following conclusions are deduced:
Team: AngelWorm
Members:
Name | Student ID | Email Address |
Ang Wei Hao Max | A0178278L | e0267589@u.nus.edu |
Li Xiaoguang | A0178450B | li.x@u.nus.edu |
Laura Long Xiaoxing | A0055448Y | laura.long@u.nus.edu |
Ma Wenjing | A0073595U | e0267970@u.nus.edu |
About the data
The data describes the change in the frequency of television viewing from October 2008 to March 2009 by the average Indian household. We are not given the details of the TV channel hence we will not know who the target audience. As seen in the chart below, the weekly rating of the TV channel hovers between the range of 330 to 170 (60 points) and it is trending downwards. The TV channel rating has been staying below 300 after 27 Jan 2008, suggesting that the programs were not able to appeal to its audience in India.
We tested autocorrelation of the data and confirms that there is positive autocorrelation in this dataset (refer to appendix for more information). And the ACF didn’t dampen out within 15 to 20 lags, hence we confirm that the autocorrelation is not likely to be stationary (nonstationary process)[1].
Software
We used Jmp, SPSS Modeler as well as R to run our models.
Data
We tried to log normal predicted data before we use Jmp to runtime series diagnostic and we found that the results are the same using the original data set are similar so we used the original data set on all our models.
Referring to the equation below, the trend component T reflects the long-term progression of the rating. It’s a function of how effective your marketing campaign is, and it indicates how well the TV channel is doing. This trend is perturbed by several other effects which contribute to the buying behaviours of the audience. The seasonal factor, TV rating peaked during festive seasons (more people staying at home to watch TV or there’s a big event organised by the TV channel); all those events have a quantifiable, cyclic effect on the data and therefore make the seasonal component, S, of the time series. Finally, what’s left is assumed to be caused by random, non-periodic events and is called irregular (or residual) component, I. The cyclic pattern in television rating can be explained by environmental factors, such as the weather or the amount of daylight. People watch less television when the weather permits them to be outside and engage in alternative activities (Gould, Johnson, & Chapman, 1984). In this data set, we observed a downward trend in the data but no seasonality and cycle in the data. We didn’t observe any annual cycle in the data nor did we see any seasonality in the data. Referring to the equation below, we be using linear regression method for our forecasting (refer to excel spreadsheet below for details), error in our training dataset 6.49% (MAPE) while testing data set (20 data points) is about 8.43%(MAPE).
Intercept | 302.8228365 |
Slope | -1.400229918 |
Few key assumptions of regression include:
However, we tested positive autocorrelation in the data hence we can conclude that the decomposition method is not a good model for prediction as it violates the third assumption, there should not be multicollinearity in the independent variables.
Figure 1
The time series shows strong positive autocorrelations (ACF), it dies off slowly (we still see some autocorrelation at lag 25). In the PACF plot, there’s strong autocorrelation at lag 1 and it gets off after lag 1 (see figure 1)
Next we try 1 and 2 differencing to convert it stationary time series data before we test for P and Q (ARIMA).
In figure 2a and 2b – 1 Differencing
The intercept becomes insignificant and we also see signs of over-differencing (pattern of changes of sign from observation to the next in the ACF plot, lag 1 has a negative spike). Hence, we decide to remove differencing and try for P & Q instead.
ACF plot indicates no seasonality in the data (only lag 1 and lag 7 shows strong autocorrelation).
Figure 2a
We tried ARMA (1,0,0), we achieved adjusted Rsquare of 0.76 but lag 2, 5 and 5 become significant and there is a negative spike in lag 1 (refer to appendix for more information).
Next, we try ARIMA (0.0,1), adjusted Rsquare was only at 0.50 but ACF plot still shows high autocorrelation. (refer to appendix for more information). ARIMA (1,1,1) also giving us insignificant variable in AR and intercept (see figure 3)
Figure 2b
Figure 3
Our final model – ARIMA(1,0,1) Figure 4 & 5
We achieved an adjusted R square of 0.79, MAE of 5.38. We didn’t split the data into training and testing dataset (there is only 20 data points for validation, this will not be sufficient in ARIMA models. We will need at least 50 data points to run ARIMA model).
Final Model Checking
The estimation summary below indicates that both autoregressive parameter and the SMA parameter are significant. The Ljung-Box statistics indicates that all the lag there are showing no signs of significant autocorrelation in the residuals and we also see randomness in the residual plot. Hence, we confirm that the model is a good fit.
Figure 4
Figure 5
Feature Engineering
The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm (regression) to model. By creating dummy variables, trend variable and weekly fluctuation variable, we aim to learn the underlying inherent functional relationship between inputs and outputs in order to improve the accuracy of the model.
First, we created 5 dummy variables to represent week 1, week 2, week 3, week 4 and week 5 of the month.
And we also created another variable to capture the weekly fluctuation rate of ratings by:
(This week’s rating – Previous Week’s rating) / Previous Week’s ratings
We also created 3 trend variables to capture downward/upward trend of the data – 5 weeks moving average, 6 week moving average and 7 weeks moving average.
We split the data into training and testing July 2007 – 26 Oct 2008 for training and Nov 2008 – Mar 2009 for testing. Using backward regression method to run our regression in SPSS Modeler. In backward regression, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
Referring to figure 6 below, the significant variables in our time series regression model are the time period, 5 weekly trend, 7 weekly trend variable and differencing variable. Model 6 gives us the best adjusted RSquare of 0.8490. The best Rsquare compared to decomposition method and ARIMA but MAE between our training and testing data is more than 10%. The difference may be due to insufficient data points in our testing data set. We only have 20 data points for validation and there may be a few observations in the data set which are similar to the training data set (10 out of 20 data points are below 200 rating).
Testing Data Set Training Data Set
This method predicts the one next period value based on the past and current value. It involves averaging of data such that the non-systematic components of each individual case or observation cancel out each other. The exponential smoothing method is used to predict the short-term predication. Alpha, Gamma, Phi, and Delta are the parameters that estimate the effect of the time series data. Alpha is used when seasonality is not present in data. Gamma is used when a series has a trend in data. Delta is used when seasonality cycles are present in data.
Our Final Model – Simple Exponential Smoothing
Our final model is exponential smoothing technique using SPSS modeler, giving us coeffcient of 0.624.
Simple exponential smoothing is suitable for forecasting data with no trend and no seasonal component.The smoothing equation is given by
St+1= St + α(Xt – St)
Also can be writen in St+1= αXt +(1- α)St
Where(Xt – St) is the difference between the observed and forecasted value and St+1 is the smoothing statistic and α is the smoothing constant.It can be seen that the new smoothed value is the weighted sum of the current observation and the previous smoothed value.The weight of the most recent observation is α and the weight of the most recent smoothed value is (1-α).
As a result, St+1 is the weighted average of all past observations and the initial value of S0.The weights are decreasing exponentially depending on the value of α (0.624). More weights given to most recent obeservations and weights decrease geometrically with week. MAE (error) between the training and testing dataset is more than 10%. Since this technique is good for making short predictions and smoothing parameters are based on current and previous observation, we do not recommend splitting data for prediction since the TV channel has been trending downwards after the second quarter of 2008. We attempted exponontial smoothing technique on the data set but we still spikes in the ACF and PACF plots hence we conclude that the model is not a good fit.
Conclusion
Above is a summary of error terms and R square (goodness of fit) of our models (split between training and testing data). We would recommend using time series regression model for prediction if we have more data points for validation. Though both ARIMA and exponential smoothing techniques are useful techniques in forecasting a time series. Exponential smoothing technique depends upon the assumption of exponential decrease in weights for past data but it doesn’t take care of autoregressive component (refer to the exponontial smoothing model summary above). ARIMA is employed by transforming a time series to stationary series and studying the the nature of the stationary series through ACF and PACF and then accounting auto-regressive and moving average effects in a time series and it’s the best model the four models.
Appendix
ACF
Model Description | ||
Model Name | MOD_1 | |
Series Name | 1 | normalized data |
Transformation | None | |
Non-Seasonal Differencing | 0 | |
Seasonal Differencing | 0 | |
Length of Seasonal Period | No periodicity | |
Maximum Number of Lags | 16 | |
Process Assumed for Calculating the Standard Errors of the Autocorrelations | Independence(white noise)^{a} | |
Display and Plot | All lags | |
Applying the model specifications from MOD_1 |
a. Not applicable for calculating the standard errors of the partial autocorrelations. |
Case Processing Summary | ||
normalized data | ||
Series Length | 92 | |
Number of Missing Values | User-Missing | 0 |
System-Missing | 0 | |
Number of Valid Values | 92 | |
Number of Computable First Lags | 91 |
normalized data
Autocorrelations | |||||
Series: normalized data | |||||
Lag | Autocorrelation | Std. Error^{a} | Box-Ljung Statistic | ||
Value | df | Sig.^{b} | |||
1 | .876 | .103 | 72.847 | 1 | .000 |
2 | .829 | .102 | 138.832 | 2 | .000 |
3 | .821 | .101 | 204.294 | 3 | .000 |
4 | .774 | .101 | 263.139 | 4 | .000 |
5 | .741 | .100 | 317.682 | 5 | .000 |
6 | .704 | .100 | 367.480 | 6 | .000 |
7 | .644 | .099 | 409.611 | 7 | .000 |
8 | .652 | .099 | 453.323 | 8 | .000 |
9 | .617 | .098 | 493.003 | 9 | .000 |
10 | .579 | .097 | 528.344 | 10 | .000 |
11 | .573 | .097 | 563.380 | 11 | .000 |
12 | .519 | .096 | 592.461 | 12 | .000 |
13 | .489 | .096 | 618.633 | 13 | .000 |
14 | .479 | .095 | 644.117 | 14 | .000 |
15 | .428 | .094 | 664.700 | 15 | .000 |
16 | .412 | .094 | 684.041 | 16 | .000 |
a. The underlying process assumed is independence (white noise). |
b. Based on the asymptotic chi-square approximation. |
Partial Autocorrelations | ||
Series: normalized data | ||
Lag | Partial Autocorrelation | Std. Error |
1 | .876 | .104 |
2 | .266 | .104 |
3 | .255 | .104 |
4 | -.029 | .104 |
5 | .021 | .104 |
6 | -.051 | .104 |
7 | -.130 | .104 |
8 | .209 | .104 |
9 | -.048 | .104 |
10 | .013 | .104 |
11 | .043 | .104 |
12 | -.164 | .104 |
13 | .005 | .104 |
14 | .011 | .104 |
15 | -.046 | .104 |
16 | .055 | .104 |
Reference:
[1] Time Series Analysis and forecasting by Example. Soren Bisgaard and Murat Kulahci
[2] Applied Regression Analysis and Generalized Linear Models
Team Members : Gopesh Dwivedi (A0178338R), Indu Arya (A0178360B), Rubi Saini (A0178255W)
The Singapore Zoo, formerly known as the Singapore Zoological Gardens and commonly known locally as the Mandai Zoo, occupies 28 hectares (69 acres) on the margins of Upper Seletar Reservoir within Singapore’s heavily forested central catchment area. The zoo was built at a cost of $9 million granted by the government of Singapore and opened on 27 June 1973. It is operated by Wildlife Reserves Singapore, who also manage the neighboring Night Safari, River Safari and the Jurong Bird Park. There are about 315 species of animal in the zoo, of which some 16 percent are considered to be threatened species. The zoo attracts 1.7 million visitors each year.
Statistical time series models are used to analyze the short-run impact of weather and the long-run impact of climate upon visits to Singapore Zoo. The relationship between key climate parameters and tourism has been researched for over 30 years. Whilst tourists respond to the integrated effects of the atmospheric environment, including thermal, physical and aesthetic aspects, many studies conclude that temperature is the dominant climate variable for tourism. Several studies have attempted to identify the ideal or optimum temperatures for tourism. We conclude that’ globally, tourists prefer an average daily temperature of 21°C, and ideal temperatures for urban sightseeing in Singapore have been found at 20- 26°C. Higher daily maximum temperatures around 30°C are preferred for beach-based recreation.
Our hypothesis regarding the Singapore’s Zoo’s Visitation is that better are the weather conditions which in our case means bright sunny day rather than raining would increase visitors at the Zoo. Thus, some cross-correlation can be attributed to weather conditions and Zoo’s visitation.
A “tourist experience” encompasses three stages: planning, the actual experience, and assessments. Weather and climate affect the tourist experience. The distinction between weather (shorter term) and climate (longer term) is apparent when these tourist experience stages are used. Climate information and forecasts are assessed during tourism planning; weather information supports the actual experience; assessments are combinations of weather and climate information—reconciling discrepancies between expectations and reality of the tourist experience.
At the tourist destination, activity choices are impacted by local weather events. Weather events influence on-site activity choices of tourists. Social factors and different activity choices diversify options for tourists and recreationalists. However, influences of weather on the availability and performance of an activity are still present. For example, if availability of an intended activity is reduced due to the influence of the weather (e.g. lack of snow for skiing), alternative choices are created by businesses to mitigate revenue losses and provide social outlets.
For this Study, we collected temporal monthly data from SingStat. The data collected contains monthly visitors at Singapore’s Zoo from April 1990 onwards. Also, weather conditions like number of bright sunny days, rainfall etc is also acquired from the SingStat website. Primarily, we are considering the number of sunny days as our input series and Singapore’s Zoo monthly visitors as output series. We want to see if no rain scenarios affect the tourist decision to go for outdoor activities such as visiting Singapore’s Zoo.
Test of cointegration – We ran co integrated test for input and output series and achieved significant p value so we can conclude that these series are not co integrated and we will develop transfer function for these series.
Relation plot of input and output series:
Test of stationarity (ADF test )– We want to identify at first if series is stationary or not so we ran Augmented Dicky-Fuller test and p value is insignificant so our series is stationary.
As out input series is stationary so there is no need of differencing. We ran models for different values of p and q.
We ran multiple models and found out that Seasonal ARIMA(1,0,1)(3,0,1) is giving the best results.
For this model we have white noise, all auto correlations are insignificant. all the ACF and PACF values are within confidence interval.
Residual plot is random, and the selected parameters are significant.
MAPE for this model is insignificant.
Then applied same model on Y series and getting good results with white noise, random residual plot and significant ACF& PACF values.
We then apply pre-whitening on this model and proceed to check correlation between input and output series.
We ran cross correlation test for both input and output series and can see that output series is correlated with input series lag. So we can say that past data of input series can be used to predict output(Y) series.
We are considering that Correlations exist only in 4 lags, 2,3,4 and 5, starting from lags 2 and peak at 4 for development of transfer function. So our transfer function equation will have terms related to the input series only for lags 2,3,4 and 5.
So, we have d = 2, s=4-2 =>2.
We developed 2 transfer functions with d = 2, s=4-2 and r=2 and second d = 2, s=4-2 and r=1.
Transfer function with d = 2, s=4-2 and r=2 is giving best results so we will proceed with Transfer function model 1.
*standardized(t)= No of visitors in zoo
standardized_Bright_sunshine_daily_mean_hour(t-1) + et = Average Temperature
We can now infer from the above analysis, there is a conclusive “analytical evidence” that bright sunny weather has a strong relation with number of people visiting in Singapore zoological park. In this report, we applied “Transfer function analysis” and observed that relationship of the two factors, i.e., bright sunshine daily mean hours (Hour) as the “input variable” and number of people visiting Singapore zoological park as the “Output variable”, are related unidirectionally.
It is believed that visitors from countries with good economy and high standard of living, such as those in Americas (USA and Canada in particular) are more likely to have the economic capability to travel in a more expensive fashion. Hence, it is reasonable to hypothesize that the influx of American visitors may relate to a higher occupancy rate in luxury hotel.
This hypothesis further leads to the objective of the study: to answer and provide evidence to the question of whether there exists a relationship between visitors from the Americas and the luxury hotel booking in Singapore.
To analyze the relationship between visitors from Americas and the luxury hotel booking in Singapore, data was collected from Department of Statistics Singapore, covering the most recent 5.5 years of data (66 data points for each time series from Jan 2013 to June 2018) under the following two topics:
Data Source: http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1991
Selected Attribute: AMERICAS – Number of visitors from AMERICAS, including USA, Canada, and other parts of AMERICAS
Data Source: http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1991
Selected Attribute: AOR of Luxury Hotels – Includes hotels in the luxury segment, predominantly in prime locations and/or in historical buildings.
(* AOR = [Gross lettings (Room Nights) / Available room nights] x 100)
The initial assumption made for the two time series is that the number of American visitors (AMERICAS) leads luxury hotel bookings (LUXURY). Hence, AMERICAS is the independent series (input) while LUXURY is the dependent series (output). Several ARIMA model are tested to get the best performance model. The final ARIMA model is Seasonal ARIMA (1,1,1)(0,1,0)_{12}.
From the prewhitening plot, it can be observed that the relationship between input and output series is not unidirectional, with peaks ranging from the negative lags to positives. This indicates that the two series affect each other in a bi-directional way. Since transfer function is only applicable to situations where the relationship is unidirectional, other techniques will be explored for the further analysis.
Since a bi-directional correlation is found between Luxury and Americas, the Vector Autoregression (VAR) model is a useful starting point in the analysis of the inter-relationships between the different time series. Stationarity must exist in the series in order to estimate VAR model, so ADF test will be performed at first to test whether these two series are stationary and suitable for VAR(p) model. If they are not stationary, then differencing or cointegration test will be performed to estimate VAR(p) model or VEC(p) model.
VAR lag selection is performed with a value of 10. According to all the criteria, AIC, BIC and HQC, the best lag value to use for the analysis is 1.
The null hypothesis of the Unit Root Test is that the time series variable is non-stationary(a=1). According to the unit root test results of the two time series variables as shown above, both p-values are significant, which indicates that the null hypothesis can be rejected and the two series are stationary with constant and trend. Hence, a VAR(1) model can be estimated.
A VAR(1) model can be estimated with constant and trend. Seasonal dummy variables are also incorporated in the model as the time series plot shows a clear seasonal cycle. The two essential features of the VAR model: Causality Analysis and Impulse Response Analysis are performed to analyze the two series.
In the below output displays, S1 to S11 refer to the dummy variables, Month. Both equations are significant, and the majority of the variables are significant as well. The F-tests of zero restrictions under each model provide information for Causality Analysis.
The Granger Causality Test can provide evidence for answering the question of whether variable x causes variable y or vice versa, which is the interest of bi-directional time series analysis. The results of F-tests of zero restrictions under each model show that:
Equation 1 F-test results of “all lags of Luxury” and “all lags of AMERICAS” are both significant. This indicates a bidirectional causality between Luxury and Americas in equation 1.
Equation 2 F-tests result of “all lags of Luxury” is not significant and that of “all lags of AMERICAS” is significant. This indicates a one-way causality in equation 2.
With this information, a conclusion can be drawn that Americas and Luxury affect each other in a bidirectional way.
An exogenous shock to one variable not only directly affects this specific variable but is also transmitted to the other endogenous variables through the dynamic (lag) structure of the VAR. The Impulse Response Analysis in VAR model can help to trace the effect of a one standard deviation shock to one of the innovations on current and future values of the endogenous variables. The 4 charts below show the impact of a shock on Luxury and AMERICA to Luxury or AMERICA, respectively.
The analysis result shows that visitors from AMERICAS and the luxury hotel booking in Singapore have an interactive effect on each other, and the individual reaction of Luxury and AMERICAS to shocks on itself or the other series are very different.
To investigate and understand the relationship between visitors from Americas and the luxury hotel booking in Singapore, several analytical approaches and methods were conducted on the acquired dataset. Transfer function in JMP gives the indication of a bidirectional relationship between the two series, while Granger Causality Test further proves the inference from JMP. Finally, the Impulse Response Analysis helps to trace the effect of various shocks to Luxury and AMERICAS. This analytical approach can be applied to similar studies on hotel type versus visitors from other regions to identify the reaction pattern for different regions. Decision makers in the tourism industry can apply the insight and knowledge gained from studying these patterns in market segment planning and other related issues.
Submitted by: Red
Gao Yuan (A0178404A); Wang Yijing (A0178354W);
Wang Yuhui(A0179193U); Zhang Xiaoman(A0178489A);
Hu Mengxi (A0178561W)
Submitted by: Aastha Arora (A0178188L), Aishwarya Bose (A0178277M), Chetna Gupta (A0178260A), Madeleine Dy (A0178427U), Misha Singh (A0178309W), Zaira Hossain (A0178331E)
Is there a relationship between Air Passenger Arrivals and value of Food & Beverage Sales in Singapore?
SERIES 1: AIR PASSENGER ARRIVALS IN SINGAPORE
Notes:
SERIES 2: FOOD & BEVERAGE SALES IN SINGAPORE
Source: https://data.gov.sg/dataset/value-of-food-beverage-sales-based-on-2014-100-index-estimated-monthly
Notes:
Number of Records:
F&B sales value and overall Air passenger arrivals were combined to form a dataset containing 200 monthly records from Jan 1997 to Aug 2013 (~16 years & 8 months of data) for both Singapore Food & Beverage Prices, and Number of Air Passenger arrivals, which was used for analysis.
Outliers:
A dip in arrival and F&B sales value was seen between April, May and June 2003, which could be attributed to the fact that there was SARS virus outbreak in Singapore. So, this was considered as an outlier, and we replaced the value by taking the moving average of 3 months.
We have plotted both the variables to visually inspect these variables. With this, we can clearly see that both the time series are trending which is a clear sign that both the time series have a unit root.
We confirmed whether they have unit roots by using the augmented Dickey Fuller (ADF) test.
The Dickey Fuller Test produced the following output for Arrival:
(1.5485691015907959, 0.9976947744107255, 15, 184,
{'1%': -3.466398230774071, '5%': -2.8773796387256514,
'10%': -2.575213838610586}, 4548.903850689798)
The Dickey Fuller Test produced the following output for Food Value:
(1.5508415283231438, 0.9977023393139994, 13, 186,
{'1%': -3.466005071659723, '5%': -2.8772078537639385,
'10%': -2.5751221620996647}, 1478.034269903061)
Clearly, we cannot reject the null-hypothesis that these series have a unit root. So we should difference both series as a first step.
To test whether the ‘Arrival’ is caused by the ‘Food Value’, we applied the Granger Causality Test.
Granger Causality
number of lags (no zero) 1
ssr based F test: F=50.6180 , p=0.0000 , df_denom=196, df_num=1
ssr based chi2 test: chi2=51.3927 , p=0.0000 , df=1
likelihood ratio test: chi2=45.7154 , p=0.0000 , df=1
parameter F test: F=50.6180 , p=0.0000 , df_denom=196, df_num=1
{1: ({'ssr_ftest': (50.61796959284111, 2.059081651775858e-11, 196.0, 1),
'ssr_chi2test': (51.39273443354787, 7.56178806704204e-13, 1),
'lrtest': (45.71543384976394, 1.3674112550498519e-11, 1),
'params_ftest': (50.617969592841, 2.0590816517759726e-11, 196.0, 1)},
[<statsmodels.regression.linear_model.RegressionResultsWrapper object at
0x000001A53F406F28>,
<statsmodels.regression.linear_model.RegressionResultsWrapper object at
0x000001A540F8D518>, array([[0., 1., 0.]])])}
We can say that Arrival is caused by Food Value because p value is significant and less than 0.05. Then, we reverse the test to see if the ‘Food Value’ is caused by the ‘Arrival’. As a result, we found that Food Value was not caused by Arrival.
Granger Causality
number of lags (no zero) 1
ssr based F test: F=1.9754 , p=0.1615 , df_denom=196, df_num=1
ssr based chi2 test: chi2=2.0056 , p=0.1567 , df=1
likelihood ratio test: chi2=1.9956 , p=0.1578 , df=1
parameter F test: F=1.9754 , p=0.1615 , df_denom=196, df_num=1
{1:({'ssr_ftest': (1.9754074370715489, 0.16145817984806168, 196.0, 1),
'ssr_chi2test': (2.005643265189991, 0.15671480384086686, 1),
'lrtest': (1.9956036184489676, 0.15775620297932386, 1),
'params_ftest': (1.9754074371779515, 0.16145817983682556, 196.0, 1)},
[<statsmodels.regression.linear_model.RegressionResultsWrapper object at
0x000001A545511BE0>,
<statsmodels.regression.linear_model.RegressionResultsWrapper object at
0x000001A545489CC0>, array([[0., 1., 0.]])])}
Firstly, we checked the PACF graph in the time series inputs (Food Value). It’s clear that the data needs to be differenced to run the input series ARIMA for pre-whitening:
We then use the ARIMA Group Model Function in JMP to discover the best combination. This turns out to be Seasonal ARIMA (2,1,0) (0,1,1)12 for the input series. Then pre-whitened the input series and we got the following result:
Since, we did not get any significant lags. We decided to go with log transformation of both the series. After the log transformation, we fit an ARIMA model using the above steps and pre-whitened the series and got the following output:
We found significant lags at both sides of the plot. In this case transfer function won’t give the right estimates of coefficients. So, we reached the conclusion that it is a cointegration problem. Hence, we moved on to GRETL.
We have plotted both the variables to visually inspect these variables. With this, we can clearly see that both the time series are trending which is a clear sign that both the time series have a unit root.
Both series are trending upward. It is possible that both the series follow long run equilibrium relationship that they tend to return to over time. We performed the Engle Granger Test for cointegration to find out.
Engle Granger Test:
Given two sets of time series data, x and y, granger-causality is a method which attempts to determine whether one series is likely to influence change in the other. This is accomplished by taking different lags of one series and using that to model the change in the second series. We create two models which predict y, one with only past values of y (Ω), and the other with past values of y and x (π). The models are given below where k is the number of lags in the time series:
Let Ω = yt = β0 + β1yt-1 +…+ βkyt-k + e
And π = yt = β0 + β1yt-1 +…+ βkyt-k + α1xt-1 +…+ αkxt-k + e
The residual sums of squared errors are then compared, and a test is used to determine whether the nested model (Ω) is adequate to explain the future values of y or if the full model (π) is better. The F-test, t-test or Wald test (used in R) are calculated to test the following null and alternate hypotheses:
H0: αi = 0 for each i of the element [1,k]
H1: αi ≠ 0 for at least 1 i of the element [1,k]
Essentially, we are trying to determine whether we can say that statistically, x provides more information about future values of y than past values of y alone. Under this definition we are not trying to prove actual causation, only that the two values are related by some phenomenon. Along those lines, we must also run this model in reverse to verify that that y does not provide information about future values of x. If we find that this is the case, it is likely that there is some exogenous variable, z, which needs to be controlled or could be a better candidate for granger causation.
Steps in Engle Granger Test:
Step 1: Determine ‘d’ in I(d) for ‘Log of Arrival’ using ADF Unit Root Test.
H0: Level series contains a unit root.
HA: Level series does not contain a unit root.
We have taken a maximum lag order of 6 by taking the cube root of the number of data points (200).
We select ‘constant and trend’ because we have seen from the plot that the series has an upward trend.
The p-value is large, so we fail to reject the NULL Hypothesis. This means series needs to be differenced to make it stationary. So, d=1.
Step 2: Determine ‘d’ in I(d) for ‘Log of Food Value’ using ADF Unit Root Test.
H0: Level series contains a unit root.
HA: Level series does not contain a unit root.
We have taken a maximum lag order of 6 by taking the cube root of the number of data points (200).
We select ‘constant and trend’ because we have seen from the plot that the series has an upward trend.
The p-value is large, so we fail to reject the NULL Hypothesis. This means series needs to be differenced to make it stationary. So, d=1.
Step 3: Estimate cointegrating regression: Yt = β1+ β2Xt+Ɛt
We estimated the cointegrating regression by using ‘Log of Arrival’ as depend Variable. Both variables are integrated of the same order. We provided lag order to be 6 as mentioned in Step 1 and Step 2.
Step 4: Determine ‘d’ in I(d) for Ɛt
H0: Unit root (i.e., not cointegrated)
HA: No unit root (i.e., cointegrated)
Since our p-value is small so we reject the NULL hypothesis at 5% level of significance and conclude that the Food Value is cointegrated with Arrival rate. This in turn means that the series can be written as an error correction model.
Equation:
∆Y_t=φ_1 ∆X_(t-1)+φ_2 Y_(t-1)-γ{Y_(t-1)-β ̂_1-β ̂_2 X_(t-1) }+ω_t
The lag residual from cointegrating regression is found within the curly braces above. The coefficient γ is the speed of adjustment. If it is not statistically significant, the variable is weakly exogenous.
Before estimating the error correction model, we estimated the cointegrating regression and saved the residuals with name e.
Then, we created two new series which is the difference of ‘log of arrival’ and ‘log of food value’.
Then we estimated the Error Correction Model. Select following lags:
The variable that we are interested in is e. Atleast one of the variables must not be weakly exogenous if the series are cointegrated. We can see that Log of Arrival is not weakly exogenous. This means that The arrival value moves to restore the equilibirium when the system is out of balance but Log of Food Value doesn’t move to equilibirium when system is out of balance.
We can conclude that arrivals of air passengers is caused by food and beverage sales values as opposed to our initial assumption that food and beverage sales value is dependent on arrivals of air passengers.We can say the Food Value Granger-Cause Arrival rate!
Increasing in the average deal price of Shanghai license plate will increase the number of bidders.
Over the last two decades, there is an increase in automobile ownership and use in China. This increases energy consumption, worsens air pollution and exacerbates congestion. Therefore, an auction system has been adopted by Shanghai government to limit the number of license plates issued for every month.
The dataset contains monthly auction data from Jan 2002 to Oct 2017. Feb 2008 data is missing, and the number of records is 189 [1].
Field | Description | Data Type |
Date | Jan 2002 to Oct 2017 (Feb 2008 is missing) | Date |
avg_deal_price | The average deal price of Shanghai license plate | Numeric |
lowest_deal_price | The lowest deal price of Shanghai license plate | Numeric |
num_bidder | Number of citizens who participate the auction for the month | Numeric |
num_plates | number of plates that will be issued by the government for the month | Numeric |
Table 1 Metadata
Only variables “avg_deal_price” and “num_bidder” will be analysed in this report.
Assuming that the relationship between “avg_deal_price” and “num_bidder” is unidirectional, JMP is used for the analysis. Since the hypothesis is increasing in the average deal price of Shanghai license plate will increase the number of bidders, so taking “avg_deal_price” as the input and “num_bidder” as the output.
Figure 1 Time series plot of input and output variables
Figure 1 shows that the trends of input and output are both not stationary, so an order 1 differencing is applied to both input and output.
Figure 2 Plot of Input with order 1 differencing
Figure 3 Plot of output with order 1 differencing
Then the trends of input and output become stationary.
Figure 4 Autocorrelation and partial autocorrelation graphs
Since there is no seasonality observed, so ARIMA instead of seasonal ARIMA is applied next to both input and output. Figure 4 shows that there is a distinct drop in the Partial autocorrelation function (PACF) value after lag 3, an AR(3) model is plotted.
Figure 5 AR(3) Model for input
However, Figure 5 shows that the values for AR2 and AR3 in Column Prob>|t| are big. Therefore, AR(1) model is adopted for both input and output instead of AR(3).
Figure 6 AR(1) Model for input
Figure 7 Residuals of AR(1) model for input
Figure 8 AR(1) Model for output
Figure 9 Residuals of AR(1) model for output
Figure 7 and Figure 9 show that the residuals of AR(1) models for both input and output are random.
Next, in order to remove the autocorrelation, pre-whitening is performed to the input variable.
Figure 10 Pre-whitening plot for dataset from Jan 2002 to Oct 2017
Figure 10 shows that there is an increase in the values at both negative lag (lag -14 and -13) and positive lag (lag 1 and 2). This means the cross-correlation is bi-directional. Therefore, the cointegration technique needs to be applied.
Since the cross-correlation is bi-directional, cointegration analysis with GRETL is performed.
Firstly, a time series plot of both variables is plotted as shown in Figure 11. It shows that both variables values are increasing gradually from the Year 2002 to 2014. Then from the Year 2014 to 2016, there is a big increase in “num_bidder”, but “avg_deal_price” increases in a similar rate as previous years.
Figure 11 Time series plot of both variables from Jan 2002 to Oct 2017
Then the Engle-Graner test for cointegration is performed and the steps are,
Step 1: Determine ‘d’ in I(d) for the first series, where d represents the order of integration and is abbreviated I(d). d is the number of times the series has to be different to be made stationary. ADF unit root test is used to determine d.
Step 2: Determine ‘d’ in I(d) for the second series
Step 3: Estimate cointegration regression: Y_{t} = β_{1} + β_{2}X_{t} + ε_{t}
Step 4: Determine ‘d’ in I(d) for ε_{t}
_{ } H_{0}: Unit root (i.e., not cointegrated)
H_{A}: No unit root (i.e., cointegrated)
To perform the ADF unit root test on the residuals from the cointegrating regression to determine the order of integration [2].
Augmented Dickey-Fuller (ADF) test is performed in Step 1 for series “avg_deal_price”. Taking 6 as the lag order for ADF test which is the cube root of the data size which is 189.
Figure 12 ADF test for series “avg_deal_price”
Figure 13 ADF test result for series “avg_deal_price”
Figure 13 shows that the p-value is 0.5498 which is large, hence fail to reject the null hypothesis. Therefore, the series needs to be different to make it stationary. d=1 for “avg_deal_price”.
The same test is done for series “num_bidder”. Figure 14 shows the p-value is 0.9175 which is large, hence fail to reject the null hypothesis. Therefore, the series needs to be different to make it stationary. d=1 for “num_bidder”.
Figure 14 ADF test for series “num_bidder”
The Engle-Graner cointegration test is performed, taking “avg_deal_price” as the independent variable and “num_bidder” as the dependent variable. Figure 15 shows the result that the p-value is 0.2498 which is not small, hence fail to reject the null hypothesis at 5% level of significance. This means there is no cointegration relation between the two variables.
Figure 15 Engle-Graner cointegration test result for dataset from Jan 2002 to Oct 2017
Since Figure 11 shows that there is a big increase in “num_bidder”, but “avg_deal_price” still increase gradually, this may be the reason that the relationship between two variables is neither unidirectional nor bi-directional. Therefore, data from Jan 2002 to Dec 2013 is going to be analysed instead.
Figure 16 Time series plot of both variables from Jan 2002 to Dec 2013
Firstly, use JMP to perform the unidirectional relationship analysis between two variables as mentioned in Section 3. The result as shown in Figure 17 shows that the cross-correlation is bi-directional. Therefore, GRETL is used to perform bi-directional cross-correlation analysis.
Figure 17 Pre-whitening plot for dataset from Jan 2002 to Dec 2013
Since the number of records is 143, so taking 6 as the lag order for ADF test which is the cube root of the data size which is 143. Following the same steps mentioned in Section 4, the Engle-Graner cointegration test result is shown in Figure 18.
Figure 18 Engle-Graner cointegration test result for dataset from Jan 2002 to Dec 2013
The result shows the p-value is 4.709e^{-13} which is small, so reject the null hypothesis at 5% level of significance. This means there is cointegration relation between two variables and the series can be written as an error correction model.
Equation 1 Error correction model equation
Equation 1 shows the equation of the error correction model. The residuals from the cointegrating regression are found within the brackets and capture the departure from the attractor from last period. The coefficient gamma is the speed of adjustment, if it is not statistically significant, the variable is weakly exogenous. Before estimating the error correction model, the cointegrating regression is estimated and the residuals are saved.
Figure 19 Ordinary least squares method result
Ordinary least squares method is performed to save the residuals. After the residuals are saved, two new series “d_avg_deal_price” and “d_num_bidder” are created that are the difference of the “avg_deal_price” and “num_bidder” as shown in Figure 20. Next step is to estimate the error correction model.
Figure 20 New variables are created
Figure 21 Adding lags
Figure 22 Ordinary least squares method with new variables lagged by 1
To estimate the error correction model, ordinary least squares is applied as shown in Figure 22. The variables are lagged by 1 and remove the “const” is removed.
Figure 23 Ordinary least squares method result with new variables
The p-value of variable e_1 is 8.81e^{-10} which is statistically significant. This means the dependent variable “num_bidder” is not weakly exogenous and moves to restore the equilibrium when the system is out of balance.
From the Year 2014 to 2016, there is a big increase in the number of bidders, but the average deal price of Shanghai license plate increases at a similar rate as previous years. This may be the reason that the relationship between the average deal price of Shanghai license plate and the number of bidders is neither unidirectional nor bi-directional from Jan 2002 to Oct 2017.
However, from Jan 2002 to Dec 2013, there is a bi-directional cross-correlation between the average deal price of Shanghai license plate and the number of bidders. The number of bidders is not weakly exogenous and moves to restore the equilibrium when the system is out of balance.
[1] “Shanghai license plate bidding price prediction,” 2017. [Online]. Available: https://www.kaggle.com/bazingasu/shanghai-license-plate-bidding-price-prediction.
[2] Janelle, “Cointegration (Video 7 of 7 in the gretl Instructional Video Series),” 17 Feb 2015. [Online]. Available: https://www.youtube.com/watch?v=bJgx3JLb7fI&t=311s.
Transfer function model is a unidirectional relationship between input and output.
Problem/Hypothesis of the time series data:
To analyze the relationship between Consumer Price Index (CPI) the Gross Domestic Product (GDP) of the Indian market.
Number of records in the dataset:
The dataset contains the history of CPI and the GDP for the past 60 years of the Indian market (one entry for each year).
Data source: https://www.data.gov.in/
Transfer Function:
On our initial visual analysis, we could see that both the input and output parameters are of different scales, the GDP was in lakhs of rupees and the CPI is in percentage, so we have transformed the dependent variable (GPD) with log_{10}.
The relation between the input and output is mostly linear with some discrepancy due to white noise and the auto correlation between the output variable. On business terms, we understand that the Consumer Price Index and GDP of a India were related to each other with positive correlation and having a good interaction. This can be confirmed by the given scatterplot matrix below:
With the above understanding we continue to the transfer function analysis for the above dataset.
With initial Time Series Basic Diagnostics, we see that the for the output variable the ACF and PACF both are tailing off; hence this is an ARIMA model. But, we see that the data is already stationary so we have not done any differencing. Similar pattern is followed by the input series (CPI) also.
For Pre-whitening, we could see that the auto correlation is significant after lag 14, i.e nearly after 15 years. By Indian Government standards, the budget is revised every 5 years and due to the slow growth of economy the impact of CPI on GDP is seen after 15 years. But it is very prominent during the next 5-7 years as seen in the below ACF graph.
For the transfer function, we expect that B^{15} and other variables are to be prominent. The above analysis resulted in the below transfer function wherein the relation between the output variables and the input is seen to be prominent. Thereby confirming the delayed impact of CPI after 15 years on the GDP of India.
Since, the pre-whitening process gave us a model of (2,0,2) and we could also see that the peak in pre-whitening lag plot is at 19 and the significant values start from 15. So s1 is given a value of 4. Below is the transfer function model summary.
The actual transfer function which depicts the current GDP is impacted by the 5 CPI observations that are nearly 15 years old. This also gives us an idea that the 5-year budgeting is having a prominent impact on the GDP.
Conclusion
From the above analysis we have concluded that the GDP estimate on the input variable (CPI) is as given below:
This gives us a clear understanding that the current GDP is dependent on the 3 previous years’ GDP and 6 years of CPI (independent variable) which is 15 years apart. We also notice that as the year grows, the older years are having minimal impact.
Team Members: Apurv Garg, Bhabesh Senapati, Daksh Gupta, Dibyajyoti Panda, Rajiv Hemanth, Tran Minh Duc
The Auto industry uses a tremendous number of materials to build cars, among which most of the weight comes from steel. Most of the modern cars weigh around 3,000 pounds out of which 2,400 pounds is steel. In cars, steel is used to build the chassis, door beams, roof, body panels, exhaust etc.
TATA Steel is the largest producer of steel in India and on the other hand Maruti Suzuki is one of the most extensively purchased brand in automobile industry. So, there should be a relation existing between the stocks of both the big giants.
.
Figure 1: Time Series Plot
Log transformation was applied to the stock prices of both the companies as they helped in better identification of time series patterns by reducing the variance.
For the 1st iteration the Y (dependent) variable is selected as the TATA steel stock prices and the X(independent) variable is MARUTI stock price.
Before starting with the model building of a time series sequence it is very important for the series to be stationary, and if not stationary then it should be converted to a stationary sequence by taking a difference with its lag. Augmented Dicky Fuller test is used to test the stationarity of a series.
Augmented Dicky Fuller test has the following hypothesis:
Ho –> Level series contains a unit root
H1 –> Level series does not contain a unit root
Depending on the Significance of the p value we reject or fail to reject the null hypothesis. If we fail to reject the null hypothesis the series is non-stationary and if we reject the null hypothesis the series is a stationary series.
Both the series were tested for stationarity and both failed to reject the null hypothesis due to insignificant value of p. Hence both were non-stationary and were differenced with a lag of 1 to make the series stationary.
The input series i.e. Maruti stock price series is made to fit the best combination of ARIMA series. The value of I = 1 was determined in the above step.
Illustration 1: The Seasonal ARIMA (2,1,2)(0,0,2) model was finally selected as the best fit with all variables significant and a MAPE of 1.07
Illustration 2: Parameter Estimates
Both the series were then pre whitened on the following parameter values:
p = 2 | I = 1 | q = 2 | P = 0 | Is = 0 | Q = 2
Illustration 3: Cross-Correlation plot between tata steel and Maruti Suzuki stock prices.
After the pre whitening of the series the cross-correlation plot was analysed to determine the transfer function parameters. But the cross-correlation plot showed hikes on both the negative as well as the positive lags. Hence indicating a case of cointegration where both the series are correlated to each other. This can be seen in the illustration shown below.
The p value for Engle – Granger Test is much closer to being significant for the case
Y = Log_TISC and X= Log_MRTI (p = 0.09) as compared to
Y = Log_MRTI and X = Log_TISC.
Hence, the dependent variable was finally selected as Tata Steel price and the independent variable as Maruti Suzuki price.
An error correction model as it is now a problem of Cointegration Regression, with a difference [D=1] (to make the series stationary). The illustration below shows the result for the Ordinary least Square Model made:
Model: OLS, using observations 2003:10-2018:09 (T = 180)
Dependent variable: d_Log_TISC
coefficient std. error t-ratio p-value
—————————————————————————-
d_Log_MRTI_1 −0.134663 0.115003 −1.171 0.2432
e_1 −0.0925064 0.0280208 −3.301 0.0012 ***
d_Log_TISC_1 0.194640 0.0854004 2.279 0.0239 **
——————————————————————————
Mean dependent var 0.007540 S.D. dependent var 0.140167
Sum squared resid 3.274077 S.E. of regression 0.136006
Uncentered R-squared 0.071713 Centered R-squared 0.069012
F(3, 177) 4.557955 P-value(F) 0.004207
Log-likelihood 105.2140 Akaike criterion −204.4279
Schwarz criterion −194.8490 Hannan-Quinn −200.5441
rho −0.017859 Durbin-Watson 2.022787
P-value was highest for variable 10 (d_Log_MRTI_1)
Where:
Y(t) –> Log value of Tata Steel stock price at time t.
Y(t-1) –> Log value of Tata Steel stock price at time t-1
X(t-1) –> Log value of Maruti Suzuki stock price at time t-1
E –> Random error