Analytics And Intelligent Systems

NUS ISS AIS Practice Group

CUSTOMER SEGMENTATION AND STRATEGY DEVELOPMENT BASED ON CUSTOMER LIFETIME VALUE: CASE STUDY ON A WIRELESS COMPANY — June 6, 2017

CUSTOMER SEGMENTATION AND STRATEGY DEVELOPMENT BASED ON CUSTOMER LIFETIME VALUE: CASE STUDY ON A WIRELESS COMPANY

Capture

Source: Click here for the paper

Submitted by:

Team Incognito

Pankaj Mitra (A0163319E), Deepthi Suresh (A0163328E), Anand Rajan (A0163200B) and Neeraja Lalitha Muralidharan (A0163327H)

Advertisements
Spatial-Temporal Analytics with Students Data to recommend optimum regions to stay — May 7, 2017

Spatial-Temporal Analytics with Students Data to recommend optimum regions to stay

Objectives & Motivation:

Objective is to find a convenient place for students to stay based on data collected in Singapore

To any international student coming into Singapore needs robust information about convenient places to rent. To proceed with this we analyzed student’s movement data which was collected by students of NUS belonging to ISS school, exploratory spatial data analysis was done on that to find the pattern and insights. We considered that as a student basic amenities would be economical stay, cycling path, library, MRT closeness, parks, hawker center.

Data Sources:

Dataset description Data Source URL
Data points of all NUS-ISS  students IVLE (Apps used: Openpaths)
Dwelling data https://data.gov.sg/dataset/master-plan-2014-planning-area-boundary
Hawker Centres https://data.gov.sg/dataset/hawker-centres
Cycling Path https://data.gov.sg/dataset/cycling-path-network
Park Connector Loop https://data.gov.sg/dataset/park-connector-loop

Methodology:

blog

INPUT:

OpenPath data with student’s personal location information was collected for the month of April. Data Cleaning was done to refine the data. As the data was stored on multiple mobile devices the date and time formats were inconsistent, thus, all the date and time field values were converted to a single standard format. To do more analysis, separate columns were created for date and time. The outliers were treated using R and the data points outside Singapore were ignored. Modified Name and MailID columns to obtain missing details using EXCEL.

Exploratory Analysis of Students Data (with derived fields):

Mean Centre:

Picture1

From the above exploration, it is evident that most student stay at the university, travel to the residence and they use MRT most of the time. As students would prefer to stay near university, place near to MRT station and have the potential to use Cycling path, our aim is to find and suggest better zones for student life. Thus, we need to add these layers to the student data and carry out analysis.

Process:

To get more insights into a suitable dwelling, we tried to separately analyze various datasets such as HDB dwelling population data, Hawker center location, MRT station, and cycling paths.

Geographically Weighted Regression:

To find insights from student data using HDB and population data layer, geographically weighted regression is performed using variables as shown below with an assumption that data points having the timestamp of late night represent student’s home. Spatial join was done on hawker center, dwelling and open path student data layers to obtain final GWR model.

Explanatory Variables:

The count_  variable is the number of student data points per polygon, Count_1 is the number of Hawker centers per polygon, showsSHAPE_Area of the dwelling layer and HDB the total count of HDB per person.

GWR Model

The GWR results show that model can perform with moderate accuracy having adjusted R2 of 0.31.

Spatial Autocorrelation (Moran’s I) tool on the regression residuals was run to ensure that the model residuals are spatially random. Statistically significant clustering of high and/or low residuals (model under- and overprediction) indicates that our GWR model is not accurate enough to predict.

GWR Fail

The below map indicates regions with localized R square to find HDB population using students data. The Dark Red regions are the places where the model predicts with higher accuracy and the blue regions are the place where its prediction is poor.

gwr-local-e1494152769715.png

As the model residuals are not random based on the spatial autocorrelation ( Morons I ), it cannot be used for prediction purposes. This model was just build to study the insights of students data with another layer of data.

Thus, to find convenient places for students, the places were ranked using student data points and factors like MRT, cycling path and hawker centers availability.

Output:

Ranking zones based on Hawker Centres:

To get a density of hawker centers in each we calculated a new field which was used for ranking:

Ranking of the classified field by using reclassify tool to rank the regions based on the hawker center density which shows places ranked according to economic zones for food.

Rank1

Ranking zones Based on proximity to cycling paths:

The areas are ranked as per its proximity of cycling paths and data is converted to raster data and the ranked.

Rank3

Ranking zones based on MRT by reclassify:

The easy access to public transport is considered one of the major consideration while choosing a place to stay and have used proximity to MRT as a factor to rank areas. MRT location data is converted to raster data to rank areas based on its proximity to MRT stations

Rank2

 Final Ranking:

The final rank for each zone in Singapore was calculated based on the average of other three ranks (MRT, Hawker center, and Cycling path). Areas which are ranked good can be most favorable for staying. This final ranking can help to choose a place for staying based on individual priority.

 

 Final Rank

 

 

Recommendations:

Using the final ranking, we can recommend to a new student coming to study at NUS a convenient place to stay, considering MRT, Cycling and Food places in the ranking.

As expected the better ranking zones are crowded near NUS itself and there are other places also being suggested by the ranking.

As a future scope of this story, we can add a configurable element which can replace Hawker center layer with many another layer like Libraries, Parks, HDB rental prices, Bus stop layer to form an ideal tool for the upcoming student to use it.

For detailed analysis, please read the entire report: – Spatial-Temporal Analytics with Students Data

Explored and submitted by: 

ARUN KUMAR BALASUBRAMANIAN (A0163264H)

DEVI VIJAYAKUMAR (A0163403R)

RAGHU ADITYA (A0163260N)

SHARVINA PAWASKAR (A0163302W)

LI MEIYAO (A0163379U)

STUDY AREA DETERMINATION — May 6, 2017

STUDY AREA DETERMINATION

Objective:

To find most accessible study areas for students in NUS

Problem Space:

Capture

Analysis Strategy:

  • What? – Availability of study areas for students
  • Where? – Inside NUS campus
  • Why? – The various reasons for the preferred locations
  • How? – Finding the clusters around various study locations

Data Exploration and Feature Addition:

  • Data source is OpenPaths
  • The class data was cleansed and the records pertaining to geographical coordinates of Singapore/NUS was chosen
  • The initial analysis of the sampled data was performed using various tools such as R , carto, ArcGIS to study the geographic spread
  •  A new dataset was formulated for representing the various study centres in NUS
  • Transformation was performed to achieve the variables in the necessary format for the geo-visualization
  • Reverse geocoding was performed on the dataset using the ‘ggmap’ package in R and the corresponding locations were obtained

Preliminary Analysis:

Picture1

  • Carto was used to analyse the spread of the data points during the class hours and after the class hours
  • The results of the analysis portrayed that after the class hours the population spread is more at University town owing to availability of more study areas and facilities

Modelling:

Capture

Base Map Creation and Polygon Generation:

Capture

Addition of Layers:

Picture3

Step 3: Loaded class data (master class namely) and converted coordinates for visibility

Density Analysis:

Capture

Picture4

Step 6: The high-density area was in and around National University of Singapore. We can conclude that the data points are either working in NUS or students of NUS.

Assumption and Addition of new Layer:

Capture

Hot Spot Analysis:

Step 8: Hot Spot Analysis was performed, Arc Tool Box->Spatial Statistics Tools->Mapping Clusters->Optimized Hotspot Analysis

Picture5

Model Diagram – Proximity Analysis:

Picture6

Step 9: Proximity Analysis was performed to find most accessible study areas for students in NUS, Arc Tool Box->Analysis Tools->Proximity->Near

Proximity Analysis Results:

Capture

Inferences/Solution Outline:

  • Comparing the inference obtained from CARTO and Model built in ARCGIS, We can find that the students only focus on University Town
  • From the model it is evident that there were other study areas that could be preferred as the data points were close to these study areas
  • The students can explore other areas like FOE,SOC etc for holding discussion sessions
  • We can propose a new study area at an optimal location based on the geographic distribution of student data in case the number of students enrolled increases over a period of time

Limitations:

  • It’s a student’s location data
  • Sample size is limited to ISS students only
  • Non availability of accurate shuttle bus timings data
  • Non availability of students enrolled in each and every faculty

Team Name: Incognito

Team Members: Pankaj Mitra, Deepthi Suresh, Anand Rajan, Neeraja Lalitha Muralidharan, Nandini Paila Reddy

Delay Estimation in Pedestrian crossing — April 29, 2017

Delay Estimation in Pedestrian crossing

Team Name: Incognito

Team Members: Deepthi Suresh, Neeraja Lalitha Muralidharan, Pankaj Mitra, Sindhu Rumesh Kumar

Journal:

This study uses multiple linear regression:

1.To provide theoretical support for traffic management and control

2.To increase efficiency at intersections and improve security

One Page Journal Summary “Estimates of pedestrian crossing delay based on multiple linear regression and application” authored by Li Dan and Xiaofa Shi.

Journal_Pedestrian Crossing

Click here for Journal.

Analysis on Workplace Injuries — March 17, 2017

Analysis on Workplace Injuries

Untitled

The objective of this dashboard is to demonstrate different levels and types of injuries caused in the workplaces of Singapore.

 

Analysis on Workplace Injuries – Analytics And Intelligent Systems

Questions:

  1. In which industry majority of injuries occur?

Majority of injuries occur in the Construction and Manufacturing industry consistently over the years but in the year 2016, the injuries pertaining to “Accommodation and Food Services” has also risen.

  1. With the current scenario in workplaces, which type of injury occurs more frequently?

The number of minor injuries (94.9%) greatly exceeds the number of major (4.5%) and fatal (0.5%).

  1. Is there a trend seen in the number of injuries caused in the industries through the years 2014-2016?

The fatal and major injuries are consistent through the years, while the number of minor injuries dropped in 2015 and increased drastically in 2016.

  1. What are the common types of minor injuries across different industries?

Injuries caused by slips, trips and falls are the maximum across industries.  Injuries caused by cuts or stabs by objects is more at the Accommodation and Food services industries, while workmen at Construction and Manufacturing industries are more affected by moving objects.

  1. Why has minor injury increased drastically from 2015 to 2016 while fatal and major injuries are consistent?

A good measure has been taken to avoid the fatal injuries yet minor injuries have not decreased. This could be because of the conversion of fatal injuries to minor injuries.

 

Final Analysis

We can clearly see that fatal and major injuries are very low and steps have also been taken to reduce this further, there is still a rise in the minor injuries. This may be because the fatal injuries have been converted to minor injuries.

Struck by moving/falling objects is one of the common cause of minor injury in Construction as well as Manufacturing industry, so to suggest a common solution for this, we could make use of Wireless sensors (heat and motion) on the objects which make an alert sound when it reaches within 50 m of any human being. This way the people working at the site will be aware of any moving objects in their vicinity and move out of harm.

The injuries may be minor, but if the injuries occur simultaneously to multiple people, it may affect the overall productivity of the company.

 

Tableau Public link:

Workplace Injuries in Singapore(2011-2016)

Submitted by:

Pankaj Mitra (A0163319E)

Deepthi Suresh (A0163328E)

Neeraja Lalitha Muralidharan (A0163327H)

Kriti Srivastasa (A0163206N)

Sindhu Rumesh Kumar (A0163342M)

 

Analysis of Crude Oil Production And Import Prices — December 2, 2018

Analysis of Crude Oil Production And Import Prices

Prepared by:

Team Data Insighters

Balacoumarane Vetrivel – A0178301L

Mohammed Ismail Khan – A0178366N

Meghna Vinay Amin – A0178307Y

Sreekar Bethu – A0178220L

Vaishnavi Renganathan – A0178229U

Introduction

With each passing year, oil seems to play an even greater role in the world-wide process of people making, selling, and buying things. In the early days, finding oil during a drill was carefully thought believed somewhat of an annoyance as the meant treasures were usually in a common and regular way water or salt. It wasn’t until 1857 that the first commercial oil well was drilled in Romania. The U.S. petroleum industry was born two years later with a drilling in Titusville, Pa.

Oil has replaced coal as the world’s primary fuel source. Oil’s use in fuels continues to be the primary factor in making it a high-demand commodity around the globe.

The Determinants of Oil Prices:

​With oil’s stature as a high-demand global commodity comes the possibility that major fluctuations in price can have a significant economic impact. The two primary factors that impact the price of oil are:

  • Supply and demand
  • Market sentiment

Supply and demand:

The price of oil as known is actually set in the oil futures market. An oil futures contract is a binding agreement that gives one the right to purchase oil by the barrel at a predefined price on a predefined date in the future. Under a futures contract, both the buyer and the seller are obligated to fulfil their side of the transaction on the specified date.

Market sentiment:

The mere belief that oil demand will increase dramatically at some point in the future can result in a dramatic increase in oil prices in the present as speculators and hedgers alike snap up oil futures contracts. The opposite is also true that means prices can hinge on little more than market psychology at times.

Other variables:

  1. News on new supply
  2. Changes in consumer habits
  3. Terrorist attacks and disturbance
  4. Alternative energy sources
  5. Economic growth

Crude oil production in US

For seven straight years, the US has pumped more oil and gas out of the ground than any other country and this lead will only widen. Production of crude topped 10.7 million barrels per day with production of natural gas hitting 4 million barrels per day. The surge in U.S. output is due in large part to the wide use of horizontal hydraulic fracturing, or fracking, as new technologies give drillers access to some of the largest oil deposits in the world that were once too tight to exploit.

Data source and Understanding

Data source:

The data is obtained from the link

https://data.oecd.org/energy/crude-oil-production.htm#indicator-chart

image-1

3.1 Crude oil production (input series):

Crude oil production is defined as the quantities of oil extracted from the ground after the removal of inert matter or impurities. It includes crude oil, natural gas liquids (NGLs) and additives. This indicator is measured in thousand tonne of oil equivalent (toe). The data which has been considered is Total Crude oil production in US.

3.2 Crude Oil Import price (output series):

Crude oil import prices come from the IEA’s Crude Oil Import Register. Average prices are obtained by dividing value by volume as recorded by customs administrations for each tariff position. Values are recorded at the time of import and include cost, insurance and freight, but exclude import duties. The price measured in USD per barrel of oil.

Graph:

image-2

The graph depicts the time series of both the crude oil production in US and import price rate. To some extent, there is an inverse relationship between these two.

Hypothesis framework

Total crude oil production is a major factor in determining the price of a barrel is tacit. To challenge this common belief, import price of oil is considered as Input series(X) and oil production is considered to be the Output series(Y).

Hypothesis – “There is an underlying relationship between the total crude oil production in US and crude oil import price.”

Equation – Production(Y(t)) ~ Import Price(X(t)).

 

Procedure

The following steps show how various parameters were considered while building the ARIMA model and ultimately the team arrived at the Transfer Function.

Step 1:  Fit ARIMA model to the input production series Xt.  Input data is now loaded and an ARIMA model is tried to fit with different parameters. The model IMA (2,1) shows good overall summary.

ACF and PACF were less than significant level and model coefficient are significant. The below plot shows the ACF and PACF obtained for the IMA model tried above. It concurs what we have observed. The residuals plot is also attached for the model.

image-3

image-4

Arima residual check:

image-5

image-6

Step 2:  By fitting a preliminary model, we can get rid of the autocorrelation, if there exists any in the data. This process is called pre-whitening.

Pre-whiten the input production-output price series and check for cross correlation.

image-7

The cross-correlation plot between input and output indicate there is significant correlation in lag 3 to 5 and lag 8.

This suggests that our transfer function equation will have terms related to the input series only for lags 3,4,5 and 8.

The model equation:

Yt=Vt-3Xt-3 +Vt-4 Xt-4+Vt-5Xt-5+Vt-6 Xt-6+Vt-7Xt-7+Vt-8Xt-8+nt.

Step 3: Compute the transfer function.

Parameters identified are:       b=3,     s=8-3=5,            and r=2, these values were identified from the above pre-whitening plot and used to build transfer function.

image-8

image-9

Model residual diagnostics:

image-10

image-11

Transfer Function Output

image-12

The residuals plot for the transfer function is obtained through the software. We do a diagnostic check for the fitted transfer function model – noise. The parameter significance is also checked from the plot and it is found that the ACF and PACF value ranges are in the significant range that we are looking for with varied significance.

6.1 Final Equation:

Yt – 0.03 Y(t-1) -1.03 Y(t-2) -1.85 Y(t-3) +1.91 Y(t-4) = -290.3574+197.96X (t-3)-249.74X (t-4) +19.84X (t-5) + 93.04 X (t-6) + 230.5484 X (t-7) -763.4634 X (t-8) + e(t) +2.56 e(t-1) +3.08 e(t-2) +1.14 e(t-3)

Inference

Form the transfer function it is evident that import price depends on lag variables (3,4,5,8) of the crude oil production. As mentioned in our hypothesis it has been observed and re-iterated that there indeed exists a relationship, inverse in nature, between the two variables considered under our analysis, namely, Crude oil production and the import prices. Rising production in crude oil shall forecast a diminishment in the import prices.

 

 

Transfer Function Analysis: Relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month — November 9, 2018

Transfer Function Analysis: Relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month

  • Liang Shize (A0178178M)
  • Nandana Murthy(A0178563R)
  • Tan Zhi Wen David (A0056418A)
  • Zhao Ziyuan(A0178184U)
  • Zhu Huiying(A0178222H)

 

1. Background

In rapidly ageing Singapore, the demographic trends are worrying. While the birth-rate fell to a seven-year low, the number of deaths recorded was the highest in at least two decades. The death-rate rose by 4 per cent from 20,017 deaths in 2016 to 20,905 deaths last year, the report on Registration of Births and Deaths-2017 showed. The report was released by the Immigration and Checkpoints Authority (ICA). Therefore, this study aims to analyze factors related to deaths in order to help government have more insights beyond the current situation. According to our analysis, Admissions to Accident & Emergency Departments is one of factors that has correlation with death.

 

2. Hypothesis

We assume that the relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month are correlated. In other words, increase in the number of admissions will raise death count. In this case, the dependent variable is number of deaths aggregated monthly, while the independent variable is admissions to Accident & Emergency Departments aggregated by month.

 

3. Dataset

3.1 Data Source

The two datasets were collected from Department of Statistics Singapore. The admissions to Accident & Emergency Departments data was extracted from the dataset called ‘Admissions to Public Sector Hospitals, Monthly’, whereas the monthly death data was extracted  from ‘Deaths by Ethnic Group and Sex, Monthly’. (Link:http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=15193 http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=15167

 

3.2 Data Transformation

Originally, the data structure was a multi-dimension table which was not suitable for analytics. Therefore, we transformed the data structure in order to make it more suitable for analytics. Besides, we selected the data from Jan 2012 to Dec 2017 only as we wanted to mainly focus on its recent trend.

 

3.2 Data Records

The data was extracted from Jan 2012 to Dec 2017. It has 72 records in total which are enough for transfer function, because generally speaking, the threshold of transfer function is 60 records.

 

4. ARIMA Model

4.1 Observing plots

As shown in Figure 1 and 2, the monthly deaths are non-stationary with a trend, while the admissions to Accidents & Emergency Department are stationary. In order to test their stationary, KPSS test was employed. The null hypothesis of KPSS test is that the data is level or trend stationary. Therefore, if the data is level or trend stationary, its p-value should be larger than 0.05.

According to the outputs of KPSS test shown in Figure 3, it is obvious that the monthly deaths are non-stationary, while the admissions to Accidents & Emergency Department are stationary.

1

Figure 1 Line chart of monthly death

 

2

Figure 2 Line chart of admissions to Accidents & Emergency Department

 

3.13.2

Figure 3 Output of KPSS test

 

4.2 Determination of parameters

Based on the ACF and PACF graphs in Figure 4, the parameters of ARIMA model can be determined.

  • p – Non-seasonal Autoregression Order

According to the PACF graph, there is a spike at lag 2. Therefore, p is equal to 2.

  • d – Non-seasonal Differencing Order

Because the original data is stationary. Therefore, the differencing order is equal to 0.

  • q – Non-seasonal Moving Average Order

According to the ACF graph, there is a spike at lag 2. Therefore, p is equal to 2.

  • P – Seasonal Autoregression Order

According to the PACF graph, there is a spike at lags 12 and 24. Therefore, P is equal to 2.

  • D – Seasonal Differencing Order

Because the original admissions data is stationary. Therefore, D is equal to 0.

  • Q – Seasonal Moving Average Order

According to the ACF graph, there is a spike at lag 12. Therefore, p is equal to 1.

4

Figure 4 Time Series Basic Diagnostics

 

4.3 Result of Seasonal ARIMA (2,0,2) (2,0,1)

The result of ARIMA is shown below. It is obvious that some coefficients are not significant, so the insignificant parameters have to be removed one by one to find out the parameters of best ARIMA model.

5

Figure 5 Output of Seasonal ARIMA (2,0,2) (2,0,1)

 

4.4 Output of ARIMA Model Group

In order to find out the best parameters, ARIMA Model Group was employed. The output of ARIMA Model Group is shown as below. According to AIC criteria, the best solution is (2,0,2) (1,0,1).

6

Figure 6 Output of ARIMA Model Group

 

4.5 Output of Seasonal ARIMA (2,0,2) (1,0,1)

The output of Season ARIMA (2,0,2) (1,0,1) is shown as below. The model is regarded as the best model for the following reasons:

  1. Firstly, according to Forecast in Figure 8, the predictive trend and range are reasonable.
  2. Secondly, according to Parameter Estimates in Figure 9, all parameters are significant.
  3. Thirdly, according to Residuals in Figure 10, the residuals are randomly distributed, while there are no spikes in ACF and PACF graphs. Therefore, this model can be used for further analysis.

7

Figure 7 Model Summary

 

8

Figure 8 Forecast

 

9

Figure 9 Parameter Estimates

 

10

Figure 10 Residuals

 

5. Transfer Function

5.1 Prewhitening

After finding the suitable ARIMA Model, the parameters of X’s ARIMA was used to pre-whiten the input and output series in order to get their cross-correlation graph which is shown as below.

11

Figure 11 Prewhitening Plot

 

5.2 Identifying Parameters and Fitting Transfer Function Noise Model

According to Figure 11, there are two spikes at lags 10 and 11. It means the non-zero autocorrelation occurs at lag10 and the values decay after lag 11. Therefore, b=10, s=11-10=1. Besides, r is equal to2. Finally, the parameters we used are shown in Figure below:

12

Figure 12 Transfer Function Model

 

5.3 Diagnostic Checks

  • Check Residuals

According to Figure 13, the residuals are randomly distributed.

13

Figure 13 Residuals of Transfer Function

 

  • Check Significance of Parameters

According to the result, all parameters are significant.

14

Figure 14 Parameter Estimates of Transfer Function

 

5.4 Model Comparison

Although the above solution is good enough, other parameters were tried for comparison. Finally, according to the output of model comparison in Figure 15, the solution mentioned above is the best one.

15

Figure 15 Model Comparison

 

5.5 Expanded Formula

16

Figure 16 Formula of Transfer Function

The formula of transfer function is shown above. However, in order to understand the formula better, we expanded the model in full with backshift operator as shown below. The expanded formula includes Y-Deaths, X-Admissions to Accident & Emergency Departments and e-Error Term. The number at the bottom right corner means the lags of its corresponding item.

17

Figure 17 Expanded Formula

 

5.6 Model Summary

As shown in Model Summary in Figure 18, the MAPE is equal to 2.77, while the MAE is equal to 46.92. The performance is acceptable. The forecasting points and confidence interval are shown in Figure 19.

18

Figure 18 Model Summary

 

19

Figure 19 Forecasting graph

 

5.7 Output of test data

As shown in Figure 20, the MAE of test data is equal to 100.157, while the MAPE is equal to 5.55. Both evaluation metrics are larger than that of train data. However, the result still can be a reference for government.

Time Predictive Value Actual Value
Jan 2018 1696.946 1924
Feb 2018 1565.284 1662
Mar 2018 1666.980 1776
Apr 2018 1654.818 1624
May 2018 1684.973 1803
Jun 2018 1710.541 1729
 
MAE 100.157
MAPE 5.545054604

Figure 20 Output of test data

 

  1. Conclusion

This transfer function indicates the relationship between Deaths and Admissions to Accident & Emergency Departments. It validates the initial hypothesis that the Admissions to Accident & Emergency Departments leads number of deaths aggregated by month in Singapore. Besides, the performance of model is acceptable. Therefore, government can use this model to predict deaths in advance and then take actions to lower death count. For example, if the government sees an increase in the number of admissions, it can put more effort to provide medical assistance and perform researches to understand the underlying reasons.

Cointegration and Causality analysis of Crude Oil WTI and Dow Jones Commodity Index Energy — November 8, 2018

Cointegration and Causality analysis of Crude Oil WTI and Dow Jones Commodity Index Energy

Data Source and EDA

Data Source and definition

The relationship between weekly data of crude oil (WTI) price and Dow Jones Commodity Index is examined in this report. The 2 time-series begins from 6 Jul 2014 and consists of 221 observations. The WTI (West Texas Intermediate) is a grade of crude oil used as a benchmark in oil pricing. Here the downloaded data’s unit of measure is USD/barrel. The Dow Jones Commodity Index was launched on 2 Jul 2014 by S&P. The DJCI includes 23 of the commodities included in the world production-weighted S&P GSCI, and it weighs each of the 3 major sectors – energy, metals, agriculture and livestock – equally. In this report we look at only the energy section DJCI, which consists of 6 commodities: WTI Crude Oil, Heating Oil, Brent Crude Oil, RBOB Gasoline, Gasoil and Natural Gas.

The data are downloaded from the following website: https://www.investing.com/indices/dj-commodity-energy-historical-data

https://www.investing.com/commodities/crude-oil-historical-data

EDA1

Figure 1. Plot of WTI Price and DJCIEN

From the plot it can be seen that essentially the two time-series move together. To discover the relationship between these two, we do pre-whitening in JMP to see whether we can apply the transfer function technique.

Pre-whitening using JMP

Using WTI Oil Price as Input series, the ACF and PACF plots are shown below:2Figure 2. ACF and PACF of WTI Oil Price Series

It can be easily seen that the time series is not stationary. Therefore, we apply 1st order differencing to the series:3Figure 3. ACF and PACF of WTI Oil Price differenced Series

ARIMA(1,1,1) is applied to the input series:4Figure 4. ARIMA(1,1,1) applied to input series

Apply Ljung-Box test to the residual after ARIMA, p-values are all more than 0.05. Therefore, we don’t reject the null hypothesis that the residuals are independently distributed. The model fits fine.5Figure 5. Input Series residual after ARIMA(1,1,1)

ARIMA(1,1,1) is applied to both input and output series and the following pre-whitening plot is obtained:6Figure 6. Pre-whitening Plot

From this plot, there is a significant peak at the time lag 0. The other correlations on positive and negative sides of the lag are small enough to be neglectable. This implies that Yt (Output Series) reacts immediately to the change of input Xt. It’s not possible to identify the transfer function model’s b, s, r here. Gretl will be used to discover the cointegration relationship between these 2 time-series.

Cointegration test and modeling using Gretl

Cointegration test

Before the Cointegration test, Augmented Dickey-Fuller test are performed:7Figure 7. ADF test for WTI Oil Price                                                   8Figure 8. ADF test for DJCIEN

Both of them having p-value > 0.05, therefore, we do not reject the null hypothesis that unit root presents in the time-series.

The theoretical background for Engle-Granger Cointegration test is layed down as follows:

The estimation of regression equation of Y on X or X on Y in level form OLS method is referred to as cointegration regression as shown below:

Yt = intercept + b1 X+ error

Xt = intercept + b1 Yt + error

Where b1 = long run regression coefficient

If Xt and Yt are non-stationary and cointegrated

then a linear combination of them must be stationary. In other words:

Yt – βXt = Ut, where Ut is stationary.

ADF test is carried out for Ut to confirm the stationarity.

Results from the Engle-Granger Cointegration test is shown as below:9Figure 9. Engle-Granger cointegration test result

The p-value for ADF test for the residual is small, therefore we reject the null hypothesis, Ut is stationary.

We conclude that the two time-series are cointegrated.

Error Correction Model

From Yt – βXt = Ut, we know that Ut will capture the error correction relationship by capturing the degree to which Y and X are out of equilibrium. Therefore, ECM can be built as: ΔYt = C + Φ ΔXt + α Ut-1, where Ut-1 = Yt-1 – Xt-1, ΔXt is any short-term effects.

Use Ordinary Least Square firstly to obtain Ut, and then use Ut-1 to build ECM:10

Figure 10. Setting used for Error Correction Model

Result of the model shown as below:11Figure 11. Error Correction Model

The coefficient of the error correction term presents statistical significance, indicating that short-term imbalances between the two series should disappear in the exact condition of the long-run equilibrium. Durbin Watson test shows residuals are normally distributed. R2 value is 0.881811. Therefore, we conclude that the ECM is reasonably good.

Granger Causality Test

If two time-series, X and Y, are cointegrated, there must exist Granger causality either from X to Y, or from Y to X, or in both directions. However, the reverse is not true. Since we already tested the cointegration relationship, Granger Causality Test is then carried out to discover the causality relationship.

In Gretl, Granger Causality Test is done in Vector Autoregression. Before doing the VAR, it’s important to find the maximum lag to specify in VAR(p), since this p value will affect the VAR result a lot. We use VAR lag selection function in Gretl to determine the max lag. From the following result, lag 1 is chosen because of the smallest AIC & BIC.12Figure 12. VAR lag selection

Using Lag 1 to run VAR, the result is shown as follows:13

Figure 13. Result of VAR

From the equations:

Equation 1: DJ Energy, the null hypothesis of all lags of WTI coefficients are zero cannot be rejected at 5% significance level. Therefore, WTI Oil Price does not Granger cause Dow Jones Commodity Index Energy.

Equation 2: WTI, the null hypothesis of all lags of WTI coefficients are zero is rejected at 10% significance level, and essentially at 5% significance level also.

Therefore, we conclude that Dow Jones Commodity Index Energy Granger cause WTI Oil Price.

Conclusion

From the above analysis, the following conclusions are deduced:

  1. Long-run equilibrium relationship between WTI Crude Oil Price and Dow Jones Commodity Index Energy Exist
  2. Unidirectional causality exists. Dow Jones Commodity Index Energy cause WTI Oil Price, while WTI Oil Price does not cause Dow Jones Commodity Index Energy 

    Team: AngelWorm

    Members:

    Name Student ID Email Address
    Ang Wei Hao Max A0178278L e0267589@u.nus.edu
    Li Xiaoguang A0178450B li.x@u.nus.edu
    Laura Long Xiaoxing A0055448Y laura.long@u.nus.edu
    Ma Wenjing A0073595U e0267970@u.nus.edu
Transfer function assignment of Analytics_1_Group1 — November 3, 2018

Transfer function assignment of Analytics_1_Group1

20

About the data

The data describes the change in the frequency of television viewing from October 2008 to March 2009 by the average Indian household. We are not given the details of the TV channel hence we will not know who the target audience. As seen in the chart below, the weekly rating of the TV channel hovers between the range of 330 to 170 (60 points) and it is trending downwards. The TV channel rating has been staying below 300 after 27 Jan 2008, suggesting that the programs were not able to appeal to its audience in India.

We tested autocorrelation of the data and confirms that there is positive autocorrelation in this dataset (refer to appendix for more information). And the ACF didn’t dampen out within 15 to 20 lags, hence we confirm that the autocorrelation is not likely to be stationary (nonstationary process)[1].

1

Software

We used Jmp, SPSS Modeler as well as R to run our models.

Data

We tried to log normal predicted data before we use Jmp to runtime series diagnostic and we found that the results are the same using the original data set are similar so we used the original data set on all our models.

 

  • Decomposition Method – Multiplicative method

Referring to the equation below, the trend component T reflects the long-term progression of the rating. It’s a function of how effective your marketing campaign is, and it indicates how well the TV channel is doing. This trend is perturbed by several other effects which contribute to the buying behaviours of the audience. The seasonal factor, TV rating peaked during festive seasons (more people staying at home to watch TV or there’s a big event organised by the TV channel); all those events have a quantifiable, cyclic effect on the data and therefore make the seasonal component, S, of the time series. Finally, what’s left is assumed to be caused by random, non-periodic events and is called irregular (or residual) component, I. The cyclic pattern in television rating can be explained by environmental factors, such as the weather or the amount of daylight. People watch less television when the weather permits them to be outside and engage in alternative activities (Gould, Johnson, & Chapman, 1984). In this data set, we observed a downward trend in the data but no seasonality and cycle in the data. We didn’t observe any annual cycle in the data nor did we see any seasonality in the data. Referring to the equation below, we be using linear regression method for our forecasting (refer to excel spreadsheet below for details), error in our training dataset 6.49% (MAPE) while testing data set (20 data points) is about 8.43%(MAPE).

2

 

Intercept 302.8228365
Slope -1.400229918

 

Few key assumptions of regression include:

  • There must be a linear relationship between the outcome variable and the independent variables.
  • Multivariate Normality–Multiple regression assumes that the residuals are normally distributed.
  • No Multicollinearity—Multiple regression assumes that the independent variables are not highly correlated with each other.
  • Homoscedasticity–This assumption states that the variance of error terms is similar across the values of the independent variables. [2]

However, we tested positive autocorrelation in the data hence we can conclude that the decomposition method is not a good model for prediction as it violates the third assumption, there should not be multicollinearity in the independent variables.

 

  • ARIMA

Figure 1

3

The time series shows strong positive autocorrelations (ACF), it dies off slowly (we still see some autocorrelation at lag 25). In the PACF plot, there’s strong autocorrelation at lag 1 and it gets off after lag 1 (see figure 1)

Next we try 1 and 2 differencing to convert it stationary time series data before we test for P and Q (ARIMA).

 

In figure 2a and 2b – 1 Differencing

The intercept becomes insignificant and we also see signs of over-differencing (pattern of changes of sign from observation to the next in the ACF plot, lag 1 has a negative spike). Hence, we decide to remove differencing and try for P & Q instead.

ACF plot indicates no seasonality in the data (only lag 1 and lag 7 shows strong autocorrelation).

 

Figure 2a

4

 

We tried ARMA (1,0,0), we achieved adjusted Rsquare of 0.76 but lag 2, 5 and 5 become significant and there is a negative spike in lag 1 (refer to appendix for more information).

Next, we try ARIMA (0.0,1), adjusted Rsquare was only at 0.50 but ACF plot still shows high autocorrelation. (refer to appendix for more information). ARIMA (1,1,1) also giving us insignificant variable in AR and intercept (see figure 3)

Figure 2b

5

 

Figure 3

6

 

Our final model – ARIMA(1,0,1)  Figure 4 & 5

We achieved an adjusted R square of 0.79, MAE of 5.38. We didn’t split the data into training and testing dataset (there is only 20 data points for validation, this will not be sufficient in ARIMA models. We will need at least 50 data points to run ARIMA model).

Final Model Checking

The estimation summary below indicates that both autoregressive parameter and the SMA parameter are significant. The Ljung-Box statistics indicates that all the lag there are showing no signs of significant autocorrelation in the residuals and we also see randomness in the residual plot. Hence, we confirm that the model is a good fit.

 

Figure 4

7

 

Figure 5

8

  • Time Series Regression

Feature Engineering

The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm (regression) to model. By creating dummy variables, trend variable and weekly fluctuation variable, we aim to learn the underlying inherent functional relationship between inputs and outputs in order to improve the accuracy of the model.

First, we created 5 dummy variables to represent week 1, week 2, week 3, week 4 and week 5 of the month.9

And we also created another variable to capture the weekly fluctuation rate of ratings by:

(This week’s rating – Previous Week’s rating) / Previous Week’s ratings

We also created 3 trend variables to capture downward/upward trend of the data – 5 weeks moving average, 6 week moving average and 7 weeks moving average.

10

We split the data into training and testing July 2007 – 26 Oct 2008 for training and Nov 2008 – Mar 2009 for testing. Using backward regression method to run our regression in SPSS Modeler. In backward regression, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Referring to figure 6 below, the significant variables in our time series regression model are the time period, 5 weekly trend, 7 weekly trend variable and differencing variable. Model 6 gives us the best adjusted RSquare of 0.8490. The best Rsquare compared to decomposition method and ARIMA but MAE between our training and testing data is more than 10%. The difference may be due to insufficient data points in our testing data set. We only have 20 data points for validation and there may be a few observations in the data set which are similar to the training data set (10 out of 20 data points are below 200 rating).

11121314

 

Testing Data Set                                 Training Data Set

 

  • Exponential smoothing

This method predicts the one next period value based on the past and current value.  It involves averaging of data such that the non-systematic components of each individual case or observation cancel out each other.  The exponential smoothing method is used to predict the short-term predication.  Alpha, Gamma, Phi, and Delta are the parameters that estimate the effect of the time series data.  Alpha is used when seasonality is not present in data.  Gamma is used when a series has a trend in data.  Delta is used when seasonality cycles are present in data.

 

Our Final Model – Simple Exponential Smoothing

Our final model is exponential smoothing technique using SPSS modeler, giving us coeffcient of 0.624.

15

Simple exponential smoothing is suitable for forecasting data with no trend and no seasonal component.The smoothing equation is given by

St+1= St + α(Xt – St)

Also can be writen in St+1= αXt +(1- α)St

16

Where(Xt – St) is the difference between the observed and forecasted value and St+1 is the smoothing statistic and α is the smoothing constant.It can be seen that the new smoothed value is the weighted sum of the current observation and the previous smoothed value.The weight of the most recent observation is α and the weight of the most recent smoothed value is (1-α).

As a result, St+1 is the weighted average of all past observations and the initial value of S0.The weights are decreasing exponentially depending on the value of α (0.624). More weights given to most recent obeservations and weights decrease geometrically with week. MAE (error) between the training and testing dataset is more than 10%. Since this technique is good for making short predictions and smoothing parameters are based on current and previous observation, we do not recommend splitting data for prediction since the TV channel has been trending downwards after the second quarter of 2008. We attempted exponontial smoothing technique on the data set but we still spikes in the ACF and PACF plots hence we conclude that the model is not a good fit.

Conclusion

17

Above is a summary of error terms and R square (goodness of fit) of our models (split between training and testing data). We would recommend using time series regression model for prediction if we have more data points for validation. Though both ARIMA and exponential smoothing techniques are useful techniques in forecasting a time series. Exponential smoothing technique depends upon the assumption of exponential decrease in weights for past data but it doesn’t take care of autoregressive component (refer to the exponontial smoothing model summary above). ARIMA is employed by transforming a time series to stationary series and studying the the nature of the stationary series through ACF and PACF and then accounting auto-regressive and moving average effects in a time series and it’s the best model the four models.

 

Appendix

ACF

 

 

Model Description
Model Name MOD_1
Series Name 1 normalized data
Transformation None
Non-Seasonal Differencing 0
Seasonal Differencing 0
Length of Seasonal Period No periodicity
Maximum Number of Lags 16
Process Assumed for Calculating the Standard Errors of the Autocorrelations Independence(white noise)a
Display and Plot All lags

 

Applying the model specifications from MOD_1
a. Not applicable for calculating the standard errors of the partial autocorrelations.

 

 

Case Processing Summary
  normalized data
Series Length 92
Number of Missing Values User-Missing 0
System-Missing 0
Number of Valid Values 92
Number of Computable First Lags 91

 

 

 

normalized data

 

 

 

Autocorrelations
Series:   normalized data
Lag Autocorrelation Std. Errora Box-Ljung Statistic
Value df Sig.b
1 .876 .103 72.847 1 .000
2 .829 .102 138.832 2 .000
3 .821 .101 204.294 3 .000
4 .774 .101 263.139 4 .000
5 .741 .100 317.682 5 .000
6 .704 .100 367.480 6 .000
7 .644 .099 409.611 7 .000
8 .652 .099 453.323 8 .000
9 .617 .098 493.003 9 .000
10 .579 .097 528.344 10 .000
11 .573 .097 563.380 11 .000
12 .519 .096 592.461 12 .000
13 .489 .096 618.633 13 .000
14 .479 .095 644.117 14 .000
15 .428 .094 664.700 15 .000
16 .412 .094 684.041 16 .000

 

a. The underlying process assumed is independence (white noise).
b. Based on the asymptotic chi-square approximation.

 

18.png

 

 

Partial Autocorrelations
Series:   normalized data
Lag Partial Autocorrelation Std. Error
1 .876 .104
2 .266 .104
3 .255 .104
4 -.029 .104
5 .021 .104
6 -.051 .104
7 -.130 .104
8 .209 .104
9 -.048 .104
10 .013 .104
11 .043 .104
12 -.164 .104
13 .005 .104
14 .011 .104
15 -.046 .104
16 .055 .104

19.png

Reference:

[1] Time Series Analysis and forecasting by Example. Soren Bisgaard and Murat Kulahci

[2] Applied Regression Analysis and Generalized Linear Models

Transfer Function Analysis: Relation between the number of visitors in Singapore’s Zoological Gardens with the weather conditions — October 26, 2018

Transfer Function Analysis: Relation between the number of visitors in Singapore’s Zoological Gardens with the weather conditions

Team Members : Gopesh Dwivedi (A0178338R), Indu Arya (A0178360B), Rubi Saini (A0178255W)

Introduction

The Singapore Zoo, formerly known as the Singapore Zoological Gardens and commonly known locally as the Mandai Zoo, occupies 28 hectares (69 acres) on the margins of Upper Seletar Reservoir within Singapore’s heavily forested central catchment area. The zoo was built at a cost of $9 million granted by the government of Singapore and opened on 27 June 1973. It is operated by Wildlife Reserves Singapore, who also manage the neighboring Night Safari, River Safari and the Jurong Bird Park. There are about 315 species of animal in the zoo, of which some 16 percent are considered to be threatened species. The zoo attracts 1.7 million visitors each year.

Statistical time series models are used to analyze the short-run impact of weather and the long-run impact of climate upon visits to Singapore Zoo. The relationship between key climate parameters and tourism has been researched for over 30 years. Whilst tourists respond to the integrated effects of the atmospheric environment, including thermal, physical and aesthetic aspects, many studies conclude that temperature is the dominant climate variable for tourism. Several studies have attempted to identify the ideal or optimum temperatures for tourism. We conclude that’ globally, tourists prefer an average daily temperature of 21°C, and ideal temperatures for urban sightseeing in Singapore  have been found at 20- 26°C. Higher daily maximum temperatures around 30°C are preferred for beach-based recreation.

Hypothesis

Our hypothesis regarding the Singapore’s Zoo’s Visitation is that better are the weather conditions which in our case means bright sunny day rather than raining would increase visitors at the Zoo. Thus, some cross-correlation can be attributed to weather conditions and Zoo’s visitation.

 A “tourist experience” encompasses three stages: planning, the actual experience, and assessments. Weather and climate affect the tourist experience. The distinction between weather (shorter term) and climate (longer term) is apparent when these tourist experience stages are used. Climate information and forecasts are assessed during tourism planning; weather information supports the actual experience; assessments are combinations of weather and climate information—reconciling discrepancies between expectations and reality of the tourist experience.

At the tourist destination, activity choices are impacted by local weather events. Weather events influence on-site activity choices of tourists. Social factors and different activity choices diversify options for tourists and recreationalists. However, influences of weather on the availability and performance of an activity are still present. For example, if availability of an intended activity is reduced due to the influence of the weather (e.g. lack of snow for skiing), alternative choices are created by businesses to mitigate revenue losses and provide social outlets.

Data and Preprocessing

For this Study, we collected temporal monthly data from SingStat. The data collected contains monthly visitors at Singapore’s Zoo from April 1990 onwards. Also, weather conditions like number of bright sunny days, rainfall etc is also acquired from the SingStat website. Primarily, we are considering the number of sunny days as our input series and Singapore’s Zoo monthly visitors as output series. We want to see if no rain scenarios affect the tourist decision to go for outdoor activities such as visiting Singapore’s Zoo.

Modelling

Test of cointegration – We ran co integrated test for input and output series and achieved significant p value so we can conclude that these series are not co integrated and we will develop transfer function for these series.

g1

Relation plot of input and output series:

g2

 

Test of stationarity (ADF test )– We want to identify at first if series is stationary or not so we ran Augmented Dicky-Fuller test and p value is insignificant so our series is stationary.

g3

ARIMA:

As out input series is stationary so there is no need of differencing. We ran models for different values of p and q.

g4

We ran multiple models and found out that Seasonal ARIMA(1,0,1)(3,0,1) is giving the best results.

For this model we have white noise, all auto correlations are insignificant. all the ACF and PACF values are within confidence interval.

g5

Residual plot is random, and the selected parameters are significant.

g6

g7

MAPE for this model is insignificant.

g8

Then applied same model on Y series and getting  good results with white noise, random residual plot and significant ACF& PACF values.

We then apply pre-whitening on this model and proceed to check correlation between input and output series.

g9

Cross Correlation:

We ran cross correlation test for both input and output series and can see that output series is correlated with input series lag. So we can say that past data of input series can be used to predict output(Y) series.

g10

g11

We are considering that Correlations exist only in 4 lags, 2,3,4 and 5, starting from lags 2 and peak at 4 for development of transfer function. So our transfer function equation will have terms related to the input series only for lags 2,3,4 and 5.

So, we have d = 2, s=4-2 =>2.

Transfer Function

We developed 2 transfer functions with d = 2, s=4-2 and r=2 and second d = 2, s=4-2 and r=1.

Transfer function with d = 2, s=4-2 and r=2 is giving best results so we will proceed with Transfer function model 1.

g1

g2

g3

 

*standardized(t)= No of visitors in zoo

standardized_Bright_sunshine_daily_mean_hour(t-1) + et = Average Temperature

 

Results and Insights

We can now infer from the above analysis, there is a conclusive “analytical evidence” that bright sunny weather has a strong relation with number of people visiting in Singapore zoological park. In this report, we applied “Transfer function analysis” and observed that relationship of the two factors, i.e., bright sunshine daily mean hours (Hour) as the “input variable” and number of people visiting Singapore zoological park as the “Output variable”, are related unidirectionally.

References

  • Forecasting Tourist Decisions Regarding Zoo Attendance Using Weather and Climate References David Richard Perkins IV
  • The impact of weather and climate on tourist demand: the case of Chester Zoo
  • The Effect of Seasonal Climatic Anomalies on Zoo Visitation in Toronto (Canada) and the Implications for Projected Climate Change Micah J. Hewer , and William A. Gough
Transfer Function Analysis: Relationship between visitors from the Americas and the luxury hotel booking in Singapore — October 24, 2018

Transfer Function Analysis: Relationship between visitors from the Americas and the luxury hotel booking in Singapore

1. Introduction & Problem Definition

It is believed that visitors from countries with good economy and high standard of living, such as those in Americas (USA and Canada in particular) are more likely to have the economic capability to travel in a more expensive fashion. Hence, it is reasonable to hypothesize that the influx of American visitors may relate to a higher occupancy rate in luxury hotel.

This hypothesis further leads to the objective of the study: to answer and provide evidence to the question of whether there exists a relationship between visitors from the Americas and the luxury hotel booking in Singapore.

2. Data Description

To analyze the relationship between visitors from Americas and the luxury hotel booking in Singapore, data was collected from Department of Statistics Singapore, covering the most recent 5.5 years of data (66 data points for each time series from Jan 2013 to June 2018) under the following two topics:

  1. a) Monthly data of International Visitor Arrivals

Data Source: http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1991

Selected Attribute: AMERICAS – Number of visitors from AMERICAS, including USA, Canada, and other parts of AMERICAS

  1. b) Monthly data of Hotel Statistics

Data Source: http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1991

Selected Attribute: AOR of Luxury Hotels – Includes hotels in the luxury segment, predominantly in prime locations and/or in historical buildings.

(* AOR = [Gross lettings (Room Nights) / Available room nights] x 100)

3. Transfer Function Analysis in JMP

The initial assumption made for the two time series is that the number of American visitors (AMERICAS) leads luxury hotel bookings (LUXURY). Hence, AMERICAS is the independent series (input) while LUXURY is the dependent series (output). Several ARIMA model are tested to get the best performance model. The final ARIMA model is Seasonal ARIMA (1,1,1)(0,1,0)12.

1

From the prewhitening plot, it can be observed that the relationship between input and output series is not unidirectional, with peaks ranging from the negative lags to positives. This indicates that the two series affect each other in a bi-directional way. Since transfer function is only applicable to situations where the relationship is unidirectional, other techniques will be explored for the further analysis.

4. Bi-directional Correlation Analysis in GRETL

Since a bi-directional correlation is found between Luxury and Americas, the Vector Autoregression (VAR) model is a useful starting point in the analysis of the inter-relationships between the different time series. Stationarity must exist in the series in order to estimate VAR model, so ADF test will be performed at first to test whether these two series are stationary and suitable for VAR(p) model. If they are not stationary, then differencing or cointegration test will be performed to estimate VAR(p) model or VEC(p) model.

26-e1540391255997.png

VAR lag selection is performed with a value of 10. According to all the criteria, AIC, BIC and HQC, the best lag value to use for the analysis is 1.

Test unit root

The null hypothesis of the Unit Root Test is that the time series variable is non-stationary(a=1). According to the unit root test results of the two time series variables as shown above, both p-values are significant, which indicates that the null hypothesis can be rejected and the two series are stationary with constant and trend.  Hence, a VAR(1) model can be estimated.

Estimate VAR model

A VAR(1) model can be estimated with constant and trend. Seasonal dummy variables are also incorporated in the model as the time series plot shows a clear seasonal cycle. The two essential features of the VAR model: Causality Analysis and Impulse Response Analysis are performed to analyze the two series.

53.png

VAR model

In the below output displays, S1 to S11 refer to the dummy variables, Month. Both equations are significant, and the majority of the variables are significant as well. The F-tests of zero restrictions under each model provide information for Causality Analysis.

Granger Causality Test

The Granger Causality Test can provide evidence for answering the question of whether variable x causes variable y or vice versa, which is the interest of bi-directional time series analysis. The results of F-tests of zero restrictions under each model show that:

Equation 1 F-test results of “all lags of Luxury” and “all lags of AMERICAS” are both significant. This indicates a bidirectional causality between Luxury and Americas in equation 1.

Equation 2 F-tests result of “all lags of Luxury” is not significant and that of “all lags of AMERICAS” is significant. This indicates a one-way causality in equation 2.

With this information, a conclusion can be drawn that Americas and Luxury affect each other in a bidirectional way.

Impulse Response Analysis

An exogenous shock to one variable not only directly affects this specific variable but is also transmitted to the other endogenous variables through the dynamic (lag) structure of the VAR. The Impulse Response Analysis in VAR model can help to trace the effect of a one standard deviation shock to one of the innovations on current and future values of the endogenous variables. The 4 charts below show the impact of a shock on Luxury and AMERICA to Luxury or AMERICA, respectively.

93-e1540391212181.png

  • The shock on Luxury directly moves Luxury to a higher level, then Luxury returns to its initial status monotonically after 4 months.
  • The shock on Luxury gradually reduce AMERICAS to the bottom at the second month, then AMERICAS slowly return to its initial status monotonically after 10 months
  • The shock on AMERICA pushes Luxury immediately to a medium level, then Luxury reduces to its normal status slowly after 10 months
  • Similar with Luxury on Luxury, the shock on AMERICAS directly moves AMERICAS to a higher level and then AMERICAS returns to its initial status.

The analysis result shows that visitors from AMERICAS and the luxury hotel booking in Singapore have an interactive effect on each other, and the individual reaction of Luxury and AMERICAS to shocks on itself or the other series are very different.

5. Conclusion

To investigate and understand the relationship between visitors from Americas and the luxury hotel booking in Singapore, several analytical approaches and methods were conducted on the acquired dataset. Transfer function in JMP gives the indication of a bidirectional relationship between the two series, while Granger Causality Test further proves the inference from JMP. Finally, the Impulse Response Analysis helps to trace the effect of various shocks to Luxury and AMERICAS. This analytical approach can be applied to similar studies on hotel type versus visitors from other regions to identify the reaction pattern for different regions. Decision makers in the tourism industry can apply the insight and knowledge gained from studying these patterns in market segment planning and other related issues.

 

 

Submitted by: Red

Gao Yuan (A0178404A); Wang Yijing (A0178354W);

Wang Yuhui(A0179193U); Zhang Xiaoman(A0178489A);

Hu Mengxi (A0178561W)

RELATIONSHIP BETWEEN AIR PASSENGER ARRIVALS AND THE VALUE OF FOOD AND BEVERAGE SALES IN SINGAPORE —

RELATIONSHIP BETWEEN AIR PASSENGER ARRIVALS AND THE VALUE OF FOOD AND BEVERAGE SALES IN SINGAPORE

Submitted by: Aastha Arora (A0178188L), Aishwarya Bose (A0178277M), Chetna Gupta (A0178260A), Madeleine Dy (A0178427U), Misha Singh (A0178309W), Zaira Hossain (A0178331E)

 

Problem/Hypothesis of the Two Time Series:

Is there a relationship between Air Passenger Arrivals and value of Food & Beverage Sales in Singapore?

Data Exploration and Pre-Processing:

SERIES 1: AIR PASSENGER ARRIVALS IN SINGAPORE

Picture1

Source: https://data.gov.sg/dataset/air-passenger-arrivals-total-by-region-and-selected-country-of-embarkation

Notes:

  • The raw dataset contains Air passenger arrivals by country and regions estimated monthly.
  • Information has been obtained from third party sources.
  • Refers to Changi Airport only.
  • Data excludes transit passengers who continued their journey on the same flight.

 

SERIES 2: FOOD & BEVERAGE SALES IN SINGAPORE

Picture2

Source: https://data.gov.sg/dataset/value-of-food-beverage-sales-based-on-2014-100-index-estimated-monthly

Notes:

  • The raw dataset contains value of food and beverage sales index estimated monthly.
  • The Food & Beverage Sales Value is compiled from the results of the Monthly Survey of Food & Beverage Services.
  • The objective of the survey is to measure the short-term performance of food & beverage (F&B) services industries based on the sales records of F&B services establishments.
  • The Food & Beverage Sales Value is based on the Food & Beverage Services Index (FSI).

Number of Records:

F&B sales value and overall Air passenger arrivals were combined to form a dataset containing 200 monthly records from Jan 1997 to Aug 2013 (~16 years & 8 months of data) for both Singapore Food & Beverage Prices, and Number of Air Passenger arrivals, which was used for analysis.

Outliers:

A dip in arrival and F&B sales value was seen between April, May and June 2003, which could be attributed to the fact that there was SARS virus outbreak in Singapore. So, this was considered as an outlier, and we replaced the value by taking the moving average of 3 months.

 

Initial Analysis using Python:

We have plotted both the variables to visually inspect these variables. With this, we can clearly see that both the time series are trending which is a clear sign that both the time series have a unit root.

Picture4

We confirmed whether they have unit roots by using the augmented Dickey Fuller (ADF) test.

The Dickey Fuller Test produced the following output for Arrival:

(1.5485691015907959, 0.9976947744107255, 15, 184, 
{'1%': -3.466398230774071, '5%': -2.8773796387256514, 
'10%': -2.575213838610586}, 4548.903850689798)

The Dickey Fuller Test produced the following output for Food Value:

(1.5508415283231438, 0.9977023393139994, 13, 186, 
{'1%': -3.466005071659723, '5%': -2.8772078537639385, 
'10%': -2.5751221620996647}, 1478.034269903061)

Clearly, we cannot reject the null-hypothesis that these series have a unit root. So we should difference both series as a first step.

To test whether the ‘Arrival’ is caused by the ‘Food Value’, we applied the Granger Causality Test.

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=50.6180 , p=0.0000  , df_denom=196, df_num=1
ssr based chi2 test:   chi2=51.3927 , p=0.0000  , df=1
likelihood ratio test: chi2=45.7154 , p=0.0000  , df=1
parameter F test:         F=50.6180 , p=0.0000  , df_denom=196, df_num=1
{1: ({'ssr_ftest': (50.61796959284111, 2.059081651775858e-11, 196.0, 1), 
'ssr_chi2test': (51.39273443354787, 7.56178806704204e-13, 1), 
'lrtest': (45.71543384976394, 1.3674112550498519e-11, 1), 
'params_ftest': (50.617969592841, 2.0590816517759726e-11, 196.0, 1)}, 
[<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
0x000001A53F406F28>, 
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
0x000001A540F8D518>, array([[0., 1., 0.]])])}

We can say that Arrival is caused by Food Value because p value is significant and less than 0.05. Then, we reverse the test to see if the ‘Food Value’ is caused by the ‘Arrival’. As a result, we found that Food Value was not caused by Arrival.

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.9754  , p=0.1615  , df_denom=196, df_num=1
ssr based chi2 test:   chi2=2.0056  , p=0.1567  , df=1
likelihood ratio test: chi2=1.9956  , p=0.1578  , df=1
parameter F test:         F=1.9754  , p=0.1615  , df_denom=196, df_num=1
{1:({'ssr_ftest': (1.9754074370715489, 0.16145817984806168, 196.0, 1),
'ssr_chi2test': (2.005643265189991, 0.15671480384086686, 1), 
'lrtest': (1.9956036184489676, 0.15775620297932386, 1), 
'params_ftest': (1.9754074371779515, 0.16145817983682556, 196.0, 1)}, 
[<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
0x000001A545511BE0>, 
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
0x000001A545489CC0>, array([[0., 1., 0.]])])}

 

Analysis using JMP:

Firstly, we checked the PACF graph in the time series inputs (Food Value). It’s clear that the data needs to be differenced to run the input series ARIMA for pre-whitening:

Picture5

We then use the ARIMA Group Model Function in JMP to discover the best combination. This turns out to be Seasonal ARIMA (2,1,0) (0,1,1)12 for the input series. Then pre-whitened the input series and we got the following result:

Picture6

Since, we did not get any significant lags. We decided to go with log transformation of both the series.  After the log transformation, we fit an ARIMA model using the above steps and pre-whitened the series and got the following output:

Picture8

We found significant lags at both sides of the plot. In this case transfer function won’t give the right estimates of coefficients. So, we reached the conclusion that it is a cointegration problem. Hence, we moved on to GRETL.

 

Analysis using GRETL:

We have plotted both the variables to visually inspect these variables. With this, we can clearly see that both the time series are trending which is a clear sign that both the time series have a unit root.

Picture8

Both series are trending upward. It is possible that both the series follow long run equilibrium relationship that they tend to return to over time. We performed the Engle Granger Test for cointegration to find out.

Engle Granger Test:
Given two sets of time series data, x and y, granger-causality is a method which attempts to determine whether one series is likely to influence change in the other. This is accomplished by taking different lags of one series and using that to model the change in the second series. We create two models which predict y, one with only past values of y (Ω), and the other with past values of y and x (π). The models are given below where k is the number of lags in the time series:

Let Ω = yt = β0 + β1yt-1 +…+ βkyt-k + e
And π = yt = β0 + β1yt-1 +…+ βkyt-k + α1xt-1 +…+ αkxt-k + e

The residual sums of squared errors are then compared, and a test is used to determine whether the nested model (Ω) is adequate to explain the future values of y or if the full model (π) is better. The F-test, t-test or Wald test (used in R) are calculated to test the following null and alternate hypotheses:

H0: αi = 0 for each i of the element [1,k]
H1: αi ≠ 0 for at least 1 i of the element [1,k]

Essentially, we are trying to determine whether we can say that statistically, x provides more information about future values of y than past values of y alone. Under this definition we are not trying to prove actual causation, only that the two values are related by some phenomenon. Along those lines, we must also run this model in reverse to verify that that y does not provide information about future values of x. If we find that this is the case, it is likely that there is some exogenous variable, z, which needs to be controlled or could be a better candidate for granger causation.

Steps in Engle Granger Test:

Step 1: Determine ‘d’ in I(d) for ‘Log of Arrival’ using ADF Unit Root Test.
H0: Level series contains a unit root.
HA: Level series does not contain a unit root.

We have taken a maximum lag order of 6 by taking the cube root of the number of data points (200).

We select ‘constant and trend’ because we have seen from the plot that the series has an upward trend.

Picture9

The p-value is large, so we fail to reject the NULL Hypothesis. This means series needs to be differenced to make it stationary. So, d=1.

Step 2: Determine ‘d’ in I(d) for ‘Log of Food Value’ using ADF Unit Root Test.
H0: Level series contains a unit root.
HA: Level series does not contain a unit root.

We have taken a maximum lag order of 6 by taking the cube root of the number of data points (200).

We select ‘constant and trend’ because we have seen from the plot that the series has an upward trend.

Picture10

The p-value is large, so we fail to reject the NULL Hypothesis. This means series needs to be differenced to make it stationary. So, d=1.

Step 3: Estimate cointegrating regression: Yt = β1+ β2Xt+Ɛt
We estimated the cointegrating regression by using ‘Log of Arrival’ as depend Variable. Both variables are integrated of the same order. We provided lag order to be 6 as mentioned in Step 1 and Step 2.

Picture11

Step 4: Determine ‘d’ in I(d) for Ɛt
H0: Unit root (i.e., not cointegrated)
HA: No unit root (i.e., cointegrated)

Since our p-value is small so we reject the NULL hypothesis at 5% level of significance and conclude that the Food Value is cointegrated with Arrival rate. This in turn means that the series can be written as an error correction model.

 

Error Correction model using GRETL:

Equation:
∆Y_t=φ_1 ∆X_(t-1)+φ_2 Y_(t-1)-γ{Y_(t-1)-β ̂_1-β ̂_2 X_(t-1) }+ω_t

The lag residual from cointegrating regression is found within the curly braces above. The coefficient γ is the speed of adjustment. If it is not statistically significant, the variable is weakly exogenous.

Before estimating the error correction model, we estimated the cointegrating regression and saved the residuals with name e.

Picture12

Then, we created two new series which is the difference of ‘log of arrival’ and ‘log of food value’.

Then we estimated the Error Correction Model. Select following lags:

Picture13

The variable that we are interested in is e. Atleast one of the variables must not be weakly exogenous if the series are cointegrated. We can see that Log of Arrival is not weakly exogenous. This means that The arrival value moves to restore the equilibirium when the system is out of balance but Log of Food Value doesn’t move to equilibirium when system is out of balance.

Picture14

Conclusion:

We can conclude that arrivals of air passengers is caused by food and beverage sales values as opposed to our initial assumption that food and beverage sales value is dependent on arrivals of air passengers.We can say the Food Value Granger-Cause Arrival rate!

The relationship between the Average Deal Price of Shanghai License Plate and the Number of Bidders —

The relationship between the Average Deal Price of Shanghai License Plate and the Number of Bidders

1 Hypothesis

Increasing in the average deal price of Shanghai license plate will increase the number of bidders.

2 Data Exploration

2.1        Background

Over the last two decades, there is an increase in automobile ownership and use in China. This increases energy consumption, worsens air pollution and exacerbates congestion. Therefore, an auction system has been adopted by Shanghai government to limit the number of license plates issued for every month.

The dataset contains monthly auction data from Jan 2002 to Oct 2017. Feb 2008 data is missing, and the number of records is 189 [1].

2.2        Metadata

Field Description Data Type
Date Jan 2002 to Oct 2017 (Feb 2008 is missing) Date
avg_deal_price The average deal price of Shanghai license plate Numeric
lowest_deal_price The lowest deal price of Shanghai license plate Numeric
num_bidder Number of citizens who participate the auction for the month Numeric
num_plates number of plates that will be issued by the government for the month Numeric

Table 1 Metadata

Only variables “avg_deal_price” and “num_bidder” will be analysed in this report.

 

3          Transfer Function Modelling

Assuming that the relationship between “avg_deal_price” and “num_bidder” is unidirectional, JMP is used for the analysis. Since the hypothesis is increasing in the average deal price of Shanghai license plate will increase the number of bidders, so taking “avg_deal_price” as the input and “num_bidder” as the output.

1

Figure 1 Time series plot of input and output variables

Figure 1 shows that the trends of input and output are both not stationary, so an order 1 differencing is applied to both input and output.

2

Figure 2 Plot of Input with order 1 differencing

3

Figure 3 Plot of output with order 1 differencing

 

Then the trends of input and output become stationary.

4

Figure 4 Autocorrelation and partial autocorrelation graphs

Since there is no seasonality observed, so ARIMA instead of seasonal ARIMA is applied next to both input and output. Figure 4 shows that there is a distinct drop in the Partial autocorrelation function (PACF) value after lag 3, an AR(3) model is plotted.

5

Figure 5 AR(3) Model for input

However, Figure 5 shows that the values for AR2 and AR3 in Column Prob>|t| are big. Therefore, AR(1) model is adopted for both input and output instead of AR(3).

6

Figure 6 AR(1) Model for input

7

Figure 7 Residuals of AR(1) model for input

8

Figure 8 AR(1) Model for output

9

Figure 9 Residuals of AR(1) model for output

Figure 7 and Figure 9 show that the residuals of AR(1) models for both input and output are random.

Next, in order to remove the autocorrelation, pre-whitening is performed to the input variable.

10

Figure 10 Pre-whitening plot for dataset from Jan 2002 to Oct 2017

Figure 10 shows that there is an increase in the values at both negative lag (lag -14 and -13) and positive lag (lag 1 and 2). This means the cross-correlation is bi-directional. Therefore, the cointegration technique needs to be applied.

 

4          Bi-Directional Cross-Correlation Analysis

Since the cross-correlation is bi-directional, cointegration analysis with GRETL is performed.

Firstly, a time series plot of both variables is plotted as shown in Figure 11. It shows that both variables values are increasing gradually from the Year 2002 to 2014. Then from the Year 2014 to 2016, there is a big increase in “num_bidder”, but “avg_deal_price” increases in a similar rate as previous years.

11

Figure 11 Time series plot of both variables from Jan 2002 to Oct 2017

Then the Engle-Graner test for cointegration is performed and the steps are,

Step 1: Determine ‘d’ in I(d) for the first series, where d represents the order of integration and is abbreviated I(d). d is the number of times the series has to be different to be made stationary. ADF unit root test is used to determine d.

Step 2: Determine ‘d’ in I(d) for the second series

Step 3: Estimate cointegration regression: Yt = β1 + β2Xt + εt

Step 4: Determine ‘d’ in I(d) for εt

                                H0: Unit root (i.e., not cointegrated)

HA: No unit root (i.e., cointegrated)

To perform the ADF unit root test on the residuals from the cointegrating regression to determine the order of integration [2].

 

4.1        Step 1: Determine ‘d’ in I(d) for the first series

Augmented Dickey-Fuller (ADF) test is performed in Step 1 for series “avg_deal_price”. Taking 6 as the lag order for ADF test which is the cube root of the data size which is 189.

12

Figure 12 ADF test for series “avg_deal_price”

13

Figure 13 ADF test result for series “avg_deal_price”

Figure 13 shows that the p-value is 0.5498 which is large, hence fail to reject the null hypothesis. Therefore, the series needs to be different to make it stationary. d=1 for “avg_deal_price”.

 

4.2        Step 2: Determine ‘d’ in I(d) for the second series

The same test is done for series “num_bidder”. Figure 14 shows the p-value is 0.9175 which is large, hence fail to reject the null hypothesis. Therefore, the series needs to be different to make it stationary. d=1 for “num_bidder”.

14

Figure 14 ADF test for series “num_bidder”

 

4.3        Step 3 and Step 4

The Engle-Graner cointegration test is performed, taking “avg_deal_price” as the independent variable and “num_bidder” as the dependent variable. Figure 15 shows the result that the p-value is 0.2498 which is not small, hence fail to reject the null hypothesis at 5% level of significance. This means there is no cointegration relation between the two variables.

15

Figure 15 Engle-Graner cointegration test result for dataset from Jan 2002 to Oct 2017

 

5          Further Exploration

Since Figure 11 shows that there is a big increase in “num_bidder”, but “avg_deal_price” still increase gradually, this may be the reason that the relationship between two variables is neither unidirectional nor bi-directional. Therefore, data from Jan 2002 to Dec 2013 is going to be analysed instead.

16

Figure 16 Time series plot of both variables from Jan 2002 to Dec 2013

Firstly, use JMP to perform the unidirectional relationship analysis between two variables as mentioned in Section 3. The result as shown in Figure 17 shows that the cross-correlation is bi-directional. Therefore, GRETL is used to perform bi-directional cross-correlation analysis.

17

Figure 17 Pre-whitening plot for dataset from Jan 2002 to Dec 2013

Since the number of records is 143, so taking 6 as the lag order for ADF test which is the cube root of the data size which is 143. Following the same steps mentioned in Section 4, the Engle-Graner cointegration test result is shown in Figure 18.

18

Figure 18 Engle-Graner cointegration test result for dataset from Jan 2002 to Dec 2013

The result shows the p-value is 4.709e-13 which is small, so reject the null hypothesis at 5% level of significance. This means there is cointegration relation between two variables and the series can be written as an error correction model.

Eqn1

Equation 1 Error correction model equation

Equation 1 shows the equation of the error correction model. The residuals from the cointegrating regression are found within the brackets and capture the departure from the attractor from last period. The coefficient gamma is the speed of adjustment, if it is not statistically significant, the variable is weakly exogenous. Before estimating the error correction model, the cointegrating regression is estimated and the residuals are saved.

19

Figure 19 Ordinary least squares method result

Ordinary least squares method is performed to save the residuals. After the residuals are saved, two new series “d_avg_deal_price” and “d_num_bidder” are created that are the difference of the “avg_deal_price” and “num_bidder” as shown in Figure 20. Next step is to estimate the error correction model.

20

Figure 20 New variables are created

21

Figure 21 Adding lags

22

Figure 22 Ordinary least squares method with new variables lagged by 1

To estimate the error correction model, ordinary least squares is applied as shown in Figure 22. The variables are lagged by 1 and remove the “const” is removed.

23

Figure 23 Ordinary least squares method result with new variables

The p-value of variable e_1 is 8.81e-10 which is statistically significant. This means the dependent variable “num_bidder” is not weakly exogenous and moves to restore the equilibrium when the system is out of balance.

 

6          Conclusion

From the Year 2014 to 2016, there is a big increase in the number of bidders, but the average deal price of Shanghai license plate increases at a similar rate as previous years. This may be the reason that the relationship between the average deal price of Shanghai license plate and the number of bidders is neither unidirectional nor bi-directional from Jan 2002 to Oct 2017.

However, from Jan 2002 to Dec 2013, there is a bi-directional cross-correlation between the average deal price of Shanghai license plate and the number of bidders. The number of bidders is not weakly exogenous and moves to restore the equilibrium when the system is out of balance.

 

7          References

[1] “Shanghai license plate bidding price prediction,” 2017. [Online]. Available: https://www.kaggle.com/bazingasu/shanghai-license-plate-bidding-price-prediction.
[2] Janelle, “Cointegration (Video 7 of 7 in the gretl Instructional Video Series),” 17 Feb 2015. [Online]. Available: https://www.youtube.com/watch?v=bJgx3JLb7fI&t=311s.

 

Relationship between the GDP and CPI in India – by Diksha Jha, Dinesh Nagarajan ,Pradeep Rajagopalan ,Praveen Ponnuchamy , Sravani Satpathy —

Relationship between the GDP and CPI in India – by Diksha Jha, Dinesh Nagarajan ,Pradeep Rajagopalan ,Praveen Ponnuchamy , Sravani Satpathy

Introduction

Transfer function model is a unidirectional relationship between input and output.

Problem/Hypothesis of the time series data:

To analyze the relationship between Consumer Price Index (CPI) the Gross Domestic Product (GDP) of the Indian market.

Number of records in the dataset:

The dataset contains the history of CPI and the GDP for the past 60 years of the Indian market (one entry for each year).

Data source: https://www.data.gov.in/

Transfer Function:

On our initial visual analysis, we could see that both the input and output parameters are of different scales, the GDP was in lakhs of rupees and the CPI is in percentage, so we have transformed the dependent variable (GPD) with log10.

Capture1 The relation between the input and output is mostly linear with some discrepancy due to white noise and the auto correlation between the output variable. On business terms, we understand that the Consumer Price Index and GDP of a India were related to each other with positive correlation and having a good interaction. This can be confirmed by the given scatterplot matrix below:

Capture2.PNG

With the above understanding we continue to the transfer function analysis for the above dataset.

With initial Time Series Basic Diagnostics, we see that the for the output variable the ACF and PACF both are tailing off; hence this is an ARIMA model. But, we see that the data is already stationary so we have not done any differencing. Similar pattern is followed by the input series (CPI) also.

Capture3

For Pre-whitening, we could see that the auto correlation is significant after lag 14, i.e nearly after 15 years. By Indian Government standards, the budget is revised every 5 years and due to the slow growth of economy the impact of CPI on GDP is seen after 15 years. But it is very prominent during the next 5-7 years as seen in the below ACF graph.

Capture4

For the transfer function, we expect that B15 and other variables are to be prominent. The above analysis resulted in the below transfer function wherein the relation between the output variables and the input is seen to be prominent. Thereby confirming the delayed impact of CPI after 15 years on the GDP of India.

Capture5

Since, the pre-whitening process gave us a model of (2,0,2) and we could also see that the peak in pre-whitening lag plot is at 19 and the significant values start from 15. So s1 is given a value of 4. Below is the transfer function model summary.

Capture6

The actual transfer function which depicts the current GDP is impacted by the 5 CPI observations that are nearly 15 years old. This also gives us an idea that the 5-year budgeting is having a prominent impact on the GDP.

Conclusion

From the above analysis we have concluded that the GDP estimate on the input variable (CPI) is as given below:

Capture7

This gives us a clear understanding that the current GDP is dependent on the 3 previous years’ GDP and 6 years of CPI (independent variable) which is 15 years apart. We also notice that as the year grows, the older years are having minimal impact.

 

Relationship between Maruti Suzuki & Tata Steel Stock Prices on National Stock Exchange using Transfer Function — October 22, 2018

Relationship between Maruti Suzuki & Tata Steel Stock Prices on National Stock Exchange using Transfer Function

Team Members: Apurv Garg, Bhabesh Senapati, Daksh Gupta, Dibyajyoti Panda, Rajiv Hemanth, Tran Minh Duc

Reason for Study

The Auto industry uses a tremendous number of materials to build cars, among which most of the weight comes from steel. Most of the modern cars weigh around 3,000 pounds out of which 2,400 pounds is steel. In cars, steel is used to build the chassis, door beams, roof, body panels, exhaust etc.

TATA Steel is the largest producer of steel in India and on the other hand Maruti Suzuki is one of the most extensively purchased brand in automobile industry. So, there should be a relation existing between the stocks of both the big giants.

.

Time Series Analysis

1.Data Source

Time series plot

Figure 1: Time Series Plot

2. Data Transformation

Log transformation was applied to the stock prices of both the companies as they helped in better identification of time series patterns by reducing the variance.

3. Model Building

For the 1st iteration the Y (dependent) variable is selected as the TATA steel stock prices and the X(independent) variable is MARUTI stock price.

3.1 Test for Stationarity

Before starting with the model building of a time series sequence it is very important for the series to be stationary, and if not stationary then it should be converted to a stationary sequence by taking a difference with its lag. Augmented Dicky Fuller test is used to test the stationarity of a series.

Augmented Dicky Fuller test has the following hypothesis:

 Ho –> Level series contains a unit root

 H1 –> Level series does not contain a unit root

Depending on the Significance of the p value we reject or fail to reject the null hypothesis. If we fail to reject the null hypothesis the series is non-stationary and if we reject the null hypothesis the series is a stationary series.

Both the series were tested for stationarity and both failed to reject the null hypothesis due to insignificant value of p. Hence both were non-stationary and were differenced with a lag of 1 to make the series stationary.

3.2 Fitting ARIMA Model

The input series i.e. Maruti stock price series is made to fit the best combination of ARIMA series. The value of I = 1 was determined in the above step.

ARIMA plotIllustration 1: The Seasonal ARIMA (2,1,2)(0,0,2) model was finally selected as the best fit with all variables significant and a MAPE of 1.07

arima par

Illustration 2: Parameter Estimates

3.3 Pre-whitening of Input & Output Series

Both the series were then pre whitened on the following parameter values:

p = 2  |  I = 1  |  q = 2  | P = 0  |  Is = 0  |  Q = 2

Cross correlation plot

Illustration 3: Cross-Correlation plot between tata steel and Maruti Suzuki stock prices.

After the pre whitening of the series the cross-correlation plot was analysed to determine the transfer function parameters. But the cross-correlation plot showed hikes on both the negative as well as the positive lags. Hence indicating a case of cointegration where both the series are correlated to each other. This can be seen in the illustration shown below.

 

3.4 Engle-Granger Test for Cointegration Regression

The p value for Engle – Granger Test is much closer to being significant for the case

Y = Log_TISC and X= Log_MRTI (p = 0.09) as compared to

                                           Y = Log_MRTI and X = Log_TISC.

Hence, the dependent variable was finally selected as Tata Steel price and the independent variable as Maruti Suzuki price.

 

3.4 Error Correlation Model

An error correction model as it is now a problem of Cointegration Regression, with a difference [D=1] (to make the series stationary). The illustration below shows the result for the Ordinary least Square Model made:

Model: OLS, using observations 2003:10-2018:09 (T = 180)

Dependent variable: d_Log_TISC 

coefficient   std. error   t-ratio   p-value

—————————————————————————-

d_Log_MRTI_1   −0.134663     0.115003     −1.171    0.2432

e_1            −0.0925064    0.0280208    −3.301    0.0012  ***

d_Log_TISC_1    0.194640     0.0854004     2.279    0.0239  **

——————————————————————————

 

 Mean dependent var   0.007540   S.D. dependent var   0.140167

Sum squared resid    3.274077   S.E. of regression   0.136006

Uncentered R-squared 0.071713   Centered R-squared   0.069012

F(3, 177)            4.557955   P-value(F)           0.004207

Log-likelihood       105.2140   Akaike criterion    −204.4279

Schwarz criterion   −194.8490   Hannan-Quinn        −200.5441

rho                 −0.017859   Durbin-Watson        2.022787

 

 P-value was highest for variable 10 (d_Log_MRTI_1)

3.4 Final Equation

 Delta Y(t)  =  0.194* Delta Y(t-1) – 0.134* Delta X(t-1) – 0.092*E

Where:

Y(t) –>  Log value of Tata Steel stock price at time t.

Y(t-1) –> Log value of Tata Steel stock price at time t-1

X(t-1) –> Log value of Maruti Suzuki stock price at time t-1

E –> Random error