Analytics And Intelligent Systems

NUS ISS AIS Practice Group

Analysis of Crude Oil Production And Import Prices — December 2, 2018

Analysis of Crude Oil Production And Import Prices

Prepared by:

Team Data Insighters

Balacoumarane Vetrivel – A0178301L

Mohammed Ismail Khan – A0178366N

Meghna Vinay Amin – A0178307Y

Sreekar Bethu – A0178220L

Vaishnavi Renganathan – A0178229U


With each passing year, oil seems to play an even greater role in the world-wide process of people making, selling, and buying things. In the early days, finding oil during a drill was carefully thought believed somewhat of an annoyance as the meant treasures were usually in a common and regular way water or salt. It wasn’t until 1857 that the first commercial oil well was drilled in Romania. The U.S. petroleum industry was born two years later with a drilling in Titusville, Pa.

Oil has replaced coal as the world’s primary fuel source. Oil’s use in fuels continues to be the primary factor in making it a high-demand commodity around the globe.

The Determinants of Oil Prices:

​With oil’s stature as a high-demand global commodity comes the possibility that major fluctuations in price can have a significant economic impact. The two primary factors that impact the price of oil are:

  • Supply and demand
  • Market sentiment

Supply and demand:

The price of oil as known is actually set in the oil futures market. An oil futures contract is a binding agreement that gives one the right to purchase oil by the barrel at a predefined price on a predefined date in the future. Under a futures contract, both the buyer and the seller are obligated to fulfil their side of the transaction on the specified date.

Market sentiment:

The mere belief that oil demand will increase dramatically at some point in the future can result in a dramatic increase in oil prices in the present as speculators and hedgers alike snap up oil futures contracts. The opposite is also true that means prices can hinge on little more than market psychology at times.

Other variables:

  1. News on new supply
  2. Changes in consumer habits
  3. Terrorist attacks and disturbance
  4. Alternative energy sources
  5. Economic growth

Crude oil production in US

For seven straight years, the US has pumped more oil and gas out of the ground than any other country and this lead will only widen. Production of crude topped 10.7 million barrels per day with production of natural gas hitting 4 million barrels per day. The surge in U.S. output is due in large part to the wide use of horizontal hydraulic fracturing, or fracking, as new technologies give drillers access to some of the largest oil deposits in the world that were once too tight to exploit.

Data source and Understanding

Data source:

The data is obtained from the link


3.1 Crude oil production (input series):

Crude oil production is defined as the quantities of oil extracted from the ground after the removal of inert matter or impurities. It includes crude oil, natural gas liquids (NGLs) and additives. This indicator is measured in thousand tonne of oil equivalent (toe). The data which has been considered is Total Crude oil production in US.

3.2 Crude Oil Import price (output series):

Crude oil import prices come from the IEA’s Crude Oil Import Register. Average prices are obtained by dividing value by volume as recorded by customs administrations for each tariff position. Values are recorded at the time of import and include cost, insurance and freight, but exclude import duties. The price measured in USD per barrel of oil.



The graph depicts the time series of both the crude oil production in US and import price rate. To some extent, there is an inverse relationship between these two.

Hypothesis framework

Total crude oil production is a major factor in determining the price of a barrel is tacit. To challenge this common belief, import price of oil is considered as Input series(X) and oil production is considered to be the Output series(Y).

Hypothesis – “There is an underlying relationship between the total crude oil production in US and crude oil import price.”

Equation – Production(Y(t)) ~ Import Price(X(t)).



The following steps show how various parameters were considered while building the ARIMA model and ultimately the team arrived at the Transfer Function.

Step 1:  Fit ARIMA model to the input production series Xt.  Input data is now loaded and an ARIMA model is tried to fit with different parameters. The model IMA (2,1) shows good overall summary.

ACF and PACF were less than significant level and model coefficient are significant. The below plot shows the ACF and PACF obtained for the IMA model tried above. It concurs what we have observed. The residuals plot is also attached for the model.



Arima residual check:



Step 2:  By fitting a preliminary model, we can get rid of the autocorrelation, if there exists any in the data. This process is called pre-whitening.

Pre-whiten the input production-output price series and check for cross correlation.


The cross-correlation plot between input and output indicate there is significant correlation in lag 3 to 5 and lag 8.

This suggests that our transfer function equation will have terms related to the input series only for lags 3,4,5 and 8.

The model equation:

Yt=Vt-3Xt-3 +Vt-4 Xt-4+Vt-5Xt-5+Vt-6 Xt-6+Vt-7Xt-7+Vt-8Xt-8+nt.

Step 3: Compute the transfer function.

Parameters identified are:       b=3,     s=8-3=5,            and r=2, these values were identified from the above pre-whitening plot and used to build transfer function.



Model residual diagnostics:



Transfer Function Output


The residuals plot for the transfer function is obtained through the software. We do a diagnostic check for the fitted transfer function model – noise. The parameter significance is also checked from the plot and it is found that the ACF and PACF value ranges are in the significant range that we are looking for with varied significance.

6.1 Final Equation:

Yt – 0.03 Y(t-1) -1.03 Y(t-2) -1.85 Y(t-3) +1.91 Y(t-4) = -290.3574+197.96X (t-3)-249.74X (t-4) +19.84X (t-5) + 93.04 X (t-6) + 230.5484 X (t-7) -763.4634 X (t-8) + e(t) +2.56 e(t-1) +3.08 e(t-2) +1.14 e(t-3)


Form the transfer function it is evident that import price depends on lag variables (3,4,5,8) of the crude oil production. As mentioned in our hypothesis it has been observed and re-iterated that there indeed exists a relationship, inverse in nature, between the two variables considered under our analysis, namely, Crude oil production and the import prices. Rising production in crude oil shall forecast a diminishment in the import prices.



Transfer Function Analysis: Relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month — November 9, 2018

Transfer Function Analysis: Relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month

  • Liang Shize (A0178178M)
  • Nandana Murthy(A0178563R)
  • Tan Zhi Wen David (A0056418A)
  • Zhao Ziyuan(A0178184U)
  • Zhu Huiying(A0178222H)


1. Background

In rapidly ageing Singapore, the demographic trends are worrying. While the birth-rate fell to a seven-year low, the number of deaths recorded was the highest in at least two decades. The death-rate rose by 4 per cent from 20,017 deaths in 2016 to 20,905 deaths last year, the report on Registration of Births and Deaths-2017 showed. The report was released by the Immigration and Checkpoints Authority (ICA). Therefore, this study aims to analyze factors related to deaths in order to help government have more insights beyond the current situation. According to our analysis, Admissions to Accident & Emergency Departments is one of factors that has correlation with death.


2. Hypothesis

We assume that the relationship between number of deaths aggregated by month and number of admissions to Accident & Emergency Departments per month are correlated. In other words, increase in the number of admissions will raise death count. In this case, the dependent variable is number of deaths aggregated monthly, while the independent variable is admissions to Accident & Emergency Departments aggregated by month.


3. Dataset

3.1 Data Source

The two datasets were collected from Department of Statistics Singapore. The admissions to Accident & Emergency Departments data was extracted from the dataset called ‘Admissions to Public Sector Hospitals, Monthly’, whereas the monthly death data was extracted  from ‘Deaths by Ethnic Group and Sex, Monthly’. (Link:


3.2 Data Transformation

Originally, the data structure was a multi-dimension table which was not suitable for analytics. Therefore, we transformed the data structure in order to make it more suitable for analytics. Besides, we selected the data from Jan 2012 to Dec 2017 only as we wanted to mainly focus on its recent trend.


3.2 Data Records

The data was extracted from Jan 2012 to Dec 2017. It has 72 records in total which are enough for transfer function, because generally speaking, the threshold of transfer function is 60 records.


4. ARIMA Model

4.1 Observing plots

As shown in Figure 1 and 2, the monthly deaths are non-stationary with a trend, while the admissions to Accidents & Emergency Department are stationary. In order to test their stationary, KPSS test was employed. The null hypothesis of KPSS test is that the data is level or trend stationary. Therefore, if the data is level or trend stationary, its p-value should be larger than 0.05.

According to the outputs of KPSS test shown in Figure 3, it is obvious that the monthly deaths are non-stationary, while the admissions to Accidents & Emergency Department are stationary.


Figure 1 Line chart of monthly death



Figure 2 Line chart of admissions to Accidents & Emergency Department



Figure 3 Output of KPSS test


4.2 Determination of parameters

Based on the ACF and PACF graphs in Figure 4, the parameters of ARIMA model can be determined.

  • p – Non-seasonal Autoregression Order

According to the PACF graph, there is a spike at lag 2. Therefore, p is equal to 2.

  • d – Non-seasonal Differencing Order

Because the original data is stationary. Therefore, the differencing order is equal to 0.

  • q – Non-seasonal Moving Average Order

According to the ACF graph, there is a spike at lag 2. Therefore, p is equal to 2.

  • P – Seasonal Autoregression Order

According to the PACF graph, there is a spike at lags 12 and 24. Therefore, P is equal to 2.

  • D – Seasonal Differencing Order

Because the original admissions data is stationary. Therefore, D is equal to 0.

  • Q – Seasonal Moving Average Order

According to the ACF graph, there is a spike at lag 12. Therefore, p is equal to 1.


Figure 4 Time Series Basic Diagnostics


4.3 Result of Seasonal ARIMA (2,0,2) (2,0,1)

The result of ARIMA is shown below. It is obvious that some coefficients are not significant, so the insignificant parameters have to be removed one by one to find out the parameters of best ARIMA model.


Figure 5 Output of Seasonal ARIMA (2,0,2) (2,0,1)


4.4 Output of ARIMA Model Group

In order to find out the best parameters, ARIMA Model Group was employed. The output of ARIMA Model Group is shown as below. According to AIC criteria, the best solution is (2,0,2) (1,0,1).


Figure 6 Output of ARIMA Model Group


4.5 Output of Seasonal ARIMA (2,0,2) (1,0,1)

The output of Season ARIMA (2,0,2) (1,0,1) is shown as below. The model is regarded as the best model for the following reasons:

  1. Firstly, according to Forecast in Figure 8, the predictive trend and range are reasonable.
  2. Secondly, according to Parameter Estimates in Figure 9, all parameters are significant.
  3. Thirdly, according to Residuals in Figure 10, the residuals are randomly distributed, while there are no spikes in ACF and PACF graphs. Therefore, this model can be used for further analysis.


Figure 7 Model Summary



Figure 8 Forecast



Figure 9 Parameter Estimates



Figure 10 Residuals


5. Transfer Function

5.1 Prewhitening

After finding the suitable ARIMA Model, the parameters of X’s ARIMA was used to pre-whiten the input and output series in order to get their cross-correlation graph which is shown as below.


Figure 11 Prewhitening Plot


5.2 Identifying Parameters and Fitting Transfer Function Noise Model

According to Figure 11, there are two spikes at lags 10 and 11. It means the non-zero autocorrelation occurs at lag10 and the values decay after lag 11. Therefore, b=10, s=11-10=1. Besides, r is equal to2. Finally, the parameters we used are shown in Figure below:


Figure 12 Transfer Function Model


5.3 Diagnostic Checks

  • Check Residuals

According to Figure 13, the residuals are randomly distributed.


Figure 13 Residuals of Transfer Function


  • Check Significance of Parameters

According to the result, all parameters are significant.


Figure 14 Parameter Estimates of Transfer Function


5.4 Model Comparison

Although the above solution is good enough, other parameters were tried for comparison. Finally, according to the output of model comparison in Figure 15, the solution mentioned above is the best one.


Figure 15 Model Comparison


5.5 Expanded Formula


Figure 16 Formula of Transfer Function

The formula of transfer function is shown above. However, in order to understand the formula better, we expanded the model in full with backshift operator as shown below. The expanded formula includes Y-Deaths, X-Admissions to Accident & Emergency Departments and e-Error Term. The number at the bottom right corner means the lags of its corresponding item.


Figure 17 Expanded Formula


5.6 Model Summary

As shown in Model Summary in Figure 18, the MAPE is equal to 2.77, while the MAE is equal to 46.92. The performance is acceptable. The forecasting points and confidence interval are shown in Figure 19.


Figure 18 Model Summary



Figure 19 Forecasting graph


5.7 Output of test data

As shown in Figure 20, the MAE of test data is equal to 100.157, while the MAPE is equal to 5.55. Both evaluation metrics are larger than that of train data. However, the result still can be a reference for government.

Time Predictive Value Actual Value
Jan 2018 1696.946 1924
Feb 2018 1565.284 1662
Mar 2018 1666.980 1776
Apr 2018 1654.818 1624
May 2018 1684.973 1803
Jun 2018 1710.541 1729
MAE 100.157
MAPE 5.545054604

Figure 20 Output of test data


  1. Conclusion

This transfer function indicates the relationship between Deaths and Admissions to Accident & Emergency Departments. It validates the initial hypothesis that the Admissions to Accident & Emergency Departments leads number of deaths aggregated by month in Singapore. Besides, the performance of model is acceptable. Therefore, government can use this model to predict deaths in advance and then take actions to lower death count. For example, if the government sees an increase in the number of admissions, it can put more effort to provide medical assistance and perform researches to understand the underlying reasons.

Cointegration and Causality analysis of Crude Oil WTI and Dow Jones Commodity Index Energy — November 8, 2018

Cointegration and Causality analysis of Crude Oil WTI and Dow Jones Commodity Index Energy

Data Source and EDA

Data Source and definition

The relationship between weekly data of crude oil (WTI) price and Dow Jones Commodity Index is examined in this report. The 2 time-series begins from 6 Jul 2014 and consists of 221 observations. The WTI (West Texas Intermediate) is a grade of crude oil used as a benchmark in oil pricing. Here the downloaded data’s unit of measure is USD/barrel. The Dow Jones Commodity Index was launched on 2 Jul 2014 by S&P. The DJCI includes 23 of the commodities included in the world production-weighted S&P GSCI, and it weighs each of the 3 major sectors – energy, metals, agriculture and livestock – equally. In this report we look at only the energy section DJCI, which consists of 6 commodities: WTI Crude Oil, Heating Oil, Brent Crude Oil, RBOB Gasoline, Gasoil and Natural Gas.

The data are downloaded from the following website:


Figure 1. Plot of WTI Price and DJCIEN

From the plot it can be seen that essentially the two time-series move together. To discover the relationship between these two, we do pre-whitening in JMP to see whether we can apply the transfer function technique.

Pre-whitening using JMP

Using WTI Oil Price as Input series, the ACF and PACF plots are shown below:2Figure 2. ACF and PACF of WTI Oil Price Series

It can be easily seen that the time series is not stationary. Therefore, we apply 1st order differencing to the series:3Figure 3. ACF and PACF of WTI Oil Price differenced Series

ARIMA(1,1,1) is applied to the input series:4Figure 4. ARIMA(1,1,1) applied to input series

Apply Ljung-Box test to the residual after ARIMA, p-values are all more than 0.05. Therefore, we don’t reject the null hypothesis that the residuals are independently distributed. The model fits fine.5Figure 5. Input Series residual after ARIMA(1,1,1)

ARIMA(1,1,1) is applied to both input and output series and the following pre-whitening plot is obtained:6Figure 6. Pre-whitening Plot

From this plot, there is a significant peak at the time lag 0. The other correlations on positive and negative sides of the lag are small enough to be neglectable. This implies that Yt (Output Series) reacts immediately to the change of input Xt. It’s not possible to identify the transfer function model’s b, s, r here. Gretl will be used to discover the cointegration relationship between these 2 time-series.

Cointegration test and modeling using Gretl

Cointegration test

Before the Cointegration test, Augmented Dickey-Fuller test are performed:7Figure 7. ADF test for WTI Oil Price                                                   8Figure 8. ADF test for DJCIEN

Both of them having p-value > 0.05, therefore, we do not reject the null hypothesis that unit root presents in the time-series.

The theoretical background for Engle-Granger Cointegration test is layed down as follows:

The estimation of regression equation of Y on X or X on Y in level form OLS method is referred to as cointegration regression as shown below:

Yt = intercept + b1 X+ error

Xt = intercept + b1 Yt + error

Where b1 = long run regression coefficient

If Xt and Yt are non-stationary and cointegrated

then a linear combination of them must be stationary. In other words:

Yt – βXt = Ut, where Ut is stationary.

ADF test is carried out for Ut to confirm the stationarity.

Results from the Engle-Granger Cointegration test is shown as below:9Figure 9. Engle-Granger cointegration test result

The p-value for ADF test for the residual is small, therefore we reject the null hypothesis, Ut is stationary.

We conclude that the two time-series are cointegrated.

Error Correction Model

From Yt – βXt = Ut, we know that Ut will capture the error correction relationship by capturing the degree to which Y and X are out of equilibrium. Therefore, ECM can be built as: ΔYt = C + Φ ΔXt + α Ut-1, where Ut-1 = Yt-1 – Xt-1, ΔXt is any short-term effects.

Use Ordinary Least Square firstly to obtain Ut, and then use Ut-1 to build ECM:10

Figure 10. Setting used for Error Correction Model

Result of the model shown as below:11Figure 11. Error Correction Model

The coefficient of the error correction term presents statistical significance, indicating that short-term imbalances between the two series should disappear in the exact condition of the long-run equilibrium. Durbin Watson test shows residuals are normally distributed. R2 value is 0.881811. Therefore, we conclude that the ECM is reasonably good.

Granger Causality Test

If two time-series, X and Y, are cointegrated, there must exist Granger causality either from X to Y, or from Y to X, or in both directions. However, the reverse is not true. Since we already tested the cointegration relationship, Granger Causality Test is then carried out to discover the causality relationship.

In Gretl, Granger Causality Test is done in Vector Autoregression. Before doing the VAR, it’s important to find the maximum lag to specify in VAR(p), since this p value will affect the VAR result a lot. We use VAR lag selection function in Gretl to determine the max lag. From the following result, lag 1 is chosen because of the smallest AIC & BIC.12Figure 12. VAR lag selection

Using Lag 1 to run VAR, the result is shown as follows:13

Figure 13. Result of VAR

From the equations:

Equation 1: DJ Energy, the null hypothesis of all lags of WTI coefficients are zero cannot be rejected at 5% significance level. Therefore, WTI Oil Price does not Granger cause Dow Jones Commodity Index Energy.

Equation 2: WTI, the null hypothesis of all lags of WTI coefficients are zero is rejected at 10% significance level, and essentially at 5% significance level also.

Therefore, we conclude that Dow Jones Commodity Index Energy Granger cause WTI Oil Price.


From the above analysis, the following conclusions are deduced:

  1. Long-run equilibrium relationship between WTI Crude Oil Price and Dow Jones Commodity Index Energy Exist
  2. Unidirectional causality exists. Dow Jones Commodity Index Energy cause WTI Oil Price, while WTI Oil Price does not cause Dow Jones Commodity Index Energy 

    Team: AngelWorm


    Name Student ID Email Address
    Ang Wei Hao Max A0178278L
    Li Xiaoguang A0178450B
    Laura Long Xiaoxing A0055448Y
    Ma Wenjing A0073595U
Transfer function assignment of Analytics_1_Group1 — November 3, 2018

Transfer function assignment of Analytics_1_Group1


About the data

The data describes the change in the frequency of television viewing from October 2008 to March 2009 by the average Indian household. We are not given the details of the TV channel hence we will not know who the target audience. As seen in the chart below, the weekly rating of the TV channel hovers between the range of 330 to 170 (60 points) and it is trending downwards. The TV channel rating has been staying below 300 after 27 Jan 2008, suggesting that the programs were not able to appeal to its audience in India.

We tested autocorrelation of the data and confirms that there is positive autocorrelation in this dataset (refer to appendix for more information). And the ACF didn’t dampen out within 15 to 20 lags, hence we confirm that the autocorrelation is not likely to be stationary (nonstationary process)[1].



We used Jmp, SPSS Modeler as well as R to run our models.


We tried to log normal predicted data before we use Jmp to runtime series diagnostic and we found that the results are the same using the original data set are similar so we used the original data set on all our models.


  • Decomposition Method – Multiplicative method

Referring to the equation below, the trend component T reflects the long-term progression of the rating. It’s a function of how effective your marketing campaign is, and it indicates how well the TV channel is doing. This trend is perturbed by several other effects which contribute to the buying behaviours of the audience. The seasonal factor, TV rating peaked during festive seasons (more people staying at home to watch TV or there’s a big event organised by the TV channel); all those events have a quantifiable, cyclic effect on the data and therefore make the seasonal component, S, of the time series. Finally, what’s left is assumed to be caused by random, non-periodic events and is called irregular (or residual) component, I. The cyclic pattern in television rating can be explained by environmental factors, such as the weather or the amount of daylight. People watch less television when the weather permits them to be outside and engage in alternative activities (Gould, Johnson, & Chapman, 1984). In this data set, we observed a downward trend in the data but no seasonality and cycle in the data. We didn’t observe any annual cycle in the data nor did we see any seasonality in the data. Referring to the equation below, we be using linear regression method for our forecasting (refer to excel spreadsheet below for details), error in our training dataset 6.49% (MAPE) while testing data set (20 data points) is about 8.43%(MAPE).



Intercept 302.8228365
Slope -1.400229918


Few key assumptions of regression include:

  • There must be a linear relationship between the outcome variable and the independent variables.
  • Multivariate Normality–Multiple regression assumes that the residuals are normally distributed.
  • No Multicollinearity—Multiple regression assumes that the independent variables are not highly correlated with each other.
  • Homoscedasticity–This assumption states that the variance of error terms is similar across the values of the independent variables. [2]

However, we tested positive autocorrelation in the data hence we can conclude that the decomposition method is not a good model for prediction as it violates the third assumption, there should not be multicollinearity in the independent variables.



Figure 1


The time series shows strong positive autocorrelations (ACF), it dies off slowly (we still see some autocorrelation at lag 25). In the PACF plot, there’s strong autocorrelation at lag 1 and it gets off after lag 1 (see figure 1)

Next we try 1 and 2 differencing to convert it stationary time series data before we test for P and Q (ARIMA).


In figure 2a and 2b – 1 Differencing

The intercept becomes insignificant and we also see signs of over-differencing (pattern of changes of sign from observation to the next in the ACF plot, lag 1 has a negative spike). Hence, we decide to remove differencing and try for P & Q instead.

ACF plot indicates no seasonality in the data (only lag 1 and lag 7 shows strong autocorrelation).


Figure 2a



We tried ARMA (1,0,0), we achieved adjusted Rsquare of 0.76 but lag 2, 5 and 5 become significant and there is a negative spike in lag 1 (refer to appendix for more information).

Next, we try ARIMA (0.0,1), adjusted Rsquare was only at 0.50 but ACF plot still shows high autocorrelation. (refer to appendix for more information). ARIMA (1,1,1) also giving us insignificant variable in AR and intercept (see figure 3)

Figure 2b



Figure 3



Our final model – ARIMA(1,0,1)  Figure 4 & 5

We achieved an adjusted R square of 0.79, MAE of 5.38. We didn’t split the data into training and testing dataset (there is only 20 data points for validation, this will not be sufficient in ARIMA models. We will need at least 50 data points to run ARIMA model).

Final Model Checking

The estimation summary below indicates that both autoregressive parameter and the SMA parameter are significant. The Ljung-Box statistics indicates that all the lag there are showing no signs of significant autocorrelation in the residuals and we also see randomness in the residual plot. Hence, we confirm that the model is a good fit.


Figure 4



Figure 5


  • Time Series Regression

Feature Engineering

The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm (regression) to model. By creating dummy variables, trend variable and weekly fluctuation variable, we aim to learn the underlying inherent functional relationship between inputs and outputs in order to improve the accuracy of the model.

First, we created 5 dummy variables to represent week 1, week 2, week 3, week 4 and week 5 of the month.9

And we also created another variable to capture the weekly fluctuation rate of ratings by:

(This week’s rating – Previous Week’s rating) / Previous Week’s ratings

We also created 3 trend variables to capture downward/upward trend of the data – 5 weeks moving average, 6 week moving average and 7 weeks moving average.


We split the data into training and testing July 2007 – 26 Oct 2008 for training and Nov 2008 – Mar 2009 for testing. Using backward regression method to run our regression in SPSS Modeler. In backward regression, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Referring to figure 6 below, the significant variables in our time series regression model are the time period, 5 weekly trend, 7 weekly trend variable and differencing variable. Model 6 gives us the best adjusted RSquare of 0.8490. The best Rsquare compared to decomposition method and ARIMA but MAE between our training and testing data is more than 10%. The difference may be due to insufficient data points in our testing data set. We only have 20 data points for validation and there may be a few observations in the data set which are similar to the training data set (10 out of 20 data points are below 200 rating).



Testing Data Set                                 Training Data Set


  • Exponential smoothing

This method predicts the one next period value based on the past and current value.  It involves averaging of data such that the non-systematic components of each individual case or observation cancel out each other.  The exponential smoothing method is used to predict the short-term predication.  Alpha, Gamma, Phi, and Delta are the parameters that estimate the effect of the time series data.  Alpha is used when seasonality is not present in data.  Gamma is used when a series has a trend in data.  Delta is used when seasonality cycles are present in data.


Our Final Model – Simple Exponential Smoothing

Our final model is exponential smoothing technique using SPSS modeler, giving us coeffcient of 0.624.


Simple exponential smoothing is suitable for forecasting data with no trend and no seasonal component.The smoothing equation is given by

St+1= St + α(Xt – St)

Also can be writen in St+1= αXt +(1- α)St


Where(Xt – St) is the difference between the observed and forecasted value and St+1 is the smoothing statistic and α is the smoothing constant.It can be seen that the new smoothed value is the weighted sum of the current observation and the previous smoothed value.The weight of the most recent observation is α and the weight of the most recent smoothed value is (1-α).

As a result, St+1 is the weighted average of all past observations and the initial value of S0.The weights are decreasing exponentially depending on the value of α (0.624). More weights given to most recent obeservations and weights decrease geometrically with week. MAE (error) between the training and testing dataset is more than 10%. Since this technique is good for making short predictions and smoothing parameters are based on current and previous observation, we do not recommend splitting data for prediction since the TV channel has been trending downwards after the second quarter of 2008. We attempted exponontial smoothing technique on the data set but we still spikes in the ACF and PACF plots hence we conclude that the model is not a good fit.



Above is a summary of error terms and R square (goodness of fit) of our models (split between training and testing data). We would recommend using time series regression model for prediction if we have more data points for validation. Though both ARIMA and exponential smoothing techniques are useful techniques in forecasting a time series. Exponential smoothing technique depends upon the assumption of exponential decrease in weights for past data but it doesn’t take care of autoregressive component (refer to the exponontial smoothing model summary above). ARIMA is employed by transforming a time series to stationary series and studying the the nature of the stationary series through ACF and PACF and then accounting auto-regressive and moving average effects in a time series and it’s the best model the four models.






Model Description
Model Name MOD_1
Series Name 1 normalized data
Transformation None
Non-Seasonal Differencing 0
Seasonal Differencing 0
Length of Seasonal Period No periodicity
Maximum Number of Lags 16
Process Assumed for Calculating the Standard Errors of the Autocorrelations Independence(white noise)a
Display and Plot All lags


Applying the model specifications from MOD_1
a. Not applicable for calculating the standard errors of the partial autocorrelations.



Case Processing Summary
  normalized data
Series Length 92
Number of Missing Values User-Missing 0
System-Missing 0
Number of Valid Values 92
Number of Computable First Lags 91




normalized data




Series:   normalized data
Lag Autocorrelation Std. Errora Box-Ljung Statistic
Value df Sig.b
1 .876 .103 72.847 1 .000
2 .829 .102 138.832 2 .000
3 .821 .101 204.294 3 .000
4 .774 .101 263.139 4 .000
5 .741 .100 317.682 5 .000
6 .704 .100 367.480 6 .000
7 .644 .099 409.611 7 .000
8 .652 .099 453.323 8 .000
9 .617 .098 493.003 9 .000
10 .579 .097 528.344 10 .000
11 .573 .097 563.380 11 .000
12 .519 .096 592.461 12 .000
13 .489 .096 618.633 13 .000
14 .479 .095 644.117 14 .000
15 .428 .094 664.700 15 .000
16 .412 .094 684.041 16 .000


a. The underlying process assumed is independence (white noise).
b. Based on the asymptotic chi-square approximation.





Partial Autocorrelations
Series:   normalized data
Lag Partial Autocorrelation Std. Error
1 .876 .104
2 .266 .104
3 .255 .104
4 -.029 .104
5 .021 .104
6 -.051 .104
7 -.130 .104
8 .209 .104
9 -.048 .104
10 .013 .104
11 .043 .104
12 -.164 .104
13 .005 .104
14 .011 .104
15 -.046 .104
16 .055 .104



[1] Time Series Analysis and forecasting by Example. Soren Bisgaard and Murat Kulahci

[2] Applied Regression Analysis and Generalized Linear Models

Transfer Function Analysis: Relation between the number of visitors in Singapore’s Zoological Gardens with the weather conditions — October 26, 2018

Transfer Function Analysis: Relation between the number of visitors in Singapore’s Zoological Gardens with the weather conditions

Team Members : Gopesh Dwivedi (A0178338R), Indu Arya (A0178360B), Rubi Saini (A0178255W)


The Singapore Zoo, formerly known as the Singapore Zoological Gardens and commonly known locally as the Mandai Zoo, occupies 28 hectares (69 acres) on the margins of Upper Seletar Reservoir within Singapore’s heavily forested central catchment area. The zoo was built at a cost of $9 million granted by the government of Singapore and opened on 27 June 1973. It is operated by Wildlife Reserves Singapore, who also manage the neighboring Night Safari, River Safari and the Jurong Bird Park. There are about 315 species of animal in the zoo, of which some 16 percent are considered to be threatened species. The zoo attracts 1.7 million visitors each year.

Statistical time series models are used to analyze the short-run impact of weather and the long-run impact of climate upon visits to Singapore Zoo. The relationship between key climate parameters and tourism has been researched for over 30 years. Whilst tourists respond to the integrated effects of the atmospheric environment, including thermal, physical and aesthetic aspects, many studies conclude that temperature is the dominant climate variable for tourism. Several studies have attempted to identify the ideal or optimum temperatures for tourism. We conclude that’ globally, tourists prefer an average daily temperature of 21°C, and ideal temperatures for urban sightseeing in Singapore  have been found at 20- 26°C. Higher daily maximum temperatures around 30°C are preferred for beach-based recreation.


Our hypothesis regarding the Singapore’s Zoo’s Visitation is that better are the weather conditions which in our case means bright sunny day rather than raining would increase visitors at the Zoo. Thus, some cross-correlation can be attributed to weather conditions and Zoo’s visitation.

 A “tourist experience” encompasses three stages: planning, the actual experience, and assessments. Weather and climate affect the tourist experience. The distinction between weather (shorter term) and climate (longer term) is apparent when these tourist experience stages are used. Climate information and forecasts are assessed during tourism planning; weather information supports the actual experience; assessments are combinations of weather and climate information—reconciling discrepancies between expectations and reality of the tourist experience.

At the tourist destination, activity choices are impacted by local weather events. Weather events influence on-site activity choices of tourists. Social factors and different activity choices diversify options for tourists and recreationalists. However, influences of weather on the availability and performance of an activity are still present. For example, if availability of an intended activity is reduced due to the influence of the weather (e.g. lack of snow for skiing), alternative choices are created by businesses to mitigate revenue losses and provide social outlets.

Data and Preprocessing

For this Study, we collected temporal monthly data from SingStat. The data collected contains monthly visitors at Singapore’s Zoo from April 1990 onwards. Also, weather conditions like number of bright sunny days, rainfall etc is also acquired from the SingStat website. Primarily, we are considering the number of sunny days as our input series and Singapore’s Zoo monthly visitors as output series. We want to see if no rain scenarios affect the tourist decision to go for outdoor activities such as visiting Singapore’s Zoo.


Test of cointegration – We ran co integrated test for input and output series and achieved significant p value so we can conclude that these series are not co integrated and we will develop transfer function for these series.


Relation plot of input and output series:



Test of stationarity (ADF test )– We want to identify at first if series is stationary or not so we ran Augmented Dicky-Fuller test and p value is insignificant so our series is stationary.



As out input series is stationary so there is no need of differencing. We ran models for different values of p and q.


We ran multiple models and found out that Seasonal ARIMA(1,0,1)(3,0,1) is giving the best results.

For this model we have white noise, all auto correlations are insignificant. all the ACF and PACF values are within confidence interval.


Residual plot is random, and the selected parameters are significant.



MAPE for this model is insignificant.


Then applied same model on Y series and getting  good results with white noise, random residual plot and significant ACF& PACF values.

We then apply pre-whitening on this model and proceed to check correlation between input and output series.


Cross Correlation:

We ran cross correlation test for both input and output series and can see that output series is correlated with input series lag. So we can say that past data of input series can be used to predict output(Y) series.



We are considering that Correlations exist only in 4 lags, 2,3,4 and 5, starting from lags 2 and peak at 4 for development of transfer function. So our transfer function equation will have terms related to the input series only for lags 2,3,4 and 5.

So, we have d = 2, s=4-2 =>2.

Transfer Function

We developed 2 transfer functions with d = 2, s=4-2 and r=2 and second d = 2, s=4-2 and r=1.

Transfer function with d = 2, s=4-2 and r=2 is giving best results so we will proceed with Transfer function model 1.





*standardized(t)= No of visitors in zoo

standardized_Bright_sunshine_daily_mean_hour(t-1) + et = Average Temperature


Results and Insights

We can now infer from the above analysis, there is a conclusive “analytical evidence” that bright sunny weather has a strong relation with number of people visiting in Singapore zoological park. In this report, we applied “Transfer function analysis” and observed that relationship of the two factors, i.e., bright sunshine daily mean hours (Hour) as the “input variable” and number of people visiting Singapore zoological park as the “Output variable”, are related unidirectionally.


  • Forecasting Tourist Decisions Regarding Zoo Attendance Using Weather and Climate References David Richard Perkins IV
  • The impact of weather and climate on tourist demand: the case of Chester Zoo
  • The Effect of Seasonal Climatic Anomalies on Zoo Visitation in Toronto (Canada) and the Implications for Projected Climate Change Micah J. Hewer , and William A. Gough
Transfer Function Analysis: Relationship between visitors from the Americas and the luxury hotel booking in Singapore — October 24, 2018

Transfer Function Analysis: Relationship between visitors from the Americas and the luxury hotel booking in Singapore

1. Introduction & Problem Definition

It is believed that visitors from countries with good economy and high standard of living, such as those in Americas (USA and Canada in particular) are more likely to have the economic capability to travel in a more expensive fashion. Hence, it is reasonable to hypothesize that the influx of American visitors may relate to a higher occupancy rate in luxury hotel.

This hypothesis further leads to the objective of the study: to answer and provide evidence to the question of whether there exists a relationship between visitors from the Americas and the luxury hotel booking in Singapore.

2. Data Description

To analyze the relationship between visitors from Americas and the luxury hotel booking in Singapore, data was collected from Department of Statistics Singapore, covering the most recent 5.5 years of data (66 data points for each time series from Jan 2013 to June 2018) under the following two topics:

  1. a) Monthly data of International Visitor Arrivals

Data Source:

Selected Attribute: AMERICAS – Number of visitors from AMERICAS, including USA, Canada, and other parts of AMERICAS

  1. b) Monthly data of Hotel Statistics

Data Source:

Selected Attribute: AOR of Luxury Hotels – Includes hotels in the luxury segment, predominantly in prime locations and/or in historical buildings.

(* AOR = [Gross lettings (Room Nights) / Available room nights] x 100)

3. Transfer Function Analysis in JMP

The initial assumption made for the two time series is that the number of American visitors (AMERICAS) leads luxury hotel bookings (LUXURY). Hence, AMERICAS is the independent series (input) while LUXURY is the dependent series (output). Several ARIMA model are tested to get the best performance model. The final ARIMA model is Seasonal ARIMA (1,1,1)(0,1,0)12.


From the prewhitening plot, it can be observed that the relationship between input and output series is not unidirectional, with peaks ranging from the negative lags to positives. This indicates that the two series affect each other in a bi-directional way. Since transfer function is only applicable to situations where the relationship is unidirectional, other techniques will be explored for the further analysis.

4. Bi-directional Correlation Analysis in GRETL

Since a bi-directional correlation is found between Luxury and Americas, the Vector Autoregression (VAR) model is a useful starting point in the analysis of the inter-relationships between the different time series. Stationarity must exist in the series in order to estimate VAR model, so ADF test will be performed at first to test whether these two series are stationary and suitable for VAR(p) model. If they are not stationary, then differencing or cointegration test will be performed to estimate VAR(p) model or VEC(p) model.


VAR lag selection is performed with a value of 10. According to all the criteria, AIC, BIC and HQC, the best lag value to use for the analysis is 1.

Test unit root

The null hypothesis of the Unit Root Test is that the time series variable is non-stationary(a=1). According to the unit root test results of the two time series variables as shown above, both p-values are significant, which indicates that the null hypothesis can be rejected and the two series are stationary with constant and trend.  Hence, a VAR(1) model can be estimated.

Estimate VAR model

A VAR(1) model can be estimated with constant and trend. Seasonal dummy variables are also incorporated in the model as the time series plot shows a clear seasonal cycle. The two essential features of the VAR model: Causality Analysis and Impulse Response Analysis are performed to analyze the two series.


VAR model

In the below output displays, S1 to S11 refer to the dummy variables, Month. Both equations are significant, and the majority of the variables are significant as well. The F-tests of zero restrictions under each model provide information for Causality Analysis.

Granger Causality Test

The Granger Causality Test can provide evidence for answering the question of whether variable x causes variable y or vice versa, which is the interest of bi-directional time series analysis. The results of F-tests of zero restrictions under each model show that:

Equation 1 F-test results of “all lags of Luxury” and “all lags of AMERICAS” are both significant. This indicates a bidirectional causality between Luxury and Americas in equation 1.

Equation 2 F-tests result of “all lags of Luxury” is not significant and that of “all lags of AMERICAS” is significant. This indicates a one-way causality in equation 2.

With this information, a conclusion can be drawn that Americas and Luxury affect each other in a bidirectional way.

Impulse Response Analysis

An exogenous shock to one variable not only directly affects this specific variable but is also transmitted to the other endogenous variables through the dynamic (lag) structure of the VAR. The Impulse Response Analysis in VAR model can help to trace the effect of a one standard deviation shock to one of the innovations on current and future values of the endogenous variables. The 4 charts below show the impact of a shock on Luxury and AMERICA to Luxury or AMERICA, respectively.


  • The shock on Luxury directly moves Luxury to a higher level, then Luxury returns to its initial status monotonically after 4 months.
  • The shock on Luxury gradually reduce AMERICAS to the bottom at the second month, then AMERICAS slowly return to its initial status monotonically after 10 months
  • The shock on AMERICA pushes Luxury immediately to a medium level, then Luxury reduces to its normal status slowly after 10 months
  • Similar with Luxury on Luxury, the shock on AMERICAS directly moves AMERICAS to a higher level and then AMERICAS returns to its initial status.

The analysis result shows that visitors from AMERICAS and the luxury hotel booking in Singapore have an interactive effect on each other, and the individual reaction of Luxury and AMERICAS to shocks on itself or the other series are very different.

5. Conclusion

To investigate and understand the relationship between visitors from Americas and the luxury hotel booking in Singapore, several analytical approaches and methods were conducted on the acquired dataset. Transfer function in JMP gives the indication of a bidirectional relationship between the two series, while Granger Causality Test further proves the inference from JMP. Finally, the Impulse Response Analysis helps to trace the effect of various shocks to Luxury and AMERICAS. This analytical approach can be applied to similar studies on hotel type versus visitors from other regions to identify the reaction pattern for different regions. Decision makers in the tourism industry can apply the insight and knowledge gained from studying these patterns in market segment planning and other related issues.



Submitted by: Red

Gao Yuan (A0178404A); Wang Yijing (A0178354W);

Wang Yuhui(A0179193U); Zhang Xiaoman(A0178489A);

Hu Mengxi (A0178561W)



Submitted by: Aastha Arora (A0178188L), Aishwarya Bose (A0178277M), Chetna Gupta (A0178260A), Madeleine Dy (A0178427U), Misha Singh (A0178309W), Zaira Hossain (A0178331E)


Problem/Hypothesis of the Two Time Series:

Is there a relationship between Air Passenger Arrivals and value of Food & Beverage Sales in Singapore?

Data Exploration and Pre-Processing:





  • The raw dataset contains Air passenger arrivals by country and regions estimated monthly.
  • Information has been obtained from third party sources.
  • Refers to Changi Airport only.
  • Data excludes transit passengers who continued their journey on the same flight.






  • The raw dataset contains value of food and beverage sales index estimated monthly.
  • The Food & Beverage Sales Value is compiled from the results of the Monthly Survey of Food & Beverage Services.
  • The objective of the survey is to measure the short-term performance of food & beverage (F&B) services industries based on the sales records of F&B services establishments.
  • The Food & Beverage Sales Value is based on the Food & Beverage Services Index (FSI).

Number of Records:

F&B sales value and overall Air passenger arrivals were combined to form a dataset containing 200 monthly records from Jan 1997 to Aug 2013 (~16 years & 8 months of data) for both Singapore Food & Beverage Prices, and Number of Air Passenger arrivals, which was used for analysis.


A dip in arrival and F&B sales value was seen between April, May and June 2003, which could be attributed to the fact that there was SARS virus outbreak in Singapore. So, this was considered as an outlier, and we replaced the value by taking the moving average of 3 months.


Initial Analysis using Python:

We have plotted both the variables to visually inspect these variables. With this, we can clearly see that both the time series are trending which is a clear sign that both the time series have a unit root.


We confirmed whether they have unit roots by using the augmented Dickey Fuller (ADF) test.

The Dickey Fuller Test produced the following output for Arrival:

(1.5485691015907959, 0.9976947744107255, 15, 184, 
{'1%': -3.466398230774071, '5%': -2.8773796387256514, 
'10%': -2.575213838610586}, 4548.903850689798)

The Dickey Fuller Test produced the following output for Food Value:

(1.5508415283231438, 0.9977023393139994, 13, 186, 
{'1%': -3.466005071659723, '5%': -2.8772078537639385, 
'10%': -2.5751221620996647}, 1478.034269903061)

Clearly, we cannot reject the null-hypothesis that these series have a unit root. So we should difference both series as a first step.

To test whether the ‘Arrival’ is caused by the ‘Food Value’, we applied the Granger Causality Test.

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=50.6180 , p=0.0000  , df_denom=196, df_num=1
ssr based chi2 test:   chi2=51.3927 , p=0.0000  , df=1
likelihood ratio test: chi2=45.7154 , p=0.0000  , df=1
parameter F test:         F=50.6180 , p=0.0000  , df_denom=196, df_num=1
{1: ({'ssr_ftest': (50.61796959284111, 2.059081651775858e-11, 196.0, 1), 
'ssr_chi2test': (51.39273443354787, 7.56178806704204e-13, 1), 
'lrtest': (45.71543384976394, 1.3674112550498519e-11, 1), 
'params_ftest': (50.617969592841, 2.0590816517759726e-11, 196.0, 1)}, 
[<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
0x000001A540F8D518>, array([[0., 1., 0.]])])}

We can say that Arrival is caused by Food Value because p value is significant and less than 0.05. Then, we reverse the test to see if the ‘Food Value’ is caused by the ‘Arrival’. As a result, we found that Food Value was not caused by Arrival.

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=1.9754  , p=0.1615  , df_denom=196, df_num=1
ssr based chi2 test:   chi2=2.0056  , p=0.1567  , df=1
likelihood ratio test: chi2=1.9956  , p=0.1578  , df=1
parameter F test:         F=1.9754  , p=0.1615  , df_denom=196, df_num=1
{1:({'ssr_ftest': (1.9754074370715489, 0.16145817984806168, 196.0, 1),
'ssr_chi2test': (2.005643265189991, 0.15671480384086686, 1), 
'lrtest': (1.9956036184489676, 0.15775620297932386, 1), 
'params_ftest': (1.9754074371779515, 0.16145817983682556, 196.0, 1)}, 
[<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 
0x000001A545489CC0>, array([[0., 1., 0.]])])}


Analysis using JMP:

Firstly, we checked the PACF graph in the time series inputs (Food Value). It’s clear that the data needs to be differenced to run the input series ARIMA for pre-whitening:


We then use the ARIMA Group Model Function in JMP to discover the best combination. This turns out to be Seasonal ARIMA (2,1,0) (0,1,1)12 for the input series. Then pre-whitened the input series and we got the following result:


Since, we did not get any significant lags. We decided to go with log transformation of both the series.  After the log transformation, we fit an ARIMA model using the above steps and pre-whitened the series and got the following output:


We found significant lags at both sides of the plot. In this case transfer function won’t give the right estimates of coefficients. So, we reached the conclusion that it is a cointegration problem. Hence, we moved on to GRETL.


Analysis using GRETL:

We have plotted both the variables to visually inspect these variables. With this, we can clearly see that both the time series are trending which is a clear sign that both the time series have a unit root.


Both series are trending upward. It is possible that both the series follow long run equilibrium relationship that they tend to return to over time. We performed the Engle Granger Test for cointegration to find out.

Engle Granger Test:
Given two sets of time series data, x and y, granger-causality is a method which attempts to determine whether one series is likely to influence change in the other. This is accomplished by taking different lags of one series and using that to model the change in the second series. We create two models which predict y, one with only past values of y (Ω), and the other with past values of y and x (π). The models are given below where k is the number of lags in the time series:

Let Ω = yt = β0 + β1yt-1 +…+ βkyt-k + e
And π = yt = β0 + β1yt-1 +…+ βkyt-k + α1xt-1 +…+ αkxt-k + e

The residual sums of squared errors are then compared, and a test is used to determine whether the nested model (Ω) is adequate to explain the future values of y or if the full model (π) is better. The F-test, t-test or Wald test (used in R) are calculated to test the following null and alternate hypotheses:

H0: αi = 0 for each i of the element [1,k]
H1: αi ≠ 0 for at least 1 i of the element [1,k]

Essentially, we are trying to determine whether we can say that statistically, x provides more information about future values of y than past values of y alone. Under this definition we are not trying to prove actual causation, only that the two values are related by some phenomenon. Along those lines, we must also run this model in reverse to verify that that y does not provide information about future values of x. If we find that this is the case, it is likely that there is some exogenous variable, z, which needs to be controlled or could be a better candidate for granger causation.

Steps in Engle Granger Test:

Step 1: Determine ‘d’ in I(d) for ‘Log of Arrival’ using ADF Unit Root Test.
H0: Level series contains a unit root.
HA: Level series does not contain a unit root.

We have taken a maximum lag order of 6 by taking the cube root of the number of data points (200).

We select ‘constant and trend’ because we have seen from the plot that the series has an upward trend.


The p-value is large, so we fail to reject the NULL Hypothesis. This means series needs to be differenced to make it stationary. So, d=1.

Step 2: Determine ‘d’ in I(d) for ‘Log of Food Value’ using ADF Unit Root Test.
H0: Level series contains a unit root.
HA: Level series does not contain a unit root.

We have taken a maximum lag order of 6 by taking the cube root of the number of data points (200).

We select ‘constant and trend’ because we have seen from the plot that the series has an upward trend.


The p-value is large, so we fail to reject the NULL Hypothesis. This means series needs to be differenced to make it stationary. So, d=1.

Step 3: Estimate cointegrating regression: Yt = β1+ β2Xt+Ɛt
We estimated the cointegrating regression by using ‘Log of Arrival’ as depend Variable. Both variables are integrated of the same order. We provided lag order to be 6 as mentioned in Step 1 and Step 2.


Step 4: Determine ‘d’ in I(d) for Ɛt
H0: Unit root (i.e., not cointegrated)
HA: No unit root (i.e., cointegrated)

Since our p-value is small so we reject the NULL hypothesis at 5% level of significance and conclude that the Food Value is cointegrated with Arrival rate. This in turn means that the series can be written as an error correction model.


Error Correction model using GRETL:

∆Y_t=φ_1 ∆X_(t-1)+φ_2 Y_(t-1)-γ{Y_(t-1)-β ̂_1-β ̂_2 X_(t-1) }+ω_t

The lag residual from cointegrating regression is found within the curly braces above. The coefficient γ is the speed of adjustment. If it is not statistically significant, the variable is weakly exogenous.

Before estimating the error correction model, we estimated the cointegrating regression and saved the residuals with name e.


Then, we created two new series which is the difference of ‘log of arrival’ and ‘log of food value’.

Then we estimated the Error Correction Model. Select following lags:


The variable that we are interested in is e. Atleast one of the variables must not be weakly exogenous if the series are cointegrated. We can see that Log of Arrival is not weakly exogenous. This means that The arrival value moves to restore the equilibirium when the system is out of balance but Log of Food Value doesn’t move to equilibirium when system is out of balance.



We can conclude that arrivals of air passengers is caused by food and beverage sales values as opposed to our initial assumption that food and beverage sales value is dependent on arrivals of air passengers.We can say the Food Value Granger-Cause Arrival rate!

Relationship between the GDP and CPI in India – by Diksha Jha, Dinesh Nagarajan ,Pradeep Rajagopalan ,Praveen Ponnuchamy , Sravani Satpathy —

Relationship between the GDP and CPI in India – by Diksha Jha, Dinesh Nagarajan ,Pradeep Rajagopalan ,Praveen Ponnuchamy , Sravani Satpathy


Transfer function model is a unidirectional relationship between input and output.

Problem/Hypothesis of the time series data:

To analyze the relationship between Consumer Price Index (CPI) the Gross Domestic Product (GDP) of the Indian market.

Number of records in the dataset:

The dataset contains the history of CPI and the GDP for the past 60 years of the Indian market (one entry for each year).

Data source:

Transfer Function:

On our initial visual analysis, we could see that both the input and output parameters are of different scales, the GDP was in lakhs of rupees and the CPI is in percentage, so we have transformed the dependent variable (GPD) with log10.

Capture1 The relation between the input and output is mostly linear with some discrepancy due to white noise and the auto correlation between the output variable. On business terms, we understand that the Consumer Price Index and GDP of a India were related to each other with positive correlation and having a good interaction. This can be confirmed by the given scatterplot matrix below:


With the above understanding we continue to the transfer function analysis for the above dataset.

With initial Time Series Basic Diagnostics, we see that the for the output variable the ACF and PACF both are tailing off; hence this is an ARIMA model. But, we see that the data is already stationary so we have not done any differencing. Similar pattern is followed by the input series (CPI) also.


For Pre-whitening, we could see that the auto correlation is significant after lag 14, i.e nearly after 15 years. By Indian Government standards, the budget is revised every 5 years and due to the slow growth of economy the impact of CPI on GDP is seen after 15 years. But it is very prominent during the next 5-7 years as seen in the below ACF graph.


For the transfer function, we expect that B15 and other variables are to be prominent. The above analysis resulted in the below transfer function wherein the relation between the output variables and the input is seen to be prominent. Thereby confirming the delayed impact of CPI after 15 years on the GDP of India.


Since, the pre-whitening process gave us a model of (2,0,2) and we could also see that the peak in pre-whitening lag plot is at 19 and the significant values start from 15. So s1 is given a value of 4. Below is the transfer function model summary.


The actual transfer function which depicts the current GDP is impacted by the 5 CPI observations that are nearly 15 years old. This also gives us an idea that the 5-year budgeting is having a prominent impact on the GDP.


From the above analysis we have concluded that the GDP estimate on the input variable (CPI) is as given below:


This gives us a clear understanding that the current GDP is dependent on the 3 previous years’ GDP and 6 years of CPI (independent variable) which is 15 years apart. We also notice that as the year grows, the older years are having minimal impact.


Relationship between Maruti Suzuki & Tata Steel Stock Prices on National Stock Exchange using Transfer Function — October 22, 2018

Relationship between Maruti Suzuki & Tata Steel Stock Prices on National Stock Exchange using Transfer Function

Team Members: Apurv Garg, Bhabesh Senapati, Daksh Gupta, Dibyajyoti Panda, Rajiv Hemanth, Tran Minh Duc

Reason for Study

The Auto industry uses a tremendous number of materials to build cars, among which most of the weight comes from steel. Most of the modern cars weigh around 3,000 pounds out of which 2,400 pounds is steel. In cars, steel is used to build the chassis, door beams, roof, body panels, exhaust etc.

TATA Steel is the largest producer of steel in India and on the other hand Maruti Suzuki is one of the most extensively purchased brand in automobile industry. So, there should be a relation existing between the stocks of both the big giants.


Time Series Analysis

1.Data Source

Time series plot

Figure 1: Time Series Plot

2. Data Transformation

Log transformation was applied to the stock prices of both the companies as they helped in better identification of time series patterns by reducing the variance.

3. Model Building

For the 1st iteration the Y (dependent) variable is selected as the TATA steel stock prices and the X(independent) variable is MARUTI stock price.

3.1 Test for Stationarity

Before starting with the model building of a time series sequence it is very important for the series to be stationary, and if not stationary then it should be converted to a stationary sequence by taking a difference with its lag. Augmented Dicky Fuller test is used to test the stationarity of a series.

Augmented Dicky Fuller test has the following hypothesis:

 Ho –> Level series contains a unit root

 H1 –> Level series does not contain a unit root

Depending on the Significance of the p value we reject or fail to reject the null hypothesis. If we fail to reject the null hypothesis the series is non-stationary and if we reject the null hypothesis the series is a stationary series.

Both the series were tested for stationarity and both failed to reject the null hypothesis due to insignificant value of p. Hence both were non-stationary and were differenced with a lag of 1 to make the series stationary.

3.2 Fitting ARIMA Model

The input series i.e. Maruti stock price series is made to fit the best combination of ARIMA series. The value of I = 1 was determined in the above step.

ARIMA plotIllustration 1: The Seasonal ARIMA (2,1,2)(0,0,2) model was finally selected as the best fit with all variables significant and a MAPE of 1.07

arima par

Illustration 2: Parameter Estimates

3.3 Pre-whitening of Input & Output Series

Both the series were then pre whitened on the following parameter values:

p = 2  |  I = 1  |  q = 2  | P = 0  |  Is = 0  |  Q = 2

Cross correlation plot

Illustration 3: Cross-Correlation plot between tata steel and Maruti Suzuki stock prices.

After the pre whitening of the series the cross-correlation plot was analysed to determine the transfer function parameters. But the cross-correlation plot showed hikes on both the negative as well as the positive lags. Hence indicating a case of cointegration where both the series are correlated to each other. This can be seen in the illustration shown below.


3.4 Engle-Granger Test for Cointegration Regression

The p value for Engle – Granger Test is much closer to being significant for the case

Y = Log_TISC and X= Log_MRTI (p = 0.09) as compared to

                                           Y = Log_MRTI and X = Log_TISC.

Hence, the dependent variable was finally selected as Tata Steel price and the independent variable as Maruti Suzuki price.


3.4 Error Correlation Model

An error correction model as it is now a problem of Cointegration Regression, with a difference [D=1] (to make the series stationary). The illustration below shows the result for the Ordinary least Square Model made:

Model: OLS, using observations 2003:10-2018:09 (T = 180)

Dependent variable: d_Log_TISC 

coefficient   std. error   t-ratio   p-value


d_Log_MRTI_1   −0.134663     0.115003     −1.171    0.2432

e_1            −0.0925064    0.0280208    −3.301    0.0012  ***

d_Log_TISC_1    0.194640     0.0854004     2.279    0.0239  **



 Mean dependent var   0.007540   S.D. dependent var   0.140167

Sum squared resid    3.274077   S.E. of regression   0.136006

Uncentered R-squared 0.071713   Centered R-squared   0.069012

F(3, 177)            4.557955   P-value(F)           0.004207

Log-likelihood       105.2140   Akaike criterion    −204.4279

Schwarz criterion   −194.8490   Hannan-Quinn        −200.5441

rho                 −0.017859   Durbin-Watson        2.022787


 P-value was highest for variable 10 (d_Log_MRTI_1)

3.4 Final Equation

 Delta Y(t)  =  0.194* Delta Y(t-1) – 0.134* Delta X(t-1) – 0.092*E


Y(t) –>  Log value of Tata Steel stock price at time t.

Y(t-1) –> Log value of Tata Steel stock price at time t-1

X(t-1) –> Log value of Maruti Suzuki stock price at time t-1

E –> Random error

Is there a relationship between employment re-entry rate and job vacancy to unemployed person ratio? —

Is there a relationship between employment re-entry rate and job vacancy to unemployed person ratio?

  • Group Name: Insights
  • Group Members:
  1. Arn Joseph Napinas (A0178485L)
  2. Akwila Jeremiah Suria (A0137701M)
  3. Jayanthi D/O Duraisamy (A0178542X)
  4. Shi Peigong (A0178286M)
  5. Yang Chia Lieh (A0178500J)
  6. Lee Geok Joo (A0178392R)


This study analyzes the dynamic relationship between an input and output time series using conventional transfer function model. The following sections of this report will describe the data source and data records, identify the leading indicator (by pre-whitening), and the necessary analysis and diagnostic check to make an appropriate conclusion between the input and output time series.

Data Preparation

Data Preparation

Quarter: YYYY-Quarter
– rates of re-entry into employment in a quarter for retrenched residents
– Statistics on re-entry into employment allows government to gauge how well retrenched workers are able to secure a new job
– Percentages are expressed as a value over 100, i.e. “100” represents 100%
– Job Vacancy to Unemployed Person Ratio
– calculated by taking the ratio of the estimates of the total number of job vacancies for the whole economy to the total number of unemployed persons
– Ratio

A total of 65 records are used in this study, which starts from year 2002 first quarter until year 2018 first quarter.

Data Source:

Develop Transfer Function Model

1. Time series plot

In Figure 1, a set of 65 pairs of simultaneous observations of reentry-rate and jb_to_ue are plotted in a time series manner with respective quarters of each year. They denote the input by x and the output by y, respectively. As it is difficult to assess the delayed dynamic relationship between reentry_rate and jb_to_ue in the early stage, an assumption is made that reentry_rate is the output variable, y and jb_to_ue is the input variable, x. This assumed relationship will be verified and confirmed during the series pre-whitening stage.

figure 1

2. Pre-Whitening

If the input is autocorrelated, the effect of any change in the input itself will take some time to play out. Therefore, the subsequent effects of the input on the output tend to linger around. A spurious effect of changes in y may appear to have led to the change in x. To alleviate the spurious effect in the estimated cross correlation function caused by the autocorrelation in the input, pre-whitening method is applied to make the input look like white noise. Below is the summary of pre-whitening steps.

Figure 2.pngStep 1:
The first step in pre-whitening involves identifying and fitting a time series model to the input data x. In Time series plot section, it is assumed that jb_to_ue is the input variable, x. The autocorrelation and partial autocorrelation of input data x is shown in Figure 2. The ACF dies down quickly and the PACF cut off after the lag 1. These two plots, when combined, indicate that an appropriate model might be a first order autoregressive AR(1) Model.

figure 3.pngStep 2:
The residuals after fitting the AR(1) model should look like “white noise” as shown in Figure 3. Figure 4 shows the plot of ACF and PACF for the residuals which has no significant autocorrelation at higher lags and deem the model to be appropriate.
figure 4.pngStep 3:
After pre-whitening the input, the next step is to pre-whitening the output, y. Figure 5 and Figure 6 show the residuals plot, ACF and PACF plot after fitting the AR(1) model into the output.

figure 5.pngStep 4:
From the flow chart, the next step in the pre-whitening process is to compute the cross-correlation function between the pre-whitened input and output. The correlation function is shown in Figure 7.
In Figure 7, it provides a picture of input-output relationship. Also, it shows that the cross correlations for lag 2 and 3 are significant and it indicates that there is a time delay of about 2-time units in the system and a dynamic relationship stretching over 2 time periods that initially build up and then fades out. Thus, the leading indicator is jb_to_ue which prove that the assumption in Step 1 is correct.

figure 6Step 5:
The approximate transfer function weights for the data in Figure 7. Clearly, the first nonzero weight starts at lag 2, indicating that the delay is 2 times units, that is b=2. For the nonzero weights, one plausible interpretation is that the weights exhibit exponential decay after lag 2 which suggests that s = 2-2 = 0. The exponential decay after lag 2 could be due to a single exponential decay term or sum of more than one exponential decay terms. This suggests that r = 1 or 2. To build the transfer function model, fit r with value of either 1 or 2 to obtain the optimal model.

3. Transfer function Model

With all parameters obtained from above process, the transfer function model is generated as shown in Figure 8 and Figure 9:

a) Transfer Function Model with r=1:

figure 7.pngb) Transfer Function Model with r=2:

figure 8

4. Diagnostic Check

In diagnostic check, residuals check and parameters check are performed.
For the residual check, from the ACF and PACF plots of the residuals for transfer function model 1 and 2 in Figure 10 and Figure 11, it can see that there is no autocorrelation left in the residuals.

figure 9.png

figure 10.pngFor the parameter check, in it shows that in transfer function model 1, Figure 8, the first and second parameters are not significant and in the transfer function model 2, Figure 9, second parameter is not significant. As both models has insignificant terms, the built transfer model is not optimal. It is recommended that alternative regression strategies (e.g. linear transfer function modeling strategy for multiple input model, where total_paid_hours in the same dataset can be included as an input variable) be considered.


In this study, the relationship between the reentry_rate (rates of re-entry into employment in a quarter for retrenched residents) and jb_to_ue (Job Vacancy to Unemployed Person Ratio) was modeled using Box-Jenkins approach. This study uses AR(1) pre-whitening process to “de-correlate” the input and output series, and based on these pre-whitened series to build a transfer function. However, during the diagnostic check, it is found that not all the parameters are significant, which lead refitting the data with various ARIMA model for the pre-whitening process. After several iterations, it seems that there is no better ARIMA model to fit the input data and generate the transfer function. Therefore, this result may suggest that there is no dynamic delayed relationship between reentry_rate and jb_to_ue. Besides, it is suspected that the failure to build an appropriate transfer function could be due to insufficient data set for the analysis, since there are only 65 observations in the study data set.