I’m excited to present this comprehensive report, which focuses on a time series analysis and offers a detailed analysis of economic indicators data from Analyse Boston. We’ve discovered significant trends, correlations, and forecast insights through careful examination and the use of statistical techniques. This research looks at the historical patterns of International flights at Logan International Airport as well as the complex interactions between a number of economic variables, including the unemployment rate. Critical understandings of the dynamic nature of the data have been obtained through the application of modelling, forecasting, and differencing techniques. I’m excited to share this in-depth report, which highlights the breadth of our investigation and depth of our conclusions.
Report_2_MTH522 (6)Author: athakur1
Trend Analysis
In this blog we are going to analyze the trend and discuss about seasonality.
In the below graphical representation we have plotted the historical data of international flights at Boston Logan airport.
There is an initial upward trend between 2013 and 2015, suggesting a gradual increase in the number of international flights.
The plot steepens from 2015 to 2016, suggesting a quicker rise in international flights during this time frame. This might point to a time of explosive expansion or a change in the driving forces.
The plot stays at the same angle from 2016 to 2018 as it did from 2013 to 2015. Unlike the earlier period, when growth was more markedly increased, this pattern of sustained growth shows growth at a relatively constant rate.
We could also compute and plot a moving average or apply more sophisticated time series decomposition techniques to uncover underlying trends. These methods can help uncover any long-term trends or variations in the number of international flights, offering insightful information about the long-term dynamics of the airport.
The Orange Line in Context: The data trends over the designated window size are represented by the peaks and valleys of the orange line. An increasing trend over the chosen time period is indicated if the orange line rises. A downward trend is indicated if it is falling.
The variations are less sharp than in the original data, which facilitates the identification of broad trends.
Usually, the light orange area indicates a confidence interval around the moving average. The data points within each window are more erratic or uncertain when the confidence interval is wider.
We can do the Seasonal-trend decomposition of the international flights and can identify which specific time (month or year) has the highest flights.
For instance if we explicitly set the interval of time series to months instead of years. we can see the frequency of flights in each month.
The trend, seasonal and residual graph will be :
Using the seasonal component derived from STL decomposition instead of the original data allows us to highlight and isolate the recurrent patterns present in international flight numbers. The seasonal component allows us to focus on the recurring variations associated with different months by representing the regular, periodic fluctuations.
Through an analysis of this component, we can pinpoint particular months that exhibit a consistent increase or decrease in the number of international flights. This helps us to better understand the seasonal patterns and trends present in the dataset. With the use of this technique, recurrent behaviours that could be obscured or diluted in raw, unprocessed data can be found.
The tallest bar shows the average seasonal component for the month with the highest value. This shows that month has the highest average number of flights during the season. Conversely, shorter bars indicate periods with fewer international flights during months with lower average seasonal effects. By examining the heights of these bars, one can gain insight into seasonal variations and determine which months exhibit a consistent increase or decrease in the number of international flights, based on the seasonal patterns that were extracted from the data.
Ljung-Box Test Procedure
Ljung-Box Test Hypotheses:
For the Ljung-Box test, the alternative hypothesis (H1) asserts significant autocorrelation at least at one lag within the specified maximum lag, while the null hypothesis (H0) posits the absence of autocorrelation in a time series at lags up to a specified maximum lag.
Test Statistic and Critical Values:
Based on the sum of squares of autocorrelations at various lags, the test statistic is calculated. Next, the test statistic and critical values from the chi-square distribution are contrasted. There is significant autocorrelation if the test statistic is greater than the critical value, rejecting the null hypothesis.
Interpretation of Significant Autocorrelation:
A statistically significant autocorrelation present in 25% of the data indicates that the residuals cannot be adequately explained by the model. This suggests that there may be lags where the residuals are not independent or random, exposing a temporal structure that the model is unable to explain.
Implications and Model Enhancement:
Considerable autocorrelation indicates patterns or subtleties that have not yet been identified in the time series, which may be the result of factors that have been missed or underlying complexity. This emphasises how important it is to improve the model, which leads to investigating various specifications, modifying parameters, or adding new features.
Analysis of Individual Lags:
In order to identify trends and inform iterative model improvements, it becomes imperative to examine individual lags with notable autocorrelations. This in-depth analysis helps to reveal details about the data structure that the first model was unable to sufficiently represent.
Residual Analysis in Time Series
Residual analysis stands as a pivotal stage in time series modeling, serving to assess the model’s goodness of fit and ensure the satisfaction of underlying assumptions. Residuals, representing the differences between predicted and observed values, undergo careful examination in the following steps:
- Compute Residuals:
Calculate residuals by subtracting predicted values from observed values. - Plot Residuals:
Visual inspection of residuals over time reveals trends, patterns, or seasonality. Ideally, well-fitted model residuals appear random and centered around zero. - Autocorrelation Function (ACF) of Residuals:
Plotting the ACF of residuals helps identify any lingering autocorrelation. Significant spikes in the ACF plot suggest unaccounted temporal dependencies.
Significance of Normality:
The normality assumption is foundational for statistical techniques like confidence interval estimation and hypothesis testing. Deviations from normality can lead to biased estimations and inaccurate conclusions. Time series models, including ARIMA and SARIMA, often assume residual normality, and if not met, the model may fail to accurately capture data patterns.
Implications of Deviations from Normality:
- Validity of Confidence Intervals:
Constructing valid confidence intervals relies on the normality assumption. Non-normally distributed residuals may compromise the reliability of these intervals, leading to inaccurate uncertainty assessments. - Outliers and Skewness:
Histogram deviations from normality may signal outliers or residual skewness. Identifying and addressing these issues is crucial for enhancing overall model performance.
In essence, ensuring normality in residuals is fundamental for robust time series modeling, aligning with the foundational assumptions of various statistical techniques. Violations of this assumption warrant attention to maintain the model’s accuracy and reliability.
On running Residual analysis over Analyze Boston we get:
Residual Over Time:
This plot shows differences between observed and predicted values over the course of the prediction period, illuminating the behaviour of the model residuals. It is essential to analyse residuals over time in order to evaluate the model’s performance and spot any possible systematic trends or oversights. Important factors to think about are:
Random Patterns: The residuals should ideally show randomness in the absence of recurring patterns, demonstrating how well the model captured the underlying data structures.
Centred around Zero: The residuals should be centred around zero; any discernible drift points to potential bias or incompleteness in the model.
Heteroscedasticity: A consistent variability in residuals may be a sign of heteroscedasticity, a signal that the model does not sufficiently account for the inherent variability in the data.
Outliers: Finding extreme values or outliers in the residuals can help identify data points or events that the model missed.
The lack of consistent trends indicates that the variance in the ‘logan_intl_flights’ data has been effectively accounted for. Accurate model predictions are indicated by residuals that are mainly centred around the mean; random noise is usually responsible for any deviations. The models’ ability to consistently manage variability is implied by the lack of heteroscedasticity, which strengthens their dependability over time.
ACF Residual Analysis:
The residuals’ Autocorrelation Function (ACF), which illustrates the relationship between different lags, helps with the evaluation of residual temporal structures after model fitting.
Among the interpretations are:
- No Notable Increases:
The ACF of residuals indicates independence and shows that the model has successfully captured temporal dependencies if it rapidly decays to zero without noticeable spikes. - Significant Spikes:
The existence of notable spikes at particular lags raises the possibility of residual patterns or autocorrelation, which calls for further research into alternative model structures or refinements.
The fact that our ACF shows no notable spikes suggests that the temporal dependencies in the data have been successfully eliminated by the model.
In next blog we will learn about which statistical test is commonly utilized to assess the existence of significant autocorrelations in a time series at different lags.
The Moving Average Model
In this blog we are going to talk about moving average model and analyse the ‘logan_intl_flights’ time-series from Analyze Boston.
The Moving Average Model, or MA(q) for short, is a part of the larger class of time series analysis models known as ARIMA (Autoregressive Integrated Moving Average) models. By taking into account the impact of random or “white noise” terms from the past, this model forecasts the current observation. The number of historical white noise terms taken into consideration is indicated by the model’s order, denoted by “q” in MA(q). For example, the latest white noise term is considered in MA(1).
Features of the Model: The current observation is expressed by the model’s mathematical equation as a linear combination of the current white noise term and the most recent q white noise terms. A series of independent, identically distributed random variables with a constant variance and zero mean is known as white noise. In order for the MA(q) model to be applicable, the time series must display a constant mean (μ). Furthermore, stationarity is required, and if needed, differencing can be used to attain it. Model identification, or order q, is determined using methods such as autocorrelation function (ACF) plots and statistical criteria. An essential part of applying MA(q) to time series analysis is parameter estimation, followed by forecasting and model validation using techniques like residual analysis.
Now we are going to use MA(1) model on our ‘logan_intl_flights’ time series
The Autocorrelation Function (ACF) plot shows correlation coefficients on the y-axis and the number of lags, or the interval of time between the current observation and its lagged values, on the x-axis. An ACF plot attempts to identify significant spikes that decrease with increasing lags. A notable spike at a particular lag indicates a strong correlation with observations at that lag. Comparably, the Partial Autocorrelation Function (PACF) plot uses the same idea for the x-axis but uses the y-axis to show partial correlation coefficients. By eliminating the impact of intermediate lags, PACF isolates the distinct correlation between the current observation and its lagged values. Notable peaks in the PACF plot reveal robust partial correlations at those lags, shedding light on how each lag directly affects the current observation. In essence, the ACF provides information about how each lag affects times that come after, whereas the PACF shows the direct effect of each lag on the current time.
Forecasting Random Walk
In this blog we are going to recall the Random Walk and then we are going to apply it on the Economic Indicators to see how well it will predict the incoming intl flights.
The fundamental idea behind the concept is that future values in a time series depend on the value that was most recently observed, and that any deviation from this value is essentially random. Despite its simplicity, forecasting models are evaluated using the random walk, especially when it is challenging to predict complex patterns.
Initialization: To start the forecasting process, it is often necessary to use the most recent observed value from the historical data.Iterative Prediction: For every subsequent time interval, the forecast for the subsequent observation is simply the most recent observed value. The model assumes that any changes or deviations are random and unpredictable.Evaluation: Metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or Root Mean Squared Error (RMSE) are frequently used to assess the performance of the random walk model by comparing the predicted values to the actual observations.
Using Mean Absolute Error (MAE), the evaluation reveals an average prediction deviation from actual international flight counts of about 280.16 units. When more complex models are compared to this MAE, their precision improves over the simplistic random walk.
The random walk’s average prediction error is approximately 5.33% of the maximum flight count when scaled. With an MAE of approximately 578.06, the random walk outperforms a basic mean benchmark model with an average of around 3940.51 international flights at Logan Airport. This demonstrates how the random walk can capture more information than a mean-predicting model.
A relative measure is given by the scale-adjusted MAE of 5.33%, which shows that, on average, the random walk’s prediction errors are low in relation to the maximum flight count. The way you interpret MAE should be in line with the particular context of your data and the needs you have for forecasting.
AutoCorrelation Function(ACF)
In this blog we are going to talk about AutoCorrelation Function (ACF) and implement it over Economic Indicators for the analysis.
The Autocorrelation Function (ACF) is a statistical tool that is used to determine the correlation between a time series and its own lagged values. It makes it easier to find patterns, trends, and seasonality in the data. The ACF plot displays correlation coefficients for different lags, allowing one to pinpoint significant lags and potential autocorrelation patterns in the time series. The ACF at lag k is the correlation between the time series and itself at lag k.
There are two types of ACF:
Positive and Negative ACF:
- Positive ACF suggests a positive correlation, indicating that high values at one time point may relate to high values at another time point.
- Negative ACF implies a negative correlation, indicating an inverse relationship between values at different times.
On analysis the Economic Indicators:
This analysis sheds light on the temporal relationships embedded in the ‘logan_intl_flights’ time series. The ACF plot reveals a robust positive correlation at a lag of one month, suggesting a tendency for the current month’s international flight count to positively associate with the count in the preceding month. This finding holds significance for further exploration and modeling, especially considering the application of techniques like autoregressive models designed to capture such temporal dependencies.
Notable peaks in the plot indicate substantial autocorrelation at specific lags, emphasizing the strong correlation of the time series with its past values at these particular points. The y-axis reflects both the direction and strength of autocorrelation, with a value of 1 denoting perfect positive correlation, -1 representing perfect negative correlation, and 0 indicating no correlation.
TSF with Python – Differencing and Stationarity
I have been working with datasets for Economic Indicators which I retrieved from Analyze Boston. In my previous blog, I talked about various method of Time Series Forecasting. I have applied those methods to check the Stationarity and Differencing of Logan-International flights time-series.
Stationarity:
Stationarity is an important concept in time series analysis. A stationary time series is one that has constant statistical characteristics over a given period of time. The lack of seasonality or trends simplifies the modelling process.
To achieve stationarity and stabilise statistical properties, transformations such as differencing or logarithmic transformations are frequently required.
By visualizing the above graph, it doesn’t look stationary. But we can check the stationarity by using ADF test. The Augmented Dickey-Fuller (ADF) test is a well-known solution to this issue. This statistical tool determines whether a unit root is present, indicating non-stationarity, through a thorough examination. If a unit root’s null hypothesis is less than the conventional 0.05 threshold, stationarity is confirmed. Our understanding of the dataset’s temporal dynamics is enhanced by the comprehensive approach’s integration of statistical rigor and domain expertise.
Differencing:
It is an method to achieve stationarity in time-series. It involves calculating the variations between successive observations. Through the removal of seasonality and trends, this process enhances the time series’ analytical accessibility. The difference in the first order is Y (t) – Y (t1).
This is how first order differenced series looks wrt original time series. By visualizing it looks differencing has been achieved and the time series looks stationary.
TSF with Python: Unveiling Key Concepts and Analyze Boston
I’ve gained an understanding of the topic I mentioned below through the book written by Marco Peixeiro on Time Series Forecasting with Python.
Understanding Time Series:
A time series is a chronological collection of measurements that are regularly spaced and are important in many different fields of study. In order to produce reliable future estimates, time series analysis entails identifying patterns, trends, and behaviors that have been ingrained in the data throughout time.
Decomposing Time Series:
Data is broken down into its most basic components using time series decomposition: trend, seasonality, and noise. Long-term movement is represented by the trend, recurrent patterns are identified by seasonality, and unexplained deviations are represented by noise. This breakdown improves understanding of data structures for more accurate analysis and predictions.
Forecasting Project Lifecycle:
The phases of a forecasting project lifecycle include gathering data and deploying models. Examining exploratory data, choosing a model (such as ARIMA or exponential smoothing), training the model using historical data, validating and testing it, deploying it, and continuing to monitor and maintain it are important processes. Iterative processes guarantee ongoing updates for precise and up-to-date projections.
Baseline Models:
Baseline models provide minimum forecasts that more complicated models should exceed, acting as standards for more advanced models. The seasonal baseline, naive baseline, and mean or average baseline are a few examples. They create performance benchmarks and assess the importance of the advances provided by more sophisticated models.
Random Walk Model:
A strong foundation for time series forecasting is the random walk model. It asserts that the most recent observation alone determines future values, accounting for short-term swings with random error. Its persistence, ability to capture noise, and usefulness as a starting point for evaluating the performance of sophisticated models are among its features. Underlying trends in the data may be difficult to identify if a sophisticated model is unable to outperform the random walk.
Using the “Economic Indicators” dataset from Analyse Boston, let’s attempt to determine the baseline (mean) for the total number of international flights at Logan Airport.
The baseline model is based on a computation of the historical average of international flights at Logan Airport. For this basic metric, the mean of international flights is computed for every time unit, like months or years, based on the temporal granularity of the dataset. The computed historical average serves as the foundation for predicting future international flights.
The alignment of the blue and red lines in the visualisation, provides a quick indicator of how well the baseline model is performing. The assumption here is that future value will reflect the historical average.
Decoding Temporal Patterns: Navigating Time Series Analysis and Conventional Predictive Modeling in Data Exploration
In the vast world of data analysis, understanding the subtle differences between various methods is crucial. Time series analysis, designed for data collected over time, brings unique advantages and disadvantages when compared to conventional predictive modeling.
Time series analysis, represented by models like ARIMA and Prophet, is essential for tasks involving temporal dependencies. ARIMA uses techniques like moving averages, differencing, and autoregression to capture trends and seasonality. Prophet, developed by Facebook, excels at handling missing data and unusual events.
On the other hand, conventional predictive models like random forests, decision trees, and linear regression offer flexibility but may struggle with dynamic temporal patterns. Linear regression’s assumption of linearity can miss complex trends, while decision trees and random forests might not capture subtle long-term dependencies.
Machine learning heavyweights like neural networks and support vector machines, although versatile, may lack a nuanced understanding of temporal intricacies. Even simpler methods like K-Nearest Neighbours may struggle with time-related nuances.
Choosing between time series analysis and conventional predictive modeling depends on the data characteristics. Time series methods excel in unraveling temporal complexities, providing a tailored approach for identifying and forecasting patterns over time that generic models might overlook. Understanding the strengths and weaknesses of each technique helps in selecting the right tool for the specific data landscape.
Now, let’s delve into some essential concepts:
Stationarity: A crucial idea in time series analysis is stationarity. A time series exhibiting constant statistical attributes like mean, variance, and autocorrelation is called stationary. It becomes non-stationary if there’s seasonality or a trend.
Types of Stationarity:
- Strict Stationarity: Distribution moments remain constant over time.
- Trend Stationarity: Only the mean is constant over time, with variable variance.
- Difference Stationarity: Achieved by differencing; first-order difference is stationary.
How to Check for Stationarity:
- Visual Inspection: Plot the time series data and observe trends or seasonality.
- Summary Statistics: Compare mean and variance across different time periods.
- Statistical Tests:
- KPSS Test: Checks for stationarity around a deterministic trend. Null hypothesis: stationary around a trend.
- ADF Test (Augmented Dickey-Fuller): Tests for a unit root. Null hypothesis: the time series has a unit root and is non-stationary. If p-value < 0.05, reject the null hypothesis, suggesting stationarity.