Forecasting Air Passengers count using Time series Analysis

Hello readers, In today’s tutorial we will be learning to forecast Air Passengers count by Time Series Analysis using the ARIMA model with the help of Python programming.

Introduction to Forecasting Air Passengers count using Time series Analysis

Air travel is one of the quickest ways of public transportation for crossing international borders. An airline company collects the data of the number of passengers that have traveled with them on a particular route for the past few years. This data is used to gather insights on passengers travel patterns. With the help of this data they can see and forecast the number of passengers for the next twelve months.

Making this projection could be advantageous to the Airline companies because it will assist them in making important decisions such as –

  • What kind of plane should they fly?
  • When shall they take off?
  • How many flight attendants and pilots do they require?
  • How much food should they have on hand at their warehouse?

The forecasting is done on the Time Series data. Training has been done with an ARIMA model. To convert the available time series into a stationary time series, techniques such as moving average and differencing were applied. The suitable values of p, d, and q for ARIMA were established by plotting ACF and PACF.

Steps involved in Forecasting Air Passengers count using Time series Analysis

  1. Data Cleaning and Analysis
  2. Checking Stationarity (ADF Test)
  3. Transformation
  4. Differencing
  5. Time Series Components
  6. Finding ACF and PACF
  7. ARIMA Modeling
  8. Forecast

Dataset used for Forecasting

The dataset is been downloaded from Kaggle. It includes total 144 unique Values and 2 attributes all together.

Link for Dataset  : Click here for the Dataset.

Now, Lets procced with the code.

Importing all the required Python libraries for the analysis

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

 Reading the dataset file and checking the size of file

data=pd.read_csv(r'AirPassengers.csv')
data.shape

OUTPUT: (144,2)

1) Data Cleaning and Analysis

Now that we have our dataset loaded, let’s start with the data-preprocess

Creating the ‘Date’ as Index for data and viewing the dataset.

data['Month']=pd.to_datetime(data['Month'], infer_datetime_format=True)
data=data.set_index(['Month'])
print(data.head())
print(data.tail())

OUTPUT:

Dataset (First & last 5 Passenger detail count )

Dataset (First & last 5 Passenger detail count )

 

In the Image, we can see the datasets 2 attributes with first 5  and last 5 unique values.

Now, moving forward let’s Visualize the Time Series plot for the number of Air Passengers.

plt.figure(figsize=(20,10))
plt.xlabel("Month")
plt.ylabel("Number of Air Passengers")
plt.plot(data)

OUTPUT:

Time Series plot for the number of Air Passengers.

Time Series plot for the number of Air Passengers.

 

Now, we will be checking the stationarity. We already know that a stationary time series is one in which the mean and variance remain constant across time. Data preparation is done for rolling mean and standard deviation analysis to determine whether a particular time series is stationary

2) Checking Stationarity

rolmean=data.rolling(window=12).mean()
rolstd=data.rolling(window=12).std()
print(rolmean.head(15))
print(rolstd.head(15))

OUTPUT:

Rolling Mean and Standard Deviation

Calculated Rolling Mean and Standard Deviation.

 

In the above image, we see the calculated Rolling mean and standard deviation. Now, let’s Plot the Rolling Mean and Standard Deviation.

plt.figure(figsize=(20,10))
actual=plt.plot(data, color='red', label='Actual')
mean_6=plt.plot(rolmean, color='green', label='Rolling Mean') 
std_6=plt.plot(rolstd, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

OUTPUT:

Plotting the Rolling Mean and Standard Deviation along with actual values.

Plotting the Rolling Mean and Standard Deviation along with actual values.

By looking at the above plot, we conclude that it is non-stationary because mean and variance is not constant.

2) Checking Stationarity (ADF Test)

Now, we will be using another approach, the ADF (Augmented Dickey-Fuller Test) to check stationarity.

Assume ADF has : Null hypothesis – Time Series is non-stationary

from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
dftest=adfuller(data['Passengers'], autolag='AIC')
dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Obs'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)

OUTPUT:

ADF test results.

ADF test results.

  • From above ADF test, we fail to reject the null hypothesis, since p-value is greater than 0.05
  • Now we take log transformation to make our Time series stationary and plot visual for it.
plt.figure(figsize=(20,10))
data_log=np.log(data)
plt.plot(data_log)

OUTPUT:

Visualizing the Results to check the stationarity.

Visualizing the Results to check the stationarity.

 

We discovered a graph with an ascending trend over time and seasonality.

Now, we  will be testing the Rolling Mean with window 12 on above log transformation.

plt.figure(figsize=(20,10))
MAvg=data_log.rolling(window=12).mean()
MStd=data_log.rolling(window=12).std()
plt.plot(data_log)
plt.plot(MAvg, color='blue')
Non-Stationary Result for the time series data

Non-Stationary Result from the time series data

As a result,  It has concluded non-stationary. We use another method differencing, to make our time series stationary.

data_log_diff=data_log-MAvg
data_log_diff.head(12)

OUTPUT:

Dataset with missing values

Dataset with missing values

We can observe a lot of missing values, so now we will be dropping all the missing values to avoid inefficiency of the model.

data_log_diff=data_log_diff.dropna()
data_log_diff.head()

OUTPUT:

Output of differencing method, after dropping null values

Output of differencing method, after dropping null values

Now we will be defining function for Rolling Mean and Standard Deviation & ADF test

def stationarity(timeseries):
    
    rolmean=timeseries.rolling(window=12).mean()
    rolstd=timeseries.rolling(window=12).std()
    
    plt.figure(figsize=(20,10))
    actual=plt.plot(timeseries, color='red', label='Actual')
    mean_6=plt.plot(rolmean, color='green', label='Rolling Mean') 
    std_6=plt.plot(rolstd, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    print('Dickey-Fuller Test: ')
    dftest=adfuller(timeseries['Passengers'], autolag='AIC')
    dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Obs'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
stationarity(data_log_diff)

OUTPUT:

Visualization of Stationary data along with ADF Test results.

Visualization of Stationary data along with ADF Test results.

  • The differenced data’s stationarity is being examined
  • We can see from the rolling approach that the mean and standard deviation are constant.
  • We reject the null hypothesis based on the ADF since the p-value is less than 0.05. (significance level)
  • Our differenced data is now stationary after implementing all the transformations and techniques.

3) Transformation

Moving forward towards the transformation step. Firstly, we will be checking for Trend stationarity and then perform exponential transformation on our data

plt.figure(figsize=(20,10))
exp_data=data_log.ewm(halflife=12, min_periods=0, adjust=True).mean()
plt.plot(data_log)
plt.plot(exp_data, color='black')

OUTPUT:

.

.

4) Differencing

A technique for changing a time series dataset is differencing. It can be used to get rid of the series’ so-called temporal reliance on time.

  • Applying differencing to our data since log transformation is non-stationary.
exp_data_diff=data_log-exp_data
stationarity(exp_data_diff)

OUTPUT:

Visualization for Trend stationarity and ADF test results.

Visualization and ADF test after applying differencing.

plt.figure(figsize=(20,10))
data_shift=data_log-data_log.shift()
plt.plot(data_shift)

OUTPUT:

Trend Stationary.

Trend Stationary.

data_shift=data_shift.dropna()
stationarity(data_shift)

OUTPUT:

5) Time Series Components

Now we will be moving towards decomposing the Time Series into its components : Trend, Seasonality and Residual

from statsmodels.tsa.seasonal import seasonal_decompose
decomp=seasonal_decompose(data_log)

trend=decomp.trend
seasonal=decomp.seasonal
residual=decomp.resid

plt.subplot(411)
plt.plot(data_log, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

OUTPUT:

Time Series into its components : Trend, Seasonality and Residual

Visualizing the Time Series into its components : Trend, Seasonality and Residual

Lets now check the stationarity of Time Series components.

decomp_data=residual
decomp_data=decomp_data.dropna()
stationarity(decomp_data)

OUTPUT:

6) Finding ACF and PACF

Moving forward we will be plotting the ACF and PACF to find q and p value.  We get q and p both 2 from the graph

from statsmodels.tsa.stattools import acf, pacf

lag_acf=acf(data_shift, nlags=20)
lag_pacf=pacf(data_shift, nlags=20, method='ols')

plt.figure(figsize=(20,10))
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='green')
plt.axhline(y=-1.96/np.sqrt(len(data_shift)),linestyle='--',color='green')
plt.axhline(y=1.96/np.sqrt(len(data_shift)),linestyle='--',color='green')
plt.title('Autocorrelation Function')

plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='green')
plt.axhline(y=-1.96/np.sqrt(len(data_shift)),linestyle='--',color='green')
plt.axhline(y=1.96/np.sqrt(len(data_shift)),linestyle='--',color='green')
plt.title('Partial Autocorrelation Function')

OUTPUT:

plotting the ACF and PACF

Plotting the ACF and PACF

7) Building the ARIMA Model

Lets Finally Build the ARIMA model with p=2, q=2 and I=1.

from statsmodels.tsa.arima_model import ARIMA

plt.figure(figsize=(20,10))
model=ARIMA(data_log, order=(2,1,2))
results=model.fit(disp=-1)
plt.plot(data_shift)
plt.plot(results.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results.fittedvalues-data_shift['Passengers'])**2))
print('plotting ARIMA model')
Plotted ARIMA model

Plotted ARIMA model

  • Predicted values from ARIMA model in difference form.
predictions=pd.Series(results.fittedvalues, copy=True)
print(predictions.head())

OUTPUT:

Predicted values from ARIMA model

Predicted values from ARIMA model

  • Performing Inverse Transformation for differencing, by doing cumulative sum
predictions_cum_sum=predictions.cumsum()
print(predictions_cum_sum.head())

Displaying the predicted value in log scale

predictions_log=pd.Series(data_log['Passengers'].ix[0], index=data_log.index)
predictions_log=predictions_log.add(predictions_cum_sum,fill_value=0)
predictions_log.head()
Predicted value in log Scale

Predicted value in log Scale

Now we will be taking out log transformation and finally visualizing the actual vs predicted value graph

predictions_ARIMA=np.exp(predictions_log)
plt.figure(figsize=(20,10))
plt.plot(data)
plt.plot(predictions_ARIMA)
Actual vs Predicted value graph after removing log transformation.

Actual vs Predicted value graph after removing log transformation.

8) Forecasting

Now we will be plotting the visual for forecast of next 10 years with 95% confidence interval.

rcParams['figure.figsize']=20,10
results.plot_predict(1,204)
x=results.forecast(steps=120)

Next 10 years predicted value on log scale

In the above Image ,we can see the Forecasted line of the passenger with 95% confidence interval. Further, lets print the predicted values on log scale.

x[0]
Next 10 years Predicted value on log scale

Next 10 years Predicted value on log scale

  • Actual predicted value for next 10 years, after taking out log transformation
np.exp(x[0])
Actual predicted value for next 10 years, after taking out log transformation

Actual predicted value for next 10 years, after taking out log transformation

 

Conclusion of Forecasting Air Passengers count using Time series Analysis

In this tutorial, we have successfully learned and forecasted Air passenger count using Time Series Analysis. Training has been done with an ARIMA model. We have successfully achieved a good accuracy rate after forecasting. The predicted values are approximately similar to the actual values. In the future, This model can be modified and can be utilized for forecasting in various scenarios.

I hope you all enjoyed this tutorial. Happy Learning 🙂

Want to add your thoughts? Need any further help? Leave a comment below and I will get back to you ASAP 🙂

For further reading:

Leave a Reply

Your email address will not be published. Required fields are marked *