# Forecasting Air Passengers count using Time series Analysis

Hello readers, In today’s tutorial we will be learning to forecast Air Passengers count by Time Series Analysis using the ARIMA model with the help of Python programming.

## Introduction to Forecasting Air Passengers count using Time series Analysis

Air travel is one of the quickest ways of public transportation for crossing international borders. An airline company collects the data of the number of passengers that have traveled with them on a particular route for the past few years. This data is used to gather insights on passengers travel patterns. With the help of this data they can see and forecast the number of passengers for the next twelve months.

Making this projection could be advantageous to the Airline companies because it will assist them in making important decisions such as –

- What kind of plane should they fly?
- When shall they take off?
- How many flight attendants and pilots do they require?
- How much food should they have on hand at their warehouse?

The forecasting is done on the Time Series data. Training has been done with an ARIMA model. To convert the available time series into a stationary time series, techniques such as moving average and differencing were applied. The suitable values of p, d, and q for ARIMA were established by plotting ACF and PACF.

## Steps involved in Forecasting Air Passengers count using Time series Analysis

- Data Cleaning and Analysis
- Checking Stationarity (ADF Test)
- Transformation
- Differencing
- Time Series Components
- Finding ACF and PACF
- ARIMA Modeling
- Forecast

## Dataset used for Forecasting

The dataset is been downloaded from Kaggle. It includes total 144 unique Values and 2 attributes all together.

Link for Dataset : Click here for the Dataset.

Now, Lets procced with the code.

#### Importing all the required Python libraries for the analysis

import pandas as pd import numpy as np import matplotlib.pylab as plt %matplotlib inline from matplotlib.pylab import rcParams from datetime import datetime import warnings warnings.filterwarnings('ignore')

#### Reading the dataset file and checking the size of file

data=pd.read_csv(r'AirPassengers.csv') data.shape

OUTPUT: (144,2)

### 1) Data Cleaning and Analysis

Now that we have our dataset loaded, let’s start with the data-preprocess

Creating the ‘Date’ as Index for data and viewing the dataset.

data['Month']=pd.to_datetime(data['Month'], infer_datetime_format=True) data=data.set_index(['Month']) print(data.head()) print(data.tail())

OUTPUT:

In the Image, we can see the datasets 2 attributes with first 5 and last 5 unique values.

Now, moving forward let’s Visualize the Time Series plot for the number of Air Passengers.

plt.figure(figsize=(20,10)) plt.xlabel("Month") plt.ylabel("Number of Air Passengers") plt.plot(data)

OUTPUT:

Now, we will be checking the stationarity. We already know that a stationary time series is one in which the mean and variance remain constant across time. Data preparation is done for rolling mean and standard deviation analysis to determine whether a particular time series is stationary

### 2) Checking Stationarity

rolmean=data.rolling(window=12).mean() rolstd=data.rolling(window=12).std() print(rolmean.head(15)) print(rolstd.head(15))

OUTPUT:

In the above image, we see the calculated Rolling mean and standard deviation. Now, let’s Plot the Rolling Mean and Standard Deviation.

plt.figure(figsize=(20,10)) actual=plt.plot(data, color='red', label='Actual') mean_6=plt.plot(rolmean, color='green', label='Rolling Mean') std_6=plt.plot(rolstd, color='black', label='Rolling Std') plt.legend(loc='best') plt.title('Rolling Mean & Standard Deviation') plt.show(block=False)

OUTPUT:

By looking at the above plot, we conclude that it is non-stationary because mean and variance is not constant.

### 2) Checking Stationarity (ADF Test)

Now, we will be using another approach, the ADF (Augmented Dickey-Fuller Test) to check stationarity.

Assume ADF has : **Null hypothesis – Time Series is non-stationary**

from statsmodels.tsa.stattools import adfuller print('Dickey-Fuller Test: ') dftest=adfuller(data['Passengers'], autolag='AIC') dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Obs']) for key,value in dftest[4].items(): dfoutput['Critical Value (%s)'%key] = value print(dfoutput)

OUTPUT:

- From above ADF test, we fail to reject the null hypothesis, since p-value is greater than 0.05
- Now we take log transformation to make our Time series stationary and plot visual for it.

plt.figure(figsize=(20,10)) data_log=np.log(data) plt.plot(data_log)

OUTPUT:

We discovered a graph with an ascending trend over time and seasonality.

Now, we will be testing the Rolling Mean with window 12 on above log transformation.

plt.figure(figsize=(20,10)) MAvg=data_log.rolling(window=12).mean() MStd=data_log.rolling(window=12).std() plt.plot(data_log) plt.plot(MAvg, color='blue')

As a result, It has concluded non-stationary. We use another method differencing, to make our time series stationary.

data_log_diff=data_log-MAvg data_log_diff.head(12)

OUTPUT:

We can observe a lot of missing values, so now we will be dropping all the missing values to avoid inefficiency of the model.

data_log_diff=data_log_diff.dropna() data_log_diff.head()

OUTPUT:

Now we will be defining function for Rolling Mean and Standard Deviation & ADF test

def stationarity(timeseries): rolmean=timeseries.rolling(window=12).mean() rolstd=timeseries.rolling(window=12).std() plt.figure(figsize=(20,10)) actual=plt.plot(timeseries, color='red', label='Actual') mean_6=plt.plot(rolmean, color='green', label='Rolling Mean') std_6=plt.plot(rolstd, color='black', label='Rolling Std') plt.legend(loc='best') plt.title('Rolling Mean & Standard Deviation') plt.show(block=False) print('Dickey-Fuller Test: ') dftest=adfuller(timeseries['Passengers'], autolag='AIC') dfoutput=pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Obs']) for key,value in dftest[4].items(): dfoutput['Critical Value (%s)'%key] = value print(dfoutput)

stationarity(data_log_diff)

OUTPUT:

- The differenced data’s stationarity is being examined
- We can see from the rolling approach that the mean and standard deviation are constant.
- We reject the null hypothesis based on the ADF since the p-value is less than 0.05. (significance level)
- Our differenced data is now stationary after implementing all the transformations and techniques.

### 3) Transformation

Moving forward towards the transformation step. Firstly, we will be checking for Trend stationarity and then perform exponential transformation on our data

plt.figure(figsize=(20,10)) exp_data=data_log.ewm(halflife=12, min_periods=0, adjust=True).mean() plt.plot(data_log) plt.plot(exp_data, color='black')

OUTPUT:

### 4) Differencing

A technique for changing a time series dataset is differencing. It can be used to get rid of the series’ so-called temporal reliance on time.

- Applying differencing to our data since log transformation is non-stationary.

exp_data_diff=data_log-exp_data stationarity(exp_data_diff)

OUTPUT:

- From above graph and ADF test, we conclude that it is Trend stationary

plt.figure(figsize=(20,10)) data_shift=data_log-data_log.shift() plt.plot(data_shift)

OUTPUT:

data_shift=data_shift.dropna() stationarity(data_shift)

OUTPUT:

### 5) Time Series Components

Now we will be moving towards decomposing the Time Series into its components : Trend, Seasonality and Residual

from statsmodels.tsa.seasonal import seasonal_decompose decomp=seasonal_decompose(data_log) trend=decomp.trend seasonal=decomp.seasonal residual=decomp.resid plt.subplot(411) plt.plot(data_log, label='Original') plt.legend(loc='best') plt.subplot(412) plt.plot(trend, label='Trend') plt.legend(loc='best') plt.subplot(413) plt.plot(seasonal, label='Seasonality') plt.legend(loc='best') plt.subplot(414) plt.plot(residual, label='Residuals') plt.legend(loc='best') plt.tight_layout()

OUTPUT:

Lets now check the stationarity of Time Series components.

decomp_data=residual decomp_data=decomp_data.dropna() stationarity(decomp_data)

OUTPUT:

### 6) Finding ACF and PACF

Moving forward we will be plotting the ACF and PACF to find q and p value. We get q and p both 2 from the graph

from statsmodels.tsa.stattools import acf, pacf lag_acf=acf(data_shift, nlags=20) lag_pacf=pacf(data_shift, nlags=20, method='ols') plt.figure(figsize=(20,10)) plt.subplot(121) plt.plot(lag_acf) plt.axhline(y=0,linestyle='--',color='green') plt.axhline(y=-1.96/np.sqrt(len(data_shift)),linestyle='--',color='green') plt.axhline(y=1.96/np.sqrt(len(data_shift)),linestyle='--',color='green') plt.title('Autocorrelation Function') plt.subplot(122) plt.plot(lag_pacf) plt.axhline(y=0,linestyle='--',color='green') plt.axhline(y=-1.96/np.sqrt(len(data_shift)),linestyle='--',color='green') plt.axhline(y=1.96/np.sqrt(len(data_shift)),linestyle='--',color='green') plt.title('Partial Autocorrelation Function')

OUTPUT:

### 7) Building the ARIMA Model

Lets Finally Build the ARIMA model with p=2, q=2 and I=1.

from statsmodels.tsa.arima_model import ARIMA plt.figure(figsize=(20,10)) model=ARIMA(data_log, order=(2,1,2)) results=model.fit(disp=-1) plt.plot(data_shift) plt.plot(results.fittedvalues, color='red') plt.title('RSS: %.4f'% sum((results.fittedvalues-data_shift['Passengers'])**2)) print('plotting ARIMA model')

- Predicted values from ARIMA model in difference form.

predictions=pd.Series(results.fittedvalues, copy=True) print(predictions.head())

OUTPUT:

- Performing Inverse Transformation for differencing, by doing cumulative sum

predictions_cum_sum=predictions.cumsum() print(predictions_cum_sum.head())

Displaying the predicted value in log scale

predictions_log=pd.Series(data_log['Passengers'].ix[0], index=data_log.index) predictions_log=predictions_log.add(predictions_cum_sum,fill_value=0) predictions_log.head()

Now we will be taking out log transformation and finally visualizing the actual vs predicted value graph

predictions_ARIMA=np.exp(predictions_log) plt.figure(figsize=(20,10)) plt.plot(data) plt.plot(predictions_ARIMA)

### 8) Forecasting

Now we will be plotting the visual for forecast of next 10 years with 95% confidence interval.

rcParams['figure.figsize']=20,10 results.plot_predict(1,204) x=results.forecast(steps=120)

In the above Image ,we can see the Forecasted line of the passenger with 95% confidence interval. Further, lets print the predicted values on log scale.

x[0]

- Actual predicted value for next 10 years, after taking out log transformation

np.exp(x[0])

## Conclusion of Forecasting Air Passengers count using Time series Analysis

In this tutorial, we have successfully learned and forecasted Air passenger count using Time Series Analysis. Training has been done with an ARIMA model. We have successfully achieved a good accuracy rate after forecasting. The predicted values are approximately similar to the actual values. In the future, This model can be modified and can be utilized for forecasting in various scenarios.

I hope you all enjoyed this tutorial. Happy Learning

Want to add your thoughts? Need any further help? Leave a comment below and I will get back to you ASAP

For further reading:

## Leave a Reply