You are on page 1of 11

Term Project Report

American Airlines Flight Arrival Delay Analysis

Submitted by: Hyder Murtaza

Abstract

The purpose of the project is to apply advanced multivariate data analysis techniques using

SIMCA software to determine the major factors causing the delay in arrival of American Airlines

flights flown over the United States. The data was provided by the Research and Innovative

Technology Administration (RITA) and Bureau of Transportation Statistics (BTS) for the Year

2008. Principle Component Analysis and Partial Least Square Regression techniques were

performed on the data to address the relationships between various factors.

Problem and Objective

The purpose of analysis of this data was to determine the critical causes/factors which play a

major role in the delay of the major flight carrier(AA). Along with the daily reported data for

flight arrivals and departure, the origin and destination airports, the day of the month and other

miscellaneous factors were also included to check the minimum and maximum flight arrival

delays.

Some of the problems we are going to address are:

Are some days in the month more prone to delays than the others?
Relationships between delays (PCA Analysis)
Weather factors: blizzards and severe weather
Are some airports more prone to delays than others? (PLS Analysis)
Are there differences between flying into an airport and flying out?

Data Description
The data are provided by Research and Innovative Technology Administration (RITA) and

Bureau of Transportation Statistics (BTS). Arrival and departure details for American Airlines

flights flown between June 2008 were recorded. The hourly weather details for each airport were

from Weather Underground at http://www.wunderground.com

Total number of observations: 12410

Variables for the data: Day of Month, Departure Time, Arrival Time, Arrival Delay, Departure

Delay, Origin, Destination, Distance, Taxi In, Taxi Out, Carrier Delay, Weather Delay, NAS

Delay, Security Delay, Late Aircraft Delay

Omitted Variable from Data: Cancelled Flight

The data set is divided into two sets.

Data Set 12410 Observations For whole month


Training Data Set 9517 Observations First 3 weeks
Test Data Set 2893 Observations Last Week

Preprocessing and Analysis methods

PLS Analysis:

The data was UV scaled and centered by the SIMCA software before application of the

regression techniques. The training dataset was first analyzed with PLS regression technique.

The PLS model was built with 19-X (Input Variable) and 1-Y (Response Variable Arrival

Delay) with a total of 9517 observations.

The R2(y) was found out to be 0.969 which tells that the model is a good fit with good

reproducibility. The Q2 is found out to be 0.966 which means the data has a good predictability

as well. The first score plot was analyzed along with the DModX line. We found some

observations outside the ellipse of the score plot which in agreement correspond to the highest

peaks on the DModX graph. After further analysis, the reason for the occurrence of outliers were
extreme values, errors made on the data entry and missing values of response variables. Upon

thorough analysis, the outliers outside the ellipsoid were excluded from the data set and the data

was further run with 9262 observations. The final R2(y) after omission of outliers is 0.96 and Q2

to be 0.958.

The second plot we analyze is the loadings plot. From the graph, the variables situated near the

response variable and close in the 1 st quadrant are found to be positively correlated while the

variables far and in the 3rd quadrant are negatively correlated. The variables Departure Delay,

Late Aircraft, NAS delay, Carrier delay are found to be positively coorelated to the Arrival delay

of a flight while Actual Arrival time and the flights flying from LAX and SFO are found to be

negatively correlated to arrival time delay. This can also be confirmed from the coefficients plot.

A list of 10 positively and 10 negatively correlated variables is given below

10 positively M4.CoeffCS[3] 10 negatively M4.CoeffCS[3]

correlated (ArrDelay) correlated (ArrDelay)

variables variables
DepDelay 0.563626 ActualArrT -0.0458861
LateAircra 0.287079 Origin(LAX -0.0277715
NASDelay 0.25478 Origin(SFO -0.0274042
CarrierDel 0.240652 ScheduledD -0.0260928
TaxiOut 0.195736 ScheduledE -0.0198144
WeatherDel 0.109635 Origin(SAN -0.01905
TaxiIn 0.0875963 Origin(LAS -0.0183761
ActualElap 0.0668079 Distance -0.017635
AirTime 0.0202531 ScheduledA -0.0175892
Time of the day 0.0194661 Origin(LGA -0.015081

Another plot we analyze is the VIP plot. The VIP (Variable Importance for the Projection) plot

summarizes the importance of the variables both to explain X and to correlate to Y. VIP-values

larger than 1 indicate important X-variables, and values lower than 0.5 indicate unimportant
X-variables. The interval between 1 and 0.5 is a gray area, where the importance level depends

on the size of the data set.

20 VIP (Variable Importance for the projection)


Var ID (Primary) M4.VIP[3]
DepDelay 8.49164
LateAircraftDelay 5.12093
NASDelay 3.93501
TaxiOut 3.48781
CarrierDelay 2.96009
Time of the Day 2.72736
ScheduledDepTime 2.24678
ScheduledArrTime 1.88697
WeatherDelay 1.56686
TaxiIn 1.19935
ActualElapsedTime 1.11323
Dest(ORD) 1.04323
Origin(ORD) 0.992761
Origin(DFW) 0.966236
Dest(DFW) 0.907042
Day(Saturday) 0.902471
ActualArrTime 0.866032
Origin(LGA) 0.825574
Origin(JFK) 0.74186

Model Validation

To account for model validity, we use either the Q 2 or the permutation test. To prove the

validation test, we follow the criteria i.e.

All blue Q2 vales to the left are lower to the original points to the right or

The blue regression line of the Q2 points intersects the vertical axis to the left at or below zero.

From the permutation test graph (in the graphs section), it can be clearly seen that our model is

valid.

Prediction of the Test set using Training set


After model validation, we use our training set (first three weeks) to predict values of our test set

(last week) and compare the actual results (already known) with the predicted results. The results

for some of the observations are as follows. The Y predicted plot has also been provided.

Obs ID Original Value Predicted Value


67 10 20.2979
79 1 1.87349
101 25 22.9994
149 15 9.33487
235 10 6.2575
272 45 48.1548
291 34 28.8161
302 75 65.4218
322 46 25.17
375 -2 3.84978
416 39 55.1378
449 44 39.8593
558 26 18.6861
582 120 112.27
594 79 87.6572
609 139 120.599
649 54 40.4522
672 82 73.6369

PCA Analysis:

A PCA model was created from the training set. All the other factors were eliminated apart from

the delays. So, we have only 7 input variables left to analyze. The scores plot was first examined

along with the DModX plot. The outliers were removed and the data was again autofit. The R 2

was 0.573, which means the model has good reproducibility while the Q 2 was 0.277 which

means the data has poor predictability. The loadings plot was observed to find out the

relationship between all the delays.

The Arrival delay and departure delay are close together to form one group which means they are

closely related while weather delay and NAS delay are another group which can also be

understood. Security delay does not show any relationship with the other delay variables.
Per the XY variables, departure delay, arrival delay and carrier delay are well modelled

parameters while all other parameters are poorly modeled.

Results:

Delay by:

1. Day of the week

Best day to travel and avoid delays are Saturday and Tuesday. Thursday is a bad day for

delays.

2. Airport
IAD (Washington Dulles International Airport) is the worst. ORD (Chicago) is also not good

but has high volume. LAX(Los Angeles) is good. It has experienced the minimum number of

delayed arrival flights. SFO Is relatively good relatively small delays.


3. Weather

Weather plays a significant role in delays any kind of high precipitation, high winds or

reduced visibility.

4. By time of the day


Delays increase as day progresses. Flights later during the day face more delays as compared

to flights early morning or by afternoon.


5. Arrival and Departure
Arrival delay time is positively connected to departure delay. Arrival delay is most impacted

by departure delay than any other factor and causes a ripple effect.

Conclusion:

The training data set is well fit and has a good predictability. The combination of training and

test data sets after the elimination of outliers can be used to predict future delay in flights for

American Airlines.

References:
1) http://stat-computing.org/dataexpo/2009/the-data.html

Graphs:

PLS: Score plot with outliers highlighted


PLS: DModX plot with outliers marked

PLS: Loadings plot after outlier omission

PLS: Coefficient plot:


Permutation Test Graph:

PLS: Y original v/s Y predicted for test set

PCA: Scores plot with highlighted outliers


PCA: DModX

PCA: Loadings plots after omission of outliers


PCA: X contribution

You might also like