Professional Documents
Culture Documents
Abstract
The purpose of the project is to apply advanced multivariate data analysis techniques using
SIMCA software to determine the major factors causing the delay in arrival of American Airlines
flights flown over the United States. The data was provided by the Research and Innovative
Technology Administration (RITA) and Bureau of Transportation Statistics (BTS) for the Year
2008. Principle Component Analysis and Partial Least Square Regression techniques were
The purpose of analysis of this data was to determine the critical causes/factors which play a
major role in the delay of the major flight carrier(AA). Along with the daily reported data for
flight arrivals and departure, the origin and destination airports, the day of the month and other
miscellaneous factors were also included to check the minimum and maximum flight arrival
delays.
Are some days in the month more prone to delays than the others?
Relationships between delays (PCA Analysis)
Weather factors: blizzards and severe weather
Are some airports more prone to delays than others? (PLS Analysis)
Are there differences between flying into an airport and flying out?
Data Description
The data are provided by Research and Innovative Technology Administration (RITA) and
Bureau of Transportation Statistics (BTS). Arrival and departure details for American Airlines
flights flown between June 2008 were recorded. The hourly weather details for each airport were
Variables for the data: Day of Month, Departure Time, Arrival Time, Arrival Delay, Departure
Delay, Origin, Destination, Distance, Taxi In, Taxi Out, Carrier Delay, Weather Delay, NAS
PLS Analysis:
The data was UV scaled and centered by the SIMCA software before application of the
regression techniques. The training dataset was first analyzed with PLS regression technique.
The PLS model was built with 19-X (Input Variable) and 1-Y (Response Variable Arrival
The R2(y) was found out to be 0.969 which tells that the model is a good fit with good
reproducibility. The Q2 is found out to be 0.966 which means the data has a good predictability
as well. The first score plot was analyzed along with the DModX line. We found some
observations outside the ellipse of the score plot which in agreement correspond to the highest
peaks on the DModX graph. After further analysis, the reason for the occurrence of outliers were
extreme values, errors made on the data entry and missing values of response variables. Upon
thorough analysis, the outliers outside the ellipsoid were excluded from the data set and the data
was further run with 9262 observations. The final R2(y) after omission of outliers is 0.96 and Q2
to be 0.958.
The second plot we analyze is the loadings plot. From the graph, the variables situated near the
response variable and close in the 1 st quadrant are found to be positively correlated while the
variables far and in the 3rd quadrant are negatively correlated. The variables Departure Delay,
Late Aircraft, NAS delay, Carrier delay are found to be positively coorelated to the Arrival delay
of a flight while Actual Arrival time and the flights flying from LAX and SFO are found to be
negatively correlated to arrival time delay. This can also be confirmed from the coefficients plot.
variables variables
DepDelay 0.563626 ActualArrT -0.0458861
LateAircra 0.287079 Origin(LAX -0.0277715
NASDelay 0.25478 Origin(SFO -0.0274042
CarrierDel 0.240652 ScheduledD -0.0260928
TaxiOut 0.195736 ScheduledE -0.0198144
WeatherDel 0.109635 Origin(SAN -0.01905
TaxiIn 0.0875963 Origin(LAS -0.0183761
ActualElap 0.0668079 Distance -0.017635
AirTime 0.0202531 ScheduledA -0.0175892
Time of the day 0.0194661 Origin(LGA -0.015081
Another plot we analyze is the VIP plot. The VIP (Variable Importance for the Projection) plot
summarizes the importance of the variables both to explain X and to correlate to Y. VIP-values
larger than 1 indicate important X-variables, and values lower than 0.5 indicate unimportant
X-variables. The interval between 1 and 0.5 is a gray area, where the importance level depends
Model Validation
To account for model validity, we use either the Q 2 or the permutation test. To prove the
All blue Q2 vales to the left are lower to the original points to the right or
The blue regression line of the Q2 points intersects the vertical axis to the left at or below zero.
From the permutation test graph (in the graphs section), it can be clearly seen that our model is
valid.
(last week) and compare the actual results (already known) with the predicted results. The results
for some of the observations are as follows. The Y predicted plot has also been provided.
PCA Analysis:
A PCA model was created from the training set. All the other factors were eliminated apart from
the delays. So, we have only 7 input variables left to analyze. The scores plot was first examined
along with the DModX plot. The outliers were removed and the data was again autofit. The R 2
was 0.573, which means the model has good reproducibility while the Q 2 was 0.277 which
means the data has poor predictability. The loadings plot was observed to find out the
The Arrival delay and departure delay are close together to form one group which means they are
closely related while weather delay and NAS delay are another group which can also be
understood. Security delay does not show any relationship with the other delay variables.
Per the XY variables, departure delay, arrival delay and carrier delay are well modelled
Results:
Delay by:
Best day to travel and avoid delays are Saturday and Tuesday. Thursday is a bad day for
delays.
2. Airport
IAD (Washington Dulles International Airport) is the worst. ORD (Chicago) is also not good
but has high volume. LAX(Los Angeles) is good. It has experienced the minimum number of
Weather plays a significant role in delays any kind of high precipitation, high winds or
reduced visibility.
by departure delay than any other factor and causes a ripple effect.
Conclusion:
The training data set is well fit and has a good predictability. The combination of training and
test data sets after the elimination of outliers can be used to predict future delay in flights for
American Airlines.
References:
1) http://stat-computing.org/dataexpo/2009/the-data.html
Graphs: