Data Mining & Analytics for U.S. Airlines On-Time Performance

Data Analysis of U.S. Airlines On-time
Performance
Yanxiang Zhu, Nilesh Padwal, Mingxuan Li
Finished by June 27th, 2014
Contents
1 Introduction 2
1.1 Background and Problem Description . . . . . . . . . . . . . . . 2
1.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Collecting data 3
3 Preprocessing Data 3
4 Variables Description 4
5 Association Rule 9
6 Cluster Analysis 15
6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 Determine K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3.1 Pam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3.2 Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Decision Tree 24
7.1 Categorize Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2 Rpart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3 Ctree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8 Random Forest 31
9 Classiﬁcation 35
9.1 knn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.2 Processing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1

INFO7374 Data Science Final Project
10 Processing Data 36
11 Conclusion 40
12 Limitation 41
13 Future Work 42
2

1 Introduction
1.1 Background and Problem Description
In airlines industry, It is much more common that airlines are struggling to
get plane to the gate on time.The current challenge we all met is to improve
the quality of airline on-time performance. Besides the carriers’ services and
baggage policy, It is seemingly much more necessary to save people’s air time.
So our goal is to find something special and valuable relationships from dataset
by using data mining techniques like cluster analysis, association rule, decision
tree and etc.
1.2 Dataset Description
The dataset is a airlines’ data collection coming from the Research and Innova-
tive Technology Administration (RITA), and it contains detailed facets of each
air flight information between 1987 to 2008. It is huge information which include
29 variables like Destination, Origin, Arrival time, Departure time and so on.
Here is a original list that show the all variables. It is very important statistical
records that any flight information could be tracked via specific features. The
thing we need to mention, due to limit performance of our computers, is that
we are able to fetch a part of the whole data (all the airlines of U.S. during
22 years)to process and analyze. Our selected dataset is still have millions of
observations which are definitely enough to obtain the satisfying outcomes.Here
is a descriptive list of useful variables.
1. DayofMonth December 1st to December 31th.
2. DayOfWeek 1 refers to Monday and in a similar way, 7 refers to Sunday.
3. DepTime Actual departure time
4. ArrTime Actual arrival time
5. CRSDepTime Scheduled departure time
6. CRSArrTime Actual arrival time
7. UniqueCarrier Unique carrier code
8. FlightNum Flight number
9. ActualElapsedTime In minutes
10. CRSElapsedTime In minutes
11. AirTime In minutes
12. ArrDelay Arrival delay, in minutes
13. DepDelay Departure delay, in minutes
3

14. Origin Origin IATA airport code
15. Dest Destination IATA airport code
16. Distance In miles
According to the historical record of On-time flight operation of U.S.air carriers,
the 2008 seems like to be interesting and special period for the airline industry,
whose on-time percentage is 76.0%, then it went up to 79.5% in 2009. That is
why we choose such a breaking point to find out what should not be ignored
that behind the common numbers and words.
2 Collecting data
The dataset we use contains all commercial flights within the USA in 2008. The
dataset is downloaded from http://stat-computing.org/dataexpo/2009.
The dataset contains nearly 10 million records and takes 700MB space.
file.name <- paste(2008, "csv.bz2", sep = ".")
if (!file.exists(file.name)) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/", 2008, ".csv.bz2",
sep = "")
cat("Downloading missing data file ", file.name, "n", sep = "")
download.file(url.text, file.name)
}
To import the data into our workspace, we use read.csv function. We store
the dataset in d.
d <- read.csv("2008.csv")
3 Preprocessing Data
Since the analysis still need a well-structured dataset, so we omit the NA values.
And due to the limitation of our computer’ processing capability, we decide to
work with data from only December,2008. And it still has 1,524,735 observation
of 29 variables and we think it is enough to obtain a good data analysis results
from such a large-scale dataset.
d = subset(d, Month == "12")
d = na.omit(d)
After that, we also need to remove some of the columns that we think is not
useful in our study. So we remove them directly from the original dataset.
4

d = d[, -20:-29]
On the other hand, since we already decide to use the data in December, 2008.
The Year and Month columns become useless.
d = d[, -1]
d = d[, -1]
So far, our dataset contains 168646 records with 17 variables.
str(d)
## 'data.frame': 168647 obs. of 17 variables:
## $ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...
## $ DayOfWeek : int 3 3 3 3 3 3 3 3 3 3 ...
## $ DepTime : int 1126 1859 1256 1925 2002 1716 1620 1807 1930 1004 ...
## $ CRSDepTime : int 1045 1825 1240 1900 1940 1610 1555 1725 1905 1005 ...
## $ ArrTime : int 1241 1925 1458 2120 2249 2054 1826 1910 2041 1130 ...
## $ CRSArrTime : int 1200 1900 1435 2100 2230 1950 1800 1845 2020 1115 ...
## $ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 1
## $ FlightNum : int 2717 1712 294 2776 623 586 1259 548 619 1152 ...
## $ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3796 2127 3943 3316
## $ ActualElapsedTime: int 75 86 62 55 107 158 186 63 71 86 ...
## $ CRSElapsedTime : int 75 95 55 60 110 160 185 80 75 70 ...
## $ AirTime : int 55 73 45 46 93 140 177 50 56 51 ...
## $ ArrDelay : int 41 25 23 20 19 64 26 25 21 15 ...
## $ DepDelay : int 41 34 16 25 22 66 25 42 25 -1 ...
## $ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 82 157 160 175 177 181 2
## $ Distance : int 349 487 289 332 718 1121 1111 328 328 321 ...
4 Variables Description
After importing the dataset, the variables associated with each observation were
explored further. The names of variables were listed and described.
1. DayofMonth December 1st to December 31th.
summary(d$DayofMonth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 11.0 18.0 17.2 23.0 31.0
2. DayOfWeek 1 refers to Sunday and in a similar way, 7 refers to Saturday.
summary(d$DayOfWeek)
5

## 1.00 2.00 4.00 3.74 5.00 7.00
3. DepTime Actual departure time
summary(d$DepTime)
## 1 1120 1510 1470 1840 2400
Departure time is another key factor that we are going to examine. We
want to know which time is the best time for ﬂight.
4. ArrTime Actual arrival time
summary(d$ArrTime)
## 1 1230 1640 1560 2010 2400
5. CRSDepTime Scheduled departure time
summary(d$CRSDepTime)
## 5 1040 1420 1400 1750 2360
6. CRSArrTime Actual arrival time
summary(d$CRSArrTime)
## 1 1230 1620 1580 1950 2360
7. UniqueCarrier Unique carrier code
carrier = data.frame(d$UniqueCarrier)
qplot(x = d$UniqueCarrier, data = carrier, fill = d$UniqueCarrier)
6

0
10000
20000
30000
9E AA AS B6 CO DL EV F9 FL HA MQNWOH OO UA US WN XE YV
d$UniqueCarrier
count d$UniqueCarrier
9E
AA
AS
B6
CO
DL
EV
F9
FL
HA
MQ
NW
OH
OO
UA
US
WN
XE
YV
Southwest Airline runs most of the airplane in the U.S. in 2008. The
number of their ﬂights are even greater than the sum of Skywest Airline
and American Airline. We will also help you ﬁnd out which airline to
choose if you want to avoid delay.
8. FlightNum Flight number
summary(d$FlightNum)
## 1 658 1680 2360 3590 9740
9. ActualElapsedTime In minutes
summary(d$ActualElapsedTime)
## 18 88 126 144 177 790
10. CRSElapsedTime In minutes
7

summary(d$CRSElapsedTime)
## 26 82 116 135 165 660
11. AirTime In minutes
summary(d$AirTime)
## 6 60 93 112 141 647
12. ArrDelay 4Arrival delay, in minutes
summary(d$ArrDelay)
## 15.0 24.0 41.0 62.6 77.0 1660.0
The arrive delay is our target variable. The median is 40 mins which
means the delay problem is severe. We are going to ﬁnd out which factors
will cause the delay.
13. DepDelay Departure delay, in minutes
summary(d$DepDelay)
## -34.0 15.0 35.0 53.5 71.0 1600.0
14. Origin Origin IATA airport code
summary(d$Origin)
## ATL ORD DEN DFW DTW PHX EWR IAH LAS
## 12232 11020 7004 6208 4984 4353 4333 4269 4020
## MSP LAX JFK SLC SFO SEA CLT BOS PHL
## 4004 3837 3409 3375 3370 2992 2977 2841 2676
## MDW MCO CVG BWI LGA SAN DCA IAD MEM
## 2546 2481 2457 2383 2272 1832 1768 1714 1665
## MIA FLL STL MKE TPA MCI BNA CLE PDX
## 1601 1590 1504 1491 1489 1381 1358 1352 1329
## HOU RDU DAL OAK HNL SMF PIT SJC IND
## 1324 1297 1218 1211 1164 1131 1061 1003 989
## SNA ABQ AUS SAT MSY CMH PBI BUF OMA
## 944 905 872 809 806 757 733 727 704
## JAX BDL BUR RSW BHM ONT PVD GRR SDF
## 643 629 627 607 546 521 517 514 504
## TUL OKC RNO DSM SJU RIC MHT DAY MSN
## 493 488 485 478 471 452 426 419 408
## GEG LIT BOI ELP TUS ANC ICT LGB TYS
## 404 404 394 394 389 384 371 367 358
8

## ALB ROC XNA SYR OGG ORF HPN COS CID
## 356 356 344 343 332 322 317 311 292
## CHS FAT LEX GSO MLI CAE HSV SAV JAN
## 287 285 282 278 271 268 260 259 251
## (Other)
## 12768
The most busiest airport in the U.S. is Atlantic airport. Chicago and
Denver ranked second and third.
15. Dest Destination IATA airport code
summary(d$Dest)
## ATL ORD DEN DFW LAX PHX LAS EWR SFO
## 11791 9506 6338 5159 5013 4663 4357 4335 4277
## IAH DTW MSP JFK SLC SEA MCO LGA PHL
## 4271 3575 3280 3238 3216 3131 3015 2818 2669
## BOS CLT SAN BWI FLL MDW CVG TPA MEM
## 2541 2422 2265 2142 1959 1912 1831 1748 1721
## DCA MIA IAD PDX RDU MCI STL SMF OAK
## 1710 1702 1494 1483 1455 1384 1374 1361 1351
## BNA CLE MKE SJC HOU SNA SAT DAL AUS
## 1346 1314 1290 1210 1196 1120 1096 1085 1049
## ABQ HNL PIT PBI MSY IND CMH RSW OMA
## 1014 981 935 917 889 841 831 758 757
## JAX BUR ONT BUF TUL OKC SJU BHM TUS
## 737 717 679 664 615 608 604 585 585
## BDL RNO ANC SDF PVD DSM GRR RIC ELP
## 578 567 560 526 505 501 498 485 480
## BOI LIT GEG MSN TYS ICT DAY XNA LGB
## 472 442 424 415 412 408 406 386 381
## MHT COS ORF GSO CHS ROC CAE JAN HPN
## 373 366 362 336 334 327 301 296 295
## SAV CID OGG FAT ALB SYR LEX HSV MLI
## 293 292 292 291 285 280 276 261 255
## (Other)
## 13756
The result is very similar to Origin. We also need to check whether the
most busiest airport suﬀers the delay most.
16. Distance In miles
summary(d$Distance)
## 31 338 599 753 984 4960
9

The most majority of flights have a distance of under 1000 miles. The
relationship between distance and delay time is another import question
we need to validate.
5 Association Rule
After thinking about air flight performance, we consider that there are some-
thing special and important that can address with so that it is quite beneficial
to improve the quality of air flights’ on-time performance. In this section, we
are goint to find the hidden relationship from different facets among the infor-
mation that given by air flight dataset. We raise some certain questions, for
instance, how does the factors like distance, DayOfWeek influence on the on-
time performance. So our goal is to solve these questions by using association
rule.
• Support The probability that antecedent and conclusion hold simultane-
ously in the data set.
• Condence The conditional probability that conclusion holds if antecedent
is satised.
• Lift Lift is the ratio of Confidence to Expected Confidence.Values greater
than 1 indicate that the rule has predictive potential.
Firstly, we prepare the library that association rule requires.
library(arules)
library(arulesViz)
We copy the original dataset because we will do some further data transforming
work on it. We store it as data.
data = d
According to our observation, it is effective that dividing the numeric variable
into the ordered, reasonable range for our analysis. So we split the Distance,
ArrDelay, Airtime, CRSDepTime, CRSArrTime respectively and make sure that
all the observation is included in given range.
data$Distance = ordered(cut(data$Distance, c(0, 300, 600, 1000, Inf)), labels = c("Short",
"Medium", "Long", "Too long"))
data$ArrDelay = ordered(cut(data$ArrDelay, c(0, 25, 50, 80, Inf)), labels = c("On-Time",
"Delayed", "Intermediate-Delayed", "Much-Delayed"))
data$AirTime = ordered(cut(data$AirTime, c(-1, 50, 100, 200, 300, Inf)), labels = c("Too-Sho
"Short", "Intermediate", "Long", "Too-Long"))
10

data$CRSDepTime = ordered(cut(data$CRSDepTime, c(-1, 600, 1200, 1800, Inf)),
labels = c("Overnight", "Morning", "Afternoon", "Evening"))
data$CRSArrTime = ordered(cut(data$CRSArrTime, c(-1, 600, 1200, 1800, 2359)),
labels = c("Overnight", "Morning", "Afternoon", "Evening"))
DayOfWeek contain the number like 1, 2, 3 to represent days of week. So
we change it into character and replace with string.After manipulating, it is
transformed into factor.
data$DayOfWeek = as.character(data$DayOfWeek)
data$DayOfWeek = gsub("^1", "Sunday", data$DayOfWeek)
data$DayOfWeek = gsub("^2", "Monday", data$DayOfWeek)
data$DayOfWeek = gsub("^3", "Tuesday", data$DayOfWeek)
data$DayOfWeek = gsub("^4", "Wednesday", data$DayOfWeek)
data$DayOfWeek = gsub("^5", "Thursday", data$DayOfWeek)
data$DayOfWeek = gsub("^6", "Friday", data$DayOfWeek)
data$DayOfWeek = gsub("^7", "Saturday", data$DayOfWeek)
data$DayOfWeek = factor(data$DayOfWeek)
We have 5 variable that is not that useful so that they need to be removed such
as FlightNum, ActualElapseTime and etc.
logNdx = !(names(data) %in% c("DayofMonth", "FlightNum", "Cancelled", "ActualElapsedTime",
"DepDelay", "UniqueCarrier"))
data.AR = data[, logNdx]
Finishing these processing work above, here the dataset comes to analyze called
data.AR. It contains 10 variables shown below.
summary(data.AR)
## DayOfWeek CRSDepTime CRSArrTime TailNum
## Friday :20526 Overnight: 3009 Overnight: 2050 N986CA : 129
## Monday :27345 Morning :54579 Morning :35361 N87353 : 126
## Saturday :21397 Afternoon:72468 Afternoon:67079 N77302 : 122
## Sunday :30021 Evening :38591 Evening :64157 N507CA : 112
## Thursday :22742 N472CA : 107
## Tuesday :26882 N471CA : 106
## Wednesday:19734 (Other):167945
## AirTime ArrDelay Origin
## Too-Short :28126 On-Time :46402 ATL : 12232
## Short :64977 Delayed :53601 ORD : 11020
## Intermediate:56146 Intermediate-Delayed:29107 DEN : 7004
## Long :14165 Much-Delayed :39537 DFW : 6208
## Too-Long : 5233 DTW : 4984
## PHX : 4353
## (Other):122846
11

## Dest Distance
## ATL : 11791 Short :33685
## ORD : 9506 Medium :50906
## DEN : 6338 Long :43644
## DFW : 5159 Too long:40412
## LAX : 5013
## PHX : 4663
## (Other):126177
We apply association rule mining to the dataset. Firstly, we intend to find out
the main factors that could possible result in air flight on-time or not. We give
the index of support and confidence respectively and at right column show out
the four levels, that is On-Time, Delayed, Intermediate-Delayed, Much-Delayed.
apriori.appearance1 = list(rhs = c("ArrDelay=On-Time", "ArrDelay=Delayed", "ArrDelay=Interme
"ArrDelay=Much-Delayed"), default = "lhs")
apriori.parameter1 = list(support = 0.01, confidence = 0.1)
rules1 = apriori(data.AR, parameter = apriori.parameter1, appearance = apriori.appearance1)
##
## parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.1 0.1 1 none FALSE TRUE 0.01 1 10
## target ext
## rules FALSE
##
## algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[4 item(s)] done [0.00s].
## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s].
## sorting and recoding items ... [83 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 done [0.07s].
## writing ... [743 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
By giving lift is larger than 1, and we create the subset that ordered by lift.
rules1.subset = subset(rules1, subset = lift > 1.2 & confidence > 0.1)
rules1.subset.conf = sort(rules1.subset, by = "lift")
12

summary(rules1.subset)
## set of 50 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 4 26 18 2
##
## 2.00 3.00 3.00 3.36 4.00 5.00
##
## summary of quality measures:
## support confidence lift
## Min. :0.0101 Min. :0.284 Min. :1.20
## 1st Qu.:0.0118 1st Qu.:0.314 1st Qu.:1.24
## Median :0.0147 Median :0.342 Median :1.27
## Mean :0.0182 Mean :0.337 Mean :1.31
## 3rd Qu.:0.0200 3rd Qu.:0.354 3rd Qu.:1.32
## Max. :0.0725 Max. :0.428 Max. :1.83
##
## mining info:
## data ntransactions support confidence
## data.AR 168647 0.01 0.1
The list below displays the top ten rules sorted by lift.
inspect(rules1.subset.conf[1:5])
## lhs rhs support confidence lift
## 1 {Dest=EWR} => {ArrDelay=Much-Delayed} 0.01101 0.4284 1.827
## 2 {CRSDepTime=Afternoon,
## Dest=ORD} => {ArrDelay=Much-Delayed} 0.01040 0.4048 1.727
## 3 {CRSArrTime=Evening,
## Origin=ORD} => {ArrDelay=Much-Delayed} 0.01032 0.3786 1.615
## 4 {Dest=ORD} => {ArrDelay=Much-Delayed} 0.02001 0.3549 1.514
## 5 {Origin=ORD} => {ArrDelay=Much-Delayed} 0.02199 0.3365 1.435
And the flights origin from or land on ORD airport delay with much possibility.
We search the weather history records of Chicago O’Hare International Airport
(ORD), it did suffer a very severe snowstorm at that time. So even the weather
condition is not included in air flight information, but still is a key of air flight
on-time performance.
rules1.subset.delay = subset(rules1.subset, subset = lhs %in% "DayOfWeek=Monday")
plot(rules1.subset.delay, method = "graph", control = list(type = "items"))
13

Graph for 5 rules
DayOfWeek=Monday
CRSDepTime=Morning
CRSDepTime=Evening
CRSArrTime=Morning
CRSArrTime=Evening
ArrDelay=On−Time
ArrDelay=Much−Delayed
size: support (0.011 − 0.019)
color: lift (1.213 − 1.421)
apriori.appearance3 = list(rhs = c("ArrDelay=On-Time"), default = "lhs")
apriori.parameter3 = list(support = 0.01, confidence = 0.1)
rules3 = apriori(data.AR, parameter = apriori.parameter3, appearance = apriori.appearance3)
##
## parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.1 0.1 1 none FALSE TRUE 0.01 1 10
## target ext
## rules FALSE
##
## algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
14

## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s].
## sorting and recoding items ... [83 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [211 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
rules3.subset = subset(rules3, subset = lift > 1.2 & confidence > 0.1)
rules3.subset.conf = sort(rules3.subset, by = "lift")
rules3.subset.ontime = subset(rules3.subset.conf, subset = lhs %in% c("DayOfWeek=Friday",
"DayOfWeek=Saturday", "DayOfWeek=Sunday", "DayOfWeek=Monday", "DayOfWeek=Tuesday",
"DayOfWeek=Wednesday", "DayOfWeek=Thursday"))
inspect(rules3.subset.ontime[1:5])
## 1 {DayOfWeek=Monday,
## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01326 0.3909 1.421
## CRSDepTime=Morning,
## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01874 0.3624 1.317
## 4 {DayOfWeek=Wednesday,
## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01283 0.3503 1.273
## 5 {DayOfWeek=Sunday,
So obviously, we could conclude that the ﬂights which own the most high on-
time performance always in the morning. It illustrates that in Dec,2008, the air
traﬃc control is in a great condition in the morning but heavier in afternoon
and evening comparatively.
rules1.subset.delay1 = subset(rules1.subset.conf, subset = lhs %in% c("Distance=Short",
"Distance=Medium", "Distance=Long", "Distance=Too long"))
inspect(rules1.subset.delay1[1:10])
## 1 {CRSArrTime=Morning,
## AirTime=Short,
## Distance=Medium} => {ArrDelay=On-Time} 0.02186 0.3679 1.337
## AirTime=Intermediate,
## Distance=Long} => {ArrDelay=On-Time} 0.01494 0.3610 1.312
15

## 3 {CRSDepTime=Morning,
## CRSArrTime=Morning,
## AirTime=Short,
## AirTime=Intermediate,
## 7 {AirTime=Short,
## Origin=ATL,
## Distance=Too long} => {ArrDelay=On-Time} 0.01120 0.3480 1.265
We can find the most flights are possible on time which often have long distance
routes. That shows that in air control system, the small regions and short route
are much busy and people may have 5 or more choice of flights if they go to
New York City from Boston while there are only 2 flight if your family want
to travel to beautiful San Diego from Washington D.C. That is why the long
distance route have less pressure in air control system and It is easier to meet
the air traffic jam in shorter flight routes.
6 Cluster Analysis
We are going to research the Airline dataset using clustering analysis. Cluster-
ing analysis generally refers to sorting observed data into k groups(k indicates
how many groups will be created) so as to minimize the similarity of observations
within the same group and maximize the similarity of observations across dif-
ferent groups. Basically, cluster analysis can be separated into two approaches,
hierarchical and non-hierarchical. We are going to run non-hierarchical cluster-
ing including K-Means and Pam. For Airline dataset, we will continue finding
16

associations within the dataset and the factors which will influence the ArrDe-
lay.
6.1 Setup
First we need to load libraries we are going to use.
library(cluster) # cluster library
library(proxy) # hcluster function
library(fpc) # cluster.stats function
library(pamr) # pam function
library(clValid) # clValid function
library(ggplot2) # plot diagram
Because the original dataset is too large, and it is very difficult to compute
distance matrix. So we randomly choose 1000 records and do analysis on this
sample.
cd = cd[sample(nrow(d), 1000), ]
m = as.matrix(cd)
str(m)
## int [1:1000, 1:12] 31 10 28 18 14 23 7 20 11 20 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "6926813" "6946821" "6545814" "6680323" ...
## ..$ : chr [1:12] "DayofMonth" "DayOfWeek" "DepTime" "CRSDepTime" ...
mDist = dist(m)
6.2 Determine K
The most important thing in cluster analysis is to determine best K. To find
the best case, we will apply clvalid function, which will directly give us the
optimal result.
# clValid
hvalid <- clValid(m, 2:10, clMethods = c("hierarchical"), validation = "internal",
maxitems = 1e+06)
pamvalid <- clValid(m, 2:10, clMethods = c("pam"), validation = "internal",
maxitems = 1e+06)
kvalid <- clValid(m, 2:10, clMethods = c("kmeans"), validation = "internal",
maxitems = 1e+06)
Now we can use the summary() function to see the result of each method,
where the Optimal Scores section will directly give us the best clusters number.
17

summary(kvalid)
##
## Clustering Methods:
## kmeans
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9
##
## kmeans Connectivity 2.878 79.275 85.293 86.702 104.193 131.675 105.027 128.007 136.
## Dunn 0.098 0.012 0.012 0.014 0.027 0.023 0.027 0.023 0.
## Silhouette 0.471 0.454 0.468 0.470 0.485 0.388 0.486 0.390 0.
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 2.878 kmeans 2
## Dunn 0.098 kmeans 2
## Silhouette 0.486 kmeans 8
summary(pamvalid)
##
## pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## 2 3 4 5 6 7 8 9 10
##
## pam Connectivity 91.199 139.967 145.223 171.271 173.816 237.760 221.203 212.537 246.705
## Dunn 0.019 0.013 0.014 0.018 0.009 0.012 0.013 0.013 0.013
## Silhouette 0.404 0.288 0.318 0.360 0.305 0.298 0.320 0.320 0.333
##
## Optimal Scores:
##
## Connectivity 91.199 pam 2
## Dunn 0.019 pam 2
## Silhouette 0.404 pam 2
summary(hvalid)
18

##
## hierarchical
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## 2 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 5.287 5.287 7.005 11.240 32.673 33.506 35.006 38.148 44.491
## Dunn 0.420 0.420 0.417 0.381 0.073 0.073 0.073 0.074 0.074
## Silhouette 0.485 0.460 0.443 0.426 0.407 0.374 0.368 0.317 0.318
##
## Optimal Scores:
##
## Connectivity 5.287 hierarchical 2
## Dunn 0.420 hierarchical 2
## Silhouette 0.485 hierarchical 2
Because the 1000 sample records are randomly chosen, the results are not always
the same. But we can still ﬁnd that in most cases, K = 2 will be given. On the
other hand, we can also use other measurements to validate our result.
We use foreachcluster function to show the 6 measurements for cluster number
from 2 to 10.
foreachcluster3 = function(k) {
pamC = pam(x = m, k)
p.stats = cluster.stats(mDist, pamC$clustering)
c(max.dia = p.stats$max.diameter, min.sep = p.stats$min.separation, avg.wi = p.stats$ave
avg.bw = p.stats$average.between, silwidth = p.stats$avg.silwidth, dunn = p.stats$du
}
We apply this function to cluster numbers from 2 to 10 and use rbind to make
a table.
t3 = rbind(foreachcluster3(2), foreachcluster3(3), foreachcluster3(4), foreachcluster3(5),
foreachcluster3(6), foreachcluster3(7), foreachcluster3(8), foreachcluster3(9),
foreachcluster3(10))
rownames(t3) = 2:10
t3
## max.dia min.sep avg.wi avg.bw silwidth dunn
## 2 3899 75.14 1062.9 1811 0.4041 0.019271
## 3 3899 52.03 963.4 1666 0.2884 0.013344
## 4 3802 52.03 797.9 1698 0.3184 0.013685
19

## 5 3366 59.92 653.1 1689 0.3599 0.017800
## 6 3366 30.82 605.6 1625 0.3055 0.009157
## 7 3265 39.76 579.1 1602 0.2979 0.012177
## 8 2959 39.76 552.8 1604 0.3199 0.013437
## 9 2959 38.39 521.4 1583 0.3196 0.012974
## 10 2959 39.76 480.1 1562 0.3330 0.013437
The result also shows we should use K = 2 for cluster analysis.
6.3 Cluster Analysis
After getting the best K = 2, we can continue to compare diﬀerent clusters to
see if we can get interesting result.
From the previous tests, we ﬁnd out that hclust is not performing good in this
analysis. We found that one cluster only have few elements while another cluster
have over 99% elements. It means the Hclust function is also not working very
well in this case. So we will use Pam and Kmeans functions.
6.3.1 Pam
We apply pam function to the matrix, and set k = 2.
pamC = pam(x = m, 2)
pamC$clusinfo
## size max_diss av_diss diameter separation
## [1,] 561 2814 741.6 3505 75.14
## [2,] 439 3004 717.2 3899 75.14
pamcluster = data.frame(pamC$clustering)
We paste the cluster result back to our original dataset.
total = cbind(cd, pamcluster)
After that, we can obtain the two subset according to their cluster numbers.
d1 = subset(total, pamC.clustering == 1)
d2 = subset(total, pamC.clustering == 2)
summary(d1)
## DayofMonth DayOfWeek DepTime CRSDepTime
## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045
## 1st Qu.:10.0 1st Qu.:2.00 1st Qu.:1631 1st Qu.:1525
## Median :18.0 Median :3.00 Median :1809 Median :1715
20

## Mean :16.9 Mean :3.59 Mean :1802 Mean :1699
## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:2004 3rd Qu.:1855
## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308
## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime
## Min. : 2 Min. : 640 Min. : 33 Min. : 35
## 1st Qu.:1745 1st Qu.:1718 1st Qu.: 84 1st Qu.: 80
## Median :1941 Median :1914 Median :123 Median :115
## Mean :1826 Mean :1904 Mean :137 Mean :130
## 3rd Qu.:2136 3rd Qu.:2105 3rd Qu.:168 3rd Qu.:160
## Max. :2357 Max. :2359 Max. :441 Max. :407
## AirTime ArrDelay DepDelay Distance
## Min. : 14 Min. : 15.0 Min. :-10.0 Min. : 56
## 1st Qu.: 57 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334
## Median : 88 Median : 47.0 Median : 45.0 Median : 590
## Mean :108 Mean : 69.3 Mean : 62.4 Mean : 720
## 3rd Qu.:137 3rd Qu.: 97.0 3rd Qu.: 89.0 3rd Qu.: 948
## Max. :382 Max. :395.0 Max. :377.0 Max. :2640
## pamC.clustering
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
summary(d2)
## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45
## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 835 1st Qu.: 810
## Median :18.0 Median :3.00 Median :1021 Median : 955
## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359
## Min. : 10 Min. : 1 Min. : 35 Min. : 34.0
## 1st Qu.:1022 1st Qu.: 942 1st Qu.: 92 1st Qu.: 83.5
## Median :1217 Median :1130 Median :129 Median :116.0
## Mean :1178 Mean :1111 Mean :146 Mean :134.6
## 3rd Qu.:1408 3rd Qu.:1322 3rd Qu.:176 3rd Qu.:165.0
## Max. :1810 Max. :2345 Max. :432 Max. :405.0
## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74
21

## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777
## pamC.clustering
## Min. :2
## 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
We can see from the summary that all the columns are similar except Departure
Time and our target variable Arrive time. We can conclude that when the
Departure Time is in the midnight or in the morning, it’s more likely that this
ﬁght will have a relatively lower delay, which match the conclusion we draw
from Association rules.
totaldf = data.frame(total)
totaldf$pamC.clustering = as.factor(totaldf$pamC.clustering)
qplot(data = totaldf, x = totaldf$pamC.clustering, y = totaldf$DepTime, colour = totaldf$pam
geom = "boxplot")
22

0
500
1000
1500
2000
1 2
totaldf$pamC.clustering
totaldf$DepTime
totaldf$pamC.clustering
1
2
From the result, we know that pam has done a quite good job in clustering.
Next, we are going to try Kmeans to compare the result.
6.3.2 Kmeans
We apply similar work to Kmeans to see if Kmeans works better than pam
function.
kmeans.results = kmeans(m, 2)
clusterdf = data.frame(kmeans.results$cluster)
total = cbind(cd, clusterdf)
d1 = subset(total, kmeans.results.cluster == 1)
summary(d1)
## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045
23

## 1st Qu.:10.5 1st Qu.:2.00 1st Qu.:1627 1st Qu.:1520
## Median :18.0 Median :3.00 Median :1804 Median :1710
## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308
## Min. : 2 Min. : 640 Min. : 33 Min. : 35
## 1st Qu.:1739 1st Qu.:1714 1st Qu.: 84 1st Qu.: 80
## Max. :2357 Max. :2359 Max. :441 Max. :407
## Min. : 14.0 Min. : 15.0 Min. :-10.0 Min. : 56
## 1st Qu.: 57.5 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334
## Median : 88.0 Median : 47.0 Median : 45.0 Median : 588
## Mean :107.6 Mean : 69.8 Mean : 62.8 Mean : 720
## 3rd Qu.:136.5 3rd Qu.: 97.0 3rd Qu.: 88.5 3rd Qu.: 947
## Max. :382.0 Max. :425.0 Max. :392.0 Max. :2640
## kmeans.results.cluster
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
d2 = subset(total, kmeans.results.cluster == 2)
summary(d2)
## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45
## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 834 1st Qu.: 805
## Median :18.0 Median :3.00 Median :1017 Median : 950
## Mean :17.1 Mean :3.54 Mean :1030 Mean : 992
## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359
## Min. : 24 Min. : 1 Min. : 35 Min. : 34
## 1st Qu.:1021 1st Qu.: 940 1st Qu.: 92 1st Qu.: 83
## Max. :1810 Max. :2305 Max. :432 Max. :405
## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74
24

## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777
## kmeans.results.cluster
## Min. :2
## 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
It seems they are generating very similar results. It also shows a strong rela-
tionship between Departure Time and Arrive Delay, which matches our findings
in Association Rules and Decision Trees.
7 Decision Tree
In this section, we are going to use decision tree to help us analyze the factors
that will affect the target variables.
First, we need to load the libraries required.
library(rpart)
library(rpart.plot)
library(rattle)
library(maptree)
library(party)
library(partykit)
7.1 Categorize Variable
We categorize our variable into different parts.
Distance We divided the variables into three parts: up to 750, 750 to 1000,
greater than 1000
d$Distance = ordered(cut(d$Distance, c(0, 750, 1000, Inf)), labels = c("upto750",
"750to1000", ">1000"))
DayOfWeek Replace week days number into characters like 1=MON, 2=TUE
etc.with the help of gsub
d$DayOfWeek = gsub("1", "MON", d$DayOfWeek)
d$DayOfWeek = gsub("2", "TUE", d$DayOfWeek)
25

d$DayOfWeek = gsub("3", "WED", d$DayOfWeek)
d$DayOfWeek = gsub("4", "THU", d$DayOfWeek)
d$DayOfWeek = gsub("5", "FRI", d$DayOfWeek)
d$DayOfWeek = gsub("6", "SAT", d$DayOfWeek)
d$DayOfWeek = gsub("7", "SUN", d$DayOfWeek)
Origin Origins of airports are categorized into SW, SE, NE, MW, W these ﬁve
regions with the help of gsub function.
d$Origin = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS
"SW", d$Origin)
d$Origin = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT
"SE", d$Origin)
d$Origin = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT
"NE", d$Origin)
d$Origin = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID
"MW", d$Origin)
d$Origin = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT
"W", d$Origin)
Dest Destination of airports are categorized into SW, SE, NE, MW, W these
ﬁve regions with the help of gsub function
d$Dest = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS|G
"SW", d$Dest)
d$Dest = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT|D
"SE", d$Dest)
d$Dest = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT|C
"NE", d$Dest)
d$Dest = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID|C
"MW", d$Dest)
d$Dest = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT|B
"W", d$Dest)
DayOfMonth We divided December Day Of Month into Regular day and
Christmas Week
d$DayofMonth = ordered(cut(d$DayofMonth, c(0, 23, 32)), labels = c("R.Days",
"CH.Days"))
DepDelay
We divided Departure Delay into two part low and high Delay
26

d$DepDelay = ordered(cut(d$DepDelay, c(-Inf, 60, Inf)), labels = c("low", "high"))
7.2 Rpart
Rpart is recursive partitioning for classiﬁcation, regression and survival trees.
We are going to classify two predict variable ArrDelay and DepDelay by
using rpart.
Departure Delay: DepDelay is response variable and DayofMonth, DayOfWeek,
DepTime, Distance are predicate variable.
ss.formula = DepDelay ~ DayofMonth + DayOfWeek + DepTime + Distance
# formula for tree
ss.rpart = rpart(data = d, formula = ss.formula)
draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw
DepTime <> 1406.5
low; 168647 obs; 30.1%
DepTime >< 447.5
low; 72390 obs; 19.8%
low
70766 obs
1
high
1624 obs
2
DepTime <> 2229.5
low; 96257 obs; 37.9%
low
91550 obs
3
high
4707 obs
4
Total classified correct = 27.5 %
27

print(ss.rpart) # for printing tree rules
## n= 168647
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 168647 50820 low (0.6987 0.3013)
## 2) DepTime< 1406 72390 14340 low (0.8019 0.1981)
## 4) DepTime>=447.5 70766 13030 low (0.8158 0.1842) *
## 5) DepTime< 447.5 1624 314 high (0.1933 0.8067) *
## 3) DepTime>=1406 96257 36480 low (0.6210 0.3790)
## 6) DepTime< 2230 91550 33200 low (0.6374 0.3626) *
## 7) DepTime>=2230 4707 1427 high (0.3032 0.6968) *
From this tree we conclude that normally at night between 10:30PM to 5:00AM
delays are more as compare to day time. From this tree we can conclude that
our decision tree is mainly depend on Arrival time and Departure time. So that
we removed AirTime from next decision tree.
Arrival Delay: ArrDelay is response variable and DayofMonth, DayOfWeek,
Origin, Distance are predicate variable.
ss.formula = ArrDelay ~ DayofMonth + DayOfWeek + Distance + Origin
# formula for tree
R.control = rpart.control(cp = 0.001) # to control tree
ss.rpart = rpart(data = d, formula = ss.formula, control = R.control)
draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw
28

,SE,SW,W = Origin = ,MW,NE
62.5547 ; 168647 obs; 0.6%
,MON,SUN,THU,TUE,WED = DayOfWeek = ,FRI,SAT
59.0561 ; 110460 obs; 0.1%
,SE,SW = Origin = ,MW,NE,W
57.44 ; 81657 obs; 0.1%
54.7998
50291 obs
1
61.6731
31366 obs
2
,R.Days = DayofMonth = ,CH.Days
63.6379 ; 28803 obs; 0.2%
58.8759
17922 obs
3
71.4812
10881 obs
4
,MON,SUN,THU,WED = DayOfWeek = ,FRI,SAT,TUE
69.1963 ; 58187 obs; 0.4%
,THU = DayOfWeek = ,FRI,MON,SAT,SUN,TUE,WED
63.6194 ; 33946 obs; 0.1%
52.3489
5849 obs
5
65.9656
28097 obs
6
77.006
24241 obs
7
Total deviance explained = 1.5 %
print(ss.rpart) # for printing tree rules
## n= 168647
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 168647 675500000 62.55
## 2) Origin=SE,SW,W 110460 394400000 59.06
## 4) DayOfWeek=MON,SUN,THU,TUE,WED 81657 273300000 57.44
## 8) Origin=SE,SW 50291 142800000 54.80 *
## 9) Origin=W 31366 129600000 61.67 *
## 5) DayOfWeek=FRI,SAT 28803 120300000 63.64
## 10) DayofMonth=R.Days 17922 68280000 58.88 *
## 11) DayofMonth=CH.Days 10881 50950000 71.48 *
## 3) Origin=MW,NE 58187 277100000 69.20
## 6) DayOfWeek=MON,SUN,THU,WED 33946 124100000 63.62
29

## 12) DayOfWeek=THU 5849 17040000 52.35 *
## 13) DayOfWeek=MON,SUN,WED 28097 106100000 65.97 *
## 7) DayOfWeek=FRI,SAT,TUE 24241 150500000 77.01 *
From this decision tree we can say that our dataset is divided into parts like
division of origin into SE,SW,W and MW,NE as well as day of week into
MON,SUN,THU,TUE,WED and FRI,SAT.
7.3 Ctree
Ctree is Conditional inference trees which embed tree-structured regression
models into a well deﬁned theory of conditional inference procedure
Departure Delay:
ss.formula1 = DepDelay ~ Distance + DepTime # formula for Ctree
ss.control = ctree_control(maxdepth = 2) #height is 2
ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation
## Loading required package: Formula
## Warning: there is no package called ’Formula’
plot(ss.ctree) # plotting of tree
30

DepTime
p < 0.001
1
≤ 1406 > 1406
DepTime
p < 0.001
2
≤ 447 > 447
Node 3 (n = 1624)
highlow
0
0.2
0.4
0.6
0.8
1
Node 4 (n = 70766)
highlow
0
0.2
0.4
0.6
0.8
1
DepTime
p < 0.001
5
≤ 2229 > 2229
Node 6 (n = 91550)highlow
0
0.2
0.4
0.6
0.8
1
Node 7 (n = 4707)
highlow
0
0.2
0.4
0.6
0.8
1
Like we explain in rpart, ctree is giving same result which is normally at night
between 10:30PM to 5:00AM delays are more as compare to day time.
Arrival Delay:
ss.formula1 = ArrDelay ~ Distance + ArrTime # formula for Ctree
ss.control = ctree_control(maxdepth = 2) # height is 2
ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation
## Loading required package: Formula
## Warning: there is no package called ’Formula’
plot(ss.ctree) # plotting of tree
31

ArrTime
p < 0.001
1
≤ 518 > 518
ArrTime
p < 0.001
2
≤ 134 > 134
Node 3 (n = 7693)
0
500
1000
1500
Node 4 (n = 2165)
0
500
1000
1500
ArrTime
p < 0.001
5
≤ 1438 > 1438
Node 6 (n = 52990)
0
500
1000
1500
Node 7 (n = 105799)
0
500
1000
1500
From this tree we can conclude that around midnight like before 5:18 AM, delays
are higher compared to day time.
8 Random Forest
Now we will use random forest analysis to learn more about predictions. In the
random forest, the following libraries will be used.
library(randomForest) # for randomForest
library(rpart)
library(caret) # for confusionMatrix
Because the original data is too large, we still randomly select 1000 rows.
32

rfd = rd[sample(nrow(rd), 1000), ]
We seperate our dataset into train set and test set.
ndxTrain = sample(x = nrow(rfd), size = 0.7 * nrow(rfd))
rfd.train = rfd[ndxTrain, ]
rfd.test = rfd[-ndxTrain, ]
We set all the other variables to be predictors and see how they will aﬀect our
target variable.
rfd.predictors = c("DayofMonth", "DayOfWeek", "DepTime", "CRSDepTime", "ArrTime",
"CRSArrTime", "AirTime", "ActualElapsedTime", "Distance")
rfd.rf = randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay)
print(rfd.rf)
##
## Call:
## randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 24.14%
## Confusion matrix:
## Low High class.error
## Low 272 80 0.2273
## High 89 259 0.2557
plot(rfd.rf)
33

0 100 200 300 400 500
0.200.250.300.350.40
rfd.rf
trees
Error
From the diagram, we ﬁnd out that the error rate will be stable when the tree
numbers get larger. So we use the default tree number, which is 500.
rfd.train.pred = predict(object = rfd.rf, newdata = rfd.train, type = "class")
rfd.test.pred = predict(object = rfd.rf, newdata = rfd.test, type = "class")
confusionMatrix(data = rfd.train.pred, reference = rfd.train$ArrDelay)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 352 0
## High 0 348
##
## Accuracy : 1
## 95% CI : (0.995, 1)
34

## No Information Rate : 0.503
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.503
## Detection Rate : 0.503
## Detection Prevalence : 0.503
## Balanced Accuracy : 1.000
##
## 'Positive' Class : Low
##
confusionMatrix(data = rfd.test.pred, reference = rfd.test$ArrDelay)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 133 32
## High 29 106
##
## Accuracy : 0.797
## 95% CI : (0.747, 0.841)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.59
## Mcnemar's Test P-Value : 0.798
##
## Sensitivity : 0.821
## Specificity : 0.768
## Pos Pred Value : 0.806
## Neg Pred Value : 0.785
## Prevalence : 0.540
## Detection Rate : 0.443
## Detection Prevalence : 0.550
## Balanced Accuracy : 0.795
##
## 'Positive' Class : Low
##
35

Although the dataset is randomly chosen, we can always get an accuracy rate
of over 70 percent, which is higher than a single decision tree.
9 Classification
Classification technique used to predict group membership for data instances.For
Classification we are using knn and svm algorithms
9.1 knn
K-Nearest Neighbors(Knn) is supervised machine learning algorithm for object
classification.
library(class) #for knn
library(RWeka) #for IBk function
## Error: package or namespace load failed for ’RWeka’
9.2 Processing Data
For removing not useful column.
kd = kd[, -20:-29]
kd = kd[, -1:-2]
kd = kd[, -3:-12]
kd = kd[, -4]
kd = kd[, -5]
We are keeping ArrDelay as our responsive variable so we categorize that into
two part low and high delay.
kd$ArrDelay = ordered(cut(kd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High"))
Our machine is not able to handle full data set so we are using 1000 random
record.
kdd = kd[sample(nrow(kd), 1000), ] # sample dataset
The IBk function implements the K-NN technique to predict the Arrival Delay
variable from the remaining four variables of the kdd dataframe.That’s why we
are using this function and storing result into classifier
classifier = IBk(ArrDelay ~ DayOfWeek + DayofMonth + Distance + Origin, data = kdd,
control = Weka_control(K = 4)) # k=4 because 4 other variable
36

## Error: could not find function "IBk"
summary(classifier) # detail eplanation with confusion matrix
## Error: error in evaluating the argument ’object’ in selecting a
method for function ’summary’: Error: object ’classifier’ not found
In k nearest neighbour technique we found that our around 70% data are cor-
rectly classified and only 30% data are incorrectly classified. In confusion matrix
we can see that in high delay part is not classified properly.
9.3 SVM
For classification we are using another method which is SVM. Support Vector
Machine can analyze data and recognize patterns, used for classification and
regression analysis.
10 Processing Data
We are keeping ArrDelay as our response variable so we categorize that into two
part low and high.
sd$ArrDelay = ordered(cut(sd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High"))
Our machine is not able to handle full data set so we took part of that
sdd = sd[sample(nrow(sd), 1000), ] # sample dataset
We divided our dataset Into two parts train1 and test1 dataset
sd1 = nrow(sdd)
nxd.train = sample(1:sd1, 0.7 * sd1)
sd.train1 = sdd[nxd.train, ]
sd.test1 = sdd[-nxd.train, ]
For SVM we are using these two libraries.
library(e1071)
library(caret)
Predict variable is ArrDelay which is based on two variable which are Day-
OfWeek and Distance.
sd.formula = ArrDelay ~ DayOfWeek + Distance
plot.formula = DayOfWeek ~ Distance #For plot X and Y axis
sd.model = svm(formula = sd.formula, data = sd.train1) # for actual model creation.
summary(sd.model) # Detail description of a model.
37

##
## Call:
## svm(formula = sd.formula, data = sd.train1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 639
##
## ( 322 317 )
##
##
## Number of Classes: 2
##
## Levels:
## Low High
sd.predict = predict(sd.model, sd.test1) # prediction on testing data set
# confusionMatrix(data = sd.predict, reference = sdd£ArrDelay)
plot(x = sd.model, data = sd.train1, formula = plot.formula) #default: cost=1, gamma=0.5
38

LowHigh
500 1000 1500 2000 2500
1
2
3
4
5
6
7
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
oo
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o o
oo
o
oo
x
x
x
xx
x
x
x x
x x
x
x
x
x
x
x
x x
x
x x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x x
x
xx
x
xx
x
x x
xx
x
x
x
x
x
x
x
x
xx
x
xx
x
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx x
x
x
x
x
x
x x
x
x
x
xx x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
xxx x
x
xx
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
SVM classification plot
Distance
DayOfWeek
For clear result we are changing cost and gamma parameter
sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification",
kernel = "radial", cost = 1, gamma = 5)
plot(x = sd.model, data = sd.train1, formula = plot.formula)
39

LowHigh
500 1000 1500 2000 2500
1
2
3
4
5
6
7
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
x
xx
x
xx
x
x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x x
xx
x
x
xx
x
x
x
x
x
x
x
x
x
x
x x
x x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
x x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
xx
x x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x xx
x
x
x
x
x
x
x
xx
x
xx
x
x x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
xx
x
x
x
x x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
xxx x
x
xx
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x x
x
x
x
x
x
x
x
x
x
x
x x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
Distance
DayOfWeek
sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification",
kernel = "radial", cost = 1, gamma = 0.1)
plot(x = sd.model, data = sd.train1, formula = plot.formula)
40

LowHigh
500 1000 1500 2000 2500
1
2
3
4
5
6
7
oo
o
o o
oo o o
oo
o
o
oo
oo oooo o
o
o
o
ooo
x
xx
x
xx
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x x
x
x x
xx
x
x
xx
x
x
x
x
x
x
x
x x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
xx
x x
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x x
x
x
x
x x
x
xx
x
xx
x
x x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
xxx x
x
xx
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
Distance
DayOfWeek
From this graph we can clearly say that on 6th and 7th day which is Saturday
and Sunday is having more delay than rest of the day. After certain distance
Arrival Delays are getting lower.
11 Conclusion
Overall, we did find some useful results in our analysis. Within the logistic re-
gression, we found several variables that were statistically significant, like Dis-
tance, Day Of Week, Origin, Destination, Departure-time, Arrival-time, Day
Of Month and etc. We converted them from numeric variable to categorical
variables. We found out some reasons behind the U.S Air flights Arrival Delay.
To find out relationships between different variables, we applied association rules
and we got really good result. We found out on Monday flights are more likely
41

to be on time. In December 2008, we found out Chicago flights delays are high
so we check weather records and we found that In December 2008, weather was
very bad in Chicago. Many flights are affected because of that reason.
The fastest and easiest way to make decision about our dataset is to apply De-
cision tree mechanism where we organized our data hierarchically. So, we used
Rpart and Ctree algorithm for making decision tree. From the diagram, we
conclude that normally at night between 10:30PM to 5:00AM delays are rela-
tively higher. And normally on weekends like Saturday and Sunday delays are
possibly higher compared with weekdays, which makes sense.
On the other hand, we also use clustering analysis. After applying Kmeans,
Hclust and Pam clustering method to our dataset, we got cluster 2 is the best
cluster and we validate that by using clvalid function and other measurements.
After separating the dataset into two subsets, we also find out some relation-
ships that fulfill what we have found in the association rules.
In classification we used knn and svm techniques. In k-nearest neighbor tech-
nique we found that approximately 70 percent of data are correctly classified.
Same thing we found in our confusion matrix.
We did some pattern recognition with help of Support Vector Machine (SVM)
where we found that the longer distance, the longer delay.
According to our analysis, we suggest that it is better to choose daytime on
weekdays to travel so that you can arrive your destination on time.
12 Limitation
There are still a few limitations during our analysis for this dataset.
First, we are limited by our computers processing capability. The original
dataset is huge which contains 7,009,728 observations, so we select a part of
it (All U.S. airlines data of December, 2008) so as to reduce the file size loading
into R. In addition, when we especially address with PAM algorithm in cluster
analysis and RandomForest, it often got stuck even crashed. So we have to use
a random sample to apply functions, like computing distance matrix. However,
we haven’t verify how the random sample will affect our result.
Another limitation is that so far we only focus on the delay. There might be
42

other interesting relationships among other variables. We may work on other
relationships in the future.
13 Future Work
While we have already obtained some analysis outcomes, there are still a few
works we can do in the future.
First, due to the limitation of our computers, we are not able to process large-
scale data. So we cannot apply some of the functions on the full dataset. In our
analysis, we only use a random sample, so the result cannot be accurate every
time.
What’s more, we may find out more relationships because our target is ana-
lyzing the air flights perform on-time or delayed. Something valuable is still
waiting for us. For instance, we may find the busiest carrier in the air.
Last but not the least, from previous work, we find out that the DBSCAN is
not working well unless the dataset becomes very large. So we don’t apply that
to our dataset. We would like to see how DBSCAN will perform and we want
to compare the result of DBSCAN to the other cluster methods.
43

Data Mining & Analytics for U.S. Airlines On-Time Performance

Recommended

More Related Content

Similar to Data Mining & Analytics for U.S. Airlines On-Time Performance

Similar to Data Mining & Analytics for U.S. Airlines On-Time Performance (20)

Recently uploaded

Recently uploaded (16)

Data Mining & Analytics for U.S. Airlines On-Time Performance