SlideShare a Scribd company logo
1 of 43
Download to read offline
Data Analysis of U.S. Airlines On-time
Performance
Yanxiang Zhu, Nilesh Padwal, Mingxuan Li
Finished by June 27th, 2014
Contents
1 Introduction 2
1.1 Background and Problem Description . . . . . . . . . . . . . . . 2
1.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Collecting data 3
3 Preprocessing Data 3
4 Variables Description 4
5 Association Rule 9
6 Cluster Analysis 15
6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 Determine K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3.1 Pam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.3.2 Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Decision Tree 24
7.1 Categorize Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2 Rpart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.3 Ctree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8 Random Forest 31
9 Classification 35
9.1 knn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.2 Processing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1
INFO7374 Data Science Final Project
10 Processing Data 36
11 Conclusion 40
12 Limitation 41
13 Future Work 42
2
INFO7374 Data Science Final Project
1 Introduction
1.1 Background and Problem Description
In airlines industry, It is much more common that airlines are struggling to
get plane to the gate on time.The current challenge we all met is to improve
the quality of airline on-time performance. Besides the carriers’ services and
baggage policy, It is seemingly much more necessary to save people’s air time.
So our goal is to find something special and valuable relationships from dataset
by using data mining techniques like cluster analysis, association rule, decision
tree and etc.
1.2 Dataset Description
The dataset is a airlines’ data collection coming from the Research and Innova-
tive Technology Administration (RITA), and it contains detailed facets of each
air flight information between 1987 to 2008. It is huge information which include
29 variables like Destination, Origin, Arrival time, Departure time and so on.
Here is a original list that show the all variables. It is very important statistical
records that any flight information could be tracked via specific features. The
thing we need to mention, due to limit performance of our computers, is that
we are able to fetch a part of the whole data (all the airlines of U.S. during
22 years)to process and analyze. Our selected dataset is still have millions of
observations which are definitely enough to obtain the satisfying outcomes.Here
is a descriptive list of useful variables.
1. DayofMonth December 1st to December 31th.
2. DayOfWeek 1 refers to Monday and in a similar way, 7 refers to Sunday.
3. DepTime Actual departure time
4. ArrTime Actual arrival time
5. CRSDepTime Scheduled departure time
6. CRSArrTime Actual arrival time
7. UniqueCarrier Unique carrier code
8. FlightNum Flight number
9. ActualElapsedTime In minutes
10. CRSElapsedTime In minutes
11. AirTime In minutes
12. ArrDelay Arrival delay, in minutes
13. DepDelay Departure delay, in minutes
3
INFO7374 Data Science Final Project
14. Origin Origin IATA airport code
15. Dest Destination IATA airport code
16. Distance In miles
According to the historical record of On-time flight operation of U.S.air carriers,
the 2008 seems like to be interesting and special period for the airline industry,
whose on-time percentage is 76.0%, then it went up to 79.5% in 2009. That is
why we choose such a breaking point to find out what should not be ignored
that behind the common numbers and words.
2 Collecting data
The dataset we use contains all commercial flights within the USA in 2008. The
dataset is downloaded from http://stat-computing.org/dataexpo/2009.
The dataset contains nearly 10 million records and takes 700MB space.
file.name <- paste(2008, "csv.bz2", sep = ".")
if (!file.exists(file.name)) {
url.text <- paste("http://stat-computing.org/dataexpo/2009/", 2008, ".csv.bz2",
sep = "")
cat("Downloading missing data file ", file.name, "n", sep = "")
download.file(url.text, file.name)
}
To import the data into our workspace, we use read.csv function. We store
the dataset in d.
d <- read.csv("2008.csv")
3 Preprocessing Data
Since the analysis still need a well-structured dataset, so we omit the NA values.
And due to the limitation of our computer’ processing capability, we decide to
work with data from only December,2008. And it still has 1,524,735 observation
of 29 variables and we think it is enough to obtain a good data analysis results
from such a large-scale dataset.
d = subset(d, Month == "12")
d = na.omit(d)
After that, we also need to remove some of the columns that we think is not
useful in our study. So we remove them directly from the original dataset.
4
INFO7374 Data Science Final Project
d = d[, -20:-29]
On the other hand, since we already decide to use the data in December, 2008.
The Year and Month columns become useless.
d = d[, -1]
d = d[, -1]
So far, our dataset contains 168646 records with 17 variables.
str(d)
## 'data.frame': 168647 obs. of 17 variables:
## $ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...
## $ DayOfWeek : int 3 3 3 3 3 3 3 3 3 3 ...
## $ DepTime : int 1126 1859 1256 1925 2002 1716 1620 1807 1930 1004 ...
## $ CRSDepTime : int 1045 1825 1240 1900 1940 1610 1555 1725 1905 1005 ...
## $ ArrTime : int 1241 1925 1458 2120 2249 2054 1826 1910 2041 1130 ...
## $ CRSArrTime : int 1200 1900 1435 2100 2230 1950 1800 1845 2020 1115 ...
## $ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 1
## $ FlightNum : int 2717 1712 294 2776 623 586 1259 548 619 1152 ...
## $ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3796 2127 3943 3316
## $ ActualElapsedTime: int 75 86 62 55 107 158 186 63 71 86 ...
## $ CRSElapsedTime : int 75 95 55 60 110 160 185 80 75 70 ...
## $ AirTime : int 55 73 45 46 93 140 177 50 56 51 ...
## $ ArrDelay : int 41 25 23 20 19 64 26 25 21 15 ...
## $ DepDelay : int 41 34 16 25 22 66 25 42 25 -1 ...
## $ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 82 157 160 175 177 181 2
## $ Distance : int 349 487 289 332 718 1121 1111 328 328 321 ...
4 Variables Description
After importing the dataset, the variables associated with each observation were
explored further. The names of variables were listed and described.
1. DayofMonth December 1st to December 31th.
summary(d$DayofMonth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 11.0 18.0 17.2 23.0 31.0
2. DayOfWeek 1 refers to Sunday and in a similar way, 7 refers to Saturday.
summary(d$DayOfWeek)
5
INFO7374 Data Science Final Project
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 4.00 3.74 5.00 7.00
3. DepTime Actual departure time
summary(d$DepTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1120 1510 1470 1840 2400
Departure time is another key factor that we are going to examine. We
want to know which time is the best time for flight.
4. ArrTime Actual arrival time
summary(d$ArrTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1230 1640 1560 2010 2400
5. CRSDepTime Scheduled departure time
summary(d$CRSDepTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 1040 1420 1400 1750 2360
6. CRSArrTime Actual arrival time
summary(d$CRSArrTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1230 1620 1580 1950 2360
7. UniqueCarrier Unique carrier code
carrier = data.frame(d$UniqueCarrier)
qplot(x = d$UniqueCarrier, data = carrier, fill = d$UniqueCarrier)
6
INFO7374 Data Science Final Project
0
10000
20000
30000
9E AA AS B6 CO DL EV F9 FL HA MQNWOH OO UA US WN XE YV
d$UniqueCarrier
count d$UniqueCarrier
9E
AA
AS
B6
CO
DL
EV
F9
FL
HA
MQ
NW
OH
OO
UA
US
WN
XE
YV
Southwest Airline runs most of the airplane in the U.S. in 2008. The
number of their flights are even greater than the sum of Skywest Airline
and American Airline. We will also help you find out which airline to
choose if you want to avoid delay.
8. FlightNum Flight number
summary(d$FlightNum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 658 1680 2360 3590 9740
9. ActualElapsedTime In minutes
summary(d$ActualElapsedTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18 88 126 144 177 790
10. CRSElapsedTime In minutes
7
INFO7374 Data Science Final Project
summary(d$CRSElapsedTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26 82 116 135 165 660
11. AirTime In minutes
summary(d$AirTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 60 93 112 141 647
12. ArrDelay 4Arrival delay, in minutes
summary(d$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 24.0 41.0 62.6 77.0 1660.0
The arrive delay is our target variable. The median is 40 mins which
means the delay problem is severe. We are going to find out which factors
will cause the delay.
13. DepDelay Departure delay, in minutes
summary(d$DepDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -34.0 15.0 35.0 53.5 71.0 1600.0
14. Origin Origin IATA airport code
summary(d$Origin)
## ATL ORD DEN DFW DTW PHX EWR IAH LAS
## 12232 11020 7004 6208 4984 4353 4333 4269 4020
## MSP LAX JFK SLC SFO SEA CLT BOS PHL
## 4004 3837 3409 3375 3370 2992 2977 2841 2676
## MDW MCO CVG BWI LGA SAN DCA IAD MEM
## 2546 2481 2457 2383 2272 1832 1768 1714 1665
## MIA FLL STL MKE TPA MCI BNA CLE PDX
## 1601 1590 1504 1491 1489 1381 1358 1352 1329
## HOU RDU DAL OAK HNL SMF PIT SJC IND
## 1324 1297 1218 1211 1164 1131 1061 1003 989
## SNA ABQ AUS SAT MSY CMH PBI BUF OMA
## 944 905 872 809 806 757 733 727 704
## JAX BDL BUR RSW BHM ONT PVD GRR SDF
## 643 629 627 607 546 521 517 514 504
## TUL OKC RNO DSM SJU RIC MHT DAY MSN
## 493 488 485 478 471 452 426 419 408
## GEG LIT BOI ELP TUS ANC ICT LGB TYS
## 404 404 394 394 389 384 371 367 358
8
INFO7374 Data Science Final Project
## ALB ROC XNA SYR OGG ORF HPN COS CID
## 356 356 344 343 332 322 317 311 292
## CHS FAT LEX GSO MLI CAE HSV SAV JAN
## 287 285 282 278 271 268 260 259 251
## (Other)
## 12768
The most busiest airport in the U.S. is Atlantic airport. Chicago and
Denver ranked second and third.
15. Dest Destination IATA airport code
summary(d$Dest)
## ATL ORD DEN DFW LAX PHX LAS EWR SFO
## 11791 9506 6338 5159 5013 4663 4357 4335 4277
## IAH DTW MSP JFK SLC SEA MCO LGA PHL
## 4271 3575 3280 3238 3216 3131 3015 2818 2669
## BOS CLT SAN BWI FLL MDW CVG TPA MEM
## 2541 2422 2265 2142 1959 1912 1831 1748 1721
## DCA MIA IAD PDX RDU MCI STL SMF OAK
## 1710 1702 1494 1483 1455 1384 1374 1361 1351
## BNA CLE MKE SJC HOU SNA SAT DAL AUS
## 1346 1314 1290 1210 1196 1120 1096 1085 1049
## ABQ HNL PIT PBI MSY IND CMH RSW OMA
## 1014 981 935 917 889 841 831 758 757
## JAX BUR ONT BUF TUL OKC SJU BHM TUS
## 737 717 679 664 615 608 604 585 585
## BDL RNO ANC SDF PVD DSM GRR RIC ELP
## 578 567 560 526 505 501 498 485 480
## BOI LIT GEG MSN TYS ICT DAY XNA LGB
## 472 442 424 415 412 408 406 386 381
## MHT COS ORF GSO CHS ROC CAE JAN HPN
## 373 366 362 336 334 327 301 296 295
## SAV CID OGG FAT ALB SYR LEX HSV MLI
## 293 292 292 291 285 280 276 261 255
## (Other)
## 13756
The result is very similar to Origin. We also need to check whether the
most busiest airport suffers the delay most.
16. Distance In miles
summary(d$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31 338 599 753 984 4960
9
INFO7374 Data Science Final Project
The most majority of flights have a distance of under 1000 miles. The
relationship between distance and delay time is another import question
we need to validate.
5 Association Rule
After thinking about air flight performance, we consider that there are some-
thing special and important that can address with so that it is quite beneficial
to improve the quality of air flights’ on-time performance. In this section, we
are goint to find the hidden relationship from different facets among the infor-
mation that given by air flight dataset. We raise some certain questions, for
instance, how does the factors like distance, DayOfWeek influence on the on-
time performance. So our goal is to solve these questions by using association
rule.
• Support The probability that antecedent and conclusion hold simultane-
ously in the data set.
• Condence The conditional probability that conclusion holds if antecedent
is satised.
• Lift Lift is the ratio of Confidence to Expected Confidence.Values greater
than 1 indicate that the rule has predictive potential.
Firstly, we prepare the library that association rule requires.
library(arules)
library(arulesViz)
We copy the original dataset because we will do some further data transforming
work on it. We store it as data.
data = d
According to our observation, it is effective that dividing the numeric variable
into the ordered, reasonable range for our analysis. So we split the Distance,
ArrDelay, Airtime, CRSDepTime, CRSArrTime respectively and make sure that
all the observation is included in given range.
data$Distance = ordered(cut(data$Distance, c(0, 300, 600, 1000, Inf)), labels = c("Short",
"Medium", "Long", "Too long"))
data$ArrDelay = ordered(cut(data$ArrDelay, c(0, 25, 50, 80, Inf)), labels = c("On-Time",
"Delayed", "Intermediate-Delayed", "Much-Delayed"))
data$AirTime = ordered(cut(data$AirTime, c(-1, 50, 100, 200, 300, Inf)), labels = c("Too-Sho
"Short", "Intermediate", "Long", "Too-Long"))
10
INFO7374 Data Science Final Project
data$CRSDepTime = ordered(cut(data$CRSDepTime, c(-1, 600, 1200, 1800, Inf)),
labels = c("Overnight", "Morning", "Afternoon", "Evening"))
data$CRSArrTime = ordered(cut(data$CRSArrTime, c(-1, 600, 1200, 1800, 2359)),
labels = c("Overnight", "Morning", "Afternoon", "Evening"))
DayOfWeek contain the number like 1, 2, 3 to represent days of week. So
we change it into character and replace with string.After manipulating, it is
transformed into factor.
data$DayOfWeek = as.character(data$DayOfWeek)
data$DayOfWeek = gsub("^1", "Sunday", data$DayOfWeek)
data$DayOfWeek = gsub("^2", "Monday", data$DayOfWeek)
data$DayOfWeek = gsub("^3", "Tuesday", data$DayOfWeek)
data$DayOfWeek = gsub("^4", "Wednesday", data$DayOfWeek)
data$DayOfWeek = gsub("^5", "Thursday", data$DayOfWeek)
data$DayOfWeek = gsub("^6", "Friday", data$DayOfWeek)
data$DayOfWeek = gsub("^7", "Saturday", data$DayOfWeek)
data$DayOfWeek = factor(data$DayOfWeek)
We have 5 variable that is not that useful so that they need to be removed such
as FlightNum, ActualElapseTime and etc.
logNdx = !(names(data) %in% c("DayofMonth", "FlightNum", "Cancelled", "ActualElapsedTime",
"DepDelay", "UniqueCarrier"))
data.AR = data[, logNdx]
Finishing these processing work above, here the dataset comes to analyze called
data.AR. It contains 10 variables shown below.
summary(data.AR)
## DayOfWeek CRSDepTime CRSArrTime TailNum
## Friday :20526 Overnight: 3009 Overnight: 2050 N986CA : 129
## Monday :27345 Morning :54579 Morning :35361 N87353 : 126
## Saturday :21397 Afternoon:72468 Afternoon:67079 N77302 : 122
## Sunday :30021 Evening :38591 Evening :64157 N507CA : 112
## Thursday :22742 N472CA : 107
## Tuesday :26882 N471CA : 106
## Wednesday:19734 (Other):167945
## AirTime ArrDelay Origin
## Too-Short :28126 On-Time :46402 ATL : 12232
## Short :64977 Delayed :53601 ORD : 11020
## Intermediate:56146 Intermediate-Delayed:29107 DEN : 7004
## Long :14165 Much-Delayed :39537 DFW : 6208
## Too-Long : 5233 DTW : 4984
## PHX : 4353
## (Other):122846
11
INFO7374 Data Science Final Project
## Dest Distance
## ATL : 11791 Short :33685
## ORD : 9506 Medium :50906
## DEN : 6338 Long :43644
## DFW : 5159 Too long:40412
## LAX : 5013
## PHX : 4663
## (Other):126177
We apply association rule mining to the dataset. Firstly, we intend to find out
the main factors that could possible result in air flight on-time or not. We give
the index of support and confidence respectively and at right column show out
the four levels, that is On-Time, Delayed, Intermediate-Delayed, Much-Delayed.
apriori.appearance1 = list(rhs = c("ArrDelay=On-Time", "ArrDelay=Delayed", "ArrDelay=Interme
"ArrDelay=Much-Delayed"), default = "lhs")
apriori.parameter1 = list(support = 0.01, confidence = 0.1)
rules1 = apriori(data.AR, parameter = apriori.parameter1, appearance = apriori.appearance1)
##
## parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.1 0.1 1 none FALSE TRUE 0.01 1 10
## target ext
## rules FALSE
##
## algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[4 item(s)] done [0.00s].
## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s].
## sorting and recoding items ... [83 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 done [0.07s].
## writing ... [743 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
By giving lift is larger than 1, and we create the subset that ordered by lift.
rules1.subset = subset(rules1, subset = lift > 1.2 & confidence > 0.1)
rules1.subset.conf = sort(rules1.subset, by = "lift")
12
INFO7374 Data Science Final Project
summary(rules1.subset)
## set of 50 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 4 26 18 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 3.00 3.36 4.00 5.00
##
## summary of quality measures:
## support confidence lift
## Min. :0.0101 Min. :0.284 Min. :1.20
## 1st Qu.:0.0118 1st Qu.:0.314 1st Qu.:1.24
## Median :0.0147 Median :0.342 Median :1.27
## Mean :0.0182 Mean :0.337 Mean :1.31
## 3rd Qu.:0.0200 3rd Qu.:0.354 3rd Qu.:1.32
## Max. :0.0725 Max. :0.428 Max. :1.83
##
## mining info:
## data ntransactions support confidence
## data.AR 168647 0.01 0.1
The list below displays the top ten rules sorted by lift.
inspect(rules1.subset.conf[1:5])
## lhs rhs support confidence lift
## 1 {Dest=EWR} => {ArrDelay=Much-Delayed} 0.01101 0.4284 1.827
## 2 {CRSDepTime=Afternoon,
## Dest=ORD} => {ArrDelay=Much-Delayed} 0.01040 0.4048 1.727
## 3 {CRSArrTime=Evening,
## Origin=ORD} => {ArrDelay=Much-Delayed} 0.01032 0.3786 1.615
## 4 {Dest=ORD} => {ArrDelay=Much-Delayed} 0.02001 0.3549 1.514
## 5 {Origin=ORD} => {ArrDelay=Much-Delayed} 0.02199 0.3365 1.435
And the flights origin from or land on ORD airport delay with much possibility.
We search the weather history records of Chicago O’Hare International Airport
(ORD), it did suffer a very severe snowstorm at that time. So even the weather
condition is not included in air flight information, but still is a key of air flight
on-time performance.
rules1.subset.delay = subset(rules1.subset, subset = lhs %in% "DayOfWeek=Monday")
plot(rules1.subset.delay, method = "graph", control = list(type = "items"))
13
INFO7374 Data Science Final Project
Graph for 5 rules
DayOfWeek=Monday
CRSDepTime=Morning
CRSDepTime=Evening
CRSArrTime=Morning
CRSArrTime=Evening
ArrDelay=On−Time
ArrDelay=Much−Delayed
size: support (0.011 − 0.019)
color: lift (1.213 − 1.421)
apriori.appearance3 = list(rhs = c("ArrDelay=On-Time"), default = "lhs")
apriori.parameter3 = list(support = 0.01, confidence = 0.1)
rules3 = apriori(data.AR, parameter = apriori.parameter3, appearance = apriori.appearance3)
##
## parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.1 0.1 1 none FALSE TRUE 0.01 1 10
## target ext
## rules FALSE
##
## algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
14
INFO7374 Data Science Final Project
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s].
## sorting and recoding items ... [83 item(s)] done [0.01s].
## creating transaction tree ... done [0.09s].
## checking subsets of size 1 2 3 4 5 done [0.06s].
## writing ... [211 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
rules3.subset = subset(rules3, subset = lift > 1.2 & confidence > 0.1)
rules3.subset.conf = sort(rules3.subset, by = "lift")
rules3.subset.ontime = subset(rules3.subset.conf, subset = lhs %in% c("DayOfWeek=Friday",
"DayOfWeek=Saturday", "DayOfWeek=Sunday", "DayOfWeek=Monday", "DayOfWeek=Tuesday",
"DayOfWeek=Wednesday", "DayOfWeek=Thursday"))
inspect(rules3.subset.ontime[1:5])
## lhs rhs support confidence lift
## 1 {DayOfWeek=Monday,
## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01326 0.3909 1.421
## 2 {DayOfWeek=Monday,
## CRSDepTime=Morning,
## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01170 0.3826 1.390
## 3 {DayOfWeek=Monday,
## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01874 0.3624 1.317
## 4 {DayOfWeek=Wednesday,
## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01283 0.3503 1.273
## 5 {DayOfWeek=Sunday,
## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01280 0.3349 1.217
So obviously, we could conclude that the flights which own the most high on-
time performance always in the morning. It illustrates that in Dec,2008, the air
traffic control is in a great condition in the morning but heavier in afternoon
and evening comparatively.
rules1.subset.delay1 = subset(rules1.subset.conf, subset = lhs %in% c("Distance=Short",
"Distance=Medium", "Distance=Long", "Distance=Too long"))
inspect(rules1.subset.delay1[1:10])
## lhs rhs support confidence lift
## 1 {CRSArrTime=Morning,
## AirTime=Short,
## Distance=Medium} => {ArrDelay=On-Time} 0.02186 0.3679 1.337
## 2 {CRSArrTime=Morning,
## AirTime=Intermediate,
## Distance=Long} => {ArrDelay=On-Time} 0.01494 0.3610 1.312
15
INFO7374 Data Science Final Project
## 3 {CRSDepTime=Morning,
## CRSArrTime=Morning,
## AirTime=Short,
## Distance=Medium} => {ArrDelay=On-Time} 0.01996 0.3605 1.310
## 4 {CRSArrTime=Morning,
## Distance=Medium} => {ArrDelay=On-Time} 0.02433 0.3596 1.307
## 5 {CRSArrTime=Morning,
## Distance=Long} => {ArrDelay=On-Time} 0.01849 0.3556 1.292
## 6 {CRSDepTime=Morning,
## CRSArrTime=Morning,
## AirTime=Intermediate,
## Distance=Long} => {ArrDelay=On-Time} 0.01310 0.3547 1.289
## 7 {AirTime=Short,
## Origin=ATL,
## Distance=Medium} => {ArrDelay=On-Time} 0.01010 0.3527 1.282
## 8 {CRSDepTime=Morning,
## CRSArrTime=Morning,
## Distance=Medium} => {ArrDelay=On-Time} 0.02219 0.3526 1.281
## 9 {CRSDepTime=Morning,
## CRSArrTime=Morning,
## Distance=Long} => {ArrDelay=On-Time} 0.01622 0.3493 1.270
## 10 {CRSDepTime=Morning,
## CRSArrTime=Morning,
## Distance=Too long} => {ArrDelay=On-Time} 0.01120 0.3480 1.265
We can find the most flights are possible on time which often have long distance
routes. That shows that in air control system, the small regions and short route
are much busy and people may have 5 or more choice of flights if they go to
New York City from Boston while there are only 2 flight if your family want
to travel to beautiful San Diego from Washington D.C. That is why the long
distance route have less pressure in air control system and It is easier to meet
the air traffic jam in shorter flight routes.
6 Cluster Analysis
We are going to research the Airline dataset using clustering analysis. Cluster-
ing analysis generally refers to sorting observed data into k groups(k indicates
how many groups will be created) so as to minimize the similarity of observations
within the same group and maximize the similarity of observations across dif-
ferent groups. Basically, cluster analysis can be separated into two approaches,
hierarchical and non-hierarchical. We are going to run non-hierarchical cluster-
ing including K-Means and Pam. For Airline dataset, we will continue finding
16
INFO7374 Data Science Final Project
associations within the dataset and the factors which will influence the ArrDe-
lay.
6.1 Setup
First we need to load libraries we are going to use.
library(cluster) # cluster library
library(proxy) # hcluster function
library(fpc) # cluster.stats function
library(pamr) # pam function
library(clValid) # clValid function
library(ggplot2) # plot diagram
Because the original dataset is too large, and it is very difficult to compute
distance matrix. So we randomly choose 1000 records and do analysis on this
sample.
cd = cd[sample(nrow(d), 1000), ]
m = as.matrix(cd)
str(m)
## int [1:1000, 1:12] 31 10 28 18 14 23 7 20 11 20 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "6926813" "6946821" "6545814" "6680323" ...
## ..$ : chr [1:12] "DayofMonth" "DayOfWeek" "DepTime" "CRSDepTime" ...
mDist = dist(m)
6.2 Determine K
The most important thing in cluster analysis is to determine best K. To find
the best case, we will apply clvalid function, which will directly give us the
optimal result.
# clValid
hvalid <- clValid(m, 2:10, clMethods = c("hierarchical"), validation = "internal",
maxitems = 1e+06)
pamvalid <- clValid(m, 2:10, clMethods = c("pam"), validation = "internal",
maxitems = 1e+06)
kvalid <- clValid(m, 2:10, clMethods = c("kmeans"), validation = "internal",
maxitems = 1e+06)
Now we can use the summary() function to see the result of each method,
where the Optimal Scores section will directly give us the best clusters number.
17
INFO7374 Data Science Final Project
summary(kvalid)
##
## Clustering Methods:
## kmeans
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9
##
## kmeans Connectivity 2.878 79.275 85.293 86.702 104.193 131.675 105.027 128.007 136.
## Dunn 0.098 0.012 0.012 0.014 0.027 0.023 0.027 0.023 0.
## Silhouette 0.471 0.454 0.468 0.470 0.485 0.388 0.486 0.390 0.
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 2.878 kmeans 2
## Dunn 0.098 kmeans 2
## Silhouette 0.486 kmeans 8
summary(pamvalid)
##
## Clustering Methods:
## pam
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10
##
## pam Connectivity 91.199 139.967 145.223 171.271 173.816 237.760 221.203 212.537 246.705
## Dunn 0.019 0.013 0.014 0.018 0.009 0.012 0.013 0.013 0.013
## Silhouette 0.404 0.288 0.318 0.360 0.305 0.298 0.320 0.320 0.333
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 91.199 pam 2
## Dunn 0.019 pam 2
## Silhouette 0.404 pam 2
summary(hvalid)
18
INFO7374 Data Science Final Project
##
## Clustering Methods:
## hierarchical
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 5.287 5.287 7.005 11.240 32.673 33.506 35.006 38.148 44.491
## Dunn 0.420 0.420 0.417 0.381 0.073 0.073 0.073 0.074 0.074
## Silhouette 0.485 0.460 0.443 0.426 0.407 0.374 0.368 0.317 0.318
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 5.287 hierarchical 2
## Dunn 0.420 hierarchical 2
## Silhouette 0.485 hierarchical 2
Because the 1000 sample records are randomly chosen, the results are not always
the same. But we can still find that in most cases, K = 2 will be given. On the
other hand, we can also use other measurements to validate our result.
We use foreachcluster function to show the 6 measurements for cluster number
from 2 to 10.
foreachcluster3 = function(k) {
pamC = pam(x = m, k)
p.stats = cluster.stats(mDist, pamC$clustering)
c(max.dia = p.stats$max.diameter, min.sep = p.stats$min.separation, avg.wi = p.stats$ave
avg.bw = p.stats$average.between, silwidth = p.stats$avg.silwidth, dunn = p.stats$du
}
We apply this function to cluster numbers from 2 to 10 and use rbind to make
a table.
t3 = rbind(foreachcluster3(2), foreachcluster3(3), foreachcluster3(4), foreachcluster3(5),
foreachcluster3(6), foreachcluster3(7), foreachcluster3(8), foreachcluster3(9),
foreachcluster3(10))
rownames(t3) = 2:10
t3
## max.dia min.sep avg.wi avg.bw silwidth dunn
## 2 3899 75.14 1062.9 1811 0.4041 0.019271
## 3 3899 52.03 963.4 1666 0.2884 0.013344
## 4 3802 52.03 797.9 1698 0.3184 0.013685
19
INFO7374 Data Science Final Project
## 5 3366 59.92 653.1 1689 0.3599 0.017800
## 6 3366 30.82 605.6 1625 0.3055 0.009157
## 7 3265 39.76 579.1 1602 0.2979 0.012177
## 8 2959 39.76 552.8 1604 0.3199 0.013437
## 9 2959 38.39 521.4 1583 0.3196 0.012974
## 10 2959 39.76 480.1 1562 0.3330 0.013437
The result also shows we should use K = 2 for cluster analysis.
6.3 Cluster Analysis
After getting the best K = 2, we can continue to compare different clusters to
see if we can get interesting result.
From the previous tests, we find out that hclust is not performing good in this
analysis. We found that one cluster only have few elements while another cluster
have over 99% elements. It means the Hclust function is also not working very
well in this case. So we will use Pam and Kmeans functions.
6.3.1 Pam
We apply pam function to the matrix, and set k = 2.
pamC = pam(x = m, 2)
pamC$clusinfo
## size max_diss av_diss diameter separation
## [1,] 561 2814 741.6 3505 75.14
## [2,] 439 3004 717.2 3899 75.14
pamcluster = data.frame(pamC$clustering)
We paste the cluster result back to our original dataset.
total = cbind(cd, pamcluster)
After that, we can obtain the two subset according to their cluster numbers.
d1 = subset(total, pamC.clustering == 1)
d2 = subset(total, pamC.clustering == 2)
summary(d1)
## DayofMonth DayOfWeek DepTime CRSDepTime
## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045
## 1st Qu.:10.0 1st Qu.:2.00 1st Qu.:1631 1st Qu.:1525
## Median :18.0 Median :3.00 Median :1809 Median :1715
20
INFO7374 Data Science Final Project
## Mean :16.9 Mean :3.59 Mean :1802 Mean :1699
## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:2004 3rd Qu.:1855
## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308
## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime
## Min. : 2 Min. : 640 Min. : 33 Min. : 35
## 1st Qu.:1745 1st Qu.:1718 1st Qu.: 84 1st Qu.: 80
## Median :1941 Median :1914 Median :123 Median :115
## Mean :1826 Mean :1904 Mean :137 Mean :130
## 3rd Qu.:2136 3rd Qu.:2105 3rd Qu.:168 3rd Qu.:160
## Max. :2357 Max. :2359 Max. :441 Max. :407
## AirTime ArrDelay DepDelay Distance
## Min. : 14 Min. : 15.0 Min. :-10.0 Min. : 56
## 1st Qu.: 57 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334
## Median : 88 Median : 47.0 Median : 45.0 Median : 590
## Mean :108 Mean : 69.3 Mean : 62.4 Mean : 720
## 3rd Qu.:137 3rd Qu.: 97.0 3rd Qu.: 89.0 3rd Qu.: 948
## Max. :382 Max. :395.0 Max. :377.0 Max. :2640
## pamC.clustering
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
summary(d2)
## DayofMonth DayOfWeek DepTime CRSDepTime
## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45
## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 835 1st Qu.: 810
## Median :18.0 Median :3.00 Median :1021 Median : 955
## Mean :17.2 Mean :3.54 Mean :1041 Mean :1001
## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:1205 3rd Qu.:1130
## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359
## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime
## Min. : 10 Min. : 1 Min. : 35 Min. : 34.0
## 1st Qu.:1022 1st Qu.: 942 1st Qu.: 92 1st Qu.: 83.5
## Median :1217 Median :1130 Median :129 Median :116.0
## Mean :1178 Mean :1111 Mean :146 Mean :134.6
## 3rd Qu.:1408 3rd Qu.:1322 3rd Qu.:176 3rd Qu.:165.0
## Max. :1810 Max. :2345 Max. :432 Max. :405.0
## AirTime ArrDelay DepDelay Distance
## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74
## 1st Qu.: 61 1st Qu.: 21.0 1st Qu.: 5.0 1st Qu.: 344
## Median : 92 Median : 33.0 Median : 27.0 Median : 594
## Mean :113 Mean : 55.9 Mean : 44.2 Mean : 750
21
INFO7374 Data Science Final Project
## 3rd Qu.:140 3rd Qu.: 65.5 3rd Qu.: 56.0 3rd Qu.: 966
## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777
## pamC.clustering
## Min. :2
## 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
We can see from the summary that all the columns are similar except Departure
Time and our target variable Arrive time. We can conclude that when the
Departure Time is in the midnight or in the morning, it’s more likely that this
fight will have a relatively lower delay, which match the conclusion we draw
from Association rules.
totaldf = data.frame(total)
totaldf$pamC.clustering = as.factor(totaldf$pamC.clustering)
qplot(data = totaldf, x = totaldf$pamC.clustering, y = totaldf$DepTime, colour = totaldf$pam
geom = "boxplot")
22
INFO7374 Data Science Final Project
0
500
1000
1500
2000
1 2
totaldf$pamC.clustering
totaldf$DepTime
totaldf$pamC.clustering
1
2
From the result, we know that pam has done a quite good job in clustering.
Next, we are going to try Kmeans to compare the result.
6.3.2 Kmeans
We apply similar work to Kmeans to see if Kmeans works better than pam
function.
kmeans.results = kmeans(m, 2)
clusterdf = data.frame(kmeans.results$cluster)
total = cbind(cd, clusterdf)
d1 = subset(total, kmeans.results.cluster == 1)
summary(d1)
## DayofMonth DayOfWeek DepTime CRSDepTime
## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045
23
INFO7374 Data Science Final Project
## 1st Qu.:10.5 1st Qu.:2.00 1st Qu.:1627 1st Qu.:1520
## Median :18.0 Median :3.00 Median :1804 Median :1710
## Mean :17.0 Mean :3.58 Mean :1797 Mean :1693
## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:2002 3rd Qu.:1855
## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308
## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime
## Min. : 2 Min. : 640 Min. : 33 Min. : 35
## 1st Qu.:1739 1st Qu.:1714 1st Qu.: 84 1st Qu.: 80
## Median :1939 Median :1910 Median :123 Median :115
## Mean :1817 Mean :1899 Mean :137 Mean :130
## 3rd Qu.:2134 3rd Qu.:2104 3rd Qu.:168 3rd Qu.:160
## Max. :2357 Max. :2359 Max. :441 Max. :407
## AirTime ArrDelay DepDelay Distance
## Min. : 14.0 Min. : 15.0 Min. :-10.0 Min. : 56
## 1st Qu.: 57.5 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334
## Median : 88.0 Median : 47.0 Median : 45.0 Median : 588
## Mean :107.6 Mean : 69.8 Mean : 62.8 Mean : 720
## 3rd Qu.:136.5 3rd Qu.: 97.0 3rd Qu.: 88.5 3rd Qu.: 947
## Max. :382.0 Max. :425.0 Max. :392.0 Max. :2640
## kmeans.results.cluster
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
d2 = subset(total, kmeans.results.cluster == 2)
summary(d2)
## DayofMonth DayOfWeek DepTime CRSDepTime
## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45
## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 834 1st Qu.: 805
## Median :18.0 Median :3.00 Median :1017 Median : 950
## Mean :17.1 Mean :3.54 Mean :1030 Mean : 992
## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:1202 3rd Qu.:1125
## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359
## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime
## Min. : 24 Min. : 1 Min. : 35 Min. : 34
## 1st Qu.:1021 1st Qu.: 940 1st Qu.: 92 1st Qu.: 83
## Median :1215 Median :1123 Median :129 Median :117
## Mean :1175 Mean :1099 Mean :146 Mean :135
## 3rd Qu.:1402 3rd Qu.:1318 3rd Qu.:177 3rd Qu.:166
## Max. :1810 Max. :2305 Max. :432 Max. :405
## AirTime ArrDelay DepDelay Distance
## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74
24
INFO7374 Data Science Final Project
## 1st Qu.: 61 1st Qu.: 21.0 1st Qu.: 4.0 1st Qu.: 344
## Median : 92 Median : 33.0 Median : 26.0 Median : 595
## Mean :113 Mean : 54.9 Mean : 43.2 Mean : 750
## 3rd Qu.:140 3rd Qu.: 64.0 3rd Qu.: 55.0 3rd Qu.: 967
## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777
## kmeans.results.cluster
## Min. :2
## 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
It seems they are generating very similar results. It also shows a strong rela-
tionship between Departure Time and Arrive Delay, which matches our findings
in Association Rules and Decision Trees.
7 Decision Tree
In this section, we are going to use decision tree to help us analyze the factors
that will affect the target variables.
First, we need to load the libraries required.
library(rpart)
library(rpart.plot)
library(rattle)
library(maptree)
library(party)
library(partykit)
7.1 Categorize Variable
We categorize our variable into different parts.
Distance We divided the variables into three parts: up to 750, 750 to 1000,
greater than 1000
d$Distance = ordered(cut(d$Distance, c(0, 750, 1000, Inf)), labels = c("upto750",
"750to1000", ">1000"))
DayOfWeek Replace week days number into characters like 1=MON, 2=TUE
etc.with the help of gsub
d$DayOfWeek = gsub("1", "MON", d$DayOfWeek)
d$DayOfWeek = gsub("2", "TUE", d$DayOfWeek)
25
INFO7374 Data Science Final Project
d$DayOfWeek = gsub("3", "WED", d$DayOfWeek)
d$DayOfWeek = gsub("4", "THU", d$DayOfWeek)
d$DayOfWeek = gsub("5", "FRI", d$DayOfWeek)
d$DayOfWeek = gsub("6", "SAT", d$DayOfWeek)
d$DayOfWeek = gsub("7", "SUN", d$DayOfWeek)
Origin Origins of airports are categorized into SW, SE, NE, MW, W these five
regions with the help of gsub function.
d$Origin = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS
"SW", d$Origin)
d$Origin = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT
"SE", d$Origin)
d$Origin = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT
"NE", d$Origin)
d$Origin = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID
"MW", d$Origin)
d$Origin = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT
"W", d$Origin)
Dest Destination of airports are categorized into SW, SE, NE, MW, W these
five regions with the help of gsub function
d$Dest = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS|G
"SW", d$Dest)
d$Dest = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT|D
"SE", d$Dest)
d$Dest = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT|C
"NE", d$Dest)
d$Dest = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID|C
"MW", d$Dest)
d$Dest = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT|B
"W", d$Dest)
DayOfMonth We divided December Day Of Month into Regular day and
Christmas Week
d$DayofMonth = ordered(cut(d$DayofMonth, c(0, 23, 32)), labels = c("R.Days",
"CH.Days"))
DepDelay
We divided Departure Delay into two part low and high Delay
26
INFO7374 Data Science Final Project
d$DepDelay = ordered(cut(d$DepDelay, c(-Inf, 60, Inf)), labels = c("low", "high"))
7.2 Rpart
Rpart is recursive partitioning for classification, regression and survival trees.
We are going to classify two predict variable ArrDelay and DepDelay by
using rpart.
Departure Delay: DepDelay is response variable and DayofMonth, DayOfWeek,
DepTime, Distance are predicate variable.
ss.formula = DepDelay ~ DayofMonth + DayOfWeek + DepTime + Distance
# formula for tree
ss.rpart = rpart(data = d, formula = ss.formula)
draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw
DepTime <> 1406.5
low; 168647 obs; 30.1%
DepTime >< 447.5
low; 72390 obs; 19.8%
low
70766 obs
1
high
1624 obs
2
DepTime <> 2229.5
low; 96257 obs; 37.9%
low
91550 obs
3
high
4707 obs
4
Total classified correct = 27.5 %
27
INFO7374 Data Science Final Project
print(ss.rpart) # for printing tree rules
## n= 168647
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 168647 50820 low (0.6987 0.3013)
## 2) DepTime< 1406 72390 14340 low (0.8019 0.1981)
## 4) DepTime>=447.5 70766 13030 low (0.8158 0.1842) *
## 5) DepTime< 447.5 1624 314 high (0.1933 0.8067) *
## 3) DepTime>=1406 96257 36480 low (0.6210 0.3790)
## 6) DepTime< 2230 91550 33200 low (0.6374 0.3626) *
## 7) DepTime>=2230 4707 1427 high (0.3032 0.6968) *
From this tree we conclude that normally at night between 10:30PM to 5:00AM
delays are more as compare to day time. From this tree we can conclude that
our decision tree is mainly depend on Arrival time and Departure time. So that
we removed AirTime from next decision tree.
Arrival Delay: ArrDelay is response variable and DayofMonth, DayOfWeek,
Origin, Distance are predicate variable.
ss.formula = ArrDelay ~ DayofMonth + DayOfWeek + Distance + Origin
# formula for tree
R.control = rpart.control(cp = 0.001) # to control tree
ss.rpart = rpart(data = d, formula = ss.formula, control = R.control)
draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw
28
INFO7374 Data Science Final Project
,SE,SW,W = Origin = ,MW,NE
62.5547 ; 168647 obs; 0.6%
,MON,SUN,THU,TUE,WED = DayOfWeek = ,FRI,SAT
59.0561 ; 110460 obs; 0.1%
,SE,SW = Origin = ,MW,NE,W
57.44 ; 81657 obs; 0.1%
54.7998
50291 obs
1
61.6731
31366 obs
2
,R.Days = DayofMonth = ,CH.Days
63.6379 ; 28803 obs; 0.2%
58.8759
17922 obs
3
71.4812
10881 obs
4
,MON,SUN,THU,WED = DayOfWeek = ,FRI,SAT,TUE
69.1963 ; 58187 obs; 0.4%
,THU = DayOfWeek = ,FRI,MON,SAT,SUN,TUE,WED
63.6194 ; 33946 obs; 0.1%
52.3489
5849 obs
5
65.9656
28097 obs
6
77.006
24241 obs
7
Total deviance explained = 1.5 %
print(ss.rpart) # for printing tree rules
## n= 168647
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 168647 675500000 62.55
## 2) Origin=SE,SW,W 110460 394400000 59.06
## 4) DayOfWeek=MON,SUN,THU,TUE,WED 81657 273300000 57.44
## 8) Origin=SE,SW 50291 142800000 54.80 *
## 9) Origin=W 31366 129600000 61.67 *
## 5) DayOfWeek=FRI,SAT 28803 120300000 63.64
## 10) DayofMonth=R.Days 17922 68280000 58.88 *
## 11) DayofMonth=CH.Days 10881 50950000 71.48 *
## 3) Origin=MW,NE 58187 277100000 69.20
## 6) DayOfWeek=MON,SUN,THU,WED 33946 124100000 63.62
29
INFO7374 Data Science Final Project
## 12) DayOfWeek=THU 5849 17040000 52.35 *
## 13) DayOfWeek=MON,SUN,WED 28097 106100000 65.97 *
## 7) DayOfWeek=FRI,SAT,TUE 24241 150500000 77.01 *
From this decision tree we can say that our dataset is divided into parts like
division of origin into SE,SW,W and MW,NE as well as day of week into
MON,SUN,THU,TUE,WED and FRI,SAT.
7.3 Ctree
Ctree is Conditional inference trees which embed tree-structured regression
models into a well defined theory of conditional inference procedure
Departure Delay:
ss.formula1 = DepDelay ~ Distance + DepTime # formula for Ctree
ss.control = ctree_control(maxdepth = 2) #height is 2
ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation
## Loading required package: Formula
## Warning: there is no package called ’Formula’
plot(ss.ctree) # plotting of tree
30
INFO7374 Data Science Final Project
DepTime
p < 0.001
1
≤ 1406 > 1406
DepTime
p < 0.001
2
≤ 447 > 447
Node 3 (n = 1624)
highlow
0
0.2
0.4
0.6
0.8
1
Node 4 (n = 70766)
highlow
0
0.2
0.4
0.6
0.8
1
DepTime
p < 0.001
5
≤ 2229 > 2229
Node 6 (n = 91550)highlow
0
0.2
0.4
0.6
0.8
1
Node 7 (n = 4707)
highlow
0
0.2
0.4
0.6
0.8
1
Like we explain in rpart, ctree is giving same result which is normally at night
between 10:30PM to 5:00AM delays are more as compare to day time.
Arrival Delay:
ss.formula1 = ArrDelay ~ Distance + ArrTime # formula for Ctree
ss.control = ctree_control(maxdepth = 2) # height is 2
ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation
## Loading required package: Formula
## Warning: there is no package called ’Formula’
plot(ss.ctree) # plotting of tree
31
INFO7374 Data Science Final Project
ArrTime
p < 0.001
1
≤ 518 > 518
ArrTime
p < 0.001
2
≤ 134 > 134
Node 3 (n = 7693)
0
500
1000
1500
Node 4 (n = 2165)
0
500
1000
1500
ArrTime
p < 0.001
5
≤ 1438 > 1438
Node 6 (n = 52990)
0
500
1000
1500
Node 7 (n = 105799)
0
500
1000
1500
From this tree we can conclude that around midnight like before 5:18 AM, delays
are higher compared to day time.
8 Random Forest
Now we will use random forest analysis to learn more about predictions. In the
random forest, the following libraries will be used.
library(randomForest) # for randomForest
library(rpart)
library(caret) # for confusionMatrix
Because the original data is too large, we still randomly select 1000 rows.
32
INFO7374 Data Science Final Project
rfd = rd[sample(nrow(rd), 1000), ]
We seperate our dataset into train set and test set.
ndxTrain = sample(x = nrow(rfd), size = 0.7 * nrow(rfd))
rfd.train = rfd[ndxTrain, ]
rfd.test = rfd[-ndxTrain, ]
We set all the other variables to be predictors and see how they will affect our
target variable.
rfd.predictors = c("DayofMonth", "DayOfWeek", "DepTime", "CRSDepTime", "ArrTime",
"CRSArrTime", "AirTime", "ActualElapsedTime", "Distance")
rfd.rf = randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay)
print(rfd.rf)
##
## Call:
## randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 24.14%
## Confusion matrix:
## Low High class.error
## Low 272 80 0.2273
## High 89 259 0.2557
plot(rfd.rf)
33
INFO7374 Data Science Final Project
0 100 200 300 400 500
0.200.250.300.350.40
rfd.rf
trees
Error
From the diagram, we find out that the error rate will be stable when the tree
numbers get larger. So we use the default tree number, which is 500.
rfd.train.pred = predict(object = rfd.rf, newdata = rfd.train, type = "class")
rfd.test.pred = predict(object = rfd.rf, newdata = rfd.test, type = "class")
confusionMatrix(data = rfd.train.pred, reference = rfd.train$ArrDelay)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 352 0
## High 0 348
##
## Accuracy : 1
## 95% CI : (0.995, 1)
34
INFO7374 Data Science Final Project
## No Information Rate : 0.503
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.503
## Detection Rate : 0.503
## Detection Prevalence : 0.503
## Balanced Accuracy : 1.000
##
## 'Positive' Class : Low
##
confusionMatrix(data = rfd.test.pred, reference = rfd.test$ArrDelay)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 133 32
## High 29 106
##
## Accuracy : 0.797
## 95% CI : (0.747, 0.841)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.59
## Mcnemar's Test P-Value : 0.798
##
## Sensitivity : 0.821
## Specificity : 0.768
## Pos Pred Value : 0.806
## Neg Pred Value : 0.785
## Prevalence : 0.540
## Detection Rate : 0.443
## Detection Prevalence : 0.550
## Balanced Accuracy : 0.795
##
## 'Positive' Class : Low
##
35
INFO7374 Data Science Final Project
Although the dataset is randomly chosen, we can always get an accuracy rate
of over 70 percent, which is higher than a single decision tree.
9 Classification
Classification technique used to predict group membership for data instances.For
Classification we are using knn and svm algorithms
9.1 knn
K-Nearest Neighbors(Knn) is supervised machine learning algorithm for object
classification.
library(class) #for knn
library(RWeka) #for IBk function
## Error: package or namespace load failed for ’RWeka’
9.2 Processing Data
For removing not useful column.
kd = kd[, -20:-29]
kd = kd[, -1:-2]
kd = kd[, -3:-12]
kd = kd[, -4]
kd = kd[, -5]
We are keeping ArrDelay as our responsive variable so we categorize that into
two part low and high delay.
kd$ArrDelay = ordered(cut(kd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High"))
Our machine is not able to handle full data set so we are using 1000 random
record.
kdd = kd[sample(nrow(kd), 1000), ] # sample dataset
The IBk function implements the K-NN technique to predict the Arrival Delay
variable from the remaining four variables of the kdd dataframe.That’s why we
are using this function and storing result into classifier
classifier = IBk(ArrDelay ~ DayOfWeek + DayofMonth + Distance + Origin, data = kdd,
control = Weka_control(K = 4)) # k=4 because 4 other variable
36
INFO7374 Data Science Final Project
## Error: could not find function "IBk"
summary(classifier) # detail eplanation with confusion matrix
## Error: error in evaluating the argument ’object’ in selecting a
method for function ’summary’: Error: object ’classifier’ not found
In k nearest neighbour technique we found that our around 70% data are cor-
rectly classified and only 30% data are incorrectly classified. In confusion matrix
we can see that in high delay part is not classified properly.
9.3 SVM
For classification we are using another method which is SVM. Support Vector
Machine can analyze data and recognize patterns, used for classification and
regression analysis.
10 Processing Data
We are keeping ArrDelay as our response variable so we categorize that into two
part low and high.
sd$ArrDelay = ordered(cut(sd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High"))
Our machine is not able to handle full data set so we took part of that
sdd = sd[sample(nrow(sd), 1000), ] # sample dataset
We divided our dataset Into two parts train1 and test1 dataset
sd1 = nrow(sdd)
nxd.train = sample(1:sd1, 0.7 * sd1)
sd.train1 = sdd[nxd.train, ]
sd.test1 = sdd[-nxd.train, ]
For SVM we are using these two libraries.
library(e1071)
library(caret)
Predict variable is ArrDelay which is based on two variable which are Day-
OfWeek and Distance.
sd.formula = ArrDelay ~ DayOfWeek + Distance
plot.formula = DayOfWeek ~ Distance #For plot X and Y axis
sd.model = svm(formula = sd.formula, data = sd.train1) # for actual model creation.
summary(sd.model) # Detail description of a model.
37
INFO7374 Data Science Final Project
##
## Call:
## svm(formula = sd.formula, data = sd.train1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 639
##
## ( 322 317 )
##
##
## Number of Classes: 2
##
## Levels:
## Low High
sd.predict = predict(sd.model, sd.test1) # prediction on testing data set
# confusionMatrix(data = sd.predict, reference = sdd£ArrDelay)
plot(x = sd.model, data = sd.train1, formula = plot.formula) #default: cost=1, gamma=0.5
38
INFO7374 Data Science Final Project
LowHigh
500 1000 1500 2000 2500
1
2
3
4
5
6
7
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
oo
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o o
oo
o
oo
x
x
x
xx
x
x
x x
x x
x
x
x
x
x
x
x x
x
x x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x x
x
xx
x
xx
x
x x
xx
x
x
x
x
x
x
x
x
xx
x
xx
x
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx x
x
x
x
x
x
x x
x
x
x
xx x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
xxx x
x
xx
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
SVM classification plot
Distance
DayOfWeek
For clear result we are changing cost and gamma parameter
sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification",
kernel = "radial", cost = 1, gamma = 5)
plot(x = sd.model, data = sd.train1, formula = plot.formula)
39
INFO7374 Data Science Final Project
LowHigh
500 1000 1500 2000 2500
1
2
3
4
5
6
7
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
x
xx
x
xx
x
x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x x
xx
x
x
xx
x
x
x
x
x
x
x
x
x
x
x x
x x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
x x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
xx
x x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x xx
x
x
x
x
x
x
x
xx
x
xx
x
x x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
xx
x
x
x
x x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
xxx x
x
xx
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x x
x
x
x
x
x
x
x
x
x
x
x x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
SVM classification plot
Distance
DayOfWeek
sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification",
kernel = "radial", cost = 1, gamma = 0.1)
plot(x = sd.model, data = sd.train1, formula = plot.formula)
40
INFO7374 Data Science Final Project
LowHigh
500 1000 1500 2000 2500
1
2
3
4
5
6
7
oo
o
o o
oo o o
oo
o
o
oo
oo oooo o
o
o
o
ooo
x
xx
x
xx
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x x
x
x x
xx
x
x
xx
x
x
x
x
x
x
x
x x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x x
x
xx
x x
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x x
x
x
x
x x
x
xx
x
xx
x
x x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
xx
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x
x
x
xxx x
x
xx
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x x
x
xx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
SVM classification plot
Distance
DayOfWeek
From this graph we can clearly say that on 6th and 7th day which is Saturday
and Sunday is having more delay than rest of the day. After certain distance
Arrival Delays are getting lower.
11 Conclusion
Overall, we did find some useful results in our analysis. Within the logistic re-
gression, we found several variables that were statistically significant, like Dis-
tance, Day Of Week, Origin, Destination, Departure-time, Arrival-time, Day
Of Month and etc. We converted them from numeric variable to categorical
variables. We found out some reasons behind the U.S Air flights Arrival Delay.
To find out relationships between different variables, we applied association rules
and we got really good result. We found out on Monday flights are more likely
41
INFO7374 Data Science Final Project
to be on time. In December 2008, we found out Chicago flights delays are high
so we check weather records and we found that In December 2008, weather was
very bad in Chicago. Many flights are affected because of that reason.
The fastest and easiest way to make decision about our dataset is to apply De-
cision tree mechanism where we organized our data hierarchically. So, we used
Rpart and Ctree algorithm for making decision tree. From the diagram, we
conclude that normally at night between 10:30PM to 5:00AM delays are rela-
tively higher. And normally on weekends like Saturday and Sunday delays are
possibly higher compared with weekdays, which makes sense.
On the other hand, we also use clustering analysis. After applying Kmeans,
Hclust and Pam clustering method to our dataset, we got cluster 2 is the best
cluster and we validate that by using clvalid function and other measurements.
After separating the dataset into two subsets, we also find out some relation-
ships that fulfill what we have found in the association rules.
In classification we used knn and svm techniques. In k-nearest neighbor tech-
nique we found that approximately 70 percent of data are correctly classified.
Same thing we found in our confusion matrix.
We did some pattern recognition with help of Support Vector Machine (SVM)
where we found that the longer distance, the longer delay.
According to our analysis, we suggest that it is better to choose daytime on
weekdays to travel so that you can arrive your destination on time.
12 Limitation
There are still a few limitations during our analysis for this dataset.
First, we are limited by our computers processing capability. The original
dataset is huge which contains 7,009,728 observations, so we select a part of
it (All U.S. airlines data of December, 2008) so as to reduce the file size loading
into R. In addition, when we especially address with PAM algorithm in cluster
analysis and RandomForest, it often got stuck even crashed. So we have to use
a random sample to apply functions, like computing distance matrix. However,
we haven’t verify how the random sample will affect our result.
Another limitation is that so far we only focus on the delay. There might be
42
INFO7374 Data Science Final Project
other interesting relationships among other variables. We may work on other
relationships in the future.
13 Future Work
While we have already obtained some analysis outcomes, there are still a few
works we can do in the future.
First, due to the limitation of our computers, we are not able to process large-
scale data. So we cannot apply some of the functions on the full dataset. In our
analysis, we only use a random sample, so the result cannot be accurate every
time.
What’s more, we may find out more relationships because our target is ana-
lyzing the air flights perform on-time or delayed. Something valuable is still
waiting for us. For instance, we may find the busiest carrier in the air.
Last but not the least, from previous work, we find out that the DBSCAN is
not working well unless the dataset becomes very large. So we don’t apply that
to our dataset. We would like to see how DBSCAN will perform and we want
to compare the result of DBSCAN to the other cluster methods.
43

More Related Content

Similar to Data Mining & Analytics for U.S. Airlines On-Time Performance

The intersection of business rules and big data.
The intersection of business rules and big data.The intersection of business rules and big data.
The intersection of business rules and big data.Anurag Saran
 
Predicting flight cancellation likelihood
Predicting flight cancellation likelihoodPredicting flight cancellation likelihood
Predicting flight cancellation likelihoodAashish Jain
 
R getting spatial
R getting spatialR getting spatial
R getting spatialFAO
 
KNN and regression Tree
KNN and regression TreeKNN and regression Tree
KNN and regression TreeAsmar Farooq
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...DataStax Academy
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Trice-SeniorCapstone
Trice-SeniorCapstoneTrice-SeniorCapstone
Trice-SeniorCapstoneAndrew Trice
 
Course project for CEE 4674
Course project for CEE 4674Course project for CEE 4674
Course project for CEE 4674Junqi Hu
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsySmartHinJ
 
Most important "trick" of performance instrumentation
Most important "trick" of performance instrumentationMost important "trick" of performance instrumentation
Most important "trick" of performance instrumentationCary Millsap
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 

Similar to Data Mining & Analytics for U.S. Airlines On-Time Performance (20)

Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
The intersection of business rules and big data.
The intersection of business rules and big data.The intersection of business rules and big data.
The intersection of business rules and big data.
 
Predicting flight cancellation likelihood
Predicting flight cancellation likelihoodPredicting flight cancellation likelihood
Predicting flight cancellation likelihood
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 
KNN and regression Tree
KNN and regression TreeKNN and regression Tree
KNN and regression Tree
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
 
10. R getting spatial
10.  R getting spatial10.  R getting spatial
10. R getting spatial
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Trice-SeniorCapstone
Trice-SeniorCapstoneTrice-SeniorCapstone
Trice-SeniorCapstone
 
Airline delay prediction
Airline delay predictionAirline delay prediction
Airline delay prediction
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Course project for CEE 4674
Course project for CEE 4674Course project for CEE 4674
Course project for CEE 4674
 
Aermap userguide under-revision
Aermap userguide under-revisionAermap userguide under-revision
Aermap userguide under-revision
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
Most important "trick" of performance instrumentation
Most important "trick" of performance instrumentationMost important "trick" of performance instrumentation
Most important "trick" of performance instrumentation
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 

Recently uploaded

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 

Recently uploaded (16)

Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 

Data Mining & Analytics for U.S. Airlines On-Time Performance

  • 1. Data Analysis of U.S. Airlines On-time Performance Yanxiang Zhu, Nilesh Padwal, Mingxuan Li Finished by June 27th, 2014 Contents 1 Introduction 2 1.1 Background and Problem Description . . . . . . . . . . . . . . . 2 1.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Collecting data 3 3 Preprocessing Data 3 4 Variables Description 4 5 Association Rule 9 6 Cluster Analysis 15 6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2 Determine K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6.3.1 Pam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6.3.2 Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7 Decision Tree 24 7.1 Categorize Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.2 Rpart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 7.3 Ctree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 8 Random Forest 31 9 Classification 35 9.1 knn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 9.2 Processing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 9.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1
  • 2. INFO7374 Data Science Final Project 10 Processing Data 36 11 Conclusion 40 12 Limitation 41 13 Future Work 42 2
  • 3. INFO7374 Data Science Final Project 1 Introduction 1.1 Background and Problem Description In airlines industry, It is much more common that airlines are struggling to get plane to the gate on time.The current challenge we all met is to improve the quality of airline on-time performance. Besides the carriers’ services and baggage policy, It is seemingly much more necessary to save people’s air time. So our goal is to find something special and valuable relationships from dataset by using data mining techniques like cluster analysis, association rule, decision tree and etc. 1.2 Dataset Description The dataset is a airlines’ data collection coming from the Research and Innova- tive Technology Administration (RITA), and it contains detailed facets of each air flight information between 1987 to 2008. It is huge information which include 29 variables like Destination, Origin, Arrival time, Departure time and so on. Here is a original list that show the all variables. It is very important statistical records that any flight information could be tracked via specific features. The thing we need to mention, due to limit performance of our computers, is that we are able to fetch a part of the whole data (all the airlines of U.S. during 22 years)to process and analyze. Our selected dataset is still have millions of observations which are definitely enough to obtain the satisfying outcomes.Here is a descriptive list of useful variables. 1. DayofMonth December 1st to December 31th. 2. DayOfWeek 1 refers to Monday and in a similar way, 7 refers to Sunday. 3. DepTime Actual departure time 4. ArrTime Actual arrival time 5. CRSDepTime Scheduled departure time 6. CRSArrTime Actual arrival time 7. UniqueCarrier Unique carrier code 8. FlightNum Flight number 9. ActualElapsedTime In minutes 10. CRSElapsedTime In minutes 11. AirTime In minutes 12. ArrDelay Arrival delay, in minutes 13. DepDelay Departure delay, in minutes 3
  • 4. INFO7374 Data Science Final Project 14. Origin Origin IATA airport code 15. Dest Destination IATA airport code 16. Distance In miles According to the historical record of On-time flight operation of U.S.air carriers, the 2008 seems like to be interesting and special period for the airline industry, whose on-time percentage is 76.0%, then it went up to 79.5% in 2009. That is why we choose such a breaking point to find out what should not be ignored that behind the common numbers and words. 2 Collecting data The dataset we use contains all commercial flights within the USA in 2008. The dataset is downloaded from http://stat-computing.org/dataexpo/2009. The dataset contains nearly 10 million records and takes 700MB space. file.name <- paste(2008, "csv.bz2", sep = ".") if (!file.exists(file.name)) { url.text <- paste("http://stat-computing.org/dataexpo/2009/", 2008, ".csv.bz2", sep = "") cat("Downloading missing data file ", file.name, "n", sep = "") download.file(url.text, file.name) } To import the data into our workspace, we use read.csv function. We store the dataset in d. d <- read.csv("2008.csv") 3 Preprocessing Data Since the analysis still need a well-structured dataset, so we omit the NA values. And due to the limitation of our computer’ processing capability, we decide to work with data from only December,2008. And it still has 1,524,735 observation of 29 variables and we think it is enough to obtain a good data analysis results from such a large-scale dataset. d = subset(d, Month == "12") d = na.omit(d) After that, we also need to remove some of the columns that we think is not useful in our study. So we remove them directly from the original dataset. 4
  • 5. INFO7374 Data Science Final Project d = d[, -20:-29] On the other hand, since we already decide to use the data in December, 2008. The Year and Month columns become useless. d = d[, -1] d = d[, -1] So far, our dataset contains 168646 records with 17 variables. str(d) ## 'data.frame': 168647 obs. of 17 variables: ## $ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ... ## $ DayOfWeek : int 3 3 3 3 3 3 3 3 3 3 ... ## $ DepTime : int 1126 1859 1256 1925 2002 1716 1620 1807 1930 1004 ... ## $ CRSDepTime : int 1045 1825 1240 1900 1940 1610 1555 1725 1905 1005 ... ## $ ArrTime : int 1241 1925 1458 2120 2249 2054 1826 1910 2041 1130 ... ## $ CRSArrTime : int 1200 1900 1435 2100 2230 1950 1800 1845 2020 1115 ... ## $ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 1 ## $ FlightNum : int 2717 1712 294 2776 623 586 1259 548 619 1152 ... ## $ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3796 2127 3943 3316 ## $ ActualElapsedTime: int 75 86 62 55 107 158 186 63 71 86 ... ## $ CRSElapsedTime : int 75 95 55 60 110 160 185 80 75 70 ... ## $ AirTime : int 55 73 45 46 93 140 177 50 56 51 ... ## $ ArrDelay : int 41 25 23 20 19 64 26 25 21 15 ... ## $ DepDelay : int 41 34 16 25 22 66 25 42 25 -1 ... ## $ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 82 157 160 175 177 181 2 ## $ Distance : int 349 487 289 332 718 1121 1111 328 328 321 ... 4 Variables Description After importing the dataset, the variables associated with each observation were explored further. The names of variables were listed and described. 1. DayofMonth December 1st to December 31th. summary(d$DayofMonth) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.0 11.0 18.0 17.2 23.0 31.0 2. DayOfWeek 1 refers to Sunday and in a similar way, 7 refers to Saturday. summary(d$DayOfWeek) 5
  • 6. INFO7374 Data Science Final Project ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 2.00 4.00 3.74 5.00 7.00 3. DepTime Actual departure time summary(d$DepTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1 1120 1510 1470 1840 2400 Departure time is another key factor that we are going to examine. We want to know which time is the best time for flight. 4. ArrTime Actual arrival time summary(d$ArrTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1 1230 1640 1560 2010 2400 5. CRSDepTime Scheduled departure time summary(d$CRSDepTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 5 1040 1420 1400 1750 2360 6. CRSArrTime Actual arrival time summary(d$CRSArrTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1 1230 1620 1580 1950 2360 7. UniqueCarrier Unique carrier code carrier = data.frame(d$UniqueCarrier) qplot(x = d$UniqueCarrier, data = carrier, fill = d$UniqueCarrier) 6
  • 7. INFO7374 Data Science Final Project 0 10000 20000 30000 9E AA AS B6 CO DL EV F9 FL HA MQNWOH OO UA US WN XE YV d$UniqueCarrier count d$UniqueCarrier 9E AA AS B6 CO DL EV F9 FL HA MQ NW OH OO UA US WN XE YV Southwest Airline runs most of the airplane in the U.S. in 2008. The number of their flights are even greater than the sum of Skywest Airline and American Airline. We will also help you find out which airline to choose if you want to avoid delay. 8. FlightNum Flight number summary(d$FlightNum) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1 658 1680 2360 3590 9740 9. ActualElapsedTime In minutes summary(d$ActualElapsedTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18 88 126 144 177 790 10. CRSElapsedTime In minutes 7
  • 8. INFO7374 Data Science Final Project summary(d$CRSElapsedTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 26 82 116 135 165 660 11. AirTime In minutes summary(d$AirTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 6 60 93 112 141 647 12. ArrDelay 4Arrival delay, in minutes summary(d$ArrDelay) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 15.0 24.0 41.0 62.6 77.0 1660.0 The arrive delay is our target variable. The median is 40 mins which means the delay problem is severe. We are going to find out which factors will cause the delay. 13. DepDelay Departure delay, in minutes summary(d$DepDelay) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -34.0 15.0 35.0 53.5 71.0 1600.0 14. Origin Origin IATA airport code summary(d$Origin) ## ATL ORD DEN DFW DTW PHX EWR IAH LAS ## 12232 11020 7004 6208 4984 4353 4333 4269 4020 ## MSP LAX JFK SLC SFO SEA CLT BOS PHL ## 4004 3837 3409 3375 3370 2992 2977 2841 2676 ## MDW MCO CVG BWI LGA SAN DCA IAD MEM ## 2546 2481 2457 2383 2272 1832 1768 1714 1665 ## MIA FLL STL MKE TPA MCI BNA CLE PDX ## 1601 1590 1504 1491 1489 1381 1358 1352 1329 ## HOU RDU DAL OAK HNL SMF PIT SJC IND ## 1324 1297 1218 1211 1164 1131 1061 1003 989 ## SNA ABQ AUS SAT MSY CMH PBI BUF OMA ## 944 905 872 809 806 757 733 727 704 ## JAX BDL BUR RSW BHM ONT PVD GRR SDF ## 643 629 627 607 546 521 517 514 504 ## TUL OKC RNO DSM SJU RIC MHT DAY MSN ## 493 488 485 478 471 452 426 419 408 ## GEG LIT BOI ELP TUS ANC ICT LGB TYS ## 404 404 394 394 389 384 371 367 358 8
  • 9. INFO7374 Data Science Final Project ## ALB ROC XNA SYR OGG ORF HPN COS CID ## 356 356 344 343 332 322 317 311 292 ## CHS FAT LEX GSO MLI CAE HSV SAV JAN ## 287 285 282 278 271 268 260 259 251 ## (Other) ## 12768 The most busiest airport in the U.S. is Atlantic airport. Chicago and Denver ranked second and third. 15. Dest Destination IATA airport code summary(d$Dest) ## ATL ORD DEN DFW LAX PHX LAS EWR SFO ## 11791 9506 6338 5159 5013 4663 4357 4335 4277 ## IAH DTW MSP JFK SLC SEA MCO LGA PHL ## 4271 3575 3280 3238 3216 3131 3015 2818 2669 ## BOS CLT SAN BWI FLL MDW CVG TPA MEM ## 2541 2422 2265 2142 1959 1912 1831 1748 1721 ## DCA MIA IAD PDX RDU MCI STL SMF OAK ## 1710 1702 1494 1483 1455 1384 1374 1361 1351 ## BNA CLE MKE SJC HOU SNA SAT DAL AUS ## 1346 1314 1290 1210 1196 1120 1096 1085 1049 ## ABQ HNL PIT PBI MSY IND CMH RSW OMA ## 1014 981 935 917 889 841 831 758 757 ## JAX BUR ONT BUF TUL OKC SJU BHM TUS ## 737 717 679 664 615 608 604 585 585 ## BDL RNO ANC SDF PVD DSM GRR RIC ELP ## 578 567 560 526 505 501 498 485 480 ## BOI LIT GEG MSN TYS ICT DAY XNA LGB ## 472 442 424 415 412 408 406 386 381 ## MHT COS ORF GSO CHS ROC CAE JAN HPN ## 373 366 362 336 334 327 301 296 295 ## SAV CID OGG FAT ALB SYR LEX HSV MLI ## 293 292 292 291 285 280 276 261 255 ## (Other) ## 13756 The result is very similar to Origin. We also need to check whether the most busiest airport suffers the delay most. 16. Distance In miles summary(d$Distance) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 31 338 599 753 984 4960 9
  • 10. INFO7374 Data Science Final Project The most majority of flights have a distance of under 1000 miles. The relationship between distance and delay time is another import question we need to validate. 5 Association Rule After thinking about air flight performance, we consider that there are some- thing special and important that can address with so that it is quite beneficial to improve the quality of air flights’ on-time performance. In this section, we are goint to find the hidden relationship from different facets among the infor- mation that given by air flight dataset. We raise some certain questions, for instance, how does the factors like distance, DayOfWeek influence on the on- time performance. So our goal is to solve these questions by using association rule. • Support The probability that antecedent and conclusion hold simultane- ously in the data set. • Condence The conditional probability that conclusion holds if antecedent is satised. • Lift Lift is the ratio of Confidence to Expected Confidence.Values greater than 1 indicate that the rule has predictive potential. Firstly, we prepare the library that association rule requires. library(arules) library(arulesViz) We copy the original dataset because we will do some further data transforming work on it. We store it as data. data = d According to our observation, it is effective that dividing the numeric variable into the ordered, reasonable range for our analysis. So we split the Distance, ArrDelay, Airtime, CRSDepTime, CRSArrTime respectively and make sure that all the observation is included in given range. data$Distance = ordered(cut(data$Distance, c(0, 300, 600, 1000, Inf)), labels = c("Short", "Medium", "Long", "Too long")) data$ArrDelay = ordered(cut(data$ArrDelay, c(0, 25, 50, 80, Inf)), labels = c("On-Time", "Delayed", "Intermediate-Delayed", "Much-Delayed")) data$AirTime = ordered(cut(data$AirTime, c(-1, 50, 100, 200, 300, Inf)), labels = c("Too-Sho "Short", "Intermediate", "Long", "Too-Long")) 10
  • 11. INFO7374 Data Science Final Project data$CRSDepTime = ordered(cut(data$CRSDepTime, c(-1, 600, 1200, 1800, Inf)), labels = c("Overnight", "Morning", "Afternoon", "Evening")) data$CRSArrTime = ordered(cut(data$CRSArrTime, c(-1, 600, 1200, 1800, 2359)), labels = c("Overnight", "Morning", "Afternoon", "Evening")) DayOfWeek contain the number like 1, 2, 3 to represent days of week. So we change it into character and replace with string.After manipulating, it is transformed into factor. data$DayOfWeek = as.character(data$DayOfWeek) data$DayOfWeek = gsub("^1", "Sunday", data$DayOfWeek) data$DayOfWeek = gsub("^2", "Monday", data$DayOfWeek) data$DayOfWeek = gsub("^3", "Tuesday", data$DayOfWeek) data$DayOfWeek = gsub("^4", "Wednesday", data$DayOfWeek) data$DayOfWeek = gsub("^5", "Thursday", data$DayOfWeek) data$DayOfWeek = gsub("^6", "Friday", data$DayOfWeek) data$DayOfWeek = gsub("^7", "Saturday", data$DayOfWeek) data$DayOfWeek = factor(data$DayOfWeek) We have 5 variable that is not that useful so that they need to be removed such as FlightNum, ActualElapseTime and etc. logNdx = !(names(data) %in% c("DayofMonth", "FlightNum", "Cancelled", "ActualElapsedTime", "DepDelay", "UniqueCarrier")) data.AR = data[, logNdx] Finishing these processing work above, here the dataset comes to analyze called data.AR. It contains 10 variables shown below. summary(data.AR) ## DayOfWeek CRSDepTime CRSArrTime TailNum ## Friday :20526 Overnight: 3009 Overnight: 2050 N986CA : 129 ## Monday :27345 Morning :54579 Morning :35361 N87353 : 126 ## Saturday :21397 Afternoon:72468 Afternoon:67079 N77302 : 122 ## Sunday :30021 Evening :38591 Evening :64157 N507CA : 112 ## Thursday :22742 N472CA : 107 ## Tuesday :26882 N471CA : 106 ## Wednesday:19734 (Other):167945 ## AirTime ArrDelay Origin ## Too-Short :28126 On-Time :46402 ATL : 12232 ## Short :64977 Delayed :53601 ORD : 11020 ## Intermediate:56146 Intermediate-Delayed:29107 DEN : 7004 ## Long :14165 Much-Delayed :39537 DFW : 6208 ## Too-Long : 5233 DTW : 4984 ## PHX : 4353 ## (Other):122846 11
  • 12. INFO7374 Data Science Final Project ## Dest Distance ## ATL : 11791 Short :33685 ## ORD : 9506 Medium :50906 ## DEN : 6338 Long :43644 ## DFW : 5159 Too long:40412 ## LAX : 5013 ## PHX : 4663 ## (Other):126177 We apply association rule mining to the dataset. Firstly, we intend to find out the main factors that could possible result in air flight on-time or not. We give the index of support and confidence respectively and at right column show out the four levels, that is On-Time, Delayed, Intermediate-Delayed, Much-Delayed. apriori.appearance1 = list(rhs = c("ArrDelay=On-Time", "ArrDelay=Delayed", "ArrDelay=Interme "ArrDelay=Much-Delayed"), default = "lhs") apriori.parameter1 = list(support = 0.01, confidence = 0.1) rules1 = apriori(data.AR, parameter = apriori.parameter1, appearance = apriori.appearance1) ## ## parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen ## 0.1 0.1 1 none FALSE TRUE 0.01 1 10 ## target ext ## rules FALSE ## ## algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## apriori - find association rules with the apriori algorithm ## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt ## set item appearances ...[4 item(s)] done [0.00s]. ## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s]. ## sorting and recoding items ... [83 item(s)] done [0.01s]. ## creating transaction tree ... done [0.09s]. ## checking subsets of size 1 2 3 4 5 done [0.07s]. ## writing ... [743 rule(s)] done [0.00s]. ## creating S4 object ... done [0.02s]. By giving lift is larger than 1, and we create the subset that ordered by lift. rules1.subset = subset(rules1, subset = lift > 1.2 & confidence > 0.1) rules1.subset.conf = sort(rules1.subset, by = "lift") 12
  • 13. INFO7374 Data Science Final Project summary(rules1.subset) ## set of 50 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 3 4 5 ## 4 26 18 2 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.00 3.00 3.00 3.36 4.00 5.00 ## ## summary of quality measures: ## support confidence lift ## Min. :0.0101 Min. :0.284 Min. :1.20 ## 1st Qu.:0.0118 1st Qu.:0.314 1st Qu.:1.24 ## Median :0.0147 Median :0.342 Median :1.27 ## Mean :0.0182 Mean :0.337 Mean :1.31 ## 3rd Qu.:0.0200 3rd Qu.:0.354 3rd Qu.:1.32 ## Max. :0.0725 Max. :0.428 Max. :1.83 ## ## mining info: ## data ntransactions support confidence ## data.AR 168647 0.01 0.1 The list below displays the top ten rules sorted by lift. inspect(rules1.subset.conf[1:5]) ## lhs rhs support confidence lift ## 1 {Dest=EWR} => {ArrDelay=Much-Delayed} 0.01101 0.4284 1.827 ## 2 {CRSDepTime=Afternoon, ## Dest=ORD} => {ArrDelay=Much-Delayed} 0.01040 0.4048 1.727 ## 3 {CRSArrTime=Evening, ## Origin=ORD} => {ArrDelay=Much-Delayed} 0.01032 0.3786 1.615 ## 4 {Dest=ORD} => {ArrDelay=Much-Delayed} 0.02001 0.3549 1.514 ## 5 {Origin=ORD} => {ArrDelay=Much-Delayed} 0.02199 0.3365 1.435 And the flights origin from or land on ORD airport delay with much possibility. We search the weather history records of Chicago O’Hare International Airport (ORD), it did suffer a very severe snowstorm at that time. So even the weather condition is not included in air flight information, but still is a key of air flight on-time performance. rules1.subset.delay = subset(rules1.subset, subset = lhs %in% "DayOfWeek=Monday") plot(rules1.subset.delay, method = "graph", control = list(type = "items")) 13
  • 14. INFO7374 Data Science Final Project Graph for 5 rules DayOfWeek=Monday CRSDepTime=Morning CRSDepTime=Evening CRSArrTime=Morning CRSArrTime=Evening ArrDelay=On−Time ArrDelay=Much−Delayed size: support (0.011 − 0.019) color: lift (1.213 − 1.421) apriori.appearance3 = list(rhs = c("ArrDelay=On-Time"), default = "lhs") apriori.parameter3 = list(support = 0.01, confidence = 0.1) rules3 = apriori(data.AR, parameter = apriori.parameter3, appearance = apriori.appearance3) ## ## parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen ## 0.1 0.1 1 none FALSE TRUE 0.01 1 10 ## target ext ## rules FALSE ## ## algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## apriori - find association rules with the apriori algorithm 14
  • 15. INFO7374 Data Science Final Project ## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt ## set item appearances ...[1 item(s)] done [0.00s]. ## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s]. ## sorting and recoding items ... [83 item(s)] done [0.01s]. ## creating transaction tree ... done [0.09s]. ## checking subsets of size 1 2 3 4 5 done [0.06s]. ## writing ... [211 rule(s)] done [0.00s]. ## creating S4 object ... done [0.02s]. rules3.subset = subset(rules3, subset = lift > 1.2 & confidence > 0.1) rules3.subset.conf = sort(rules3.subset, by = "lift") rules3.subset.ontime = subset(rules3.subset.conf, subset = lhs %in% c("DayOfWeek=Friday", "DayOfWeek=Saturday", "DayOfWeek=Sunday", "DayOfWeek=Monday", "DayOfWeek=Tuesday", "DayOfWeek=Wednesday", "DayOfWeek=Thursday")) inspect(rules3.subset.ontime[1:5]) ## lhs rhs support confidence lift ## 1 {DayOfWeek=Monday, ## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01326 0.3909 1.421 ## 2 {DayOfWeek=Monday, ## CRSDepTime=Morning, ## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01170 0.3826 1.390 ## 3 {DayOfWeek=Monday, ## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01874 0.3624 1.317 ## 4 {DayOfWeek=Wednesday, ## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01283 0.3503 1.273 ## 5 {DayOfWeek=Sunday, ## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01280 0.3349 1.217 So obviously, we could conclude that the flights which own the most high on- time performance always in the morning. It illustrates that in Dec,2008, the air traffic control is in a great condition in the morning but heavier in afternoon and evening comparatively. rules1.subset.delay1 = subset(rules1.subset.conf, subset = lhs %in% c("Distance=Short", "Distance=Medium", "Distance=Long", "Distance=Too long")) inspect(rules1.subset.delay1[1:10]) ## lhs rhs support confidence lift ## 1 {CRSArrTime=Morning, ## AirTime=Short, ## Distance=Medium} => {ArrDelay=On-Time} 0.02186 0.3679 1.337 ## 2 {CRSArrTime=Morning, ## AirTime=Intermediate, ## Distance=Long} => {ArrDelay=On-Time} 0.01494 0.3610 1.312 15
  • 16. INFO7374 Data Science Final Project ## 3 {CRSDepTime=Morning, ## CRSArrTime=Morning, ## AirTime=Short, ## Distance=Medium} => {ArrDelay=On-Time} 0.01996 0.3605 1.310 ## 4 {CRSArrTime=Morning, ## Distance=Medium} => {ArrDelay=On-Time} 0.02433 0.3596 1.307 ## 5 {CRSArrTime=Morning, ## Distance=Long} => {ArrDelay=On-Time} 0.01849 0.3556 1.292 ## 6 {CRSDepTime=Morning, ## CRSArrTime=Morning, ## AirTime=Intermediate, ## Distance=Long} => {ArrDelay=On-Time} 0.01310 0.3547 1.289 ## 7 {AirTime=Short, ## Origin=ATL, ## Distance=Medium} => {ArrDelay=On-Time} 0.01010 0.3527 1.282 ## 8 {CRSDepTime=Morning, ## CRSArrTime=Morning, ## Distance=Medium} => {ArrDelay=On-Time} 0.02219 0.3526 1.281 ## 9 {CRSDepTime=Morning, ## CRSArrTime=Morning, ## Distance=Long} => {ArrDelay=On-Time} 0.01622 0.3493 1.270 ## 10 {CRSDepTime=Morning, ## CRSArrTime=Morning, ## Distance=Too long} => {ArrDelay=On-Time} 0.01120 0.3480 1.265 We can find the most flights are possible on time which often have long distance routes. That shows that in air control system, the small regions and short route are much busy and people may have 5 or more choice of flights if they go to New York City from Boston while there are only 2 flight if your family want to travel to beautiful San Diego from Washington D.C. That is why the long distance route have less pressure in air control system and It is easier to meet the air traffic jam in shorter flight routes. 6 Cluster Analysis We are going to research the Airline dataset using clustering analysis. Cluster- ing analysis generally refers to sorting observed data into k groups(k indicates how many groups will be created) so as to minimize the similarity of observations within the same group and maximize the similarity of observations across dif- ferent groups. Basically, cluster analysis can be separated into two approaches, hierarchical and non-hierarchical. We are going to run non-hierarchical cluster- ing including K-Means and Pam. For Airline dataset, we will continue finding 16
  • 17. INFO7374 Data Science Final Project associations within the dataset and the factors which will influence the ArrDe- lay. 6.1 Setup First we need to load libraries we are going to use. library(cluster) # cluster library library(proxy) # hcluster function library(fpc) # cluster.stats function library(pamr) # pam function library(clValid) # clValid function library(ggplot2) # plot diagram Because the original dataset is too large, and it is very difficult to compute distance matrix. So we randomly choose 1000 records and do analysis on this sample. cd = cd[sample(nrow(d), 1000), ] m = as.matrix(cd) str(m) ## int [1:1000, 1:12] 31 10 28 18 14 23 7 20 11 20 ... ## - attr(*, "dimnames")=List of 2 ## ..$ : chr [1:1000] "6926813" "6946821" "6545814" "6680323" ... ## ..$ : chr [1:12] "DayofMonth" "DayOfWeek" "DepTime" "CRSDepTime" ... mDist = dist(m) 6.2 Determine K The most important thing in cluster analysis is to determine best K. To find the best case, we will apply clvalid function, which will directly give us the optimal result. # clValid hvalid <- clValid(m, 2:10, clMethods = c("hierarchical"), validation = "internal", maxitems = 1e+06) pamvalid <- clValid(m, 2:10, clMethods = c("pam"), validation = "internal", maxitems = 1e+06) kvalid <- clValid(m, 2:10, clMethods = c("kmeans"), validation = "internal", maxitems = 1e+06) Now we can use the summary() function to see the result of each method, where the Optimal Scores section will directly give us the best clusters number. 17
  • 18. INFO7374 Data Science Final Project summary(kvalid) ## ## Clustering Methods: ## kmeans ## ## Cluster sizes: ## 2 3 4 5 6 7 8 9 10 ## ## Validation Measures: ## 2 3 4 5 6 7 8 9 ## ## kmeans Connectivity 2.878 79.275 85.293 86.702 104.193 131.675 105.027 128.007 136. ## Dunn 0.098 0.012 0.012 0.014 0.027 0.023 0.027 0.023 0. ## Silhouette 0.471 0.454 0.468 0.470 0.485 0.388 0.486 0.390 0. ## ## Optimal Scores: ## ## Score Method Clusters ## Connectivity 2.878 kmeans 2 ## Dunn 0.098 kmeans 2 ## Silhouette 0.486 kmeans 8 summary(pamvalid) ## ## Clustering Methods: ## pam ## ## Cluster sizes: ## 2 3 4 5 6 7 8 9 10 ## ## Validation Measures: ## 2 3 4 5 6 7 8 9 10 ## ## pam Connectivity 91.199 139.967 145.223 171.271 173.816 237.760 221.203 212.537 246.705 ## Dunn 0.019 0.013 0.014 0.018 0.009 0.012 0.013 0.013 0.013 ## Silhouette 0.404 0.288 0.318 0.360 0.305 0.298 0.320 0.320 0.333 ## ## Optimal Scores: ## ## Score Method Clusters ## Connectivity 91.199 pam 2 ## Dunn 0.019 pam 2 ## Silhouette 0.404 pam 2 summary(hvalid) 18
  • 19. INFO7374 Data Science Final Project ## ## Clustering Methods: ## hierarchical ## ## Cluster sizes: ## 2 3 4 5 6 7 8 9 10 ## ## Validation Measures: ## 2 3 4 5 6 7 8 9 10 ## ## hierarchical Connectivity 5.287 5.287 7.005 11.240 32.673 33.506 35.006 38.148 44.491 ## Dunn 0.420 0.420 0.417 0.381 0.073 0.073 0.073 0.074 0.074 ## Silhouette 0.485 0.460 0.443 0.426 0.407 0.374 0.368 0.317 0.318 ## ## Optimal Scores: ## ## Score Method Clusters ## Connectivity 5.287 hierarchical 2 ## Dunn 0.420 hierarchical 2 ## Silhouette 0.485 hierarchical 2 Because the 1000 sample records are randomly chosen, the results are not always the same. But we can still find that in most cases, K = 2 will be given. On the other hand, we can also use other measurements to validate our result. We use foreachcluster function to show the 6 measurements for cluster number from 2 to 10. foreachcluster3 = function(k) { pamC = pam(x = m, k) p.stats = cluster.stats(mDist, pamC$clustering) c(max.dia = p.stats$max.diameter, min.sep = p.stats$min.separation, avg.wi = p.stats$ave avg.bw = p.stats$average.between, silwidth = p.stats$avg.silwidth, dunn = p.stats$du } We apply this function to cluster numbers from 2 to 10 and use rbind to make a table. t3 = rbind(foreachcluster3(2), foreachcluster3(3), foreachcluster3(4), foreachcluster3(5), foreachcluster3(6), foreachcluster3(7), foreachcluster3(8), foreachcluster3(9), foreachcluster3(10)) rownames(t3) = 2:10 t3 ## max.dia min.sep avg.wi avg.bw silwidth dunn ## 2 3899 75.14 1062.9 1811 0.4041 0.019271 ## 3 3899 52.03 963.4 1666 0.2884 0.013344 ## 4 3802 52.03 797.9 1698 0.3184 0.013685 19
  • 20. INFO7374 Data Science Final Project ## 5 3366 59.92 653.1 1689 0.3599 0.017800 ## 6 3366 30.82 605.6 1625 0.3055 0.009157 ## 7 3265 39.76 579.1 1602 0.2979 0.012177 ## 8 2959 39.76 552.8 1604 0.3199 0.013437 ## 9 2959 38.39 521.4 1583 0.3196 0.012974 ## 10 2959 39.76 480.1 1562 0.3330 0.013437 The result also shows we should use K = 2 for cluster analysis. 6.3 Cluster Analysis After getting the best K = 2, we can continue to compare different clusters to see if we can get interesting result. From the previous tests, we find out that hclust is not performing good in this analysis. We found that one cluster only have few elements while another cluster have over 99% elements. It means the Hclust function is also not working very well in this case. So we will use Pam and Kmeans functions. 6.3.1 Pam We apply pam function to the matrix, and set k = 2. pamC = pam(x = m, 2) pamC$clusinfo ## size max_diss av_diss diameter separation ## [1,] 561 2814 741.6 3505 75.14 ## [2,] 439 3004 717.2 3899 75.14 pamcluster = data.frame(pamC$clustering) We paste the cluster result back to our original dataset. total = cbind(cd, pamcluster) After that, we can obtain the two subset according to their cluster numbers. d1 = subset(total, pamC.clustering == 1) d2 = subset(total, pamC.clustering == 2) summary(d1) ## DayofMonth DayOfWeek DepTime CRSDepTime ## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045 ## 1st Qu.:10.0 1st Qu.:2.00 1st Qu.:1631 1st Qu.:1525 ## Median :18.0 Median :3.00 Median :1809 Median :1715 20
  • 21. INFO7374 Data Science Final Project ## Mean :16.9 Mean :3.59 Mean :1802 Mean :1699 ## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:2004 3rd Qu.:1855 ## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308 ## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime ## Min. : 2 Min. : 640 Min. : 33 Min. : 35 ## 1st Qu.:1745 1st Qu.:1718 1st Qu.: 84 1st Qu.: 80 ## Median :1941 Median :1914 Median :123 Median :115 ## Mean :1826 Mean :1904 Mean :137 Mean :130 ## 3rd Qu.:2136 3rd Qu.:2105 3rd Qu.:168 3rd Qu.:160 ## Max. :2357 Max. :2359 Max. :441 Max. :407 ## AirTime ArrDelay DepDelay Distance ## Min. : 14 Min. : 15.0 Min. :-10.0 Min. : 56 ## 1st Qu.: 57 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334 ## Median : 88 Median : 47.0 Median : 45.0 Median : 590 ## Mean :108 Mean : 69.3 Mean : 62.4 Mean : 720 ## 3rd Qu.:137 3rd Qu.: 97.0 3rd Qu.: 89.0 3rd Qu.: 948 ## Max. :382 Max. :395.0 Max. :377.0 Max. :2640 ## pamC.clustering ## Min. :1 ## 1st Qu.:1 ## Median :1 ## Mean :1 ## 3rd Qu.:1 ## Max. :1 summary(d2) ## DayofMonth DayOfWeek DepTime CRSDepTime ## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45 ## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 835 1st Qu.: 810 ## Median :18.0 Median :3.00 Median :1021 Median : 955 ## Mean :17.2 Mean :3.54 Mean :1041 Mean :1001 ## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:1205 3rd Qu.:1130 ## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359 ## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime ## Min. : 10 Min. : 1 Min. : 35 Min. : 34.0 ## 1st Qu.:1022 1st Qu.: 942 1st Qu.: 92 1st Qu.: 83.5 ## Median :1217 Median :1130 Median :129 Median :116.0 ## Mean :1178 Mean :1111 Mean :146 Mean :134.6 ## 3rd Qu.:1408 3rd Qu.:1322 3rd Qu.:176 3rd Qu.:165.0 ## Max. :1810 Max. :2345 Max. :432 Max. :405.0 ## AirTime ArrDelay DepDelay Distance ## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74 ## 1st Qu.: 61 1st Qu.: 21.0 1st Qu.: 5.0 1st Qu.: 344 ## Median : 92 Median : 33.0 Median : 27.0 Median : 594 ## Mean :113 Mean : 55.9 Mean : 44.2 Mean : 750 21
  • 22. INFO7374 Data Science Final Project ## 3rd Qu.:140 3rd Qu.: 65.5 3rd Qu.: 56.0 3rd Qu.: 966 ## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777 ## pamC.clustering ## Min. :2 ## 1st Qu.:2 ## Median :2 ## Mean :2 ## 3rd Qu.:2 ## Max. :2 We can see from the summary that all the columns are similar except Departure Time and our target variable Arrive time. We can conclude that when the Departure Time is in the midnight or in the morning, it’s more likely that this fight will have a relatively lower delay, which match the conclusion we draw from Association rules. totaldf = data.frame(total) totaldf$pamC.clustering = as.factor(totaldf$pamC.clustering) qplot(data = totaldf, x = totaldf$pamC.clustering, y = totaldf$DepTime, colour = totaldf$pam geom = "boxplot") 22
  • 23. INFO7374 Data Science Final Project 0 500 1000 1500 2000 1 2 totaldf$pamC.clustering totaldf$DepTime totaldf$pamC.clustering 1 2 From the result, we know that pam has done a quite good job in clustering. Next, we are going to try Kmeans to compare the result. 6.3.2 Kmeans We apply similar work to Kmeans to see if Kmeans works better than pam function. kmeans.results = kmeans(m, 2) clusterdf = data.frame(kmeans.results$cluster) total = cbind(cd, clusterdf) d1 = subset(total, kmeans.results.cluster == 1) summary(d1) ## DayofMonth DayOfWeek DepTime CRSDepTime ## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045 23
  • 24. INFO7374 Data Science Final Project ## 1st Qu.:10.5 1st Qu.:2.00 1st Qu.:1627 1st Qu.:1520 ## Median :18.0 Median :3.00 Median :1804 Median :1710 ## Mean :17.0 Mean :3.58 Mean :1797 Mean :1693 ## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:2002 3rd Qu.:1855 ## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308 ## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime ## Min. : 2 Min. : 640 Min. : 33 Min. : 35 ## 1st Qu.:1739 1st Qu.:1714 1st Qu.: 84 1st Qu.: 80 ## Median :1939 Median :1910 Median :123 Median :115 ## Mean :1817 Mean :1899 Mean :137 Mean :130 ## 3rd Qu.:2134 3rd Qu.:2104 3rd Qu.:168 3rd Qu.:160 ## Max. :2357 Max. :2359 Max. :441 Max. :407 ## AirTime ArrDelay DepDelay Distance ## Min. : 14.0 Min. : 15.0 Min. :-10.0 Min. : 56 ## 1st Qu.: 57.5 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334 ## Median : 88.0 Median : 47.0 Median : 45.0 Median : 588 ## Mean :107.6 Mean : 69.8 Mean : 62.8 Mean : 720 ## 3rd Qu.:136.5 3rd Qu.: 97.0 3rd Qu.: 88.5 3rd Qu.: 947 ## Max. :382.0 Max. :425.0 Max. :392.0 Max. :2640 ## kmeans.results.cluster ## Min. :1 ## 1st Qu.:1 ## Median :1 ## Mean :1 ## 3rd Qu.:1 ## Max. :1 d2 = subset(total, kmeans.results.cluster == 2) summary(d2) ## DayofMonth DayOfWeek DepTime CRSDepTime ## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45 ## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 834 1st Qu.: 805 ## Median :18.0 Median :3.00 Median :1017 Median : 950 ## Mean :17.1 Mean :3.54 Mean :1030 Mean : 992 ## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:1202 3rd Qu.:1125 ## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359 ## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime ## Min. : 24 Min. : 1 Min. : 35 Min. : 34 ## 1st Qu.:1021 1st Qu.: 940 1st Qu.: 92 1st Qu.: 83 ## Median :1215 Median :1123 Median :129 Median :117 ## Mean :1175 Mean :1099 Mean :146 Mean :135 ## 3rd Qu.:1402 3rd Qu.:1318 3rd Qu.:177 3rd Qu.:166 ## Max. :1810 Max. :2305 Max. :432 Max. :405 ## AirTime ArrDelay DepDelay Distance ## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74 24
  • 25. INFO7374 Data Science Final Project ## 1st Qu.: 61 1st Qu.: 21.0 1st Qu.: 4.0 1st Qu.: 344 ## Median : 92 Median : 33.0 Median : 26.0 Median : 595 ## Mean :113 Mean : 54.9 Mean : 43.2 Mean : 750 ## 3rd Qu.:140 3rd Qu.: 64.0 3rd Qu.: 55.0 3rd Qu.: 967 ## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777 ## kmeans.results.cluster ## Min. :2 ## 1st Qu.:2 ## Median :2 ## Mean :2 ## 3rd Qu.:2 ## Max. :2 It seems they are generating very similar results. It also shows a strong rela- tionship between Departure Time and Arrive Delay, which matches our findings in Association Rules and Decision Trees. 7 Decision Tree In this section, we are going to use decision tree to help us analyze the factors that will affect the target variables. First, we need to load the libraries required. library(rpart) library(rpart.plot) library(rattle) library(maptree) library(party) library(partykit) 7.1 Categorize Variable We categorize our variable into different parts. Distance We divided the variables into three parts: up to 750, 750 to 1000, greater than 1000 d$Distance = ordered(cut(d$Distance, c(0, 750, 1000, Inf)), labels = c("upto750", "750to1000", ">1000")) DayOfWeek Replace week days number into characters like 1=MON, 2=TUE etc.with the help of gsub d$DayOfWeek = gsub("1", "MON", d$DayOfWeek) d$DayOfWeek = gsub("2", "TUE", d$DayOfWeek) 25
  • 26. INFO7374 Data Science Final Project d$DayOfWeek = gsub("3", "WED", d$DayOfWeek) d$DayOfWeek = gsub("4", "THU", d$DayOfWeek) d$DayOfWeek = gsub("5", "FRI", d$DayOfWeek) d$DayOfWeek = gsub("6", "SAT", d$DayOfWeek) d$DayOfWeek = gsub("7", "SUN", d$DayOfWeek) Origin Origins of airports are categorized into SW, SE, NE, MW, W these five regions with the help of gsub function. d$Origin = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS "SW", d$Origin) d$Origin = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT "SE", d$Origin) d$Origin = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT "NE", d$Origin) d$Origin = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID "MW", d$Origin) d$Origin = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT "W", d$Origin) Dest Destination of airports are categorized into SW, SE, NE, MW, W these five regions with the help of gsub function d$Dest = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS|G "SW", d$Dest) d$Dest = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT|D "SE", d$Dest) d$Dest = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT|C "NE", d$Dest) d$Dest = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID|C "MW", d$Dest) d$Dest = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT|B "W", d$Dest) DayOfMonth We divided December Day Of Month into Regular day and Christmas Week d$DayofMonth = ordered(cut(d$DayofMonth, c(0, 23, 32)), labels = c("R.Days", "CH.Days")) DepDelay We divided Departure Delay into two part low and high Delay 26
  • 27. INFO7374 Data Science Final Project d$DepDelay = ordered(cut(d$DepDelay, c(-Inf, 60, Inf)), labels = c("low", "high")) 7.2 Rpart Rpart is recursive partitioning for classification, regression and survival trees. We are going to classify two predict variable ArrDelay and DepDelay by using rpart. Departure Delay: DepDelay is response variable and DayofMonth, DayOfWeek, DepTime, Distance are predicate variable. ss.formula = DepDelay ~ DayofMonth + DayOfWeek + DepTime + Distance # formula for tree ss.rpart = rpart(data = d, formula = ss.formula) draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw DepTime <> 1406.5 low; 168647 obs; 30.1% DepTime >< 447.5 low; 72390 obs; 19.8% low 70766 obs 1 high 1624 obs 2 DepTime <> 2229.5 low; 96257 obs; 37.9% low 91550 obs 3 high 4707 obs 4 Total classified correct = 27.5 % 27
  • 28. INFO7374 Data Science Final Project print(ss.rpart) # for printing tree rules ## n= 168647 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## 1) root 168647 50820 low (0.6987 0.3013) ## 2) DepTime< 1406 72390 14340 low (0.8019 0.1981) ## 4) DepTime>=447.5 70766 13030 low (0.8158 0.1842) * ## 5) DepTime< 447.5 1624 314 high (0.1933 0.8067) * ## 3) DepTime>=1406 96257 36480 low (0.6210 0.3790) ## 6) DepTime< 2230 91550 33200 low (0.6374 0.3626) * ## 7) DepTime>=2230 4707 1427 high (0.3032 0.6968) * From this tree we conclude that normally at night between 10:30PM to 5:00AM delays are more as compare to day time. From this tree we can conclude that our decision tree is mainly depend on Arrival time and Departure time. So that we removed AirTime from next decision tree. Arrival Delay: ArrDelay is response variable and DayofMonth, DayOfWeek, Origin, Distance are predicate variable. ss.formula = ArrDelay ~ DayofMonth + DayOfWeek + Distance + Origin # formula for tree R.control = rpart.control(cp = 0.001) # to control tree ss.rpart = rpart(data = d, formula = ss.formula, control = R.control) draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw 28
  • 29. INFO7374 Data Science Final Project ,SE,SW,W = Origin = ,MW,NE 62.5547 ; 168647 obs; 0.6% ,MON,SUN,THU,TUE,WED = DayOfWeek = ,FRI,SAT 59.0561 ; 110460 obs; 0.1% ,SE,SW = Origin = ,MW,NE,W 57.44 ; 81657 obs; 0.1% 54.7998 50291 obs 1 61.6731 31366 obs 2 ,R.Days = DayofMonth = ,CH.Days 63.6379 ; 28803 obs; 0.2% 58.8759 17922 obs 3 71.4812 10881 obs 4 ,MON,SUN,THU,WED = DayOfWeek = ,FRI,SAT,TUE 69.1963 ; 58187 obs; 0.4% ,THU = DayOfWeek = ,FRI,MON,SAT,SUN,TUE,WED 63.6194 ; 33946 obs; 0.1% 52.3489 5849 obs 5 65.9656 28097 obs 6 77.006 24241 obs 7 Total deviance explained = 1.5 % print(ss.rpart) # for printing tree rules ## n= 168647 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 168647 675500000 62.55 ## 2) Origin=SE,SW,W 110460 394400000 59.06 ## 4) DayOfWeek=MON,SUN,THU,TUE,WED 81657 273300000 57.44 ## 8) Origin=SE,SW 50291 142800000 54.80 * ## 9) Origin=W 31366 129600000 61.67 * ## 5) DayOfWeek=FRI,SAT 28803 120300000 63.64 ## 10) DayofMonth=R.Days 17922 68280000 58.88 * ## 11) DayofMonth=CH.Days 10881 50950000 71.48 * ## 3) Origin=MW,NE 58187 277100000 69.20 ## 6) DayOfWeek=MON,SUN,THU,WED 33946 124100000 63.62 29
  • 30. INFO7374 Data Science Final Project ## 12) DayOfWeek=THU 5849 17040000 52.35 * ## 13) DayOfWeek=MON,SUN,WED 28097 106100000 65.97 * ## 7) DayOfWeek=FRI,SAT,TUE 24241 150500000 77.01 * From this decision tree we can say that our dataset is divided into parts like division of origin into SE,SW,W and MW,NE as well as day of week into MON,SUN,THU,TUE,WED and FRI,SAT. 7.3 Ctree Ctree is Conditional inference trees which embed tree-structured regression models into a well defined theory of conditional inference procedure Departure Delay: ss.formula1 = DepDelay ~ Distance + DepTime # formula for Ctree ss.control = ctree_control(maxdepth = 2) #height is 2 ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation ## Loading required package: Formula ## Warning: there is no package called ’Formula’ plot(ss.ctree) # plotting of tree 30
  • 31. INFO7374 Data Science Final Project DepTime p < 0.001 1 ≤ 1406 > 1406 DepTime p < 0.001 2 ≤ 447 > 447 Node 3 (n = 1624) highlow 0 0.2 0.4 0.6 0.8 1 Node 4 (n = 70766) highlow 0 0.2 0.4 0.6 0.8 1 DepTime p < 0.001 5 ≤ 2229 > 2229 Node 6 (n = 91550)highlow 0 0.2 0.4 0.6 0.8 1 Node 7 (n = 4707) highlow 0 0.2 0.4 0.6 0.8 1 Like we explain in rpart, ctree is giving same result which is normally at night between 10:30PM to 5:00AM delays are more as compare to day time. Arrival Delay: ss.formula1 = ArrDelay ~ Distance + ArrTime # formula for Ctree ss.control = ctree_control(maxdepth = 2) # height is 2 ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation ## Loading required package: Formula ## Warning: there is no package called ’Formula’ plot(ss.ctree) # plotting of tree 31
  • 32. INFO7374 Data Science Final Project ArrTime p < 0.001 1 ≤ 518 > 518 ArrTime p < 0.001 2 ≤ 134 > 134 Node 3 (n = 7693) 0 500 1000 1500 Node 4 (n = 2165) 0 500 1000 1500 ArrTime p < 0.001 5 ≤ 1438 > 1438 Node 6 (n = 52990) 0 500 1000 1500 Node 7 (n = 105799) 0 500 1000 1500 From this tree we can conclude that around midnight like before 5:18 AM, delays are higher compared to day time. 8 Random Forest Now we will use random forest analysis to learn more about predictions. In the random forest, the following libraries will be used. library(randomForest) # for randomForest library(rpart) library(caret) # for confusionMatrix Because the original data is too large, we still randomly select 1000 rows. 32
  • 33. INFO7374 Data Science Final Project rfd = rd[sample(nrow(rd), 1000), ] We seperate our dataset into train set and test set. ndxTrain = sample(x = nrow(rfd), size = 0.7 * nrow(rfd)) rfd.train = rfd[ndxTrain, ] rfd.test = rfd[-ndxTrain, ] We set all the other variables to be predictors and see how they will affect our target variable. rfd.predictors = c("DayofMonth", "DayOfWeek", "DepTime", "CRSDepTime", "ArrTime", "CRSArrTime", "AirTime", "ActualElapsedTime", "Distance") rfd.rf = randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay) print(rfd.rf) ## ## Call: ## randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 3 ## ## OOB estimate of error rate: 24.14% ## Confusion matrix: ## Low High class.error ## Low 272 80 0.2273 ## High 89 259 0.2557 plot(rfd.rf) 33
  • 34. INFO7374 Data Science Final Project 0 100 200 300 400 500 0.200.250.300.350.40 rfd.rf trees Error From the diagram, we find out that the error rate will be stable when the tree numbers get larger. So we use the default tree number, which is 500. rfd.train.pred = predict(object = rfd.rf, newdata = rfd.train, type = "class") rfd.test.pred = predict(object = rfd.rf, newdata = rfd.test, type = "class") confusionMatrix(data = rfd.train.pred, reference = rfd.train$ArrDelay) ## Confusion Matrix and Statistics ## ## Reference ## Prediction Low High ## Low 352 0 ## High 0 348 ## ## Accuracy : 1 ## 95% CI : (0.995, 1) 34
  • 35. INFO7374 Data Science Final Project ## No Information Rate : 0.503 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 1 ## Mcnemar's Test P-Value : NA ## ## Sensitivity : 1.000 ## Specificity : 1.000 ## Pos Pred Value : 1.000 ## Neg Pred Value : 1.000 ## Prevalence : 0.503 ## Detection Rate : 0.503 ## Detection Prevalence : 0.503 ## Balanced Accuracy : 1.000 ## ## 'Positive' Class : Low ## confusionMatrix(data = rfd.test.pred, reference = rfd.test$ArrDelay) ## Confusion Matrix and Statistics ## ## Reference ## Prediction Low High ## Low 133 32 ## High 29 106 ## ## Accuracy : 0.797 ## 95% CI : (0.747, 0.841) ## No Information Rate : 0.54 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.59 ## Mcnemar's Test P-Value : 0.798 ## ## Sensitivity : 0.821 ## Specificity : 0.768 ## Pos Pred Value : 0.806 ## Neg Pred Value : 0.785 ## Prevalence : 0.540 ## Detection Rate : 0.443 ## Detection Prevalence : 0.550 ## Balanced Accuracy : 0.795 ## ## 'Positive' Class : Low ## 35
  • 36. INFO7374 Data Science Final Project Although the dataset is randomly chosen, we can always get an accuracy rate of over 70 percent, which is higher than a single decision tree. 9 Classification Classification technique used to predict group membership for data instances.For Classification we are using knn and svm algorithms 9.1 knn K-Nearest Neighbors(Knn) is supervised machine learning algorithm for object classification. library(class) #for knn library(RWeka) #for IBk function ## Error: package or namespace load failed for ’RWeka’ 9.2 Processing Data For removing not useful column. kd = kd[, -20:-29] kd = kd[, -1:-2] kd = kd[, -3:-12] kd = kd[, -4] kd = kd[, -5] We are keeping ArrDelay as our responsive variable so we categorize that into two part low and high delay. kd$ArrDelay = ordered(cut(kd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High")) Our machine is not able to handle full data set so we are using 1000 random record. kdd = kd[sample(nrow(kd), 1000), ] # sample dataset The IBk function implements the K-NN technique to predict the Arrival Delay variable from the remaining four variables of the kdd dataframe.That’s why we are using this function and storing result into classifier classifier = IBk(ArrDelay ~ DayOfWeek + DayofMonth + Distance + Origin, data = kdd, control = Weka_control(K = 4)) # k=4 because 4 other variable 36
  • 37. INFO7374 Data Science Final Project ## Error: could not find function "IBk" summary(classifier) # detail eplanation with confusion matrix ## Error: error in evaluating the argument ’object’ in selecting a method for function ’summary’: Error: object ’classifier’ not found In k nearest neighbour technique we found that our around 70% data are cor- rectly classified and only 30% data are incorrectly classified. In confusion matrix we can see that in high delay part is not classified properly. 9.3 SVM For classification we are using another method which is SVM. Support Vector Machine can analyze data and recognize patterns, used for classification and regression analysis. 10 Processing Data We are keeping ArrDelay as our response variable so we categorize that into two part low and high. sd$ArrDelay = ordered(cut(sd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High")) Our machine is not able to handle full data set so we took part of that sdd = sd[sample(nrow(sd), 1000), ] # sample dataset We divided our dataset Into two parts train1 and test1 dataset sd1 = nrow(sdd) nxd.train = sample(1:sd1, 0.7 * sd1) sd.train1 = sdd[nxd.train, ] sd.test1 = sdd[-nxd.train, ] For SVM we are using these two libraries. library(e1071) library(caret) Predict variable is ArrDelay which is based on two variable which are Day- OfWeek and Distance. sd.formula = ArrDelay ~ DayOfWeek + Distance plot.formula = DayOfWeek ~ Distance #For plot X and Y axis sd.model = svm(formula = sd.formula, data = sd.train1) # for actual model creation. summary(sd.model) # Detail description of a model. 37
  • 38. INFO7374 Data Science Final Project ## ## Call: ## svm(formula = sd.formula, data = sd.train1) ## ## ## Parameters: ## SVM-Type: C-classification ## SVM-Kernel: radial ## cost: 1 ## gamma: 0.5 ## ## Number of Support Vectors: 639 ## ## ( 322 317 ) ## ## ## Number of Classes: 2 ## ## Levels: ## Low High sd.predict = predict(sd.model, sd.test1) # prediction on testing data set # confusionMatrix(data = sd.predict, reference = sdd£ArrDelay) plot(x = sd.model, data = sd.train1, formula = plot.formula) #default: cost=1, gamma=0.5 38
  • 39. INFO7374 Data Science Final Project LowHigh 500 1000 1500 2000 2500 1 2 3 4 5 6 7 o o o o o o o o o oo o o o o o o o o o o o o o o o o o o oo o o o oo o oo oo o o o o o o o oo o o o o o o oo o oo x x x xx x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x xx x x x x x x x x x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x xx x x x xx x x x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x xx x x x x x x x x x x x xx x x x x x x x x xx x xx x x x xx x x x x x x x x xx x xx x x xx x x x x x x x x x x x x x x xx x x x x xx x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x x xx x x x x x x x x x x x x x x x x x xx x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x xxx x x x x xx x x x x x xx x x x x x x x xxx x x xx x x x x x x x x xxx x x x x x x xx x x x x x x xx x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x SVM classification plot Distance DayOfWeek For clear result we are changing cost and gamma parameter sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification", kernel = "radial", cost = 1, gamma = 5) plot(x = sd.model, data = sd.train1, formula = plot.formula) 39
  • 40. INFO7374 Data Science Final Project LowHigh 500 1000 1500 2000 2500 1 2 3 4 5 6 7 o o o o o o oo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o o o oo o oo o o o o o o o o o o o o o o o o oo o x xx x xx x x x x x x x x x x x x x x x x x x x xx x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x xx x x x x x xx x x x x x x x x x x x x x x x x x xx x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x xx x x x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x xx x x x x x x x x x x xx x x x x x x x xx x xx x x x xx x x x x x x x x x x xx x xx x x xx x x x x x x x x x x x x x x xx x x x x xx x x x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x xx x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x xx x x x x x x xx x x x x xxx x x x x xx x x x x x x xx x x x xxx x x xx x x x x x x x x x xxx x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x SVM classification plot Distance DayOfWeek sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification", kernel = "radial", cost = 1, gamma = 0.1) plot(x = sd.model, data = sd.train1, formula = plot.formula) 40
  • 41. INFO7374 Data Science Final Project LowHigh 500 1000 1500 2000 2500 1 2 3 4 5 6 7 oo o o o oo o o oo o o oo oo oooo o o o o ooo x xx x xx x x x x x x x x x x x x x x x x x x x x x x xx x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x xx x x x x x xx x x x x x x x xx x x x x x x x x x xx x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x xx x x x x x x x x xx x xx x x x xx x x x x x x x x x x xx x xx x x xx x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x xx xx x x x xx xx x x x x x x x x x x x x x x x x x x xx x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x xx xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x xxx x x x x xx x x x x x xx x x x x x x x xxx x x xx x x x x x x x x x xxx x x x x x x x x xx x x x x x x xx x x x x x x x x x x x x x x x x x x x xx x xx x x x x x x x x x x x x x x xx x x x x xx x x x x x x xx x x x x x x x x x x x x x SVM classification plot Distance DayOfWeek From this graph we can clearly say that on 6th and 7th day which is Saturday and Sunday is having more delay than rest of the day. After certain distance Arrival Delays are getting lower. 11 Conclusion Overall, we did find some useful results in our analysis. Within the logistic re- gression, we found several variables that were statistically significant, like Dis- tance, Day Of Week, Origin, Destination, Departure-time, Arrival-time, Day Of Month and etc. We converted them from numeric variable to categorical variables. We found out some reasons behind the U.S Air flights Arrival Delay. To find out relationships between different variables, we applied association rules and we got really good result. We found out on Monday flights are more likely 41
  • 42. INFO7374 Data Science Final Project to be on time. In December 2008, we found out Chicago flights delays are high so we check weather records and we found that In December 2008, weather was very bad in Chicago. Many flights are affected because of that reason. The fastest and easiest way to make decision about our dataset is to apply De- cision tree mechanism where we organized our data hierarchically. So, we used Rpart and Ctree algorithm for making decision tree. From the diagram, we conclude that normally at night between 10:30PM to 5:00AM delays are rela- tively higher. And normally on weekends like Saturday and Sunday delays are possibly higher compared with weekdays, which makes sense. On the other hand, we also use clustering analysis. After applying Kmeans, Hclust and Pam clustering method to our dataset, we got cluster 2 is the best cluster and we validate that by using clvalid function and other measurements. After separating the dataset into two subsets, we also find out some relation- ships that fulfill what we have found in the association rules. In classification we used knn and svm techniques. In k-nearest neighbor tech- nique we found that approximately 70 percent of data are correctly classified. Same thing we found in our confusion matrix. We did some pattern recognition with help of Support Vector Machine (SVM) where we found that the longer distance, the longer delay. According to our analysis, we suggest that it is better to choose daytime on weekdays to travel so that you can arrive your destination on time. 12 Limitation There are still a few limitations during our analysis for this dataset. First, we are limited by our computers processing capability. The original dataset is huge which contains 7,009,728 observations, so we select a part of it (All U.S. airlines data of December, 2008) so as to reduce the file size loading into R. In addition, when we especially address with PAM algorithm in cluster analysis and RandomForest, it often got stuck even crashed. So we have to use a random sample to apply functions, like computing distance matrix. However, we haven’t verify how the random sample will affect our result. Another limitation is that so far we only focus on the delay. There might be 42
  • 43. INFO7374 Data Science Final Project other interesting relationships among other variables. We may work on other relationships in the future. 13 Future Work While we have already obtained some analysis outcomes, there are still a few works we can do in the future. First, due to the limitation of our computers, we are not able to process large- scale data. So we cannot apply some of the functions on the full dataset. In our analysis, we only use a random sample, so the result cannot be accurate every time. What’s more, we may find out more relationships because our target is ana- lyzing the air flights perform on-time or delayed. Something valuable is still waiting for us. For instance, we may find the busiest carrier in the air. Last but not the least, from previous work, we find out that the DBSCAN is not working well unless the dataset becomes very large. So we don’t apply that to our dataset. We would like to see how DBSCAN will perform and we want to compare the result of DBSCAN to the other cluster methods. 43