SlideShare a Scribd company logo
1 of 37
Spring Semester 2017
Group Project report
Instructor: Iva Stricevic
OPIM 5604- Predictive Modeling
Section B12-1173
Team 2
NTSB Aviation Accident Data Analysis
April 25, 2017
SUMMARY
Team Members:
Surya Adavi
Ashish Doke
William Pratt
Mark Strout
Jinwei Wang
The objective of this case analysis is to assess NTSB aviation Accident data and develop a
classification model forecasting likelihood of injury severity of an aircraft crash using the
SEMMA approach.
Objectives:
Following the guidelines of the SEMMA approach
● Sampling of the data. (S) : Data sampling
● Exploration of the data (E) : Data visualization and pattern discovery:
● Modification of data (M) : Data preprocessing
● Modeling the data (M) : Predictive/Classification Modeling
● Assessing the data (A) : Model Implementation
● Plan for future upgrade
Sampling (S)​:
​The dataset we chose contains information about global aviation accidents. This dataset has
been picked from Kaggle website. This data has been collected by National Transportation
Safety Board (NTSB) since the year 1962. This data set contains accident/ incident data from
1982 to January 2017. Since it’s a government agency (NTSB), we assume that the data is
genuine and collection methods used are standardized. The data set has 79,293 observations and
32 variables. There are 8 variables of Numeric datatype dealing with the Event Date, Latitudes
and longitudes, Registration Number and the Count of Injuries. There are 24 Categorical
variables, Amateur built being the one ordinal variable and the rest being nominal. The nominal
1
variables give us details about the weather conditions, Location of airport, crash site, country,
type of Aircraft its model, make and purpose of the flight.
Based on the aviation accidents data, we chose to explore which factors play an important role
contributing to these accidents and determine new patterns and possible causal relationships if
any. Some of the variables we intend to study are:
1. Weather conditions
2. Flight conditions influencing the accident
3. Aircraft Category & Engine Type
4. Make, Model & Built of Aircraft
5. Event dates and Location details
6. Broad phase of flight
7. Aircraft damage and severity of injuries
After studying all these variables and drawing patterns between them, we will provide actionable
recommendations, that will help prevent aviation accidents and make air-travel safer.
General thoughts:
Most accidents are in General Aviation(GA) due to two factors:
● The flight activity volume is significantly higher 81% at any given time in GA aircraft.,
than in commercial, military, agricultural or medevac 19%. In addition, there is generally
less training required for GA pilots. This leads to less experience and less exposure to
potentially hazardous situations in a controlled environment.
● An additional factor that may skew numbers leading to errors is collection regularity and
accuracy of the reporting process. Most accidents are reported in the United States, 93%,
so the location numbers may be skewed, compared to 7% in all others.
● These factors come into play and are taken into account in the pre-processing of the data.
2
​II. Data visualization and pattern discovery (E)
We decide to do some visualizations and analysis to detect the patterns inside the data and get
some insights for our analysis.
i) ​Injury severity vs The total fatal injuries:
Total fatal injuries are actually derived from this injury severity column. The investigation report
uses the fatal injuries count to categories the injury severity. So, we cannot use this column.
instead we can use this variable to check for patterns with other variables to derive insights.
ii) ​Total fatal injuries vs year​:
From the graph, we can see that the total fatal injuries has been decreasing with time. The
development of aviation technology and aircraft manufacturing industry makes aircraft more
3
reliable and air travel much safer. In addition, the training requirements have increased over the
years. The improvement of electronic and virtual simulation training devices has improved the
access and quality of pilot training and improved situational awareness.
iii) ​Fatal Injuries vs season:
We can get that summer has the highest number of fatal while winter has the lowest number of
fatal. The number of fatal between autumn and spring are close. This is a similar comparison to
the higher than average number of GA to commercial accidents reported annually. The GA
community is much more active during the nicer summer months in the Northern Hemisphere.
iv)​Total fatal injuries vs day of week 2​:
There were more accidents on the weekend than that on weekdays. During weekdays, Friday has
4
the highest number of fatal injuries. This is again, a similar comparison to the higher than
average number of GA to commercial accidents reported. The GA community is much more
active during the weekends and time away from their normal occupations including Fridays.
v) ​Aircraft damage vs amateur built​:
From this graphic it is clear that two factors are at work here. First. The ease of access to
home-built aircraft increases the likelihood a home built aircraft is involved in an accident. These
aircraft are much cheaper and there is very little regulation on the type and construction
standards of these aircraft. essentially the only limits to these are the safety precautions when
interacting with commercial aircraft, such a radio communication and navigations requirements.
The second factor affecting this variable's impact on aviation accident data is the survivability of
these home built aircraft. These aircraft are usually constructed out of less rugged and poorly
designed components. The built-in safety of commercially produced aircraft is not required in
these home built and thus the accidents are less survivable and more likely to occur.
5
vi) ​Region in the United states vs Total fatal injuries​:
The weather and accessibility of GA flights is reflective of the increased rates of fatal accidents
here. There is much better weather, in general which makes flight much easier and thus there is
very limited pilot training in these areas. The best weather and the most dangerous terrain, from a
survivability standpoint is California. It’s clear sunny days increases GA traffic, but its rugged
mountains and dense population make the prospects of surviving a crash far less likely. Thus, we
see that California has highest no. of accidents with fatal injuries.
6
vii)​ weather vs injury severity​:
The Mosaic plot above shows the relation between injury.severity and weather condition. The
blue color means non-fatal and red means fatal. We can get that under the VMC condition, there
are more non-fatal and less fatal than that under IMC condition. This also identifies that there are
far more flights occurring during the nice VMC weather. This makes total number of accidents
and thus the fatal accidents increase. The interesting fact is that in IMC conditions the accidents
tend to more serious and less survivable. This is due to the extreme flight conditions encountered
and the fact accidents tend to be very sudden or of an extreme nature where pilots have lost
control of their aircraft for one reason or another.
7
viii)​ injury severity vs aircraft damage​:
The chart above represents the relationship between injury severity and aircraft damage. The
damage level has a strong relationship with injury severity. When the aircraft was destroyed, the
highest number of fatal happened. When there are just minor and substantial damage happened,
the number of fatal was low.
8
ix) ​Number of engines vs injury severity​:
There were more accidents total again where the highest likelihood of GA is present. In general
GA is much more centered on inexpensive and simple aircraft. The number of engines in most
planes is only one in this category. There are instances of two engines in GA but in general the
pilot training in these cases is more intensive and the fact they have a spare engine if one quits
there are fewer accidents.
x)​ purpose of the flight vs injury severity​:
Personal use as one of the purposes of the flight is almost a mirror image of the FAR type
9
depicting GA. This data set only provides a duplicate value to the FAR type and if included
would skew the data and provide no real value.
xi)​ Injury severity vs FAR description​:
In this original data set there are too many similar categories/ variables. Examples are the
“Weight Shift” and “Glider” categories, which are truly GA and not commercial use. In these
cases, we recoded the data to allow its useful inclusion in the predictive analysis of the data set.
10
xiii) ​Injury severity vs Broad phase of flight
The Mosaic plot above shows the relation between injury severity and broad phase of flight.
From the graph, we can easily see that there are less fatal when the airplane is landing. Cruising
aircraft have the highest fatal rate. This data is reflective of the location and profile more than a
causal of an accident. this helps predict severity but not the cause of the accident. In aviation
accident analysis terms this is labeled an aggravating factor. contributing but not causal.
11
III. Data preprocessing (M):
Our dataset contains a large number of categorical variables. Many of them are disordered and
many of them contain missing values. These data values are unacceptable for data visualization
and model building. First we recode these categorical variables to make them concise and
reliable. Below are the columns that were changed. We believe that these transformations would
greatly increase our model performance and make the data visualization more reasonable.
1. Injury.Severity​: Recode all the fatal with numeric values to “fatal” and non-fatal
to”nonfatal”.
2. Event.date​: The information of year,day,month from this variable is extracted to
create the following variables.
(a) Day of Week​: New column, used “Monday, Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday” to represents the days of the specific dates.
(b) Month​: New column, used “January, February, March, April, May, June, July, August,
September, October, November, December” to represents the specific months.
(c) Season​: The event date column is used to create the “Season” variable “ Spring,
Summer, Fall, Winter.”
(d) Day of Week​ 2 2(Weekend?): The day of the week is recoded to create new variable is it
a weekend or not with “Weekend” and “Non-Weekend.”
3) ​Weather.Condition​: Code the missing value to “UNK”, which means unknown. VMC is a
meteorological condition expressed in terms of visibility, distance from cloud, and ceiling equal
to or better than specified minima. This is generally associated with a pilot’s ability to see
clearly. IMC is meteorological conditions expressed in terms of visibility, distance from clouds,
and ceiling less than the minima specified for visual meteorological conditions. This is generally
meaning that the flight or some portion thereof is actually in the clouds with only reference to the
aircraft instruments.
4) ​Broad.Phase.of.Fligh​t: Coded “unk” into the missing values, coded descent and approach
classes in the variable into landing based on the definitions , recoded climb and taxi into takeoff
as it is a part of the same phase and other classes are left as is.
5)​ Number. of engine​s: We use informative missing to put the average number of engines into
the missing values and the values were rounded to the nearest number.
6) ​Engine.Type​: Coded all the engines that indicated turbo/ Turbine, into “ Turbo”, coded all the
engines that indicates Reciprocating into “Reciprocating” and the missing values are coded as
unknown.
7) ​State​: Combined the original “Country” and “Region” columns into one column, if the
12
regions where the accident happened are within U.S., we coded the state code, if the regions are
outside U.S., we coded the country name.
8) ​Region​: if the regions where the accident happened are within U.S., we coded the position
(“Midwest, Northeast, West, South”) that the region is in U.S., if the regions are outside U.S.,
we coded “Other.”
9) ​Purpose of Flight​.2: We are interested in the prediction of the injury severity of the aircrafts
with passengers.so we have recoded the original purpose of the flight column into classes which
belong to personal /business/public aircraft purposes and others which are for aerial observations
into other observations class.
10)​FAR Description​ 2: We are interested in the prediction of the injury severity of the aircrafts
with passengers.so we have recoded the FAR description column into classes which belong to
aviation purpose, missing values as unknown and other applications of aircraft into Air other
application.
The columns which we have decide to drop and the explanations are:
1) Investigation type​: This variable is not practically significant to us as this is available to
us only the investigation is done on the aircraft. So we decided to not use this variable
2) Accident Number​: This is just a unique ID given to the accident number. No
significance.
3) Location and Country​: We have not used variable as they were too many different
observations in this class but we extracted information from this column to create a new
variable to use in our model.
4) Latitude and longitude​: We couldn't use this variable as they were too many missing
values in these columns and geocode them as the data set has no zip codes available.
5) Airport Name and Airport Code​:This is just a unique code and name given to the
airport and no practical significance. We have too many different classes in this variable
as they are observations from all round the world.
6) Aircraft Damage​: This variable is not practically significant to us as this is available to
us only after the investigation report is made on the aircraft.This is statistically
significant but we can't use practical reasons.
7) Aircraft category:​ We have dropped this variable as it has too many missing values
which will affect the prediction.
8) Registration Number:​ This is just a unique number given to the Airplane. No
significance.
9) Make and Model​: These variables have too many missing values and we cannot impute
them so we haven't used them in our model and also in the non missing observations we
13
couldnt group the classes as each class was different.
10) Schedule , Air carrie​r : We had to drop this variable as it had too many missing values
and we didn't want to use this variable to predict due to this reason.
11) I​njuries columns​(​fatal,serious,minor and uninjured)​:These variable is not practically
significant to us as this is available to us only after the investigation report is made on
the aircraft.This is statistically significant but we can't use practical reasons.
12) Report status and publication date​:This variable is related to the investigation purpose
and has no practical significance as this data are recorded after a fatal accident was
already happened.
DATA PREPROCESSING: MISSING VALUE CHECK
​Project team also analyzed the data-set to check if there are any missing values available or not
​Based on the analysis, in JMP missing value exploration, we encountered the following columns
have missing values.
​1. ​Latitude and Longitude columns: 62,614 rows of 79,293 have missing values in this column
as the exact location of the event was not available in the model we have decided not to use this
variable as a predictor and delete the column.
​2. ​Number of Engines​: 4374 rows of 79,293 have missing values we have treated these missing
rows with missing value imputation method.
​3.​Total Fatal injuries: 7626 rows of 79,293 have missing values as we cannot use this variable
as it has statistical significance but no practical significance since we are predicting the severity
of the crash before the investigation report is published. Hence, we are dropping this variable as
a predictor.
​4​.Total Serious injuries​: 17,442 rows of 79,293 have missing values as we cannot use this
variable as it has statistical significance but no practical significance since we are predicting the
severity of the crash before the investigation report is published. Hence, we are dropping this
variable as a predictor.
​5. ​Total Minor injuries​: 18089 rows of 79,293 have missing values as we cannot use this
variable as it has statistical significance but no practical significance since we are predicting the
severity of the crash before the investigation report is published. Hence, we are dropping this
variable as a predictor.
​6. ​Total Uninjured​: 14218 rows of 79,293 have missing values as we cannot use this variable as
it has statistical significance but no practical significance since we are predicting the severity of
14
the crash before the investigation report is published. we are dropping this variable as a
predictor.
Fig1​:
Fig2:
Missing data pattern:
Based on the above missing data pattern and exploring the missing value our only variable of
significance is Number of engines.
We have used the imputation of the missing observations in this column from the impute missing
column option from jmp and rounded of the imputed values to the nearest integer.
15
Distribution and outlier analysis:
The only Numeric variable were using in this model is ​Number of engines,​ after imputing the
column with informative missing. the above is the distribution of the variable.
There are 12642 outliers as per the robust fit outliers analysis, the distribution of the variable is
as follows.
The most of the number of engines observations are around 1 and 2 . They’re engines with
16
number 3 and 4. We have considered them to be legitimate as aircrafts can have more than 1
engine practically. We’ve decide not to change the data.
IV​. ​Modeling (M)
Our objective of this project is to predict the possibility of the occurrence of fatal in an aircraft
accident. We believe after studying all these variables and drawing patterns between them, we
would be able to build models to predict the possibility and then provide reasonable suggestions,
that will help prevent aviation accidents and make air-travel safer.
We implemented the following models:
1. Logistic Regression Model
2. Decision Tree Model
3. Neural Nets Model
For our model, we split the dataset into 65% training set, 25% validation set and 15% test set.
We used the large training set to build our model, then we used validation set to draw our
conclusion and suggestions, and used test set for model comparison.
The following are the variables Selection for the modeling. These were selected based on the
pattern discovery which we have come up with in the Visualization and also by running models
and comparing them which gave the better prediction rate.
1.Number of engines
2.Broad phase of the flight
3.Season
4.Weather condition
5.Amateur built
6.FAR description
7.Region (Location)
8.Weekend (is it a weekend or not)
Challenges​:
1.​ ​The dataset which was available to us was not balanced the number of fatal crashes were less
compared to the Non-fatal crashes.
2. Latitude and Longitude values were missing which are the important predictors of our
target variable.
17
3. Our predictors were categorical in nature, which prevented us from using several
techniques such as Principal component analysis, Clustering or correlation analysis.
4. Having the severity (dollar value) of each incident would help assign a cost to the
misclassification and would help in future predictions.
Logistic Multiple Regression:
Since our target variable, injury_severity, has a binary outcome (either fatal or non-fatal), we can
fit a Logistic Regression model. The significance of each variable in our model is given below.
18
Parameter Estimates:
● The parameter estimate for the variable Weather condition in IMC stat is positive in this
multiple logistic regression which indicates that holding all other independent variables
constant the injury severity is more likely to be fatal when the plane is flying in IMC
weather condition.
● Similarly The negative coefficient for the variable an aircraft which is not Amateur built
indicates all other independent variables constant the injury severity is less likely to be
fatal when the plane is not amateurly built.
19
ROC Curves:
*​The accuracy of our model depends on how well the test differentiated the injury severity
variable into fatal and non fatal.. The area under the ROC curve gives the accuracy measure of
the model which in our case is 0.7884 for test data and 0.7959 for validation data.
Odds Ratios:
20
● As we know odds ratio​ ​represents the constant effect of a predictor X, on the likelihood
that one outcome will occur. The odds ratio is better ​for Maneuvering/Landing in the
Broad Phase of flight , IMC on VMC weather condition and Amateur built Yes on No
when compared to other classes in the same variables.
Decision Tree Model:
Fit Details:
21
No. of Splits: 84
22
Tree Split of the model:
● From the picture, we can see that The decision tree model Find the variables that split the
best outcomes which is Broad phase of the flight variable in this case and divides the data
into two groups leaves on that split. This variable has the highest information gain and
the which has the highest power in the reduction of Residual sum of squares.
● The tree is constructed on Broad phase of the flight which has highest information gain
● No of engines is another variable which gets split into many times the split continues
until no region contains more than 5 observations.
23
Column Contribution:
● Broad phase of the flight has highest information gain. It has the highest number of splits
along with Number of engines variable.
Leaf Report:
● We can see the leaf report gives us the classification rules of the model and is a compact
representation of most of the information in the decision tree.
24
ROC Curve for Decision tree model:
● ROC curves represent how good a model is by testing it in terms of a graph plot for
Sensitivity (True Positive Rate) against Specificity (False Positive Rate). The accuracy of
our model depends on how well the test differentiated the injury severity variable into
fatal and non-fatal.. The area under the ROC curve gives the accuracy measure of the
model which in our case is 0.7999 for test data and 0.8067 for validation data.
25
Neural Networks Model:
Overall information:
● We have used the information from the decision tree model for the best predictors and
used these predictors for the Neural net model.
26
Diagram representation of the model.
27
ROC Curve for Neural Networks model:
● ROC curves represent how good a model is by testing it in terms of a graph plot for
Sensitivity (True Positive Rate) against Specificity (False Positive Rate). The accuracy of
our model depends on how well the test differentiated the injury severity variable into
fatal and non-fatal.. The area under the ROC curve gives the accuracy measure of the
model which in our case is 0.7969 for test data and 0.8032 for validation data.
28
V. Analysis (A): Accuracy Reports & Model
Comparison
Calculating the Accuracy of our model based on the Confusion matrix of our models:
Logistic regression Confusion Matrix:
Decision Tree Confusion matrix:
Neural Network Confusion matrix:
29
We compare Accuracy of our models logistic regression, Decision tree and Neural nets.
Model Type Accuracy of the model
Baseline Logistic
Regression
Decision Tree Neural Nets
Validation Set 31% 70.11% 71.66% 71.17%
Test Set 31% 70.62% 71.76% 71.17%
● We can see that the Decision tree model has a better accuracy rate when compared to the
logistic multiple regression model and the Neural networks model.
30
ROC Curve comparison:
● We can see that the Decision tree model has the best Area under the curve for the models
run on the data set with AUC=0.8038.
Lift Curve Comparison:
31
● We can see that the Lift ratios of the Neural Networks and Decision tree models are
close to 2.50 and the logistic regression model has lower lift ratio compared to the other
two models.
Model selection based on the following factors:
Logistic Regression Decision Tree Neural Networks
Accuracy of model
(Validation)
70.11% 71.66% 71.17%
ROC Curve 0.7906 ​ 0.8038 0.8017
Lift Ratio 2.4508 ​ 2.502 2.495
All the models perform better than the baseline, the decision tree model and the Neural networks
are close in the accuracy prediction.
We choose the ​Decision tree model​ as
1) It does better prediction accuracy on Validation and Test data compared to Logistic
regression model and Decision tree was doing a better job in modeling the data with
nonlinear relationships between the variables and was able to handle the interaction
between the variables. Also, It has a better Area under curve when compared to other
32
models.
2) when compared to Neural network model, decision tree gives us an understandable
model. It is relatively easy to explain how the data was analyzed and how decisions are
being made with decision trees and also it is difficult to incorporate a neural network
model in a business environment when compared to a Decision tree model.
● Model Implementation
After running all the models and analysis based on our observation of the dataset, as well as the
results of all Prediction Models. We find that Decision Tree model is the better model for the
prediction of the injury severity. After running the models and establishing the accuracy percent,
we evaluated one of the regression formulas used to predict injury severity:
33
Just a part of the classification formula from the decision tree.
This analysis raises a few questions:
1.When conducting the analysis of this model and the aviation accident fatality rates the
relationship between the Broad Phase of Flight and the manufacture of the aircraft seem to be the
strongest indicators of risk of Fatalities. But in order to preserve the pioneering nature of aviation
we wonder if it is realistic to actually change the type of aircraft allowed to be built? It is not
34
possible to restrict the phase of flight.
2. Examining the data set of the number of engines, again it seems like the severity of the
accidents and fatalities are definitely tied to this. But the ability to make changes to this would
essentially be impossible since the increase in the number of engines to make flight safer would
severely limit the accessibility of GA to most due to cost. This would also mean the design of
aircraft would have to change significantly and the industry could not survive such a significant
change.
3. Is the relationship between the GA population and the type of engine significant in predicting
the severity of accidents? It seems to show that the severity of accidents is greater with
reciprocating engines. But we believe that the majority of the turbine engines are commercial,
military or other than GA aircraft, Thus the data is more indicative of the category than of the
impact of the engine type alone on the accident severity.
VI. Plan for future upgrades
In order to improve aviation safety and drive the fatality rate down across the community we
must use the data in the model we chose but then analyze the true meaning behind each data
point. When we look at the phase of flight, there is little to improve. The fact that aircraft that are
in the landing phase are already committed to land in a safe location that drives down the
severity of any accident. The focus of the pilot on their actions required is also heightened to the
point they catch emergencies and can react faster. So in order to improve the safety in each phase
35
of flight we should attempt to carry forward these positive attributes to the other phases.
Obviously, the location of the landing area cannot be carried forward, so we must focus on the
pilot readiness and situational awareness.
Similar to the phase of flight, the build of the aircraft itself can exacerbate the severity of a
crash. But in order to change this variable we would have to change the nature of aviation itself.
Aviation has always been a sport of pioneering and development. Where experimentation and
bravery are valued. Thus, attempting to stop the production of home built aircraft would be
impossible. Similar to the phase of flight however, we can attempt to improve the quality of the
aircraft through education. It would be possible to impose a mandatory evaluation and analysis
of aircraft design by the builder before manufacture. This could be enforced in requiring a
certificate or permit to build an aircraft. In order to gain this the builder would have to sit with a
licensed FAA aircraft designer and go through an analysis of the inherent weaknesses and
strengths of the aircraft design. In this way, the designer/ builder would gain the knowledge to
improve the safety and survivability of their aircraft through education.
Finally the next best improvement would be to require turbine engines in all GA aircraft. There
are two means to approach this problem. First there must be a concerted effort throughout the
industry to drive down the inhibitive cost of turbine engines. Thus the cost prohibition would be
removed facilitating a shift to this more reliable powerplant. Secondly we must analyze the
whole picture of the lower accident rate with turbine engines. If we look closely again we come
back to the fact that most turbine engines are in commercial and other than GA aircraft. Thus the
training of the pilots again is a factor. The pilots flying turbine engine powered aircraft are
generally more well trained. This brings us back to the improvements in training that must be
made across the board in aviation.
As a closing thought, we should look back to the beginning of this analysis paper at the chart in
data visualization (ii). This chart shows the improvement in aviation safety over the years. The
chart shows that the fatalities recorded in 1980 were 3296. Over the next 36 years the fatality rate
has been driven down by over 71%. We must continue this trend by concentrating on both pilot
and aircraft safety. This can easily be done by leveraging the accelerated rate of technology
advancements in both pilot training and the construction of safer aircraft designs.
References:
Dataset source: ​https://www.kaggle.com/khsamaha/aviation-accident-database-synopses
Semma:​http://faculty.smu.edu/tfomby/eco5385_eco6380/data/SPSS/SAS%20_%20SEMMA.pdf
NTSB:​https://www.ntsb.gov/investigations/AccidentReports/Pages/aviation.aspx
36

More Related Content

What's hot

2 database system concepts and architecture
2 database system concepts and architecture2 database system concepts and architecture
2 database system concepts and architecture
Kumar
 
Introductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, QueueIntroductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, Queue
Ghaffar Khan
 

What's hot (20)

Database management systems cs403 power point slides lecture 05
Database management systems   cs403 power point slides lecture 05Database management systems   cs403 power point slides lecture 05
Database management systems cs403 power point slides lecture 05
 
2 database system concepts and architecture
2 database system concepts and architecture2 database system concepts and architecture
2 database system concepts and architecture
 
dbms notes.ppt
dbms notes.pptdbms notes.ppt
dbms notes.ppt
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
NoSql
NoSqlNoSql
NoSql
 
Introductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, QueueIntroductiont To Aray,Tree,Stack, Queue
Introductiont To Aray,Tree,Stack, Queue
 
Basic DBMS ppt
Basic DBMS pptBasic DBMS ppt
Basic DBMS ppt
 
Abstract data types (adt) intro to data structure part 2
Abstract data types (adt)   intro to data structure part 2Abstract data types (adt)   intro to data structure part 2
Abstract data types (adt) intro to data structure part 2
 
Chapter2
Chapter2Chapter2
Chapter2
 
Database Design
Database DesignDatabase Design
Database Design
 
Factores humanos
Factores humanosFactores humanos
Factores humanos
 
Adbms 17 object query language
Adbms 17 object query languageAdbms 17 object query language
Adbms 17 object query language
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
 
Database Systems - Introduction (Chapter 1)
Database Systems - Introduction (Chapter 1)Database Systems - Introduction (Chapter 1)
Database Systems - Introduction (Chapter 1)
 
23246406 dbms-unit-1
23246406 dbms-unit-123246406 dbms-unit-1
23246406 dbms-unit-1
 
Schema
SchemaSchema
Schema
 
Database Management System Introduction
Database Management System IntroductionDatabase Management System Introduction
Database Management System Introduction
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Database Chapter 2
Database Chapter 2Database Chapter 2
Database Chapter 2
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 

Similar to Predictive Analytics - NTSB Aviation accidents data

HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSHOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
Sunil Kakade
 
Federal Aviation Administration (FAA) NextGeneration Air Tra.docx
Federal Aviation Administration (FAA) NextGeneration Air Tra.docxFederal Aviation Administration (FAA) NextGeneration Air Tra.docx
Federal Aviation Administration (FAA) NextGeneration Air Tra.docx
lmelaine
 
Towards Improving Crash Data Management System in Gulf Countries
Towards Improving Crash Data Management System in Gulf CountriesTowards Improving Crash Data Management System in Gulf Countries
Towards Improving Crash Data Management System in Gulf Countries
IJERA Editor
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docx
charisellington63520
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docx
todd521
 
Available online at httpdocs.lib.purdue.edujateJournal.docx
Available online at httpdocs.lib.purdue.edujateJournal.docxAvailable online at httpdocs.lib.purdue.edujateJournal.docx
Available online at httpdocs.lib.purdue.edujateJournal.docx
celenarouzie
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
lorainedeserre
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
jesusamckone
 
BlackSwan AviationOutlook WP_061014
BlackSwan AviationOutlook WP_061014BlackSwan AviationOutlook WP_061014
BlackSwan AviationOutlook WP_061014
Paul Moser
 

Similar to Predictive Analytics - NTSB Aviation accidents data (20)

HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSHOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
 
AVSS & The Institute for Drone Technology™ joint report government regulation...
AVSS & The Institute for Drone Technology™ joint report government regulation...AVSS & The Institute for Drone Technology™ joint report government regulation...
AVSS & The Institute for Drone Technology™ joint report government regulation...
 
Federal Aviation Administration (FAA) NextGeneration Air Tra.docx
Federal Aviation Administration (FAA) NextGeneration Air Tra.docxFederal Aviation Administration (FAA) NextGeneration Air Tra.docx
Federal Aviation Administration (FAA) NextGeneration Air Tra.docx
 
Chittoor.Sandeep
Chittoor.SandeepChittoor.Sandeep
Chittoor.Sandeep
 
AVIATION RISK 2020
AVIATION RISK  2020AVIATION RISK  2020
AVIATION RISK 2020
 
!Carroll_Capstone_Final
!Carroll_Capstone_Final!Carroll_Capstone_Final
!Carroll_Capstone_Final
 
Towards Improving Crash Data Management System in Gulf Countries
Towards Improving Crash Data Management System in Gulf CountriesTowards Improving Crash Data Management System in Gulf Countries
Towards Improving Crash Data Management System in Gulf Countries
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docx
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docx
 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining project
 
Available online at httpdocs.lib.purdue.edujateJournal.docx
Available online at httpdocs.lib.purdue.edujateJournal.docxAvailable online at httpdocs.lib.purdue.edujateJournal.docx
Available online at httpdocs.lib.purdue.edujateJournal.docx
 
Aviation Analysis.pptx
Aviation Analysis.pptxAviation Analysis.pptx
Aviation Analysis.pptx
 
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
 
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
ANALYZING AIRCRAFT LANDING DECISIONMAKING THROUGH FUZZY LOGIC APPROACH: A COM...
 
8AA SMS221 Analysis
8AA SMS221 Analysis8AA SMS221 Analysis
8AA SMS221 Analysis
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
 
BlackSwan AviationOutlook WP_061014
BlackSwan AviationOutlook WP_061014BlackSwan AviationOutlook WP_061014
BlackSwan AviationOutlook WP_061014
 
IRJET - Road Accident and Emergency Management: A Data Analytics Approach
IRJET - Road Accident and Emergency Management: A Data Analytics ApproachIRJET - Road Accident and Emergency Management: A Data Analytics Approach
IRJET - Road Accident and Emergency Management: A Data Analytics Approach
 
Foqa good one
Foqa good oneFoqa good one
Foqa good one
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

Predictive Analytics - NTSB Aviation accidents data

  • 1. Spring Semester 2017 Group Project report Instructor: Iva Stricevic OPIM 5604- Predictive Modeling Section B12-1173 Team 2 NTSB Aviation Accident Data Analysis April 25, 2017
  • 2. SUMMARY Team Members: Surya Adavi Ashish Doke William Pratt Mark Strout Jinwei Wang The objective of this case analysis is to assess NTSB aviation Accident data and develop a classification model forecasting likelihood of injury severity of an aircraft crash using the SEMMA approach. Objectives: Following the guidelines of the SEMMA approach ● Sampling of the data. (S) : Data sampling ● Exploration of the data (E) : Data visualization and pattern discovery: ● Modification of data (M) : Data preprocessing ● Modeling the data (M) : Predictive/Classification Modeling ● Assessing the data (A) : Model Implementation ● Plan for future upgrade Sampling (S)​: ​The dataset we chose contains information about global aviation accidents. This dataset has been picked from Kaggle website. This data has been collected by National Transportation Safety Board (NTSB) since the year 1962. This data set contains accident/ incident data from 1982 to January 2017. Since it’s a government agency (NTSB), we assume that the data is genuine and collection methods used are standardized. The data set has 79,293 observations and 32 variables. There are 8 variables of Numeric datatype dealing with the Event Date, Latitudes and longitudes, Registration Number and the Count of Injuries. There are 24 Categorical variables, Amateur built being the one ordinal variable and the rest being nominal. The nominal 1
  • 3. variables give us details about the weather conditions, Location of airport, crash site, country, type of Aircraft its model, make and purpose of the flight. Based on the aviation accidents data, we chose to explore which factors play an important role contributing to these accidents and determine new patterns and possible causal relationships if any. Some of the variables we intend to study are: 1. Weather conditions 2. Flight conditions influencing the accident 3. Aircraft Category & Engine Type 4. Make, Model & Built of Aircraft 5. Event dates and Location details 6. Broad phase of flight 7. Aircraft damage and severity of injuries After studying all these variables and drawing patterns between them, we will provide actionable recommendations, that will help prevent aviation accidents and make air-travel safer. General thoughts: Most accidents are in General Aviation(GA) due to two factors: ● The flight activity volume is significantly higher 81% at any given time in GA aircraft., than in commercial, military, agricultural or medevac 19%. In addition, there is generally less training required for GA pilots. This leads to less experience and less exposure to potentially hazardous situations in a controlled environment. ● An additional factor that may skew numbers leading to errors is collection regularity and accuracy of the reporting process. Most accidents are reported in the United States, 93%, so the location numbers may be skewed, compared to 7% in all others. ● These factors come into play and are taken into account in the pre-processing of the data. 2
  • 4. ​II. Data visualization and pattern discovery (E) We decide to do some visualizations and analysis to detect the patterns inside the data and get some insights for our analysis. i) ​Injury severity vs The total fatal injuries: Total fatal injuries are actually derived from this injury severity column. The investigation report uses the fatal injuries count to categories the injury severity. So, we cannot use this column. instead we can use this variable to check for patterns with other variables to derive insights. ii) ​Total fatal injuries vs year​: From the graph, we can see that the total fatal injuries has been decreasing with time. The development of aviation technology and aircraft manufacturing industry makes aircraft more 3
  • 5. reliable and air travel much safer. In addition, the training requirements have increased over the years. The improvement of electronic and virtual simulation training devices has improved the access and quality of pilot training and improved situational awareness. iii) ​Fatal Injuries vs season: We can get that summer has the highest number of fatal while winter has the lowest number of fatal. The number of fatal between autumn and spring are close. This is a similar comparison to the higher than average number of GA to commercial accidents reported annually. The GA community is much more active during the nicer summer months in the Northern Hemisphere. iv)​Total fatal injuries vs day of week 2​: There were more accidents on the weekend than that on weekdays. During weekdays, Friday has 4
  • 6. the highest number of fatal injuries. This is again, a similar comparison to the higher than average number of GA to commercial accidents reported. The GA community is much more active during the weekends and time away from their normal occupations including Fridays. v) ​Aircraft damage vs amateur built​: From this graphic it is clear that two factors are at work here. First. The ease of access to home-built aircraft increases the likelihood a home built aircraft is involved in an accident. These aircraft are much cheaper and there is very little regulation on the type and construction standards of these aircraft. essentially the only limits to these are the safety precautions when interacting with commercial aircraft, such a radio communication and navigations requirements. The second factor affecting this variable's impact on aviation accident data is the survivability of these home built aircraft. These aircraft are usually constructed out of less rugged and poorly designed components. The built-in safety of commercially produced aircraft is not required in these home built and thus the accidents are less survivable and more likely to occur. 5
  • 7. vi) ​Region in the United states vs Total fatal injuries​: The weather and accessibility of GA flights is reflective of the increased rates of fatal accidents here. There is much better weather, in general which makes flight much easier and thus there is very limited pilot training in these areas. The best weather and the most dangerous terrain, from a survivability standpoint is California. It’s clear sunny days increases GA traffic, but its rugged mountains and dense population make the prospects of surviving a crash far less likely. Thus, we see that California has highest no. of accidents with fatal injuries. 6
  • 8. vii)​ weather vs injury severity​: The Mosaic plot above shows the relation between injury.severity and weather condition. The blue color means non-fatal and red means fatal. We can get that under the VMC condition, there are more non-fatal and less fatal than that under IMC condition. This also identifies that there are far more flights occurring during the nice VMC weather. This makes total number of accidents and thus the fatal accidents increase. The interesting fact is that in IMC conditions the accidents tend to more serious and less survivable. This is due to the extreme flight conditions encountered and the fact accidents tend to be very sudden or of an extreme nature where pilots have lost control of their aircraft for one reason or another. 7
  • 9. viii)​ injury severity vs aircraft damage​: The chart above represents the relationship between injury severity and aircraft damage. The damage level has a strong relationship with injury severity. When the aircraft was destroyed, the highest number of fatal happened. When there are just minor and substantial damage happened, the number of fatal was low. 8
  • 10. ix) ​Number of engines vs injury severity​: There were more accidents total again where the highest likelihood of GA is present. In general GA is much more centered on inexpensive and simple aircraft. The number of engines in most planes is only one in this category. There are instances of two engines in GA but in general the pilot training in these cases is more intensive and the fact they have a spare engine if one quits there are fewer accidents. x)​ purpose of the flight vs injury severity​: Personal use as one of the purposes of the flight is almost a mirror image of the FAR type 9
  • 11. depicting GA. This data set only provides a duplicate value to the FAR type and if included would skew the data and provide no real value. xi)​ Injury severity vs FAR description​: In this original data set there are too many similar categories/ variables. Examples are the “Weight Shift” and “Glider” categories, which are truly GA and not commercial use. In these cases, we recoded the data to allow its useful inclusion in the predictive analysis of the data set. 10
  • 12. xiii) ​Injury severity vs Broad phase of flight The Mosaic plot above shows the relation between injury severity and broad phase of flight. From the graph, we can easily see that there are less fatal when the airplane is landing. Cruising aircraft have the highest fatal rate. This data is reflective of the location and profile more than a causal of an accident. this helps predict severity but not the cause of the accident. In aviation accident analysis terms this is labeled an aggravating factor. contributing but not causal. 11
  • 13. III. Data preprocessing (M): Our dataset contains a large number of categorical variables. Many of them are disordered and many of them contain missing values. These data values are unacceptable for data visualization and model building. First we recode these categorical variables to make them concise and reliable. Below are the columns that were changed. We believe that these transformations would greatly increase our model performance and make the data visualization more reasonable. 1. Injury.Severity​: Recode all the fatal with numeric values to “fatal” and non-fatal to”nonfatal”. 2. Event.date​: The information of year,day,month from this variable is extracted to create the following variables. (a) Day of Week​: New column, used “Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday” to represents the days of the specific dates. (b) Month​: New column, used “January, February, March, April, May, June, July, August, September, October, November, December” to represents the specific months. (c) Season​: The event date column is used to create the “Season” variable “ Spring, Summer, Fall, Winter.” (d) Day of Week​ 2 2(Weekend?): The day of the week is recoded to create new variable is it a weekend or not with “Weekend” and “Non-Weekend.” 3) ​Weather.Condition​: Code the missing value to “UNK”, which means unknown. VMC is a meteorological condition expressed in terms of visibility, distance from cloud, and ceiling equal to or better than specified minima. This is generally associated with a pilot’s ability to see clearly. IMC is meteorological conditions expressed in terms of visibility, distance from clouds, and ceiling less than the minima specified for visual meteorological conditions. This is generally meaning that the flight or some portion thereof is actually in the clouds with only reference to the aircraft instruments. 4) ​Broad.Phase.of.Fligh​t: Coded “unk” into the missing values, coded descent and approach classes in the variable into landing based on the definitions , recoded climb and taxi into takeoff as it is a part of the same phase and other classes are left as is. 5)​ Number. of engine​s: We use informative missing to put the average number of engines into the missing values and the values were rounded to the nearest number. 6) ​Engine.Type​: Coded all the engines that indicated turbo/ Turbine, into “ Turbo”, coded all the engines that indicates Reciprocating into “Reciprocating” and the missing values are coded as unknown. 7) ​State​: Combined the original “Country” and “Region” columns into one column, if the 12
  • 14. regions where the accident happened are within U.S., we coded the state code, if the regions are outside U.S., we coded the country name. 8) ​Region​: if the regions where the accident happened are within U.S., we coded the position (“Midwest, Northeast, West, South”) that the region is in U.S., if the regions are outside U.S., we coded “Other.” 9) ​Purpose of Flight​.2: We are interested in the prediction of the injury severity of the aircrafts with passengers.so we have recoded the original purpose of the flight column into classes which belong to personal /business/public aircraft purposes and others which are for aerial observations into other observations class. 10)​FAR Description​ 2: We are interested in the prediction of the injury severity of the aircrafts with passengers.so we have recoded the FAR description column into classes which belong to aviation purpose, missing values as unknown and other applications of aircraft into Air other application. The columns which we have decide to drop and the explanations are: 1) Investigation type​: This variable is not practically significant to us as this is available to us only the investigation is done on the aircraft. So we decided to not use this variable 2) Accident Number​: This is just a unique ID given to the accident number. No significance. 3) Location and Country​: We have not used variable as they were too many different observations in this class but we extracted information from this column to create a new variable to use in our model. 4) Latitude and longitude​: We couldn't use this variable as they were too many missing values in these columns and geocode them as the data set has no zip codes available. 5) Airport Name and Airport Code​:This is just a unique code and name given to the airport and no practical significance. We have too many different classes in this variable as they are observations from all round the world. 6) Aircraft Damage​: This variable is not practically significant to us as this is available to us only after the investigation report is made on the aircraft.This is statistically significant but we can't use practical reasons. 7) Aircraft category:​ We have dropped this variable as it has too many missing values which will affect the prediction. 8) Registration Number:​ This is just a unique number given to the Airplane. No significance. 9) Make and Model​: These variables have too many missing values and we cannot impute them so we haven't used them in our model and also in the non missing observations we 13
  • 15. couldnt group the classes as each class was different. 10) Schedule , Air carrie​r : We had to drop this variable as it had too many missing values and we didn't want to use this variable to predict due to this reason. 11) I​njuries columns​(​fatal,serious,minor and uninjured)​:These variable is not practically significant to us as this is available to us only after the investigation report is made on the aircraft.This is statistically significant but we can't use practical reasons. 12) Report status and publication date​:This variable is related to the investigation purpose and has no practical significance as this data are recorded after a fatal accident was already happened. DATA PREPROCESSING: MISSING VALUE CHECK ​Project team also analyzed the data-set to check if there are any missing values available or not ​Based on the analysis, in JMP missing value exploration, we encountered the following columns have missing values. ​1. ​Latitude and Longitude columns: 62,614 rows of 79,293 have missing values in this column as the exact location of the event was not available in the model we have decided not to use this variable as a predictor and delete the column. ​2. ​Number of Engines​: 4374 rows of 79,293 have missing values we have treated these missing rows with missing value imputation method. ​3.​Total Fatal injuries: 7626 rows of 79,293 have missing values as we cannot use this variable as it has statistical significance but no practical significance since we are predicting the severity of the crash before the investigation report is published. Hence, we are dropping this variable as a predictor. ​4​.Total Serious injuries​: 17,442 rows of 79,293 have missing values as we cannot use this variable as it has statistical significance but no practical significance since we are predicting the severity of the crash before the investigation report is published. Hence, we are dropping this variable as a predictor. ​5. ​Total Minor injuries​: 18089 rows of 79,293 have missing values as we cannot use this variable as it has statistical significance but no practical significance since we are predicting the severity of the crash before the investigation report is published. Hence, we are dropping this variable as a predictor. ​6. ​Total Uninjured​: 14218 rows of 79,293 have missing values as we cannot use this variable as it has statistical significance but no practical significance since we are predicting the severity of 14
  • 16. the crash before the investigation report is published. we are dropping this variable as a predictor. Fig1​: Fig2: Missing data pattern: Based on the above missing data pattern and exploring the missing value our only variable of significance is Number of engines. We have used the imputation of the missing observations in this column from the impute missing column option from jmp and rounded of the imputed values to the nearest integer. 15
  • 17. Distribution and outlier analysis: The only Numeric variable were using in this model is ​Number of engines,​ after imputing the column with informative missing. the above is the distribution of the variable. There are 12642 outliers as per the robust fit outliers analysis, the distribution of the variable is as follows. The most of the number of engines observations are around 1 and 2 . They’re engines with 16
  • 18. number 3 and 4. We have considered them to be legitimate as aircrafts can have more than 1 engine practically. We’ve decide not to change the data. IV​. ​Modeling (M) Our objective of this project is to predict the possibility of the occurrence of fatal in an aircraft accident. We believe after studying all these variables and drawing patterns between them, we would be able to build models to predict the possibility and then provide reasonable suggestions, that will help prevent aviation accidents and make air-travel safer. We implemented the following models: 1. Logistic Regression Model 2. Decision Tree Model 3. Neural Nets Model For our model, we split the dataset into 65% training set, 25% validation set and 15% test set. We used the large training set to build our model, then we used validation set to draw our conclusion and suggestions, and used test set for model comparison. The following are the variables Selection for the modeling. These were selected based on the pattern discovery which we have come up with in the Visualization and also by running models and comparing them which gave the better prediction rate. 1.Number of engines 2.Broad phase of the flight 3.Season 4.Weather condition 5.Amateur built 6.FAR description 7.Region (Location) 8.Weekend (is it a weekend or not) Challenges​: 1.​ ​The dataset which was available to us was not balanced the number of fatal crashes were less compared to the Non-fatal crashes. 2. Latitude and Longitude values were missing which are the important predictors of our target variable. 17
  • 19. 3. Our predictors were categorical in nature, which prevented us from using several techniques such as Principal component analysis, Clustering or correlation analysis. 4. Having the severity (dollar value) of each incident would help assign a cost to the misclassification and would help in future predictions. Logistic Multiple Regression: Since our target variable, injury_severity, has a binary outcome (either fatal or non-fatal), we can fit a Logistic Regression model. The significance of each variable in our model is given below. 18
  • 20. Parameter Estimates: ● The parameter estimate for the variable Weather condition in IMC stat is positive in this multiple logistic regression which indicates that holding all other independent variables constant the injury severity is more likely to be fatal when the plane is flying in IMC weather condition. ● Similarly The negative coefficient for the variable an aircraft which is not Amateur built indicates all other independent variables constant the injury severity is less likely to be fatal when the plane is not amateurly built. 19
  • 21. ROC Curves: *​The accuracy of our model depends on how well the test differentiated the injury severity variable into fatal and non fatal.. The area under the ROC curve gives the accuracy measure of the model which in our case is 0.7884 for test data and 0.7959 for validation data. Odds Ratios: 20
  • 22. ● As we know odds ratio​ ​represents the constant effect of a predictor X, on the likelihood that one outcome will occur. The odds ratio is better ​for Maneuvering/Landing in the Broad Phase of flight , IMC on VMC weather condition and Amateur built Yes on No when compared to other classes in the same variables. Decision Tree Model: Fit Details: 21
  • 23. No. of Splits: 84 22
  • 24. Tree Split of the model: ● From the picture, we can see that The decision tree model Find the variables that split the best outcomes which is Broad phase of the flight variable in this case and divides the data into two groups leaves on that split. This variable has the highest information gain and the which has the highest power in the reduction of Residual sum of squares. ● The tree is constructed on Broad phase of the flight which has highest information gain ● No of engines is another variable which gets split into many times the split continues until no region contains more than 5 observations. 23
  • 25. Column Contribution: ● Broad phase of the flight has highest information gain. It has the highest number of splits along with Number of engines variable. Leaf Report: ● We can see the leaf report gives us the classification rules of the model and is a compact representation of most of the information in the decision tree. 24
  • 26. ROC Curve for Decision tree model: ● ROC curves represent how good a model is by testing it in terms of a graph plot for Sensitivity (True Positive Rate) against Specificity (False Positive Rate). The accuracy of our model depends on how well the test differentiated the injury severity variable into fatal and non-fatal.. The area under the ROC curve gives the accuracy measure of the model which in our case is 0.7999 for test data and 0.8067 for validation data. 25
  • 27. Neural Networks Model: Overall information: ● We have used the information from the decision tree model for the best predictors and used these predictors for the Neural net model. 26
  • 28. Diagram representation of the model. 27
  • 29. ROC Curve for Neural Networks model: ● ROC curves represent how good a model is by testing it in terms of a graph plot for Sensitivity (True Positive Rate) against Specificity (False Positive Rate). The accuracy of our model depends on how well the test differentiated the injury severity variable into fatal and non-fatal.. The area under the ROC curve gives the accuracy measure of the model which in our case is 0.7969 for test data and 0.8032 for validation data. 28
  • 30. V. Analysis (A): Accuracy Reports & Model Comparison Calculating the Accuracy of our model based on the Confusion matrix of our models: Logistic regression Confusion Matrix: Decision Tree Confusion matrix: Neural Network Confusion matrix: 29
  • 31. We compare Accuracy of our models logistic regression, Decision tree and Neural nets. Model Type Accuracy of the model Baseline Logistic Regression Decision Tree Neural Nets Validation Set 31% 70.11% 71.66% 71.17% Test Set 31% 70.62% 71.76% 71.17% ● We can see that the Decision tree model has a better accuracy rate when compared to the logistic multiple regression model and the Neural networks model. 30
  • 32. ROC Curve comparison: ● We can see that the Decision tree model has the best Area under the curve for the models run on the data set with AUC=0.8038. Lift Curve Comparison: 31
  • 33. ● We can see that the Lift ratios of the Neural Networks and Decision tree models are close to 2.50 and the logistic regression model has lower lift ratio compared to the other two models. Model selection based on the following factors: Logistic Regression Decision Tree Neural Networks Accuracy of model (Validation) 70.11% 71.66% 71.17% ROC Curve 0.7906 ​ 0.8038 0.8017 Lift Ratio 2.4508 ​ 2.502 2.495 All the models perform better than the baseline, the decision tree model and the Neural networks are close in the accuracy prediction. We choose the ​Decision tree model​ as 1) It does better prediction accuracy on Validation and Test data compared to Logistic regression model and Decision tree was doing a better job in modeling the data with nonlinear relationships between the variables and was able to handle the interaction between the variables. Also, It has a better Area under curve when compared to other 32
  • 34. models. 2) when compared to Neural network model, decision tree gives us an understandable model. It is relatively easy to explain how the data was analyzed and how decisions are being made with decision trees and also it is difficult to incorporate a neural network model in a business environment when compared to a Decision tree model. ● Model Implementation After running all the models and analysis based on our observation of the dataset, as well as the results of all Prediction Models. We find that Decision Tree model is the better model for the prediction of the injury severity. After running the models and establishing the accuracy percent, we evaluated one of the regression formulas used to predict injury severity: 33
  • 35. Just a part of the classification formula from the decision tree. This analysis raises a few questions: 1.When conducting the analysis of this model and the aviation accident fatality rates the relationship between the Broad Phase of Flight and the manufacture of the aircraft seem to be the strongest indicators of risk of Fatalities. But in order to preserve the pioneering nature of aviation we wonder if it is realistic to actually change the type of aircraft allowed to be built? It is not 34
  • 36. possible to restrict the phase of flight. 2. Examining the data set of the number of engines, again it seems like the severity of the accidents and fatalities are definitely tied to this. But the ability to make changes to this would essentially be impossible since the increase in the number of engines to make flight safer would severely limit the accessibility of GA to most due to cost. This would also mean the design of aircraft would have to change significantly and the industry could not survive such a significant change. 3. Is the relationship between the GA population and the type of engine significant in predicting the severity of accidents? It seems to show that the severity of accidents is greater with reciprocating engines. But we believe that the majority of the turbine engines are commercial, military or other than GA aircraft, Thus the data is more indicative of the category than of the impact of the engine type alone on the accident severity. VI. Plan for future upgrades In order to improve aviation safety and drive the fatality rate down across the community we must use the data in the model we chose but then analyze the true meaning behind each data point. When we look at the phase of flight, there is little to improve. The fact that aircraft that are in the landing phase are already committed to land in a safe location that drives down the severity of any accident. The focus of the pilot on their actions required is also heightened to the point they catch emergencies and can react faster. So in order to improve the safety in each phase 35
  • 37. of flight we should attempt to carry forward these positive attributes to the other phases. Obviously, the location of the landing area cannot be carried forward, so we must focus on the pilot readiness and situational awareness. Similar to the phase of flight, the build of the aircraft itself can exacerbate the severity of a crash. But in order to change this variable we would have to change the nature of aviation itself. Aviation has always been a sport of pioneering and development. Where experimentation and bravery are valued. Thus, attempting to stop the production of home built aircraft would be impossible. Similar to the phase of flight however, we can attempt to improve the quality of the aircraft through education. It would be possible to impose a mandatory evaluation and analysis of aircraft design by the builder before manufacture. This could be enforced in requiring a certificate or permit to build an aircraft. In order to gain this the builder would have to sit with a licensed FAA aircraft designer and go through an analysis of the inherent weaknesses and strengths of the aircraft design. In this way, the designer/ builder would gain the knowledge to improve the safety and survivability of their aircraft through education. Finally the next best improvement would be to require turbine engines in all GA aircraft. There are two means to approach this problem. First there must be a concerted effort throughout the industry to drive down the inhibitive cost of turbine engines. Thus the cost prohibition would be removed facilitating a shift to this more reliable powerplant. Secondly we must analyze the whole picture of the lower accident rate with turbine engines. If we look closely again we come back to the fact that most turbine engines are in commercial and other than GA aircraft. Thus the training of the pilots again is a factor. The pilots flying turbine engine powered aircraft are generally more well trained. This brings us back to the improvements in training that must be made across the board in aviation. As a closing thought, we should look back to the beginning of this analysis paper at the chart in data visualization (ii). This chart shows the improvement in aviation safety over the years. The chart shows that the fatalities recorded in 1980 were 3296. Over the next 36 years the fatality rate has been driven down by over 71%. We must continue this trend by concentrating on both pilot and aircraft safety. This can easily be done by leveraging the accelerated rate of technology advancements in both pilot training and the construction of safer aircraft designs. References: Dataset source: ​https://www.kaggle.com/khsamaha/aviation-accident-database-synopses Semma:​http://faculty.smu.edu/tfomby/eco5385_eco6380/data/SPSS/SAS%20_%20SEMMA.pdf NTSB:​https://www.ntsb.gov/investigations/AccidentReports/Pages/aviation.aspx 36