Predictive Analytics - NTSB Aviation accidents data

Spring Semester 2017
Group Project report
Instructor: Iva Stricevic
OPIM 5604- Predictive Modeling
Section B12-1173
Team 2
NTSB Aviation Accident Data Analysis
April 25, 2017

SUMMARY
Team Members:
Surya Adavi
Ashish Doke
William Pratt
Mark Strout
Jinwei Wang
The objective of this case analysis is to assess NTSB aviation Accident data and develop a
classification model forecasting likelihood of injury severity of an aircraft crash using the
SEMMA approach.
Objectives:
Following the guidelines of the SEMMA approach
● Sampling of the data. (S) : Data sampling
● Exploration of the data (E) : Data visualization and pattern discovery:
● Modification of data (M) : Data preprocessing
● Modeling the data (M) : Predictive/Classification Modeling
● Assessing the data (A) : Model Implementation
● Plan for future upgrade
Sampling (S):
The dataset we chose contains information about global aviation accidents. This dataset has
been picked from Kaggle website. This data has been collected by National Transportation
Safety Board (NTSB) since the year 1962. This data set contains accident/ incident data from
1982 to January 2017. Since it’s a government agency (NTSB), we assume that the data is
genuine and collection methods used are standardized. The data set has 79,293 observations and
32 variables. There are 8 variables of Numeric datatype dealing with the Event Date, Latitudes
and longitudes, Registration Number and the Count of Injuries. There are 24 Categorical
variables, Amateur built being the one ordinal variable and the rest being nominal. The nominal
1

variables give us details about the weather conditions, Location of airport, crash site, country,
type of Aircraft its model, make and purpose of the flight.
Based on the aviation accidents data, we chose to explore which factors play an important role
contributing to these accidents and determine new patterns and possible causal relationships if
any. Some of the variables we intend to study are:
1. Weather conditions
2. Flight conditions influencing the accident
3. Aircraft Category & Engine Type
4. Make, Model & Built of Aircraft
5. Event dates and Location details
6. Broad phase of flight
7. Aircraft damage and severity of injuries
After studying all these variables and drawing patterns between them, we will provide actionable
recommendations, that will help prevent aviation accidents and make air-travel safer.
General thoughts:
Most accidents are in General Aviation(GA) due to two factors:
● The flight activity volume is significantly higher 81% at any given time in GA aircraft.,
than in commercial, military, agricultural or medevac 19%. In addition, there is generally
less training required for GA pilots. This leads to less experience and less exposure to
potentially hazardous situations in a controlled environment.
● An additional factor that may skew numbers leading to errors is collection regularity and
accuracy of the reporting process. Most accidents are reported in the United States, 93%,
so the location numbers may be skewed, compared to 7% in all others.
● These factors come into play and are taken into account in the pre-processing of the data.
2

II. Data visualization and pattern discovery (E)
We decide to do some visualizations and analysis to detect the patterns inside the data and get
some insights for our analysis.
i) Injury severity vs The total fatal injuries:
Total fatal injuries are actually derived from this injury severity column. The investigation report
uses the fatal injuries count to categories the injury severity. So, we cannot use this column.
instead we can use this variable to check for patterns with other variables to derive insights.
ii) Total fatal injuries vs year:
From the graph, we can see that the total fatal injuries has been decreasing with time. The
development of aviation technology and aircraft manufacturing industry makes aircraft more
3

reliable and air travel much safer. In addition, the training requirements have increased over the
years. The improvement of electronic and virtual simulation training devices has improved the
access and quality of pilot training and improved situational awareness.
iii) Fatal Injuries vs season:
We can get that summer has the highest number of fatal while winter has the lowest number of
fatal. The number of fatal between autumn and spring are close. This is a similar comparison to
the higher than average number of GA to commercial accidents reported annually. The GA
community is much more active during the nicer summer months in the Northern Hemisphere.
iv)Total fatal injuries vs day of week 2:
There were more accidents on the weekend than that on weekdays. During weekdays, Friday has
4

the highest number of fatal injuries. This is again, a similar comparison to the higher than
average number of GA to commercial accidents reported. The GA community is much more
active during the weekends and time away from their normal occupations including Fridays.
v) Aircraft damage vs amateur built:
From this graphic it is clear that two factors are at work here. First. The ease of access to
home-built aircraft increases the likelihood a home built aircraft is involved in an accident. These
aircraft are much cheaper and there is very little regulation on the type and construction
standards of these aircraft. essentially the only limits to these are the safety precautions when
interacting with commercial aircraft, such a radio communication and navigations requirements.
The second factor affecting this variable's impact on aviation accident data is the survivability of
these home built aircraft. These aircraft are usually constructed out of less rugged and poorly
designed components. The built-in safety of commercially produced aircraft is not required in
these home built and thus the accidents are less survivable and more likely to occur.
5

vi) Region in the United states vs Total fatal injuries:
The weather and accessibility of GA flights is reflective of the increased rates of fatal accidents
here. There is much better weather, in general which makes flight much easier and thus there is
very limited pilot training in these areas. The best weather and the most dangerous terrain, from a
survivability standpoint is California. It’s clear sunny days increases GA traffic, but its rugged
mountains and dense population make the prospects of surviving a crash far less likely. Thus, we
see that California has highest no. of accidents with fatal injuries.
6

vii) weather vs injury severity:
The Mosaic plot above shows the relation between injury.severity and weather condition. The
blue color means non-fatal and red means fatal. We can get that under the VMC condition, there
are more non-fatal and less fatal than that under IMC condition. This also identifies that there are
far more flights occurring during the nice VMC weather. This makes total number of accidents
and thus the fatal accidents increase. The interesting fact is that in IMC conditions the accidents
tend to more serious and less survivable. This is due to the extreme flight conditions encountered
and the fact accidents tend to be very sudden or of an extreme nature where pilots have lost
control of their aircraft for one reason or another.
7

viii) injury severity vs aircraft damage:
The chart above represents the relationship between injury severity and aircraft damage. The
damage level has a strong relationship with injury severity. When the aircraft was destroyed, the
highest number of fatal happened. When there are just minor and substantial damage happened,
the number of fatal was low.
8

ix) Number of engines vs injury severity:
There were more accidents total again where the highest likelihood of GA is present. In general
GA is much more centered on inexpensive and simple aircraft. The number of engines in most
planes is only one in this category. There are instances of two engines in GA but in general the
pilot training in these cases is more intensive and the fact they have a spare engine if one quits
there are fewer accidents.
x) purpose of the flight vs injury severity:
Personal use as one of the purposes of the flight is almost a mirror image of the FAR type
9

depicting GA. This data set only provides a duplicate value to the FAR type and if included
would skew the data and provide no real value.
xi) Injury severity vs FAR description:
In this original data set there are too many similar categories/ variables. Examples are the
“Weight Shift” and “Glider” categories, which are truly GA and not commercial use. In these
cases, we recoded the data to allow its useful inclusion in the predictive analysis of the data set.
10

xiii) Injury severity vs Broad phase of flight
The Mosaic plot above shows the relation between injury severity and broad phase of flight.
From the graph, we can easily see that there are less fatal when the airplane is landing. Cruising
aircraft have the highest fatal rate. This data is reflective of the location and profile more than a
causal of an accident. this helps predict severity but not the cause of the accident. In aviation
accident analysis terms this is labeled an aggravating factor. contributing but not causal.
11

III. Data preprocessing (M):
Our dataset contains a large number of categorical variables. Many of them are disordered and
many of them contain missing values. These data values are unacceptable for data visualization
and model building. First we recode these categorical variables to make them concise and
reliable. Below are the columns that were changed. We believe that these transformations would
greatly increase our model performance and make the data visualization more reasonable.
1. Injury.Severity: Recode all the fatal with numeric values to “fatal” and non-fatal
to”nonfatal”.
2. Event.date: The information of year,day,month from this variable is extracted to
create the following variables.
(a) Day of Week: New column, used “Monday, Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday” to represents the days of the specific dates.
(b) Month: New column, used “January, February, March, April, May, June, July, August,
September, October, November, December” to represents the specific months.
(c) Season: The event date column is used to create the “Season” variable “ Spring,
Summer, Fall, Winter.”
(d) Day of Week 2 2(Weekend?): The day of the week is recoded to create new variable is it
a weekend or not with “Weekend” and “Non-Weekend.”
3) Weather.Condition: Code the missing value to “UNK”, which means unknown. VMC is a
meteorological condition expressed in terms of visibility, distance from cloud, and ceiling equal
to or better than specified minima. This is generally associated with a pilot’s ability to see
clearly. IMC is meteorological conditions expressed in terms of visibility, distance from clouds,
and ceiling less than the minima specified for visual meteorological conditions. This is generally
meaning that the flight or some portion thereof is actually in the clouds with only reference to the
aircraft instruments.
4) Broad.Phase.of.Flight: Coded “unk” into the missing values, coded descent and approach
classes in the variable into landing based on the definitions , recoded climb and taxi into takeoff
as it is a part of the same phase and other classes are left as is.
5) Number. of engines: We use informative missing to put the average number of engines into
the missing values and the values were rounded to the nearest number.
6) Engine.Type: Coded all the engines that indicated turbo/ Turbine, into “ Turbo”, coded all the
engines that indicates Reciprocating into “Reciprocating” and the missing values are coded as
unknown.
7) State: Combined the original “Country” and “Region” columns into one column, if the
12

regions where the accident happened are within U.S., we coded the state code, if the regions are
outside U.S., we coded the country name.
8) Region: if the regions where the accident happened are within U.S., we coded the position
(“Midwest, Northeast, West, South”) that the region is in U.S., if the regions are outside U.S.,
we coded “Other.”
9) Purpose of Flight.2: We are interested in the prediction of the injury severity of the aircrafts
with passengers.so we have recoded the original purpose of the flight column into classes which
belong to personal /business/public aircraft purposes and others which are for aerial observations
into other observations class.
10)FAR Description 2: We are interested in the prediction of the injury severity of the aircrafts
with passengers.so we have recoded the FAR description column into classes which belong to
aviation purpose, missing values as unknown and other applications of aircraft into Air other
application.
The columns which we have decide to drop and the explanations are:
1) Investigation type: This variable is not practically significant to us as this is available to
us only the investigation is done on the aircraft. So we decided to not use this variable
2) Accident Number: This is just a unique ID given to the accident number. No
significance.
3) Location and Country: We have not used variable as they were too many different
observations in this class but we extracted information from this column to create a new
variable to use in our model.
4) Latitude and longitude: We couldn't use this variable as they were too many missing
values in these columns and geocode them as the data set has no zip codes available.
5) Airport Name and Airport Code:This is just a unique code and name given to the
airport and no practical significance. We have too many different classes in this variable
as they are observations from all round the world.
6) Aircraft Damage: This variable is not practically significant to us as this is available to
us only after the investigation report is made on the aircraft.This is statistically
significant but we can't use practical reasons.
7) Aircraft category: We have dropped this variable as it has too many missing values
which will affect the prediction.
8) Registration Number: This is just a unique number given to the Airplane. No
significance.
9) Make and Model: These variables have too many missing values and we cannot impute
them so we haven't used them in our model and also in the non missing observations we
13

couldnt group the classes as each class was different.
10) Schedule , Air carrier : We had to drop this variable as it had too many missing values
and we didn't want to use this variable to predict due to this reason.
11) Injuries columns(fatal,serious,minor and uninjured):These variable is not practically
significant to us as this is available to us only after the investigation report is made on
the aircraft.This is statistically significant but we can't use practical reasons.
12) Report status and publication date:This variable is related to the investigation purpose
and has no practical significance as this data are recorded after a fatal accident was
already happened.
DATA PREPROCESSING: MISSING VALUE CHECK
Project team also analyzed the data-set to check if there are any missing values available or not
Based on the analysis, in JMP missing value exploration, we encountered the following columns
have missing values.
1. Latitude and Longitude columns: 62,614 rows of 79,293 have missing values in this column
as the exact location of the event was not available in the model we have decided not to use this
variable as a predictor and delete the column.
2. Number of Engines: 4374 rows of 79,293 have missing values we have treated these missing
rows with missing value imputation method.
3.Total Fatal injuries: 7626 rows of 79,293 have missing values as we cannot use this variable
as it has statistical significance but no practical significance since we are predicting the severity
of the crash before the investigation report is published. Hence, we are dropping this variable as
a predictor.
4.Total Serious injuries: 17,442 rows of 79,293 have missing values as we cannot use this
variable as it has statistical significance but no practical significance since we are predicting the
severity of the crash before the investigation report is published. Hence, we are dropping this
variable as a predictor.
5. Total Minor injuries: 18089 rows of 79,293 have missing values as we cannot use this
variable as it has statistical significance but no practical significance since we are predicting the
severity of the crash before the investigation report is published. Hence, we are dropping this
variable as a predictor.
6. Total Uninjured: 14218 rows of 79,293 have missing values as we cannot use this variable as
it has statistical significance but no practical significance since we are predicting the severity of
14

the crash before the investigation report is published. we are dropping this variable as a
predictor.
Fig1:
Fig2:
Missing data pattern:
Based on the above missing data pattern and exploring the missing value our only variable of
significance is Number of engines.
We have used the imputation of the missing observations in this column from the impute missing
column option from jmp and rounded of the imputed values to the nearest integer.
15

Distribution and outlier analysis:
The only Numeric variable were using in this model is Number of engines, after imputing the
column with informative missing. the above is the distribution of the variable.
There are 12642 outliers as per the robust fit outliers analysis, the distribution of the variable is
as follows.
The most of the number of engines observations are around 1 and 2 . They’re engines with
16

number 3 and 4. We have considered them to be legitimate as aircrafts can have more than 1
engine practically. We’ve decide not to change the data.
IV. Modeling (M)
Our objective of this project is to predict the possibility of the occurrence of fatal in an aircraft
accident. We believe after studying all these variables and drawing patterns between them, we
would be able to build models to predict the possibility and then provide reasonable suggestions,
that will help prevent aviation accidents and make air-travel safer.
We implemented the following models:
1. Logistic Regression Model
2. Decision Tree Model
3. Neural Nets Model
For our model, we split the dataset into 65% training set, 25% validation set and 15% test set.
We used the large training set to build our model, then we used validation set to draw our
conclusion and suggestions, and used test set for model comparison.
The following are the variables Selection for the modeling. These were selected based on the
pattern discovery which we have come up with in the Visualization and also by running models
and comparing them which gave the better prediction rate.
1.Number of engines
2.Broad phase of the flight
3.Season
4.Weather condition
5.Amateur built
6.FAR description
7.Region (Location)
8.Weekend (is it a weekend or not)
Challenges:
1. The dataset which was available to us was not balanced the number of fatal crashes were less
compared to the Non-fatal crashes.
2. Latitude and Longitude values were missing which are the important predictors of our
target variable.
17

3. Our predictors were categorical in nature, which prevented us from using several
techniques such as Principal component analysis, Clustering or correlation analysis.
4. Having the severity (dollar value) of each incident would help assign a cost to the
misclassification and would help in future predictions.
Logistic Multiple Regression:
Since our target variable, injury_severity, has a binary outcome (either fatal or non-fatal), we can
fit a Logistic Regression model. The significance of each variable in our model is given below.
18

Parameter Estimates:
● The parameter estimate for the variable Weather condition in IMC stat is positive in this
multiple logistic regression which indicates that holding all other independent variables
constant the injury severity is more likely to be fatal when the plane is flying in IMC
weather condition.
● Similarly The negative coefficient for the variable an aircraft which is not Amateur built
indicates all other independent variables constant the injury severity is less likely to be
fatal when the plane is not amateurly built.
19

ROC Curves:
*The accuracy of our model depends on how well the test differentiated the injury severity
variable into fatal and non fatal.. The area under the ROC curve gives the accuracy measure of
the model which in our case is 0.7884 for test data and 0.7959 for validation data.
Odds Ratios:
20

● As we know odds ratio represents the constant effect of a predictor X, on the likelihood
that one outcome will occur. The odds ratio is better for Maneuvering/Landing in the
Broad Phase of flight , IMC on VMC weather condition and Amateur built Yes on No
when compared to other classes in the same variables.
Decision Tree Model:
Fit Details:
21

Tree Split of the model:
● From the picture, we can see that The decision tree model Find the variables that split the
best outcomes which is Broad phase of the flight variable in this case and divides the data
into two groups leaves on that split. This variable has the highest information gain and
the which has the highest power in the reduction of Residual sum of squares.
● The tree is constructed on Broad phase of the flight which has highest information gain
● No of engines is another variable which gets split into many times the split continues
until no region contains more than 5 observations.
23

Column Contribution:
● Broad phase of the flight has highest information gain. It has the highest number of splits
along with Number of engines variable.
Leaf Report:
● We can see the leaf report gives us the classification rules of the model and is a compact
representation of most of the information in the decision tree.
24

ROC Curve for Decision tree model:
● ROC curves represent how good a model is by testing it in terms of a graph plot for
Sensitivity (True Positive Rate) against Specificity (False Positive Rate). The accuracy of
our model depends on how well the test differentiated the injury severity variable into
fatal and non-fatal.. The area under the ROC curve gives the accuracy measure of the
model which in our case is 0.7999 for test data and 0.8067 for validation data.
25

Neural Networks Model:
Overall information:
● We have used the information from the decision tree model for the best predictors and
used these predictors for the Neural net model.
26

Diagram representation of the model.
27

ROC Curve for Neural Networks model:
● ROC curves represent how good a model is by testing it in terms of a graph plot for
Sensitivity (True Positive Rate) against Specificity (False Positive Rate). The accuracy of
our model depends on how well the test differentiated the injury severity variable into
fatal and non-fatal.. The area under the ROC curve gives the accuracy measure of the
model which in our case is 0.7969 for test data and 0.8032 for validation data.
28

V. Analysis (A): Accuracy Reports & Model
Comparison
Calculating the Accuracy of our model based on the Confusion matrix of our models:
Logistic regression Confusion Matrix:
Decision Tree Confusion matrix:
Neural Network Confusion matrix:
29

We compare Accuracy of our models logistic regression, Decision tree and Neural nets.
Model Type Accuracy of the model
Baseline Logistic
Regression
Decision Tree Neural Nets
Validation Set 31% 70.11% 71.66% 71.17%
Test Set 31% 70.62% 71.76% 71.17%
● We can see that the Decision tree model has a better accuracy rate when compared to the
logistic multiple regression model and the Neural networks model.
30

ROC Curve comparison:
● We can see that the Decision tree model has the best Area under the curve for the models
run on the data set with AUC=0.8038.
Lift Curve Comparison:
31

● We can see that the Lift ratios of the Neural Networks and Decision tree models are
close to 2.50 and the logistic regression model has lower lift ratio compared to the other
two models.
Model selection based on the following factors:
Logistic Regression Decision Tree Neural Networks
Accuracy of model
(Validation)
70.11% 71.66% 71.17%
ROC Curve 0.7906 0.8038 0.8017
Lift Ratio 2.4508 2.502 2.495
All the models perform better than the baseline, the decision tree model and the Neural networks
are close in the accuracy prediction.
We choose the Decision tree model as
1) It does better prediction accuracy on Validation and Test data compared to Logistic
regression model and Decision tree was doing a better job in modeling the data with
nonlinear relationships between the variables and was able to handle the interaction
between the variables. Also, It has a better Area under curve when compared to other
32

models.
2) when compared to Neural network model, decision tree gives us an understandable
model. It is relatively easy to explain how the data was analyzed and how decisions are
being made with decision trees and also it is difficult to incorporate a neural network
model in a business environment when compared to a Decision tree model.
● Model Implementation
After running all the models and analysis based on our observation of the dataset, as well as the
results of all Prediction Models. We find that Decision Tree model is the better model for the
prediction of the injury severity. After running the models and establishing the accuracy percent,
we evaluated one of the regression formulas used to predict injury severity:
33

Just a part of the classification formula from the decision tree.
This analysis raises a few questions:
1.When conducting the analysis of this model and the aviation accident fatality rates the
relationship between the Broad Phase of Flight and the manufacture of the aircraft seem to be the
strongest indicators of risk of Fatalities. But in order to preserve the pioneering nature of aviation
we wonder if it is realistic to actually change the type of aircraft allowed to be built? It is not
34

possible to restrict the phase of flight.
2. Examining the data set of the number of engines, again it seems like the severity of the
accidents and fatalities are definitely tied to this. But the ability to make changes to this would
essentially be impossible since the increase in the number of engines to make flight safer would
severely limit the accessibility of GA to most due to cost. This would also mean the design of
aircraft would have to change significantly and the industry could not survive such a significant
change.
3. Is the relationship between the GA population and the type of engine significant in predicting
the severity of accidents? It seems to show that the severity of accidents is greater with
reciprocating engines. But we believe that the majority of the turbine engines are commercial,
military or other than GA aircraft, Thus the data is more indicative of the category than of the
impact of the engine type alone on the accident severity.
VI. Plan for future upgrades
In order to improve aviation safety and drive the fatality rate down across the community we
must use the data in the model we chose but then analyze the true meaning behind each data
point. When we look at the phase of flight, there is little to improve. The fact that aircraft that are
in the landing phase are already committed to land in a safe location that drives down the
severity of any accident. The focus of the pilot on their actions required is also heightened to the
point they catch emergencies and can react faster. So in order to improve the safety in each phase
35

of flight we should attempt to carry forward these positive attributes to the other phases.
Obviously, the location of the landing area cannot be carried forward, so we must focus on the
pilot readiness and situational awareness.
Similar to the phase of flight, the build of the aircraft itself can exacerbate the severity of a
crash. But in order to change this variable we would have to change the nature of aviation itself.
Aviation has always been a sport of pioneering and development. Where experimentation and
bravery are valued. Thus, attempting to stop the production of home built aircraft would be
impossible. Similar to the phase of flight however, we can attempt to improve the quality of the
aircraft through education. It would be possible to impose a mandatory evaluation and analysis
of aircraft design by the builder before manufacture. This could be enforced in requiring a
certificate or permit to build an aircraft. In order to gain this the builder would have to sit with a
licensed FAA aircraft designer and go through an analysis of the inherent weaknesses and
strengths of the aircraft design. In this way, the designer/ builder would gain the knowledge to
improve the safety and survivability of their aircraft through education.
Finally the next best improvement would be to require turbine engines in all GA aircraft. There
are two means to approach this problem. First there must be a concerted effort throughout the
industry to drive down the inhibitive cost of turbine engines. Thus the cost prohibition would be
removed facilitating a shift to this more reliable powerplant. Secondly we must analyze the
whole picture of the lower accident rate with turbine engines. If we look closely again we come
back to the fact that most turbine engines are in commercial and other than GA aircraft. Thus the
training of the pilots again is a factor. The pilots flying turbine engine powered aircraft are
generally more well trained. This brings us back to the improvements in training that must be
made across the board in aviation.
As a closing thought, we should look back to the beginning of this analysis paper at the chart in
data visualization (ii). This chart shows the improvement in aviation safety over the years. The
chart shows that the fatalities recorded in 1980 were 3296. Over the next 36 years the fatality rate
has been driven down by over 71%. We must continue this trend by concentrating on both pilot
and aircraft safety. This can easily be done by leveraging the accelerated rate of technology
advancements in both pilot training and the construction of safer aircraft designs.
References:
Dataset source: https://www.kaggle.com/khsamaha/aviation-accident-database-synopses
Semma:http://faculty.smu.edu/tfomby/eco5385_eco6380/data/SPSS/SAS%20_%20SEMMA.pdf
NTSB:https://www.ntsb.gov/investigations/AccidentReports/Pages/aviation.aspx
36

Predictive Analytics - NTSB Aviation accidents data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Predictive Analytics - NTSB Aviation accidents data

Similar to Predictive Analytics - NTSB Aviation accidents data (20)

Recently uploaded

Recently uploaded (20)

Predictive Analytics - NTSB Aviation accidents data