Spark’s machine learning (ML) library goal is to make practical machine learning scalable and easy.
MLlib provides the machine learning algorithms and utilities listed here. In this talk we will be going over Classification using Decision trees.
In general, machine learning may be broken down into two classes of algorithms: supervised and unsupervised.
Supervised algorithms use labeled data in which both the input and output are provided to the algorithm.
Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without any hints.
Next we will briefly describe three common categories of machine learning techniques, starting with Classification.
Gmail uses a machine learning technique called classification to classify if an email is spam or not, based on the data of an email: the sender, recipients, subject, and message body.
Classification takes a set of data with known labels and learns how to label new records based on that information. The algorithm identifies which category an item belongs to, based on labeled examples of known items. For example, it identifies whether an email is spam or not based on emails known to be spam or not.
Classification is a supervised algorithm meaning it uses labeled data, for example, labeled as spam/non-spam or fraud/non-fraud to build a model. The model is then use to predict the label or class for new data.
Some common use cases for classification include credit card fraud detection and email spam detection.
You can classify something based on pre-determined features. Features are the “if questions” that you ask. The label is the answer to those questions. In this example, if it walks, swims, and quacks like a duck, then the label is "duck".
To build a classifier model, you first extract the features that most contribute to the classification. In our email example, we find features that define an email as spam, or not spam.
The features are transformed and put into Feature Vectors, which is an array of numbers representing the value for each feature.
An algorithm “trains” a model by making associations between the input features and the labeled output associated with those features.
Then at runtime, we deploy the best model which can then be used to make predictions on new points.
Google News uses a technique called clustering to group news articles into different categories, based on title and content.
Clustering algorithms discover groupings that occur in collections of data.
In clustering, an algorithm classifies inputs into categories by analyzing similarities between input examples.
Clustering uses unsupervised algorithms, which do not have the outputs in advance. No known classes are used as a reference, as with a supervised algorithm like classification.
Clustering can be used for many purposes, for example:
grouping similar customers
anomaly detection, such as fraud detection
and text categorization, such as sorting books into genres
K-means is one of the most commonly used clustering algorithms. The Objective of the K-means algorithm is given a set of data points, create K number of clusters that group the most similar (closest) points.
Amazon uses a machine learning technique called collaborative filtering or recommendation, to determine products users will like based on their history and similarity to other users.
Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part).
The collaborative filtering approach is based on similarity; people who liked similar items in the past will like similar items in the future.
-----
In the example shown, Ted likes movies A, B, and C. Carol likes movies B and C. Bob likes movie B. To recommend a movie to Bob, we calculate that users who liked B also liked C, so C is a possible recommendation for Bob.
The goal of a collaborative filtering algorithm is to take preferences data from users, and to create a model that can be used for recommendations or predictions.
Ted likes movies A, B, and C. Carol likes movies B and C. We take this data and run it through an algorithm to build a model.
Then when we have new data such as Bob likes movie B, we use the model to predict that C is a possible recommendation for Bob.
Next We will look at using a Decision Tree to predict if a flight is going to be delayed.
Decision trees create a model that predicts the class or label based on several input features.
Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. The feature questions are the nodes, and the answers “yes” or “no” are the branches in the tree to the child nodes.
We are using flight information for January 2014. For each flight, we have the following
we use a Scala case class to define the Flight schema corresponding to a line in the csv data file
The parseFlight function parses a line from the data file into the Flight class
we load the flight data into the an RDD, then we use the map transformation on the RDD. This will apply the parseFlight function to each element in the rdd, and return a new RDD of Flight objects.
Next the features are transformed and put into a Feature Vector, which is a vector of numbers representing the value for each feature.
We want to extract the features that will contribute to the classification the most.
In our example, we will use the features day of the month, day of the week, the carrier, departure time, arrival time, the origin airport and destination airport.
The label is delayed or not delayed. A flight is considered to be delayed if it is more than 40 minutes late.
Here we transform the non-numeric features into numeric values. For example, the carrier AA is the number 6. airport ATL is 273.
Next we put the label and features Vector in a LabeledPoint , which holds the features and label.
Here The features for each flight are put into an RDD of arrays of numbers representing the value for each feature.
Next, we create an RDD of LabeledPoints consisting of the label and the features in numeric format for each flight.
A typical machine learning workflow is shown here. To make our predictions, we will perform the following steps
Split the data into two parts, one for building the model and one for testing the model.
We then run the algorithm to build and train a model.
We make predictions with the training data, and observe the results.
Then, test the model with the test data.
Next we split the data into two parts, one for building the model and one for testing the model.
The top line in the code shown here applies the 80-20 split to our data.
Next we build the model with the training data set which has labels
As a reminder We will use Decision Tree algorithm to build the model
categoricalFeaturesInfo specifies which features are categorical and how many categorical values each of those features can take. This is given as a map from feature index to the number of categories for that feature. The first categorical feature, categoricalFeaturesInfo = (0 -> 31) specifies that feature index 0 (which represents the day of the month) has 31 categories (values {1, ..., 31}).
The second one represents day of the week and can take the values from 1 though 7. The carrier value can go from 1 to the number of distinct carriers and so on.
maxDepth: Maximum depth of a tree.
maxBins: Number of bins used when discretizing continuous features.
impurity: Impurity measure of the homogeneity of the labels at the node.
Model.toDebugString prints out the decision tree, which asks the following questions to determine if the flight was delayed or not:
Now that we have trained our model, we want to get predictions for the test data without the label , in order to compare the predictions to the label
Next we use the test data to get predictions.
Here we create and RDD with the Test Labels and Test Predictions in order to compare the predictions to the actual flight delay label.
Here we compare the predictions to the label.
The wrong prediction ratio is the count of wrong predictions divided by the count of test data values,
which is 31%.