Presenter: Victor Herrera, PhD Candidate and Research Assistant, Florida Atlantic University
NOTE: This is the 2nd presentation in this session and also the 2nd presentation shown in the accompanying YouTube video
SQL Database Design For Developers at php[tek] 2024
HPCC Systems Engineering Summit Presentation - Collaborative Research with FAU and LexisNexis Deck 2
1. Developing Machine Learning Algorithms on HPCC/ECL Platform
Industry/University Cooperative Research
LexisNexis/Florida Atlantic University
Victor Herrera – FAU – Fall 2014
2. Developing ML Algorithms On HPCC/ECL Platform
LexisNexis/Florida Atlantic University Cooperative Research
HPCC ECL ML Library
High Level
Data Centric Declarative Language
Scalability Dictionary Approach
Open Source
LexisNexis HPCC Platform
Big Data Management
Optimized Code
Parallel Processing
Machine Learning Algorithms Victor Herrera – FAU – Fall 2014
3. FAU
Principal Investigator: Taghi Khoshgoftaar Data Mining and Machine Learning
Researchers: Victor Herrera Maryam Najafabadi
LexisNexis
Flavio Villanustre VP Technology & Product
Edin Muharemagic Data Scientist and Architect
Project Team Victor Herrera – FAU – Fall 2014
4. Victor Herrera – FAU – Fall 2014
Machine Learning Algorithm
Classification Model
Training Data
Constructor
Businessman
Doctor
Learning
Classification
Constructor
Project Scope
•
Supervised Learning for Classification
•
Big Data
•
Parallel Computing
Implement Machine Learning Algorithms in HPCC ECL-ML library, focusing on:
5. ECL-ML Learning Library
ML
Supervised Learning
Classifiers
•Naïve Bayes
•Perceptron
•Logistic Regression
Regression
Linear Regression
Unsupervised Learning
Clustering
KMeans
Association Analysis
•APrioriN
•EclatN
Project Starting Point April 2012
Victor Herrera – FAU – Fall 2014
6. Community
Machine Learning Documentation
Review Code
•
Comment
•
Merge
•
Commit
•
Pull Request
Process comments
ML-ECL Library Contributor
ML-ECL Library
Administrator
•
Use
•
Contribute
ECL – ML Library
HPCC-ECL Training
HPCC-ECL Documentation Forums
ECL-ML Open Source Blueprint Victor Herrera – FAU – Fall 2014
7. Victor Herrera – FAU – Fall 2014
Algorithm Complexity
Naïve Bayes
Decision Trees
•KNN
•Random Forest
Language Knowledge
•Transformation
•Aggregations
Data Distribution
Parallel Optimization
Implementation Complexity
•Understand Existing Code
•Fix Existing Modules
Add Functionality
Design Modules
Data
Categorical (Discrete)
Continuous
Mixed (Future Work)
Experience Progress
9. // Learning Phase
minNumObj:= 2; maxLevel := 5; // DecTree Parameters
learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel);
tmodel:= learner1.LearnC(indepData, depData); // DecTree Model
// Classification Phase
results1:= learner1.ClassifyC(indepData, tmodel);
// Comparing to Original
compare1:= Classify.Compare(depData, results1);
OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results
// Learning Phase
treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters
learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel);
rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model
// Classification Phase
results2:= trainer2.ClassifyC(indepData, rfmodel);
// Comparing to Original
compare2:= Classify.Compare(depData, results2);
OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results
A40A4A4112A3A412<= 0.6> 0.6<= 4.7> 4.7<= 1.5> 1.5<= 1.8> 1.8<= 1.7> 1.7
Example: Classifying Iris-Setosa Decision Tree vs. Random Forest Victor Herrera – FAU – Fall 2014
10. // Learning Phase
minNumObj:= 2; maxLevel := 5; // DecTree Parameters
learner1:= ML.Classify.DecisionTree.C45Binary(minNumObj, maxLevel);
tmodel:= learner1.LearnC(indepData, depData); // DecTree Model
// Classification Phase
results1:= learner1.ClassifyC(indepData, tmodel);
// Comparing to Original
compare1:= Classify.Compare(depData, results1);
OUTPUT(compare1.CrossAssignments), ALL); // CrossAssignment results
// Learning Phase
treeNum:=25; fsNum:=3; Pure:=1.0; maxLevel:=5; // RndForest Parameters
learner2 := ML.Classify.RandomForest(treeNum, fsNum, Pure, maxLevel);
rfmodel:= learner2.LearnC(indepData, depData); // RndForest Model
// Classification Phase
results2:= trainer2.ClassifyC(indepData, rfmodel);
// Comparing to Original
compare2:= Classify.Compare(depData, results2);
OUTPUT(compare2.CrossAssignments), ALL); // CrossAssignment results
Example: Classifying Iris-Setosa Decision Tree vs. Random Forest Victor Herrera – FAU – Fall 2014
11. Victor Herrera – FAU – Fall 2014
Supervised Learning
Naïve Bayes
Maximum Likelihood
Probabilistic Model
Decision Tree
Top-Down Induction
Recursive Data Partitioning
Decision Tree Model
Case Based
KD-Tree Partitioning
Nearest Neighbor Search
No Model – Lazy KNN Algorithm
Ensemble
Data Sampling Tree Bagging
F. Sampling Rdn Subspace
Ensemble Tree Model
Classification
Class Probability
Missing Values Classification
Area Under ROC
Project Contribution
12. •
Development of Machine Learning Algorithms in ECL Language, focused on the implementation of Classification Algorithms, Big Data and Parallel Processing.
•
Added functionality to Naïve Bayes Classifier and implemented Decision Trees, KNN and Random Forest Classifiers into ECL-ML library.
•
Currently working on Big Data Learning & Evaluation :
–
Experiment design and implementation of a Big Data CASE Study
–
Analysis of Results.
–
Classifiers enhancements based upon analysis of results. Victor Herrera – FAU – Fall 2014
Overall Summary