SlideShare a Scribd company logo
1 of 34
Download to read offline
GBM:
Distributed Tree Algorithms
on H2O
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
0xdata.com 2
H2O is...
● Pure Java, Open Source: 0xdata.com
● https://github.com/0xdata/h2o/
● A Platform for doing Math
● Parallel Distributed Math
● In-memory analytics: GLM, GBM, RF, Logistic Reg
● Accessible via REST & JSON
● A K/V Store: ~150ns per get or put
● Distributed Fork/Join + Map/Reduce + K/V
0xdata.com 3
Agenda
● Building Blocks For Big Data:
● Vecs & Frames & Chunks
● Distributed Tree Algorithms
● Access Patterns & Execution
● GBM on H2O
● Performance
0xdata.com 4
A Collection of Distributed Vectors
// A Distributed Vector
// much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
void set(long idx, double d); // writable
void append(double d); // variable sized
}
0xdata.com 5
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Frames
A Frame: Vec[]
age sex zip ID car
●Vecs aligned
in heaps
●Optimized for
concurrent access
●Random access
any row, any JVM
●But faster if local...
more on that later
0xdata.com 6
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec Vec Vec Vec Vec
●Typically 1e3 to
1e6 elements
●Stored compressed
●In byte arrays
●Get/put is a few
clock cycles
including
compression
0xdata.com 7
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Distributed Parallel Execution
Vec Vec Vec Vec Vec
●All CPUs grab
Chunks in parallel
●F/J load balances
●Code moves to Data
●Map/Reduce & F/J
handles all sync
●H2O handles all
comm, data manage
0xdata.com 8
Distributed Data Taxonomy
Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame
0xdata.com 9
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● Simple Data-Parallel Coding:
● Per-Row (or neighbor row) Math
● Map/Reduce-style: e.g. Any dense linear algebra
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
0xdata.com 10
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● Simple Data-Parallel Coding:
● Per-Row (or neighbor row) Math
● Map/Reduce-style: e.g. Any dense linear algebra
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
Read the docs!
This talk!
Join our GIT!
0xdata.com 11
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Stateless
●
Example from Linear Regression, Σ y2
● Auto-parallel, auto-distributed
● Near Fortran speed, Java Ease
double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( vecY );
0xdata.com 12
Simple Data-Parallel Coding
● Map/Reduce Per-Row: State-full
●
Linear Regression Pass1: Σ x, Σ y, Σ y2
class LRPass1 extends MRTask {
double sumX, sumY, sumY2; // I Can Haz State?
void map( double X, double Y ) {
sumX += X; sumY += Y; sumY2 += Y*Y;
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
}
0xdata.com 13
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Batch State-full
class LRPass1 extends MRTask {
double sumX, sumY, sumY2;
void map( Chunk CX, Chunk CY ) {// Whole Chunks
for( int i=0; i<CX.len; i++ ){// Batch!
double X = CX.at(i), Y = CY.at(i);
sumX += X; sumY += Y; sumY2 += Y*Y;
}
}
void reduce( LRPass1 that ) {
sumX += that.sumX ;
sumY += that.sumY ;
sumY2 += that.sumY2;
}
}
0xdata.com 14
Distributed Trees
● Overlay a Tree over the data
● Really: Assign a Tree Node to each Row
● Number the Nodes
● Store "Node_ID" per row in a temp Vec
● Make a pass over all Rows
● Nodes not visited in order...
● but all rows, all Nodes efficiently visited
● Do work (e.g. histogram) per Row/Node
Vec nids = v.makeZero();
… nids.set(row,nid)...
0xdata.com 15
Distributed Trees
● An initial Tree
● All rows at nid==0
● MRTask: compute stats
● Use the stats to make a decision...
● (varies by algorithm)!
nid=0
X Y nids
A 1.2 0
B 3.1 0
C -2. 0
D 1.1 0
nid=0nid=0
Tree
MRTask.sum=3.4
0xdata.com 16
Distributed Trees
● Next layer in the Tree (and MRTask across rows)
● Each row: decide!
– If "1<Y<1.5" go left else right
● Compute stats per new leaf
● Each pass across all
rows builds entire layer
nid=0
X Y nids
A 1.2 1
B 3.1 2
C -2. 2
D 1.1 1
nid=01 < Y < 1.5
Tree
sum=1.1
nid=1 nid=2
sum=2.3
0xdata.com 17
Distributed Trees
● Another MRTask, another layer...
● i.e., a 5-deep tree
takes 5 passes
●
nid=0nid=01 < Y < 1.5
Tree
sum=1.1
Y==1.1 leaf
nid=3 nid=4
X Y nids
A 1.2 3
B 3.1 2
C -2. 2
D 1.1 4 sum= -2. sum=3.1
0xdata.com 18
Distributed Trees
● Each pass is over one layer in the tree
● Builds per-node histogram in map+reduce calls
class Pass extends MRTask2<Pass> {
void map( Chunk chks[] ) {
Chunk nids = chks[...]; // Node-IDs per row
for( int r=0; r<nids.len; r++ ){// All rows
int nid = nids.at80(i); // Node-ID THIS row
// Lazy: not all Chunks see all Nodes
if( dHisto[nid]==null ) dHisto[nid]=...
// Accumulate histogram stats per node
dHisto[nid].accum(chks,r);
}
}
}.doAll(myDataFrame,nids);
0xdata.com 19
Distributed Trees
● Each pass analyzes one Tree level
● Then decide how to build next level
● Reassign Rows to new levels in another pass
– (actually merge the two passes)
● Builds a Histogram-per-Node
● Which requires a reduce() call to roll up
● All Histograms for one level done in parallel
0xdata.com 20
Distributed Trees: utilities
● “score+build” in one pass:
● Test each row against decision from prior pass
● Assign to a new leaf
● Build histogram on that leaf
● “score”: just walk the tree, and get results
● “compress”: Tree from POJO to byte[]
● Easily 10x smaller, can still walk, score, print
● Plus utilities to walk, print, display
0xdata.com 21
GBM on Distributed Trees
● GBM builds 1 Tree, 1 level at a time, but...
● We run the entire level in parallel & distributed
● Built breadth-first because it's "free"
● More data offset by more CPUs
● Classic GBM otherwise
● Build residuals tree-by-tree
● Tuning knobs: trees, depth, shrinkage, min_rows
● Pure Java
0xdata.com 22
GBM on Distributed Trees
● Limiting factor: latency in turning over a level
● About 4x faster than R single-node on covtype
● Does the per-level compute in parallel
● Requires sending histograms over network
– Can get big for very deep trees
●
0xdata.com 23
Summary: Write (parallel) Java
● Most simple Java “just works”
● Fast: parallel distributed reads, writes, appends
● Reads same speed as plain Java array loads
● Writes, appends: slightly slower (compression)
● Typically memory bandwidth limited
– (may be CPU limited in a few cases)
● Slower: conflicting writes (but follows strict JMM)
● Also supports transactional updates
0xdata.com 24
Summary: Writing Analytics
● We're writing Big Data Analytics
● Generalized Linear Modeling (ADMM, GLMNET)
– Logistic Regression, Poisson, Gamma
● Random Forest, GBM, KMeans++, KNN
● State-of-the-art Algorithms, running Distributed
● Solidly working on 100G datasets
● Heading for Tera Scale
● Paying customers (in production!)
● Come write your own (distributed) algorithm!!!
0xdata.com 25
Cool Systems Stuff...
● … that I ran out of space for
● Reliable UDP, integrated w/RPC
● TCP is reliably UNReliable
● Already have a reliable UDP framework, so no prob
● Fork/Join Goodies:
● Priority Queues
● Distributed F/J
● Surviving fork bombs & lost threads
● K/V does JMM via hardware-like MESI protocol
0xdata.com 26
H2O is...
● Pure Java, Open Source: 0xdata.com
● https://github.com/0xdata/h2o/
● A Platform for doing Math
● Parallel Distributed Math
● In-memory analytics: GLM, GBM, RF, Logistic Reg
● Accessible via REST & JSON
● A K/V Store: ~150ns per get or put
● Distributed Fork/Join + Map/Reduce + K/V
0xdata.com 27
The Platform
NFS
HDFS
byte[]
extends Iced
extends DTask
AutoBuffer
RPC
extends DRemoteTask D/F/J
extends MRTask User code?
JVM 1
NFS
HDFS
byte[]
extends Iced
extends DTask
AutoBuffer
RPC
extends DRemoteTask D/F/J
extends MRTask User code?
JVM 2
K/V get/put
UDP / TCP
0xdata.com 28
Other Simple Examples
● Filter & Count (underage males):
● (can pass in any number of Vecs or a Frame)
long sumY2 = new MRTask() {
long map( long age, long sex ) {
return (age<=17 && sex==MALE) ? 1 : 0;
}
long reduce( long d1, long d2 ) {
return d1+d2;
}
}.doAll( vecAge, vecSex );
0xdata.com 29
Other Simple Examples
● Filter into new set (underage males):
● Can write or append subset of rows
– (append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com 30
Other Simple Examples
● Filter into new set (underage males):
● Can write or append subset of rows
– (append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
0xdata.com 31
Other Simple Examples
● Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
0xdata.com 32
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
Other Simple Examples
● Group-by: count of car-types by age
Setting carAges in map() makes it an output field.
Private per-map call, single-threaded write access.
Must be rolled-up in the reduce call.
Setting carAges in map makes it an output field.
Private per-map call, single-threaded write access.
Must be rolled-up in the reduce call.
0xdata.com 33
Other Simple Examples
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
0xdata.com 34
Other Simple Examples
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
This one is written, so needs a reduce.

More Related Content

What's hot

Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataVictor Smirnov
 
Predicate-Preserving Collision-Resistant Hashing
Predicate-Preserving  Collision-Resistant HashingPredicate-Preserving  Collision-Resistant Hashing
Predicate-Preserving Collision-Resistant HashingPhilippe Camacho, Ph.D.
 
AES effecitve software implementation
AES effecitve software implementationAES effecitve software implementation
AES effecitve software implementationRoman Oliynykov
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelJavaDayUA
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelRishikese MR
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакIgor Bronovskyy
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jamsBol.com Techlab
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 

What's hot (20)

Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
User biglm
User biglmUser biglm
User biglm
 
Predicate-Preserving Collision-Resistant Hashing
Predicate-Preserving  Collision-Resistant HashingPredicate-Preserving  Collision-Resistant Hashing
Predicate-Preserving Collision-Resistant Hashing
 
RealmDB for Android
RealmDB for AndroidRealmDB for Android
RealmDB for Android
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
AES effecitve software implementation
AES effecitve software implementationAES effecitve software implementation
AES effecitve software implementation
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
 
Nicety of Java 8 Multithreading
Nicety of Java 8 MultithreadingNicety of Java 8 Multithreading
Nicety of Java 8 Multithreading
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
icwet1097
icwet1097icwet1097
icwet1097
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra Model
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр Павлишак
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Secure Hash Algorithm
Secure Hash AlgorithmSecure Hash Algorithm
Secure Hash Algorithm
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 

Viewers also liked

21st Century University feasibility study
21st Century University feasibility study 21st Century University feasibility study
21st Century University feasibility study Jouni Eho
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at ScaleSri Ambati
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014Sri Ambati
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through ExamplesSri Ambati
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...Sri Ambati
 
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati, Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati, Sri Ambati
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data EnvironmentsSri Ambati
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckSri Ambati
 
Designing chatbot personalities
Designing chatbot personalitiesDesigning chatbot personalities
Designing chatbot personalitiesEva Lettner
 
Chatbot - new opportunities and insights
Chatbot - new opportunities and insightsChatbot - new opportunities and insights
Chatbot - new opportunities and insightsPo-Cheng Chu
 
Artificially Intelligent chatbot Implementation
Artificially Intelligent chatbot ImplementationArtificially Intelligent chatbot Implementation
Artificially Intelligent chatbot ImplementationRakesh Chintha
 
Transform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningTransform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningSri Ambati
 
Chatbot Artificial Intelligence
Chatbot Artificial IntelligenceChatbot Artificial Intelligence
Chatbot Artificial IntelligenceMd. Mahedi Mahfuj
 

Viewers also liked (20)

21st Century University feasibility study
21st Century University feasibility study 21st Century University feasibility study
21st Century University feasibility study
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014PayPal's Fraud Detection with Deep Learning in H2O World 2014
PayPal's Fraud Detection with Deep Learning in H2O World 2014
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...MLconf - Distributed Deep Learning for Classification and Regression Problems...
MLconf - Distributed Deep Learning for Classification and Regression Problems...
 
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati, Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
Transformation, H2O Open Dallas 2016, Keynote by Sri Ambati,
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
 
Designing chatbot personalities
Designing chatbot personalitiesDesigning chatbot personalities
Designing chatbot personalities
 
La rivoluzione dei chatbot
La rivoluzione dei chatbotLa rivoluzione dei chatbot
La rivoluzione dei chatbot
 
Chatbot - new opportunities and insights
Chatbot - new opportunities and insightsChatbot - new opportunities and insights
Chatbot - new opportunities and insights
 
Artificially Intelligent chatbot Implementation
Artificially Intelligent chatbot ImplementationArtificially Intelligent chatbot Implementation
Artificially Intelligent chatbot Implementation
 
Transform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningTransform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine Learning
 
Chatbot Artificial Intelligence
Chatbot Artificial IntelligenceChatbot Artificial Intelligence
Chatbot Artificial Intelligence
 

Similar to GBM in H2O with Cliff Click: H2O API

Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformCliff Click
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sri Ambati
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introductionChamp Yen
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Satalia
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)PROIDEA
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 
Stevens 3rd Annual Conference Hfc2011
Stevens 3rd Annual Conference Hfc2011Stevens 3rd Annual Conference Hfc2011
Stevens 3rd Annual Conference Hfc2011jzw200
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System LanguageScyllaDB
 

Similar to GBM in H2O with Cliff Click: H2O API (20)

2013 05 ny
2013 05 ny2013 05 ny
2013 05 ny
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
HPC Essentials 0
HPC Essentials 0HPC Essentials 0
HPC Essentials 0
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Stevens 3rd Annual Conference Hfc2011
Stevens 3rd Annual Conference Hfc2011Stevens 3rd Annual Conference Hfc2011
Stevens 3rd Annual Conference Hfc2011
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System Language
 

More from Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

More from Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

GBM in H2O with Cliff Click: H2O API

  • 1. GBM: Distributed Tree Algorithms on H2O Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  • 2. 0xdata.com 2 H2O is... ● Pure Java, Open Source: 0xdata.com ● https://github.com/0xdata/h2o/ ● A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V
  • 3. 0xdata.com 3 Agenda ● Building Blocks For Big Data: ● Vecs & Frames & Chunks ● Distributed Tree Algorithms ● Access Patterns & Execution ● GBM on H2O ● Performance
  • 4. 0xdata.com 4 A Collection of Distributed Vectors // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
  • 5. 0xdata.com 5 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Frames A Frame: Vec[] age sex zip ID car ●Vecs aligned in heaps ●Optimized for concurrent access ●Random access any row, any JVM ●But faster if local... more on that later
  • 6. 0xdata.com 6 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Distributed Data Taxonomy A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec ●Typically 1e3 to 1e6 elements ●Stored compressed ●In byte arrays ●Get/put is a few clock cycles including compression
  • 7. 0xdata.com 7 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Distributed Parallel Execution Vec Vec Vec Vec Vec ●All CPUs grab Chunks in parallel ●F/J load balances ●Code moves to Data ●Map/Reduce & F/J handles all sync ●H2O handles all comm, data manage
  • 8. 0xdata.com 8 Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
  • 9. 0xdata.com 9 Distributed Coding Taxonomy ● No Distribution Coding: ● Whole Algorithms, Whole Vector-Math ● REST + JSON: e.g. load data, GLM, get results ● Simple Data-Parallel Coding: ● Per-Row (or neighbor row) Math ● Map/Reduce-style: e.g. Any dense linear algebra ● Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank
  • 10. 0xdata.com 10 Distributed Coding Taxonomy ● No Distribution Coding: ● Whole Algorithms, Whole Vector-Math ● REST + JSON: e.g. load data, GLM, get results ● Simple Data-Parallel Coding: ● Per-Row (or neighbor row) Math ● Map/Reduce-style: e.g. Any dense linear algebra ● Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank Read the docs! This talk! Join our GIT!
  • 11. 0xdata.com 11 Simple Data-Parallel Coding ● Map/Reduce Per-Row: Stateless ● Example from Linear Regression, Σ y2 ● Auto-parallel, auto-distributed ● Near Fortran speed, Java Ease double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; } }.doAll( vecY );
  • 12. 0xdata.com 12 Simple Data-Parallel Coding ● Map/Reduce Per-Row: State-full ● Linear Regression Pass1: Σ x, Σ y, Σ y2 class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } }
  • 13. 0xdata.com 13 Simple Data-Parallel Coding ● Map/Reduce Per-Row: Batch State-full class LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } }
  • 14. 0xdata.com 14 Distributed Trees ● Overlay a Tree over the data ● Really: Assign a Tree Node to each Row ● Number the Nodes ● Store "Node_ID" per row in a temp Vec ● Make a pass over all Rows ● Nodes not visited in order... ● but all rows, all Nodes efficiently visited ● Do work (e.g. histogram) per Row/Node Vec nids = v.makeZero(); … nids.set(row,nid)...
  • 15. 0xdata.com 15 Distributed Trees ● An initial Tree ● All rows at nid==0 ● MRTask: compute stats ● Use the stats to make a decision... ● (varies by algorithm)! nid=0 X Y nids A 1.2 0 B 3.1 0 C -2. 0 D 1.1 0 nid=0nid=0 Tree MRTask.sum=3.4
  • 16. 0xdata.com 16 Distributed Trees ● Next layer in the Tree (and MRTask across rows) ● Each row: decide! – If "1<Y<1.5" go left else right ● Compute stats per new leaf ● Each pass across all rows builds entire layer nid=0 X Y nids A 1.2 1 B 3.1 2 C -2. 2 D 1.1 1 nid=01 < Y < 1.5 Tree sum=1.1 nid=1 nid=2 sum=2.3
  • 17. 0xdata.com 17 Distributed Trees ● Another MRTask, another layer... ● i.e., a 5-deep tree takes 5 passes ● nid=0nid=01 < Y < 1.5 Tree sum=1.1 Y==1.1 leaf nid=3 nid=4 X Y nids A 1.2 3 B 3.1 2 C -2. 2 D 1.1 4 sum= -2. sum=3.1
  • 18. 0xdata.com 18 Distributed Trees ● Each pass is over one layer in the tree ● Builds per-node histogram in map+reduce calls class Pass extends MRTask2<Pass> { void map( Chunk chks[] ) { Chunk nids = chks[...]; // Node-IDs per row for( int r=0; r<nids.len; r++ ){// All rows int nid = nids.at80(i); // Node-ID THIS row // Lazy: not all Chunks see all Nodes if( dHisto[nid]==null ) dHisto[nid]=... // Accumulate histogram stats per node dHisto[nid].accum(chks,r); } } }.doAll(myDataFrame,nids);
  • 19. 0xdata.com 19 Distributed Trees ● Each pass analyzes one Tree level ● Then decide how to build next level ● Reassign Rows to new levels in another pass – (actually merge the two passes) ● Builds a Histogram-per-Node ● Which requires a reduce() call to roll up ● All Histograms for one level done in parallel
  • 20. 0xdata.com 20 Distributed Trees: utilities ● “score+build” in one pass: ● Test each row against decision from prior pass ● Assign to a new leaf ● Build histogram on that leaf ● “score”: just walk the tree, and get results ● “compress”: Tree from POJO to byte[] ● Easily 10x smaller, can still walk, score, print ● Plus utilities to walk, print, display
  • 21. 0xdata.com 21 GBM on Distributed Trees ● GBM builds 1 Tree, 1 level at a time, but... ● We run the entire level in parallel & distributed ● Built breadth-first because it's "free" ● More data offset by more CPUs ● Classic GBM otherwise ● Build residuals tree-by-tree ● Tuning knobs: trees, depth, shrinkage, min_rows ● Pure Java
  • 22. 0xdata.com 22 GBM on Distributed Trees ● Limiting factor: latency in turning over a level ● About 4x faster than R single-node on covtype ● Does the per-level compute in parallel ● Requires sending histograms over network – Can get big for very deep trees ●
  • 23. 0xdata.com 23 Summary: Write (parallel) Java ● Most simple Java “just works” ● Fast: parallel distributed reads, writes, appends ● Reads same speed as plain Java array loads ● Writes, appends: slightly slower (compression) ● Typically memory bandwidth limited – (may be CPU limited in a few cases) ● Slower: conflicting writes (but follows strict JMM) ● Also supports transactional updates
  • 24. 0xdata.com 24 Summary: Writing Analytics ● We're writing Big Data Analytics ● Generalized Linear Modeling (ADMM, GLMNET) – Logistic Regression, Poisson, Gamma ● Random Forest, GBM, KMeans++, KNN ● State-of-the-art Algorithms, running Distributed ● Solidly working on 100G datasets ● Heading for Tera Scale ● Paying customers (in production!) ● Come write your own (distributed) algorithm!!!
  • 25. 0xdata.com 25 Cool Systems Stuff... ● … that I ran out of space for ● Reliable UDP, integrated w/RPC ● TCP is reliably UNReliable ● Already have a reliable UDP framework, so no prob ● Fork/Join Goodies: ● Priority Queues ● Distributed F/J ● Surviving fork bombs & lost threads ● K/V does JMM via hardware-like MESI protocol
  • 26. 0xdata.com 26 H2O is... ● Pure Java, Open Source: 0xdata.com ● https://github.com/0xdata/h2o/ ● A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V
  • 27. 0xdata.com 27 The Platform NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code? JVM 1 NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code? JVM 2 K/V get/put UDP / TCP
  • 28. 0xdata.com 28 Other Simple Examples ● Filter & Count (underage males): ● (can pass in any number of Vecs or a Frame) long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; } }.doAll( vecAge, vecSex );
  • 29. 0xdata.com 29 Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males
  • 30. 0xdata.com 30 Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males
  • 31. 0xdata.com 31 Other Simple Examples ● Group-by: count of car-types by age class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } }
  • 32. 0xdata.com 32 class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } Other Simple Examples ● Group-by: count of car-types by age Setting carAges in map() makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call. Setting carAges in map makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call.
  • 33. 0xdata.com 33 Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();
  • 34. 0xdata.com 34 Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only. This one is written, so needs a reduce.