- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. 0xdata.com 2
H2O is...
● Pure Java, Open Source: 0xdata.com
● https://github.com/0xdata/h2o/
● A Platform for doing Math
● Parallel Distributed Math
● In-memory analytics: GLM, GBM, RF, Logistic Reg
● Accessible via REST & JSON
● A K/V Store: ~150ns per get or put
● Distributed Fork/Join + Map/Reduce + K/V
3. 0xdata.com 3
Agenda
● Building Blocks For Big Data:
● Vecs & Frames & Chunks
● Distributed Tree Algorithms
● Access Patterns & Execution
● GBM on H2O
● Performance
4. 0xdata.com 4
A Collection of Distributed Vectors
// A Distributed Vector
// much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
void set(long idx, double d); // writable
void append(double d); // variable sized
}
5. 0xdata.com 5
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Frames
A Frame: Vec[]
age sex zip ID car
●Vecs aligned
in heaps
●Optimized for
concurrent access
●Random access
any row, any JVM
●But faster if local...
more on that later
6. 0xdata.com 6
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec Vec Vec Vec Vec
●Typically 1e3 to
1e6 elements
●Stored compressed
●In byte arrays
●Get/put is a few
clock cycles
including
compression
7. 0xdata.com 7
JVM 4
Heap
JVM 1
Heap
JVM 2
Heap
JVM 3
Heap
Distributed Parallel Execution
Vec Vec Vec Vec Vec
●All CPUs grab
Chunks in parallel
●F/J load balances
●Code moves to Data
●Map/Reduce & F/J
handles all sync
●H2O handles all
comm, data manage
8. 0xdata.com 8
Distributed Data Taxonomy
Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame
9. 0xdata.com 9
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● Simple Data-Parallel Coding:
● Per-Row (or neighbor row) Math
● Map/Reduce-style: e.g. Any dense linear algebra
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
10. 0xdata.com 10
Distributed Coding Taxonomy
● No Distribution Coding:
● Whole Algorithms, Whole Vector-Math
● REST + JSON: e.g. load data, GLM, get results
● Simple Data-Parallel Coding:
● Per-Row (or neighbor row) Math
● Map/Reduce-style: e.g. Any dense linear algebra
● Complex Data-Parallel Coding
● K/V Store, Graph Algo's, e.g. PageRank
Read the docs!
This talk!
Join our GIT!
11. 0xdata.com 11
Simple Data-Parallel Coding
● Map/Reduce Per-Row: Stateless
●
Example from Linear Regression, Σ y2
● Auto-parallel, auto-distributed
● Near Fortran speed, Java Ease
double sumY2 = new MRTask() {
double map( double d ) { return d*d; }
double reduce( double d1, double d2 ) {
return d1+d2;
}
}.doAll( vecY );
14. 0xdata.com 14
Distributed Trees
● Overlay a Tree over the data
● Really: Assign a Tree Node to each Row
● Number the Nodes
● Store "Node_ID" per row in a temp Vec
● Make a pass over all Rows
● Nodes not visited in order...
● but all rows, all Nodes efficiently visited
● Do work (e.g. histogram) per Row/Node
Vec nids = v.makeZero();
… nids.set(row,nid)...
15. 0xdata.com 15
Distributed Trees
● An initial Tree
● All rows at nid==0
● MRTask: compute stats
● Use the stats to make a decision...
● (varies by algorithm)!
nid=0
X Y nids
A 1.2 0
B 3.1 0
C -2. 0
D 1.1 0
nid=0nid=0
Tree
MRTask.sum=3.4
16. 0xdata.com 16
Distributed Trees
● Next layer in the Tree (and MRTask across rows)
● Each row: decide!
– If "1<Y<1.5" go left else right
● Compute stats per new leaf
● Each pass across all
rows builds entire layer
nid=0
X Y nids
A 1.2 1
B 3.1 2
C -2. 2
D 1.1 1
nid=01 < Y < 1.5
Tree
sum=1.1
nid=1 nid=2
sum=2.3
17. 0xdata.com 17
Distributed Trees
● Another MRTask, another layer...
● i.e., a 5-deep tree
takes 5 passes
●
nid=0nid=01 < Y < 1.5
Tree
sum=1.1
Y==1.1 leaf
nid=3 nid=4
X Y nids
A 1.2 3
B 3.1 2
C -2. 2
D 1.1 4 sum= -2. sum=3.1
18. 0xdata.com 18
Distributed Trees
● Each pass is over one layer in the tree
● Builds per-node histogram in map+reduce calls
class Pass extends MRTask2<Pass> {
void map( Chunk chks[] ) {
Chunk nids = chks[...]; // Node-IDs per row
for( int r=0; r<nids.len; r++ ){// All rows
int nid = nids.at80(i); // Node-ID THIS row
// Lazy: not all Chunks see all Nodes
if( dHisto[nid]==null ) dHisto[nid]=...
// Accumulate histogram stats per node
dHisto[nid].accum(chks,r);
}
}
}.doAll(myDataFrame,nids);
19. 0xdata.com 19
Distributed Trees
● Each pass analyzes one Tree level
● Then decide how to build next level
● Reassign Rows to new levels in another pass
– (actually merge the two passes)
● Builds a Histogram-per-Node
● Which requires a reduce() call to roll up
● All Histograms for one level done in parallel
20. 0xdata.com 20
Distributed Trees: utilities
● “score+build” in one pass:
● Test each row against decision from prior pass
● Assign to a new leaf
● Build histogram on that leaf
● “score”: just walk the tree, and get results
● “compress”: Tree from POJO to byte[]
● Easily 10x smaller, can still walk, score, print
● Plus utilities to walk, print, display
21. 0xdata.com 21
GBM on Distributed Trees
● GBM builds 1 Tree, 1 level at a time, but...
● We run the entire level in parallel & distributed
● Built breadth-first because it's "free"
● More data offset by more CPUs
● Classic GBM otherwise
● Build residuals tree-by-tree
● Tuning knobs: trees, depth, shrinkage, min_rows
● Pure Java
22. 0xdata.com 22
GBM on Distributed Trees
● Limiting factor: latency in turning over a level
● About 4x faster than R single-node on covtype
● Does the per-level compute in parallel
● Requires sending histograms over network
– Can get big for very deep trees
●
23. 0xdata.com 23
Summary: Write (parallel) Java
● Most simple Java “just works”
● Fast: parallel distributed reads, writes, appends
● Reads same speed as plain Java array loads
● Writes, appends: slightly slower (compression)
● Typically memory bandwidth limited
– (may be CPU limited in a few cases)
● Slower: conflicting writes (but follows strict JMM)
● Also supports transactional updates
24. 0xdata.com 24
Summary: Writing Analytics
● We're writing Big Data Analytics
● Generalized Linear Modeling (ADMM, GLMNET)
– Logistic Regression, Poisson, Gamma
● Random Forest, GBM, KMeans++, KNN
● State-of-the-art Algorithms, running Distributed
● Solidly working on 100G datasets
● Heading for Tera Scale
● Paying customers (in production!)
● Come write your own (distributed) algorithm!!!
25. 0xdata.com 25
Cool Systems Stuff...
● … that I ran out of space for
● Reliable UDP, integrated w/RPC
● TCP is reliably UNReliable
● Already have a reliable UDP framework, so no prob
● Fork/Join Goodies:
● Priority Queues
● Distributed F/J
● Surviving fork bombs & lost threads
● K/V does JMM via hardware-like MESI protocol
26. 0xdata.com 26
H2O is...
● Pure Java, Open Source: 0xdata.com
● https://github.com/0xdata/h2o/
● A Platform for doing Math
● Parallel Distributed Math
● In-memory analytics: GLM, GBM, RF, Logistic Reg
● Accessible via REST & JSON
● A K/V Store: ~150ns per get or put
● Distributed Fork/Join + Map/Reduce + K/V
28. 0xdata.com 28
Other Simple Examples
● Filter & Count (underage males):
● (can pass in any number of Vecs or a Frame)
long sumY2 = new MRTask() {
long map( long age, long sex ) {
return (age<=17 && sex==MALE) ? 1 : 0;
}
long reduce( long d1, long d2 ) {
return d1+d2;
}
}.doAll( vecAge, vecSex );
29. 0xdata.com 29
Other Simple Examples
● Filter into new set (underage males):
● Can write or append subset of rows
– (append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
30. 0xdata.com 30
Other Simple Examples
● Filter into new set (underage males):
● Can write or append subset of rows
– (append order is preserved)
class Filter extends MRTask {
void map(Chunk CRisk, Chunk CAge, Chunk CSex){
for( int i=0; i<CAge.len; i++ )
if( CAge.at(i)<=17 && CSex.at(i)==MALE )
CRisk.append(CAge.at(i)); // build a set
}
};
Vec risk = new AppendableVec();
new Filter().doAll( risk, vecAge, vecSex );
...risk... // all the underage males
31. 0xdata.com 31
Other Simple Examples
● Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
32. 0xdata.com 32
class AgeHisto extends MRTask {
long carAges[][]; // count of cars by age
void map( Chunk CAge, Chunk CCar ) {
carAges = new long[numAges][numCars];
for( int i=0; i<CAge.len; i++ )
carAges[CAge.at(i)][CCar.at(i)]++;
}
void reduce( AgeHisto that ) {
for( int i=0; i<carAges.length; i++ )
for( int j=0; i<carAges[j].length; j++ )
carAges[i][j] += that.carAges[i][j];
}
}
Other Simple Examples
● Group-by: count of car-types by age
Setting carAges in map() makes it an output field.
Private per-map call, single-threaded write access.
Must be rolled-up in the reduce call.
Setting carAges in map makes it an output field.
Private per-map call, single-threaded write access.
Must be rolled-up in the reduce call.
33. 0xdata.com 33
Other Simple Examples
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
34. 0xdata.com 34
Other Simple Examples
● Uniques
● Uses distributed hash set
class Uniques extends MRTask {
DNonBlockingHashSet<Long> dnbhs = new ...;
void map( long id ) { dnbhs.add(id); }
void reduce( Uniques that ) {
dnbhs.putAll(that.dnbhs);
}
};
long uniques = new Uniques().
doAll( vecVistors ).dnbhs.size();
Setting dnbhs in <init> makes it an input field.
Shared across all maps(). Often read-only.
This one is written, so needs a reduce.