SFrame: Scalable Out-of-Core Dataframe for Machine Learning

SFrame: A Scalable, Out-of-Core
Dataframe for Machine Learning
Yucheng Low
Chief Architect @ Dato

Dato Confidential2
• Founded in mid-2013 as GraphLab by Prof. Carlos
Guestrin
• Name change to Dato in Jan 2015

Dato Confidential3
Who am I?
• PhD Machine Learning, Carnegie Mellon University
- Advised by Prof. Carlos Guestrin (CEO Dato)
- Thesis work on Distributed Machine Learning
• PowerGraph
 World’s Fastest Distributed Graph Processing Framework
• CoFounder and now Chief Architect at Dato

Dato Confidential4
Data
Engineering
Data
Intelligence
Deployment
• Fast & scalable
• Rich data type support
• Visualization
• App-oriented ML
• Supporting utils
• Extensibility
• Batch & always-on
• RESTful interface
• Elastic & robust
Inspiration Production

Yes, we run on a single
machine.

What can you do with a
single machine?

8
Build a Collaborative Filtering Model on 20 Billion
User-Item Ratings
Do PageRank on a 128 Billion edge graph.

10
SFrames
User Com.
Title Body
User Disc.
SGraphs

11
SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]

12
SFrame Design Principles
• Graceful Degradation as 1st principle
- Always Works
• Rich Datatypes
• Strong schema types: int, double, string, image, ...
• Weak schema types: list, dictionary (Can contain arbitrary JSON)
• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.
• Immutable columns + Lazy evaluation
• Statistics + visualization + sketches
Columns are first class:
Like, Pandas, SFrame is more like an arbitrary collection
of columns.

13
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
sf
user
item
rating
nrating

14
user movie rating
netflix_tr.frame
sf
user
item
rating
user movie rating
netflix_norm.frame
sf2
user
item
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
A column is an SArray. Our
equivalent of a external memory
numpy Array.

15
user movie rating
netflix_tr.frame
sf
user
item
rating
user movie rating
netflix_norm.frame
sf2
user
item
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
sf[‘diff’] = diff
diff

16
Physical Representation
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
SFrame File

17
Lazy Evaluation + Query Optimization
• Avoid writing to disk/SSD whenever possible.
• Common model for many other systems
(Spark, Blaze, etc.)
• Extensions from classical “SQL” optimization
methods needed to maintain both “tables”
(SFrames) and “columns” (SArrays)

18
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)sf

19
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
Project
ratings
sf
ratings

20
Lazy Evaluation
SFrame
Source
Project
ratings
ratings = ratings * 2
Apply
* 2
sf
ratings

21
Lazy Evaluation
SFrame
Source
Project
ratings
Apply
* 2
sf[‘ratings’] = ratings
Project
User, item
Union
Columns
sf

22
Lazy Evaluation
SFrame
Source
Project
ratings
Apply
* 2
Project
User, item
Union
Columns
sf[‘ratings’].sum()
Project
ratings
Reduce

23
Query Optimization
SFrame
Source
Project
ratings
Apply
* 2
Project
User, item
Union
Columns
Project
ratings
Reduce

24
Physical Plan
SArray
Source
Apply
* 2
Reduce

25
Query Optimization
• Very rich source of performance
optimizations
- Pushing filter operations upwards
- Pushing project operations upwards
- Pushing complex joins downloads
- Join ordering
- Etc.

26
union
union
union
Tr->{C0,...,C16}
Tr->{C0,...,C16}
PR(8)
SF(S1,S2,S3,S4,S5)
Tr->{C0,...,C16}
PR(6,6,9,4,0)
unionTr->{C0,...,C16}
union
PR(14)
Tr->{C0,...,C16}
union
PR(11)
Tr->{C0,...,C16}
PR(1,6,6,0,9)
union
Tr->{C0,...,C16}
PR(6)
Tr->{C0,...,C16}
union
PR(3)
Tr->{C0,...,C16}
union
Tr->{C0,...,C16}
PR(11)
PR(9,8,7,9,6)
union
PR(7)
PR(4,9,9,3,8) Tr->{C0,...,C16}
union
union
PR(3)
union
PR(4,6,8,7,0)
PR(1)
Tr->{C0,...,C16}
PR(12)
Tr->{C0,...,C16}
PR(1,7,8,1,3)
union
PR(3)
union
PR(3)
PR(5,9,0,5,9)
union
PR(8)
union
PR(12)
union
PR(1)
union union
PR(5,2,2,3,8)
PR(4)
union
PR(14)
union
union
union
union
union
PR(3)
union
union
PR(14)
union
PR(3)
unionPR(0)
PR(4)
union
PR(7)
union
union
union
union
union
union
union
union
Tr->{C0,...,C16}
PR(7,6,7,0,4)
union PR(6,3,7,0,5)
union
Tr->{C0,...,C16}
PR(8) Tr->{C0,...,C16}
PR(9,8,1,2,4)
PR(5,2,9,4,1)
union
union
union
union
Tr->{C0,...,C16} PR(1,0,5,0,3)PR(6)
PR(5,6,7,4,8)
union
union
PR(3,5,1,9,1)
PR(5)union
PR(13)
PR(10)
Tr->{C0,...,C16}
union
union
PR(0)
PR(9,2,0,6,3)
union
PR(3,6,3,2,1)PR(3)
Tr->{C0,...,C16}
union
Tr->{C0,...,C16}
union
PR(11) union
union
union
PR(0,2,6,2,7)
PR(7,2,2,3,3)
union
Tr->{C0,...,C16}
PR(6)
PR(14)
PR(13)
PR(8)
union
union
union
PR(12)
union
union
PR(8)
union
PR(7)
union
PR(7)
union
union
union
union
union
union
union
union
PR(0,7,8,2,8)
union
Tr->{C0,...,C16}
SF(S1,S2,S3,S4,S5)
Tr->{C0,...,C16}
PR(3)PR(1)
Tr->{C0,...,C16}
PR(6,0,3,3,1,3,1,3,0,4,7,1,6,3,6)
PR(7,3,1)
PR(3,3,8,8,4,6,1,5,0,8)
PR(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,7,2,2,3,3,8,7,8,8,7,7,6,5)
PR(2) union
union
union
union
union
union

27
Execution time Optimizations
• C++ Coroutine incremental decoder.
- Only require compressed blocks to be resident in memory. Never
require complete decompression of any block.
• Runtime Input Skipping
- Significant performance optimization for small filters. Avoid
reading data that will be thrown away.
• Etc.

28
Filter Performance
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 100
0. 15. 30. 45. 60. 75.
Redshi
Impala (1.2.3)
Spark (1.3)
Hive - 0.12
Tez - 0.2.0
GraphLab Create (1.4)
Query 1b: X = 100
5
Machines

29
Filter Performance
0. 12.5 25. 37.5 50.
Redshi
Impala (1.2.3)
Spark (1.3)
Hive - 0.12
Tez - 0.2.0
GraphLab Create (1.4)
Query 1c: X = 10
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 10
5
Machines

30
Scalability
• Largest synthetic:
- 950 Columns, 1B rows, dense numeric
•  1 Trillion numeric values.
• No real #rows limit, Practically 100s-1000s
of columns

31
SGraph Extension
User Com.
Title Body
User Disc.

32
SGraph Layout
1
2
3
4
Vertex SFrames
__id Name Address ZipCode
1011 John … 98105
2131 Jack … 98102
Vertices Partitioned into
p = 4 SFrames.

33
SGraph Layout
1
2
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Edges partitioned into p^2
= 16 SFrames.

(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge Traversal Ordering

35
PageRank on Twitter Graph
0 100 200 300 400 500 600 700
GraphLab Create
GraphX
Giraph
Runtime(s)
* Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES 2013
** Aapo Kyrola, Guy Blelloch, Carlos Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012.
16
Machines
1
Machine

36
Connected Components on Twitter Graph
0 50 100 150 200 250 300
GraphLab Create
GraphX
Giraph
Runtime(s)
* Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES 2013
** Aapo Kyrola, Guy Blelloch, Carlos Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012.
16
Machines
1
Machine

37
Scalability
1x r3.8xlarge  using 1x SSD.
Common Crawl Graph:
3.5 billion Nodes and 128 billion Edges
Largest available public Graph.
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
Represented naively  2 TB of memory.

38
SFrame SGraph
Core Underlying Datastructures

39
ML Algorithms
• Uses SFrames as data source, but models must fit in
memory. Implementations are in C++ and use whatever
it takes to go fast.
- Parallel execution
- Controlled memory access races
 Some algorithms are robust to this if you are careful
- Multiple solvers for different data regimes
- GPUs

40
ML Algorithms
• Robustness and Ease of Use of algorithms is key
objective, not just performance. Sparse/Dense regimes,
etc.
- Probably the most robust linear solvers out there.
Garbage In, Guidance Out.
- VP Eng. Sethu Raman

41
ML Algorithms
• Linear Models
• Boosted Trees
• Recommender Algorithms
- (with features, ranking regime, etc.)
• Deep-learning – GPU-support.
• Clustering
• Fuzzy Matching
• Text models
• Etc.

42
Logistic Regression
651
2092
2361
291
0 500 1000 1500 2000 2500
MLLib 5 nodes
MLLib 1 node
Scikit-Learn
GraphLab Create
Runtime (s)
Dataset Source: LIBLinear binary classification datsets.
KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.
Task: Predict student performance on math problems based on interactions with tutoring system

43
Logistic Regression on 100GB dataset on
commodity hardware.  1.5 hrs
Learn collaborative filtering model on 4 Billion
user-item ratings in 1 hour. (Scales to 20B)

44
Small Data
Big Data
Time to Rethink Big Data

46
Good Reasons To Distribute
• IO Bound tasks
- N times more Disks
- N times more Network
- N times more Memory bandwidth
• Compute Bound Embarrassingly Parallel Tasks

47
SFrames:
Good to Distribute. Working
on it.
User Com.
Title Body
User Disc.
SGraphs:
Harder to Distribute. But can be
done and we are experts at it.
Working on it.

49
Common Performance Fallacy
• Distributed Algorithm for ML Model X
Examples Per Sec
1 Machine 1,000
16 Machines 15,000
64 Machines 62,000
Neural Networks?
Collaborative Filtering?
Have you solved the Distributed ML problem for Model X?

50
Algorithm A
• Single Machine
• Very High convergence rates
 Requires 10 passes through
dataset
• Each Pass is slow
Algorithm B
• Easy to Distribute
• Low convergence rates
 Requires 100 passes through
dataset
• Each Pass is fast
Depending on System and Network Characteristics.
If you have less than X machines: A is faster
Else: B is faster

51
Common Performance Fallacy
You should be comparing convergence rates
against the best known single machine algorithms.
This is from a recent ML paper co-
authored with Jeff Dean, so we can
be sure the system is solid.
It took 600 machines to decrease
training time by 3x.

52
Distributed ML
• Interesting ML is notoriously hard to distribute effectively
- Some special cases people have worked on. (me included)
- “Beyond MapReduce”  Think HPC.
• Distribute only when it makes sense
- Balancing Convergence Rates vs Throughput
• We are working on it too

53
jobs@dato.com
We are Hiring!

SFrame: Scalable Out-of-Core Dataframe for Machine Learning

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to SFrame: Scalable Out-of-Core Dataframe for Machine Learning

Similar to SFrame: Scalable Out-of-Core Dataframe for Machine Learning (20)

More from Turi, Inc.

More from Turi, Inc. (20)

Recently uploaded

Recently uploaded (20)

SFrame: Scalable Out-of-Core Dataframe for Machine Learning

Editor's Notes