This document provides an overview of SFrame, a scalable dataframe for machine learning developed by Dato. SFrame was created to handle large datasets and enable fast machine learning. It uses a columnar storage format and lazy evaluation to optimize performance. SFrame can handle datasets with billions of rows and columns efficiently using its out-of-core design. It also includes an SGraph extension to handle graph analytics on very large graphs with billions of edges. A variety of machine learning algorithms are built on SFrame to leverage its scalability.
3. Dato Confidential3
Who am I?
• PhD Machine Learning, Carnegie Mellon University
- Advised by Prof. Carlos Guestrin (CEO Dato)
- Thesis work on Distributed Machine Learning
• PowerGraph
World’s Fastest Distributed Graph Processing Framework
• CoFounder and now Chief Architect at Dato
11. 11
SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]
12. 12
SFrame Design Principles
• Graceful Degradation as 1st principle
- Always Works
• Rich Datatypes
• Strong schema types: int, double, string, image, ...
• Weak schema types: list, dictionary (Can contain arbitrary JSON)
• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.
• Immutable columns + Lazy evaluation
• Statistics + visualization + sketches
Columns are first class:
Like, Pandas, SFrame is more like an arbitrary collection
of columns.
13. 13
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
sf
user
item
rating
nrating
14. 14
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
A column is an SArray. Our
equivalent of a external memory
numpy Array.
15. 15
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
sf[‘diff’] = diff
diff
16. 16
Physical Representation
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
SFrame File
17. 17
Lazy Evaluation + Query Optimization
• Avoid writing to disk/SSD whenever possible.
• Common model for many other systems
(Spark, Blaze, etc.)
• Extensions from classical “SQL” optimization
methods needed to maintain both “tables”
(SFrames) and “columns” (SArrays)
27. 27
Execution time Optimizations
• C++ Coroutine incremental decoder.
- Only require compressed blocks to be resident in memory. Never
require complete decompression of any block.
• Runtime Input Skipping
- Significant performance optimization for small filters. Avoid
reading data that will be thrown away.
• Etc.
28. 28
Filter Performance
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 100
0. 15. 30. 45. 60. 75.
Redshi
Impala (1.2.3)
Spark (1.3)
Hive - 0.12
Tez - 0.2.0
GraphLab Create (1.4)
Query 1b: X = 100
5
Machines
29. 29
Filter Performance
0. 12.5 25. 37.5 50.
Redshi
Impala (1.2.3)
Spark (1.3)
Hive - 0.12
Tez - 0.2.0
GraphLab Create (1.4)
Query 1c: X = 10
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 10
5
Machines
30. 30
Scalability
• Largest synthetic:
- 950 Columns, 1B rows, dense numeric
• 1 Trillion numeric values.
• No real #rows limit, Practically 100s-1000s
of columns
35. 35
PageRank on Twitter Graph
0 100 200 300 400 500 600 700
GraphLab Create
GraphX
Giraph
Runtime(s)
* Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES 2013
** Aapo Kyrola, Guy Blelloch, Carlos Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012.
16
Machines
1
Machine
36. 36
Connected Components on Twitter Graph
0 50 100 150 200 250 300
GraphLab Create
GraphX
Giraph
Runtime(s)
* Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES 2013
** Aapo Kyrola, Guy Blelloch, Carlos Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012.
16
Machines
1
Machine
37. 37
Scalability
1x r3.8xlarge using 1x SSD.
Common Crawl Graph:
3.5 billion Nodes and 128 billion Edges
Largest available public Graph.
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
Represented naively 2 TB of memory.
39. 39
ML Algorithms
• Uses SFrames as data source, but models must fit in
memory. Implementations are in C++ and use whatever
it takes to go fast.
- Parallel execution
- Controlled memory access races
Some algorithms are robust to this if you are careful
- Multiple solvers for different data regimes
- GPUs
40. 40
ML Algorithms
• Robustness and Ease of Use of algorithms is key
objective, not just performance. Sparse/Dense regimes,
etc.
- Probably the most robust linear solvers out there.
Garbage In, Guidance Out.
- VP Eng. Sethu Raman
41. 41
ML Algorithms
• Linear Models
• Boosted Trees
• Recommender Algorithms
- (with features, ranking regime, etc.)
• Deep-learning – GPU-support.
• Clustering
• Fuzzy Matching
• Text models
• Etc.
42. 42
Logistic Regression
651
2092
2361
291
0 500 1000 1500 2000 2500
MLLib 5 nodes
MLLib 1 node
Scikit-Learn
GraphLab Create
Runtime (s)
Dataset Source: LIBLinear binary classification datsets.
KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.
Task: Predict student performance on math problems based on interactions with tutoring system
43. 43
Logistic Regression on 100GB dataset on
commodity hardware. 1.5 hrs
Learn collaborative filtering model on 4 Billion
user-item ratings in 1 hour. (Scales to 20B)
46. 46
Good Reasons To Distribute
• IO Bound tasks
- N times more Disks
- N times more Network
- N times more Memory bandwidth
• Compute Bound Embarrassingly Parallel Tasks
47. 47
SFrames:
Good to Distribute. Working
on it.
User Com.
Title Body
User Disc.
SGraphs:
Harder to Distribute. But can be
done and we are experts at it.
Working on it.
49. 49
Common Performance Fallacy
• Distributed Algorithm for ML Model X
Examples Per Sec
1 Machine 1,000
16 Machines 15,000
64 Machines 62,000
Neural Networks?
Collaborative Filtering?
Have you solved the Distributed ML problem for Model X?
50. 50
Algorithm A
• Single Machine
• Very High convergence rates
Requires 10 passes through
dataset
• Each Pass is slow
Algorithm B
• Easy to Distribute
• Low convergence rates
Requires 100 passes through
dataset
• Each Pass is fast
Depending on System and Network Characteristics.
If you have less than X machines: A is faster
Else: B is faster
51. 51
Common Performance Fallacy
You should be comparing convergence rates
against the best known single machine algorithms.
This is from a recent ML paper co-
authored with Jeff Dean, so we can
be sure the system is solid.
It took 600 machines to decrease
training time by 3x.
52. 52
Distributed ML
• Interesting ML is notoriously hard to distribute effectively
- Some special cases people have worked on. (me included)
- “Beyond MapReduce” Think HPC.
• Distribute only when it makes sense
- Balancing Convergence Rates vs Throughput
• We are working on it too
Datastructures designed to ease the a data scientist’s work with data manipulation as much as possible.
We have worked with datasets from many customers. Datasets ranging from rather nice and structured, to “what the hell are you thinking of when you did this”.
Worst as Json encoded as a string…. Inside a Json string.
In a way we built these for ourselves, taking into consideration all the experience we have working on semi-structured data, and taking the lessons we have learnt, we want to make it as easy as possible for everyone to work with data, in either tabular form, or graph form.
The API looks somewhat Pandas like, and carry very similar ideas.
For instance: It is as easy to load an SFrame with a billion rows as it is to construct a tiny SFrame here of 1 column and 5 values.
2) We can easily reassign the value of an entire column: here we normalize the column by the sum of the values.
3) We can easily create new columns by applying a python lambda operation to each entry. This lambda operation is parallelized behind the scenes
4) We can also create new columns by using the vectorized operators.
5) We can easily subselect columns to create new SFrames, And due to the immutable nature of columns, this operation is essentially free.
So The SFrame can be used like a general purpose table, but having a carefully curated set of scalable operations.
This leads to the SFrame use of a columnar architecture.
- By representing the data column-wise, we can support easy feature engineering.
- By further using immutable columns and lazy evaluation, we can push through a large number of pipelining optimizations.
- easily visualize or sketch statistics about single features
Not all optimizations implemented yet. Only some basic ones. This is a rich space we are expanding on. We just rearchitected the entire sframe execution engine in the last month.
Randomly generated example we use for testing. Fuzz testing our query optimizer.
Some fun optimizations.
The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads.
If you are to run this distributed: the maintenance overhead is insane.
But.
Handle all data regimes. Sparse, Dense, Ill conditioned, etc. Our VP Eng says “…”
This really is a tall order for many ML applications. There is only that amount of diagnosis one can achieve. But is something we try hard to do.
Handle all data regimes. Sparse, Dense, Ill conditioned, etc. Our VP Eng says “…”
This really is a tall order for many ML applications. There is only that amount of diagnosis one can achieve. But is something we try hard to do.
A single machine is much cheaper, easier to maintain, faster to get working, and can scale much bigger than you think.
Sgraphs: I spent 4-5 of my Phd years working on distributed ML and working on scaling graphs. And we still have the world’s fastest general purpose distributed graph computation architecture.
There has been claims of distributed deep-learning. But as far as I am concerned, no one has solved distributed deep learning.
Throughput is not the objective. Its not how many “images you can do per second”. Convergence is the objective. And your speedup has to be measured again the best tuned single machine solver
Here is the gap between the best tuned single machine sol
When some of the most state of the art methods required 600 machines to get a 3x improvement over a single machine, you know there is still ways to go.
Distributed ML is nifty except in the simplest cases of “Map-Reduceabe algorithms”.
Almost all other cases are hiliarously difficult.
Another good reason to distribute is you love to try making things fast.