12. SFrame Design
Jobs fail because:
• Machine run out of memory
• Did not set Java Heap Size correctly
• Resource Configuration X needs to be
bigger.
Pain Point #1: Resource Limits
13. SFrame Design
• Graceful Degradation as 1st principle
• Always Works
Pain Point #2: Too Strict or Too Weak Schemas
We want strong schema types.
We also want weak schema types.
Missing Values
15. SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
Pain Point #3: Feature Manipulation
Difficult or costly to inspect existing features and
create new features.
Hard to perform data exploration.
17. SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]
18. SFrame Querying
Supports most typical SQL SELECT operations using a
Pythonic syntax.
SQL
SELECT Book.title AS title, COUNT(*) AS authors
FROM Book
JOIN Book_author ON Book.isbn = Book_author.isbn
GROUP BY Book.title;
SFrame Python
Book.join(Book_author, on=‘isbn’)
.groupby(‘title’, {‘authors’:gl.aggregate.COUNT})
19. SFrame Columnar Encoding
user movie rating
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
20. SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
SFrame File
21. SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User 176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie 257 KB
3.8 bits/intRating 47 MB
-------------------------------
Total 223MB
22. SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User 176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie 257 KB
3.8 bits/intRating 47 MB
-------------------------------
Total 223MB
10s
23. SFrames Distributed
• Distributed Dataflow
• Columnar Query Optimizations
• Communicate columnar compressed blocks
rather than row tuples.
The choice of distributed or local execution is a
question of query optimization.
37. Deep Integration of SFrames and
SGraphs
• Seamless interaction between graph data
and table data.
• Queries can be performed easily across
graph and tables.
39. SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
User-first architecture.
Built by data scientists,
for data scientists.
Editor's Notes
My name is ... I am one of the co-founders and currently the chief architect at GraphLab.
Talk about
- architecture philosophy @ graphlab
- and what we have built into GraphLab Create.
The basic architecture philosophy at GraphLab is what I like to call “Users-First” architecture. What do I mean by that?
The objective of a systems Architecture is to connect systems [] to users []. To provide users with a means to access data, compute resources, and so on.
And of course, there are many ways to achieve this.
[] Some architectures are designed with a bottom up nature:
- Given particular systems constraints, what can we perform most efficiently?
- Users can then try to develop whatever applications they need around the system.
Other architectures are designed top down:
- We begin by defining an interaction model with users. SQL for instance is a great example.
- Then we figure out how to design an architecture that supports all these capabilities efficiently.
The question we are trying to answer here at GraphLab is: “what is the Users-First Architecture for Data Science? For Machine Learning?”
We think we have an answer.
----- Meeting Notes (7/21/14 03:15) -----
1 min
We designed two core datastructures we call...
<T> Now what are SFrames and SGraphs?
- SFrame is our scalable datastructure for table manipulation
- SGraph is our scalable datastructure for graph manipulation.
And one design objective is to enable ....
I am going to first talk about the design of SFrames, our datastructure for table manipulation
The SFrame design is governed by a series of common pain points we have observed.
The first is fundamental. It is *extremely* annoying when we start a job / some task, have it run for a while, then have it fail. [] because...
----- Meeting Notes (7/21/14 03:15) -----
2 min
- graceful degradation as the first core principle in the SFrame design
- instead of demanding more resources, we try to ensure that things always work even if there is insufficient memory. So we spent a lot of work on developing bounded memory algorithms to ensure that when there is insufficient memory, we might run slower, but we will always work.
- We want strong types, because sometimes... It is *very* useful to have a priori guarantees that my “rating column” always contains an integer even in the 1billionth row.
- We want weak types because sometime data is unstructured and I need some way to manage it.
Our SFrames support a rich family of datatypes. From some strong types like int, double, string to weak types like arbitrary lists and dictionaries. Our lists and dictionaries are self-recursive and hence are expressive enough to hold arbitrary JSON.
----- Meeting Notes (7/21/14 03:15) -----
4 min
Key aspects of doing ML.
I need to create new features, or delete new features. I need to be able to deeply explore a single feature or small sets of features at a time.
----- Meeting Notes (7/21/14 03:22) -----
efficiently create.
This leads to the SFrame use of a columnar architecture.
- By representing the data column-wise, we can support easy feature engineering.
- By further using immutable columns and lazy evaluation, we can push through a large number of pipelining optimizations.
- easily visualize or sketch statistics about single features
The API looks somewhat Pandas like, and carry very similar ideas.
For instance: It is as easy to load an SFrame with a billion rows as it is to construct a tiny SFrame here of 1 column and 5 values.
2) We can easily reassign the value of an entire column: here we normalize the column by the sum of the values.
3) We can easily create new columns by applying a python lambda operation to each entry. This lambda operation is parallelized behind the scenes
4) We can also create new columns by using the vectorized operators.
5) We can easily subselect columns to create new SFrames, And due to the immutable nature of columns, this operation is essentially free.
So The SFrame can be used like a general purpose table, but having a carefully curated set of scalable operations.
Further more, SFrames supports most of the key important query operators like groupby, join, etc. Most typical SQL Select operations will have a natural conversion to a pythonic syntax using SFrames
I mentioned columnar architecture briefly in an earlier slide but I did not mention one of the fundamental benefits of columnar encoding:
- aggressive compression is possible.
But it is hard to understand how much compression can do without an actual demonstration.
Data is 99M rows, 3 columns of integers of user, movie and ratings. This is ... Size ...
----- Meeting Notes (7/21/14 03:22) -----
7 min
- The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression.
- For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms.
- The on disk SFrame representation ends up as ... (movie is sorted)
- The total size is smaller than the gzip compression.
- This allows us to do more on one machine.
- Aggressive compression allows us to keep more in application cache / file system cache
- faster, more throughput even though its external memory.
- Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame?
- 10s on a 4 core desktop. (i73770K )
- The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression.
- For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms.
- The on disk SFrame representation ends up as ... (movie is sorted)
- The total size is smaller than the gzip compression.
- This allows us to do more on one machine.
- Aggressive compression allows us to keep more in application cache / file system cache
- faster, more throughput even though its external memory.
- Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame?
- 10s on a 4 core desktop. (i73770K )
- The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression.
- For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms.
- The on disk SFrame representation ends up as ... (movie is sorted)
- The total size is smaller than the gzip compression.
- This allows us to do more on one machine.
- Aggressive compression allows us to keep more in application cache / file system cache
- faster, more throughput even though its external memory.
- Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame?
- 10s on a 4 core desktop. (i73770K )
Now a natural question to bring up then is how about going Distributed?
We take a somewhat novel perspective on this. The user facing Python API never changes. Instead,
- In other words, the decision to go distributed is simply a question of query optimization.
- If you are doing something small. It is going to be more efficient locally than to pay the distributed cost.
- We are building a distributed dataflow system, taking advantage of columnar query optimizations. And when communication is necessary ,we can get reduce comms by communicating columnar compressed blocks rather than row tuples.
----- Meeting Notes (7/21/14 03:22) -----
9 min – 10 min
Next I will talk about SGraphs
10 min
Our scalable datastructure for graph manipulation
The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion.
This representation hence allows us to easily separate structure and data.
For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion.
This representation hence allows us to easily separate structure and data.
For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion.
This representation hence allows us to easily separate structure and data.
For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
Or introduce new features without rewriting anything.
Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
Or introduce new features without rewriting anything.
Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
This deep integration of SFrames and SGraphs allow seamless interaction between graph data and table data, and queries can be performed easily across both graphs and tables.
Now, this is not so easy to understand just from slides, so I will give a quick demo.
13 min
Now Lets briefly recap what we have just done during the demo. We have taken tabular data, converted it to graph and run a basic graph algorithm on it, and ask graph questions about it. And next we take the graph, and directly interpret it as a table: joining it with other tables to get some intelligence with very little code and very little friction.. All within a few lines of Python. This is what we have achieved by trying to understand data science from a user-first perspective. Ending up with a datastructure that is easy to use, powerful and enables our machine learning algorithms in GraphLab Create to scale easily to terabyte datasets on a single machine. Thank you.