SlideShare a Scribd company logo
1 of 39
The
Technology
Behind
Yucheng Low, PhD
Chief Architect
GraphLab
Create
GraphLab Philosophy
Users-First Architecture
Architecture
Systems
User
Architecture
Systems
User
Systems-First
Architectures
Systems define constraints.
Optimize for performance.
PowerGraph
Architecture
Systems
User
Users-First
Architectures
Users define constraints.
Optimize for user interaction.
What is a Users-First
Architecture for Data Science?
SFrame and SGraph
Building on decades of database
and systems research.
Built by data scientists,
for data scientists.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
Enabling users to easily and efficiently
translate between both representations to
get the best of both worlds.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SFrame Design
Jobs fail because:
• Machine run out of memory
• Did not set Java Heap Size correctly
• Resource Configuration X needs to be
bigger.
Pain Point #1: Resource Limits
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
Pain Point #2: Too Strict or Too Weak Schemas
We want strong schema types.
We also want weak schema types.
Missing Values
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
Pain Point #3: Feature Manipulation
Difficult or costly to inspect existing features and
create new features.
Hard to perform data exploration.
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.
• Immutable columns + Lazy evaluation
• Statistics + visualization + sketches
SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]
SFrame Querying
Supports most typical SQL SELECT operations using a
Pythonic syntax.
SQL
SELECT Book.title AS title, COUNT(*) AS authors
FROM Book
JOIN Book_author ON Book.isbn = Book_author.isbn
GROUP BY Book.title;
SFrame Python
Book.join(Book_author, on=‘isbn’)
.groupby(‘title’, {‘authors’:gl.aggregate.COUNT})
SFrame Columnar Encoding
user movie rating
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
SFrame File
SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User  176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie  257 KB
3.8 bits/intRating  47 MB
-------------------------------
Total  223MB
SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User  176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie  257 KB
3.8 bits/intRating  47 MB
-------------------------------
Total  223MB
10s
SFrames Distributed
• Distributed Dataflow
• Columnar Query Optimizations
• Communicate columnar compressed blocks
rather than row tuples.
The choice of distributed or local execution is a
question of query optimization.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SGraph
• SFrame backed graph representation.
Inherits SFrame properties.
• Data types, External Memory, Columnar,
compression, etc.
• Layout optimized for batch external
memory computation.
SGraph Layout
1
2
3
4
Vertex
SFrames
Vertices Partitioned
into p = 4 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
__id Name Address ZipCode
1011 John … 98105
2131 Jack … 98102
Vertices Partitioned
into p = 4 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge
SFrames
Edges partitioned into
p^2 = 16 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge
SFrames
Edges partitioned into
p^2 = 16 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge
SFrames
Edges partitioned into
p^2 = 16 SFrames.
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
Deep Integration of SFrames and
SGraphs
• Seamless interaction between graph data
and table data.
• Queries can be performed easily across
graph and tables.
Demo
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
User-first architecture.
Built by data scientists,
for data scientists.

More Related Content

What's hot

Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
SciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkSciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkWes McKinney
 

What's hot (20)

Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
Cascalog
CascalogCascalog
Cascalog
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
SciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkSciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talk
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 

Viewers also liked

Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 

Viewers also liked (6)

Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 

Similar to GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph

On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
 
Performance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerformance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerforce
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesTanel Poder
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka MeetupCliff Gilmore
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016Metosin Oy
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 

Similar to GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph (20)

On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Pydata talk
Pydata talkPydata talk
Pydata talk
 
Performance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerformance & Scalability Improvements in Perforce
Performance & Scalability Improvements in Perforce
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Serial-War
Serial-WarSerial-War
Serial-War
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 

Recently uploaded

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 

Recently uploaded (20)

Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph

Editor's Notes

  1. My name is ... I am one of the co-founders and currently the chief architect at GraphLab. Talk about - architecture philosophy @ graphlab - and what we have built into GraphLab Create.
  2. The basic architecture philosophy at GraphLab is what I like to call “Users-First” architecture. What do I mean by that?
  3. The objective of a systems Architecture is to connect systems [] to users []. To provide users with a means to access data, compute resources, and so on.
  4. And of course, there are many ways to achieve this. [] Some architectures are designed with a bottom up nature: - Given particular systems constraints, what can we perform most efficiently? - Users can then try to develop whatever applications they need around the system.
  5. Other architectures are designed top down: - We begin by defining an interaction model with users. SQL for instance is a great example. - Then we figure out how to design an architecture that supports all these capabilities efficiently.
  6. The question we are trying to answer here at GraphLab is: “what is the Users-First Architecture for Data Science? For Machine Learning?” We think we have an answer. ----- Meeting Notes (7/21/14 03:15) ----- 1 min
  7. We designed two core datastructures we call... <T> Now what are SFrames and SGraphs?
  8. - SFrame is our scalable datastructure for table manipulation - SGraph is our scalable datastructure for graph manipulation.
  9. And one design objective is to enable ....
  10. I am going to first talk about the design of SFrames, our datastructure for table manipulation
  11. The SFrame design is governed by a series of common pain points we have observed. The first is fundamental. It is *extremely* annoying when we start a job / some task, have it run for a while, then have it fail. [] because... ----- Meeting Notes (7/21/14 03:15) ----- 2 min
  12. - graceful degradation as the first core principle in the SFrame design - instead of demanding more resources, we try to ensure that things always work even if there is insufficient memory. So we spent a lot of work on developing bounded memory algorithms to ensure that when there is insufficient memory, we might run slower, but we will always work. - We want strong types, because sometimes... It is *very* useful to have a priori guarantees that my “rating column” always contains an integer even in the 1billionth row. - We want weak types because sometime data is unstructured and I need some way to manage it.
  13. Our SFrames support a rich family of datatypes. From some strong types like int, double, string to weak types like arbitrary lists and dictionaries. Our lists and dictionaries are self-recursive and hence are expressive enough to hold arbitrary JSON. ----- Meeting Notes (7/21/14 03:15) ----- 4 min
  14. Key aspects of doing ML. I need to create new features, or delete new features. I need to be able to deeply explore a single feature or small sets of features at a time. ----- Meeting Notes (7/21/14 03:22) ----- efficiently create.
  15. This leads to the SFrame use of a columnar architecture. - By representing the data column-wise, we can support easy feature engineering. - By further using immutable columns and lazy evaluation, we can push through a large number of pipelining optimizations. - easily visualize or sketch statistics about single features
  16. The API looks somewhat Pandas like, and carry very similar ideas. For instance: It is as easy to load an SFrame with a billion rows as it is to construct a tiny SFrame here of 1 column and 5 values. 2) We can easily reassign the value of an entire column: here we normalize the column by the sum of the values. 3) We can easily create new columns by applying a python lambda operation to each entry. This lambda operation is parallelized behind the scenes 4) We can also create new columns by using the vectorized operators. 5) We can easily subselect columns to create new SFrames, And due to the immutable nature of columns, this operation is essentially free. So The SFrame can be used like a general purpose table, but having a carefully curated set of scalable operations.
  17. Further more, SFrames supports most of the key important query operators like groupby, join, etc. Most typical SQL Select operations will have a natural conversion to a pythonic syntax using SFrames
  18. I mentioned columnar architecture briefly in an earlier slide but I did not mention one of the fundamental benefits of columnar encoding: - aggressive compression is possible. But it is hard to understand how much compression can do without an actual demonstration. Data is 99M rows, 3 columns of integers of user, movie and ratings. This is ... Size ... ----- Meeting Notes (7/21/14 03:22) ----- 7 min
  19. - The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression. - For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms. - The on disk SFrame representation ends up as ... (movie is sorted) - The total size is smaller than the gzip compression. - This allows us to do more on one machine. - Aggressive compression allows us to keep more in application cache / file system cache - faster, more throughput even though its external memory. - Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame? - 10s on a 4 core desktop. (i73770K )
  20. - The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression. - For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms. - The on disk SFrame representation ends up as ... (movie is sorted) - The total size is smaller than the gzip compression. - This allows us to do more on one machine. - Aggressive compression allows us to keep more in application cache / file system cache - faster, more throughput even though its external memory. - Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame? - 10s on a 4 core desktop. (i73770K )
  21. - The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression. - For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms. - The on disk SFrame representation ends up as ... (movie is sorted) - The total size is smaller than the gzip compression. - This allows us to do more on one machine. - Aggressive compression allows us to keep more in application cache / file system cache - faster, more throughput even though its external memory. - Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame? - 10s on a 4 core desktop. (i73770K )
  22. Now a natural question to bring up then is how about going Distributed? We take a somewhat novel perspective on this. The user facing Python API never changes. Instead, - In other words, the decision to go distributed is simply a question of query optimization. - If you are doing something small. It is going to be more efficient locally than to pay the distributed cost. - We are building a distributed dataflow system, taking advantage of columnar query optimizations. And when communication is necessary ,we can get reduce comms by communicating columnar compressed blocks rather than row tuples. ----- Meeting Notes (7/21/14 03:22) ----- 9 min – 10 min
  23. Next I will talk about SGraphs 10 min
  24. Our scalable datastructure for graph manipulation
  25. The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
  26. The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
  27. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  28. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  29. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  30. The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion. This representation hence allows us to easily separate structure and data. For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
  31. The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion. This representation hence allows us to easily separate structure and data. For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
  32. The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion. This representation hence allows us to easily separate structure and data. For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
  33. Or introduce new features without rewriting anything. Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
  34. Or introduce new features without rewriting anything. Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
  35. This deep integration of SFrames and SGraphs allow seamless interaction between graph data and table data, and queries can be performed easily across both graphs and tables. Now, this is not so easy to understand just from slides, so I will give a quick demo.
  36. 13 min
  37. Now Lets briefly recap what we have just done during the demo. We have taken tabular data, converted it to graph and run a basic graph algorithm on it, and ask graph questions about it. And next we take the graph, and directly interpret it as a table: joining it with other tables to get some intelligence with very little code and very little friction.. All within a few lines of Python. This is what we have achieved by trying to understand data science from a user-first perspective. Ending up with a datastructure that is easy to use, powerful and enables our machine learning algorithms in GraphLab Create to scale easily to terabyte datasets on a single machine. Thank you.