SlideShare a Scribd company logo
1 of 53
SFrame: A Scalable, Out-of-Core
Dataframe for Machine Learning
Yucheng Low
Chief Architect @ Dato
Dato Confidential2
• Founded in mid-2013 as GraphLab by Prof. Carlos
Guestrin
• Name change to Dato in Jan 2015
Dato Confidential3
Who am I?
• PhD Machine Learning, Carnegie Mellon University
- Advised by Prof. Carlos Guestrin (CEO Dato)
- Thesis work on Distributed Machine Learning
• PowerGraph
 World’s Fastest Distributed Graph Processing Framework
• CoFounder and now Chief Architect at Dato
Dato Confidential4
Data
Engineering
Data
Intelligence
Deployment
• Fast & scalable
• Rich data type support
• Visualization
• App-oriented ML
• Supporting utils
• Extensibility
• Batch & always-on
• RESTful interface
• Elastic & robust
Inspiration Production
A Quick Demo
Yes, we run on a single
machine.
What can you do with a
single machine?
8
Build a Collaborative Filtering Model on 20 Billion
User-Item Ratings
Do PageRank on a 128 Billion edge graph.
How?
10
SFrames
User Com.
Title Body
User Disc.
SGraphs
11
SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]
12
SFrame Design Principles
• Graceful Degradation as 1st principle
- Always Works
• Rich Datatypes
• Strong schema types: int, double, string, image, ...
• Weak schema types: list, dictionary (Can contain arbitrary JSON)
• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.
• Immutable columns + Lazy evaluation
• Statistics + visualization + sketches
Columns are first class:
Like, Pandas, SFrame is more like an arbitrary collection
of columns.
13
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
sf
user
item
rating
nrating
14
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
A column is an SArray. Our
equivalent of a external memory
numpy Array.
15
Internal Representation
user movie rating
netflix_tr.frame
sf = gl.SFrame(‘netflix_tr.frame’)
sf
user
item
rating
sf2 = gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
ratingsf[‘nrating’] = sf2[‘rating’]
nrating
diff = sf[‘rating’] - sf2[‘rating’]
diff
anonymous
sf[‘diff’] = diff
diff
16
Physical Representation
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
SFrame File
17
Lazy Evaluation + Query Optimization
• Avoid writing to disk/SSD whenever possible.
• Common model for many other systems
(Spark, Blaze, etc.)
• Extensions from classical “SQL” optimization
methods needed to maintain both “tables”
(SFrames) and “columns” (SArrays)
18
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)sf
19
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
Project
ratings
sf
ratings
20
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
Project
ratings
ratings = ratings * 2
Apply
* 2
sf
ratings
21
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
Project
ratings
ratings = ratings * 2
Apply
* 2
sf[‘ratings’] = ratings
Project
User, item
Union
Columns
sf
22
Lazy Evaluation
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
Project
ratings
ratings = ratings * 2
Apply
* 2
sf[‘ratings’] = ratings
Project
User, item
Union
Columns
sf[‘ratings’].sum()
Project
ratings
Reduce
23
Query Optimization
SFrame
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
Project
ratings
ratings = ratings * 2
Apply
* 2
sf[‘ratings’] = ratings
Project
User, item
Union
Columns
sf[‘ratings’].sum()
Project
ratings
Reduce
24
Physical Plan
SArray
Source
sf = gl.SFrame(‘netflix_norm.frame’)
ratings = sf[‘ratings’]
ratings = ratings * 2
Apply
* 2
sf[‘ratings’] = ratings
sf[‘ratings’].sum()
Reduce
25
Query Optimization
• Very rich source of performance
optimizations
- Pushing filter operations upwards
- Pushing project operations upwards
- Pushing complex joins downloads
- Join ordering
- Etc.
26
union
union
union
Tr->{C0,...,C16}
Tr->{C0,...,C16}
PR(8)
SF(S1,S2,S3,S4,S5)
Tr->{C0,...,C16}
PR(6,6,9,4,0)
unionTr->{C0,...,C16}
union
PR(14)
Tr->{C0,...,C16}
union
PR(11)
Tr->{C0,...,C16}
PR(1,6,6,0,9)
union
Tr->{C0,...,C16}
PR(6)
Tr->{C0,...,C16}
union
PR(3)
Tr->{C0,...,C16}
union
Tr->{C0,...,C16}
PR(11)
PR(9,8,7,9,6)
union
PR(7)
PR(4,9,9,3,8) Tr->{C0,...,C16}
union
union
PR(3)
union
PR(4,6,8,7,0)
PR(1)
Tr->{C0,...,C16}
PR(12)
Tr->{C0,...,C16}
PR(1,7,8,1,3)
union
PR(3)
union
PR(3)
PR(5,9,0,5,9)
union
PR(8)
union
PR(12)
union
PR(1)
union union
PR(5,2,2,3,8)
PR(4)
union
PR(14)
union
union
union
union
union
PR(3)
union
union
PR(14)
union
PR(3)
unionPR(0)
PR(4)
union
PR(7)
union
union
union
union
union
union
union
union
Tr->{C0,...,C16}
PR(7,6,7,0,4)
union PR(6,3,7,0,5)
union
Tr->{C0,...,C16}
PR(8) Tr->{C0,...,C16}
PR(9,8,1,2,4)
PR(5,2,9,4,1)
union
union
union
union
Tr->{C0,...,C16} PR(1,0,5,0,3)PR(6)
PR(5,6,7,4,8)
union
union
PR(3,5,1,9,1)
PR(5)union
PR(13)
PR(10)
Tr->{C0,...,C16}
union
union
PR(0)
PR(9,2,0,6,3)
union
PR(3,6,3,2,1)PR(3)
Tr->{C0,...,C16}
union
Tr->{C0,...,C16}
union
PR(11) union
union
union
PR(0,2,6,2,7)
PR(7,2,2,3,3)
union
Tr->{C0,...,C16}
PR(6)
PR(14)
PR(13)
PR(8)
union
union
union
PR(12)
union
union
PR(8)
union
PR(7)
union
PR(7)
union
union
union
union
union
union
union
union
PR(0,7,8,2,8)
union
Tr->{C0,...,C16}
SF(S1,S2,S3,S4,S5)
Tr->{C0,...,C16}
PR(3)PR(1)
Tr->{C0,...,C16}
PR(6,0,3,3,1,3,1,3,0,4,7,1,6,3,6)
PR(7,3,1)
PR(3,3,8,8,4,6,1,5,0,8)
PR(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,7,2,2,3,3,8,7,8,8,7,7,6,5)
PR(2) union
union
union
union
union
union
27
Execution time Optimizations
• C++ Coroutine incremental decoder.
- Only require compressed blocks to be resident in memory. Never
require complete decompression of any block.
• Runtime Input Skipping
- Significant performance optimization for small filters. Avoid
reading data that will be thrown away.
• Etc.
28
Filter Performance
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 100
0. 15. 30. 45. 60. 75.
Redshi
Impala (1.2.3)
Spark (1.3)
Hive - 0.12
Tez - 0.2.0
GraphLab Create (1.4)
Query 1b: X = 100
5
Machines
29
Filter Performance
0. 12.5 25. 37.5 50.
Redshi
Impala (1.2.3)
Spark (1.3)
Hive - 0.12
Tez - 0.2.0
GraphLab Create (1.4)
Query 1c: X = 10
SELECT pageURL, pageRank FROM rankings WHERE pageRank > 10
5
Machines
30
Scalability
• Largest synthetic:
- 950 Columns, 1B rows, dense numeric
•  1 Trillion numeric values.
• No real #rows limit, Practically 100s-1000s
of columns
31
SGraph Extension
User Com.
Title Body
User Disc.
32
SGraph Layout
1
2
3
4
Vertex SFrames
__id Name Address ZipCode
1011 John … 98105
2131 Jack … 98102
Vertices Partitioned into
p = 4 SFrames.
33
SGraph Layout
1
2
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Edges partitioned into p^2
= 16 SFrames.
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge Traversal Ordering
35
PageRank on Twitter Graph
0 100 200 300 400 500 600 700
GraphLab Create
GraphX
Giraph
Runtime(s)
* Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES 2013
** Aapo Kyrola, Guy Blelloch, Carlos Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012.
16
Machines
1
Machine
36
Connected Components on Twitter Graph
0 50 100 150 200 250 300
GraphLab Create
GraphX
Giraph
Runtime(s)
* Reynold Xin, Joseph Gonzalez, Michael Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES 2013
** Aapo Kyrola, Guy Blelloch, Carlos Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012.
16
Machines
1
Machine
37
Scalability
1x r3.8xlarge  using 1x SSD.
Common Crawl Graph:
3.5 billion Nodes and 128 billion Edges
Largest available public Graph.
PageRank: 9 min per iteration.
Connected Components: ~ 1 hr.
Represented naively  2 TB of memory.
38
SFrame SGraph
Core Underlying Datastructures
39
ML Algorithms
• Uses SFrames as data source, but models must fit in
memory. Implementations are in C++ and use whatever
it takes to go fast.
- Parallel execution
- Controlled memory access races
 Some algorithms are robust to this if you are careful
- Multiple solvers for different data regimes
- GPUs
40
ML Algorithms
• Robustness and Ease of Use of algorithms is key
objective, not just performance. Sparse/Dense regimes,
etc.
- Probably the most robust linear solvers out there.
Garbage In, Guidance Out.
- VP Eng. Sethu Raman
41
ML Algorithms
• Linear Models
• Boosted Trees
• Recommender Algorithms
- (with features, ranking regime, etc.)
• Deep-learning – GPU-support.
• Clustering
• Fuzzy Matching
• Text models
• Etc.
42
Logistic Regression
651
2092
2361
291
0 500 1000 1500 2000 2500
MLLib 5 nodes
MLLib 1 node
Scikit-Learn
GraphLab Create
Runtime (s)
Dataset Source: LIBLinear binary classification datsets.
KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.
Task: Predict student performance on math problems based on interactions with tutoring system
43
Logistic Regression on 100GB dataset on
commodity hardware.  1.5 hrs
Learn collaborative filtering model on 4 Billion
user-item ratings in 1 hour. (Scales to 20B)
44
Small Data
Big Data
Time to Rethink Big Data
Distributed
46
Good Reasons To Distribute
• IO Bound tasks
- N times more Disks
- N times more Network
- N times more Memory bandwidth
• Compute Bound Embarrassingly Parallel Tasks
47
SFrames:
Good to Distribute. Working
on it.
User Com.
Title Body
User Disc.
SGraphs:
Harder to Distribute. But can be
done and we are experts at it.
Working on it.
Distributed Machine Learning?
49
Common Performance Fallacy
• Distributed Algorithm for ML Model X
Examples Per Sec
1 Machine 1,000
16 Machines 15,000
64 Machines 62,000
Neural Networks?
Collaborative Filtering?
Have you solved the Distributed ML problem for Model X?
50
Algorithm A
• Single Machine
• Very High convergence rates
 Requires 10 passes through
dataset
• Each Pass is slow
Algorithm B
• Easy to Distribute
• Low convergence rates
 Requires 100 passes through
dataset
• Each Pass is fast
Depending on System and Network Characteristics.
If you have less than X machines: A is faster
Else: B is faster
51
Common Performance Fallacy
You should be comparing convergence rates
against the best known single machine algorithms.
This is from a recent ML paper co-
authored with Jeff Dean, so we can
be sure the system is solid.
It took 600 machines to decrease
training time by 3x.
52
Distributed ML
• Interesting ML is notoriously hard to distribute effectively
- Some special cases people have worked on. (me included)
- “Beyond MapReduce”  Think HPC.
• Distribute only when it makes sense
- Balancing Convergence Rates vs Throughput
• We are working on it too
53
jobs@dato.com
We are Hiring!

More Related Content

Viewers also liked

Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...
Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...
Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...rmtjaycees
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
Рік безкарності: громадський аналіз розслідування справ Євромайдану
Рік безкарності: громадський аналіз розслідування справ ЄвромайдануРік безкарності: громадський аналіз розслідування справ Євромайдану
Рік безкарності: громадський аналіз розслідування справ ЄвромайдануМарья Ивановна
 
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...Turi, Inc.
 
Building Economical Simulators
Building Economical SimulatorsBuilding Economical Simulators
Building Economical SimulatorsTuri, Inc.
 
Applying data science to sales pipelines — for fun and profit
 Applying data science to sales pipelines — for fun and profit Applying data science to sales pipelines — for fun and profit
Applying data science to sales pipelines — for fun and profitTuri, Inc.
 
WeeklyEngineeringReport3
WeeklyEngineeringReport3WeeklyEngineeringReport3
WeeklyEngineeringReport3Navil Smith
 
Nemo museum
Nemo museumNemo museum
Nemo museumberta9
 
Production Software Teollisuuden oma ohjelmistotalo
Production Software Teollisuuden oma ohjelmistotaloProduction Software Teollisuuden oma ohjelmistotalo
Production Software Teollisuuden oma ohjelmistotaloJukka Nevavuo
 
CV Svindland Inger (english) 2016
CV Svindland Inger (english) 2016CV Svindland Inger (english) 2016
CV Svindland Inger (english) 2016Inger Svindland
 
Introduction to asp.net
Introduction to asp.netIntroduction to asp.net
Introduction to asp.netneeta1995
 
Visualization for Discovery
Visualization for DiscoveryVisualization for Discovery
Visualization for DiscoveryTuri, Inc.
 
Arthur Glosman Review
Arthur Glosman Review Arthur Glosman Review
Arthur Glosman Review aglosmandds
 
Deep Learning in a Dumpster
Deep Learning in a DumpsterDeep Learning in a Dumpster
Deep Learning in a DumpsterTuri, Inc.
 

Viewers also liked (20)

CV Jannie
CV JannieCV Jannie
CV Jannie
 
Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...
Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...
Nash Community College BDF Program Presentation - Local Economic Outlook Lunc...
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Creative
CreativeCreative
Creative
 
Рік безкарності: громадський аналіз розслідування справ Євромайдану
Рік безкарності: громадський аналіз розслідування справ ЄвромайдануРік безкарності: громадський аналіз розслідування справ Євромайдану
Рік безкарності: громадський аналіз розслідування справ Євромайдану
 
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...
Splash: User-friendly Programming Interface for Parallelizing Stochastic Lear...
 
Building Economical Simulators
Building Economical SimulatorsBuilding Economical Simulators
Building Economical Simulators
 
Applying data science to sales pipelines — for fun and profit
 Applying data science to sales pipelines — for fun and profit Applying data science to sales pipelines — for fun and profit
Applying data science to sales pipelines — for fun and profit
 
WeeklyEngineeringReport3
WeeklyEngineeringReport3WeeklyEngineeringReport3
WeeklyEngineeringReport3
 
Nemo museum
Nemo museumNemo museum
Nemo museum
 
Musker 2014
Musker 2014Musker 2014
Musker 2014
 
Production Software Teollisuuden oma ohjelmistotalo
Production Software Teollisuuden oma ohjelmistotaloProduction Software Teollisuuden oma ohjelmistotalo
Production Software Teollisuuden oma ohjelmistotalo
 
Resume_Arun
Resume_ArunResume_Arun
Resume_Arun
 
CV Svindland Inger (english) 2016
CV Svindland Inger (english) 2016CV Svindland Inger (english) 2016
CV Svindland Inger (english) 2016
 
Q7
Q7Q7
Q7
 
Introduction to asp.net
Introduction to asp.netIntroduction to asp.net
Introduction to asp.net
 
Visualization for Discovery
Visualization for DiscoveryVisualization for Discovery
Visualization for Discovery
 
Arthur Glosman Review
Arthur Glosman Review Arthur Glosman Review
Arthur Glosman Review
 
Excerise
ExceriseExcerise
Excerise
 
Deep Learning in a Dumpster
Deep Learning in a DumpsterDeep Learning in a Dumpster
Deep Learning in a Dumpster
 

Similar to SFrame: Scalable Out-of-Core Dataframe for Machine Learning

Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...Turi, Inc.
 
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...ScyllaDB
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark MLAhmet Bulut
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 

Similar to SFrame: Scalable Out-of-Core Dataframe for Machine Learning (20)

Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGr...
 
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
JavaScript and Artificial Intelligence by Aatman & Sagar - AhmedabadJS
JavaScript and Artificial Intelligence by Aatman & Sagar - AhmedabadJSJavaScript and Artificial Intelligence by Aatman & Sagar - AhmedabadJS
JavaScript and Artificial Intelligence by Aatman & Sagar - AhmedabadJS
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 

Recently uploaded

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

SFrame: Scalable Out-of-Core Dataframe for Machine Learning

Editor's Notes

  1. Mention unstructured data.
  2. Datastructures designed to ease the a data scientist’s work with data manipulation as much as possible. We have worked with datasets from many customers. Datasets ranging from rather nice and structured, to “what the hell are you thinking of when you did this”. Worst as Json encoded as a string…. Inside a Json string. In a way we built these for ourselves, taking into consideration all the experience we have working on semi-structured data, and taking the lessons we have learnt, we want to make it as easy as possible for everyone to work with data, in either tabular form, or graph form.
  3. The API looks somewhat Pandas like, and carry very similar ideas. For instance: It is as easy to load an SFrame with a billion rows as it is to construct a tiny SFrame here of 1 column and 5 values. 2) We can easily reassign the value of an entire column: here we normalize the column by the sum of the values. 3) We can easily create new columns by applying a python lambda operation to each entry. This lambda operation is parallelized behind the scenes 4) We can also create new columns by using the vectorized operators. 5) We can easily subselect columns to create new SFrames, And due to the immutable nature of columns, this operation is essentially free. So The SFrame can be used like a general purpose table, but having a carefully curated set of scalable operations.
  4. This leads to the SFrame use of a columnar architecture. - By representing the data column-wise, we can support easy feature engineering. - By further using immutable columns and lazy evaluation, we can push through a large number of pipelining optimizations. - easily visualize or sketch statistics about single features
  5. Not all optimizations implemented yet. Only some basic ones. This is a rich space we are expanding on. We just rearchitected the entire sframe execution engine in the last month.
  6. Randomly generated example we use for testing. Fuzz testing our query optimizer.
  7. Some fun optimizations.
  8. The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
  9. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  10. If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads. If you are to run this distributed: the maintenance overhead is insane. But.
  11. Handle all data regimes. Sparse, Dense, Ill conditioned, etc. Our VP Eng says “…” This really is a tall order for many ML applications. There is only that amount of diagnosis one can achieve. But is something we try hard to do.
  12. Handle all data regimes. Sparse, Dense, Ill conditioned, etc. Our VP Eng says “…” This really is a tall order for many ML applications. There is only that amount of diagnosis one can achieve. But is something we try hard to do.
  13. A single machine is much cheaper, easier to maintain, faster to get working, and can scale much bigger than you think.
  14. Sgraphs: I spent 4-5 of my Phd years working on distributed ML and working on scaling graphs. And we still have the world’s fastest general purpose distributed graph computation architecture.
  15. There has been claims of distributed deep-learning. But as far as I am concerned, no one has solved distributed deep learning. Throughput is not the objective. Its not how many “images you can do per second”. Convergence is the objective. And your speedup has to be measured again the best tuned single machine solver Here is the gap between the best tuned single machine sol
  16. When some of the most state of the art methods required 600 machines to get a 3x improvement over a single machine, you know there is still ways to go.
  17. Distributed ML is nifty except in the simplest cases of “Map-Reduceabe algorithms”. Almost all other cases are hiliarously difficult.
  18. Another good reason to distribute is you love to try making things fast.