Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2

Share

Download to read offline

SFrame

Download to read offline

Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.

The SFrame package provides the complete implementation of:

SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)

SFrame

  1. 1. SFrames Yucheng Low Chief Architect @ Dato
  2. 2. Scalable Machine Learning recommenders, other task-oriented ML, boosted decision trees, deep learning, pattern mining, many others, etc GraphLab Create SGraphSFrameLocal HDFS S3 2 Compressed In-Core or Out-of-core scalable datastructures C++11
  3. 3. SGraphSFrameLocal HDFS S3 3 Compressed In-Core or Out-of-core scalable datastructures https://github.com/dato-code/sframe
  4. 4. 4 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] sf user item rating nrating
  5. 5. 5 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous
  6. 6. 6 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous sf[‘diff’] = diff diff Not a SQL Frontend Filtering sf[sf[‘rating’] >= 3] Joins Sf.join(user_table, on=‘user_id’) Random/Array indexing row10 = sf[10] Table_with_every_other_row = sf[::2] Rather Fast Parallelized UDFs (Interproc SHM) sf[‘rating’].apply(lambda x: x*x)
  7. 7. 7 Column Types Supported • Boring Scalar Types - int64, double, string • Interesting Scalar Types - Datetime.datetime, image • For the Mathematician Type - array(‘d’) • For the all real data is ugly types - List, dict (Arbitrary union types. Ex: List can contain anything including other lists and dicts.)
  8. 8. 8 What Are SFrames Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache Type aware compression methods. Very aggressive numeric compression. Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MB
  9. 9. 9 Query Planning Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache p['X4'] = p['X3'] + p['X2'] g= p[p['X1'] < 10]
  10. 10. 10 Language Binding • Python Bindings - Our oldest binding. Via Cython + Interprocess Comm to a C++ binary. • R Bindings - Via our RCpp  C++11 Bindings (exported in SDK) • C++11 Bindings auto g = gl_sframe(); g["hello"] = gl_sarray::from_sequence(0,1000); g["world"] = 2; g["hello"] = (g["hello"] / 2) .astype(flex_type_enum::INTEGER); auto ret = g.groupby({"hello"}, {{"sum of world",aggregate::SUM("world")}}); ret = ret.sort({"hello"}); cout << ret; Columns: hello integer sum of world integer Rows: 500 Data: +----------------+----------------+ | hello | sum of world | +----------------+----------------+ | 0 | 4 | | 1 | 4 | | 2 | 4 | | 3 | 4 | | 4 | 4 | | 5 | 4 | | 6 | 4 | | 7 | 4 | | 8 | 4 | | 9 | 4 | +----------------+----------------+ [500 rows x 2 columns]
  11. 11. 11 Common Crawl Graph 1x r3.8xlarge  using 1x SSD. 3.5 billion Nodes and 128 billion Edges PageRank: 9 min per iteration. Connected Components: ~ 1 hr. There isn’t any general purpose library out there capable of this.
  12. 12. 12 https://github.com/dato-code/sframe pip install sframe
  • SilverMaple

    Oct. 16, 2015
  • kartiktv

    Sep. 19, 2015

Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis. The SFrame package provides the complete implementation of: SFrame SArray SGraph The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)

Views

Total views

2,077

On Slideshare

0

From embeds

0

Number of embeds

33

Actions

Downloads

34

Shares

0

Comments

0

Likes

2

×