Scio

•

1 like•890 views

Neville Li

Scio - A Scala API for Google Cloud Dataflow https://github.com/spotify/scio

Software

Scio
A Scala API for Google Cloud Dataflow
Neville Li @sinisa_lyh

Origin Story
Scalding and Spark
ML, recommendations, analytics
50+ users, 400+ unique jobs

Moving to
Google Cloud
Early 2015 - Dataflow Scala hack project

Data model
Spark
• RDD for batch, DStream for streaming
• Explicit caching semantics
• Two sets ofAPIs
Dataflow
• PCollection for both batch and streaming
• Windowed and timestamped values
• One unifiedAPI

Execution
Spark
• Driver and executors
• Dynamic execution from driver
• Transforms and actions
Dataflow
• No master
• Static execution planning
• Transforms only, no actions

Why not Scalding on GCE
Pros
• Community 
Twitter, eBay, Etsy, Stripe, LinkedIn, …
• Stable and proven

Why not Scalding on GCE
Cons
• Hadoop cluster operations
• Multi-tenancy 
resource contention and utilization
• No streaming mode (Summingbird?)

Why not Spark on GCE
Pros
• Batch, streaming, interactive and SQL
• MLlib, GraphX
• Scala, Python, and R support
• Zeppelin, spark-notebook, Hue

Why not Spark on GCE
Cons
• Hard to tune and scale
• Cluster lifecycle management

Why Dataflow with Scala
Dataflow
• Hosted solution, no operations
• Ecosystem 
GCS, BigQuery, PubSub, Bigtable, …
• Unified batch and streaming model

Why Dataflow with Scala
Scala
• High level DSL 
easytransition for developers
• Reusable and composable code via FP
• Numerical libraries: Breeze,Algebird

Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.

WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")

$PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }$

Spotify Running
60 million tracks
30m users * 10 tempo buckets * 25 tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories, …
Latent vectors from collaborative filtering

Personalized new releases
• Pre-computed weekly on Hadoop 
(on-premise cluster)
• 100GB recommendations 
from HDFS to Bigtable in US+EU
• 250GB Bloom filters from Bigtable to HDFS
• 200 LOC

User conversion analysis
• For marketing and campaigning strategies
• Track usertransitions through products
• Aggregated for simulation and projection
• 150GB BigQuery in and out

Design and Implementation
• Simplicity over premature optimization
• Usability over Python/Java inter-op
• Ser/de: ☑kryo/chill ☒Coder[T]
• Closure cleaner

What’s next?
• Apache Beam donation
• Migrating internal teams
• BigQuery SQL-2011 dialect
• Better streaming support
• PRs and issues welcome!

What's hot

Cassandra spark connectorDuyhai Doan

Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake

Using PostgreSQL with Bibliographic DataJimmy Angelakos

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck

Pivoting Data with SparkSQL by Andrew RaySpark Summit

Spark meetup v2.0.5Yan Zhou

Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit

DataEngConf SF16 - Spark SQL WorkshopHakka Labs

Cost-based query optimization in Apache Hive 0.14Julian Hyde

Building data pipelinesJonathan Holloway

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly

Assessing Graph Solutions for Apache SparkDatabricks

DataSource V2 and Cassandra – A Whole New WorldDatabricks

Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau

Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference

Apache spark IntroTudor Lapusan

PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph

SparkR - Play Spark Using R (20160909 HadoopCon)wqchen

Graph databases: Tinkerpop and Titan DBMohamed Taher Alrefaie

What's hot (20)

Cassandra spark connector

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Using PostgreSQL with Bibliographic Data

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...

Pivoting Data with SparkSQL by Andrew Ray

Spark meetup v2.0.5

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

DataEngConf SF16 - Spark SQL Workshop

Cost-based query optimization in Apache Hive 0.14

Building data pipelines

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...

Assessing Graph Solutions for Apache Spark

DataSource V2 and Cassandra – A Whole New World

Streaming ML on Spark: Deprecated, experimental and internal ap is galore!

Large scale, interactive ad-hoc queries over different datastores with Apache...

Apache spark Intro

PySpark Cassandra - Amsterdam Spark Meetup

SparkR - Play Spark Using R (20160909 HadoopCon)

Graph databases: Tinkerpop and Titan DB

Viewers also liked

Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...Leadel

Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...Smarter Planet Students for a

SCIO – Explore Me, IoT Israel 2014iotisrael

Open Spectrum - Physics, Engineering, Commerce and PoliticsBrough Turner

Refactoring workshop (Campus Party Quito 2014)Maria Gomez

Nutrition and It's ImportanceBP KOIRALA INSTITUTE OF HELATH SCIENCS,, NEPAL

Bringing iot data to life, IoT Israel 2014iotisrael

Dr. Jimmy Schwarzkopf main tent trends 2016Dr. Jimmy Schwarzkopf

Linux Kernel ExploitationScio Security

Sensors candidated dkim_v2David Yushin KIM

STKI Israeli IT market study 2016 V2Dr. Jimmy Schwarzkopf

ScioJonah Sherman-Waterman

Molecular Sensor from SCIOJeffrey Funk Business Models

Ansible + HadoopMichael Young

The Future of Digital HealthMonty C. M. Metzger

The Digital Health Tech Vision 2016accenture

Video is Changing the Worldaccenture

Chemicals: Smarter Investments, Outstanding Resultsaccenture

Unlocking the Power of RegTechaccenture

Mastering The Fourth Industrial Revolution Monty C. M. Metzger

Viewers also liked (20)

Only the First Drop: Changing the Way Startups are Funded by Denes Ban (OurCr...

Smarter campus workshop Part I - Amit Sinha and Heidi Riley - Smarter planet ...

SCIO – Explore Me, IoT Israel 2014

Open Spectrum - Physics, Engineering, Commerce and Politics

Refactoring workshop (Campus Party Quito 2014)

Nutrition and It's Importance

Bringing iot data to life, IoT Israel 2014

Dr. Jimmy Schwarzkopf main tent trends 2016

Linux Kernel Exploitation

Sensors candidated dkim_v2

STKI Israeli IT market study 2016 V2

Scio

Molecular Sensor from SCIO

Ansible + Hadoop

The Future of Digital Health

The Digital Health Tech Vision 2016

Video is Changing the World

Chemicals: Smarter Investments, Outstanding Results

Unlocking the Power of RegTech

Mastering The Fourth Industrial Revolution

Similar to Scio

Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

20160512 apache-spark-for-everyoneAmanda Casari

How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys

20170126 big data processingVienna Data Science Group

NoSQL: Why, When, and HowBigBlueHat

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang

Scala 20140715Roger Huang

OCF.tw's talk about "Introduction to spark"Giivee The

2015 Data Science Summit @ dato ReviewHang Li

Apache Spark RDDsDean Chen

Big data workloads using Apache Sparkon HDInsightNilesh Gule

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys

Sandish3CertsSandish Kumar H N

Why hadoop map reduce needs scala, an introduction to scoobi and scaldingXebia Nederland BV

Similar to Scio (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...

Apache Spark for Everyone - Women Who Code Workshop

Big Data Processing with .NET and Spark (SQLBits 2020)

From stream to recommendation using apache beam with cloud pubsub and cloud d...

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

20160512 apache-spark-for-everyone

How Concur uses Big Data to get you to Tableau Conference On Time

Artigo 81 - spark_tutorial.pdf

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

20170126 big data processing

NoSQL: Why, When, and How

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014

Scala 20140715

OCF.tw's talk about "Introduction to spark"

2015 Data Science Summit @ dato Review

Apache Spark RDDs

Big data workloads using Apache Sparkon HDInsight

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Sandish3Certs

Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Recently uploaded

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions

What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics

VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

VK Business Profile - provides IT solutions and Web Developmentvyaparkranti

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin

Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions

Patterns for automating API delivery. API conferencessuser9e7c64

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Introduction to Firebase Workshop Slidesvaideheekore1

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan

Recently uploaded (20)

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...

What’s New in VictoriaMetrics: Q1 2024 Updates

VictoriaMetrics Q1 Meet Up '24 - Community & News Update

Sending Calendar Invites on SES and Calendarsnack.pdf

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

Salesforce Implementation Services PPT By ABSYZ

VK Business Profile - provides IT solutions and Web Development

Understanding Flamingo - DeepMind's VLM Architecture

Machine Learning Software Engineering Patterns and Their Engineering

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

2024 DevNexus Patterns for Resiliency: Shuffle shards

Best Angular 17 Classroom & Online training - Naresh IT

Patterns for automating API delivery. API conference

Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf

The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Introduction to Firebase Workshop Slides

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording

Scio

1. Scio A Scala API for Google Cloud Dataflow Neville Li @sinisa_lyh

2. Who am I?

3. Origin Story Scalding and Spark ML, recommendations, analytics 50+ users, 400+ unique jobs

4. Moving to Google Cloud Early 2015 - Dataflow Scala hack project

5. What is Dataflow?

6. Data model Spark • RDD for batch, DStream for streaming • Explicit caching semantics • Two sets ofAPIs Dataflow • PCollection for both batch and streaming • Windowed and timestamped values • One unifiedAPI

7. Execution Spark • Driver and executors • Dynamic execution from driver • Transforms and actions Dataflow • No master • Static execution planning • Transforms only, no actions

8. Why Dataflow?

9. Why not Scalding on GCE Pros • Community  Twitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven

10. Why not Scalding on GCE Cons • Hadoop cluster operations • Multi-tenancy  resource contention and utilization • No streaming mode (Summingbird?)

11. Why not Spark on GCE Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue

12. Why not Spark on GCE Cons • Hard to tune and scale • Cluster lifecycle management

13. Why Dataflow with Scala Dataflow • Hosted solution, no operations • Ecosystem  GCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model

14. Why Dataflow with Scala Scala • High level DSL  easytransition for developers • Reusable and composable code via FP • Numerical libraries: Breeze,Algebird

15.

16. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.

17. github.com/spotify/scio

18. WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

19. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }

20. Spotify Running 60 million tracks 30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering

21.

22.

23.

24.

25. Personalized new releases • Pre-computed weekly on Hadoop  (on-premise cluster) • 100GB recommendations  from HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC

26. User conversion analysis • For marketing and campaigning strategies • Track usertransitions through products • Aggregated for simulation and projection • 150GB BigQuery in and out

27. Demo Time!

28. Design and Implementation • Simplicity over premature optimization • Usability over Python/Java inter-op • Ser/de: ☑kryo/chill ☒Coder[T] • Closure cleaner

29. What’s next? • Apache Beam donation • Migrating internal teams • BigQuery SQL-2011 dialect • Better streaming support • PRs and issues welcome!

30. Neville Li @sinisa_lyh Thank you!

Scio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scio

Similar to Scio (20)

Recently uploaded

Recently uploaded (20)

Scio