SlideShare a Scribd company logo
1 of 61
Download to read offline
Sorry
How Bieber broke
Google Cloud at Spotify
Neville Li
@sinisa_lyh
Who am I?
● Spotify NYC since 2011
● Formerly Yahoo! Search
● Music recommendations
● Data & ML infrastructure
● Scala since 2013
About Us
● 100M+ active users
● 40M+ subscribers
● 30M+ songs, 20K new per day
● 2B+ playlists
● 1B+ plays per day
And We Have Data
Hadoop at Spotify
● On-premise → Amazon EMR → On-premise
● ~2,500 nodes, largest in EU
● 100PB+ Disk, 100TB+ RAM
● 60TB+ per day log ingestion
● 20K+ jobs per day
Data Processing
● Luigi, Python M/R, circa 2011
● Scalding, Spark, circa 2013
● 200+ Scala users, 1K+ unique jobs
● Storm for real-time
● Hive for ad-hoc analysis
Moving to
Google
CloudEarly 2015
Hive → BigQuery
● Full Avro scans → Columnar storage
● M/R jobs → Optimized execution engine
● Batch → interactive query
● $$$ → Pay for bytes processed
● Beam/Dataflow integration
Dremel Paper, 2010
Top Tracks in Vancouver Jun 2017
● 30 date partitioned tables, 60TB
● 1 metadata table, 418GB
● 94.2s, 4.82TB processed
Track Artist Count
Despacito - Remix Luis Fonsi 349188
I'm the One DJ Khaled 214863
HUMBLE. Kendrick Lamar 200690
Unforgettable French Montana 192141
XO TOUR Llif3 Lil Uzi Vert 148546
Slide Calvin Harris 134236
Attention Charlie Puth 131514
Strip That Down Liam Payne 125191
2U David Guetta 124998
Top Bieber Tracks Jun 2017
● 30 date partitioned tables, 60TB
● 1 metadata table, 418GB
● 81.8s, 2.92TB processed
Track Count
Love Yourself 28400153
Sorry 27269157
What Do You Mean? 20047452
Company 11794409
Purpose 8314381
I'll Show You 6627706
Baby 6161099
Boyfriend 5854955
The Feeling 5415162
As Long As You Love Me 4738529
Adoption
● ~500 unique users
● ~640K queries per month
● ~500PB queried per month
● Cluster management, multi-tenancy & resource utilization
● Tuning and scaling
● 3 sets of separate APIs
● Not cloud native & missing Google Cloud connectors
What is Beam and
Cloud Dataflow?
The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
What is Apache Beam?
1. The Beam Programming Model
2. SDKs for writing Beam pipelines -- Java & Python
3. Runners for existing distributed processing backends
○ Apache Flink (thanks to data Artisans)
○ Apache Spark (thanks to Cloudera and PayPal)
○ Google Cloud Dataflow (fully managed service)
○ Local runner for testing
17
The Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
18
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Why Beam and
Cloud Dataflow?
● Unified batch and streaming model
● Hosted, fully managed, no ops
● Auto-scaling, dynamic work re-balance
● GCP ecosystem - BigQuery, Bigtable, Datastore, Pubsub
Beam + Cloud Dataflow
No More
PagerDuty!
Why Beam and
Scala?
● High level DSL
● Familiarity with Scalding, Spark and Flink
● Functional programming natural fit for data
● Numerical libraries - Breeze, Algebird
● Macros & shapeless for code generation
Beam and Scala
Scio
A Scala API for Apache Beam and
Google Cloud Dataflow
Scio
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.
github.com/spotify/scio
Apache Licence 2.0
Cloud
Storage
Pub/Sub Datastore BigtableBigQuery
Batch Streaming Interactive REPL
Scio Scala API
Dataflow Java SDK Scala Libraries
Extra features
WordCount
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap { _
.split("[^a-zA-Z']+")
.filter(_.nonEmpty)
}
.countByValue
.saveAsTextFile("wordcount.txt")
sc.close()
PageRank
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}
Why Scio?
Type safe BigQuery
Macro generated case classes, schemas and converters
@BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...")
class User // look mom no code!
sc.typedBigQuery[User]().map(u => (u.id, u.name))
@BigQuery.toTable
case class Score(id: String, score: Double)
data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")
● Best of both worlds
● BigQuery for slicing and
dicing huge datasets
● Scala for custom logic
● Seamless integration
BigQuery + Scio
REPL
$ scio-repl
Welcome to
_____
________________(_)_____
__ ___/ ___/_ /_ __ 
_(__ )/ /__ _ / / /_/ /
/____/ ___/ /_/ ____/ version 0.3.4
Using Scala version 2.11.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
Using 'scio-test' as your BigQuery project.
BigQuery client available as 'bq'
Scio context available as 'sc'
scio> _
Available in github.com/spotify/homebrew-public
● DAG and source code visualization
● BigQuery caching, legacy & SQL 2011 support
● HDFS, Protobuf, TensorFlow, Bigtable,
Elasticsearch & Cassandra I/O
● Join optimizations - hash, skewed, sparse
● Job metrics
● DistCache, job orchestration
Other goodies
Demo Time!
Adoption
● At Spotify
○ 200+ users, 700+ unique production pipelines (from ~70 10 months ago)
○ Most new to Scala and Scio
○ Both batch and streaming jobs
● Externally
○ ~10 companies, several fairly large ones
Use Cases
Fan Insights
● Listener stats
[artist|track] ×
[context|geography|demography] ×
[day|week|month]
● BigQuery, GCS, Datastore
● TBs daily
● 150+ Java jobs → ~10 Scio jobs
Master Metadata
● n1-standard-1 workers
● 1 core 3.75GB RAM
● Autoscaling 2-35 workers
● 26 Avro sources - artist, album, track, disc, cover art, ...
● 120GB out, 70M records
● 200 LOC vs original Java 600 LOC
And we broke Google
Listening history
● user x track x
[day/week/month/all time]
● 300B elements
● 800 n1-highmem-32 workers
● 32 core 208GB RAM
● 240TB in - Bigtable
● 90TB out - GCS
BigDiffy
● Pairwise field-level statistical diff
● Diff 2 SCollection[T] given keyFn: T => String
● T: Avro, BigQuery JSON, Protobuf
● Leaf field Δ - numeric, string (Levenshtein), vector (Cosine)
● Δ statistics - min, max, μ, σ, etc.
● Non-deterministic fields
○ ignore field
○ treat "repeated" field (List) as unordered list (Set)
Part of github.com/spotify/ratatool
Dataset Diff
● Diff stats
○ Global: # of SAME, DIFF, MISSING LHS/RHS
○ Key: key → SAME, DIFF, MISSING LHS/RHS
○ Field: field → min, max, μ, σ, etc.
● Use cases
○ Validating pipeline migration
○ Sanity checking ML models
Pairwise field-level deltas
val lKeyed = lhs.keyBy(keyFn)
val rKeyed = rhs.keyBy(keyFn)
val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) =>
(lOpt, rOpt) match {
case (Some(l), Some(r)) =>
val ds = diffy(l, r) // Seq[Delta]
val dt = if (ds.isEmpty) SAME else DIFFERENT
(k, (ds, dt))
case (_, _) =>
val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS
(k, (Nil, dt))
}
}
Summing deltas
import com.twitter.algebird._
// convert deltas to map of (field → summable stats)
def deltasToMap(ds: Seq[Delta], dt: DeltaType)
: Map[String,
(Long, Option[(DeltaType, Min[Double], Max[Double], Moments)])] = {
// ...
}
deltas
.map { case (_, (ds, dt)) => deltasToMap(ds, dt) }
.sum // Semigroup!
1. Copy & paste from legacy
codebase to Scio
2. Verify with BigDiffy
3. Profit!
Featran
● a.k.a. Featran77 or F77
● Type safe feature transformer for machine learning
● Column-wise aggregation and transformation
● 1 shuffle stage (Algebird aggregation)
● In memory, Flink, Scalding, Scio & Spark runner
● Scala collection, array, Breeze, TensorFlow & NumPy output formats
● Property-based testing, 100% coverage
https://github.com/spotify/featran
Transforming features
case class Record(d1: Double, d2: Option[Double], d3: Double,
s: Seq[String])
val spec = FeatureSpec.of[Record]
.required(_.d1)(Binarizer("bin", threshold = 0.5))
.optional(_.d2)(Bucketizer("bucket", Array(0.0, 10.0, 20.0)))
.required(_.d3)(StandardScaling("std"))
.required(_.d3)(QuantileDiscretizer("q", 4))
.required(_.s)(NHotEncoder("nhot"))
val f = spec.extract(records)
f.featureNames // human readable names
f.featureValues[DenseVector[Double]] // output format
f.featureSettings // settings can be reloaded
shapeless-datatype
● Case class ↔ Avro, BigQuery, Datastore & TensorFlow types
● Compile time and extendable type mapping
● Generic mapper between case class types
● Type and lens based record matcher
https://github.com/nevillelyh/shapeless-datatype
protobuf-generic
● Generic Protobuf manipulation without compiled classes
● Similar to Avro GenericRecord
● Protobuf type T → JSON schema
● Bytes ↔ JSON with JSON schema
● Command line tool to inspect binary files
https://github.com/nevillelyh/protobuf-generic
Other uses
● AB testing
○ Statistical analysis with bootstrap
and DimSum
● Monetization
○ Ads targeting
○ User conversion analysis
● User understanding
○ Diversity
○ Session analysis
○ Behavior analysis
● Home page ranking
● Audio fingerprint analysis
We're developing so
fast that Google's
having a hard time
keep up capacity
Google Cloud Capacity
● Almost too easy to write big complex jobs leveraging
BigQuery & Dataflow
● BigQuery export limit 10TB/day/project
● Compute engine datacenter capacity
● Shuffle disk size & throughput
● Bigtable throttling
And We Have Data
Scala Challenges
Serialization
● Data ser/de
○ Scalding, Spark and Storm uses Kryo and Twitter Chill
○ Dataflow/Beam requires explicit Coder[T]
Sometimes inferable via Guava TypeToken
○ ClassTag to the rescue, fallback to Kryo/Chill
● Lambda ser/de
○ Custom ClosureCleaner, Serializable and @transient lazy val
○ Scala 2.12, Java 8 lambda issues #10232 #10233
REPL
● Spark REPL transports lambda bytecode via HTTP
● Dataflow requires job jar for execution (masterless)
● Custom class loader and ILoop
● Interpreted classes → job jar → job submission
● SCollection[T]#closeAndCollect(): Iterator[T]
to mimic Spark actions
Macros and IntelliJ IDEA
● IntelliJ IDEA does not see macro expanded classes
https://youtrack.jetbrains.com/issue/SCL-8834
● @BigQueryType.{fromTable, fromQuery}
class MyRecord
● Scio IDEA plugin
https://github.com/spotify/scio-idea-plugin
Make IntelliJ
Intelligent Again!
Scio in Apache Zeppelin
Local Zeppelin server, remote managed Dataflow cluster, NO OPS
The End
Thank You
Neville Li
@sinisa_lyh

More Related Content

What's hot

How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Holden Karau
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 

What's hot (20)

How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 

Similar to Sorry - How Bieber broke Google Cloud at Spotify

New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBantoinegirbal
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introductionantoinegirbal
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analyticsViet-Trung TRAN
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийGeeksLab Odessa
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 

Similar to Sorry - How Bieber broke Google Cloud at Spotify (20)

New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 

Recently uploaded

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Recently uploaded (20)

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Sorry - How Bieber broke Google Cloud at Spotify

  • 1. Sorry How Bieber broke Google Cloud at Spotify Neville Li @sinisa_lyh
  • 2. Who am I? ● Spotify NYC since 2011 ● Formerly Yahoo! Search ● Music recommendations ● Data & ML infrastructure ● Scala since 2013
  • 3. About Us ● 100M+ active users ● 40M+ subscribers ● 30M+ songs, 20K new per day ● 2B+ playlists ● 1B+ plays per day
  • 4.
  • 5. And We Have Data
  • 6. Hadoop at Spotify ● On-premise → Amazon EMR → On-premise ● ~2,500 nodes, largest in EU ● 100PB+ Disk, 100TB+ RAM ● 60TB+ per day log ingestion ● 20K+ jobs per day
  • 7. Data Processing ● Luigi, Python M/R, circa 2011 ● Scalding, Spark, circa 2013 ● 200+ Scala users, 1K+ unique jobs ● Storm for real-time ● Hive for ad-hoc analysis
  • 9. Hive → BigQuery ● Full Avro scans → Columnar storage ● M/R jobs → Optimized execution engine ● Batch → interactive query ● $$$ → Pay for bytes processed ● Beam/Dataflow integration Dremel Paper, 2010
  • 10. Top Tracks in Vancouver Jun 2017 ● 30 date partitioned tables, 60TB ● 1 metadata table, 418GB ● 94.2s, 4.82TB processed Track Artist Count Despacito - Remix Luis Fonsi 349188 I'm the One DJ Khaled 214863 HUMBLE. Kendrick Lamar 200690 Unforgettable French Montana 192141 XO TOUR Llif3 Lil Uzi Vert 148546 Slide Calvin Harris 134236 Attention Charlie Puth 131514 Strip That Down Liam Payne 125191 2U David Guetta 124998
  • 11. Top Bieber Tracks Jun 2017 ● 30 date partitioned tables, 60TB ● 1 metadata table, 418GB ● 81.8s, 2.92TB processed Track Count Love Yourself 28400153 Sorry 27269157 What Do You Mean? 20047452 Company 11794409 Purpose 8314381 I'll Show You 6627706 Baby 6161099 Boyfriend 5854955 The Feeling 5415162 As Long As You Love Me 4738529
  • 12. Adoption ● ~500 unique users ● ~640K queries per month ● ~500PB queried per month
  • 13. ● Cluster management, multi-tenancy & resource utilization ● Tuning and scaling ● 3 sets of separate APIs ● Not cloud native & missing Google Cloud connectors
  • 14. What is Beam and Cloud Dataflow?
  • 15. The Evolution of Apache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  • 16. What is Apache Beam? 1. The Beam Programming Model 2. SDKs for writing Beam pipelines -- Java & Python 3. Runners for existing distributed processing backends ○ Apache Flink (thanks to data Artisans) ○ Apache Spark (thanks to Cloudera and PayPal) ○ Google Cloud Dataflow (fully managed service) ○ Local runner for testing
  • 17. 17 The Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 18. 18 Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 19. Why Beam and Cloud Dataflow?
  • 20. ● Unified batch and streaming model ● Hosted, fully managed, no ops ● Auto-scaling, dynamic work re-balance ● GCP ecosystem - BigQuery, Bigtable, Datastore, Pubsub Beam + Cloud Dataflow
  • 23. ● High level DSL ● Familiarity with Scalding, Spark and Flink ● Functional programming natural fit for data ● Numerical libraries - Breeze, Algebird ● Macros & shapeless for code generation Beam and Scala
  • 24. Scio A Scala API for Apache Beam and Google Cloud Dataflow
  • 25. Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.
  • 27. Cloud Storage Pub/Sub Datastore BigtableBigQuery Batch Streaming Interactive REPL Scio Scala API Dataflow Java SDK Scala Libraries Extra features
  • 28. WordCount val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap { _ .split("[^a-zA-Z']+") .filter(_.nonEmpty) } .countByValue .saveAsTextFile("wordcount.txt") sc.close()
  • 29. PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) } ranks }
  • 31. Type safe BigQuery Macro generated case classes, schemas and converters @BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...") class User // look mom no code! sc.typedBigQuery[User]().map(u => (u.id, u.name)) @BigQuery.toTable case class Score(id: String, score: Double) data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")
  • 32. ● Best of both worlds ● BigQuery for slicing and dicing huge datasets ● Scala for custom logic ● Seamless integration BigQuery + Scio
  • 33. REPL $ scio-repl Welcome to _____ ________________(_)_____ __ ___/ ___/_ /_ __ _(__ )/ /__ _ / / /_/ / /____/ ___/ /_/ ____/ version 0.3.4 Using Scala version 2.11.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) Type in expressions to have them evaluated. Type :help for more information. Using 'scio-test' as your BigQuery project. BigQuery client available as 'bq' Scio context available as 'sc' scio> _ Available in github.com/spotify/homebrew-public
  • 34. ● DAG and source code visualization ● BigQuery caching, legacy & SQL 2011 support ● HDFS, Protobuf, TensorFlow, Bigtable, Elasticsearch & Cassandra I/O ● Join optimizations - hash, skewed, sparse ● Job metrics ● DistCache, job orchestration Other goodies
  • 36. Adoption ● At Spotify ○ 200+ users, 700+ unique production pipelines (from ~70 10 months ago) ○ Most new to Scala and Scio ○ Both batch and streaming jobs ● Externally ○ ~10 companies, several fairly large ones
  • 38. Fan Insights ● Listener stats [artist|track] × [context|geography|demography] × [day|week|month] ● BigQuery, GCS, Datastore ● TBs daily ● 150+ Java jobs → ~10 Scio jobs
  • 39. Master Metadata ● n1-standard-1 workers ● 1 core 3.75GB RAM ● Autoscaling 2-35 workers ● 26 Avro sources - artist, album, track, disc, cover art, ... ● 120GB out, 70M records ● 200 LOC vs original Java 600 LOC
  • 40. And we broke Google
  • 41. Listening history ● user x track x [day/week/month/all time] ● 300B elements ● 800 n1-highmem-32 workers ● 32 core 208GB RAM ● 240TB in - Bigtable ● 90TB out - GCS
  • 42. BigDiffy ● Pairwise field-level statistical diff ● Diff 2 SCollection[T] given keyFn: T => String ● T: Avro, BigQuery JSON, Protobuf ● Leaf field Δ - numeric, string (Levenshtein), vector (Cosine) ● Δ statistics - min, max, μ, σ, etc. ● Non-deterministic fields ○ ignore field ○ treat "repeated" field (List) as unordered list (Set) Part of github.com/spotify/ratatool
  • 43. Dataset Diff ● Diff stats ○ Global: # of SAME, DIFF, MISSING LHS/RHS ○ Key: key → SAME, DIFF, MISSING LHS/RHS ○ Field: field → min, max, μ, σ, etc. ● Use cases ○ Validating pipeline migration ○ Sanity checking ML models
  • 44. Pairwise field-level deltas val lKeyed = lhs.keyBy(keyFn) val rKeyed = rhs.keyBy(keyFn) val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) => (lOpt, rOpt) match { case (Some(l), Some(r)) => val ds = diffy(l, r) // Seq[Delta] val dt = if (ds.isEmpty) SAME else DIFFERENT (k, (ds, dt)) case (_, _) => val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS (k, (Nil, dt)) } }
  • 45. Summing deltas import com.twitter.algebird._ // convert deltas to map of (field → summable stats) def deltasToMap(ds: Seq[Delta], dt: DeltaType) : Map[String, (Long, Option[(DeltaType, Min[Double], Max[Double], Moments)])] = { // ... } deltas .map { case (_, (ds, dt)) => deltasToMap(ds, dt) } .sum // Semigroup!
  • 46. 1. Copy & paste from legacy codebase to Scio 2. Verify with BigDiffy 3. Profit!
  • 47. Featran ● a.k.a. Featran77 or F77 ● Type safe feature transformer for machine learning ● Column-wise aggregation and transformation ● 1 shuffle stage (Algebird aggregation) ● In memory, Flink, Scalding, Scio & Spark runner ● Scala collection, array, Breeze, TensorFlow & NumPy output formats ● Property-based testing, 100% coverage https://github.com/spotify/featran
  • 48. Transforming features case class Record(d1: Double, d2: Option[Double], d3: Double, s: Seq[String]) val spec = FeatureSpec.of[Record] .required(_.d1)(Binarizer("bin", threshold = 0.5)) .optional(_.d2)(Bucketizer("bucket", Array(0.0, 10.0, 20.0))) .required(_.d3)(StandardScaling("std")) .required(_.d3)(QuantileDiscretizer("q", 4)) .required(_.s)(NHotEncoder("nhot")) val f = spec.extract(records) f.featureNames // human readable names f.featureValues[DenseVector[Double]] // output format f.featureSettings // settings can be reloaded
  • 49. shapeless-datatype ● Case class ↔ Avro, BigQuery, Datastore & TensorFlow types ● Compile time and extendable type mapping ● Generic mapper between case class types ● Type and lens based record matcher https://github.com/nevillelyh/shapeless-datatype
  • 50. protobuf-generic ● Generic Protobuf manipulation without compiled classes ● Similar to Avro GenericRecord ● Protobuf type T → JSON schema ● Bytes ↔ JSON with JSON schema ● Command line tool to inspect binary files https://github.com/nevillelyh/protobuf-generic
  • 51. Other uses ● AB testing ○ Statistical analysis with bootstrap and DimSum ● Monetization ○ Ads targeting ○ User conversion analysis ● User understanding ○ Diversity ○ Session analysis ○ Behavior analysis ● Home page ranking ● Audio fingerprint analysis
  • 52. We're developing so fast that Google's having a hard time keep up capacity
  • 53. Google Cloud Capacity ● Almost too easy to write big complex jobs leveraging BigQuery & Dataflow ● BigQuery export limit 10TB/day/project ● Compute engine datacenter capacity ● Shuffle disk size & throughput ● Bigtable throttling
  • 54. And We Have Data
  • 56. Serialization ● Data ser/de ○ Scalding, Spark and Storm uses Kryo and Twitter Chill ○ Dataflow/Beam requires explicit Coder[T] Sometimes inferable via Guava TypeToken ○ ClassTag to the rescue, fallback to Kryo/Chill ● Lambda ser/de ○ Custom ClosureCleaner, Serializable and @transient lazy val ○ Scala 2.12, Java 8 lambda issues #10232 #10233
  • 57. REPL ● Spark REPL transports lambda bytecode via HTTP ● Dataflow requires job jar for execution (masterless) ● Custom class loader and ILoop ● Interpreted classes → job jar → job submission ● SCollection[T]#closeAndCollect(): Iterator[T] to mimic Spark actions
  • 58. Macros and IntelliJ IDEA ● IntelliJ IDEA does not see macro expanded classes https://youtrack.jetbrains.com/issue/SCL-8834 ● @BigQueryType.{fromTable, fromQuery} class MyRecord ● Scio IDEA plugin https://github.com/spotify/scio-idea-plugin
  • 60. Scio in Apache Zeppelin Local Zeppelin server, remote managed Dataflow cluster, NO OPS
  • 61. The End Thank You Neville Li @sinisa_lyh