Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

•

1 like•3,018 views

This document discusses lessons learned from developing and managing massive (300TB+) Apache Spark pipelines in production. It covers using Spark for its performance, testability, and modularity benefits. It also discusses managing large scale through automation, simple interfaces, planning for growth by persisting data to HDFS, using efficient serialization and data structures, and testing on sampled data to address data skew issues.

Data & Analytics

LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE
(300TB+) APACHE SPARK PIPELINES IN PRODUCTION
Brandon Carl

MARCH 15, 2016
"SEE THE MOMENTS YOU CARE ABOUT FIRST"

MACHINE LEARNING LIFECYCLE
Training
Examples
Machine
Learning
Model
Make
Predictions
Measure
Outcomes
Ranking
Events
Client
Events

WHY SPARK?
• Performance
• Testability
• Modularity
• Serialized Logging

SERIALIZED LOGGING
{
"id": 123,
"scores": {
"modelA": 0.2345,
"modelB": 0.0012
},
"features": {
1001: 0.9934,
1002: 0.1923
}
}

SERIALIZED LOGGING
struct Candidate {
1: i64 id;
2: map<string, double> scores;
3: map<i64, double> features;
}
new Candidate()
.setId(id)
.setScores(scores)
.setFeatures(features)

CHANGES OVER TIME
• RDD
• Dataset
• Training Data Joiner

$TRAINING DATA JOINER class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner { val labels: Map[String, LabelFunction] = ??? } case class Output(id: Long, label_value: Double)$

SIMPLE INTERFACE
RankingEvent
.read('input_table', '2017-10-25')
.filter(...)
.map(...)
.write('output_table', '2017-10-25')

PERSIST TO HDFS
Source
Data
Map/Filter
Join Output
Source
Data
Map/Filter

PERSIST TO HDFS
Source
Data
Map/Filter
Temporary
Table
Source
Data
Map/Filter
Temporary
Table
Join Output

KRYO SERIALIZATION
new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.registerKryoClasses(Array(classOf[...], ...))

BIG-O MATTERS
final def withName(s: String): Value =
values
.find(_.toString == s)
.getOrElse(throw new NoSuchElementException(...))

DATA STRUCTURES MATTER
• AnyRefMap
• IntMap
• LongMap
• fastutil (http://fastutil.di.unimi.it)

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

What's hot

Data Quality With or Without Apache Spark and Its EcosystemDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Productizing Structured Streaming JobsDatabricks

3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...HostedbyConfluent

Apache Flink in the Cloud-Native EraFlink Forward

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks

Parquet performance tuning: the missing guideRyan Blue

How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit

PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...PGConf APAC

Introduction to Data EngineeringHadi Fadlallah

Real-time Analytics with Trino and Apache PinotXiang Fu

Hyperspace for Delta LakeDatabricks

Databus - LinkedIn's Change Data Capture PipelineSunil Nagaraj

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner

Technip Energies Italy: Planning is a graph matterNeo4j

Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks

What's hot (20)

Data Quality With or Without Apache Spark and Its Ecosystem

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Productizing Structured Streaming Jobs

3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...

Apache Flink in the Cloud-Native Era

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Open core summit: Observability for data pipelines with OpenLineage

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...

Parquet performance tuning: the missing guide

How to Actually Tune Your Spark Jobs So They Work

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

PGConf APAC 2018 - PostgreSQL HA with Pgpool-II and whats been happening in P...

Introduction to Data Engineering

Real-time Analytics with Trino and Apache Pinot

Hyperspace for Delta Lake

Databus - LinkedIn's Change Data Capture Pipeline

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

Technip Energies Italy: Planning is a graph matter

Simplify and Scale Data Engineering Pipelines with Delta Lake

Viewers also liked

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

The State of Spark in the Cloud with Nicolas PoggiSpark Summit

Strata Beijing - Deep Learning in Production on SparkAdam Gibson

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit

Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Databricks

Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Spark Summit

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...Spark Summit

Building Custom ML PipelineStages for Feature Selection with Marc KaminskiSpark Summit

From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks

Viewers also liked (11)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

The State of Spark in the Cloud with Nicolas Poggi

Strata Beijing - Deep Learning in Production on Spark

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...

Spark Streaming Programming Techniques You Should Know with Gerard Maas

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...

Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

From Pipelines to Refineries: scaling big data applications with Tim Hunter

Similar to Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Spark Sql for TrainingBryan Yang

Data Modeling IoT and Time Series data in NoSQLBasho Technologies

Spark streaming , Spark SQLYousun Jeong

Data Mining 2008llangit

Azure machine learning tech melaYogendra Tamang

いそがしいひとのための Microsoft Ignite 2018 + 最新情報 Data & AI 編Miho Yamamoto

SQL Server 2008 Data Miningllangit

Oracle to Azure PostgreSQL database migration webinarMinnie Seungmin Cho

C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax

SQL Server 2008 Data Miningllangit

Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Mark Ginnebaugh

Practical Machine Learning Pipelines with MLlibDatabricks

Golang in TiDB (GopherChina 2017)PingCAP

Data Mining Innovation with SQL Server 2014 -- Charlotte 201410Mark Tabladillo

Data Mining Innovation with SQL Server 2014: SQL Saturday 328 Birmingham ALMark Tabladillo

Data science with R - Clustering and ClassificationBrigitte Mueller

24 Hours of PASS -- Enterprise Data Mining with SQL ServerMark Tabladillo

Machine Learning for AdTech in Action with Cyrille Dubarry and Han JuDatabricks

Is ScalaC Getting Faster, or Am I just Imagining ItRory Graves

Similar to Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl (20)

Spark Sql for Training

Data Modeling IoT and Time Series data in NoSQL

Spark streaming , Spark SQL

Data Mining 2008

Azure machine learning tech mela

いそがしいひとのための Microsoft Ignite 2018 + 最新情報 Data & AI 編

SQL Server 2008 Data Mining

Oracle to Azure PostgreSQL database migration webinar

C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016

SQL Server 2008 Data Mining

Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction

Practical Machine Learning Pipelines with MLlib

Golang in TiDB (GopherChina 2017)

Data Mining Innovation with SQL Server 2014 -- Charlotte 201410

Data Mining Innovation with SQL Server 2014: SQL Saturday 328 Birmingham AL

Data science with R - Clustering and Classification

24 Hours of PASS -- Enterprise Data Mining with SQL Server

Machine Learning for AdTech in Action with Cyrille Dubarry and Han Ju

Is ScalaC Getting Faster, or Am I just Imagining It

Recently uploaded

While-For-loop in python used in collegessuser7a7cd61

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Easter Eggs From Star Wars and in cars 1 and 217djon017

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

ASML's Taxonomy Adventure by Daniel Cantervoginip

How we prevented account sharing with MFAAndrei Kaleshka

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Multiple time frame trading analysis -brianshannon.pdfchwongval

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Recently uploaded (20)

While-For-loop in python used in college

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Call Girls In Dwarka 9654467111 Escorts Service

Semantic Shed - Squashing and Squeezing.pptx

20240419 - Measurecamp Amsterdam - SAM.pdf

Defining Constituents, Data Vizzes and Telling a Data Story

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Student profile product demonstration on grades, ability, well-being and mind...

Easter Eggs From Star Wars and in cars 1 and 2

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

ASML's Taxonomy Adventure by Daniel Canter

How we prevented account sharing with MFA

GA4 Without Cookies [Measure Camp AMS]

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Multiple time frame trading analysis -brianshannon.pdf

Call Girls in Saket 99530🔝 56974 Escort Service

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

1. LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTION Brandon Carl

3. MARCH 15, 2016 "SEE THE MOMENTS YOU CARE ABOUT FIRST"

5. MACHINE LEARNING

6. MACHINE LEARNING LIFECYCLE Training Examples Machine Learning Model Make Predictions Measure Outcomes Ranking Events Client Events

7. WHY SPARK? • Performance • Testability • Modularity • Serialized Logging

8. SERIALIZED LOGGING { "id": 123, "scores": { "modelA": 0.2345, "modelB": 0.0012 }, "features": { 1001: 0.9934, 1002: 0.1923 } }

9. SERIALIZED LOGGING struct Candidate { 1: i64 id; 2: map<string, double> scores; 3: map<i64, double> features; } new Candidate() .setId(id) .setScores(scores) .setFeatures(features)

10. CHANGES OVER TIME

11. CHANGES OVER TIME • RDD • Dataset • Training Data Joiner

12. TRAINING DATA JOINER class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner { val labels: Map[String, LabelFunction] = ??? } case class Output(id: Long, label_value: Double)

13. MANAGING MASSIVE SCALE

14. MANAGING MASSIVE SCALE - PEOPLE

15. AUTOMATE EVERYTHING

16. SIMPLE INTERFACE

17. SIMPLE INTERFACE RankingEvent .read('input_table', '2017-10-25') .filter(...) .map(...) .write('output_table', '2017-10-25')

18. MANAGING MASSIVE SCALE - DATA

19. PLAN FOR GROWTH

20. PERSIST TO HDFS

21. PERSIST TO HDFS Source Data Map/Filter Join Output Source Data Map/Filter

22. PERSIST TO HDFS Source Data Map/Filter Temporary Table Source Data Map/Filter Temporary Table Join Output

23. KRYO SERIALIZATION

24. KRYO SERIALIZATION new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrationRequired", "true") .registerKryoClasses(Array(classOf[...], ...))

25. BIG-O MATTERS

26. BIG-O MATTERS final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))

27. BIG-O MATTERS final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))

28. DATA STRUCTURES MATTER

29. DATA STRUCTURES MATTER • AnyRefMap • IntMap • LongMap • fastutil (http://fastutil.di.unimi.it)

30. DATA SKEW MATTERS

31. TEST ON SAMPLED DATA

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Similar to Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl