SlideShare a Scribd company logo
1 of 33
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Sr. Product Manager, Amazon EMR
Yekesa Kosuru, VP of Engineering, DataXu
Dong Jiang, Sr. Principal Software Engineer, DataXu
Saket Mengle, Sr. Principal Data Scientist, DataXu
AWS re:Invent 2016
BDM301
Best Practices for Apache
Spark on Amazon EMR
What to Expect from the Session
• Spark on EMR architecture
• Spark performance and using S3
• Spark security
• Spark for ETL and data science at DataXu
• Demo of DataXu’s Spark workflow
• Run Spark Driver in
Client or Cluster mode
• Spark application runs
as a YARN application
• SparkContext runs as a
library in your program,
one instance per Spark
application.
• Spark Executors run in
YARN Containers on
NodeManagers in your
cluster
• Access Spark UI through
the Resource Manager
or Spark History Server
Spark on YARN
Spark UI
Configuring Executors – Dynamic Allocation
• Optimal resource utilization
• YARN dynamically creates and shuts down executors
based on the resource needs of the Spark application
• Spark uses the executor memory and executor cores
settings in the configuration for each executor
• Amazon EMR uses dynamic allocation by default, and
calculates the default executor size to use based on the
instance family of your Core Group
• Or, use maximizeResourceAllocation for single-tenancy
Options to submit jobs – on cluster
Notebooks and IDEs: Apache Zeppelin,
Jupyter, RStudio, and more!
Connect with ODBC / JDBC to
Spark Thriftserver
Use Spark Actions in your Apache Oozie
workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use Spark Submit
or the Spark Shell
Develop fast using notebooks and IDEs
Install using Bootstrap ActionsIncluded in EMR releases Connect with
ODBC / JDBC
(start using
start-thriftserver.sh)
Thrift Server
Options to submit jobs – off cluster
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
Amazon Aurora
Metadata for
external tables
on S3
S3 tips: Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to
EC2
• Make sure you use splittable compression or have each file
be the optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased
performance on reads
Encryption at-rest and in-flight
Auto Scaling for data science on-demand
YARN metrics
Coming Soon: Advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal AZ based on capacity/price
• Spot Block support
Using Spark at DataXu - Overview
• Who is DataXu
• Why Spark
• DataXu Processing Flows
• Benchmarks
• Tips & Lessons Learned
• Demo
• DataXu Science
• Benchmarks
• Tips and Lessons Learned
• Demo
DataXu
• Who
• Spun out of MIT Labs
• A petabyte scale digital
marketing platform
• One of the fastest growing
companies in Inc. 5000
• What
• Help world’s most
valuable brands
understand and engage
with their consumer
• Maximize ROI
Quick Statistics
• 2M+ bid requests per second
• Billions of impressions,
Petabytes of data
• ~10ms round trip response time
• 180+TB logs per day
• 2PB data analyzed
• 3000+ servers powering the
platform
• 13 regions, 24x7
Real Time Bidding
Why Spark?
• Performance
• Speed of execution: Spark sorted 100 TB of data (1 trillion records) 3X faster
using 10X fewer machines than traditional MapReduce
• Massively parallel
• In-memory vs disk IO
• Partition aware shuffles
• Speaks your language – Java, Scala, Python, R
• Streaming Use Cases – Streaming API
• Spark on EMR
• EMR as Common Foundation – mature platform
• Elasticity, Security, Spot Instances, Decoupled Storage and Compute
• Low TCO: Spot Instances, Low OPEX
DataXu Flows
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Streams
Advanced Analytics
(Third Party)
Reporting Tools
(Third Party)Amazon
Machine
Learning
S3All Data
(S3)
ETL
Attribution
Ecosystem of tools and services
DataXu Spark Pipeline
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
Parquet, DataSets, DataFrames
Common Backend
Amazon EMR
Spark
MLib
Spark
SQL
Streaming
DataXu Flows – New Generation
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Attribution & ML
S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Ecosystem of tools and services
ETL Pipeline
Event Data
• Impressions
• Activities
• Attributions
• (Facts)
Reference Data
(Dimensions)
Application Logs
Exceptions Data
Reporting Data
Spark on EMR
(ETL)
Cost/Performance Benchmarks
ETL Workload: Spark on EMR vs MPP Database
Cluster Instance Type,
Count
Execution
Time (mins)
Monthly Costs
Shared MPP*
(COLO)
48 x nodes
(24 cores, 64GB)
20, 30, 54
$20,000
(Excluding SW license)
Single-node MPP (AWS) 1 i2.8xlarge ~40
$4,992
(Excluding SW license)
Spark on EMR
(AWS)
7 m3.xlarge
7 m4.xlarge
20, 20, 20
18, 18, 18
$875
$766
Spark on EMR
(AWS)
3 m3.xlarge
3 m4.xlarge
50, 51, 51
44, 45, 46
$375
$328
Scalability
0
5
10
15
20
25
30
1x 2x 3x 4x 5x 6x 7x 8x 9x 10x
Scalability w/ Data Volume
Minutes (Actual) Minutes (Ideal)
Data Volume
0
100
200
300
400
500
600
700
800
900
0
5
10
15
20
25
30
35
40
45
50
2 3 4 5 6
Execution Time and Cost vs
Number of Core Nodes
m4.xlarge Minutes m4.xlarge Monthly Cost
Number of Core Nodes
Tips
Spark SQL
BroadcastHashJoin, Easy in 2.0
UDF for Readability, Modularity and Testability, no performance penalty
Complex SQL queries may need to be materialized
Watch out for bottlenecks, like Spark driver operations
Lessons Learned
Big Picture
Re-think infrastructure (Cloud vs Host Solution)
• Think cloud native instead of fork lifting
Re-think ETL process (Spark vs MPP database vs Hadoop MR)
• Spark Core and SQL are rock solid, MR as fallback
• Streamline processing
Re-think in-memory processing
• Minimize incoming data footprints
• Control unnecessary joins
EMR is an excellent common backend
• Default setting just works
• Consistent performance
Demo
DataXu Machine Learning
Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
Why is this hard?
Huge Scale • 2 Petabytes Processed Daily
• 2 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models Trained per Day
Unattended Operation • Model training and deployment runs
automatically every day
Changing Industry • Need ability to adapt quickly to new customer
requirements
Benchmark – Training Time
• Training times are linear using
Spark AWS
• Three different algorithms
tested at scale
• Decision Tree
• Random Forest
• Logistic Regression
• Linear Scalability is important
For training/evaluating we used a 6 node r3.2xlarge
cluster - 150 GB Memory, 20 Cores.
0
50
100
150
200
250
300
350
400
450
500
0 5 10 15
TrainingTime(insec)
Number of Training Records
Millions
Training Time Comparison
Logistic Regression Decision Tree Random Forest
Benchmarks – Bidding time
0.8
1.7
CURRENT DATAXU
MODEL
SPARK RANDOM
FOREST
Avg. Latency
(milliseconds)
123
5
36 37
7
Random
Forests
Logistic
Regression
Naive
Bayes
Decision
Trees
DataXu
Model
Model Size in Memory (KB)
• Out-of-the-box Spark performance is impressive
• Latency
• Model Size
• DataXu models are faster and smaller but they were optimized for years
Takeaways
Big Picture
One common big data platform
Spark on Amazon EMR works well out of the box
Very little code to train and evaluate models
Automated & unattended ML at scale
Thank you!
ykosuru@dataxu.com
djiang@dataxu.com
smengle@dataxu.com
We are hiring!
www.dataxu.com/careers
jonfritz@amazon.com
aws.amazon.com/emr/
aws.amazon.com/blogs/big-data/
Remember to complete
your evaluations!

More Related Content

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

AWS re:Invent 2016: Best Practices for Apache Spark on Amazon EMR (BDM301)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Sr. Product Manager, Amazon EMR Yekesa Kosuru, VP of Engineering, DataXu Dong Jiang, Sr. Principal Software Engineer, DataXu Saket Mengle, Sr. Principal Data Scientist, DataXu AWS re:Invent 2016 BDM301 Best Practices for Apache Spark on Amazon EMR
  • 2. What to Expect from the Session • Spark on EMR architecture • Spark performance and using S3 • Spark security • Spark for ETL and data science at DataXu • Demo of DataXu’s Spark workflow
  • 3. • Run Spark Driver in Client or Cluster mode • Spark application runs as a YARN application • SparkContext runs as a library in your program, one instance per Spark application. • Spark Executors run in YARN Containers on NodeManagers in your cluster • Access Spark UI through the Resource Manager or Spark History Server Spark on YARN Spark UI
  • 4. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically creates and shuts down executors based on the resource needs of the Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group • Or, use maximizeResourceAllocation for single-tenancy
  • 5. Options to submit jobs – on cluster Notebooks and IDEs: Apache Zeppelin, Jupyter, RStudio, and more! Connect with ODBC / JDBC to Spark Thriftserver Use Spark Actions in your Apache Oozie workflow to create DAGs of jobs. (start using start-thriftserver.sh) Or, use Spark Submit or the Spark Shell
  • 6. Develop fast using notebooks and IDEs Install using Bootstrap ActionsIncluded in EMR releases Connect with ODBC / JDBC (start using start-thriftserver.sh) Thrift Server
  • 7. Options to submit jobs – off cluster Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  • 8. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 9. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local Amazon Aurora Metadata for external tables on S3
  • 10. S3 tips: Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 12. Auto Scaling for data science on-demand YARN metrics
  • 13. Coming Soon: Advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal AZ based on capacity/price • Spot Block support
  • 14. Using Spark at DataXu - Overview • Who is DataXu • Why Spark • DataXu Processing Flows • Benchmarks • Tips & Lessons Learned • Demo • DataXu Science • Benchmarks • Tips and Lessons Learned • Demo
  • 15. DataXu • Who • Spun out of MIT Labs • A petabyte scale digital marketing platform • One of the fastest growing companies in Inc. 5000 • What • Help world’s most valuable brands understand and engage with their consumer • Maximize ROI Quick Statistics • 2M+ bid requests per second • Billions of impressions, Petabytes of data • ~10ms round trip response time • 180+TB logs per day • 2PB data analyzed • 3000+ servers powering the platform • 13 regions, 24x7
  • 17. Why Spark? • Performance • Speed of execution: Spark sorted 100 TB of data (1 trillion records) 3X faster using 10X fewer machines than traditional MapReduce • Massively parallel • In-memory vs disk IO • Partition aware shuffles • Speaks your language – Java, Scala, Python, R • Streaming Use Cases – Streaming API • Spark on EMR • EMR as Common Foundation – mature platform • Elasticity, Security, Spot Instances, Decoupled Storage and Compute • Low TCO: Spot Instances, Low OPEX
  • 18. DataXu Flows CDN Real Time Bidding Retargeting Platform Amazon Kinesis Streams Advanced Analytics (Third Party) Reporting Tools (Third Party)Amazon Machine Learning S3All Data (S3) ETL Attribution Ecosystem of tools and services
  • 19. DataXu Spark Pipeline ETL Attribution Machine Learning S3Amazon Kinesis Parquet, DataSets, DataFrames Common Backend Amazon EMR Spark MLib Spark SQL Streaming
  • 20. DataXu Flows – New Generation CDN Real Time Bidding Retargeting Platform Amazon Kinesis Attribution & ML S3 Reporting Data Visualization Data Pipeline ETL(Spark SQL) Ecosystem of tools and services
  • 21. ETL Pipeline Event Data • Impressions • Activities • Attributions • (Facts) Reference Data (Dimensions) Application Logs Exceptions Data Reporting Data Spark on EMR (ETL)
  • 22. Cost/Performance Benchmarks ETL Workload: Spark on EMR vs MPP Database Cluster Instance Type, Count Execution Time (mins) Monthly Costs Shared MPP* (COLO) 48 x nodes (24 cores, 64GB) 20, 30, 54 $20,000 (Excluding SW license) Single-node MPP (AWS) 1 i2.8xlarge ~40 $4,992 (Excluding SW license) Spark on EMR (AWS) 7 m3.xlarge 7 m4.xlarge 20, 20, 20 18, 18, 18 $875 $766 Spark on EMR (AWS) 3 m3.xlarge 3 m4.xlarge 50, 51, 51 44, 45, 46 $375 $328
  • 23. Scalability 0 5 10 15 20 25 30 1x 2x 3x 4x 5x 6x 7x 8x 9x 10x Scalability w/ Data Volume Minutes (Actual) Minutes (Ideal) Data Volume 0 100 200 300 400 500 600 700 800 900 0 5 10 15 20 25 30 35 40 45 50 2 3 4 5 6 Execution Time and Cost vs Number of Core Nodes m4.xlarge Minutes m4.xlarge Monthly Cost Number of Core Nodes
  • 24. Tips Spark SQL BroadcastHashJoin, Easy in 2.0 UDF for Readability, Modularity and Testability, no performance penalty Complex SQL queries may need to be materialized Watch out for bottlenecks, like Spark driver operations
  • 25. Lessons Learned Big Picture Re-think infrastructure (Cloud vs Host Solution) • Think cloud native instead of fork lifting Re-think ETL process (Spark vs MPP database vs Hadoop MR) • Spark Core and SQL are rock solid, MR as fallback • Streamline processing Re-think in-memory processing • Minimize incoming data footprints • Control unnecessary joins EMR is an excellent common backend • Default setting just works • Consistent performance
  • 26. Demo
  • 28. Why is this hard? Huge Scale • 2 Petabytes Processed Daily • 2 Million Bid Decisions Per Second • Runs 24 X 7 on 5 Continents • Thousands of ML Models Trained per Day Unattended Operation • Model training and deployment runs automatically every day Changing Industry • Need ability to adapt quickly to new customer requirements
  • 29. Benchmark – Training Time • Training times are linear using Spark AWS • Three different algorithms tested at scale • Decision Tree • Random Forest • Logistic Regression • Linear Scalability is important For training/evaluating we used a 6 node r3.2xlarge cluster - 150 GB Memory, 20 Cores. 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 TrainingTime(insec) Number of Training Records Millions Training Time Comparison Logistic Regression Decision Tree Random Forest
  • 30. Benchmarks – Bidding time 0.8 1.7 CURRENT DATAXU MODEL SPARK RANDOM FOREST Avg. Latency (milliseconds) 123 5 36 37 7 Random Forests Logistic Regression Naive Bayes Decision Trees DataXu Model Model Size in Memory (KB) • Out-of-the-box Spark performance is impressive • Latency • Model Size • DataXu models are faster and smaller but they were optimized for years
  • 31. Takeaways Big Picture One common big data platform Spark on Amazon EMR works well out of the box Very little code to train and evaluate models Automated & unattended ML at scale
  • 32. Thank you! ykosuru@dataxu.com djiang@dataxu.com smengle@dataxu.com We are hiring! www.dataxu.com/careers jonfritz@amazon.com aws.amazon.com/emr/ aws.amazon.com/blogs/big-data/