Operational Tips for Deploying Spark

•

19 likes•3,027 views

Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production.

Engineering

Operational Tips for
Deploying Spark
Miklos Christine
Solutions Engineer
Databricks

$ whoami
• Previously @ Cloudera
• Deep Knowledge of Big Data Stack
• Apache Spark Expert
• Solutions Engineer @ Databricks!

Agenda
• Quick Apache Spark Overview
• Configuration Systems
• Pipeline Design Best Practices
• Debugging Techniques

• Command Line:
spark-defaults.conf
spark-env.sh
• Programmatically:
SparkConf()
• Hadoop Configs:
core-site.xml
hdfs-site.xml
Spark Core Configuration
// Print SparkConfig
sc.getConf.toDebugString
// Print Hadoop Config
val hdConf =
sc.hadoopConfiguration.iterator()
while (hdConf.hasNext){
println(hdConf.next().toString())
}

• Set SQL Configs Through SQL Interface
SET key=value;
sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”)
• Tools to see current configurations
// View SparkSQL Config Properties
val sqlConf = sqlContext.getAllConfs
sqlConf.foreach(x => println(x._1 +" : " + x._2))
Spark SQL Configuration

• File Formats
• Compression Codecs
• Spark APIs
• Job Profiles
Spark Pipeline Design

File Formats
• Text File Formats
– CSV
– JSON
• Avro Row Format
• Parquet Columnar Format

Compression Codecs
• Choose and Analyze Compression Codecs
– Snappy, Gzip, LZO
• Configuration Parameters
– io.compression.codecs
– spark.sql.parquet.compression.codec
– spark.io.compression.codec

Small Files Problem
• Small files problem still exists
• Metadata loading
• Use coalesce()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

• 2 Types of Partitioning
– File level and Spark
# Get Number of Spark
df.rdd.getNumPartitions()
40
Partitioning
df.write.
partitionBy(“colName”).
saveAsTable(“tableName”)

• Leverage Spark UI
– SQL
– Streaming
Spark Job Profiles

• Monitoring & Metrics
– Spark
– Servers
● Toolset
– Ganglia
– Graphite
Job Profiles: Monitoring
Ref:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

● Analyze the Driver’s stacktrace.
● Analyze the executors stacktraces
– Find the initial executor’s failure.
● Review metrics
– Memory
– Disk
– Networking
Debugging Spark

● OutOfMemoryErrors
– Driver
– Executors
● Out of Disk Space Issues
● Long GC Pauses
● API Usage
Top Support Issues

● Use builtin functions instead of custom UDFs
– import pyspark.sql.functions
– import org.apache.spark.sql.functions
● Examples:
– to_date()
– get_json_object()
– regexp_extract()
Ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
Top Support Issues

● SQL Joins
– df_users.join(df_orders).explain()
– set spark.sql.autoBroadcastJoinThreshold
● Exported Parquet from External Systems
– spark.sql.parquet.binaryAsString
● Tune number of Shuffle Partitions
– spark.sql.shuffle.partitions
Top Support Issues

Thank You!
mwc@databricks.com
https://www.linkedin.com/in/mrchristine

What's hot

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Parallelize R Code Using Apache Spark Databricks

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

New directions for Apache Spark in 2015Databricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks

New Developments in SparkDatabricks

How To Connect Spark To Your Own DatasourceMongoDB

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Spark tutorialSahan Bulathwela

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Databricks

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Sqoop on Spark for Data IngestionDataWorks Summit

Large-Scale Data Science in Apache Spark 2.0Databricks

Building a modern Application with DataFramesSpark Summit

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Structuring Spark: DataFrames, Datasets, and StreamingDatabricks

What's hot (20)

Spark Summit EU talk by Miklos Christine paddling up the stream

Parallelize R Code Using Apache Spark

Project Tungsten: Bringing Spark Closer to Bare Metal

Jump Start on Apache® Spark™ 2.x with Databricks

New directions for Apache Spark in 2015

Apache Spark Core—Deep Dive—Proper Optimization

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

New Developments in Spark

How To Connect Spark To Your Own Datasource

Spark Application Carousel: Highlights of Several Applications Built with Spark

Spark tutorial

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...

Introduction to Apache Spark Developer Training

Sqoop on Spark for Data Ingestion

Large-Scale Data Science in Apache Spark 2.0

Building a modern Application with DataFrames

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Structuring Spark: DataFrames, Datasets, and Streaming

Viewers also liked

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam

Spark summit-east-dowling-feb2017-fullJim Dowling

Moa: Real Time Analytics for Data StreamsAlbert Bifet

Breakthrough OLAP performance with Cassandra and SparkEvan Chan

Consumer offset management in KafkaJoel Koshy

Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...MapR Technologies Japan

Spark For Faster Batch ProcessingEdureka!

Operational Tips For Deploying Apache SparkDatabricks

TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Viewers also liked (12)

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

SampleClean: Bringing Data Cleaning into the BDAS Stack

Spark summit-east-dowling-feb2017-full

Moa: Real Time Analytics for Data Streams

Breakthrough OLAP performance with Cassandra and Spark

Consumer offset management in Kafka

Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...

Spark For Faster Batch Processing

Operational Tips For Deploying Apache Spark

TupleJump: Breakthrough OLAP performance on Cassandra and Spark

Productionizing Spark and the REST Job Server- Evan Chan

Processing Large Data with Apache Spark -- HasGeek

Similar to Operational Tips for Deploying Spark

20170126 big data processingVienna Data Science Group

Incorta spark integrationDylan Wan

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson

Spark ProgrammingTaewook Eom

Spark with HDInsightKhalid Salama

Intro to SparkKyle Burke

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Building iot applications with Apache Spark and Apache BahirLuciano Resende

Analytics with Cassandra & SparkMatthias Niehoff

Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

From Zero to Stream ProcessingEventador

Meetup spark structured streamingJosé Carlos García Serrano

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit

Sparkstreaming with kafka and h base at scale (1)Sigmoid

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Building Out Your Kafka Developer CDC Ecosystemconfluent

Similar to Operational Tips for Deploying Spark (20)

20170126 big data processing

Incorta spark integration

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)

Spark Programming

Spark with HDInsight

Intro to Spark

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Building iot applications with Apache Spark and Apache Bahir

Analytics with Cassandra & Spark

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Big Data Processing with .NET and Spark (SQLBits 2020)

From Zero to Stream Processing

Meetup spark structured streaming

Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...

Sparkstreaming with kafka and h base at scale (1)

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

Building Out Your Kafka Developer CDC Ecosystem

Recently uploaded

US Department of Education FAFSA Week of ActionMebane Rash

CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan

Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar

Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201

Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot

ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370

Secure Key Crypto - Tech Paper JET Tech Labsamber724300

Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar

Gravity concentration_MI20612MI_________Romil Mishra

priority interrupt computer organizationchnrketan

SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar

11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad

Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra

Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra

Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra

TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A

"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University

Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303

KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales

Recently uploaded (20)

US Department of Education FAFSA Week of Action

CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES

Cost estimation approach: FP to COCOMO scenario based question

Robotics Group 10 (Control Schemes) cse.pdf

Prach: A Feature-Rich Platform Empowering the Autism Community

ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt

Secure Key Crypto - Tech Paper JET Tech Labs

Levelling - Rise and fall - Height of instrument method

Gravity concentration_MI20612MI_________

priority interrupt computer organization

SOFTWARE ESTIMATION COCOMO AND FP CALCULATION

11. Properties of Liquid Fuels in Energy Engineering.pdf

Mine Environment II Lab_MI10448MI__________.pptx

Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx

Curve setting (Basic Mine Surveying)_MI10412MI.pptx

TEST CASE GENERATION GENERATION BLOCK BOX APPROACH

Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...

"Exploring the Essential Functions and Design Considerations of Spillways in ...

Energy Awareness training ppt for manufacturing process.pptx

KCD Costa Rica 2024 - Nephio para parvulitos

Operational Tips for Deploying Spark

1. Operational Tips for Deploying Spark Miklos Christine Solutions Engineer Databricks

2. $ whoami • Previously @ Cloudera • Deep Knowledge of Big Data Stack • Apache Spark Expert • Solutions Engineer @ Databricks!

3. Agenda • Quick Apache Spark Overview • Configuration Systems • Pipeline Design Best Practices • Debugging Techniques

4. Apache Spark

5. Spark Configuration

6. • Command Line: spark-defaults.conf spark-env.sh • Programmatically: SparkConf() • Hadoop Configs: core-site.xml hdfs-site.xml Spark Core Configuration // Print SparkConfig sc.getConf.toDebugString // Print Hadoop Config val hdConf = sc.hadoopConfiguration.iterator() while (hdConf.hasNext){ println(hdConf.next().toString()) }

7. • Set SQL Configs Through SQL Interface SET key=value; sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”) • Tools to see current configurations // View SparkSQL Config Properties val sqlConf = sqlContext.getAllConfs sqlConf.foreach(x => println(x._1 +" : " + x._2)) Spark SQL Configuration

8. • File Formats • Compression Codecs • Spark APIs • Job Profiles Spark Pipeline Design

9. File Formats • Text File Formats – CSV – JSON • Avro Row Format • Parquet Columnar Format

10. Compression Codecs • Choose and Analyze Compression Codecs – Snappy, Gzip, LZO • Configuration Parameters – io.compression.codecs – spark.sql.parquet.compression.codec – spark.io.compression.codec

11. Small Files Problem • Small files problem still exists • Metadata loading • Use coalesce() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

12. • 2 Types of Partitioning – File level and Spark # Get Number of Spark df.rdd.getNumPartitions() 40 Partitioning df.write. partitionBy(“colName”). saveAsTable(“tableName”)

13. • Leverage Spark UI – SQL – Streaming Spark Job Profiles

14. Spark Job Profiles

15. Spark Job Profiles

16. • Monitoring & Metrics – Spark – Servers ● Toolset – Ganglia – Graphite Job Profiles: Monitoring Ref: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

17. ● Analyze the Driver’s stacktrace. ● Analyze the executors stacktraces – Find the initial executor’s failure. ● Review metrics – Memory – Disk – Networking Debugging Spark

18. ● OutOfMemoryErrors – Driver – Executors ● Out of Disk Space Issues ● Long GC Pauses ● API Usage Top Support Issues

19. ● Use builtin functions instead of custom UDFs – import pyspark.sql.functions – import org.apache.spark.sql.functions ● Examples: – to_date() – get_json_object() – regexp_extract() Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions Top Support Issues

20. ● SQL Joins – df_users.join(df_orders).explain() – set spark.sql.autoBroadcastJoinThreshold ● Exported Parquet from External Systems – spark.sql.parquet.binaryAsString ● Tune number of Shuffle Partitions – spark.sql.shuffle.partitions Top Support Issues

21. Thank You! mwc@databricks.com https://www.linkedin.com/in/mrchristine

Operational Tips for Deploying Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Operational Tips for Deploying Spark

Similar to Operational Tips for Deploying Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Operational Tips for Deploying Spark