Good afternoon everybody. I'm Ashwin and this is Nezih. We work for the big data platform team at Netflix. Today we are going to talk about our experience productionizing spark for running ETL workloads on YARN
I'm sure all of you have seen this netflix page before. Do you also know that spark contributes pipeline that does personalized ordering of the rows in this page. We will talk about this application in the second part of the presentation. Not just that, spark is also used to train some of machine learning models that powers our recommendations alogorithms.
Netflix is a data driven company collect a lot of data from user interactions, platforms, services One example where we use this data is A/B testing. Lets say we have UI change actually improves the experience The biggest challenge that we face…
Just to give an idea of at what scale we operate, I'm going to share with you some numbers.
this would generate a lot of data. But how much
Today, we'll walk you through three sections
We need to answer two questions. First ques how do we get data in platform? It looks like this
two pipelines. event data, so this is all the data from devices and microservices running out in the cloud. This goes into a kakfa clusters,ursula. This pipeline runs under a min. At the bottom, dimension data pipeline. live data in casssandra. Everyday this data is backed up and a tool called Aegisthus processes these backups and writes into our data warehouse. In the end, we have all of event and dimension data on s3, ready for processing.
At this point, our data is on s3 which is our source of truth, parquet .. At the very top in our interface. hundreds of internal users using platform. bigdataportal, one stop shop to access our big data services. From this portal, users pick a compute engine and run their jobs. cool features like schema browsing and accessing s3. The portal uses the Bigdata api under the hood..
Here is how our spark on yarn deployment looks like Thanks to genie, multiple spark versions. This helps experiment with new features, bug fixes To make best use of our resources, spark alongside mapreduce
Now lets move on to the technical challenges. We are going to walk through the lifecycle of a spark job, as we do that we'll talk about the challenges that we had.
resource manager, master process, managing the cluster resources. driver via the AM requests from initial set of executors from the RM, and NM launches In a typical spark job, probably use the RDD interface.. We faced two technical challenges here. the first one is a limitation of RDD interface and second one is perf optimization. T : Here is the first one
Recently we had a usecase where we wanted to merge small files. merges partitions according to a given number of paritions, but we want to merge by file size added support for passing a custom coalescer to this interface. and we implemented a size based coalescing strategy T : Now lets look at the optimization in UnionRDD.
We noticed select * from table limit 10, slow against partitioned hive table Found each hive partition, spark hadoopRDD,UnionRDD. reason poor performance sequential listing of parent hadoopRDD partitions. resolved this by computing the partitions of unionRDD in parallel. T : This fix was good but we found another important problem in s3 filesystem that was affecting the listing performance.
in hadoop we found, Filesystem making an unnecssary rpc call to s3, which is expensive. We fixed this issue by removing this redundant rpc call. This along with Union RDD fix improved in split calculation performance by 20x. T: : This problem was on the read path, we hit another issue on the write path, which was with Output committers.
Output committer writes the output of single task in an all or nothing fashion. Each task writes output to a temp directory. First successful.. But this doesn't work well with S3. To solve this we use our internal s3 output committer, which we plan to open source in the future.
At this point, driver has finished listing and it starts running tasks on the executors. dynamic allocation will kick in, and scale job's execturos up and down based on the workload. We had several major problems with dyn alloc and I'll now talk about two of them here. T : Here is the first one.
When dynamic allocation is enabled, we found that reading broadcast data can take a long time. reason is executors retrieve block locations only once from the driver during the read. But with dynamic allocation , replicas can be removed resulting in stale entries in the retrived block locations. Production job example We fixed this problem on the executor side, now the executors refresh the block locations from the driver after multiple failed attempts. T: The second problem we had is with locality optimization.
noticed a spark application failed due to timeouts. On debugging we found that AM would cancel and resend container requests if the one, locality preference is no longer needed or two if locality preference is not set. Problem is we run on s3 and s3 doesn't have locality information set, which was why we see this thrashing. We fixed this by not cancelling container requests that don't have a locality pref set.
See also https://issues.apache.org/jira/browse/SPARK-8890
Inserting a repartition minimizes the number of output files and the memory consumption. In progress, not merged.
Due to sequential processing one large application event log can block new applications from showing up.
Other configs which helped fix OOM errors Move from CMS to G1GC Garbage collector -XX:NewRatio=1 (give more space to YoungGen)
Available in Spark 2.0
Producing Spark on YARN for ETL
Spark on Yarn
~ 1 min
Big Data APIBig Data Portal
Transport VisualizationQuality PigWorkflowVis Job/ClusterVis
• 3000 EC2 nodes on two clusters (d2.4xlarge)
• Multiple Spark versions
• Share the same infrastructure with MapReduce jobs
Spark on YARN at Netflix
Custom Coalescer Support [SPARK-14042]
• coalesce() can only “merge” using the given number of partitions
– how to merge by size?
• CombineFileInputFormat with Hive
• Support custom partition coalescing strategies
• Parent RDD partitions are listed sequentially
• Slow for tables with lots of partitions
• Parallelize listing of parent RDD partitions
UnionRDD Parallel Listing [SPARK-9926]
• Each task writes output to a temp directory
• Rename first successful task’s temp directory to final destination
• Problems with S3
• S3 rename => copy + delete
• S3 is eventually consistent
Hadoop Output Committer
• Each task writes output to local disk
• Copy first successful task’s output to S3
• avoid redundant S3 copy
• avoid eventual consistency
S3 Output Committer
• Broadcast joins/variables
• Replicas can be removed with dynamic allocation
Poor Broadcast Read Performance [SPARK-13328]
16/02/13 01:02:27WARN BlockManager:
Failed to fetch remote block broadcast_18_piece0 (failed attempt 70)
16/02/13 01:02:27 INFOTorrentBroadcast:
Reading broadcast variable 18 took 1051049 ms
• Refresh replica locations from the driver on multiple failures
• Cancel & resend pending container requests
• if the locality preference is no longer needed
• if no locality preference is set
• No locality information with S3
• Do not cancel requests without locality preference
Incorrect Locality Optimization [SPARK-13779]
Property Value Description
spark.sql.hive.convertMetastoreParquet true enable native Parquet read path
parquet.filter.statistics.enabled true enable stats filtering
parquet.filter.dictionary.enabled true enable dictionary filtering
spark.sql.parquet.filterPushdown true enable Parquet filter pushdown optimization
spark.sql.parquet.mergeSchema false disable schema merging
spark.sql.hive.convertMetastoreParquet.mergeSchema false use Hive SerDe instead of built-in Parquet support
How to Enable Dictionary Filtering?
Efficient Dynamic Partition Inserts [SPARK-15420*]
• Parquet buffers row group data for each file during writes
• Spark already sorts before writes, but has some limitations
• Detect if the data is already sorted
• Expose the ability to repartition data before write
• A large application can prevent new applications from showing up
• not uncommon to see event logs of GBs
• SPARK-13988 makes the processing multi-threaded
• GC tuning helps further
• move from CMS to G1 GC
• allocate more space to young generation
Spark History Server – Where is My Job?
Pig Spark PySpark
Pig vs. Spark (Scala) vs. PySpark
load + filter
order by + store
load + filter load + filter
foreach foreach foreach
Pig vs. Spark: Job #2
Pig Spark PySpark
Pig vs. Spark (Scala) vs. PySpark
Prototype DeployBuild Run
• A rapid innovation platform for targeting algorithms
• 5 hours (vs. 10s of hours) to compute similarity for all Netflix
profiles for 30-day window of new arrival titles
• 10 minutes to score 4M profiles for 14-day window of new
Production Spark Application #1: Yogen
• Personalized ordering of rows of titles
• Enrich page/row/title features with play history
• 14 stages, ~10Ks of tasks, severalTBs
Production Spark Application #2: ARO
• Improved Parquet support
• Better visibility
• Explore new use cases