Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Meetup talk

Optimization and Internals of Apache Spark

Meetup talk

  1. 1. 1 Optimization and Internals of Apache Spark Arpit Tak Spark Developer, Quotient Technology Inc. http://in.linkedin.com/in/arpittak/
  2. 2. Agenda • Execution Engine (How it works ?) • Different Api's of Spark. • Optimization (How spark does optimization in joins, filters) • Shuffling mechanism in distributed system. • Whole-stage Code Generation (Fusing operators together by identifying stages on spark) • Spark Internals and what makes Spark faster.
  3. 3. Partitioning and Parallelism in Spark What is a partition in Spark? • Resilient Distributed Datasets are collection of various data items that are so huge in size, that they cannot fit into a single node and have to be partitioned across various nodes. • A partition in spark is an logical division of data stored on a node in the cluster. • Partitions are basic units of parallelism in Apache Spark. • RDDs in Apache Spark are collection of partitions. • Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
  4. 4. How your Spark Application runs on a Hadoop cluster
  5. 5. Lazy Evaluation & Lineage Graph • Lazy Evaluation val lines = sc.textFile("words.txt") //Transformation(1) val filtered = lines.filter(line => line.contains("word1")) filtered.first() //Action(2) The benefit of this Lazy Evaluation is, we only need to read the first line from the File instead of the whole file and also there is no need to store the complete file content in Memory • Caching – rdd.cache() • Spark Lineage - What transformations need to be executed after an action has been called.
  6. 6. Spark API’s Write Less Code: Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name
  7. 7. Not Just Less Code: Faster Implementations 0 7 10 DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDDScala 2 4 6 8 Time to Aggregate 10 million int pairs (secs)
  8. 8. How Catalyst Works: An Overview SQL AST DataFrame Query Plan Optimized Query Plan RDDs Catalyst Transformations Dataset Abstractions of users’ programs (Trees)
  9. 9. Trees: Abstractions of Users’ Programs 9 Example 1 SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  10. 10. Trees: Abstractions of Users’ Programs QueryPlan SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 10
  11. 11. Logical Plan • ALogical Plan describes computation on datasets without defining how to conduct the computation • output: a list of attributes generated by this Logical Plan, e.g. [id, v] Join Filter Project Aggregate sum(v) Scan (t1) Scan (t2) 11 t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  12. 12. Physical Plan • A Physical Plan describes computation on datasets with specific definitions on how to conduct the computation Parquet Scan (t1) JS ONScan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) 16 t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  13. 13. Combining Multiple Rules Scan Scan Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan Scan Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 (t1) (t2) (t1) (t2) 13
  14. 14. Whole-Stage Code Generation • Fuse multiple operator together into a single Java function that is aimed at improving execution performance. • Identify chain of operators. • Collapses a query into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data Performance Optimization:- • No virtual function dispatches • Intermediate data in memory vs CPU registers • Loop unrolling
  15. 15. 15 Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1. If the function does apply to an operator, that operator is replaced with the result. 2. When the function does not apply to an operator, that operator is left unchanged. 3. The transformation is applied recursively to all children.
  16. 16. An example query SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Logical Plan Project name Filter id = 1 Project id,name People 16
  17. 17. Naive Query Planning SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Project id,name Filter id = 1 People Logical Plan Project name Project name Project id,name Filter id = 1 TableScan People Physical Plan 17
  18. 18. Writing Rules as Tree Transformations 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
  19. 19. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) }
  20. 20. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } Partial FunctionTree 20
  21. 21. Filter Push Down Transformation case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } Find Filter on Project val newPlan = queryPlan transform { 21
  22. 22. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } Check that the filter can be evaluated without the result of the project. 22
  23. 23. Filter Push Down Transformation val newPlan = queryPlan transform { case f @ Filter(_, p @ Project(_, grandChild)) if(f.references subsetOf grandChild.output) => p.copy(child = f.copy(child = grandChild) } If so, switch the order. 23
  24. 24. Optimizing with Rules Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down Project name Filter id = 1 People Combine Projection IndexLookup id = 1 return: name 24
  25. 25. Shuffle Hash Join 25 A Shuffle Hash Join is the most basic type of join, and goes back to Map Reduce Fundamentals. • Map through two different data frames/tables. • Use the fields in the join condition as the output key. • Shuffle both datasets by the output key. • In the reduce phase, join the two datasets now any rows of both tables with the same keys are on the same machine and are sorted.
  26. 26. Shuffle Hash Join 26 Table 1 Table 2MAP SHUFFLE REDUCE Output Output Output Output Output
  27. 27. Shuffle Hash Join Performance Works best when the DF’s:- • Distribute evenly with the key you are joining on. • Have an adequate number of keys for parallelism • join_rdd = sqlContext.sql(“select * • FROM people_in_the_us • JOIN states • ON people_in_the_us.state = states.name”)
  28. 28. US DF Partition 1 Problems: ● Uneven Sharding ● Limited parallelism w/ 50 output partitions US RDD Partition 2 US RDD Partition 2**All** the Data for CA **All** the Data for RI CA RI All the data for the US will be shuffled into only 50 keys for each of the states. Uneven Sharding & Limited Parallelism US DF Partition 2 US DF Partition N Small State DF • A larger Spark Cluster will not solve these problems!
  29. 29. • join_rdd = sqlContext.sql(“select * • FROM people_in_california • LEFT JOIN all_the_people_in_the_world • ON people_in_california.id = • all_the_people_in_the_world.id”) More Performance Considerations Final output keys = # of people in CA, so don’t need a huge Spark cluster, right?
  30. 30. The Size of the Spark Cluster to run this job is limited by the Large table rather than the Medium Sized Table. Left Join - Shuffle Step Not a Problem: ● Even Sharding ● Good Parallelism Shuffles everything before dropping keys All CA DF All World DF All the Data from Both Tables Final Joined Output
  31. 31. A Better Solution Filter the World DF for only entries that match the CA ID Filter Transform Benefits: ● Less Data shuffled over the network and less shuffle space needed. ● More transforms, but still faster. Shuffle All CA DF All World DF Final Joined Output Partial World DF
  32. 32. Broadcast Hash Join 32 Parallelism of the large DF is maintained (n output partitions), and shuffle is not even needed. Broadcast Large DF Partition N Large DF Partition 1 Large DF Partition 2 Optimization: When one of the DF’s is small enough to fit in memory on a single machine. Small DF Small DF Small DF Small DF
  33. 33. •Thank You

    Be the first to comment

    Login to see the comments

  • arpit991

    Mar. 18, 2017
  • gsvasan

    Mar. 19, 2017
  • SACHINPBAPPALIGE

    Mar. 19, 2017
  • dilipsrmmca

    Mar. 24, 2017
  • bunkertor

    Mar. 24, 2017
  • b_sanjoy

    Jun. 26, 2017
  • mehtaarchit

    Nov. 10, 2017
  • viswatejaRayapaneni

    Nov. 4, 2018
  • DILLIBABUp1

    Mar. 17, 2019

Optimization and Internals of Apache Spark

Views

Total views

435

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

0

Shares

0

Comments

0

Likes

9

×