Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

5,129 views

Published on

Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will first introduce the concepts of Catalyst trees, followed by major features that were added in order to support Spark’s powerful API abstractions. Audience will walk away with a deeper understanding of how Spark 2.0 works under the hood.

Published in: Software

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

  1. 1. Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer Yin Huai Spark Summit 2016
  2. 2. 2 SELECT count(*) FROM ( SELECT t1.id FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Write Programs Using RDD API
  3. 3. Solution 1 3 val count = t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  4. 4. Solution 2 4 val filteredT2 = t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  5. 5. Solution 1 vs. Solution 2 0.26 53.92 Executiontime(s) Solution 1 Solution 2 5 ~ 200x
  6. 6. Solution 1 6 val count = t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  7. 7. Solution 2 7 val filteredT2 = t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  8. 8. Write Programs Using RDD API • Users’ functions are black boxes • Opaque computation • Opaque data type • Programs built using RDD API have total control on how to executeevery data operation • Developers have to write efficient programs for different kinds of workloads 8
  9. 9. 9 Is there an easy way to write efficient programs?
  10. 10. 10 The easiest way to write efficient programs is to not worry about it and get your programs automatically optimized
  11. 11. How? • Write programs usinghigh level programminginterfaces • Programs are used to describe what data operations are needed without specifying how to executethoseoperations • High level programming interfaces: SQL, DataFrame, and Dataset • Get an optimizer thatautomatically finds outthe most efficientplan to execute data operationsspecifiedin the user’sprogram 11
  12. 12. 12 Catalyst: Apache Spark’s Optimizer
  13. 13. 13 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractionsof users’programs (Trees)
  14. 14. 14 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractions of users’ programs (Trees)
  15. 15. 15 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  16. 16. 16 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Expression • An expressionrepresentsa new value, computed based on input values • e.g. 1 + 2 + t1.value • Attribute: A column of a dataset (e.g. t1.id) or a column generatedby a specific data operation (e.g. v)
  17. 17. 17 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Query Plan Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  18. 18. Logical Plan • A Logical Plan describescomputation on datasets without defining how to conductthe computation • output: a list of attributes generated by this Logical Plan, e.g. [id, v] • constraints: a setof invariants about the rows generatedby this plan, e.g. t2.id > 50 * 1000 18 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  19. 19. Physical Plan • A Physical Plan describescomputation on datasets with specific definitions on how to conductthe computation • A Physical Plan is executable 19 Parquet Scan (t1) JSONScan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  20. 20. 20 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractionsof users’programs (Trees)
  21. 21. Transformations • Transformations without changing the tree type (Transformand Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan 21
  22. 22. • A function associated with everytree used to implement a single rule Transform 22 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row
  23. 23. Transform • A transformation is defined as a Partial Function • Partial Function: A function that is defined for a subset of its possible arguments 23 val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Case statement determineifthe partialfunction is definedfora given input
  24. 24. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 24 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  25. 25. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 25 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  26. 26. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 26 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  27. 27. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 27 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.value
  28. 28. Combining Multiple Rules 28 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id
  29. 29. Combining Multiple Rules 29 Constant Folding Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id
  30. 30. Combining Multiple Rules 30 Column Pruning Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Project t1.id t1.value t2.id
  31. 31. Combining Multiple Rules 31 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Projectt1.id t1.value t2.id Before transformations After transformations
  32. 32. Combining Multiple Rules: Rule Executor 32 Batch1 Batch2 Batchn… Rule 1 Rule 2 … Rule 1 Rule 2 … A Rule Executor transforms a Tree to anothersame type Tree by applying many rulesdefined in batches Every rule is implemented based on Transform Approaches of applying rules 1. Fixed point 2. Once
  33. 33. Transformations • Transformations without changing the tree type (Transformand Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan 33
  34. 34. From Logical Plan to Physical Plan • A Logical Plan is transformedto a Physical Plan by applying a set of Strategies • Every Strategy uses patternmatchingto converta Tree to anotherkind of Tree 34 object BasicOperators extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { … case logical.Project(projectList, child) => execution.ProjectExec(projectList, planLater(child)) :: Nil case logical.Filter(condition, child) => execution.FilterExec(condition, planLater(child)) :: Nil … } } Triggers other Strategies
  35. 35. 35 SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning Catalyst
  36. 36. 36 Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning • Analysis(Rule Executor):Transformsan UnresolvedLogical Plan to a ResolvedLogical Plan • Unresolved => Resolved: Use Catalog to find where datasets and columns are coming from and types of columns • Logical Optimization (Rule Executor): Transformsa Resolved Logical Plan to an Optimized Logical Plan • Physical Planning (Strategies+ Rule Executor):Transformsa Optimized Logical Plan to a Physical Plan
  37. 37. Spark’s Planner • 1st Phase: Transformsthe Logical Plan to the Physical Plan using Strategies • 2nd Phase: Use a Rule Executor to make the Physical Plan readyfor execution • Prepare Scalar sub-queries • Ensure requirementson inputrows • Apply physical optimizations 37
  38. 38. Ensure Requirements on Input Rows 38 Sort-MergeJoin [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id t2 Requirements: Rows Sorted by t2.id
  39. 39. Ensure Requirements on Input Rows 39 Sort-MergeJoin [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]
  40. 40. Ensure Requirements on Input Rows 40 Sort-MergeJoin [t1.id = t2.id] t1 [sortedby t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]What if t1 has been sorted by t1.id?
  41. 41. Ensure Requirements on Input Rows 41 Sort-MergeJoin [t1.id = t2.id] t1 [sortedby t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id t2 Sort [t2.id]
  42. 42. 42 Catalyst in Apache Spark
  43. 43. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL ML Pipelines Structured Streaming { JSON } JDBC and more… With Spark 2.0, we expect most users to migrate to high level APIs (SQL, DataFrame, and Dataset) Spark SQL GraphFrames
  44. 44. Where to Start • SourceCode: • Trees: TreeNode,Expression, Logical Plan, and Physical Plan • Transformations: Analyzer,Optimizer, and Planner • Checkout previous pull requests • Start to write code using Catalyst 44
  45. 45. 45 ~ 200 lines of changes (95 lines are for tests) Certain workloads can run hundreds times faster SPARK-12032
  46. 46. 46 ~ 250 lines of changes (99 lines are for tests) Pivot table support SPARK-8992
  47. 47. • Try latestversion of ApacheSpark and preview of Spark 2.0 Try Apache Spark with Databricks 47 http://databricks.com/try
  48. 48. Thank you. Officehour:2:45pm–3:30pm @ ExpoHall

×