Apache spark workshop

1. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark Workshops by Paweł Szulc

2. twitter: @rabbitonweb, email: paul.szulc@gmail.com Before we start Make sure you’ve installed: JDK, Scala, SBT Clone project: https://github.com/rabbitonweb/spark-workshop Run `sbt compile` on it to fetch all dependencies

3. twitter: @rabbitonweb, email: paul.szulc@gmail.com What we’re going to cover?

4. twitter: @rabbitonweb, email: paul.szulc@gmail.com What is Apache Spark?

5. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark

6. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark “Apache Spark™ is a fast and general engine for large-scale data processing.”

7. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark “Apache Spark™ is a fast and general engine for large-scale data processing.”

8. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why?

9. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why? buzzword: Big Data

10. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like...

11. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex:

12. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it,

13. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it,

14. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it,

15. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is like... “Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it”

16. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about...

17. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :)

18. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day

19. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... “Every two days, we generate as much information as we did from the dawn of civilization until 2003” -- Eric Schmidt Former CEO Google

20. twitter: @rabbitonweb, email: paul.szulc@gmail.com source: http://papyrus.greenville.edu/2014/03/selfiesteem/

21. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day

22. twitter: @rabbitonweb, email: paul.szulc@gmail.com Big Data is all about... ● well, the data :) ● It is said that 2.5 exabytes (2.5×10^18) of data is being created around the world every single day ● It's a capacity on which you can not any longer use standard tools and methods of evaluation

23. twitter: @rabbitonweb, email: paul.szulc@gmail.com Challenges of Big Data ● The gathering ● Processing and discovery ● Present it to business ● Hardware and network failures

24. twitter: @rabbitonweb, email: paul.szulc@gmail.com Challenges of Big Data ● The gathering ● Processing and discovery ● Present it to business ● Hardware and network failures

25. twitter: @rabbitonweb, email: paul.szulc@gmail.com What was before?

26. twitter: @rabbitonweb, email: paul.szulc@gmail.com To the rescue MAP REDUCE

27. twitter: @rabbitonweb, email: paul.szulc@gmail.com To the rescue MAP REDUCE “'MapReduce' is a framework for processing parallelizable problems across huge datasets using a cluster, taking into consideration scalability and fault-tolerance”

28. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases:

29. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases: 1. Map

30. twitter: @rabbitonweb, email: paul.szulc@gmail.com MapReduce - phases (1) Map Reduce is combined of sequences of two phases: 1. Map 2. Reduce

33. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word Count ● The “Hello World” of Big Data world. ● For initial input of multiple lines, extract all words with number of occurrences To be or not to be Let it be Be me It must be Let it be be 7 to 2 let 2 or 1 not 1 must 2 me 1

34. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input To be or not to be Let it be Be me It must be Let it be

35. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me

36. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1

37. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1

38. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 EXPENSIVE

39. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1

40. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling Reducing To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 be 6 to 2 or 1 not 1 let 2 must 1 me 1

41. twitter: @rabbitonweb, email: paul.szulc@gmail.com Input Splitting Mapping Shuffling Reducing Final result To be or not to be Let it be Be me It must be Let it be To be or not to be Let it be It must be Let it be Be me to 1 be 1 or 1 not 1 to 1 be 1 let 1 it 1 be 1 be 1 me 1 let 1 it 1 be 1 it 1 must 1 be 1 be 1 be 1 be 1 be 1 be 1 be 1 to 1 to 1 or 1 not 1 let 1 let 1 must 1 me 1 be 6 to 2 or 1 not 1 let 2 must 1 me 1 be 6 to 2 let 2 or 1 not 1 must 2 me 1

42. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count - pseudo-code function map(String name, String document): for each word w in document: emit (w, 1)

43. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count - pseudo-code function map(String name, String document): for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

44. twitter: @rabbitonweb, email: paul.szulc@gmail.com Map-Reduce-Map-Reduce-Map-Red..

45. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why?

46. twitter: @rabbitonweb, email: paul.szulc@gmail.com Why Apache Spark? We have MapReduce open-sourced implementation (Hadoop) running successfully for the last 12 years. Why to bother?

47. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers

48. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count - revisited function map(String name, String document): for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

49. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count: Hadoop implementation 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { sum += val.get(); } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 46 Job job = new Job(conf, "wordcount"); 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 54 job.setInputFormatClass(TextInputFormat.class);

50. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers

51. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues

52. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination

53. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination ● Output saved to the file

54. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination ● Output saved to the file ● Iterative algorithms go through IO path again and again

55. twitter: @rabbitonweb, email: paul.szulc@gmail.com Performance issues ● Map-Reduce pair combination ● Output saved to the file ● Iterative algorithms go through IO path again and again ● Poor API (key, value), even basic join requires expensive code

56. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues

57. twitter: @rabbitonweb, email: paul.szulc@gmail.com Problems with Map Reduce 1. MapReduce provides a difficult programming model for developers 2. It suffers from a number of performance issues 3. While batch-mode analysis is still important, reacting to events as they arrive has become more important (lack support of “almost” real-time)

58. twitter: @rabbitonweb, email: paul.szulc@gmail.com Apache Spark to the rescue

59. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Cluster (Standalone, Yarn, Mesos)

60. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos)

61. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) SPARK API: 1. Scala 2. Java 3. Python

62. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) SPARK API: 1. Scala 2. Java 3. Python Master

63. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master

64. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt”

65. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master)

66. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf)

67. twitter: @rabbitonweb, email: paul.szulc@gmail.com The Big Picture Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) Executor 1 Executor 2 Executor 3

68. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition

69. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset

70. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated

71. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster

72. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically

73. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example

74. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("hdfs://logs.txt")

75. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System

76. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("hdfs://logs.txt") From Hadoop Distributed File System This is the RDD

77. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("/home/rabbit/logs.txt") From local file system (must be available on executors) This is the RDD

78. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.parallelize(List(1, 2, 3, 4)) Programmatically from a collection of elements This is the RDD

79. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt")

80. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase)

81. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) Creates a new RDD

82. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))

83. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD

84. twitter: @rabbitonweb, email: paul.szulc@gmail.com Getting started with Spark https://github.com/rabbitonweb/spark-workshop /src/main/scala/sw/ex1/ /src/main/resources/all-shakespeare.txt

85. twitter: @rabbitonweb, email: paul.szulc@gmail.com Getting started with Spark https://github.com/rabbitonweb/spark-workshop ● Make it a love story: Print out all lines that have both Juliet & Romeo in it http://spark.apache. org/docs/latest/api/scala/index.html#org.

86. twitter: @rabbitonweb, email: paul.szulc@gmail.com Getting started with Spark https://github.com/rabbitonweb/spark-workshop ● Make it a love story: Print out all lines that have both Juliet & Romeo in it ● Would be nice to have a REPL

87. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD

88. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) And yet another RDD Performance Alert?!?!

89. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - Operations 1. Transformations a. Map b. Filter c. FlatMap d. Sample e. Union f. Intersect g. Distinct h. GroupByKey i. …. 2. Actions a. Reduce b. Collect c. Count d. First e. Take(n) f. TakeSample g. SaveAsTextFile h. ….

90. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”))

91. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count

92. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation

93. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - example val logs = sc.textFile("logs.txt") val lcLogs = logs.map(_.toLowerCase) val errors = lcLogs.filter(_.contains(“error”)) val numberOfErrors = errors.count This will trigger the computation This will the calculated value (Int)

94. twitter: @rabbitonweb, email: paul.szulc@gmail.com Example 2 - other actions https://github.com/rabbitonweb/spark-workshop /src/main/scala/ex2/

95. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 2: ● Save results of your calculations as text file

96. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 2: ● Save results of your calculations as text file ● Hint: saveAsTextFile

97. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 2: ● Save results of your calculations as text file ● Hint: saveAsTextFile ● Why the output is so weird?

98. twitter: @rabbitonweb, email: paul.szulc@gmail.com Partitions? A partition represents subset of data within your distributed collection.

99. twitter: @rabbitonweb, email: paul.szulc@gmail.com Partitions? A partition represents subset of data within your distributed collection. Number of partitions tightly coupled with level of parallelism.

100. Partitions evaluation val counted = sc.textFile(..).count

101. Partitions evaluation val counted = sc.textFile(..).count node 1 node 2 node 3

114. twitter: @rabbitonweb, email: paul.szulc@gmail.com Task = partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) Executor 1 Executor 2 Executor 3

115. twitter: @rabbitonweb, email: paul.szulc@gmail.com Task = partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) Executor 1 Executor 2 Executor 3 HDFS, GlusterFS, locality

116. twitter: @rabbitonweb, email: paul.szulc@gmail.com Task = partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3

117. twitter: @rabbitonweb, email: paul.szulc@gmail.com Task = partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

118. twitter: @rabbitonweb, email: paul.szulc@gmail.com Task = partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

119. twitter: @rabbitonweb, email: paul.szulc@gmail.com Task = partition + calculation Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3

128. twitter: @rabbitonweb, email: paul.szulc@gmail.com Example 3 - working with key-value https://github.com/rabbitonweb/spark-workshop /src/main/scala/ex3/

129. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 3 - working with key-value ● Change sw.ex3.Startings to sort the result by key ● Write program that answers following: which char starts most often in all-shakespeare.txt

130. twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline

131. twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline map

132. twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline map count

133. twitter: @rabbitonweb, email: paul.szulc@gmail.com Pipeline map count task

136. twitter: @rabbitonweb, email: paul.szulc@gmail.com But what if... val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

137. twitter: @rabbitonweb, email: paul.szulc@gmail.com But what if... val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

138. twitter: @rabbitonweb, email: paul.szulc@gmail.com But what if... filter val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

139. twitter: @rabbitonweb, email: paul.szulc@gmail.com And now what? filter val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

140. twitter: @rabbitonweb, email: paul.szulc@gmail.com And now what? filter mapValues val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

141. twitter: @rabbitonweb, email: paul.szulc@gmail.com And now what? filter val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

142. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter groupyBy val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

143. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter mapValuesgroupyBy val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length }

144. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKeygroupyBy val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } mapValues

145. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKeygroupyBy mapValues

146. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKey task groupyBy mapValues

149. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKey task Wait for calculations on all partitions before moving on groupyBy mapValues

151. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKey task groupyBy Data flying around through cluster mapValues

153. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKey task task groupyBy mapValues

154. twitter: @rabbitonweb, email: paul.szulc@gmail.com Shuffling filter reduceByKeygroupyBy mapValues

155. twitter: @rabbitonweb, email: paul.szulc@gmail.com stage1 Stage filter reduceByKeygroupyBy mapValues

156. twitter: @rabbitonweb, email: paul.szulc@gmail.com sda stage2stage1 Stage filter reduceByKeygroupyBy mapValues

157. twitter: @rabbitonweb, email: paul.szulc@gmail.com Directed Acyclic Graph

158. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 4 - DAG that ● Open sw.ex3.Ex4.scala ● You will find three programs: ○ StagesStagesA ○ StagesStagesB ○ StagesStagesC ● Can you tell how DAG will look like for all three?

159. twitter: @rabbitonweb, email: paul.szulc@gmail.com Directed Acyclic Graph val startings = allShakespeare .filter(_.trim != "") .map(line => (line.charAt(0), line)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } println(startings.toDebugString)

160. twitter: @rabbitonweb, email: paul.szulc@gmail.com Directed Acyclic Graph val startings = allShakespeare .filter(_.trim != "") .map(line => (line.charAt(0), line)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } println(startings.toDebugString) (2) ShuffledRDD[5] at reduceByKey at Ex3.scala:18 [] +-(2) MapPartitionsRDD[4] at mapValues at Ex3.scala:17 [] | MapPartitionsRDD[3] at map at Ex3.scala:16 [] | MapPartitionsRDD[2] at filter at Ex3.scala:15 [] | src/main/resources/all-shakespeare.txt MapPartitionsRDD[1] | src/main/resources/all-shakespeare.txt HadoopRDD[0] at textFile

161. twitter: @rabbitonweb, email: paul.szulc@gmail.com Directed Acyclic Graph val startings = allShakespeare .filter(_.trim != "") .groupBy(_.charAt(0)) .mapValues(_.size) .reduceByKey { case (acc, length) => acc + length } println(startings.toDebugString) (2) MapPartitionsRDD[6] at reduceByKey at Ex3.scala:42 | MapPartitionsRDD[5] at mapValues at Ex3.scala:41 | ShuffledRDD[4] at groupBy at Ex3.scala:40 +-(2) MapPartitionsRDD[3] at groupBy at Ex3.scala:40 | MapPartitionsRDD[2] at filter at Ex3.scala:39 | src/main/resources/all-shakespeare.txt MapPartitionsRDD[1] | src/main/resources/all-shakespeare.txt HadoopRDD[0]

162. twitter: @rabbitonweb, email: paul.szulc@gmail.com RDD - the definition RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically

163. twitter: @rabbitonweb, email: paul.szulc@gmail.com What about Resilience? RDD stands for resilient distributed dataset Resilient - if data is lost, data can be recreated Distributed - stored in nodes among the cluster Dataset - initial data comes from a file or can be created programmatically

164. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2 T3

165. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

166. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

167. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3

176. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

177. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1T2T3

178. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 Executor 3 T1 T2T3

184. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 EDeDEADutor 3 T1 T2

185. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2

186. twitter: @rabbitonweb, email: paul.szulc@gmail.com Resilience Driver Program Cluster (Standalone, Yarn, Mesos) Master val master = “spark://host:pt” val conf = new SparkConf() .setMaster(master) val sc = new SparkContext (conf) val logs = sc.textFile(“logs. txt”) println(logs.count()) Executor 1 Executor 2 T1 T2 T3

192. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 5 - The Big Data problem ● Write a Word Count program using Spark ● Use all-shakespeare.txt as input To be or not to be Let it be Be me It must be Let it be be 7 to 2 let 2 or 1 not 1 must 2 me 1

193. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again

194. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines Scala solution

195. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) Scala solution

196. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq Scala solution

197. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) Scala solution

198. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } Scala solution

199. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) Scala solution Spark solution (in Scala language)

200. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) Scala solution Spark solution (in Scala language)

201. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")) Scala solution Spark solution (in Scala language)

202. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")) .groupBy(identity) Scala solution Spark solution (in Scala language)

203. twitter: @rabbitonweb, email: paul.szulc@gmail.com Word count once again val wc = scala.io.Source.fromFile(args(0)).getLines .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")).toSeq .groupBy(identity) .map { case (word, group) => (word, group.size) } val wc = new SparkContext(“local”, “Word Count”).textFile(args(0)) .map(line => line.toLowerCase) .flatMap(line => line.split("""W+""")) .groupBy(identity) .map { case (word, group) => (word, group.size) } Scala solution Spark solution (in Scala language)

204. twitter: @rabbitonweb, email: paul.szulc@gmail.com But that solution has major flaw

205. twitter: @rabbitonweb, email: paul.szulc@gmail.com But that solution has major flaw ● Flaw: groupBy

206. twitter: @rabbitonweb, email: paul.szulc@gmail.com But that solution has major flaw ● Flaw: groupBy ● But before we do understand it, we have to: ○ instantiate a Standalone cluster ○ understand how cluster works ○ talk about serialization (& uber jar!) ○ see the Spark UI ○ talk about Spark configuration

207. twitter: @rabbitonweb, email: paul.szulc@gmail.com But that solution has major flaw ● Flaw: groupBy ● But before we do understand it, we have to: ○ instantiate a Standalone cluster ○ understand how cluster works ○ talk about serialization ○ see the Spark UI ○ talk about Spark configuration ● http://spark.apache. org/docs/latest/configuration.html

208. twitter: @rabbitonweb, email: paul.szulc@gmail.com But that solution has major flaw ● What can we do about it? ● Something spooky: let’s see Spark code!

209. twitter: @rabbitonweb, email: paul.szulc@gmail.com Mid-term exam ● Given all-shakespeare.txt ● Given names popularity in male-names.txt & female-names. txt ● Show how given name is popular nowadays & how many times it occurred in Shakespeare ● Result: key-value pair (key: name, value: pair) E.g Romeo is mentioned 340 in Shakespeare Romeo is nowadays 688th popular name So result will be: (romeo,(688,340))

210. What is a RDD?

211. What is a RDD? Resilient Distributed Dataset

212. What is a RDD? Resilient Distributed Dataset

213. ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

214. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 What is a RDD?

215. node 1 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski node 2 node 3 ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski ... 10 10/05/2015 10:14:01 UserInitialized Ania Nowak 10 10/05/2015 10:14:55 FirstNameChanged Anna 12 10/05/2015 10:17:03 UserLoggedIn 12 10/05/2015 10:21:31 UserLoggedOut … 198 13/05/2015 21:10:11 UserInitialized Jan Kowalski What is a RDD?

216. What is a RDD?

217. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work:

218. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent

219. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned

220. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how to evaluate its internal data

222. What is a partition? A partition represents subset of data within your distributed collection.

223. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ???

224. What is a partition? A partition represents subset of data within your distributed collection. override def getPartitions: Array[Partition] = ??? How this subset is defined depends on type of the RDD

225. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”)

226. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned?

227. example: HadoopRDD val journal = sc.textFile(“hdfs://journal/*”) How HadoopRDD is partitioned? In HadoopRDD partition is exactly the same as file chunks in HDFS

228. example: HadoopRDD 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

229. example: HadoopRDD node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

234. example: HadoopRDD class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging { ... override def getPartitions: Array[Partition] = { val jobConf = getJobConf() SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array }

237. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

238. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned?

239. example: MapPartitionsRDD val journal = sc.textFile(“hdfs://journal/*”) val fromMarch = journal.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } How MapPartitionsRDD is partitioned? MapPartitionsRDD inherits partition information from its parent RDD

240. example: MapPartitionsRDD class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) { ... override def getPartitions: Array[Partition] = firstParent[T].partitions

243. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

244. RDD parent sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } .take(300) .foreach(println)

245. RDD parent sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

246. Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

247. Directed acyclic graph HadoopRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

248. Directed acyclic graph HadoopRDD ShuffeledRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

249. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

250. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

252. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Two types of parent dependencies: 1. narrow dependency 2. wider dependency

256. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks

257. Directed acyclic graph HadoopRDD ShuffeledRDD MapPartRDD MapPartRDD sc.textFile() .groupBy() .map { } .filter { } .take() .foreach() Tasks

261. Stage 1 Stage 2 Directed acyclic graph sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

262. What is a RDD? RDD needs to hold 3 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data

264. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { }

265. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect()

266. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action

267. Stage 1 Stage 2 Running Job aka materializing DAG sc.textFile() .groupBy() .map { } .filter { } .collect() action Actions are implemented using sc.runJob method

268. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( ): Array[U]

269. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], ): Array[U]

270. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], ): Array[U]

271. Running Job aka materializing DAG /** * Run a function on a given set of partitions in an RDD and return the results as an array. */ def runJob[T, U]( rdd: RDD[T], partitions: Seq[Int], func: Iterator[T] => U, ): Array[U]

272. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }

273. Running Job aka materializing DAG /** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) } /** * Return the number of elements in the RDD. */ def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

274. Multiple jobs for single action /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. */ def take(num: Int): Array[T] = { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) buf.toArray }

275. Lets test what we’ve learned

276. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

277. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 []

278. Towards efficiency val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } scala> events.toDebugString (4) MapPartitionsRDD[22] at filter at <console>:50 [] | MapPartitionsRDD[21] at map at <console>:49 [] | ShuffledRDD[20] at groupBy at <console>:48 [] +-(6) HadoopRDD[17] at textFile at <console>:47 [] events.count

279. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

280. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

281. Stage 1 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3

293. Everyday I’m Shuffling

294. Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) } node 1 node 2 node 3

305. Stage 2 node 1 node 2 node 3 val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

314. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } .filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

317. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

318. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) }

319. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

322. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) } node 1 node 2 node 3

325. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }

326. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2) .groupBy(extractDate _, 6) .map { case (date, events) => (date, events.size) }

327. Let's refactor val events = sc.textFile(“hdfs://journal/*”) .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e)) .combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)

328. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .map( e => (extractDate(e), e))

329. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4 .repartition(256) // note, this will cause a shuffle .map( e => (extractDate(e), e))

330. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .map( e => (extractDate(e), e))

331. A bit more about partitions val events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024 .filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) } .coalesce(64) // this will NOT shuffle .map( e => (extractDate(e), e))

333. What is a RDD? RDD needs to hold 3 + 2 chunks of information in order to do its work: 1. pointer to his parent 2. how its internal data is partitioned 3. how evaluate its internal data 4. data locality 5. paritioner

335. Data Locality: HDFS example node 1 10 10/05/2015 10:14:01 UserInit 3 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 4 10/05/2015 10:21:31 UserLo 5 13/05/2015 21:10:11 UserIni node 2 node 3 16 10/05/2015 10:14:01 UserInit 20 10/05/2015 10:14:55 FirstNa 42 10/05/2015 10:17:03 UserLo 67 10/05/2015 10:21:31 UserLo 12 13/05/2015 21:10:11 UserIni 10 10/05/2015 10:14:01 UserInit 10 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 12 10/05/2015 10:21:31 UserLo 198 13/05/2015 21:10:11 UserIni 5 10/05/2015 10:14:01 UserInit 4 10/05/2015 10:14:55 FirstNa 12 10/05/2015 10:17:03 UserLo 142 10/05/2015 10:21:31 UserLo 158 13/05/2015 21:10:11 UserIni

338. Spark performance - shuffle optimization map groupBy

339. Spark performance - shuffle optimization map groupBy

340. Spark performance - shuffle optimization map groupBy join

341. Spark performance - shuffle optimization map groupBy join

342. Spark performance - shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned

343. twitter: @rabbitonweb, email: paul.szulc@gmail.com Example 6 - Using partitioner ● sw/ex6/Ex6.scala

344. Spark performance - shuffle optimization map groupBy join Optimization: shuffle avoided if data is already partitioned

345. Spark performance - shuffle optimization map groupBy map

346. Spark performance - shuffle optimization map groupBy map

347. Spark performance - shuffle optimization map groupBy map join

348. Spark performance - shuffle optimization map groupBy map join

349. Spark performance - shuffle optimization map groupBy mapValues

350. Spark performance - shuffle optimization map groupBy mapValues

351. Spark performance - shuffle optimization map groupBy mapValues join

352. Spark performance - shuffle optimization map groupBy mapValues join

353. twitter: @rabbitonweb, email: paul.szulc@gmail.com Example 6 - Using partitioner ● sw/ex6/Ex6.scala

354. twitter: @rabbitonweb, email: paul.szulc@gmail.com Example 6 - Using partitioner ● sw/ex6/Ex6.scala ● How can I know which transformations preserve partitioner?

355. twitter: @rabbitonweb, email: paul.szulc@gmail.com Exercise 6 - Can I be better? ● Open sw/ex6/Ex6.scala ● Program ‘Join’ is not performing well ○ Can you tell why? ○ What should be done to fix it?

356. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching map groupBy

357. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching map groupBy

358. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching map groupBy filter

359. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ● .persist() & .cache() methods map groupBy persist

360. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ● .persist() & .cache() methods map groupBy persist filter

361. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching - where is it cached? ● How cache is stored depends on storage level

362. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching - where is it cached? ● How cache is stored depends on storage level ● Levels: ○ MEMORY_ONLY ○ MEMORY_AND_DISK ○ MEMORY_ONLY_SER ○ MEMORY_AND_DISK_SER ○ DISK_ONLY ○ OFF_HEAP

363. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ● .persist() & .cache() methods map groupBy persist filter

364. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ● .persist() & .cache() methods ● caching is fault-tolerant! map groupBy persist filter

365. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching ● .persist() & .cache() methods ● caching is fault-tolerant! map groupBy persist filter

366. twitter: @rabbitonweb, email: paul.szulc@gmail.com Caching - when it can be useful? http://people.csail.mit. edu/matei/papers/2012/nsdi_spark.pdf

367. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - caching

368. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - caching

369. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records).

370. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes.

371. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes.

372. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines.

373. twitter: @rabbitonweb, email: paul.szulc@gmail.com Spark performance - vs Hadoop (3) “(...) we decided to participate in the Sort Benchmark (...), an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by (...) Hadoop (...) cluster of 2100 nodes. Using Spark on 206 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All (...) without using Spark’s in-memory cache.”

374. twitter: @rabbitonweb, email: paul.szulc@gmail.com Example 7 - Save me maybe ● /sw/ex7/Ex7.scala

375. twitter: @rabbitonweb, email: paul.szulc@gmail.com Checkpointing ● .checkpoint() map groupBy filter

376. twitter: @rabbitonweb, email: paul.szulc@gmail.com Checkpointing ● .checkpoint() checkpoint filter

377. twitter: @rabbitonweb, email: paul.szulc@gmail.com And that is all folks! Pawel Szulc Email: paul.szulc@gmail.com Twitter: @rabbitonweb Blog: http://rabbitonweb.com

378. twitter: @rabbitonweb, email: paul.szulc@gmail.com And that is all folks!

379. twitter: @rabbitonweb, email: paul.szulc@gmail.com And that is all folks! Pawel Szulc

380. twitter: @rabbitonweb, email: paul.szulc@gmail.com And that is all folks! Pawel Szulc Email: paul.szulc@gmail.com

381. twitter: @rabbitonweb, email: paul.szulc@gmail.com And that is all folks! Pawel Szulc Email: paul.szulc@gmail.com Twitter: @rabbitonweb

382. twitter: @rabbitonweb, email: paul.szulc@gmail.com And that is all folks! Pawel Szulc Email: paul.szulc@gmail.com Twitter: @rabbitonweb Blog: http://rabbitonweb.com

Apache spark workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Apache spark workshop

Similar to Apache spark workshop (20)

More from Pawel Szulc

More from Pawel Szulc (20)

Recently uploaded

Recently uploaded (20)

Apache spark workshop