Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

11. From Hadoop to Spark 1:2

7,313 views

Published on

In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning

Published in: Data & Analytics
  • Login to see the comments

11. From Hadoop to Spark 1:2

  1. 1. From Hadoop to Spark 1/2 Dr. Fabio Fumarola
  2. 2. Contents • Aggregate and Cluster • Scatter Gather and MapReduce • MapReduce • Why Spark? • Spark: – Example, task and stages – Docker Example – Scala and Anonymous Functions • Next Topics in 2/2 2
  3. 3. Aggregates and Clusters • Aggregate-oriented databases change the rules for data storage (CAP) • But running on a cluster changes also computation models • When you store data on a cluster you can process data in parallel 3
  4. 4. Database vs Client Processing • With a centralized database data can be processed on the database server or on the client machine • Running on the client: – Pros: flexibility and programming languages – Cons: data transfer from the server to the client • Running on the server: – Pros: Data locality – Cons: programming languages and debugging 4
  5. 5. Cluster and Computation • We can spread the computation across the cluster • However, we have to reduce the amount of data transferred over the network • We need have computation locality • That is process the data in the same node where is stored 5
  6. 6. Scatter-Gather 6 Use a Scatter-Gather that broadcasts a message to multiple recipients and re-aggregates the responses back into a single message. 2003
  7. 7. Map-Reduce • It is a way to organize processing by taking advantage of clusters • It gained prominence with Google’s MapReduce framework (Dean and Ghemawat 2004) • It was then implemented in Hadoop Framework 7 http://research.google.com/archive/mapreduce.html https://hadoop.apache.org/
  8. 8. Programming Model: MapReduce • We have a huge text document • Count the number of times each distinct word appears in the file • Sample applications – Analyze web server logs to find popular URLs – Term statistics for search 8
  9. 9. Word Count • Assumption: the input file is too large for memory, but all <word, count> pairs fit in memory • We can compute the pairs by – wc file.txt | sort | uniq -c • Encyclopedia Britannica, 11th Edition, Volume 4, Part 3 (http://www.gutenberg.org/files/19699/19699.zip) 9
  10. 10. Word Count Steps wc file.txt | sort | uniq –c •Map • Scan input file record-at-a-time • Extract keys from each record •Group by key • Sort and shuffle •Reduce • Aggregate, summarize, filter or transform • Write the results 10
  11. 11. MapReduce: Logical Steps • Map • Group by Key • Reduce 11
  12. 12. Map Phase 12
  13. 13. Group and Reduce Phase 13 Partition and shuffling
  14. 14. MapReduce: Word Counting 14
  15. 15. Word Count with MapReduce 15
  16. 16. Example: Language Model • Count the number of times each 5-word sequence occurs in a large corpus of documents • Map – Extract <5-word sequence, count> pairs from each document • Reduce – Combine the counts 16
  17. 17. MapReduce: Physical Execution 17
  18. 18. Physical Execution: Concerns • Mapper intermediate results are send to a single reduces: – This is the only steps that require a communication over the network • Thus the Partition and Shuffle phase are critical 18
  19. 19. Partition and Shuffle Reducing the cost of these steps, dramatically reduces the cost in time of the computation: •The Partitioner determines which partition a given (key, value) pair will go to. •The default partitioner computes a hash value for the key and assigns the partition based on this result. •The Shuffler moves map outputs to the reducers. 19
  20. 20. MapReduce: Features • Partioning the input data • Scheduling the program’s execution across a set of machines • Performing the group by key step • Handling node failures • Managing required inter-machine communication 20
  21. 21. Hadoop MapReduce 21
  22. 22. Hadoop • An Open-Source software for distributed storage of large dataset on commodity hardware • Provides a programming model/framework for processing large dataset in parallel 22 Map Map Map Reduce Reduce Input Output
  23. 23. Hadoop: Architecture 23
  24. 24. Distributed File System • Data is kept in “chunks” spread across machines • Each chink is replicated on different machines (Persistence and Availability) 24
  25. 25. Distributed File System • Chunk Servers – File is split into contiguous chunks (16-64 MB) – Each chunk is replicated 3 times – Try to keep replicas on different machines • Master Node – Name Node in Hadoop’s HDFS – Stores metadata about where files are stored – Might be replicated 25
  26. 26. Hadoop’s Limitations 26
  27. 27. Limitations of Map Reduce • Slow due to replication, serialization, and disk IO • Inefficient for: – Iterative algorithms (Machine Learning, Graphs & Network Analysis) – Interactive Data Mining (R, Excel, Ad hoc Reporting, Searching) 27 Input iter. 1iter. 1 iter. 2iter. 2 . . . HDFS read HDFS write HDFS read HDFS write Map Map Map Reduce Reduce Input Output
  28. 28. Solutions? • Leverage to memory: – load Data into Memory – Replace disks with SSD 28
  29. 29. Apache Spark • A big data analytics cluster-computing framework written in Scala. • Open Sourced originally in AMPLab at UC Berkley • Provides in-memory analytics based on RDD • Highly compatible with Hadoop Storage API – Can run on top of an Hadoop cluster • Developer can write programs using multiple programming languages 29
  30. 30. Spark architecture 30 HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... CacheCache CacheCache CacheCache Block Block Block Cluster Manager Spark Driver (Master)
  31. 31. Hadoop Data Flow 31 iter. 1iter. 1 iter. 2iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write
  32. 32. Spark Data Flow 32 iter. 1iter. 1 iter. 2iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark HDFS read
  33. 33. Spark Programming Model 33 Datanode HDFS Datanode… User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program SparkContextSparkContext Cluster Manager Cluster Manager Worker Node ExecuterExecuter CacheCache TaskTask TaskTask Worker Node ExecuterExecuter CacheCache TaskTask TaskTask
  34. 34. Spark Programming Model 34 User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program RDD (Resilient Distributed Dataset) RDD (Resilient Distributed Dataset) • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators. • Immutable Data structure • In-memory (explicitly) • Fault Tolerant • Parallel Data Structure • Controlled partitioning to optimize data placement • Can be manipulated using rich set of operators.
  35. 35. RDD • Programming Interface: Programmer can perform 3 types of operations 35 Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Transformations •Create a new dataset from and existing one. •Lazy in nature. They are executed only when some action is performed. •Example : • Map(func) • Filter(func) • Distinct() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Actions •Returns to the driver program a value or exports data to a storage system after performing a computation. •Example: • Count() • Reduce(funct) • Collect • Take() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache() Persistence •For caching datasets in- memory for future operations. •Option to store on disk or RAM or mixed (Storage Level). •Example: • Persist() • Cache()
  36. 36. How Spark works • RDD: Parallel collection with partitions • User application create RDDs, transform them, and run actions. • This results in a DAG (Directed Acyclic Graph) of operators. • DAG is compiled into stages • Each stage is executed as a series of Task (one Task for each Partition). 36
  37. 37. Example 37 sc.textFile(“/wiki/pagecounts”) RDD[String] textFile
  38. 38. Example 38 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) RDD[String] textFile map RDD[List[String]]
  39. 39. Example 39 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map
  40. 40. Example 40 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map RDD[(String, Int)] reduceByKey
  41. 41. Example 41 sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey Array[(String, Int)] collect
  42. 42. Execution Plan Stages are sequences of RDDs, that don’t have a Shuffle in between 42 textFile map map reduceByKey collect Stage 1 Stage 2
  43. 43. Execution Plan 43 textFile map map reduceByK ey collect Stage 1 Stage 2 Stage 1 Stage 2 1. Read HDFS split 2. Apply both the maps 3. Start Partial reduce 4. Write shuffle data 1. Read shuffle data 2. Final reduce 3. Send result to driver program
  44. 44. Stage Execution • Create a task for each Partition in the new RDD • Serialize the Task • Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything) 44 Task 1 Task 2 Task 2 Task 2
  45. 45. Spark Executor (Slaves) 45 Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 1 Core 2 Core 3
  46. 46. Summary of Components • Task: The fundamental unit of execution in Spark • Stage: Set of Tasks that run parallel • DAG: Logical Graph of RDD operations • RDD: Parallel dataset with partitions 46
  47. 47. Start the docker container From •https://github.com/sequenceiq/docker-spark docker pull sequenceiq/spark:1.3.0 docker run -i -t -h sandbox sequenceiq/spark:1.3.0-ubuntu /etc/ bootstrap.sh bash •Run the spark shell using yarn or local spark-shell --master yarn-client --driver-memory 1g --executor-memory 1g --executor-cores 2 47
  48. 48. Separate Container Master/Worker $ docker pull snufkin/spark-master $ docker pull snufkin/spark-worker •These images are based on snufkin/spark-base $ docker run … master $ docker run … worker 48
  49. 49. Running the example and Shell • To Run an example $ run-example SparkPi 10 • We can start a spark shell via –spark-shell -- master local n • The -- master specifies the master URL for a distributed cluster • Example applications are also provided in Python –spark-submit example/src/main/python/pi.py 10 49
  50. 50. Scala Base Course - Start 50
  51. 51. Scala vs Java vs Python • Spark was originally written in Scala, which allows concise function syntax and interactive use • Java API added for standalone applications • Python API added more recently along with an interactive shell. 51
  52. 52. Why Scala? • High-level language for the JVM – Object oriented + functional programming • Statistically typed – Type Inference • Interoperates with Java – Can use any Java Class – Can be called from Java code 52
  53. 53. Quick Tour of Scala 53
  54. 54. Laziness • For variables we can define lazy val, that are evaluated when called lazy val x = 10 * 10 * 10 * 10 //long computation • For methods we can define call by value and call by name for the parameters def square(x: Double) // call by value def square(x: => Double) // call by name • It changes the order the parameter are evaluated 54
  55. 55. Anonymous functions 55 scala> val square = (x: Int) => x * x square: Int => Int = <function1> We define an anonymous function from Int to Int The square is a val square of type Function1, which is equivalent to scala> def square(x: Int) = x * x square: (x: Int)Int
  56. 56. Anonymous Functions (x: Int) => x * x This is a syntactic sugar for new Function1[Int ,Int] { def apply(x: Int): Int = x * x } 56
  57. 57. Currying Converting a function with multiple arguments into a function with a single argument that returns another function. def gen(f: Int => Int)(x: Int) = f(x) def identity(x: Int) = gen(i => i)(x) def square(x: Int) = gen(i => i * i)(x) def cube(x: Int) = gen(i => i * i * i)(x) 57
  58. 58. Anonymous Functions //Explicit type declaration val call1 = doWithOneAndTwo((x: Int, y: Int) => x + y) //The compiler expects 2 ints so x and y types are inferred val call2 = doWithOneAndTwo((x, y) => x + y) //Even more concise syntax val call3 = doWithOneAndTwo(_ + _) 58
  59. 59. Returning multiple variables def swap(x:String, y:String) = (y, x) val (a,b) = swap("hello","world") println(a, b) 59
  60. 60. High Order Functions Methods that take as parameter functions val list = (1 to 4).toList list.foreach( x => println(x)) list.foreach(println) list.map(x => x + 2) list.map(_ + 2) list.filter(x => x % 2 == 1) list.filter(_ % 2 == 1) list.reduce((x,y) => x + y) list.reduce(_ + _) 60
  61. 61. Function Methods on Collections 61 http://www.scala-lang.org/api/2.11.6/index.html#scala.collection.Seq
  62. 62. Scala Base Course - End http://scalatutorials.com/ 62
  63. 63. Next Topics • Spark Shell – Scala – Python • Shark Shell • Data Frames • Spark Streaming • Code Examples: Processing and Machine Learning 63

×