Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Spark

First given on the 29th of March, 2016.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Introduction to Apache Spark

  1. 1. INTRODUCTIONTOAPACHE SPARK BY SAMY DINDANE
  2. 2. OUTLINE History of "Big Data" engines Apache Spark: What is it and what's special about it? Apache Spark: What is used for? Apache Spark: API Tools and software usually used with Apache Spark Demo
  3. 3. HISTORYOF"BIG DATA"ENGINES
  4. 4. BATCHVSSTREAMING
  5. 5. HISTORYOF"BIG DATA"ENGINES 2011 - Hadoop MapReduce: Batch, in-disk processing 2011 - Apache Storm: Realtime 2014 - Apache Tez 2014 - Apache Spark: Batch and near-realtime, in- memory processing 2015 - Apache Flink: Realtime, in-memory processing
  6. 6. APACHE SPARK: WHATISITAND WHAT'SSPECIAL ABOUTIT?
  7. 7. WHY SPARK? Most machine learning algorithms are iterative; each iteration can improve the results With disk-based approach, each iteration's output is written to disk making the processing slow
  8. 8. HADOOPMAPREDUCE EXECUTION FLOW SPARK EXECUTION FLOW
  9. 9. Spark is a distributed data processing engine Started in 2009 Open source & written in Scala Compatible with Hadoop's data
  10. 10. It runs on memory and on disk Run 10 to 100 times faster than Hadoop MapReduce Can be written in Java, Scala, Python & R Supports batch and near-realtime workflows (micro- batches)
  11. 11. Spark has four modules:
  12. 12. APACHE SPARK: WHATISUSEDFOR?
  13. 13. CAPTURE ANDEXTRACTDATA Data can come from several sources: Databases Flat files Web and mobile applications' logs Data feeds from social media IoT devices
  14. 14. TRANSFORMDATA Data in an analytics pipeline needs transformation Check and correct quality issues Handle missing values Cast fields into specific data types Compute derived fields Split or merge records for more granularity Join with other datasets Restructure data
  15. 15. STORE DATA Data can then be stored in several ways As self describing files (Parquet, JSON, XML) SQL databases Search databases (Elasticsearch, Solr) Key-value stores (HBase, Cassandra)
  16. 16. QUERY,ANALYZE,VISUALIZE With Spark Shell, notebooks, Kibana, etc.
  17. 17. APACHE SPARK: API
  18. 18. EXECUTION FLOW
  19. 19. RESILENTDISTRIBUTEDDATASETS RDD's are the fundamental data unit in Spark Resilient: If data in memory is lost, it can be recreated Distributed: Stored in memory across the cluster Dataset: The initial data can come from a file or created programmatically
  20. 20. RDD'S Immutable and partionned collection of elements Basic operations: map, filter, reduce, persist Several implementations: PairRDD, DoubleRDD, SequenceFileRDD
  21. 21. HISTORY 2011 (Spark release) - RDD API 2013 - introduction of the DataFrame API: Add the concept of schema and allow Spark to manage it for more efficient serialization and deserialization 2015 - introduction of the DataSet API
  22. 22. OPERATIONSON RDD'S Transformations Actions
  23. 23. TRANSFORMATIONS Create a new dataset from an RDD, like filter, map, reduce
  24. 24. ACTIONS: Return a value to the driver program after running a computation on the dataset
  25. 25. EXAMPLE OF MAPANDFILTERTRANSFORMATIONS
  26. 26. EXAMPLE OF MAPANDFILTERTRANSFORMATIONS
  27. 27. HOW TO RUNSPARKPROGRAMS? Inside Spark Shell Using a notebook As a Spark application By submitting Spark application to spark-submit
  28. 28. INSIDE SPARKSHELL Run ./bin/spark-shell val textFile = sc.textFile("README.md") val lines = textFile.filter(line => line contains "Spark") lines.collect()
  29. 29. USING ANOTEBOOK There are many Spark notebooks, we are going to use http://spark-notebook.io/ spark­notebook open http://localhost:9000/
  30. 30. ASASPARKAPPLICATION By adding spark-core and other Spark modules as project dependencies and using Spark API inside the application code def main(args: Array[String]) {     val conf = new SparkConf()         .setAppName("Sample Application")         .setMaster("local")     val sc = new SparkContext(conf)     val logData = sc.textFile("/tmp/spark/README.md")     val lines = textFile.filter(line => line contains "Spark")     lines.collect()     sc.stop() }
  31. 31. BYSUBMITTING SPARKAPPLICATION TO SPARK-SUBMIT ./bin/spark­submit  ­­class <main­class> ­­master <master­url>  ­­deploy­mode <deploy­mode>  ­­conf <key>=<value>  ... # other options <application­jar>  [application­arguments]
  32. 32. TERMINOLOGY SparkContext: A connection to a Spark context Worker node: Node that runs the program in a cluster Task: A unit of work Job: Consists of multiple tasks Executor: Process in a worker node, that runs the tasks
  33. 33. TOOLSANDSOFTWARE USUALLY USEDWITH APACHE SPARK
  34. 34. HDFS: HADOOP DISTRIBUTEDFILE SYSTEM
  35. 35. Simple: Uses many servers as one big computer Reliable: Detects failures, has redundant storage Fault-tolerant: Auto-retry, self-healing Scalable: Scales (almost) lineary with disks and CPU
  36. 36. APACHE KAFKA ADISTRIBUTEDANDREPLICATEDMESSAGING SYSTEM
  37. 37. APACHE ZOOKEEPER ZOOKEEPERISADISTRIBUTED,OPEN-SOURCE COORDINATION SERVICE FORDISTRIBUTEDAPPLICATIONS
  38. 38. Coordination: Needed when multiple nodes need to work together Examples: Group membership Locking Leaders election Synchronization Publisher/subscriber
  39. 39. APACHE MESOS Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API's for resource management and scheduling across entire datacenter and cloud environments.
  40. 40. A cluster manager that: Runs distributed applications Abstracts CPU, memory, storage, and other resources Handles resource allocation Handles applications' isolation Has a Web UI for viewing the cluster's state
  41. 41. NOTEBOOKS Spark Notebook: Allows performing reproducible analysis with Scala, Apache Spark and more Apache Zeppelin: A web-based notebook that enables interactive data analytics
  42. 42. THE END APACHE SPARK Is a fast distributed data processing engine Runs on memory Can be used with Java, Scala, Python & R Its main data structure is a Resilient Distributed Dataset
  43. 43. SOURCES http://www.slideshare.net/jhols1/kafka­atlmeetuppublicv2?qid=8627acbf­f89d­4ada­ http://www.slideshare.net/Clogeny/an­introduction­to­zookeeper?qid=ac974e3b­c935 http://www.slideshare.net/rahuldausa/introduction­to­apache­spark­39638645?qid=4 http://www.slideshare.net/junjun1/apache­spark­its­place­within­a­big­data­stack http://www.slideshare.net/cloudera/spark­devwebinarslides­final?qid=4cd97031­912 http://www.slideshare.net/pacoid/aus­mesos https://spark.apache.org/docs/latest/submitting­applications.html https://spark.apache.org/docs/1.6.1/quick­start.html

×