Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark 101 - First steps to distributed computing

689 views

Published on

Lecture about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment

Published in: Software
  • Be the first to comment

Spark 101 - First steps to distributed computing

  1. 1. Demi Ben-Ari 10/2015
  2. 2. About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  3. 3. Agenda  What is Spark?  Spark Infrastructure and Basics  Spark Features and Suite  Development with Spark  Conclusion
  4. 4. What is Spark? Efficient Usable  General execution graphs  In-memory storage  Rich APIs in Java, Scala, Python  Interactive shell Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop
  5. 5. What is Spark?  Apache Spark is a general-purpose, cluster computing framework  Spark does computation In Memory & on Disk  Apache Spark has low level and high level APIs
  6. 6. About Spark project  Spark was founded at UC Berkeley and the main contributor is “Databricks”.  Interactive shell Spark in Scala and Python ◦ (spark-shell, pyspark)  Currently stable in version 1.5
  7. 7. Spark Philosophy  Make life easy and productive for data scientists  Well documented, expressive API’s  Powerful domain specific libraries  Easy integration with storage systems  … and caching to avoid data movement  Predictable releases, stable API’s  Stable release each 3 months
  8. 8. Unified Tools Platform
  9. 9. Unified Tools Platform Spark SQL GraphX MLlib Machine Learning Spark Streamin g Spark Core
  10. 10. Spark Core Features  Distributed In memory Computation  Stand alone and Local Capabilities  History server for Spark UI  Resource management Integration  Unified job submission tool
  11. 11. Spark Contributors  Highly active open source community (09/2015) ◦ https://github.com/apache/spark/  https://www.openhub.net/p/apache-spark
  12. 12. Spark Petabyte Sort
  13. 13. Basic Terms  Cluster (Master, Slaves)  Driver  Executors  Spark Context  RDD – Resilient Distributed Dataset
  14. 14. Resilient Distributed Datasets
  15. 15. Resilient Distributed Datasets
  16. 16. Spark execution engine  Spark uses lazy evaluation  Runs the code only when it encounters an action operation  There is no need to design and write a single complex map-reduce job.  In Spark we can write smaller and manageable operations ◦ Spark will group operations together
  17. 17. Spark execution engine  Serializes your code to the executors ◦ Can choose your serialization method (Java serialization, Kryo)  In Java - functions are specified as objects that implement one of Spark’s Function interfaces. ◦ Can use the same method of implementation in Scala and Python as well.
  18. 18. Spark Execution - UI
  19. 19. Persistence layers for Spark  Distributed system ◦ Hadoop (HDFS) ◦ Local file system ◦ Amazon S3 ◦ Cassandra ◦ Hive ◦ Hbase  File formats ◦ Text file  CSV, TSV, Plain Text ◦ Sequence File ◦ AVRO ◦ Parquet
  20. 20. History Server  Can be run on all Spark deployments, ◦ Stand Alone, YARN, Mesos  Integrates both with YARN and Mesos  In Yarn / Mesos, run history server as a daemon.
  21. 21. Job Submission Tool  ./bin/spark-submit <app-jar> --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster
  22. 22. Multi Language API Support  Scala  Java  Python  Clojure
  23. 23. Spark Shell  YouTube – Word Count Example
  24. 24. Cassandra & Spark  Cassandra cluster ◦ Bare metal vs. On the cloud  DSE – DataStax Enterprise ◦ Cassandra & Spark in each node  Vs ◦ Separate Cassandra and Spark clusters
  25. 25. Development with Spark
  26. 26. Where do I start from?!  Download spark as a package ◦ Run it on “local” mode (no need of a real cluster) ◦ “spark-ec2” scripts to ramp-up a Stand Alone mode cluster ◦ Amazon Elastic Map Reduce (EMR)  Yarn vs. Mesos vs. Stand Alone
  27. 27. Running Environments  Development – Testing – Production ◦ Don’t you need more? ◦ Be as flexible as you can  Cluster Utilization ◦ Unified Cluster for all environments  Vs. ◦ Cluster per Environment  (Cluster per Data Center)  Configuration ◦ Local Files vs. Distributed
  28. 28. Saving and Maintaining the Data Local File System – Not effective in a distributed environment  HDFS ◦ Might be very Expensive ◦ Locality Rules – Spark + HDFS node + Same machine  S3 ◦ High latency and pretty slow but low costs  Cassandra ◦ Rigid data model ◦ Very fast and depends on the Volume of the data can be
  29. 29. DevOps – Keep It Simple, Stupid Linux ◦ Bash scripts ◦ Crontab  Automation via Jenkins  Continuous Deployment – with every GIT push Dev Testing Live Staging Production Daily ManualAutomaticAutomatic
  30. 30. Build Automation  Maven ◦ Sonatype Nexus artifact management  - ◦ Deploy and Script generation scripts ◦ Per Environment Testing ◦ Data Validation ◦ Scheduled Tasks
  31. 31. Workflow Management  Oozie – Very hard to integrate with Spark ◦ XML configuration based and not that convenient  Azkaban (Haven’t tried it)  Chosen: ◦ Luigi ◦ Crontab + Jenkins (KISS again)
  32. 32. Testing Dev Testing Live Staging Production
  33. 33. Testing  Unit ◦ JUnit tests that run on the Spark “Functions”  End to End ◦ Simulate the full execution of an application on a single JVM (local mode) – Real input, Real output  Functional ◦ Stand alone application ◦ Running on the cluster ◦ Minimal coverage – Shows working data flow
  34. 34. Logging  Runs by default log4j (slf4j)  How to log correctly: ◦ Separate logs for different applications ◦ Driver and Executors log to different locations ◦ Yarn logging also exists (Might find problems there too)  ELK Stack (Logstash - ElasticSearch – Kibana) ◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2) ◦ DO NOT use the regular TCP Log4J appender
  35. 35. Reporting and Monitoring  Graphite ◦ Online application metrics  Grafana ◦ Good Graphite visualization  Jenkins - Monitoring ◦ Scheduled tests ◦ Validate result set of the applications ◦ Hung or stuck applications ◦ Failed application
  36. 36. Reporting and Monitoring  Grafana + Graphite - Example
  37. 37. Summary Cluster Dev Testing Live Staging ProductionEnv ELK
  38. 38. Data Flow Extern al Data Source s Analytics Layers Data Output
  39. 39. Conclusion  Spark is a popular and very powerful distributed in memory computation framework  Broadly used and has lots of contributors  Leading tool in the new world of Petabytes of unexplored data in the world
  40. 40. Questions?
  41. 41. Thanks, Resources and Contact  Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter

×