2. About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
3.
4. Agenda
What is Spark?
Spark Infrastructure and Basics
Spark Features and Suite
Development with Spark
Conclusion
5. What is Spark?
Efficient Usable
General execution
graphs
In-memory storage
Rich APIs in Java,
Scala, Python
Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
6. What is Spark?
Apache Spark is a general-purpose, cluster
computing framework
Spark does computation In Memory & on
Disk
Apache Spark has low level and high level
APIs
7. About Spark project
Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
Interactive shell Spark in Scala and Python
◦ (spark-shell, pyspark)
Currently stable in version 1.5
8. Spark Philosophy
Make life easy and productive for data
scientists
Well documented, expressive API’s
Powerful domain specific libraries
Easy integration with storage systems
… and caching to avoid data movement
Predictable releases, stable API’s
Stable release each 3 months
11. Spark Core Features
Distributed In memory Computation
Stand alone and Local Capabilities
History server for Spark UI
Resource management Integration
Unified job submission tool
12.
13. Spark Contributors
Highly active open source community
(09/2015)
◦ https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark
18. Spark execution engine
Spark uses lazy evaluation
Runs the code only when it encounters
an action operation
There is no need to design and write a
single complex map-reduce job.
In Spark we can write smaller and
manageable operations
◦ Spark will group operations together
19. Spark execution engine
Serializes your code to the executors
◦ Can choose your serialization method
(Java serialization, Kryo)
In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
◦ Can use the same method of
implementation in Scala and Python as
well.
21. Persistence layers for Spark
Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
File formats
◦ Text file
CSV, TSV, Plain Text
◦ Sequence File
◦ AVRO
◦ Parquet
22.
23.
24. History Server
Can be run on all Spark deployments,
◦ Stand Alone, YARN, Mesos
Integrates both with YARN and Mesos
In Yarn / Mesos, run history server as
a daemon.
28. Cassandra & Spark
Cassandra cluster
◦ Bare metal vs. On the cloud
DSE – DataStax Enterprise
◦ Cassandra & Spark in each node
Vs
◦ Separate Cassandra and Spark clusters
30. Where do I start from?!
Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
Yarn vs. Mesos vs. Stand Alone
31. Running Environments
Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
Cluster Utilization
◦ Unified Cluster for all environments
Vs.
◦ Cluster per Environment
(Cluster per Data Center)
Configuration
◦ Local Files vs. Distributed
32. Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
S3
◦ High latency and pretty slow but low costs
Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be
33. DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
Automation via Jenkins
Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic
34. Build Automation
Maven
◦ Sonatype Nexus artifact management
-
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks
35. Workflow Management
Oozie – Very hard to integrate with Spark
◦ XML configuration based and not that convenient
Azkaban (Haven’t tried it)
Chosen:
◦ Luigi
◦ Crontab + Jenkins (KISS again)
37. Testing
Unit
◦ JUnit tests that run on the Spark “Functions”
End to End
◦ Simulate the full execution of an application on a
single JVM (local mode) – Real input, Real output
Functional
◦ Stand alone application
◦ Running on the cluster
◦ Minimal coverage – Shows working data flow
38. Logging
Runs by default log4j (slf4j)
How to log correctly:
◦ Separate logs for different applications
◦ Driver and Executors log to different locations
◦ Yarn logging also exists (Might find problems there too)
ELK Stack (Logstash - ElasticSearch – Kibana)
◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2)
◦ DO NOT use the regular TCP Log4J appender
39. Reporting and Monitoring
Graphite
◦ Online application metrics
Grafana
◦ Good Graphite visualization
Jenkins - Monitoring
◦ Scheduled tests
◦ Validate result set of the applications
◦ Hung or stuck applications
◦ Failed application
43. Conclusion
Spark is a popular and very powerful
distributed in memory computation
framework
Broadly used and has lots of contributors
Leading tool in the new world of Petabytes
of unexplored data in the world