Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Glint with Apache Spark

Spark Overview, Cluster Architecture, Elements, Spark Stack, Spark Streaming

Meetup Details of my presentation:

http://www.meetup.com/lspe-in/events/212250542/
http://www.meetup.com/devops-bangalore/events/222155834/

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Glint with Apache Spark

  1. 1. June 13
  2. 2. Agenda Big Data Overview Spark Overview RRD Features Spark Stack Spark Streaming
  3. 3. BIG DATA OVERVIEW
  4. 4. Big Data -- Digital Data growth…
  5. 5. V-V-V
  6. 6. Use Cases
  7. 7. Real Time Feedback
  8. 8. Quick RecapHadoop EcoSystem
  9. 9. Legacy Architecture Pain Points • Report arrival latency quite high - Hours to perform joins, aggregate data • Existing frameworks cannot do both • Either, stream processing of 100s of MB/s with low latency • Or, batch processing of TBs of data with high latency • Expressibility of business logic in Hadoop MR is challenging • Lack of interactive SQL
  10. 10. SPARK OVERVIEW
  11. 11. Why Spark Separate, fast, Map-Reduce-like engine In-memory data storage for very fast iterative queries Better Fault Tolerance Combine SQL, Streaming and complex analytics Runs on Hadoop, Mesos, standalone, or in the cloud Data sources -> HDFS, Cassandra, HBase and S3
  12. 12. Consumed Apps…
  13. 13. In Memory - Spark vs Hadoop Improve efficiency over MapReduce 100x in memory , 2-10x in disk Up to 40x faster than Hadoop
  14. 14. Spark Stack
  15. 15. Spark Eco System RDBMS Streaming SQL GraphX BlinkDB Hadoop Input Format Apps Distributions: - CDH - HDP - MapR - DSE Tachyon MLlib
  16. 16. Benchmarking
  17. 17. RESILIENT DISTRIBUTED DATA (RDD)
  18. 18. Resilient Distributed Data (RDD) Immutable + Distributed+ Catchable+ Lazy evaluated  Distributed collections of objects  Can be cached in memory across cluster nodes  Manipulated through various parallel operations
  19. 19. QUICK DEMO
  20. 20. RDD Types RDD
  21. 21. RDD Operation
  22. 22. Memory and Persistent
  23. 23. Dependencies Types
  24. 24. Task Scheduler , DAG • Pipelines functions within a stage • Cache-aware data reuse & locality • Partitioning-aware to avoid shuffles rdd1.map(splitlines).filter("ERROR") rdd2.map(splitlines).groupBy(key) rdd2.join(rdd1, key).take(10)
  25. 25. Fault Recovery & Checkpoints • Efficient fault recovery using Lineage • log one operation to apply to many elements (lineage) • Recomputed lost partitions on failure • Checkpoint RDDs to prevent long lineage chains during fault recovery
  26. 26. SPARK CLUSTER
  27. 27. Cluster Support • Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications • Hadoop YARN – the resource manager in Hadoop 2
  28. 28. Spark Cluster Overview o Application o Driver program o Cluster manage o Worker node o Executor o Task o Job o Stage
  29. 29. Spark On Mesos
  30. 30. Spark on YARN
  31. 31. Job Flow
  32. 32. SPARK STACK DETAILS
  33. 33. Spark SQL • Seamlessly mix SQL queries with Spark programs • Load and query data from a variety of sources • Standard Connectivity through (J)ODBC • Hive Compatibility
  34. 34. Streaming • Scalable high-throughput streaming process of live data • Integrate with many sources • Fault-tolerant- Stateful exactly-once semantics out of box • Combine streaming with batch and interactive queries
  35. 35. MLib • Scalable Machine learning library • Iterative computing -> High Quality algorithm 100x faster than hadoop • Algorithms (Mlib 1.3): • linear SVM and logistic regression • classification and regression tree • random forest and gradient-boosted trees • recommendation via alternating least squares • clustering via k-means, Gaussian mixtures, • power iteration clustering • topic modeling via latent Dirichlet allocation • singular value decomposition • linear regression with L1- and L2-regularization • isotonic regression • multinomial naive Bayes • frequent itemset mining via FP-growth • basic statistics • feature transformations
  36. 36. GraphX- Unifying Graphs and Tables • Spark’s API For Graph and Graph-parallel computation • Graph abstraction: a directed multigraph with properties attached to each vertex and edge • Seamlessly works with both graph and collections
  37. 37. SPARK STREAMING
  38. 38. Spark Streaming
  39. 39. Batches… • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches
  40. 40. Dstream (Discretized Streams) DStream is represented by a continuous series of RDDs
  41. 41. Micro Batch (Near Real Time) Micro Batch
  42. 42. Window Operation & Checkpoint
  43. 43. Streaming Fault Tolerance
  44. 44. Spark Streaming + SQL Streaming SQL
  45. 45. Quick Run Spark UI
  46. 46. Spark with Storm
  47. 47. Quick Recap • Why Spark? Spark Features? • What is RDD? • Fault – tolerance model • Spark Extensions/Stack? • Micro- Batch?
  48. 48. Clients…
  49. 49. Thank You….
  50. 50. Additional Slides
  51. 51. BDAS - Berkeley Data Analytics Stack https://amplab.cs.berkeley.edu/software/ BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.
  52. 52. Optimization • groupBy is costlier – use mapr() or reduceByKey() • RDD storage level MEMOR_ONLY is better
  53. 53. Big Data Landscape
  54. 54. RDDs vs Distributed Shared Mem
  55. 55. SQL Optimization

    Be the first to comment

    Login to see the comments

  • tik_shah7

    Jun. 13, 2015
  • shahsaifi

    Jun. 13, 2015
  • iLantis

    Jun. 14, 2015
  • dilipsrmmca

    Jun. 14, 2015
  • taralodh

    Jun. 14, 2015
  • VenkataNagaRavi

    Jun. 15, 2015
  • moronkreacionz

    Jul. 15, 2015
  • PraneshVittal

    Jul. 26, 2015
  • mehtaarchit

    Nov. 10, 2017

Spark Overview, Cluster Architecture, Elements, Spark Stack, Spark Streaming Meetup Details of my presentation: http://www.meetup.com/lspe-in/events/212250542/ http://www.meetup.com/devops-bangalore/events/222155834/

Views

Total views

1,389

On Slideshare

0

From embeds

0

Number of embeds

8

Actions

Downloads

0

Shares

0

Comments

0

Likes

9

×