Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Paris Data Geek - Spark Streaming

Spark Streaming As Near Realtime ETL : talk @ Paris Data Geek

Paris Data Geek - Spark Streaming

  1. 1. Spark Streaming As Near Realtime ETL Paris Data Geek 18/09/2014 Djamel Zouaoui @DjamelOnLine
  2. 2. Who am I ? Djamel Zouaoui Director Of Engineering @DjamelOnLine #Data #Scala #RecSys #Tech #MachineLearning #NoSql #BigData #Spark #Dev #R #Architecture
  3. 3. What is Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop • Efficient • Usable • General execution graphs • In-memory storage • Rich APIs in Java, Scala, Python • Interactive shell
  4. 4. RDD in • Resilient Distributed Dataset • Storage abstraction for dataset in Spark • Imutable • Fault recovery – Each RDD remembers how it was created, and can recover if any part of the data is lost • 3 kinds of operations – Transformations: Lazy in nature, allow to create a new dataset from one – Actions: Returns a value or exports data after performing a computation – Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
  5. 5. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  6. 6. textFiles map map reduceByKey collect sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  7. 7. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2
  8. 8. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2 Stage 1 Stage 2
  9. 9. Ecosystem RDD-Based Matrices RDD-Based Graphs Spark RDD API DStream’s: Streams of RDD’s Spark Streaming GraphX MLLib RDD-Based Tables Spark SQL HDFS, S3, Cassandra YARN, Mesos, Standalone
  10. 10. What is Project started in early 2012, extends Spark for doing big data stream processing which: Scales to hundreds of nodes Achieves second-scale latencies Efficiently recover from failures Integrates with batch and interactive processing
  11. 11. How it works ?
  12. 12. How it works ?
  13. 13. How it works ? • Input Source Definition • Input D-Stream D-Stream Computations • Window level • Statefull option • … Classic RDDs manipulation • Transformation • Action
  14. 14. Code TOPOLOG Y FREE //StreamingContext & Input source creation //Standard transformations //Window usage //Start the streaming and put it in the background
  15. 15. Internals • Two main processes – Receivers in charge of the D-Stream creation – Workers which in charge of data processing • These processes are autonomous & independent – No cores & resources shared – No information shared
  16. 16. Execution Model – Receiving Data Spark Streaming + Spark Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data received Blocks pushed Blocks replicated Block Manager Block Manager Master Block Manager
  17. 17. Execution Model – Job Scheduling Spark Streaming + Spark Driver Network Input Tracker RDDs Block IDs Job Scheduler Spark’s Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager Job Queue Jobs
  18. 18. Use Case: Find The True Love ! Build a recommender system based on implicit and explicit data to find the best matching for you • Based on Machine Learning models • Processed offline (batch) • On big (bunch of) data • Main goals of streaming platforms : – Need to store a lot of data – Need to clean them – Need to transform them
  19. 19. Overview Data Receiver Data Cleaning job KAFKA Topics Data Modelin g job HDFS Storage HDFS Storage Spark Cluster • Spark in Standalone mode • 120 cores available on Spark • 4.5 GB RAM per core • Based on Hadoop cluster for HDFS storage (10 To) • HDP 2.0 • 8 machines (2 masters, 6 slaves)
  20. 20. Data Receiver Data Cleaning job Data Modelin g job HDFS Storage HDFS Storage • Use of provided Kafka source • Naive implementation: – Based on autocommit – Automatic Offset management • Cleaning with classic RDD transformations • Persist new RDDs – In HDFS for other spark job (batch) – In RAM to speed up next step • Binary matrix • Scoring based on current events and history – History is load from RDDs stored on HDFS Job details
  21. 21. Issues • Data Lost – In the receiver phase due to naive kafka consumer – Need a more robust client with handly offset management (VS autocommit) • The delights of (de)serialisation – Kryo / Avro / Parquet…: Not directly due to Spark but not ease Major issues are during import/export steps
  22. 22. And Beyond…@VIADEO More than ETL, an analytics backend Data Receiver Data Modelin g RabbitMQ Data Modelin Generic Index ElasticSearch Spark Cluster Cluster D3.JS webapp Data Modelin g g
  23. 23. Join the Viadeo adventure Wanted: Software Engineers • We use Node.js, Spark, ElasticSearch, CQRS, AWS and many more… • We love FullStack Engineers and flat organization • We work in autonomous product team • We lunch for free ;-)
  24. 24. QUESTIONS ?

×