Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
20160520 what youneedtoknowaboutlambdas
Next
Download to read offline and view in fullscreen.

6

Share

Download to read offline

20160524 ibm fast data meetup

Download to read offline

Talk about why Scala is dominating the fast data landscape

Related Books

Free with a 30 day trial from Scribd

See all

20160524 ibm fast data meetup

  1. 1. Scala: Lingua Franca of Fast Data Jamie Allen Sr. Director of Global Solutions Architects
  2. 2. • Why Scala? • Who is doing this? • What is Fast Data? • Architecting for Fast Data Agenda
  3. 3. • Cloud portability versus native control • Application correctness versus speed of development • Modularity versus global namespace • Concise syntax versus boilerplate • Multi-threaded simplicity via abstractions versus low-level control Tradeoffs
  4. 4. • REPL • Type safety • Modularity • Concise syntax • Multi-threaded simplicity • Data-centric semantics • Managed runtime for cloud portability • Ecosystem Scala is the local optimum
  5. 5. Scala is the local optimum
  6. 6. The JVM is a primary reason for Scala’s success
  7. 7. • No REPL or Notebook • Not a data-centric language, particularly collections semantics Why not Java?
  8. 8. • Data-centric language, has all of the wonderful collections semantics we want • No type safety • No modularity Why not Python?
  9. 9. • Weak type safety • Collections are too elemental • Native execution is a non-starter, so Go is the only option • Garbage collection is not generational Why not Go or C++?
  10. 10. • Scala just so happened to fit well in this space • Performance • Correctness • Conciseness • Scala will evolve • Other languages will come in time Scala is NOT the end of the road
  11. 11. Who is doing this?
  12. 12. One Caveat: Apache Beam and TensorFlow
  13. 13. Why Scala? At the time we started, I really wanted a PL that supports a language-integrated interface (where people write functions inline, etc)… However, I also wanted to be on the JVM in order to easily interact with the Hadoop filesystem and data formats for that. Scala was the only somewhat popular JVM language that offered this kind of functional syntax and was also statically typed (letting us have some control over performance), so we chose that. Today there might be an argument to make the first version of the API in Java with Java 8, but we also benefitted from other aspects of Scala in Spark, like type inference, pattern matching, actor libriaries, etc. Matei Zaharia, creator of Spark
  14. 14. What is Fast Data?
  15. 15. A bit of history: Hadoop
  16. 16. YARN HDFS MR job #1 MR job #2 Flume Sqoop DBs Slave Node DiskDiskDiskDiskDisk Node Mgr Data Node Master Resource Manager Name Node Worker
  17. 17. Hadoop strengths • Lowest capital expenditure for big data • Excellent for ingesting and integrating diverse datasets • Flexible • Classic analytics (aggregations and data warehousing) • Machine learning
  18. 18. Hadoop weaknesses • Complex administration • YARN requires dedicated cluster • MapReduce foibles • Poor performance • Imperative programming model • No stream processing support
  19. 19. Fast Data with Spark
  20. 20. Spark • 100x faster as a replacement for Hadoop MapReduce • Uses much fewer machines and resources • Incredible support from the community and enterprise
  21. 21. Spark use cases • Primarily anomaly detection • Risk management • Fraud detection • Odds recalculation • Spam filters • Update search engine results quickly
  22. 22. • Spark had it with RDDs • They removed it with the DataFrames API • Brought it back with DataSets, but not as comprehensively as RDDs Type safety
  23. 23. Why not Flink? • Flink has much better stream handling for low latency systems that Spark currently lacks • Event timing • Watermarks • Triggers • Exactly-once semantics (assuming connections hold) • Pipeline portability via Apache Beam integration
  24. 24. Why not Flink?
  25. 25. Architecting for Fast Data
  26. 26. This isn’t enough
  27. 27. Old and busted
  28. 28. Traditional application architectures and platforms are obsolete. Gartner
  29. 29. How do we avoid messing this up?
  30. 30. • At the API • In our source • For our data We want isolation Wikipedia, Creative Commons, created by DFoerster
  31. 31. We want realistic data management • Use CQRS and Event Sourcing, not CRUD • Transactions, especially distributed, will not work • Consistency is an anti-pattern at scale • Distributed locks and shared data will limit you • Data fabrics break all of these conventions Think in terms of compensation, not prevention. Kevin Webber, Lightbend
  32. 32. We want to ACID v2 • Associativity, not Atomicity • Commutativity, not Consistency • Idempotent, not Isolation • Distributed, not Durable Wikipedia, Creative Commons, created by Weston.pace
  33. 33. New hotness
  34. 34. Fast DataArchitecture HTTP/ REST
  35. 35. Learning Spark • Go to http://bigdatauniversity.com, built by IBM
  • diesalbla

    Jun. 2, 2016
  • herraiz

    Jun. 2, 2016
  • narayanareddy1258

    Jun. 1, 2016
  • wendelhope

    Jun. 1, 2016
  • josecarlosgarciaserrano

    May. 31, 2016
  • juantomas

    May. 31, 2016

Talk about why Scala is dominating the fast data landscape

Views

Total views

1,174

On Slideshare

0

From embeds

0

Number of embeds

43

Actions

Downloads

17

Shares

0

Comments

0

Likes

6

×