How AI, OpenAI, and ChatGPT impact business and software.
20160524 ibm fast data meetup
1. Scala: Lingua Franca of Fast
Data
Jamie Allen
Sr. Director of Global Solutions Architects
2. • Why Scala?
• Who is doing this?
• What is Fast Data?
• Architecting for Fast Data
Agenda
3. • Cloud portability versus native control
• Application correctness versus speed of development
• Modularity versus global namespace
• Concise syntax versus boilerplate
• Multi-threaded simplicity via abstractions versus low-level control
Tradeoffs
4. • REPL
• Type safety
• Modularity
• Concise syntax
• Multi-threaded simplicity
• Data-centric semantics
• Managed runtime for cloud portability
• Ecosystem
Scala is the local optimum
6. The JVM is a primary reason for Scala’s success
7. • No REPL or Notebook
• Not a data-centric language, particularly collections semantics
Why not Java?
8. • Data-centric language, has all of the wonderful collections semantics we
want
• No type safety
• No modularity
Why not Python?
9. • Weak type safety
• Collections are too elemental
• Native execution is a non-starter, so Go is the only option
• Garbage collection is not generational
Why not Go or C++?
10. • Scala just so happened to fit well in this space
• Performance
• Correctness
• Conciseness
• Scala will evolve
• Other languages will come in time
Scala is NOT the end of the road
14. Why Scala?
At the time we started, I really wanted a PL that supports a
language-integrated interface (where people write
functions inline, etc)… However, I also wanted to be on the
JVM in order to easily interact with the Hadoop filesystem
and data formats for that. Scala was the only somewhat
popular JVM language that offered this kind of
functional syntax and was also statically typed (letting
us have some control over performance), so we chose
that. Today there might be an argument to make the first
version of the API in Java with Java 8, but we also
benefitted from other aspects of Scala in Spark, like type
inference, pattern matching, actor libriaries, etc.
Matei Zaharia, creator of Spark
18. Hadoop strengths
• Lowest capital expenditure for big data
• Excellent for ingesting and integrating diverse datasets
• Flexible
• Classic analytics (aggregations and data warehousing)
• Machine learning
19. Hadoop weaknesses
• Complex administration
• YARN requires dedicated cluster
• MapReduce foibles
• Poor performance
• Imperative programming model
• No stream processing support
22. Spark
• 100x faster as a replacement for Hadoop MapReduce
• Uses much fewer machines and resources
• Incredible support from the community and enterprise
24. • Spark had it with RDDs
• They removed it with the DataFrames API
• Brought it back with DataSets, but not as comprehensively as RDDs
Type safety
25. Why not Flink?
• Flink has much better stream handling for low latency systems that Spark
currently lacks
• Event timing
• Watermarks
• Triggers
• Exactly-once semantics (assuming connections hold)
• Pipeline portability via Apache Beam integration
32. • At the API
• In our source
• For our data
We want isolation
Wikipedia, Creative Commons, created by DFoerster
33. We want realistic data management
• Use CQRS and Event Sourcing, not CRUD
• Transactions, especially distributed, will not work
• Consistency is an anti-pattern at scale
• Distributed locks and shared data will limit you
• Data fabrics break all of these conventions
Think in terms of compensation, not prevention.
Kevin Webber, Lightbend
34. We want to ACID v2
• Associativity, not Atomicity
• Commutativity, not Consistency
• Idempotent, not Isolation
• Distributed, not Durable
Wikipedia, Creative Commons, created by Weston.pace
One version of the emerging Fast Data Architecture. For today’s talk, I won’t go through it in detail, but this reflects some industry trends among open-source tools (popularity of Spark Streaming, Kafka, and Cassandra), plus our view of the Typesafe Reactive Platform as the glue that integrates it all, lets you implement the rest of the micro services you need, and provides the low latency streaming through Akka Streams that Spark doesn’t provide. The last slide has a link to a white paper I wrote that goes through this diagram in detail, along with several example use cases.
It’s very common to combine Spark Streaming, Kafka, and Cassandra (or another distributed, NoSQL database). In fact, this “troika” has become a de facto standard component of modern, stream-oriented architectures.
While Spark, Kafka, and Cassandra are de facto standard core components, you also need a platform for resource management and other needs, plus “glue” to tie everything together. Hence, the “SMACK stack”:
S - Spark (and Scala?)
M - Mesos
A - Akka
C - Cassandra
K - Kafka
Discuss all the components in the context of data flow. First, you will have streaming data from many sources, web requests and other external Internet traffic, you’ll have services communicating with you. If they follow the Reactive Streams standard, then they can be ingested by Akka Streams. Finally, lots of internal data like logs, files FTPed into your environment, etc. will be ingested.
Reactive Streams follow a standard for an out-of-band, resilient backpressure mechanism that provides flow control. It only describes the behavior of a single stream with 1+ producers and 1+ consumers, but these streams are composable; a graph of reactive streams effectively provide end-to-end flow control.
Use the Lightbend Reactive Platform as “glue” for the other large components, implemented as microservices. Web requests (e.g., REST) can be handled by Play, which you can also use to implement user interfaces. Reactive Stream-Compliant sources can be ingested by Akka Streams, which supports a low-latency, per-event processing model.
Use Kafka to ingest “ephemeral” stream data (which can’t be replied if lost downstream). Kafka has enormous scalability and durability. It’s a great place to capture your streams, which can then be processed with Akka Streams or with Spark Streaming.
One use of Kafka is to solve the problem of N*M direct links between producers and consumers. This is hard to manage and it couples services to directly, which is fragile when a given service needs to be scaled up through replication or replacement and sometimes in the protocol that both ends need to speak.
So Kafka can function as a central hub, yet it’s distributed and scalable so it isn’t a bottleneck or single point of failure.
For sophisticated stream processing, where richer analytics (like “online” machine learning) is required and higher latencies can be tolerated, use Spark Streaming, which implements a mini batch model, where data is processed in “chunks”, defined by time windows as small as 1 second.
Processing results can be written back to Kafka for downstream consumption. For permanent storage, right a database, like Cassandra, when fast record-level CRUD is required, or write to a distributed file system, when cheaper storage is desired and “table-scan” access patterns are most important.
This architecture is agnostic to the platform. We like Mesos, the next-generation, general-purpose infrastructure for cluster resource and application management, but it works with YARN, too. You can deploy on premise (e.g., on “bare metal” hardware) or in a cloud environment.