20160524 ibm fast data meetup

Scala: Lingua Franca of Fast
Data
Jamie Allen
Sr. Director of Global Solutions Architects

• Why Scala?
• Who is doing this?
• What is Fast Data?
• Architecting for Fast Data
Agenda

• Cloud portability versus native control
• Application correctness versus speed of development
• Modularity versus global namespace
• Concise syntax versus boilerplate
• Multi-threaded simplicity via abstractions versus low-level control
Tradeoffs

• REPL
• Type safety
• Modularity
• Concise syntax
• Multi-threaded simplicity
• Data-centric semantics
• Managed runtime for cloud portability
• Ecosystem
Scala is the local optimum

The JVM is a primary reason for Scala’s success

• No REPL or Notebook
• Not a data-centric language, particularly collections semantics
Why not Java?

• Data-centric language, has all of the wonderful collections semantics we
want
• No type safety
• No modularity
Why not Python?

• Weak type safety
• Collections are too elemental
• Native execution is a non-starter, so Go is the only option
• Garbage collection is not generational
Why not Go or C++?

• Scala just so happened to fit well in this space
• Performance
• Correctness
• Conciseness
• Scala will evolve
• Other languages will come in time
Scala is NOT the end of the road

One Caveat: Apache Beam and TensorFlow

Why Scala?
At the time we started, I really wanted a PL that supports a
language-integrated interface (where people write
functions inline, etc)… However, I also wanted to be on the
JVM in order to easily interact with the Hadoop filesystem
and data formats for that. Scala was the only somewhat
popular JVM language that offered this kind of
functional syntax and was also statically typed (letting
us have some control over performance), so we chose
that. Today there might be an argument to make the first
version of the API in Java with Java 8, but we also
benefitted from other aspects of Scala in Spark, like type
inference, pattern matching, actor libriaries, etc.
Matei Zaharia, creator of Spark

YARN
HDFS
MR job #1
MR job #2
Flume Sqoop
DBs
Slave Node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
Master
Resource
Manager
Name Node
Worker

Hadoop strengths
• Lowest capital expenditure for big data
• Excellent for ingesting and integrating diverse datasets
• Flexible
• Classic analytics (aggregations and data warehousing)
• Machine learning

Hadoop weaknesses
• Complex administration
• YARN requires dedicated cluster
• MapReduce foibles
• Poor performance
• Imperative programming model
• No stream processing support

Spark
• 100x faster as a replacement for Hadoop MapReduce
• Uses much fewer machines and resources
• Incredible support from the community and enterprise

Spark use cases
• Primarily anomaly detection
• Risk management
• Fraud detection
• Odds recalculation
• Spam filters
• Update search engine results quickly

• Spark had it with RDDs
• They removed it with the DataFrames API
• Brought it back with DataSets, but not as comprehensively as RDDs
Type safety

Why not Flink?
• Flink has much better stream handling for low latency systems that Spark
currently lacks
• Event timing
• Watermarks
• Triggers
• Exactly-once semantics (assuming connections hold)
• Pipeline portability via Apache Beam integration

Traditional application architectures
and platforms are obsolete.
Gartner

How do we avoid messing this up?

• At the API
• In our source
• For our data
We want isolation
Wikipedia, Creative Commons, created by DFoerster

We want realistic data management
• Use CQRS and Event Sourcing, not CRUD
• Transactions, especially distributed, will not work
• Consistency is an anti-pattern at scale
• Distributed locks and shared data will limit you
• Data fabrics break all of these conventions
Think in terms of compensation, not prevention.
Kevin Webber, Lightbend

We want to ACID v2
• Associativity, not Atomicity
• Commutativity, not Consistency
• Idempotent, not Isolation
• Distributed, not Durable
Wikipedia, Creative Commons, created by Weston.pace

Fast DataArchitecture
HTTP/
REST

Learning Spark
• Go to http://bigdatauniversity.com, built by IBM

20160524 ibm fast data meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20160524 ibm fast data meetup

Similar to 20160524 ibm fast data meetup (20)

More from shinolajla

More from shinolajla (16)

Recently uploaded

Recently uploaded (20)

20160524 ibm fast data meetup

Editor's Notes