At Tuplejump we have a built a big data platform powered by Scala everywhere. Using Akka for message/event processing, spark for streaming and batch processing, Shark for adhoc querying and Play to power our web based IDE. This talk will walk through what various components of the platform and how Scala, the concepts like Reactive programming, event driven architecture and ecosystem components like Akka actors framework, Spark RDDs and Play Web framework guided and inspired our decisions and also provided the kickstart to attempt this huge challenge of building a complete integrated end-to-end Big Data Application Framework.
2. A data engineering startup, with
a vision to simplify data
engineering and empower the
next generation of data powered
miracles!
tuplejump
3. Who Am I?
• Founder and CEO @ Tuplejump
• Earlier worked at Pramati, Cordys and couple of startups.
• A polyglot developer
• Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell
• Love data hacking in R and Python
• Java and Javascript fed me for a long long time
• Committed to Scala
• Believe in choosing the best tool for the task
• Open Source fanatic
@milliondreams | mytechrantings.blogspot.com
4. The big data pipeline
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
5. The Tuplejump Platform
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
Hydra
The tentacled
framework to
gather high
volume and
velocity data
from push and
pull powered
by Akka,
reacting on
demands to
events and
streaming to
Spark to batch
process.
COLLECT
Spark +
Calliope
Using the
friendly Spark
API with added
features to
easily consume
or load data
from and to
Cassandra
powered
storage.
TRANSFORM
Cassandra++
Cassandra provides a single storage
mechanism for Files, (un)structured
data, Generic data.
STORE
MinerBot
Building on Spark's ML framework,
going towards machine assisted
insights, we are in building our own
EA and ANN/DL frameworks to take
ML to the next level.
PREDICT
Shark + Calliope
Ad Hoc querying with shark on your
data in Dstore.
UberCube
A OLAP cube engine
EXPLORE
Pissaro
A modern, game changing data frontend, which
is “not just dashboards”, providing highly
interactive and reactive visualization frontend.
VIZUALIZE
6. Advantages
• All the advantages of Spark + All the advantages of Cassandra + Much more!
• Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions
• Shark + C* provide for superfast ad hoc querying.
• UberCube empowers sub-millisecond responses on very large cubes
• MinerBot provides ready to use ML Algos, plus a possibility of much more complex
algos and mechanisms than just map reduce.
• Ready to use, no integration required
• Easy to develop, deploy, monitor and scale
7. Why Scala?
• Object oriented and functional
• Runs on the JVM
• 100% compatible with Java
• Modern, evolving, scalable
• Concise, flexible and high performance
• Excellent support for DSL development
• Spark and Play use Scala as their primary language
• We used it for long and we love it!!!
8. The ultimate gyan!
You can flirt with other languages,
you can have short affairs with few,
You will fall in love with Scala at the first sight,
You have to marry her to know her!
9. Let’s call in some friends
• Akka - Actors to build concurrent and distributed applications
• Spark - The blue eyed whiz kid of the Big Data class
• Play - The web development champion
• SBT - The best builder in town
• ScalaTest - The story teller
• Shapeless and Scalaz - Masters of the Dark Arts
10. Concurrency With Akka
• Inspired by Erlang’s Actor Model
• Runs on the JVM
• Actors define behavior to handle typed messages
• Actors process one message at a time
• Can use Group/Pool of actors behind routers for concurrency
• Can run thousands of actos on a modern server
• Location transparency for clustering
• Supervision and state recovery for HA
11. Batch Processing with Spark
• Resilient Distributed Datasets
• Fast in-memory big data
• Map/Reduce on steroids
• Iterative and interactive
• Code in scala, java, python and now R
• Streaming (DStreams - Batch processing on streams)
• MLLib, Shark, Spark SQL, GraphX and more
12. Web development with Play
• Modern high velocity, highly scalable
• Built on Akka and Netty (NIO)
• Reactive in design (reactive I/O)
• Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol
• Feature rich yet flexible
13. Build with SBT
• I hate writing XML
• Very easy to get started
• All the power of Scala in the build
• Maven dependency management + more
14. Testing with ScalaTest
• Write specs not tests and excellent tool for BDD
• Specs DSL very close to english
• Many testing styles
• Powerful matchers (“should be”)
• Fixtures
• Mock objects with ScalaMock
15. Taking functional further
• Shapeless
• Scrap your boilerplate
• Generic programming
• Existential types
• ScalaZ
• Bringing Haskel to Scala
• Monads, Functors and all the theory!