SlideShare a Scribd company logo
1 of 16
Download to read offline
`
Big Data Pipeline
@tuplejump
A data engineering startup, with
a vision to simplify data
engineering and empower the
next generation of data powered
miracles!
tuplejump
Who Am I?
• Founder and CEO @ Tuplejump
• Earlier worked at Pramati, Cordys and couple of startups.
• A polyglot developer
• Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell
• Love data hacking in R and Python
• Java and Javascript fed me for a long long time
• Committed to Scala
• Believe in choosing the best tool for the task
• Open Source fanatic
@milliondreams | mytechrantings.blogspot.com
The big data pipeline
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
The Tuplejump Platform
COLLECT TRANSFORM
PREDICT
STORE
EXPLORE VISUALIZE
Hydra
The tentacled
framework to
gather high
volume and
velocity data
from push and
pull powered
by Akka,
reacting on
demands to
events and
streaming to
Spark to batch
process.
COLLECT
Spark +
Calliope
Using the
friendly Spark
API with added
features to
easily consume
or load data
from and to
Cassandra
powered
storage.
TRANSFORM
Cassandra++
Cassandra provides a single storage
mechanism for Files, (un)structured
data, Generic data.
STORE
MinerBot
Building on Spark's ML framework,
going towards machine assisted
insights, we are in building our own
EA and ANN/DL frameworks to take
ML to the next level.
PREDICT
Shark + Calliope
Ad Hoc querying with shark on your
data in Dstore.
UberCube
A OLAP cube engine
EXPLORE
Pissaro
A modern, game changing data frontend, which
is “not just dashboards”, providing highly
interactive and reactive visualization frontend.
VIZUALIZE
Advantages
• All the advantages of Spark + All the advantages of Cassandra + Much more!
• Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions
• Shark + C* provide for superfast ad hoc querying.
• UberCube empowers sub-millisecond responses on very large cubes
• MinerBot provides ready to use ML Algos, plus a possibility of much more complex
algos and mechanisms than just map reduce.
• Ready to use, no integration required
• Easy to develop, deploy, monitor and scale
Why Scala?
• Object oriented and functional
• Runs on the JVM
• 100% compatible with Java
• Modern, evolving, scalable
• Concise, flexible and high performance
• Excellent support for DSL development
• Spark and Play use Scala as their primary language
• We used it for long and we love it!!!
The ultimate gyan!
You can flirt with other languages,
you can have short affairs with few,
You will fall in love with Scala at the first sight,
You have to marry her to know her!
Let’s call in some friends
• Akka - Actors to build concurrent and distributed applications
• Spark - The blue eyed whiz kid of the Big Data class
• Play - The web development champion
• SBT - The best builder in town
• ScalaTest - The story teller
• Shapeless and Scalaz - Masters of the Dark Arts
Concurrency With Akka
• Inspired by Erlang’s Actor Model
• Runs on the JVM
• Actors define behavior to handle typed messages
• Actors process one message at a time
• Can use Group/Pool of actors behind routers for concurrency
• Can run thousands of actos on a modern server
• Location transparency for clustering
• Supervision and state recovery for HA
Batch Processing with Spark
• Resilient Distributed Datasets
• Fast in-memory big data
• Map/Reduce on steroids
• Iterative and interactive
• Code in scala, java, python and now R
• Streaming (DStreams - Batch processing on streams)
• MLLib, Shark, Spark SQL, GraphX and more
Web development with Play
• Modern high velocity, highly scalable
• Built on Akka and Netty (NIO)
• Reactive in design (reactive I/O)
• Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol
• Feature rich yet flexible
Build with SBT
• I hate writing XML
• Very easy to get started
• All the power of Scala in the build
• Maven dependency management + more
Testing with ScalaTest
• Write specs not tests and excellent tool for BDD
• Specs DSL very close to english
• Many testing styles
• Powerful matchers (“should be”)
• Fixtures
• Mock objects with ScalaMock
Taking functional further
• Shapeless
• Scrap your boilerplate
• Generic programming
• Existential types
• ScalaZ
• Bringing Haskel to Scala
• Monads, Functors and all the theory!
Thank you!
• http://www.tuplejump.com/
• http://github.com/tuplejump/
• http://tuplejump.github.com/calliope/
• http://tuplejump.github.com/stargate/
• @tuplejump on twitter

More Related Content

What's hot

E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyRikin Tanna
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
 
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...HostedbyConfluent
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platformLi Gao
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackJohn Burwell
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-finalRuslan Meshenberg
 
JOSA TechTalk - Lambda architecture and real-time processing
JOSA TechTalk - Lambda architecture and real-time processingJOSA TechTalk - Lambda architecture and real-time processing
JOSA TechTalk - Lambda architecture and real-time processingMahmoud Jalajel
 
Cloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureCloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureAWS Vietnam Community
 

What's hot (20)

E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique Visitors
 
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Sas 2015 event_driven
Sas 2015 event_drivenSas 2015 event_driven
Sas 2015 event_driven
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Docker in the Cloud
Docker in the CloudDocker in the Cloud
Docker in the Cloud
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStack
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
 
JOSA TechTalk - Lambda architecture and real-time processing
JOSA TechTalk - Lambda architecture and real-time processingJOSA TechTalk - Lambda architecture and real-time processing
JOSA TechTalk - Lambda architecture and real-time processing
 
Cloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless ArchitectureCloudsolutionday 2016: Getting Started with Severless Architecture
Cloudsolutionday 2016: Getting Started with Severless Architecture
 

Viewers also liked

Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearchFoo Sounds
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Amazon Web Services
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Visualization of big time series data
Visualization of big time series dataVisualization of big time series data
Visualization of big time series dataRob Hyndman
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupVasia Kalavri
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Setup 3 Node Kafka Cluster on AWS - Hands On
Setup 3 Node Kafka Cluster on AWS - Hands OnSetup 3 Node Kafka Cluster on AWS - Hands On
Setup 3 Node Kafka Cluster on AWS - Hands Onhkbhadraa
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 

Viewers also liked (20)

Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Visualization of big time series data
Visualization of big time series dataVisualization of big time series data
Visualization of big time series data
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Setup 3 Node Kafka Cluster on AWS - Hands On
Setup 3 Node Kafka Cluster on AWS - Hands OnSetup 3 Node Kafka Cluster on AWS - Hands On
Setup 3 Node Kafka Cluster on AWS - Hands On
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 

Similar to Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium 2014, ThoughtWorks

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetupshinolajla
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Lucas Jellema
 
Building Asynchronous Applications
Building Asynchronous ApplicationsBuilding Asynchronous Applications
Building Asynchronous ApplicationsJohan Edstrom
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Gravy Analytics
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Chicago Microservices Integration Talk
Chicago Microservices Integration TalkChicago Microservices Integration Talk
Chicago Microservices Integration TalkChristian Posta
 

Similar to Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium 2014, ThoughtWorks (20)

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
20160524 ibm fast data meetup
20160524 ibm fast data meetup20160524 ibm fast data meetup
20160524 ibm fast data meetup
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
 
Building Asynchronous Applications
Building Asynchronous ApplicationsBuilding Asynchronous Applications
Building Asynchronous Applications
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Chicago Microservices Integration Talk
Chicago Microservices Integration TalkChicago Microservices Integration Talk
Chicago Microservices Integration Talk
 
Machine Learning at Scale
Machine Learning at ScaleMachine Learning at Scale
Machine Learning at Scale
 

More from Thoughtworks

Design System as a Product
Design System as a ProductDesign System as a Product
Design System as a ProductThoughtworks
 
Designers, Developers & Dogs
Designers, Developers & DogsDesigners, Developers & Dogs
Designers, Developers & DogsThoughtworks
 
Cloud-first for fast innovation
Cloud-first for fast innovationCloud-first for fast innovation
Cloud-first for fast innovationThoughtworks
 
More impact with flexible teams
More impact with flexible teamsMore impact with flexible teams
More impact with flexible teamsThoughtworks
 
Culture of Innovation
Culture of InnovationCulture of Innovation
Culture of InnovationThoughtworks
 
Developer Experience
Developer ExperienceDeveloper Experience
Developer ExperienceThoughtworks
 
When we design together
When we design togetherWhen we design together
When we design togetherThoughtworks
 
Hardware is hard(er)
Hardware is hard(er)Hardware is hard(er)
Hardware is hard(er)Thoughtworks
 
Customer-centric innovation enabled by cloud
 Customer-centric innovation enabled by cloud Customer-centric innovation enabled by cloud
Customer-centric innovation enabled by cloudThoughtworks
 
Amazon's Culture of Innovation
Amazon's Culture of InnovationAmazon's Culture of Innovation
Amazon's Culture of InnovationThoughtworks
 
When in doubt, go live
When in doubt, go liveWhen in doubt, go live
When in doubt, go liveThoughtworks
 
Don't cross the Rubicon
Don't cross the RubiconDon't cross the Rubicon
Don't cross the RubiconThoughtworks
 
Your test coverage is a lie!
Your test coverage is a lie!Your test coverage is a lie!
Your test coverage is a lie!Thoughtworks
 
Docker container security
Docker container securityDocker container security
Docker container securityThoughtworks
 
Redefining the unit
Redefining the unitRedefining the unit
Redefining the unitThoughtworks
 
Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Thoughtworks
 
A Tribute to Turing
A Tribute to TuringA Tribute to Turing
A Tribute to TuringThoughtworks
 
Rsa maths worked out
Rsa maths worked outRsa maths worked out
Rsa maths worked outThoughtworks
 

More from Thoughtworks (20)

Design System as a Product
Design System as a ProductDesign System as a Product
Design System as a Product
 
Designers, Developers & Dogs
Designers, Developers & DogsDesigners, Developers & Dogs
Designers, Developers & Dogs
 
Cloud-first for fast innovation
Cloud-first for fast innovationCloud-first for fast innovation
Cloud-first for fast innovation
 
More impact with flexible teams
More impact with flexible teamsMore impact with flexible teams
More impact with flexible teams
 
Culture of Innovation
Culture of InnovationCulture of Innovation
Culture of Innovation
 
Dual-Track Agile
Dual-Track AgileDual-Track Agile
Dual-Track Agile
 
Developer Experience
Developer ExperienceDeveloper Experience
Developer Experience
 
When we design together
When we design togetherWhen we design together
When we design together
 
Hardware is hard(er)
Hardware is hard(er)Hardware is hard(er)
Hardware is hard(er)
 
Customer-centric innovation enabled by cloud
 Customer-centric innovation enabled by cloud Customer-centric innovation enabled by cloud
Customer-centric innovation enabled by cloud
 
Amazon's Culture of Innovation
Amazon's Culture of InnovationAmazon's Culture of Innovation
Amazon's Culture of Innovation
 
When in doubt, go live
When in doubt, go liveWhen in doubt, go live
When in doubt, go live
 
Don't cross the Rubicon
Don't cross the RubiconDon't cross the Rubicon
Don't cross the Rubicon
 
Error handling
Error handlingError handling
Error handling
 
Your test coverage is a lie!
Your test coverage is a lie!Your test coverage is a lie!
Your test coverage is a lie!
 
Docker container security
Docker container securityDocker container security
Docker container security
 
Redefining the unit
Redefining the unitRedefining the unit
Redefining the unit
 
Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22Technology Radar Webinar UK - Vol. 22
Technology Radar Webinar UK - Vol. 22
 
A Tribute to Turing
A Tribute to TuringA Tribute to Turing
A Tribute to Turing
 
Rsa maths worked out
Rsa maths worked outRsa maths worked out
Rsa maths worked out
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scala Symposium 2014, ThoughtWorks

  • 2. A data engineering startup, with a vision to simplify data engineering and empower the next generation of data powered miracles! tuplejump
  • 3. Who Am I? • Founder and CEO @ Tuplejump • Earlier worked at Pramati, Cordys and couple of startups. • A polyglot developer • Started with Perl and PHP, have worked with VB.Net, C#, VC++, Erlang and Haskell • Love data hacking in R and Python • Java and Javascript fed me for a long long time • Committed to Scala • Believe in choosing the best tool for the task • Open Source fanatic @milliondreams | mytechrantings.blogspot.com
  • 4. The big data pipeline COLLECT TRANSFORM PREDICT STORE EXPLORE VISUALIZE
  • 5. The Tuplejump Platform COLLECT TRANSFORM PREDICT STORE EXPLORE VISUALIZE Hydra The tentacled framework to gather high volume and velocity data from push and pull powered by Akka, reacting on demands to events and streaming to Spark to batch process. COLLECT Spark + Calliope Using the friendly Spark API with added features to easily consume or load data from and to Cassandra powered storage. TRANSFORM Cassandra++ Cassandra provides a single storage mechanism for Files, (un)structured data, Generic data. STORE MinerBot Building on Spark's ML framework, going towards machine assisted insights, we are in building our own EA and ANN/DL frameworks to take ML to the next level. PREDICT Shark + Calliope Ad Hoc querying with shark on your data in Dstore. UberCube A OLAP cube engine EXPLORE Pissaro A modern, game changing data frontend, which is “not just dashboards”, providing highly interactive and reactive visualization frontend. VIZUALIZE
  • 6. Advantages • All the advantages of Spark + All the advantages of Cassandra + Much more! • Over 500x (100x in case of filtered data) faster than traditional Hadoop solutions • Shark + C* provide for superfast ad hoc querying. • UberCube empowers sub-millisecond responses on very large cubes • MinerBot provides ready to use ML Algos, plus a possibility of much more complex algos and mechanisms than just map reduce. • Ready to use, no integration required • Easy to develop, deploy, monitor and scale
  • 7. Why Scala? • Object oriented and functional • Runs on the JVM • 100% compatible with Java • Modern, evolving, scalable • Concise, flexible and high performance • Excellent support for DSL development • Spark and Play use Scala as their primary language • We used it for long and we love it!!!
  • 8. The ultimate gyan! You can flirt with other languages, you can have short affairs with few, You will fall in love with Scala at the first sight, You have to marry her to know her!
  • 9. Let’s call in some friends • Akka - Actors to build concurrent and distributed applications • Spark - The blue eyed whiz kid of the Big Data class • Play - The web development champion • SBT - The best builder in town • ScalaTest - The story teller • Shapeless and Scalaz - Masters of the Dark Arts
  • 10. Concurrency With Akka • Inspired by Erlang’s Actor Model • Runs on the JVM • Actors define behavior to handle typed messages • Actors process one message at a time • Can use Group/Pool of actors behind routers for concurrency • Can run thousands of actos on a modern server • Location transparency for clustering • Supervision and state recovery for HA
  • 11. Batch Processing with Spark • Resilient Distributed Datasets • Fast in-memory big data • Map/Reduce on steroids • Iterative and interactive • Code in scala, java, python and now R • Streaming (DStreams - Batch processing on streams) • MLLib, Shark, Spark SQL, GraphX and more
  • 12. Web development with Play • Modern high velocity, highly scalable • Built on Akka and Netty (NIO) • Reactive in design (reactive I/O) • Async HTTP, streaming HTTP, Comet, Websockets, build your own protocol • Feature rich yet flexible
  • 13. Build with SBT • I hate writing XML • Very easy to get started • All the power of Scala in the build • Maven dependency management + more
  • 14. Testing with ScalaTest • Write specs not tests and excellent tool for BDD • Specs DSL very close to english • Many testing styles • Powerful matchers (“should be”) • Fixtures • Mock objects with ScalaMock
  • 15. Taking functional further • Shapeless • Scrap your boilerplate • Generic programming • Existential types • ScalaZ • Bringing Haskel to Scala • Monads, Functors and all the theory!
  • 16. Thank you! • http://www.tuplejump.com/ • http://github.com/tuplejump/ • http://tuplejump.github.com/calliope/ • http://tuplejump.github.com/stargate/ • @tuplejump on twitter