SlideShare a Scribd company logo
1 of 30
Download to read offline
SMACK Architectures
Building data processing platforms with
Spark, Mesos, Akka, Cassandra and Kafka
Anton Kirillov Big Data AW Meetup
Sep 2015
Who is this guy?
● Scala programmer
● Focused on distributed systems
● Building data platforms with SMACK/Hadoop
● Ph.D. in Computer Science
● Big Data engineer/consultant at Big Data AB
● Currently at Ooyala Stockholm (Videoplaza AB)
● Working with startups
2
Roadmap
● SMACK stack overview
● Storage layer layout
● Fixing NoSQL limitations
● Cluster resource management
● Reliable scheduling and execution
● Data ingestion options
● Preparing for failures
3
SMACK Stack
● Spark - fast and general engine for distributed, large-scale data
processing
● Mesos - cluster resource management system that provides efficient
resource isolation and sharing across distributed applications
● Akka - a toolkit and runtime for building highly concurrent, distributed,
and resilient message-driven applications on the JVM
● Cassandra - distributed, highly available database designed to handle
large amounts of data across multiple datacenters
● Kafka - a high-throughput, low-latency distributed messaging system
designed for handling real-time data feeds
4
Storage Layer: Cassandra
● optimized for heavy write
loads
● configurable CA (CAP)
● linearly scalable
● XDCR support
● easy cluster resizing and
inter-DC data migration
5
Cassandra Data Model
● nested sorted map
● should be optimized for
read queries
● data is distributed across
nodes by partition key
CREATE TABLE campaign(
id uuid,
year int,
month int,
day int,
views bigint,
clicks bigint,
PRIMARY KEY (id, year, month, day)
);
INSERT INTO campaign(id, year, month, day, views, clicks)
VALUES(40b08953-a…,2015, 9, 10, 1000, 42);
SELECT views, clicks FROM campaign
WHERE id=40b08953-a… and year=2015 and month>8; 6
Spark/Cassandra Example
● calculate total views per
campaign for given month
for all campaigns
CREATE TABLE event(
id uuid,
ad_id uuid,
campaign uuid,
ts bigint,
type text,
PRIMARY KEY(id)
);
val sc = new SparkContext(conf)
case class Event(id: UUID, ad_id: UUID, campaign: UUID, ts: Long, `type`: String)
sc.cassandraTable[Event]("keyspace", "event")
.filter(e => e.`type` == "view" && checkMonth(e.ts))
.map(e => (e.campaign, 1))
.reduceByKey(_ + _)
.collect() 7
Naive Lambda example with Spark SQL
case class CampaignReport(id: String, views: Long, clicks: Long)
sql("""SELECT campaign.id as id, campaign.views as views,
campaign.clicks as clicks, event.type as type
FROM campaign
JOIN event ON campaign.id = event.campaign
""").rdd
.groupBy(row => row.getAs[String]("id"))
.map{ case (id, rows) =>
val views = rows.head.getAs[Long]("views")
val clicks = rows.head.getAs[Long]("clicks")
val res = rows.groupBy(row => row.getAs[String]("type")).mapValues(_.size)
CampaignReport(id, views = views + res("view"), clicks = clicks + res("click"))
}.saveToCassandra(“keyspace”, “campaign_report”)
8
Let’s take a step back: Spark Basics
● RDD operations(transformations and actions) form DAG
● DAG is split into stages of tasks which are then submitted to cluster manager
● stages combine tasks which don’t require shuffling/repartitioning
● tasks run on workers and results then return to client 9
Architecture of Spark/Cassandra Clusters
Separate Write & Analytics:
● clusters can be scaled
independently
● data is replicated by
Cassandra asynchronously
● Analytics has different
Read/Write load patterns
● Analytics contains additional
data and processing results
● Spark resource impact
limited to only one DC
To fully facilitate Spark-C* connector data locality awareness,
Spark workers should be collocated with Cassandra nodes 10
Spark Applications Deployment Revisited
Cluster Manager:
● Spark Standalone
● YARN
● Mesos
11
Managing Cluster Resources: Mesos
● heterogenous workloads
● full cluster utilization
● static vs. dynamic resource
allocation
● fault tolerance and disaster
recovery
● single resource view at
datacenter levelimage source: http://www.slideshare.net/caniszczyk/apache-mesos-at-twitter-texas-linuxfest-2014
12
Mesos Architecture Overview
● leader election and
service discovery via
ZooKeeper
● slaves publish available
resources to master
● master sends resource
offers to frameworks
● scheduler replies with
tasks and resources
needed per task
● master sends tasks to
slaves
13
Bringing Spark, Mesos and Cassandra Together
Deployment example
● Mesos Masters and
ZooKeepers collocated
● Mesos Slaves and Cassandra
nodes collocated to enforce
better data locality for Spark
● Spark binaries deployed to all
worker nodes and spark-env is
configured
● Spark Executor JAR uploaded
to S3
Invocation example
spark-submit --class io.datastrophic.SparkJob /etc/jobs/spark-jobs.jar
14
Marathon
● long running tasks
execution
● HA mode with ZooKeeper
● Docker executor
● REST API
15
Chronos
● distributed cron
● HA mode with ZooKeeper
● supports graphs of jobs
● sensitive to network failures
16
More Mesos frameworks
● Hadoop
● Cassandra
● Kafka
● Myriad: YARN on Mesos
● Storm
● Samza
17
Data ingestion: endpoints to consume the data
Endpoint requirements:
● high throughput
● resiliency
● easy scalability
● back pressure 18
Akka features
class JsonParserActor extends Actor {
def receive = {
case s: String => Try(Json.parse(s).as[Event]) match {
case Failure(ex) => log.error(ex)
case Success(event) => sender ! event
}
}
}
class HttpActor extends Actor {
def receive = {
case req: HttpRequest =>
system.actorOf(Props[JsonParserActor]) ! req.body
case e: Event =>
system.actorOf(Props[CassandraWriterActor]) ! e
}
}
● actor model
implementation for JVM
● message-based and
asynchronous
● no shared mutable state
● easy scalability from one
process to cluster of
machines
● actor hierarchies with
parental supervision
● not only concurrency
framework:
○ akka-http
○ akka-streams
○ akka-persistence
19
Writing to Cassandra with Akka
class CassandraWriterActor extends Actor with ActorLogging {
//for demo purposes, session initialized here
val session = Cluster.builder()
.addContactPoint("cassandra.host")
.build()
.connect()
override def receive: Receive = {
case event: Event =>
val statement = new SimpleStatement(event.createQuery)
.setConsistencyLevel(ConsistencyLevel.QUORUM)
Try(session.execute(statement)) match {
case Failure(ex) => //error handling code
case Success => sender ! WriteSuccessfull
}
}
} 20
Cassandra meets Batch Processing
● writing raw data (events) to Cassandra with Akka is easy
● but computation time of aggregations/rollups will grow with
amount of data
● Cassandra is still designed for fast serving but not batch
processing, so pre-aggregation of incoming data is needed
● actors are not suitable for performing aggregation due to
stateless design model
● micro-batches partially solve the problem
● reliable storage for raw data is still needed
21
Kafka: distributed commit log
● pre-aggregation of incoming data
● consumers read data in batches
● available as Kinesis on AWS
22
Publishing to Kafka with Akka Http
val config = new ProducerConfig(KafkaConfig())
lazy val producer = new KafkaProducer[A, A](config)
val topic = “raw_events”
val routes: Route = {
post{
decodeRequest{
entity(as[String]){ str =>
JsonParser.parse(str).validate[Event] match {
case s: JsSuccess[String] => producer.send(new KeyedMessage(topic, str))
case e: JsError => BadRequest -> JsError.toFlatJson(e).toString()
}
}
}
}
}
object AkkaHttpMicroservice extends App with Service {
Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http.
port"))
}
23
Spark Streaming
24
● variety of data sources
● at-least-once semantics
● exactly-once semantics
available with Kafka Direct
and idempotent storage
Spark Streaming: Kinesis example
val ssc = new StreamingContext(conf, Seconds(10))
val kinesisStream = KinesisUtils.createStream(ssc,appName,streamName,
endpointURL,regionName, InitialPositionInStream.LATEST,
Duration(checkpointInterval), StorageLevel.MEMORY_ONLY)
}
//transforming given stream to Event and saving to C*
kinesisStream.map(JsonUtils.byteArrayToEvent)
.saveToCassandra(keyspace, table)
ssc.start()
ssc.awaitTermination()
25
Designing for Failure: Backups and Patching
● be prepared for failures and broken data
● design backup and patching strategies upfront
● idempotece should be enforced 26
Restoring backup from S3
val sc = new SparkContext(conf)
sc.textFile(s"s3n://bucket/2015/*/*.gz")
.map(s => Try(JsonUtils.stringToEvent(s)))
.filter(_.isSuccess).map(_.get)
.saveToCassandra(config.keyspace, config.table)
27
The big picture
28
So what SMACK is
● concise toolbox for wide variety of data processing scenarios
● battle-tested and widely used software with large communities
● easy scalability and replication of data while preserving low latencies
● unified cluster management for heterogeneous loads
● single platform for any kind of applications
● implementation platform for different architecture designs
● really short time-to-market (e.g. for MVP verification)
29
Questions
@antonkirillov datastrophic.io
30

More Related Content

What's hot

Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformManuel Sehlinger
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQLSATOSHI TAGOMORI
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK StackKnoldus Inc.
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbotyieldbot
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 

What's hot (20)

Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 

Viewers also liked

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelinesPatrick McFadin
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...DataStax
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP serverKazuho Oku
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration WarsKarl Isenberg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 

Viewers also liked (19)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
 
Digitizing Europe
Digitizing EuropeDigitizing Europe
Digitizing Europe
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 

Similar to Data processing platforms architectures with Spark, Mesos, Akka, Cassandra and Kafka

Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraDenis Dus
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developersChristopher Batey
 
Event Sourcing - what could go wrong - Jfokus 2022
Event Sourcing - what could go wrong - Jfokus 2022Event Sourcing - what could go wrong - Jfokus 2022
Event Sourcing - what could go wrong - Jfokus 2022Andrzej Ludwikowski
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservicesmarius_bogoevici
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021Andrzej Ludwikowski
 

Similar to Data processing platforms architectures with Spark, Mesos, Akka, Cassandra and Kafka (20)

Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Event Sourcing - what could go wrong - Jfokus 2022
Event Sourcing - what could go wrong - Jfokus 2022Event Sourcing - what could go wrong - Jfokus 2022
Event Sourcing - what could go wrong - Jfokus 2022
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 

Recently uploaded

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 

Recently uploaded (20)

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 

Data processing platforms architectures with Spark, Mesos, Akka, Cassandra and Kafka

  • 1. SMACK Architectures Building data processing platforms with Spark, Mesos, Akka, Cassandra and Kafka Anton Kirillov Big Data AW Meetup Sep 2015
  • 2. Who is this guy? ● Scala programmer ● Focused on distributed systems ● Building data platforms with SMACK/Hadoop ● Ph.D. in Computer Science ● Big Data engineer/consultant at Big Data AB ● Currently at Ooyala Stockholm (Videoplaza AB) ● Working with startups 2
  • 3. Roadmap ● SMACK stack overview ● Storage layer layout ● Fixing NoSQL limitations ● Cluster resource management ● Reliable scheduling and execution ● Data ingestion options ● Preparing for failures 3
  • 4. SMACK Stack ● Spark - fast and general engine for distributed, large-scale data processing ● Mesos - cluster resource management system that provides efficient resource isolation and sharing across distributed applications ● Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM ● Cassandra - distributed, highly available database designed to handle large amounts of data across multiple datacenters ● Kafka - a high-throughput, low-latency distributed messaging system designed for handling real-time data feeds 4
  • 5. Storage Layer: Cassandra ● optimized for heavy write loads ● configurable CA (CAP) ● linearly scalable ● XDCR support ● easy cluster resizing and inter-DC data migration 5
  • 6. Cassandra Data Model ● nested sorted map ● should be optimized for read queries ● data is distributed across nodes by partition key CREATE TABLE campaign( id uuid, year int, month int, day int, views bigint, clicks bigint, PRIMARY KEY (id, year, month, day) ); INSERT INTO campaign(id, year, month, day, views, clicks) VALUES(40b08953-a…,2015, 9, 10, 1000, 42); SELECT views, clicks FROM campaign WHERE id=40b08953-a… and year=2015 and month>8; 6
  • 7. Spark/Cassandra Example ● calculate total views per campaign for given month for all campaigns CREATE TABLE event( id uuid, ad_id uuid, campaign uuid, ts bigint, type text, PRIMARY KEY(id) ); val sc = new SparkContext(conf) case class Event(id: UUID, ad_id: UUID, campaign: UUID, ts: Long, `type`: String) sc.cassandraTable[Event]("keyspace", "event") .filter(e => e.`type` == "view" && checkMonth(e.ts)) .map(e => (e.campaign, 1)) .reduceByKey(_ + _) .collect() 7
  • 8. Naive Lambda example with Spark SQL case class CampaignReport(id: String, views: Long, clicks: Long) sql("""SELECT campaign.id as id, campaign.views as views, campaign.clicks as clicks, event.type as type FROM campaign JOIN event ON campaign.id = event.campaign """).rdd .groupBy(row => row.getAs[String]("id")) .map{ case (id, rows) => val views = rows.head.getAs[Long]("views") val clicks = rows.head.getAs[Long]("clicks") val res = rows.groupBy(row => row.getAs[String]("type")).mapValues(_.size) CampaignReport(id, views = views + res("view"), clicks = clicks + res("click")) }.saveToCassandra(“keyspace”, “campaign_report”) 8
  • 9. Let’s take a step back: Spark Basics ● RDD operations(transformations and actions) form DAG ● DAG is split into stages of tasks which are then submitted to cluster manager ● stages combine tasks which don’t require shuffling/repartitioning ● tasks run on workers and results then return to client 9
  • 10. Architecture of Spark/Cassandra Clusters Separate Write & Analytics: ● clusters can be scaled independently ● data is replicated by Cassandra asynchronously ● Analytics has different Read/Write load patterns ● Analytics contains additional data and processing results ● Spark resource impact limited to only one DC To fully facilitate Spark-C* connector data locality awareness, Spark workers should be collocated with Cassandra nodes 10
  • 11. Spark Applications Deployment Revisited Cluster Manager: ● Spark Standalone ● YARN ● Mesos 11
  • 12. Managing Cluster Resources: Mesos ● heterogenous workloads ● full cluster utilization ● static vs. dynamic resource allocation ● fault tolerance and disaster recovery ● single resource view at datacenter levelimage source: http://www.slideshare.net/caniszczyk/apache-mesos-at-twitter-texas-linuxfest-2014 12
  • 13. Mesos Architecture Overview ● leader election and service discovery via ZooKeeper ● slaves publish available resources to master ● master sends resource offers to frameworks ● scheduler replies with tasks and resources needed per task ● master sends tasks to slaves 13
  • 14. Bringing Spark, Mesos and Cassandra Together Deployment example ● Mesos Masters and ZooKeepers collocated ● Mesos Slaves and Cassandra nodes collocated to enforce better data locality for Spark ● Spark binaries deployed to all worker nodes and spark-env is configured ● Spark Executor JAR uploaded to S3 Invocation example spark-submit --class io.datastrophic.SparkJob /etc/jobs/spark-jobs.jar 14
  • 15. Marathon ● long running tasks execution ● HA mode with ZooKeeper ● Docker executor ● REST API 15
  • 16. Chronos ● distributed cron ● HA mode with ZooKeeper ● supports graphs of jobs ● sensitive to network failures 16
  • 17. More Mesos frameworks ● Hadoop ● Cassandra ● Kafka ● Myriad: YARN on Mesos ● Storm ● Samza 17
  • 18. Data ingestion: endpoints to consume the data Endpoint requirements: ● high throughput ● resiliency ● easy scalability ● back pressure 18
  • 19. Akka features class JsonParserActor extends Actor { def receive = { case s: String => Try(Json.parse(s).as[Event]) match { case Failure(ex) => log.error(ex) case Success(event) => sender ! event } } } class HttpActor extends Actor { def receive = { case req: HttpRequest => system.actorOf(Props[JsonParserActor]) ! req.body case e: Event => system.actorOf(Props[CassandraWriterActor]) ! e } } ● actor model implementation for JVM ● message-based and asynchronous ● no shared mutable state ● easy scalability from one process to cluster of machines ● actor hierarchies with parental supervision ● not only concurrency framework: ○ akka-http ○ akka-streams ○ akka-persistence 19
  • 20. Writing to Cassandra with Akka class CassandraWriterActor extends Actor with ActorLogging { //for demo purposes, session initialized here val session = Cluster.builder() .addContactPoint("cassandra.host") .build() .connect() override def receive: Receive = { case event: Event => val statement = new SimpleStatement(event.createQuery) .setConsistencyLevel(ConsistencyLevel.QUORUM) Try(session.execute(statement)) match { case Failure(ex) => //error handling code case Success => sender ! WriteSuccessfull } } } 20
  • 21. Cassandra meets Batch Processing ● writing raw data (events) to Cassandra with Akka is easy ● but computation time of aggregations/rollups will grow with amount of data ● Cassandra is still designed for fast serving but not batch processing, so pre-aggregation of incoming data is needed ● actors are not suitable for performing aggregation due to stateless design model ● micro-batches partially solve the problem ● reliable storage for raw data is still needed 21
  • 22. Kafka: distributed commit log ● pre-aggregation of incoming data ● consumers read data in batches ● available as Kinesis on AWS 22
  • 23. Publishing to Kafka with Akka Http val config = new ProducerConfig(KafkaConfig()) lazy val producer = new KafkaProducer[A, A](config) val topic = “raw_events” val routes: Route = { post{ decodeRequest{ entity(as[String]){ str => JsonParser.parse(str).validate[Event] match { case s: JsSuccess[String] => producer.send(new KeyedMessage(topic, str)) case e: JsError => BadRequest -> JsError.toFlatJson(e).toString() } } } } } object AkkaHttpMicroservice extends App with Service { Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http. port")) } 23
  • 24. Spark Streaming 24 ● variety of data sources ● at-least-once semantics ● exactly-once semantics available with Kafka Direct and idempotent storage
  • 25. Spark Streaming: Kinesis example val ssc = new StreamingContext(conf, Seconds(10)) val kinesisStream = KinesisUtils.createStream(ssc,appName,streamName, endpointURL,regionName, InitialPositionInStream.LATEST, Duration(checkpointInterval), StorageLevel.MEMORY_ONLY) } //transforming given stream to Event and saving to C* kinesisStream.map(JsonUtils.byteArrayToEvent) .saveToCassandra(keyspace, table) ssc.start() ssc.awaitTermination() 25
  • 26. Designing for Failure: Backups and Patching ● be prepared for failures and broken data ● design backup and patching strategies upfront ● idempotece should be enforced 26
  • 27. Restoring backup from S3 val sc = new SparkContext(conf) sc.textFile(s"s3n://bucket/2015/*/*.gz") .map(s => Try(JsonUtils.stringToEvent(s))) .filter(_.isSuccess).map(_.get) .saveToCassandra(config.keyspace, config.table) 27
  • 29. So what SMACK is ● concise toolbox for wide variety of data processing scenarios ● battle-tested and widely used software with large communities ● easy scalability and replication of data while preserving low latencies ● unified cluster management for heterogeneous loads ● single platform for any kind of applications ● implementation platform for different architecture designs ● really short time-to-market (e.g. for MVP verification) 29