SlideShare a Scribd company logo
1 of 57
Download to read offline
Martin Zapletal @zapletal_martin
Cake Solutions @cakesolutions
#CassandraSummit
Presented by Anirvan Chakraborty @anirvan_c
● Introduction
● Event sourcing and CQRS
● An emerging technology stack to handle data
● A reference application and it’s architecture
● A few use cases of the reference application
● Conclusion
● Increasing importance of data analytics
● Current state
○ Destructive updates
○ Analytics tools with poor scalability and integration
○ Manual processes
○ Slow iterations
○ Not suitable for large amounts of data
● Whole lifecycle of data
● Data processing
● Data stores
● Integration and messaging
● Distributed computing primitives
● Cluster managers and task schedulers
● Deployment, configuration management and DevOps
● Data analytics and machine learning
● Spark, Mesos, Akka, Cassandra, Kafka (SMACK, Infinity)
ACID Mutable State
● Create, Read, Update, Delete
● Exposes mutable internal state
● Many read methods on repositories
● Mapping of data model and objects (impedance mismatch)
● No auditing
● No separation of concerns (read / write, command / event)
● Strongly consistent
● Difficult optimizations of reads / writes
● Difficult to scale
● Intent, behaviour, history, is lost
Balance = 5
Balance = 10
Update
Account
Balance = 10
Account
[1]
CQRS
Client
QueryCommand
DBDB
Denormalise
/Precompute
Kappa architecture
Batch-Pipeline
Kafka
Allyourdata
NoSQL
SQL
Spark
Client
Client
Client Views
Stream
processor
Flume
Scoop
Hive
Impala
Oozie
HDFS
Lambda Architecture
Batch Layer Serving
Layer
Stream layer (fast)
Query
Query
Allyourdata Serving DB
[2, 3]
● Append only data store
● No updates or deletes (rewriting history)
● Immutable data model
● Decouples data model of the application and storage
● Current state not persisted, but derived. A sequence of updates that led to it.
● History, state known at any point in time
● Replayable
● Source of truth
● Optimisations possible
● Works well in distributed environment - easy partitioning, conflicts
● Helps avoiding transactions
● Works well with DDD
userId date change
1
1
1
10/10/2015
11/10/2015
23/10/2015
+300
-100
-200
1 24/10/2015 +100
balanceChanged
event
balanceChanged
balanceChanged
balanceChanged
Event journal
● Command Query Responsibility Segregation
● Read and write logically and physically separated
● Reasoning about the application
● Clear separation of concerns (business logic)
● Often different technology, scalability
● Often lower consistency - eventual, causal
Command
● Write side
● Messages, requests to mutate state
● Behaviour, serialized method call essentially
● Don’t expose state
● Validated and may be rejected or emit one or more events (e.g. submitting a form)
Event
● Write side
● Immutable
● Indicating something that has happened
● Atomic record of state change
● Audit log
Query
● Read side
● Precomputed
userId = 1
updateBalance
(+100)
Write
Command
Event
userId date change
1
1
1
10/10/2015
11/10/2015
23/10/2015
+300
-100
-200
1 24/10/2015 +100
balance
Changed
event
balance
Changed
balance
Changed
balance
Changed
Event journal
Command
handler
Read
balance
1 100
userId = 1
balance = 100
Query
userId
● Partial order of events for each entity
● Operation semantics, CRDTs
UserNameUpdated(B)
UserNameUpdated(B)
UserNameUpdated(A)
UserNameUpdated(A)
● Localization
● Conflicting concurrent histories
○ Resubmission
○ Deduplication
○ Replication
● Identifier
● Version
● Timestamp
● Vector clock
● Actor framework for truly concurrent and distributed systems
● Thread safe mutable state - consistency boundary
● Domain modelling, distributed state
● Simple programming model - asynchronously send messages, create
new actors, change behaviour
● Supports CQRS/ES
● Fully distributed - asynchronous, delivery guarantees, failures, time
and order, consistency, availability, communication patterns, data
locality, persistence, durability, concurrent updates, conflicts,
divergence, invariants, ...
?
?
? + 1
? + 1
? + 2
UserId = 1
Name = Bob
BankAccountId = 1
Balance = 1000
UserId = 1
Name = Alice
● Distributed domain modelling
● In memory
● Ordering, consistency
id = 1
● Actor backed by data store
● Immutable event sourced journal
● Supports CQRS (write and read side)
● Persistence, replay on failure, rebalance, at least once delivery
user1, event 2
user1, event 3
user1, event 4
user1, event 1
class UserActor extends PersistentActor {
override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId
override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator)
def notRegistered(distributedData: ActorRef): Receive = {
case cmd: AccountCommand =>
persist(AccountEvent(cmd.account)){ acc =>
context.become(registered(acc))
sender() ! /-()
}
}
def registered(account: Account): Receive = {
case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) =>
persist(eres)(data => sender() ! /-(id))
}
override def receiveRecover: Receive = {
...
}
}
● Akka Persistence Cassandra journal
○ Globally distributed journal
○ Scalable, resilient, highly available
○ Performant, operational database
● Community plugins
akka {
persistence {
journal.plugin = "cassandra-journal"
snapshot-store.plugin = "cassandra-snapshot-store"
}
}
● Partition-size
● Events in each cluster partition ordered (persistenceId - partition pair)
CREATE TABLE IF NOT EXISTS ${tableName} (
processor_id text,
partition_nr bigint,
sequence_nr bigint,
marker text,
message blob,
PRIMARY KEY ((processor_id, partition_nr),
sequence_nr, marker))
WITH COMPACT STORAGE
AND gc_grace_seconds = ${config.
gc_grace_seconds}
processor_id partition_nr sequence_nr marker message
user-1 0 0 H 0x0a6643b334...
user-1 0 1 A 0x0ab2020801...
user-1 0 2 A 0x0a98020801...
● Internal state, moment in time
● Read optimization
CREATE TABLE IF NOT EXISTS ${tableName} (
processor_id text,
sequence_nr bigint,
timestamp bigint,
snapshot blob,
PRIMARY KEY (processor_id, sequence_nr))
WITH CLUSTERING ORDER BY (sequence_nr DESC)
processor_id sequence_nr snapshot timestamp
user-1 16 0x0400000001... 1441696908210
user-1 20 0x0400000001... 1441697587765
● Uses Akka serialization
0x0a6643b334 …
PersistentRepr
Akka.Serialization
Payload: T
Protobuff
actor {
serialization-bindings {
"io.muvr.exercise.ExercisePlanDeviation" = kryo,
"io.muvr.exercise.ResistanceExercise" = kryo,
}
serializers {
java = "akka.serialization.JavaSerializer"
kryo = "com.twitter.chill.akka.AkkaSerializer"
}
}
class UserActorView(userId: String) extends PersistentView {
override def persistenceId: String = UserPersistenceId(userId).persistenceId
override def viewId: String = UserPersistenceId(userId).persistentViewId
override def autoUpdateInterval: FiniteDuration = FiniteDuration(100, TimeUnit.MILLISECONDS)
def receive: Receive = viewState(List.empty)
def viewState(processedDeviations: List[ExercisePlanProcessedDeviation]): Receive = {
case EntireResistanceExerciseSession(_, _, _, _, deviations) if isPersistent =>
context.become(viewState(deviations.filter(condition).map(process) ::: processedDeviations))
case GetProcessedDeviations => sender() ! processedDeviations
}
}
● Akka 2.4
● Potentially infinite stream of data
● Ordered, replayable, resumable
● Aggregation, transformation, moving data
● EventsByPersistenceId
● AllPersistenceids
● EventsByTag
val readJournal =
PersistenceQuery(system).readJournalFor(CassandraJournal.Identifier)
val source = readJournal.query(
EventsByPersistenceId(UserPersistenceId(name).persistenceId, 0, Long.MaxValue), NoRefresh)
.map(_.event)
.collect{ case s: EntireResistanceExerciseSession => s }
.mapConcat(_.deviations)
.filter(condition)
.map(process)
implicit val mat = ActorMaterializer()
val result = source.runFold(List.empty[ExercisePlanDeviation])((x, y) => y :: x)
● Potentially infinite stream of events
Source[Any].map(process).filter(condition)
Publisher Subscriber
process
condition
backpressure
● In Akka we have the read and write sides separated,
in Cassandra we don’t
● Different data model
● Avoid using operational datastore
● Eventual consistency
● Streaming transformations to different format
● Unify journalled and other data
● Computations and analytics queries on the data
● Often iterative, complex, expensive computations
● Prepared and interactive queries
● Data from multiple sources, joins and transformations
● Often directly on a stream of data
● Whole history of events
● Historical behaviour
● Works retrospectively, can answer questions in the future that we don’t
know exist yet
● Various data types from various sources
● Large amounts of fast data
● Automated analytics
● Cassandra 3.0 - user defined functions, functional indexes, aggregation
functions, materialized views
● Server side denormalization
● Eventual consistency
● Copy of data with different partitioning
userId
performance
● In memory dataflow distributed data processing framework, streaming
and batch
● Distributes computation using a higher level API
● Load balancing
● Moves computation to data
● Fault tolerant
● Resilient Distributed Datasets
● Fault tolerance
● Caching
● Serialization
● Transformations
○ Lazy, form the DAG
○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ...
● Actions
○ Execute DAG, retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● Accumulators
● Broadcast Variables
● Integration
● Streaming
● Machine Learning
● Graph Processing
textFile mapmap
reduceByKey
collect
sc.textFile("counts")
.map(line => line.split("t"))
.map(word => (word(0), word(1).toInt))
.reduceByKey(_ + _)
.collect()
[4]
Spark master
Spark worker
Cassandra
● Cassandra can store
● Spark can process
● Gathering large amounts of heterogeneous data
● Queries
● Transformations
● Complex computations
● Machine learning, data mining, analytics
● Now possible
● Prepared and interactive queries
lazy val sparkConf: SparkConf =
new SparkConf()
.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val data = sc.cassandraTable[T]("keyspace", "table").select("columns")
val processedData = data.flatMap(...)...
processedData.saveToCassandra("keyspace", "table")
● Akka Analytics project
● Handles custom Akka serialization
case class JournalKey(persistenceId: String, partition: Long, sequenceNr: Long)
lazy val sparkConf: SparkConf =
new SparkConf()
.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val events: RDD[(JournalKey, Any)] = sc.eventTable()
events.sortByKey().map(...).filter(...).collect().foreach(println)
● Spark streaming
● Precomputing using spark or replication often aiming for different data
model
Operational cluster Analytics cluster
Precomputation /
replication
Integration with
other data sources
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations)
val deviationsFrequency = sqlContext.sql(
"""SELECT planned.exercise, hour(time), COUNT(1)
FROM exerciseDeviations
WHERE planned.exercise = 'bench press'
GROUP BY planned.exercise, hour(time)""")
val deviationsFrequency2 = exerciseDeviationsDF
.where(exerciseDeviationsDF("planned.exercise") === "bench press")
.groupBy(
exerciseDeviationsDF("planned.exercise"),
exerciseDeviationsDF("time”))
.count()
val deviationsFrequency3 = exerciseDeviations
.filter(_.planned.exercise == "bench press")
.groupBy(d => (d.planned.exercise, d.time.getHours))
.map(d => (d._1, d._2.size))
def toVector(user: User): mllib.linalg.Vector =
Vectors.dense(
user.frequency, user.performanceIndex, user.improvementIndex)
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val users: RDD[User] = events.filterClass[User]
val kmeans = new KMeans()
.setK(5)
.set...
val clusters = kmeans.run(users.map(_.toVector))
val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val exerciseDeviations = events
.filterClass[EntireResistanceExerciseSession]
.flatMap(session =>
session.sets.flatMap(set =>
set.sets.map(exercise => (session.id.id, exercise.exercise))))
.groupBy(e => e)
.map(g =>
Rating(normalize(g._1._1), normalize(g._1._2),
normalize(g._2.size)))
val model = new ALS().run(ratings)
val predictions = model.predict(recommend)
bench
press
bicep
curl
dead
lift
user 1 5 2
user 2 4 3
user 3 5 2
user 4 3 1
val events = sc.eventTable().cache().toDF()
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(),
new IntensityFeatureExtractor(), lr))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept, Array(true, false))
getEligibleUsers(events, sessionEndedBefore)
.map { user =>
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
val model = trainValidationSplit.fit(
events,
ParamMap(ParamPair(userIdParam, user)))
val testData = // Prepare test data.
val predictions = model.transform(testData)
submitResult(userId, predictions, config)
}
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val connections = events.filterClass[Connections]
val vertices: RDD[(VertexId, Long)] =
connections.map(c => (c.id, 1l))
val edges: RDD[Edge[Long]] = connections
.flatMap(c => c.connections
.map(Edge(c.id, _, 1l)))
val graph = Graph(vertices, edges)
val ranks = graph.pageRank(0.0001).vertices
7 * Dumbbell
Alternating Curl
Data
Data
Preprocessing
Preprocessing
Features
Features
Training
Testing
Error %
● Exercise domain as an example
● Analytics of both batch (offline) and streaming (online) data
● Analytics important in other areas (banking, stock market, network,
cluster monitoring, business intelligence, commerce, internet of things, ...)
● Enabling value of data
● Event sourcing
● CQRS
● Technologies to handle the data
○ Spark
○ Mesos
○ Akka
○ Cassandra
○ Kafka
● Handling data
● Insights and analytics enable value in data
● Jobs at www.cakesolutions.net/careers
● Code at https://github.com/muvr
● Martin Zapletal @zapletal_martin
● Anirvan Chakraborty @anirvan_c
[1] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
[2] http://malteschwarzkopf.de/research/assets/google-stack.pdf
[3] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
[4] http://www.slideshare.net/LisaHua/spark-overview-37479609

More Related Content

What's hot

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + ElkVasil Remeniuk
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flinkRenato Guimaraes
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...DataStax Academy
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 

What's hot (20)

AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 

Viewers also liked

Deep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraDeep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraAhmedabadJavaMeetup
 
CQRS and Event Sourcing with Akka, Cassandra and RabbitMQ
CQRS and Event Sourcing with Akka, Cassandra and RabbitMQCQRS and Event Sourcing with Akka, Cassandra and RabbitMQ
CQRS and Event Sourcing with Akka, Cassandra and RabbitMQMiel Donkers
 
Akka persistence == event sourcing in 30 minutes
Akka persistence == event sourcing in 30 minutesAkka persistence == event sourcing in 30 minutes
Akka persistence == event sourcing in 30 minutesKonrad Malawski
 
Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Chris Richardson
 
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...Chris Richardson
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraLuke Tillman
 
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase Create
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase CreateWebinar: Overcoming the Storage Challenges Cassandra and Couchbase Create
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase CreateStorage Switzerland
 
Scalable PHP Applications With Cassandra
Scalable PHP Applications With CassandraScalable PHP Applications With Cassandra
Scalable PHP Applications With CassandraAndrea De Pirro
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsJeff Jirsa
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayJacob Park
 
Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Martin Zapletal
 
AppSync.org: open-source patterns and code for data synchronization in mobile...
AppSync.org: open-source patterns and code for data synchronization in mobile...AppSync.org: open-source patterns and code for data synchronization in mobile...
AppSync.org: open-source patterns and code for data synchronization in mobile...Niko Nelissen
 
Spreadshirt Platform - An Architectural Overview
Spreadshirt Platform - An Architectural OverviewSpreadshirt Platform - An Architectural Overview
Spreadshirt Platform - An Architectural OverviewJens Hadlich
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWSTom Laszewski
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsJeff Jirsa
 
Patterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservicesPatterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservicesRachel Reese
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 

Viewers also liked (20)

Deep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraDeep dive into event store using Apache Cassandra
Deep dive into event store using Apache Cassandra
 
CQRS and Event Sourcing with Akka, Cassandra and RabbitMQ
CQRS and Event Sourcing with Akka, Cassandra and RabbitMQCQRS and Event Sourcing with Akka, Cassandra and RabbitMQ
CQRS and Event Sourcing with Akka, Cassandra and RabbitMQ
 
Akka persistence == event sourcing in 30 minutes
Akka persistence == event sourcing in 30 minutesAkka persistence == event sourcing in 30 minutes
Akka persistence == event sourcing in 30 minutes
 
Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Developing functional domain models with event sourcing (sbtb, sbtb2015)
Developing functional domain models with event sourcing (sbtb, sbtb2015)
 
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...Developing event-driven microservices with event sourcing and CQRS  (svcc, sv...
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...
 
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
 
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase Create
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase CreateWebinar: Overcoming the Storage Challenges Cassandra and Couchbase Create
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase Create
 
Scalable PHP Applications With Cassandra
Scalable PHP Applications With CassandraScalable PHP Applications With Cassandra
Scalable PHP Applications With Cassandra
 
Using Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series WorkloadsUsing Time Window Compaction Strategy For Time Series Workloads
Using Time Window Compaction Strategy For Time Series Workloads
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2Data in Motion: Streaming Static Data Efficiently 2
Data in Motion: Streaming Static Data Efficiently 2
 
AppSync.org: open-source patterns and code for data synchronization in mobile...
AppSync.org: open-source patterns and code for data synchronization in mobile...AppSync.org: open-source patterns and code for data synchronization in mobile...
AppSync.org: open-source patterns and code for data synchronization in mobile...
 
Spreadshirt Platform - An Architectural Overview
Spreadshirt Platform - An Architectural OverviewSpreadshirt Platform - An Architectural Overview
Spreadshirt Platform - An Architectural Overview
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For Operators
 
Patterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservicesPatterns and practices for real-world event-driven microservices
Patterns and practices for real-world event-driven microservices
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 

Similar to Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architecturesRegunath B
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Challenges in building a Data Pipeline
Challenges in building a Data PipelineChallenges in building a Data Pipeline
Challenges in building a Data PipelineHevo Data Inc.
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...Angad Singh
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsStreamNative
 
M|18 Analytics as a Service
M|18 Analytics as a ServiceM|18 Analytics as a Service
M|18 Analytics as a ServiceMariaDB plc
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconN Masahiro
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 

Similar to Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015 (20)

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Scalability truths and serverless architectures
Scalability truths and serverless architecturesScalability truths and serverless architectures
Scalability truths and serverless architectures
 
AmazonRedshift
AmazonRedshiftAmazonRedshift
AmazonRedshift
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Challenges in building a Data Pipeline
Challenges in building a Data PipelineChallenges in building a Data Pipeline
Challenges in building a Data Pipeline
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
M|18 Analytics as a Service
M|18 Analytics as a ServiceM|18 Analytics as a Service
M|18 Analytics as a Service
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at Kubecon
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 

More from Martin Zapletal

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience Martin Zapletal
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveMartin Zapletal
 
Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System OptimizationsMartin Zapletal
 
Intelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsIntelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsMartin Zapletal
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyMartin Zapletal
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 

More from Martin Zapletal (9)

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
 
Customer experience at disney+ through data perspective
Customer experience at disney+ through data perspectiveCustomer experience at disney+ through data perspective
Customer experience at disney+ through data perspective
 
Intelligent System Optimizations
Intelligent System OptimizationsIntelligent System Optimizations
Intelligent System Optimizations
 
Intelligent Distributed Systems Optimizations
Intelligent Distributed Systems OptimizationsIntelligent Distributed Systems Optimizations
Intelligent Distributed Systems Optimizations
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 

Recently uploaded

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 

Recently uploaded (20)

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 

Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015

  • 1. Martin Zapletal @zapletal_martin Cake Solutions @cakesolutions #CassandraSummit Presented by Anirvan Chakraborty @anirvan_c
  • 2. ● Introduction ● Event sourcing and CQRS ● An emerging technology stack to handle data ● A reference application and it’s architecture ● A few use cases of the reference application ● Conclusion
  • 3. ● Increasing importance of data analytics ● Current state ○ Destructive updates ○ Analytics tools with poor scalability and integration ○ Manual processes ○ Slow iterations ○ Not suitable for large amounts of data
  • 4. ● Whole lifecycle of data ● Data processing ● Data stores ● Integration and messaging ● Distributed computing primitives ● Cluster managers and task schedulers ● Deployment, configuration management and DevOps ● Data analytics and machine learning ● Spark, Mesos, Akka, Cassandra, Kafka (SMACK, Infinity)
  • 6. ● Create, Read, Update, Delete ● Exposes mutable internal state ● Many read methods on repositories ● Mapping of data model and objects (impedance mismatch) ● No auditing ● No separation of concerns (read / write, command / event) ● Strongly consistent ● Difficult optimizations of reads / writes ● Difficult to scale ● Intent, behaviour, history, is lost Balance = 5 Balance = 10 Update Account Balance = 10 Account
  • 9. ● Append only data store ● No updates or deletes (rewriting history) ● Immutable data model ● Decouples data model of the application and storage ● Current state not persisted, but derived. A sequence of updates that led to it. ● History, state known at any point in time ● Replayable ● Source of truth ● Optimisations possible ● Works well in distributed environment - easy partitioning, conflicts ● Helps avoiding transactions ● Works well with DDD
  • 10. userId date change 1 1 1 10/10/2015 11/10/2015 23/10/2015 +300 -100 -200 1 24/10/2015 +100 balanceChanged event balanceChanged balanceChanged balanceChanged Event journal
  • 11. ● Command Query Responsibility Segregation ● Read and write logically and physically separated ● Reasoning about the application ● Clear separation of concerns (business logic) ● Often different technology, scalability ● Often lower consistency - eventual, causal
  • 12. Command ● Write side ● Messages, requests to mutate state ● Behaviour, serialized method call essentially ● Don’t expose state ● Validated and may be rejected or emit one or more events (e.g. submitting a form) Event ● Write side ● Immutable ● Indicating something that has happened ● Atomic record of state change ● Audit log Query ● Read side ● Precomputed
  • 13. userId = 1 updateBalance (+100) Write Command Event userId date change 1 1 1 10/10/2015 11/10/2015 23/10/2015 +300 -100 -200 1 24/10/2015 +100 balance Changed event balance Changed balance Changed balance Changed Event journal Command handler Read balance 1 100 userId = 1 balance = 100 Query userId
  • 14. ● Partial order of events for each entity ● Operation semantics, CRDTs UserNameUpdated(B) UserNameUpdated(B) UserNameUpdated(A) UserNameUpdated(A)
  • 15. ● Localization ● Conflicting concurrent histories ○ Resubmission ○ Deduplication ○ Replication ● Identifier ● Version ● Timestamp ● Vector clock
  • 16.
  • 17.
  • 18. ● Actor framework for truly concurrent and distributed systems ● Thread safe mutable state - consistency boundary ● Domain modelling, distributed state ● Simple programming model - asynchronously send messages, create new actors, change behaviour ● Supports CQRS/ES ● Fully distributed - asynchronous, delivery guarantees, failures, time and order, consistency, availability, communication patterns, data locality, persistence, durability, concurrent updates, conflicts, divergence, invariants, ...
  • 19. ? ? ? + 1 ? + 1 ? + 2 UserId = 1 Name = Bob BankAccountId = 1 Balance = 1000 UserId = 1 Name = Alice
  • 20. ● Distributed domain modelling ● In memory ● Ordering, consistency id = 1
  • 21. ● Actor backed by data store ● Immutable event sourced journal ● Supports CQRS (write and read side) ● Persistence, replay on failure, rebalance, at least once delivery
  • 22. user1, event 2 user1, event 3 user1, event 4 user1, event 1
  • 23. class UserActor extends PersistentActor { override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator) def notRegistered(distributedData: ActorRef): Receive = { case cmd: AccountCommand => persist(AccountEvent(cmd.account)){ acc => context.become(registered(acc)) sender() ! /-() } } def registered(account: Account): Receive = { case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) => persist(eres)(data => sender() ! /-(id)) } override def receiveRecover: Receive = { ... } }
  • 24. ● Akka Persistence Cassandra journal ○ Globally distributed journal ○ Scalable, resilient, highly available ○ Performant, operational database ● Community plugins akka { persistence { journal.plugin = "cassandra-journal" snapshot-store.plugin = "cassandra-snapshot-store" } }
  • 25. ● Partition-size ● Events in each cluster partition ordered (persistenceId - partition pair) CREATE TABLE IF NOT EXISTS ${tableName} ( processor_id text, partition_nr bigint, sequence_nr bigint, marker text, message blob, PRIMARY KEY ((processor_id, partition_nr), sequence_nr, marker)) WITH COMPACT STORAGE AND gc_grace_seconds = ${config. gc_grace_seconds} processor_id partition_nr sequence_nr marker message user-1 0 0 H 0x0a6643b334... user-1 0 1 A 0x0ab2020801... user-1 0 2 A 0x0a98020801...
  • 26. ● Internal state, moment in time ● Read optimization CREATE TABLE IF NOT EXISTS ${tableName} ( processor_id text, sequence_nr bigint, timestamp bigint, snapshot blob, PRIMARY KEY (processor_id, sequence_nr)) WITH CLUSTERING ORDER BY (sequence_nr DESC) processor_id sequence_nr snapshot timestamp user-1 16 0x0400000001... 1441696908210 user-1 20 0x0400000001... 1441697587765
  • 27. ● Uses Akka serialization 0x0a6643b334 … PersistentRepr Akka.Serialization Payload: T Protobuff actor { serialization-bindings { "io.muvr.exercise.ExercisePlanDeviation" = kryo, "io.muvr.exercise.ResistanceExercise" = kryo, } serializers { java = "akka.serialization.JavaSerializer" kryo = "com.twitter.chill.akka.AkkaSerializer" } }
  • 28. class UserActorView(userId: String) extends PersistentView { override def persistenceId: String = UserPersistenceId(userId).persistenceId override def viewId: String = UserPersistenceId(userId).persistentViewId override def autoUpdateInterval: FiniteDuration = FiniteDuration(100, TimeUnit.MILLISECONDS) def receive: Receive = viewState(List.empty) def viewState(processedDeviations: List[ExercisePlanProcessedDeviation]): Receive = { case EntireResistanceExerciseSession(_, _, _, _, deviations) if isPersistent => context.become(viewState(deviations.filter(condition).map(process) ::: processedDeviations)) case GetProcessedDeviations => sender() ! processedDeviations } }
  • 29. ● Akka 2.4 ● Potentially infinite stream of data ● Ordered, replayable, resumable ● Aggregation, transformation, moving data ● EventsByPersistenceId ● AllPersistenceids ● EventsByTag
  • 30. val readJournal = PersistenceQuery(system).readJournalFor(CassandraJournal.Identifier) val source = readJournal.query( EventsByPersistenceId(UserPersistenceId(name).persistenceId, 0, Long.MaxValue), NoRefresh) .map(_.event) .collect{ case s: EntireResistanceExerciseSession => s } .mapConcat(_.deviations) .filter(condition) .map(process) implicit val mat = ActorMaterializer() val result = source.runFold(List.empty[ExercisePlanDeviation])((x, y) => y :: x)
  • 31. ● Potentially infinite stream of events Source[Any].map(process).filter(condition) Publisher Subscriber process condition backpressure
  • 32. ● In Akka we have the read and write sides separated, in Cassandra we don’t ● Different data model ● Avoid using operational datastore ● Eventual consistency ● Streaming transformations to different format ● Unify journalled and other data
  • 33. ● Computations and analytics queries on the data ● Often iterative, complex, expensive computations ● Prepared and interactive queries ● Data from multiple sources, joins and transformations ● Often directly on a stream of data ● Whole history of events ● Historical behaviour ● Works retrospectively, can answer questions in the future that we don’t know exist yet ● Various data types from various sources ● Large amounts of fast data ● Automated analytics
  • 34. ● Cassandra 3.0 - user defined functions, functional indexes, aggregation functions, materialized views ● Server side denormalization ● Eventual consistency ● Copy of data with different partitioning userId performance
  • 35. ● In memory dataflow distributed data processing framework, streaming and batch ● Distributes computation using a higher level API ● Load balancing ● Moves computation to data ● Fault tolerant
  • 36. ● Resilient Distributed Datasets ● Fault tolerance ● Caching ● Serialization ● Transformations ○ Lazy, form the DAG ○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ... ● Actions ○ Execute DAG, retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ... ● Accumulators ● Broadcast Variables ● Integration ● Streaming ● Machine Learning ● Graph Processing
  • 37. textFile mapmap reduceByKey collect sc.textFile("counts") .map(line => line.split("t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect() [4]
  • 39. ● Cassandra can store ● Spark can process ● Gathering large amounts of heterogeneous data ● Queries ● Transformations ● Complex computations ● Machine learning, data mining, analytics ● Now possible ● Prepared and interactive queries
  • 40. lazy val sparkConf: SparkConf = new SparkConf() .setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(sparkConf) val data = sc.cassandraTable[T]("keyspace", "table").select("columns") val processedData = data.flatMap(...)... processedData.saveToCassandra("keyspace", "table")
  • 41. ● Akka Analytics project ● Handles custom Akka serialization case class JournalKey(persistenceId: String, partition: Long, sequenceNr: Long) lazy val sparkConf: SparkConf = new SparkConf() .setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(sparkConf) val events: RDD[(JournalKey, Any)] = sc.eventTable() events.sortByKey().map(...).filter(...).collect().foreach(println)
  • 42. ● Spark streaming ● Precomputing using spark or replication often aiming for different data model Operational cluster Analytics cluster Precomputation / replication Integration with other data sources
  • 43. val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations) val deviationsFrequency = sqlContext.sql( """SELECT planned.exercise, hour(time), COUNT(1) FROM exerciseDeviations WHERE planned.exercise = 'bench press' GROUP BY planned.exercise, hour(time)""") val deviationsFrequency2 = exerciseDeviationsDF .where(exerciseDeviationsDF("planned.exercise") === "bench press") .groupBy( exerciseDeviationsDF("planned.exercise"), exerciseDeviationsDF("time”)) .count() val deviationsFrequency3 = exerciseDeviations .filter(_.planned.exercise == "bench press") .groupBy(d => (d.planned.exercise, d.time.getHours)) .map(d => (d._1, d._2.size))
  • 44. def toVector(user: User): mllib.linalg.Vector = Vectors.dense( user.frequency, user.performanceIndex, user.improvementIndex) val events: RDD[(JournalKey, Any)] = sc.eventTable().cache() val users: RDD[User] = events.filterClass[User] val kmeans = new KMeans() .setK(5) .set... val clusters = kmeans.run(users.map(_.toVector))
  • 45. val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache() val exerciseDeviations = events .filterClass[EntireResistanceExerciseSession] .flatMap(session => session.sets.flatMap(set => set.sets.map(exercise => (session.id.id, exercise.exercise)))) .groupBy(e => e) .map(g => Rating(normalize(g._1._1), normalize(g._1._2), normalize(g._2.size))) val model = new ALS().run(ratings) val predictions = model.predict(recommend) bench press bicep curl dead lift user 1 5 2 user 2 4 3 user 3 5 2 user 4 3 1
  • 46. val events = sc.eventTable().cache().toDF() val lr = new LinearRegression() val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(), new IntensityFeatureExtractor(), lr)) val paramGrid = new ParamGridBuilder() .addGrid(lr.regParam, Array(0.1, 0.01)) .addGrid(lr.fitIntercept, Array(true, false)) getEligibleUsers(events, sessionEndedBefore) .map { user => val trainValidationSplit = new TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(new RegressionEvaluator) .setEstimatorParamMaps(paramGrid) val model = trainValidationSplit.fit( events, ParamMap(ParamPair(userIdParam, user))) val testData = // Prepare test data. val predictions = model.transform(testData) submitResult(userId, predictions, config) }
  • 47. val events: RDD[(JournalKey, Any)] = sc.eventTable().cache() val connections = events.filterClass[Connections] val vertices: RDD[(VertexId, Long)] = connections.map(c => (c.id, 1l)) val edges: RDD[Edge[Long]] = connections .flatMap(c => c.connections .map(Edge(c.id, _, 1l))) val graph = Graph(vertices, edges) val ranks = graph.pageRank(0.0001).vertices
  • 48.
  • 49.
  • 50.
  • 53. ● Exercise domain as an example ● Analytics of both batch (offline) and streaming (online) data ● Analytics important in other areas (banking, stock market, network, cluster monitoring, business intelligence, commerce, internet of things, ...) ● Enabling value of data
  • 54. ● Event sourcing ● CQRS ● Technologies to handle the data ○ Spark ○ Mesos ○ Akka ○ Cassandra ○ Kafka ● Handling data ● Insights and analytics enable value in data
  • 55.
  • 56. ● Jobs at www.cakesolutions.net/careers ● Code at https://github.com/muvr ● Martin Zapletal @zapletal_martin ● Anirvan Chakraborty @anirvan_c
  • 57. [1] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/ [2] http://malteschwarzkopf.de/research/assets/google-stack.pdf [3] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf [4] http://www.slideshare.net/LisaHua/spark-overview-37479609