Cassandra synergy

Agenda
 What do we mean by synergy?
 Storm

 Shark / Spark
 Redis
 ElasticSearch
 Hadoop

What do we mean by
Synergy?
 synergy
 1. The interaction of two or more agents or forces so
that their combined effect is greater than the sum of
their individual effects.

What do we mean by
Synergy?
 Cassandra excellent for:





Fast read or write performance
Scalable, runs on commodity hardware
Reliable cross-DC replication
Robust persistence for high volume data

 Needs some special sauce for:
 Real-time calculations for high volume streams
 Complex search functions (free-text etc.)
 Map Reduce on RDDs

Storm
 Open Sourced by Twitter in 2011
 Distributed event processor
 Operates on Resilient Distributed Data Sets
 Getting started in Apache Incubator
 Can persist to and read from from C*

 Great for high volume, real time (complex) calculations on
streamed data

Storm
 Is a CEP architecture
 Spout – Collects & submits tuples for processing
 Bolt – processes tuples and emits new tuples
 Tuple – a collection of data passed in storm
 Stream – identifies outputs from a spout / bolt and enforces
tuple structure

 Uses Zookeeper and ZeroMQ for coordination and message
passing respectively

Synergy?
 Can use Cassandra as the input data source
 Can write tuples into Cassandra

 Example project here…
 https://github.com/tjake/stormscraper/

 See CassandraWriterBolt.java for simple example of a Java
Driver CQL based bolt that writes to Cassandra.
 Good as an example application, but not production ready

Use Case
 Top N words for popularity tracking
 Input: a constant stream of messages into the system
 Count occurrences of each word in a message
 Store raw messages in Cassandra
 Use a bolt to break up messages and maintain sorted list of
top N words

 Persist the Top N words and their counts periodically in
Cassandra

Use Case
CREATE TABLE messages (date_hour TIMESTAMP,
message_id TIMEUUID, message VARCHAR,
PRIMARY KEY(date_hour, message_id));
CREATE TABLE top_words (date_hour TIMESTAMP,
position INTEGER, word VARCHAR, PRIMARY
KEY(date_hour, position));

Use Case
 https://github.com/nathanmarz/storm-starter/
 Use RollingTopWords.java as base
 Integrate CassandraWriterBolt into use case
 Add spout for input messages
 Add bolt for persisting messages & writing Top N words
 Reference : http://www.michaelnoll.com/blog/2013/01/18/implementing-real-timetrending-topics-in-storm/

Storm: Conclusion
 Powerful Architecture
 Lots of potential as an Apache project
 Nice abstractions to simplify development (Trident)
 Great for operating on high velocity, high volume streams
 Not prohibitively difficult to integrate with other systems for
input and output
 Lots of people experimenting with it!

Spark & Shark
Lightning fast cluster computing

Apache Spark
 100x faster than Hadoop MapReduce!
 Faster in-memory MapR operations

 Integration with Cassandra either via:
 https://github.com/tuplejump/calliope-release
 Or via Cassandra’s Hadoop support

 Combines SQL, Streaming and Complex Analytics

Apache Spark
 Can read and write to Cassandra…
 Reading from CF / Table into RDD via Calliope (Scala)
val cas = CasBuilder.cql3.withColumnFamily("casDemo",
"Words”).where("book = 'The Three Musketeers'”)
val rdd = sc.cql3Cassandra[Map[String, String], Map[String,
String]](cas)
* where clause can use partition key or secondary index, CasBuilder
also supports paging

Shark
 With Spark we can achieve super fast in-memory queries on
subsets of data in Cassandra
 Effectively all the features of Hive running on RDD not HDFS

 Uses HiveQL queries
 Includes machine learning algorithms out of the box
 CqlStorageHandler provided to read RDD from Cassandra or
read SSTables directly
 https://github.com/richardalow/cassowary

Spark / Shark: Conclusion
 Need resource isolation if running directly on
Cassandra nodes
 Otherwise dealing with higher latency but not affecting
cluster resources
 Impressive possibilities for machine learning
algorithms as well as more basic Hive queries
 Introduces possibilities for JOINs on hot data!

What is it?

“Redis is an open source, BSD licensed, advanced keyvalue store. It is often referred to as a data structure server
since keys can contain strings, hashes, lists, sets and sorted
sets.”

Synergy?
 Good for…






Sorting sets & lists
Pubsub messaging
(more) Accurate counters
Merging sets
Transactions!

 Works in memory, can serve data fast based on key
 Good for runtime storage of aggregate data
 Could use shared resources on Cassandra nodes (could populate
most recent data via triggers (naughty))

Elastic Search
Distributed real-time search engine based

What is it?
 Distributed real-time search engine
 Built from the ground up for reliability and scalability

 Supports lots of other features as well free text search
 Spatial
 Query by arbitrary fields
 Facets

 Multi-lingual query support

Synergy?
 Although external to Cassandra it can provide rich query
capabilities over the same data
 Simplify Data Models in Cassandra to maximise storage
 Separate read and write workloads (read from ES, write to
Cassandra)
 Some integration for Storm for writing records to elastic
search and Cassandra as data enters the system
 Again… Spatial!

What is it?
 Open Source under Apache License 2.0
 Top Level Apache project

 Runs on commodity hardware
 Used for storage and large scale processing of data-sets
 Lots of complementary tools… impala, mahout etc.

Some terms…
 HDFS


a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the
cluster.

 Hadoop MapReduce - a programming model for large scale
data processing.
 Hive - An SQL like abstraction for map reduce jobs

 Pig - A procedural style language for expressing map reduce
jobs

Synergy?
 Multiple ways to use it with Cassandra
 DataStax Enterprise supports Hadoop on top of a
Cassandra File System
 Replication managed in-cluster (efficient)
 Full Hadoop toolset available

 Some Hadoop support in vanilla distribution.
 Limited support for efficient querying

Cassandra synergy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Cassandra synergy

Similar to Cassandra synergy (20)

Recently uploaded

Recently uploaded (20)

Cassandra synergy