Presented to the Dublin Cassandra User Group by Niall Milton of DigBigData. This presentation is on Cassandra and its use with other technologies such as Storm, Spark, Hadoop, ElasticSearch and Redis. This presentation should act as a solid foundation to explore some of the mentioned technologies in more depth.
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Cassandra synergy
1.
2. Agenda
What do we mean by synergy?
Storm
Shark / Spark
Redis
ElasticSearch
Hadoop
3. What do we mean by
Synergy?
synergy
1. The interaction of two or more agents or forces so
that their combined effect is greater than the sum of
their individual effects.
4. What do we mean by
Synergy?
Cassandra excellent for:
Fast read or write performance
Scalable, runs on commodity hardware
Reliable cross-DC replication
Robust persistence for high volume data
Needs some special sauce for:
Real-time calculations for high volume streams
Complex search functions (free-text etc.)
Map Reduce on RDDs
6. Storm
Open Sourced by Twitter in 2011
Distributed event processor
Operates on Resilient Distributed Data Sets
Getting started in Apache Incubator
Can persist to and read from from C*
Great for high volume, real time (complex) calculations on
streamed data
7.
8. Storm
Is a CEP architecture
Spout – Collects & submits tuples for processing
Bolt – processes tuples and emits new tuples
Tuple – a collection of data passed in storm
Stream – identifies outputs from a spout / bolt and enforces
tuple structure
Uses Zookeeper and ZeroMQ for coordination and message
passing respectively
10. Synergy?
Can use Cassandra as the input data source
Can write tuples into Cassandra
Example project here…
https://github.com/tjake/stormscraper/
See CassandraWriterBolt.java for simple example of a Java
Driver CQL based bolt that writes to Cassandra.
Good as an example application, but not production ready
11. Use Case
Top N words for popularity tracking
Input: a constant stream of messages into the system
Count occurrences of each word in a message
Store raw messages in Cassandra
Use a bolt to break up messages and maintain sorted list of
top N words
Persist the Top N words and their counts periodically in
Cassandra
12. Use Case
CREATE TABLE messages (date_hour TIMESTAMP,
message_id TIMEUUID, message VARCHAR,
PRIMARY KEY(date_hour, message_id));
CREATE TABLE top_words (date_hour TIMESTAMP,
position INTEGER, word VARCHAR, PRIMARY
KEY(date_hour, position));
13. Use Case
https://github.com/nathanmarz/storm-starter/
Use RollingTopWords.java as base
Integrate CassandraWriterBolt into use case
Add spout for input messages
Add bolt for persisting messages & writing Top N words
Reference : http://www.michaelnoll.com/blog/2013/01/18/implementing-real-timetrending-topics-in-storm/
14. Storm: Conclusion
Powerful Architecture
Lots of potential as an Apache project
Nice abstractions to simplify development (Trident)
Great for operating on high velocity, high volume streams
Not prohibitively difficult to integrate with other systems for
input and output
Lots of people experimenting with it!
16. Apache Spark
100x faster than Hadoop MapReduce!
Faster in-memory MapR operations
Integration with Cassandra either via:
https://github.com/tuplejump/calliope-release
Or via Cassandra’s Hadoop support
Combines SQL, Streaming and Complex Analytics
17. Apache Spark
Can read and write to Cassandra…
Reading from CF / Table into RDD via Calliope (Scala)
val cas = CasBuilder.cql3.withColumnFamily("casDemo",
"Words”).where("book = 'The Three Musketeers'”)
val rdd = sc.cql3Cassandra[Map[String, String], Map[String,
String]](cas)
* where clause can use partition key or secondary index, CasBuilder
also supports paging
18. Shark
With Spark we can achieve super fast in-memory queries on
subsets of data in Cassandra
Effectively all the features of Hive running on RDD not HDFS
Uses HiveQL queries
Includes machine learning algorithms out of the box
CqlStorageHandler provided to read RDD from Cassandra or
read SSTables directly
https://github.com/richardalow/cassowary
19. Spark / Shark: Conclusion
Need resource isolation if running directly on
Cassandra nodes
Otherwise dealing with higher latency but not affecting
cluster resources
Impressive possibilities for machine learning
algorithms as well as more basic Hive queries
Introduces possibilities for JOINs on hot data!
21. What is it?
“Redis is an open source, BSD licensed, advanced keyvalue store. It is often referred to as a data structure server
since keys can contain strings, hashes, lists, sets and sorted
sets.”
22. Synergy?
Good for…
Sorting sets & lists
Pubsub messaging
(more) Accurate counters
Merging sets
Transactions!
Works in memory, can serve data fast based on key
Good for runtime storage of aggregate data
Could use shared resources on Cassandra nodes (could populate
most recent data via triggers (naughty))
24. What is it?
Distributed real-time search engine
Built from the ground up for reliability and scalability
Supports lots of other features as well free text search
Spatial
Query by arbitrary fields
Facets
Multi-lingual query support
25. Synergy?
Although external to Cassandra it can provide rich query
capabilities over the same data
Simplify Data Models in Cassandra to maximise storage
Separate read and write workloads (read from ES, write to
Cassandra)
Some integration for Storm for writing records to elastic
search and Cassandra as data enters the system
Again… Spatial!
27. What is it?
Open Source under Apache License 2.0
Top Level Apache project
Runs on commodity hardware
Used for storage and large scale processing of data-sets
Lots of complementary tools… impala, mahout etc.
28. Some terms…
HDFS
a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the
cluster.
Hadoop MapReduce - a programming model for large scale
data processing.
Hive - An SQL like abstraction for map reduce jobs
Pig - A procedural style language for expressing map reduce
jobs
29. Synergy?
Multiple ways to use it with Cassandra
DataStax Enterprise supports Hadoop on top of a
Cassandra File System
Replication managed in-cluster (efficient)
Full Hadoop toolset available
Some Hadoop support in vanilla distribution.
Limited support for efficient querying