Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
1. We know stuff. And get people excited about it.
Rachel Pedreschi @RachelPedreschi
Patrick McFadin @PatrickMcFadin
A Cassandra + Solr + Spark Love Triangle Using
DataStax Enterprise
1
2. Not
(all queries preplanned. Very Fast)
Very
(Ask anything, anytime. Slowest)
Adhociness
C* Search Analytics
13. Multi-Data Center Replication
Data Center 1
hash(key) => token(43)
replication factor = 3
80
10
3050
70
60
40
20
Data Center 2
replication factor = 3
81
11
3151
71
61
41
21
Application
14. How does DSE integrate Solr?
C* C*/Solr
Transactional Search
15.
16. SELECT *
FROM killrvideo.videos
WHERE solr_query='{
"q": "{!edismax qf="name^2 tags^1
description”}datastax"
}';
SELECT id, value
FROM keyspace.table
WHERE token(id) >= -3074457345618258601
AND token(id) <= 3074457345618258603
AND solr_query='id:*'
40. Behind the scenes…
// Videos by id
CREATE TABLE videos (
videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type int,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
);
// Index for tag keywords
CREATE TABLE videos_by_tag (
tag text,
videoid uuid,
added_date timestamp,
userid uuid,
name text,
preview_image_location text,
tagged_date timestamp,
PRIMARY KEY (tag, videoid)
);
Not a great idea
Possible Index
41. // Videos by id
CREATE TABLE videos (
videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type int,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
And
this?
This?
This?
42.
43. 1) Spin up a new C* Cluster with search enabled using the DSE
installer.
$ sudo service dse cassandra -s
2) Run your schema DDL to create the C* keyspace and tables.
3) Run dse_tool on the videos table
$ dsetool create_core killrvideo.videos generateResources=true
4) Use the Solr Admin to check sanity and make sure you have a
core.
5) Write a CQL query with a Solr Search in it.
SELECT * FROM killrvideo.videos
WHERE solr_query='{ "q": "{!edismax qf="name^2 tags^1 description
”}?" }';
45. Now you get this!
SELECT name
FROM videos
WHERE solr_query = 'tags:crime*';
46. Attaching to Spark and Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
/** The setMaster("local") lets us run & test the job right in our IDE */
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster(“local[*]")
.setAppName(getClass.getName)
// Optionally
.set("cassandra.username", "cassandra")
.set("cassandra.password", “cassandra")
val sc = new SparkContext(conf)
47. Comment table example
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
48. Simple example
/** keyspace & table */
val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”)
/** get a simple count of all the rows in the raw_weather_data table */
val rowCount = tableRDD.count()
println(s"Total Rows in Comments Table: $rowCount")
sc.stop()
49. Simple example
/** keyspace & table */
val tableRDD = sc.cassandraTable("killrvideo", “comments_by_video”)
/** get a simple count of all the rows in the comments_by_video table */
val rowCount = tableRDD.count()
println(s"Total Rows in Comments Table: $rowCount")
sc.stop()
Executer
SELECT *
FROM killrvideo.comments_by_video
Spark RDD
Spark Partition
Spark Connector
50. Using CQL
SELECT userid
FROM comments_by_video
WHERE videoid = '01860584-de45-018f-12be-5f81704e8033'
val cqlRRD = sc.cassandraTable("killrvideo", “comments_by_video”)
.select("userid")
.where("videoid = ?”,
“01860584-de45-018f-12be-5f81704e8033")
51. Even SQL!
spark-sql> SELECT cast(videoid as String) videoid, count(*) c
FROM comments_by_video
GROUP BY cast(videoid as String)
ORDER BY c DESC limit 10;
52. Saving back to Cassandra
// Create insert data
val collection = sc.parallelize(Seq(("01860584-de45-018f-12be-5f81704e8033", "Great video", "cdaf6bd5-8914-29e0-
f0b6-8b0bc6156777"),
("01860584-de45-018f-12be-5f81704e8033", "Hated it", "cdaf6bd5-8914-29e0-f0b6-8b0bc6156777")))
// Insert data into table
collection.saveToCassandra("killrvideo", "comments_by_video", SomeColumns("videoid", "comment", "userid"))
53.
val solrQueryRDD = sc.cassandraTable("killrvideo", “videos")
.select("name").where("solr_query='tags:crime*'")
solrQueryRDD.collect().map(row => println(row.getString("name")))