2. @doanduyhai
Who Am I ?!
2
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
3. @doanduyhai
Datastax!
3
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
8. @doanduyhai
Normal token ranges!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
8
9. @doanduyhai
Cassandra Query Language (CQL)!
9
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = jdoe;
DELETE age FROM users WHERE login = jdoe;
SELECT age FROM users WHERE login = jdoe;
10. @doanduyhai
Why Spark on Cassandra ?!
10
Fast disk access
Structured data (columnar format)
Multi data-center !!!
Cross-table operations (JOIN, UNION, etc.)
Real-time/batch processing
Complex analytics (e.g. machine learning)
For Spark
For Cassandra
14. @doanduyhai
Connector architecture – Core API!
14
Cassandra tables exposed as Spark RDDs
Read from and write to Cassandra
Mapping of C* tables and rows to Scala objects
• CassandraRow
• case class (object mapper)
• Scala tuples
15. @doanduyhai
Connector architecture – Spark SQL !
15
Mapping of C* table to SchemaRDD
• custom query plan
• CassandraRDD à SchemaRDD
• push predicates to CQL
16. @doanduyhai
Connector architecture – Spark Streaming !
16
Streaming data INTO Cassandra table
• trivial setup
• be careful about your Cassandra data model !!!
Streaming data OUT of Cassandra table
• fetch all data from table
• send each row as a DStream
22. @doanduyhai
Data Locality!
22
getPartitions :
1. fetch all token ranges and their corresponding nodes from C*
(describe_ring method)
2. group token ranges together so that 1 Spark partition = n token
ranges belonging to the same node
25. Connector API & Usage!
Resources handling!
Connector API!
Live demo!
26. @doanduyhai
Resources Handling!
26
Open connections to C* cluster
Connections pooled (using Ref counting) on each executor
Scala Loan Pattern
!connector.withSessionDo!{!
! session!=>!session.execute("SELECT!xxx!FROM!yyy").all()!
!}!
34. @doanduyhai
Use Cases!
34
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
38. @doanduyhai
Multi-DC with Spark!
38
Workload segregation with virtual DC
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3
n4n5
n1
Production
(Live)
Analytics
with Spark
Same physical DC
Async
replication