Making sure your Data Model will work on the production cluster after 6 months as well as it does on your laptop is an important skill. It's one that we use every day with our clients at The Last Pickle, and one that relies on tools like the cassandra-stress. Knowing how the data model will perform under stress once it has been loaded with data can prevent expensive re-writes late in the project.
In this talk Christopher Batey, Consultant at The Last Pickle, will shed some light on how to use the cassandra-stress tool to test your own schema, graph the results and even how to extend the tool for your own use cases. While this may be called premature optimisation for a RDBS, a successful Cassandra project depends on it's data model.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
6. How cassandra projects roll
● New shiny project
● Cassandra is a good fit
● Build application against a single node cassandra cluster
● Go to production
● Fail
7. Cassandra is not an agile database
● Schema migrations :(
● Data migrations :(
● Must know queries up front
8. You must iterate on your schema early
● Know your queries
● Know your data
● Test them against a realistic cluster
12. Commands
cassandra-stress help
---Commands---
read : Multiple concurrent reads - the cluster must first be populated by a write test
write : Multiple concurrent writes against the cluster
mixed : Interleaving of any basic commands
user : Interleaving of user provided queries, with configurable ratio and distribution
help : Print help for a command or option
10
13. Command help
cassandra-stress help write
Usage: write n=? [no-warmup] [truncate=?] [cl=?]
OR
Usage: write duration=? [no-warmup] [truncate=?] [cl=?]
err<? (default=0.02) Run until the standard error of the mean is below this fraction
n>? (default=30) Run at least this many iterations before accepting uncertainty convergence
n<? (default=200) Run at most this many iterations before accepting uncertainty convergence
no-warmup Do not warmup the process
truncate=? (default=never) Truncate the table: never, before performing any work, or before each iteration
cl=? (default=LOCAL_ONE) Consistency level to use
n=? Number of operations to perform
duration=? Time to run in (in seconds, minutes or hours)
15. Output
Op rate : 654 op/s [WRITE: 654 op/s]
Partition rate : 654 pk/s [WRITE: 654 pk/s]
Row rate : 654 row/s [WRITE: 654 row/s]
Latency mean : 59.5 ms [WRITE: 59.5 ms]
Latency median : 64.0 ms [WRITE: 64.0 ms]
Latency 95th percentile : 79.3 ms [WRITE: 79.3 ms]
Latency 99th percentile : 84.0 ms [WRITE: 84.0 ms]
Latency 99.9th percentile : 84.9 ms [WRITE: 84.9 ms]
Latency max : 84.9 ms [WRITE: 84.9 ms]
Total partitions : 100 [WRITE: 100]
Total errors : 0 [WRITE: 0]
Total GC count : 0
Total GC memory : 0.000 KiB
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:00:00
16. Options
---Options---
-pop : Population distribution and intra-partition visit order
-insert : Insert specific options relating to various methods for batching and splitting partition
-rate : Thread count, rate limit or automatic mode (default is auto)
-mode : Thrift or CQL with options
-errors : How to handle errors when encountered during stress
-sample : Specify the number of samples to collect for measuring latency
-schema : Replication settings, compression, compaction, etc.
-node : Nodes to connect to
-log : Where to log progress to, and the interval at which to do it
-transport : Custom transport factories
-port : The port to connect to cassandra nodes on
-sendto : Specify a stress server to send this command to
17. Options help
cassandra-stress help -mode
Usage: -mode native [unprepared] cql3 [compression=?] [port=?] [user=?] [password=?] [auth-provider=?] [maxPending=?]
[connectionsPerHost=?] [protocolVersion=?]
user=? username
password=? password
unprepared force use of unprepared statements
compression=? (default=none)
port=? (default=9046)
auth-provider=? Fully qualified implementation of com.datastax.driver.core.AuthProvider
maxPending=? (default=) Maximum pending requests per connection
connectionsPerHost=? (default=) Number of connections per host
protocolVersion=? (default=NEWEST_SUPPORTED) CQL Protocol Version
19. Scenario: tracking your staff
● Get all the activities for a staff member
● Get the latest event for a staff member
● Is Cassandra a good fit?
20. Defining your schema
table: staff_activities
table_definition: |
CREATE TABLE staff_activities (
name text,
when timeuuid,
what text,
PRIMARY KEY(name, when)
)
21. Column metadata
columnspec:
- name: name
size: uniform(5..10) # The names of the staff members are between 5-10 characters
population: uniform(1..10) # 10 possible staff members to pick from
- name: when
cluster: uniform(20..500) # Staff members do between 20 and 500 activities
- name: what
size: normal(10..100,50)
30. Column metadata
columnspec:
- name: name
size: uniform(5..10) # The names of the staff members are between 5-10 characters
population: uniform(1..10) # 10 possible staff members to pick from
- name: when
cluster: uniform(20..500) # Staff members do between 20 and 500 activities
- name: what
size: normal(10..100,50)
31. Insertion of data
insert:
# we only update a single partition in any given insert
partitions: fixed(1)
# we want to insert a single row per partition and we have between 20 and 500
# rows per partition
select: fixed(1)/500
batchtype: UNLOGGED