SlideShare a Scribd company logo
1 of 49
Jeff Jirsa
Using TimeWindowCompactionStrategy for Time Series Workloads
1 Who Am I?
2 LSM DBs
3 TWCS
4 The 1%
5 Things Nobody Else Told You About Compaction
6 Q&A
2© 2016. All Rights Reserved.
Who Am I?
(Or: Why You Should Believe Me)
2016 CROWDSTRIKE, INC. ALL
RIGHTS RESERVED.
4
© DataStax, All Rights Reserved. 5
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 6
• We keep some data from sensors for a
fixed time period
• Processes
• DNS queries
• Executables created
• It’s a LOT of data
• 2015 Talk: One million writes per
second with 60 nodes
• Multiple Petabytes Per Cluster
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 7
• TWCS was written to solve problems
CrowdStrike faced in production
• It wasn’t meant to be clever, it was
meant to be efficient and easy to reason
about
• I’m on the pager rotation, this directly
impacts my quality of life
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 8
• I have better things to do on my off time
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 9
• I have better things to do on my off time
We’ve Spent Some Time With Time Series
© 2016. All Rights Reserved. 10
• I have better things to do on my off time
Log Structured – Database, Not Cabins
If You’re Going To Use Cassandra, Let’s Make Sure We Know How It Works
Log Structured Merge Trees
• Cassandra write path:
1. First the Commitlog
2. Then the Memtable
3. Eventually flushed to a SSTable
• Each SSTable is written exactly once
• Over time, Cassandra combines those data files
Duplicate cells are merged
Obsolete data is purged
• On reads, Cassandra searches for data in each SSTable, merging any existing records and
returning the result
© 2016. All Rights Reserved. 12
Real World, Real Problems
• If you can’t get compaction happy, your cluster will never be happy
• The write path relies on efficient flushing
• If your compaction strategy falls behind, you can block flushes (CASSANDRA-9882)
• The read path relies on efficient merging
• If your compaction strategy falls behind, each read may touch hundreds or thousands of sstables
• IO bound clusters are common, even with SSDs
• Dynamic Snitch - latency + “severity”
© 2016. All Rights Reserved. 13
What We Hope For
• We accept that we need to compact sstables sometimes, but we want to do it when we have a
good reason
• Good reasons:
• Data has been deleted and we want to reclaim space
• Data has been overwritten and we want to avoid merges on reads
• Our queries span multiple sstables, and we’re having to touch a lot of sstables on each read
• Bad Reasons:
• We hit some magic size threshold and we want to join two non-overlapping files together
• We’re aiming for a situation where the merge on read is tolerable
• Bloom filter is your friend – let’s read from as few sstables as possible
• We want as few tombstones as possible (this includes expired data)
• Tombstones create garbage, garbage creates sadness
© 2016. All Rights Reserved. 14
Use The Defaults?
It’s Not Just Naïve, It’s Also Expensive
The Basics: SizeTieredCompactionStrategy
• Each time min_threshold (4) files of the same size appear, combine them into a new file
• Over time, you’ll naturally end up with a distribution of old data in large files, new data in small files
• Deleted data in large files stays on disk longer than desired because those files are very rarely compacted
© 2016. All Rights Reserved. 16
SizeTieredCompactionStrategy
© 2016. All Rights Reserved. 17
SizeTieredCompactionStrategy
© 2016. All Rights Reserved. 18
If each of the smallest blocks represent 1 day of data, and each write
had a 90 day TTL, when do you actually delete files and reclaim disk
space?
SizeTieredCompactionStrategy
• Expensive IO:
• Far more writes than necessary, you’ll recompact old data weeks after it was written
• Reads may touch a ton of sstables – we have no control over how data will be arranged on disk
• Expensive Operationally:
• Expired data doesn’t get dropped until you happen to re-compact the table it’s in
• You have to keep up to 50% spare disk
© 2016. All Rights Reserved. 19
TWCS
Because Everything Else Made Me Sad
Kübler Ross Stages of Grief
• Denial
• Anger
• Bargaining
• Depression
• Acceptance
© 2016. All Rights Reserved. 21
Sad Operator: Stages of Grief
• Denial
• STCS and LCS aren’t gonna work, but DTCS will fix it
• Anger
• DTCS seemed to be the fix, and it didn’t work, either
• Bargaining
• What if we tweak all these sub-properties? What if we just fix things one at a time?
• Depression
• Still SOL at ~hundred node scale
• Can we get through this? Is it time for a therapist’s couch?
© 2016. All Rights Reserved. 22
© 2016. All Rights Reserved. 23
Sad Operator: Stages of Grief
• Acceptance
• Compaction is pluggable, we’ll write it ourselves
• Designed to be simple and efficient
• Group sstables into logical buckets
• STCS in the newest time window
• No more confusing options, just Window Size + Window Unit
• Base time seconds? Max age days? Overloading min_threshold for grouping? Not today.
• “12 Hours”, “3 Days”, “6 Minutes”
• Configure buckets so you have 20-30 buckets on disk
© 2016. All Rights Reserved. 24
That’s It.
• 90 day TTL
• Unit = Days, # = 3
• Each file on disk spans 3 days of data (except the first window), expect ~30 + first window
• Expect to have at least 3 days of extra data on disk*
• 2 hour TTL
• Unit = Minutes, # = 10
• Each file on disk represents 10 minutes of data, expect 12-13 + first window
• Expect to have at least 10 minutes of extra data on disk*
© 2016. All Rights Reserved. 25
© 2016. All Rights Reserved. 26
© 2016. All Rights Reserved. 27
Example: IO (Real Cluster)
© 2016. All Rights Reserved. 28
Example: Load (Real Cluster)
The Only Real Optimization You Need
• Align your partition keys to your TWCS windows
• Bloom filter reads will only touch a single sstable
• Deletion gets much easier because you get rid of overlapping ranges
• Bucketing partitions keeps partition sizes reasonable ( < 100MB ), which saves you a ton of GC pressure
• If you’re using 30 day TTL and 1 day TWCS windows, put a “day_of_year” field into the partition key
• Use parallel async reads to read more than one day at a time
• Spread reads across multiple nodes
• Each node should touch exactly 1 sstable on disk (watch timezones)
• That sstable is probably hot for all partitions, so it’ll be in page cache
• Extrapolate for other windows (you may have to chunk things up into 3 day buckets or 30 minute buckets, but it’ll
be worth it)
© 2016. All Rights Reserved. 29
What We’ve Discussed Is Good Enough For 99%
Of Time Series Use Cases
But let’s make sure the 1% knows what’s up
Out Of Order Writes
• If we mix write timestamps “USING TIMESTAMP”…
• Life isn’t over, it just potentially blocks expiration
• Goal:
• Avoid mixing timestamps within any given sstable
• Options:
• Don’t mix in the memtable
• Don’t use the memtable
© 2016. All Rights Reserved. 31
Out Of Order Writes
• Don’t comingle in the memtable
• If we have a queue-like workflow, consider the following option:
• Pause kafka consumer / celery worker / etc
• “nodetool flush”
• Write old data with “USING TIMESTAMP”
• “nodetool flush
• Resume consumer/workers for new data
• Positives: No comingled data
• Negatives: Have to pause ingestion
© 2016. All Rights Reserved. 32
Out Of Order Writes
• Don’t use the memtable
• CQLSSTableWriter
• Yuki has a great blog at: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
• Write sstables offline
• Stream them in with sstableloader
• Positives: No comingled data, no pausing ingestion, incredibly fast, easy to parallelize
• Negatives: Requires code (but it’s not difficult code, your ops team should be able to do it)
© 2016. All Rights Reserved. 33
Per-Window Major Compaction
• At the end of each window, you’re going to see a major compaction for all sstables in that window
© 2016. All Rights Reserved. 34
Per-Window Major Compaction
• At the end of each window, you’re going to see a major compaction for all sstables in that window
• Expect a jump in CPU usage, disk usage, and disk IO
• The DURATION of these increases depends on your write rate and window size
• Larger windows will take longer to compact because you’ll have more data on disk
• If this is a problem for you, you’re under provisioned
© 2016. All Rights Reserved. 35
Per-Window Major Compaction
© DataStax, All Rights Reserved. 36
CPU Usage
During the end-of-window major, cpu
usage on ALL OF THE NODES (in all
DCs) will increase at the same time.
This will likely impact your read latency.
When you validate TWCS, be sure to
make sure your application works well at
this transition.
We can surely fix this, just need to find a
way to avoid cluttering the options.
Per-Window Major Compaction
© DataStax, All Rights Reserved. 37
Disk Usage
During the daily major, disk usage on ALL
OF THE NODES will increase at the
same time.
Per-Window Major Compaction
© DataStax, All Rights Reserved. 38
Disk Usage
In some cases, you’ll see the window
major compaction run twice because of
the timing of flush. You can manually
flush (cron) to work around it if it bothers
you.
This is on my list of things to fix
No reason to do two majors, better to
either delay the first major until we’re
sure it’s time, or keep a history that we’ve
already done a window major
compaction, and skip it the second time
There Are Things Nobody Told You About
Compaction
The More You Know…
Things Nobody Told You About Compaction
• Compaction Impacts Read Performance More Than Write Performance
• Typical advice is use LCS if you need fast reads, STCS if you need fast writes
• LCS optimizes reads by limiting the # of potential SSTables you’ll need to touch on the read path
• The goal of LCS (fast reads/low latency) and the act of keeping levels are in competition with each other
• It takes a LOT of IO for LCS to keep up, and it’s generally not a great fit for most time series use cases
• LCS will negatively impact your read latencies in any sufficiently busy cluster
© 2016. All Rights Reserved. 40
Things Nobody Told You About Compaction
• You can change the compaction strategy on a single node using JMX
• The change won’t persist through restarts, but it’s often a great way to test / canary before rolling it out to the full
cluster
• You can change other useful things in JMX, too. No need to restart to change:
• Compaction threads
• Compaction throughput
• If you see an IO impact of changing compaction strategies, you can slow-roll it out to the cluster using JMX.
© 2016. All Rights Reserved. 41
Things Nobody Told You About Compaction
• Compaction Task Prioritization
© 2016. All Rights Reserved. 42
Things Nobody Told You About Compaction
• Compaction Task Prioritization
• Just kidding, stuff’s going to run in an order you don’t like.
• There’s nothing you can do about it (yet)
• If you run Cassandra long enough, you’ll eventually OOM or run a box out of disk doing cleanup or bootstrap or
validation compactions or similar
• We run watchdog daemons that watch for low disk/RAM conditions and interrupts cleanups/compactions
• Not provided, but it’s a 5 line shell script
• 2.0 -> 2.1 was a huge change
• Cleanup / Scrub used to be single threaded
• Someone thought it was a good idea to make it parallel (CASSANDRA-5547)
• Now cleanup/scrub blocks normal sstable compactions
• If you run parallel operations, be prepared to interrupt and restart them if you run out of disk, RAM, or if your
sstable count gets too high (CASSANDRA-11179). Consider using –seq or userDefinedCleanup (JMX)
• CASSANDRA-11218 (priority queue)
© 2016. All Rights Reserved. 43
Things Nobody Told You About Compaction
• ”Fully Expired”
• Cassandra is super conservative
• Find global minTimestamp of any overlapping sstable, compacting sstable, and memtables
• This is the oldest “live” data
• Build a list of “candidates” that we think are fully expired
• See if the candidates are completely older than that global minTimestamp
• Operators are not as conservative
• CASSANDRA-7019 / Philip Thompson’s talk from yesterday
• When you’re running out of disk space, Cassandra’s definition may seem silly =>
• Any out of order write can “block” a lot of data from being deleted
• Read repair, hints, whatever
• It used to be so hard to figure out, cassandra now has `sstableexpiredblockers`
© 2016. All Rights Reserved. 44
Things Nobody Told You About Compaction
• Tombstone compaction sub-properties
• Show of hands if you’ve ever set these on a real cluster
© 2016. All Rights Reserved. 45
Things Nobody Told You About Compaction
• Tombstone compaction sub-properties
• Cassandra has logic to try to eliminate mostly-expired sstables
• Three basic knobs:
1. What % of the table must be tombstones for it to be worth compacting?
2. How long has it been since that file has been created?
3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful?
© 2016. All Rights Reserved. 46
Things Nobody Told You About Compaction
• Tombstone compaction sub-properties
• Cassandra has logic to try to eliminate mostly-expired sstables
• Three basic knobs:
1. What % of the table must be tombstones for it to be worth compacting?
• tombstone_threshold (0.2 -> 0.8)
2. How long has it been since that file has been created?
• tombstone_compaction_interval (how much IO do you have?
3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful?
• unchecked_tombstone_compaction (false -> true)
© 2016. All Rights Reserved. 47
Q&A
© 2016. All Rights Reserved. 48
Spoilers
TWCS is available in mainline Cassandra in 3.0.8 and newer.
If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on
github.com/jeffjirsa/twcs
You PROBABLY don’t need to do anything special to change from DTCS -> TWCS
Thanks!
© 2016. All Rights Reserved. 49
CrowdStrike Is Hiring
Talk to me about TWCS on Twitter: @jjirsa
Find me on IRC: jeffj on Freenode (#cassandra)
If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on
https://github.com/jeffjirsa/twcs

More Related Content

What's hot

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsBrendan Gregg
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...SindhuVasireddy1
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...DataStax
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra Nikiforos Botis
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Cassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlCassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlYutuki r
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係datastaxjp
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについてKumazaki Hiroki
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 

What's hot (20)

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Cassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlCassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sql
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについて
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 

Similar to Using Time Window Compaction Strategy For Time Series Workloads

CrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsCrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsDataStax Academy
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsJeff Jirsa
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016DataStax
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionDataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionDataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Johnny Miller
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraJon Haddad
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Manage your compactions before they manage you!
Manage your compactions before they manage you!Manage your compactions before they manage you!
Manage your compactions before they manage you!Carlos Juzarte Rolo
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 

Similar to Using Time Window Compaction Strategy For Time Series Workloads (20)

CrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For OperatorsCrowdStrike: Real World DTCS For Operators
CrowdStrike: Real World DTCS For Operators
 
Cassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For OperatorsCassandra Summit 2015: Real World DTCS For Operators
Cassandra Summit 2015: Real World DTCS For Operators
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Manage your compactions before they manage you!
Manage your compactions before they manage you!Manage your compactions before they manage you!
Manage your compactions before they manage you!
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 

Recently uploaded

CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 

Recently uploaded (20)

CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 

Using Time Window Compaction Strategy For Time Series Workloads

  • 2. 1 Who Am I? 2 LSM DBs 3 TWCS 4 The 1% 5 Things Nobody Else Told You About Compaction 6 Q&A 2© 2016. All Rights Reserved.
  • 3. Who Am I? (Or: Why You Should Believe Me)
  • 4. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. 4
  • 5. © DataStax, All Rights Reserved. 5
  • 6. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 6 • We keep some data from sensors for a fixed time period • Processes • DNS queries • Executables created • It’s a LOT of data • 2015 Talk: One million writes per second with 60 nodes • Multiple Petabytes Per Cluster
  • 7. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 7 • TWCS was written to solve problems CrowdStrike faced in production • It wasn’t meant to be clever, it was meant to be efficient and easy to reason about • I’m on the pager rotation, this directly impacts my quality of life
  • 8. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 8 • I have better things to do on my off time
  • 9. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 9 • I have better things to do on my off time
  • 10. We’ve Spent Some Time With Time Series © 2016. All Rights Reserved. 10 • I have better things to do on my off time
  • 11. Log Structured – Database, Not Cabins If You’re Going To Use Cassandra, Let’s Make Sure We Know How It Works
  • 12. Log Structured Merge Trees • Cassandra write path: 1. First the Commitlog 2. Then the Memtable 3. Eventually flushed to a SSTable • Each SSTable is written exactly once • Over time, Cassandra combines those data files Duplicate cells are merged Obsolete data is purged • On reads, Cassandra searches for data in each SSTable, merging any existing records and returning the result © 2016. All Rights Reserved. 12
  • 13. Real World, Real Problems • If you can’t get compaction happy, your cluster will never be happy • The write path relies on efficient flushing • If your compaction strategy falls behind, you can block flushes (CASSANDRA-9882) • The read path relies on efficient merging • If your compaction strategy falls behind, each read may touch hundreds or thousands of sstables • IO bound clusters are common, even with SSDs • Dynamic Snitch - latency + “severity” © 2016. All Rights Reserved. 13
  • 14. What We Hope For • We accept that we need to compact sstables sometimes, but we want to do it when we have a good reason • Good reasons: • Data has been deleted and we want to reclaim space • Data has been overwritten and we want to avoid merges on reads • Our queries span multiple sstables, and we’re having to touch a lot of sstables on each read • Bad Reasons: • We hit some magic size threshold and we want to join two non-overlapping files together • We’re aiming for a situation where the merge on read is tolerable • Bloom filter is your friend – let’s read from as few sstables as possible • We want as few tombstones as possible (this includes expired data) • Tombstones create garbage, garbage creates sadness © 2016. All Rights Reserved. 14
  • 15. Use The Defaults? It’s Not Just Naïve, It’s Also Expensive
  • 16. The Basics: SizeTieredCompactionStrategy • Each time min_threshold (4) files of the same size appear, combine them into a new file • Over time, you’ll naturally end up with a distribution of old data in large files, new data in small files • Deleted data in large files stays on disk longer than desired because those files are very rarely compacted © 2016. All Rights Reserved. 16
  • 18. SizeTieredCompactionStrategy © 2016. All Rights Reserved. 18 If each of the smallest blocks represent 1 day of data, and each write had a 90 day TTL, when do you actually delete files and reclaim disk space?
  • 19. SizeTieredCompactionStrategy • Expensive IO: • Far more writes than necessary, you’ll recompact old data weeks after it was written • Reads may touch a ton of sstables – we have no control over how data will be arranged on disk • Expensive Operationally: • Expired data doesn’t get dropped until you happen to re-compact the table it’s in • You have to keep up to 50% spare disk © 2016. All Rights Reserved. 19
  • 21. Kübler Ross Stages of Grief • Denial • Anger • Bargaining • Depression • Acceptance © 2016. All Rights Reserved. 21
  • 22. Sad Operator: Stages of Grief • Denial • STCS and LCS aren’t gonna work, but DTCS will fix it • Anger • DTCS seemed to be the fix, and it didn’t work, either • Bargaining • What if we tweak all these sub-properties? What if we just fix things one at a time? • Depression • Still SOL at ~hundred node scale • Can we get through this? Is it time for a therapist’s couch? © 2016. All Rights Reserved. 22
  • 23. © 2016. All Rights Reserved. 23
  • 24. Sad Operator: Stages of Grief • Acceptance • Compaction is pluggable, we’ll write it ourselves • Designed to be simple and efficient • Group sstables into logical buckets • STCS in the newest time window • No more confusing options, just Window Size + Window Unit • Base time seconds? Max age days? Overloading min_threshold for grouping? Not today. • “12 Hours”, “3 Days”, “6 Minutes” • Configure buckets so you have 20-30 buckets on disk © 2016. All Rights Reserved. 24
  • 25. That’s It. • 90 day TTL • Unit = Days, # = 3 • Each file on disk spans 3 days of data (except the first window), expect ~30 + first window • Expect to have at least 3 days of extra data on disk* • 2 hour TTL • Unit = Minutes, # = 10 • Each file on disk represents 10 minutes of data, expect 12-13 + first window • Expect to have at least 10 minutes of extra data on disk* © 2016. All Rights Reserved. 25
  • 26. © 2016. All Rights Reserved. 26
  • 27. © 2016. All Rights Reserved. 27 Example: IO (Real Cluster)
  • 28. © 2016. All Rights Reserved. 28 Example: Load (Real Cluster)
  • 29. The Only Real Optimization You Need • Align your partition keys to your TWCS windows • Bloom filter reads will only touch a single sstable • Deletion gets much easier because you get rid of overlapping ranges • Bucketing partitions keeps partition sizes reasonable ( < 100MB ), which saves you a ton of GC pressure • If you’re using 30 day TTL and 1 day TWCS windows, put a “day_of_year” field into the partition key • Use parallel async reads to read more than one day at a time • Spread reads across multiple nodes • Each node should touch exactly 1 sstable on disk (watch timezones) • That sstable is probably hot for all partitions, so it’ll be in page cache • Extrapolate for other windows (you may have to chunk things up into 3 day buckets or 30 minute buckets, but it’ll be worth it) © 2016. All Rights Reserved. 29
  • 30. What We’ve Discussed Is Good Enough For 99% Of Time Series Use Cases But let’s make sure the 1% knows what’s up
  • 31. Out Of Order Writes • If we mix write timestamps “USING TIMESTAMP”… • Life isn’t over, it just potentially blocks expiration • Goal: • Avoid mixing timestamps within any given sstable • Options: • Don’t mix in the memtable • Don’t use the memtable © 2016. All Rights Reserved. 31
  • 32. Out Of Order Writes • Don’t comingle in the memtable • If we have a queue-like workflow, consider the following option: • Pause kafka consumer / celery worker / etc • “nodetool flush” • Write old data with “USING TIMESTAMP” • “nodetool flush • Resume consumer/workers for new data • Positives: No comingled data • Negatives: Have to pause ingestion © 2016. All Rights Reserved. 32
  • 33. Out Of Order Writes • Don’t use the memtable • CQLSSTableWriter • Yuki has a great blog at: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated • Write sstables offline • Stream them in with sstableloader • Positives: No comingled data, no pausing ingestion, incredibly fast, easy to parallelize • Negatives: Requires code (but it’s not difficult code, your ops team should be able to do it) © 2016. All Rights Reserved. 33
  • 34. Per-Window Major Compaction • At the end of each window, you’re going to see a major compaction for all sstables in that window © 2016. All Rights Reserved. 34
  • 35. Per-Window Major Compaction • At the end of each window, you’re going to see a major compaction for all sstables in that window • Expect a jump in CPU usage, disk usage, and disk IO • The DURATION of these increases depends on your write rate and window size • Larger windows will take longer to compact because you’ll have more data on disk • If this is a problem for you, you’re under provisioned © 2016. All Rights Reserved. 35
  • 36. Per-Window Major Compaction © DataStax, All Rights Reserved. 36 CPU Usage During the end-of-window major, cpu usage on ALL OF THE NODES (in all DCs) will increase at the same time. This will likely impact your read latency. When you validate TWCS, be sure to make sure your application works well at this transition. We can surely fix this, just need to find a way to avoid cluttering the options.
  • 37. Per-Window Major Compaction © DataStax, All Rights Reserved. 37 Disk Usage During the daily major, disk usage on ALL OF THE NODES will increase at the same time.
  • 38. Per-Window Major Compaction © DataStax, All Rights Reserved. 38 Disk Usage In some cases, you’ll see the window major compaction run twice because of the timing of flush. You can manually flush (cron) to work around it if it bothers you. This is on my list of things to fix No reason to do two majors, better to either delay the first major until we’re sure it’s time, or keep a history that we’ve already done a window major compaction, and skip it the second time
  • 39. There Are Things Nobody Told You About Compaction The More You Know…
  • 40. Things Nobody Told You About Compaction • Compaction Impacts Read Performance More Than Write Performance • Typical advice is use LCS if you need fast reads, STCS if you need fast writes • LCS optimizes reads by limiting the # of potential SSTables you’ll need to touch on the read path • The goal of LCS (fast reads/low latency) and the act of keeping levels are in competition with each other • It takes a LOT of IO for LCS to keep up, and it’s generally not a great fit for most time series use cases • LCS will negatively impact your read latencies in any sufficiently busy cluster © 2016. All Rights Reserved. 40
  • 41. Things Nobody Told You About Compaction • You can change the compaction strategy on a single node using JMX • The change won’t persist through restarts, but it’s often a great way to test / canary before rolling it out to the full cluster • You can change other useful things in JMX, too. No need to restart to change: • Compaction threads • Compaction throughput • If you see an IO impact of changing compaction strategies, you can slow-roll it out to the cluster using JMX. © 2016. All Rights Reserved. 41
  • 42. Things Nobody Told You About Compaction • Compaction Task Prioritization © 2016. All Rights Reserved. 42
  • 43. Things Nobody Told You About Compaction • Compaction Task Prioritization • Just kidding, stuff’s going to run in an order you don’t like. • There’s nothing you can do about it (yet) • If you run Cassandra long enough, you’ll eventually OOM or run a box out of disk doing cleanup or bootstrap or validation compactions or similar • We run watchdog daemons that watch for low disk/RAM conditions and interrupts cleanups/compactions • Not provided, but it’s a 5 line shell script • 2.0 -> 2.1 was a huge change • Cleanup / Scrub used to be single threaded • Someone thought it was a good idea to make it parallel (CASSANDRA-5547) • Now cleanup/scrub blocks normal sstable compactions • If you run parallel operations, be prepared to interrupt and restart them if you run out of disk, RAM, or if your sstable count gets too high (CASSANDRA-11179). Consider using –seq or userDefinedCleanup (JMX) • CASSANDRA-11218 (priority queue) © 2016. All Rights Reserved. 43
  • 44. Things Nobody Told You About Compaction • ”Fully Expired” • Cassandra is super conservative • Find global minTimestamp of any overlapping sstable, compacting sstable, and memtables • This is the oldest “live” data • Build a list of “candidates” that we think are fully expired • See if the candidates are completely older than that global minTimestamp • Operators are not as conservative • CASSANDRA-7019 / Philip Thompson’s talk from yesterday • When you’re running out of disk space, Cassandra’s definition may seem silly => • Any out of order write can “block” a lot of data from being deleted • Read repair, hints, whatever • It used to be so hard to figure out, cassandra now has `sstableexpiredblockers` © 2016. All Rights Reserved. 44
  • 45. Things Nobody Told You About Compaction • Tombstone compaction sub-properties • Show of hands if you’ve ever set these on a real cluster © 2016. All Rights Reserved. 45
  • 46. Things Nobody Told You About Compaction • Tombstone compaction sub-properties • Cassandra has logic to try to eliminate mostly-expired sstables • Three basic knobs: 1. What % of the table must be tombstones for it to be worth compacting? 2. How long has it been since that file has been created? 3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful? © 2016. All Rights Reserved. 46
  • 47. Things Nobody Told You About Compaction • Tombstone compaction sub-properties • Cassandra has logic to try to eliminate mostly-expired sstables • Three basic knobs: 1. What % of the table must be tombstones for it to be worth compacting? • tombstone_threshold (0.2 -> 0.8) 2. How long has it been since that file has been created? • tombstone_compaction_interval (how much IO do you have? 3. Should we try to compact the tombstones away even if we suspect it’s not going to be successful? • unchecked_tombstone_compaction (false -> true) © 2016. All Rights Reserved. 47
  • 48. Q&A © 2016. All Rights Reserved. 48 Spoilers TWCS is available in mainline Cassandra in 3.0.8 and newer. If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on github.com/jeffjirsa/twcs You PROBABLY don’t need to do anything special to change from DTCS -> TWCS
  • 49. Thanks! © 2016. All Rights Reserved. 49 CrowdStrike Is Hiring Talk to me about TWCS on Twitter: @jjirsa Find me on IRC: jeffj on Freenode (#cassandra) If you’re running 2.0, 2.1, or 2.2, you can build a JAR from source on https://github.com/jeffjirsa/twcs