If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
6. Dynamo Paper(2007)
• How do we build a data store that is:
• Reliable
• Performant
• “Always On”
• Nothing new and shiny
Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort
17. Token
Server
•Each partition is a 128 bit value
•Consistent hash between 2-63
and 264
•Each node owns a range of those
values
•The token is the beginning of that
range to the next node’s token value
•Virtual Nodes break these down
further
Data
Token Range
0 …
25. Consistency level
Consistency Level Number of Nodes Acknowledged
One One - Read repair triggered
Local One One - Read repair in local DC
Quorum 51%
Local Quorum 51% in local DC
36. Select
id | call_sign | country_code | elevation | lat | long | name | state_code
--------------+-----------+--------------+-----------+--------+---------+-------------------------------+------------
727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SEATTLE SEATTLE-TACOMA INTL A | WA
SELECT id, call_sign, country_code, elevation, lat, long, name, state_code
FROM weather_station
WHERE id = '727930:24233';
Fields
Table Name
Primary Key: Partition Key Required
37. Update
UPDATE weather_station
SET name = 'SeaTac International Airport'
WHERE id = '727930:24233';
id | call_sign | country_code | elevation | lat | long | name | state_code
--------------+-----------+--------------+-----------+--------+---------+------------------------------+------------
727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SeaTac International Airport | WA
Table Name
Fields to Update: Not in Primary Key
Primary Key
39. Collections
Set
CREATE TABLE weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
equipment set<text>
PRIMARY KEY(id)
);
equipment set<text>
CQL Type: For Ordering
Column Name
40. Collections
Set
List
CREATE TABLE weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
equipment set<text>,
service_dates list<timestamp>,
PRIMARY KEY(id)
);
equipment set<text>
service_dates list<timestamp>
CQL Type
Column Name
CQL Type: For Ordering
Column Name
41. Collections
Set
List
Map
CREATE TABLE weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
equipment set<text>,
service_dates list<timestamp>,
service_notes map<timestamp,text>,
PRIMARY KEY(id)
);
equipment set<text>
service_dates list<timestamp>
service_notes map<timestamp,text>
CQL Type
Column Name
Column Name
CQL Key Type CQL Value Type
CQL Type: For Ordering
Column Name
42. UDF and UDA
User Defined Function
CREATE OR REPLACE AGGREGATE group_and_count(text)
SFUNC state_group_and_count
STYPE map<text, int>
INITCOND {};
CREATE FUNCTION state_group_and_count( state map<text, int>, type text )
CALLED ON NULL INPUT
RETURNS map<text, int>
LANGUAGE java AS '
Integer count = (Integer) state.get(type);
if (count == null)
count = 1;
else count++;
state.put(type, count);
return state; ' ;
User Defined Aggregate
As of Cassandra 2.2
43. Example: Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence
44. Queries supported
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Get weather data given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time
45. Primary Key
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
51. Partition keys
10010:99999 Murmur3 Hash Token = 7224631062609997448
722266:13850 Murmur3 Hash Token = -6804302034103043898
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘722266:13850’,2005,12,1,7,-5.6);
Consistent hash. 128 bit number
between 2-63
and 264
52. Partition keys
10010:99999 Murmur3 Hash Token = 15
722266:13850 Murmur3 Hash Token = 77
For this example, let’s make it a
reasonable number
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘722266:13850’,2005,12,1,7,-5.6);
58. Date Tiered Compaction Strategy
•Group similar time blocks
•Never compact again
•Used for high density
SSTable
SSTable
SSTable
T=2015-01-01 -> 2015-01-5
T=2015-01-06 -> 2015-01-10
T=2015-01-11 -> 2015-01-15
59. Storage Model - Logical View
2005:12:1:10
-5.6
2005:12:1:9
-5.1
2005:12:1:8
-4.9
10010:99999
10010:99999
10010:99999
wsid hour temperature
2005:12:1:7
-5.3
10010:99999
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
60. 2005:12:1:10
-5.6 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:8
10010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
61. 2005:12:1:10
-5.6
2005:12:1:11
-4.9 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:8
10010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
62. 2005:12:1:10
-5.6
2005:12:1:11
-4.9 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:8
10010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
2005:12:1:12
-5.4
64. Query patterns
• Range queries
• “Slice” operation on disk
Single seek on disk
10010:99999
Partition key for locality
SELECT wsid,hour,temperature
FROM raw_weather_data
WHERE wsid='10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
2005:12:1:10
-5.6 -5.3-4.9-5.1
2005:12:1:9 2005:12:1:8 2005:12:1:7
65. Query patterns
• Range queries
• “Slice” operation on disk
Programmers like this
Sorted by event_time
2005:12:1:10
-5.6
2005:12:1:9
-5.1
2005:12:1:8
-4.9
10010:99999
10010:99999
10010:99999
weather_station hour temperature
2005:12:1:7
-5.3
10010:99999
SELECT weatherstation,hour,temperature
FROM temperature
WHERE weatherstation_id=‘10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;