SlideShare a Scribd company logo
1 of 31
Download to read offline
Cassandra Data Modeling Workshop
  Matthew F. Dennis // @mdennis
Overview
●   Hopefully interactive
●   Use cases submitted via Google Moderator,
    email, IRC, etc
●   Interesting and/or common requests in the
    slides to get us started
●   Bring up others if you have them !
Data Modeling Goals
●   Keep data queried together on disk together
●   In a more general sense think about the
    efficiency of querying your data and work
    backward from there to a model in Cassandra
●   Don't try to normalize your data (contrary to
    many use cases in relational databases)
●   Usually better to keep a record that something
    happened as opposed to changing a value (not
    always advisable or possible)
ClickStream Data
                     (use case #1)

●   A ClickStream (in this context) is the sequence
    of actions a user of an application performs
●   Usually this refers to clicking links in a WebApp
●   Useful for ad selection, error recording, UI/UX
    improvement, A/B testing, debugging, et cetera
●   Not a lot of detail in the Google Moderator
    request on what the purpose of collecting the
    ClickStream data was – so I made some up
ClickStream Data Defined
●   Record actions of a user within a session for
    debugging purposes if app/browser/page/server
    crashes
Recording Sessions
●   CF for sessions a user has had
    ●   Row Key is user name/id
    ●   Column Name is session id (TimeUUID)
    ●   Column Value is empty (or length of session, or some
        aggregated details about the session after it ended)
●   CF for actual sessions
    ●   Row Key is TimeUUID session id
    ●   Column Name is timestamp/TimeUUID of each click
    ●   Column Value is details about that click (serialized)
UserSessions Column Family
              Session_01    Session_02    Session_03
              (TimeUUID)                  (TimeUUID)
    userId                  (TimeUUID)

              (empty/agg)   (empty/agg)   (empty/agg)


●   Most recent session
●   All sessions for a given time period
Sessions Column Family
                 timestamp_01 timestamp_02 timestamp_03
 SessionId
(TimeUUID)          ClickData      ClickData      ClickData
                 (json/xml/etc) (json/xml/etc) (json/xml/etc)



●   Retrieve entire session's ClickStream (row)
●   Order of clicks/events preserved
●   Retrieve ClickStream for a slice of time within the session
●   First action taken in a session
●   Most recent action taken in a session
●   Why JSON/XML/etc?
Alternatives?
Of Course
         (depends on what you want to do)
●   Secondary Indexes
●   All Sessions in one row
●   Track by time of activity instead of session
Secondary Indexes Applied
●   Drop UserSessions CF and use secondary
    indexes
●   Uses a “well known” column to record the user
    in the row; secondary index is created on that
    column
●   Doesn't work so well when storing aggregates
    about sessions in the UserSessions CF
●   Better when you want to retrieve all sessions a
    user has had
All Sessions In One Row Applied
●   Row Key is userId
●   Column Name is composite of timestamp and
    sessionId
●   Can efficiently request activity of a user across
    all sessions within a specific time range
●   Rows could potentially grow quite large, be
    careful
●   Reads will almost always require at least two
    seeks on disk
Time Period Partitioning Applied
●   Row Key is composite of userId and time “bucket”
    ●   e.g. jan_2011 or jan_01_2011 for month or day buckets respectively
●   Column Name is TimeUUID of click
●   Column Value is serialized click data
●   Avoids always requiring multiple seeks when the user has old
    data but only recent data is requested
●   Easy to lazily aggregate old activity
●   Can still efficiently request activity of a user across all
    sessions within a specific time range
Rolling Time Window Of Data Points
                    (use case #2)
●   Similar to RRDTool was the example given
●   Essentially store a series of data points within a
    rolling window
●   common request from Cassandra users for this
    and/or similar
Data Points Defined
●   Each data point has a value (or multiple values)
●   Each data point corresponds to a specific point
    in time or an interval/bucket (e.g. 5 th minute of
       th
    17 hour on some date)
Time Window Model
              System7:RenderTime

               TimeUUID0   TimeUUID1     TimeUUID2

    s7:rt        0.051       0.014          0.173

                                     Some request took 0.014 seconds to render


●   Row Key is the id of the time window data you are
    tracking (e.g. server7:render_time)
●   Column Name is timestamp (or TimeUUID) the event
    occurred at
●   Column Value is the value of the event (e.g. 0.051)
The Details
●   Cassandra TTL values are key here
    ●   When you insert each data point set the TTL to the max time
        range you will ever request; there is very little overhead to
        expiring columns
●   When querying, construct TimeUUIDs for the min/max of
    the time range in question and use them as the start/end
    in your get_slice call
●   Consider partitioning the rows by a known time period
    (e.g. “year”) if you plan on keeping a long history of data
    (NB: requires slightly more complex logic in the app if a
    time range spans such a period)
●   Very efficient queries for any window of time
Rolling Window Of Counters
                (use case #3)
●   “How to model rolling time window that contains counters with time
    buckets of monthly (12 months), weekly (4 weeks), daily (7 days),
    hourly (24 hours)? Example would be; how many times user logged
    into a system in last 24 hours, last 7 days ...”
●   Timezones and “rolling window” is what makes this interesting
Rolling Time Window Details
●   One row for every granularity you want to track
    (e.g. day, hour)
●   Row Key consists of the granularity, metric, user
    and system
●   Column Name is a “fixed” time bucket on UTC time
●   Column Values are counts of the logins in that
    bucket
●   get_slice calls to return multiple counters which
    are them summed up
Rolling Time Window Counter Model
                     user3:system5:logins:by_day

                                     20110107          ...          20110523
            U3:S5:L:D
                                        2              ...               7

    2 logins in Jan 7th 2011           7 logins on May 23rd 2011
    for user 3 on system 5               for user 3 on system 5


                    user3:system5:logins:by_hour

                                    2011010710         ...         2011052316
            U3:S5:L:H
                                        1              ...               7

one login for user 3 on system 5     2 logins for user 3 on system 5
on Jan 7th 2011 for the 10th hour   on May 23rd 2011 for the 16th hour
Rolling Time Window Queries
●   Time window is rolling and there are other
    timezones besides UTC
    ●   one get_slice for the “middle” counts
    ●   one get_slice for the “left end”
    ●   one get_slice for the “right end”
Example: logins for the past 7 days
●   Determine date/time boundaries
●   Determine UTC days that are wholly contained
    within your boundaries to select and sum
●   Select and sum counters for the remaining hours
    on either side of the UTC days
●   O(1) queries (3 in this case), can be requested
    from C* in parallel
●   NB: some timezones are annoying (e.g. 15 minute
    or 30 minutes offsets); I try to ignore them
Alternatives?
                         (of course)
●   If you're counting logins and each user doesn't login
    in hundreds of times a day, just have one row per
    user with a TimeUUID column name for the time the
    login occurred
●   Supports any timezone/range/granularity easily
●   More expensive for large ranges (e.g. year)
    regardless of granularity, so cache results (in C*)
    lazily.
●   NB: caching results for rolling windows is not usually
    helpful (because, well it's rolling and always changes)
Eventually Atomic
                            (use case #4)
●   “When there are many to many or one to many relations involved how
    to model that and also keep it atomic? for eg: one user can upload
    many pictures and those pictures can somehow be related to other
    users as well.”
●   Attempting full ACID compliance in distributed systems is a bad idea
    (and impossible in the general sense)
●   However, consistency is important and can certainly be achieved in
    C*
●   Many approaches / alternatives
●   I like transaction log approach, especially in the context of C*
Transaction Logs
                   (in this context)
●   Records what is going to be performed before it
    is actually performed
●   Performs the actions that need to be atomic (in
    the indivisible sense, not the all at once sense)
●   Marks that the actions were performed
In Cassandra
●   Serialize all actions that need to be performed
    in a single column – JSON, XML, YAML (yuck!),
    cpickle, JSO, et cetera
    ●   Row Key = randomly chosen C* node token
    ●   Column Name = TimeUUID
●   Perform actions
●   Delete Column
Configuration Details
●   Short GC_Grace on the XACT_LOG Column
    Family (e.g. 1 hour)
●   Write to XACT_LOG at CL.QUORUM or
    CL.LOCAL_QUORUM for durability (if it fails
    with an unavailable exception, pick a different
    node token and/or node and try again; same
    semantics as a traditional relational DB)
●   1M memtable ops, 1 hour memtable flush time
Failures
●   Before insert into the XACT_LOG
●   After insert, before actions
●   After insert, in middle of actions
●   After insert, after actions, before delete
●   After insert, after actions, after delete
Recovery
●   Each C* has a crond job offset from every other
    by some time period
●   Each job runs the same code: multiget_slice for
    all node tokens for all columns older than some
    time period
●   Any columns need to be replayed in their
    entirety and are deleted after replay (normally
    there are no columns because normally things
    are working normally)
XACT_LOG Comments
●   Idempotent writes are awesome (that's why this
    works so well)
●   Doesn't work so well for counters (they're not
    idempotent)
●   Clients must be able to deal with temporarily
    inconsistent data (they have to do this anyway)
●   Could use a reliable queuing service (e.g. SQS)
    instead of polling – push to SQS first, then
    XACT log.
Q?
Cassandra Data Modeling Workshop
  Matthew F. Dennis // @mdennis

More Related Content

What's hot

High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Michaël Figuière
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
Patrick McFadin
 

What's hot (20)

collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
An Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL TriggersAn Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL Triggers
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding Autovacuum
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
Dun ddd
Dun dddDun ddd
Dun ddd
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
 

Viewers also liked

Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Eric Evans
 
NoSQL with Cassandra
NoSQL with CassandraNoSQL with Cassandra
NoSQL with Cassandra
Gasol Wu
 

Viewers also liked (20)

Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
 
Cassandra datamodel
Cassandra datamodelCassandra datamodel
Cassandra datamodel
 
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit SuisseCassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
 
NoSQL with Cassandra
NoSQL with CassandraNoSQL with Cassandra
NoSQL with Cassandra
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 

Similar to Cassandra Data Modeling

MySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukMySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossuk
Valeriy Kravchuk
 

Similar to Cassandra Data Modeling (20)

Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Teradata Tutorial for Beginners
Teradata Tutorial for BeginnersTeradata Tutorial for Beginners
Teradata Tutorial for Beginners
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Document 14 (6).pdf
Document 14 (6).pdfDocument 14 (6).pdf
Document 14 (6).pdf
 
Lecture 5 process concept
Lecture 5   process conceptLecture 5   process concept
Lecture 5 process concept
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
 
MySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukMySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossuk
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
 
Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15
 
Async Web Frameworks in Python
Async Web Frameworks in PythonAsync Web Frameworks in Python
Async Web Frameworks in Python
 
Time Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSTime Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOS
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K users
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Cassandra Data Modeling

  • 1. Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis
  • 2. Overview ● Hopefully interactive ● Use cases submitted via Google Moderator, email, IRC, etc ● Interesting and/or common requests in the slides to get us started ● Bring up others if you have them !
  • 3. Data Modeling Goals ● Keep data queried together on disk together ● In a more general sense think about the efficiency of querying your data and work backward from there to a model in Cassandra ● Don't try to normalize your data (contrary to many use cases in relational databases) ● Usually better to keep a record that something happened as opposed to changing a value (not always advisable or possible)
  • 4. ClickStream Data (use case #1) ● A ClickStream (in this context) is the sequence of actions a user of an application performs ● Usually this refers to clicking links in a WebApp ● Useful for ad selection, error recording, UI/UX improvement, A/B testing, debugging, et cetera ● Not a lot of detail in the Google Moderator request on what the purpose of collecting the ClickStream data was – so I made some up
  • 5. ClickStream Data Defined ● Record actions of a user within a session for debugging purposes if app/browser/page/server crashes
  • 6. Recording Sessions ● CF for sessions a user has had ● Row Key is user name/id ● Column Name is session id (TimeUUID) ● Column Value is empty (or length of session, or some aggregated details about the session after it ended) ● CF for actual sessions ● Row Key is TimeUUID session id ● Column Name is timestamp/TimeUUID of each click ● Column Value is details about that click (serialized)
  • 7. UserSessions Column Family Session_01 Session_02 Session_03 (TimeUUID) (TimeUUID) userId (TimeUUID) (empty/agg) (empty/agg) (empty/agg) ● Most recent session ● All sessions for a given time period
  • 8. Sessions Column Family timestamp_01 timestamp_02 timestamp_03 SessionId (TimeUUID) ClickData ClickData ClickData (json/xml/etc) (json/xml/etc) (json/xml/etc) ● Retrieve entire session's ClickStream (row) ● Order of clicks/events preserved ● Retrieve ClickStream for a slice of time within the session ● First action taken in a session ● Most recent action taken in a session ● Why JSON/XML/etc?
  • 10. Of Course (depends on what you want to do) ● Secondary Indexes ● All Sessions in one row ● Track by time of activity instead of session
  • 11. Secondary Indexes Applied ● Drop UserSessions CF and use secondary indexes ● Uses a “well known” column to record the user in the row; secondary index is created on that column ● Doesn't work so well when storing aggregates about sessions in the UserSessions CF ● Better when you want to retrieve all sessions a user has had
  • 12. All Sessions In One Row Applied ● Row Key is userId ● Column Name is composite of timestamp and sessionId ● Can efficiently request activity of a user across all sessions within a specific time range ● Rows could potentially grow quite large, be careful ● Reads will almost always require at least two seeks on disk
  • 13. Time Period Partitioning Applied ● Row Key is composite of userId and time “bucket” ● e.g. jan_2011 or jan_01_2011 for month or day buckets respectively ● Column Name is TimeUUID of click ● Column Value is serialized click data ● Avoids always requiring multiple seeks when the user has old data but only recent data is requested ● Easy to lazily aggregate old activity ● Can still efficiently request activity of a user across all sessions within a specific time range
  • 14. Rolling Time Window Of Data Points (use case #2) ● Similar to RRDTool was the example given ● Essentially store a series of data points within a rolling window ● common request from Cassandra users for this and/or similar
  • 15. Data Points Defined ● Each data point has a value (or multiple values) ● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of th 17 hour on some date)
  • 16. Time Window Model System7:RenderTime TimeUUID0 TimeUUID1 TimeUUID2 s7:rt 0.051 0.014 0.173 Some request took 0.014 seconds to render ● Row Key is the id of the time window data you are tracking (e.g. server7:render_time) ● Column Name is timestamp (or TimeUUID) the event occurred at ● Column Value is the value of the event (e.g. 0.051)
  • 17. The Details ● Cassandra TTL values are key here ● When you insert each data point set the TTL to the max time range you will ever request; there is very little overhead to expiring columns ● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call ● Consider partitioning the rows by a known time period (e.g. “year”) if you plan on keeping a long history of data (NB: requires slightly more complex logic in the app if a time range spans such a period) ● Very efficient queries for any window of time
  • 18. Rolling Window Of Counters (use case #3) ● “How to model rolling time window that contains counters with time buckets of monthly (12 months), weekly (4 weeks), daily (7 days), hourly (24 hours)? Example would be; how many times user logged into a system in last 24 hours, last 7 days ...” ● Timezones and “rolling window” is what makes this interesting
  • 19. Rolling Time Window Details ● One row for every granularity you want to track (e.g. day, hour) ● Row Key consists of the granularity, metric, user and system ● Column Name is a “fixed” time bucket on UTC time ● Column Values are counts of the logins in that bucket ● get_slice calls to return multiple counters which are them summed up
  • 20. Rolling Time Window Counter Model user3:system5:logins:by_day 20110107 ... 20110523 U3:S5:L:D 2 ... 7 2 logins in Jan 7th 2011 7 logins on May 23rd 2011 for user 3 on system 5 for user 3 on system 5 user3:system5:logins:by_hour 2011010710 ... 2011052316 U3:S5:L:H 1 ... 7 one login for user 3 on system 5 2 logins for user 3 on system 5 on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
  • 21. Rolling Time Window Queries ● Time window is rolling and there are other timezones besides UTC ● one get_slice for the “middle” counts ● one get_slice for the “left end” ● one get_slice for the “right end”
  • 22. Example: logins for the past 7 days ● Determine date/time boundaries ● Determine UTC days that are wholly contained within your boundaries to select and sum ● Select and sum counters for the remaining hours on either side of the UTC days ● O(1) queries (3 in this case), can be requested from C* in parallel ● NB: some timezones are annoying (e.g. 15 minute or 30 minutes offsets); I try to ignore them
  • 23. Alternatives? (of course) ● If you're counting logins and each user doesn't login in hundreds of times a day, just have one row per user with a TimeUUID column name for the time the login occurred ● Supports any timezone/range/granularity easily ● More expensive for large ranges (e.g. year) regardless of granularity, so cache results (in C*) lazily. ● NB: caching results for rolling windows is not usually helpful (because, well it's rolling and always changes)
  • 24. Eventually Atomic (use case #4) ● “When there are many to many or one to many relations involved how to model that and also keep it atomic? for eg: one user can upload many pictures and those pictures can somehow be related to other users as well.” ● Attempting full ACID compliance in distributed systems is a bad idea (and impossible in the general sense) ● However, consistency is important and can certainly be achieved in C* ● Many approaches / alternatives ● I like transaction log approach, especially in the context of C*
  • 25. Transaction Logs (in this context) ● Records what is going to be performed before it is actually performed ● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense) ● Marks that the actions were performed
  • 26. In Cassandra ● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), cpickle, JSO, et cetera ● Row Key = randomly chosen C* node token ● Column Name = TimeUUID ● Perform actions ● Delete Column
  • 27. Configuration Details ● Short GC_Grace on the XACT_LOG Column Family (e.g. 1 hour) ● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability (if it fails with an unavailable exception, pick a different node token and/or node and try again; same semantics as a traditional relational DB) ● 1M memtable ops, 1 hour memtable flush time
  • 28. Failures ● Before insert into the XACT_LOG ● After insert, before actions ● After insert, in middle of actions ● After insert, after actions, before delete ● After insert, after actions, after delete
  • 29. Recovery ● Each C* has a crond job offset from every other by some time period ● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period ● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working normally)
  • 30. XACT_LOG Comments ● Idempotent writes are awesome (that's why this works so well) ● Doesn't work so well for counters (they're not idempotent) ● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway) ● Could use a reliable queuing service (e.g. SQS) instead of polling – push to SQS first, then XACT log.
  • 31. Q? Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis