SlideShare a Scribd company logo
1 of 91
Download to read offline
Apache Cassandra in Action




 Jonathan Ellis
 jbellis@datastax.com / @spyced
Why Cassandra?
•
    Relational databases are not designed to
    scale
•
    B-trees are slow
    –
        and require read-before-write
(“The eBay Architecture,” Randy Shoup and Dan Pritchett)
Reader
                    Memtable
     Writer




    Commitlog




The Log-Structured Merge-Tree,
Bigtable: A Distributed Storage
System for Structured Data
Dynamo, 2007
Bigtable, 2006




                           OSS, 2008




         Incubator, 2009       TLP, 2010
Cassandra in production
•
    Digital Reasoning: NLP + entity analytics
•
    OpenWave: enterprise messaging
•
    OpenX: largest publisher-side ad network in the
    world
•
    Cloudkick: performance data & aggregation
•
    SimpleGEO: location-as-API
•
    Ooyala: video analytics and business intelligence
•
    ngmoco: massively multiplayer game worlds
FUD?
•
    “Cassandra is only appropriate for
    unimportant data.”
Durabilty
•
    Write to commitlog
    –
        fsync is cheap since it’s append-only
•
    Write to memtable
•
    [amortized] flush memtable to sstable
SSTable format, briefly


       <key 127>
       <key 255>             <row data 0>
       ...                   <row data 1>
                             ...
                             <row data 127>
                             ...
                             <row data 255>
                             ...


                   Sorted [clustered] by row key
Scaling
W   A




T
        L
W   A




            F


T
        L
W           A




                    F
        (A-L]



T
                L
W           A




        (A-F]       F


T
        (F-L]   L
Key “C”
              W   A




                      F


          T
                  L
Reliability
•
    No single points of failure
•
    Multiple datacenters
•
    Monitorable
Some headlines
•
    “Resyncing Broken MySQL Replication”
•
    “How To Repair MySQL Replication”
•
    “Fixing Broken MySQL Database Replication”
•
    “Replication on Linux broken after db restore”
•
    “MySQL :: Repairing broken replication”
Good architecture solves multiple
problems at once
•
    Availability in single datacenter
•
    Availability in multiple datacenters
Y
                        Key “C”
            A
    W



U
                    F



    T
                L
        P
Y
                           Key “C”
               A
    W



U
                       F




               X
    T   hint
                   L
        P
Y
            A
    W



U
                    F



    T
                L
        P
Y
                        Key “C”
            A
    W



U
                    F



    T
                L
        P
Y
                    Key “C”
            A
    W



U
                F


    T
            L
        P
Tuneable consistency
•
    ONE, QUORUM, ALL
•
    R+W>N
•
    Choose availability vs consistency (and latency)
Monitorable
JMX
OpsCenter
When do you need Cassandra?
•
    Ian Eure: “If you’re deploying memcache on top of your
    database, you’re inventing your own ad-hoc, difficult to
    maintain NoSQL data store”
Not Only SQL
•
    Curt Monash: “ACID-compliant transaction integrity
    commonly costs more in terms of DBMS licenses and many other
    components of TCO (Total Cost of Ownership) than [scalable
    NoSQL]. Worse, it can actually hurt application uptime,
    by forcing your system to pull in its horns and stop functioning in the
    face of failures that a non-transactional system might smoothly work
    around. Other flavors of “complexity can be a bad thing” apply as
    well. Thus, transaction integrity can be more trouble
    than it’s worth.” [Curt’s emphasis]
Keyspaces & ColumnFamilies
•
    Conceptually, like “schemas” and “tables”
Inside CFs, columns are dynamic
•
    Twitter: “Fifteen months ago, it took two
    weeks to perform ALTER TABLE on the
    statuses [tweets] table.”
ColumnFamilies
•
    Static
    –
        Object data
•
    Dynamic
    –
        Precalculated query results
“static” columnfamilies

                      Users
   zznate    Password: *    Name: Nate

   driftx    Password: *   Name: Brandon

   thobbs    Password: *    Name: Tyler

   jbellis   Password: *   Name: Jonathan   Site: riptano.com
“dynamic” columnfamilies

                     Following
zznate    driftx:   thobbs:

driftx

thobbs    zznate:

jbellis   driftx:   mdennis:   pcmanus   thobbs:   xedin:   zznate
Inserting
•
    Really “insert or update”
•
    Not a key/value store – update as much of
    the row as you want
Example: twissandra
•
    http://twissandra.com
CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    username VARCHAR(64),
    password VARCHAR(64)
);

CREATE TABLE following (
    user INTEGER REFERENCES user(id),
    followed INTEGER REFERENCES user(id)
);

CREATE TABLE tweets (
    id INTEGER,
    user INTEGER REFERENCES user(id),
    body VARCHAR(140),
    timestamp TIMESTAMP
);
Cassandrified
create column family users with comparator = UTF8Type
and column_metadata = [{column_name: password,
validation_class: UTF8Type}]

create column family tweets with comparator = UTF8Type
and column_metadata = [{column_name: body, validation_class:
UTF8Type}, {column_name: username, validation_class:
UTF8Type}]

create column family friends with comparator = UTF8Type
create column family followers with comparator = UTF8Type

create column family userline with comparator = LongType and
default_validation_class = UUIDType
create column family timeline with comparator = LongType and
default_validation_class = UUIDType
Connecting
CLIENT = pycassa.connect_thread_local('Twissandra')

USER = pycassa.ColumnFamily(CLIENT, 'User')
User
RowKey: ericflo
=> (column=password, value=****,
timestamp=1289446382541473)

-------------------
RowKey: jbellis
=> (column=password, value=****,
timestamp=1289446438490709)


uname = 'jericevans'
password = '**********'

columns = {'password': password}

USER.insert(uname, columns)
Natural keys vs surrogate
Friends and Followers
RowKey: ericflo

=> (column=jbellis, value=1289446467611029,
timestamp=1289446467611064)

=> (column=b6n, value=1289446467611031,
timestamp=1289446467611080)

to_uname = 'ericflo'

FRIENDS.insert(uname, {to_uname: time.time()})
FOLLOWERS.insert(to_uname, {uname: time.time()})
zznate    driftx:   thobbs:

driftx

thobbs    zznate:

jbellis   driftx:   mdenni    pcmanu   thobbs:   xedin:   zznat
                      s:        s:                          e:
Tweets
RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f

=> (column=body, value=Four score and seven years ago,
timestamp=1289446891681799)

=> (column=username, value=alincoln,
timestamp=1289446891681799)

-------------------
RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f

=> (column=body, value=Do geese see God?,
timestamp=1289501976713199)

=> (column=username, value=pdrome,
timestamp=1289501976713199)
Userline
RowKey: ericflo

=> (column=1289446393708810, value=6a0b4834-ed44-11df-
bc31-000c29864c4f, timestamp=1289446393710212)

=> (column=1289446397693831, value=6c6b5916-ed44-11df-
bc31-000c29864c4f, timestamp=1289446397694646)

=> (column=1289446891681780, value=92dbeb50-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446891685065)

=> (column=1289446897315887, value=96379f92-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446897317676)
Userline


zznate    1289847840615: 3f19757a-c89d...   1289847887086: a20fcf52-595c...


driftx

thobbs    1289847887086: a20fcf52-595c...


jbellis   1289847840615: 3f19757a-c89d...   128984784425: 844e75e2-b546...
Timeline
RowKey: ericflo

=> (column=1289446393708810, value=6a0b4834-ed44-11df-
bc31-000c29864c4f, timestamp=1289446393710212)

=> (column=1289446397693831, value=6c6b5916-ed44-11df-
bc31-000c29864c4f, timestamp=1289446397694646)

=> (column=1289446891681780, value=92dbeb50-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446891685065)

=> (column=1289446897315887, value=96379f92-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446897317676)
Adding a tweet
tweet_id = str(uuid())
body = '@ericflo thanks for Twissandra, it helps!'
timestamp = long(time.time() * 1e6)

columns = {'uname': useruuid, 'body': body}
TWEET.insert(tweet_id, columns)

columns = {ts: tweet_id}
USERLINE.insert(uname, columns)

TIMELINE.insert(uname, columns)
for follower_uname in FOLLOWERS.get(uname, 5000):
    TIMELINE.insert(follower_uname, columns)
Reads
timeline = USERLINE.get(uname, column_reversed=True)
tweets = TWEET.multiget(timeline.values())


start = request.GET.get('start')
limit = NUM_PER_PAGE

timeline = TIMELINE.get(uname, column_start=start,
column_count=limit, column_reversed=True)
tweets = TWEET.multiget(timeline.values())
Programatically
•
    Don't use thrift directly
•
    Higher level clients have a lot of features you
    want
    –
        Knowledge about data types
    –
        Connection pooling
    –
        Automatic retries
    –
        Logging
Raw thrift API: Connecting
def get_client(host='127.0.0.1', port=9170):
    socket = TSocket.TSocket(host, port)
    transport = TTransport.TBufferedTransport(socket)
    transport.open()
    protocol =
TBinaryProtocol.TBinaryProtocolAccelerated(transport)
    client = Cassandra.Client(protocol)
    return client
Raw thrift API: Inserting
data = {'id': useruuid, ...}
columns = [Column(k, v, time.time())
           for (k, v) in data.items()]
mutations = [Mutation(ColumnOrSuperColumn(column=c))
             for c in columns]
rows = {useruuid: {'User': mutations}}

client.batch_mutate('Twissandra', rows,
ConsistencyLevel.ONE)
API layers
•
    libpq    •
                 Thrift
•
    JDBC     •
                 Hector
•
    JPA      •
                 Hector object-
                 mapper
Running twissandra
•
    Login: notroot/notroot
    –
        (root/riptano)


•
    cd twissandra
•
    python manage.py runserver &
•
    Navigate to http://127.0.0.1:8000
•
    Login as jim/jim, tom/tom, or create your own
One more thing
•
    !PUBLIC! userline
Exercise 1
•
    $ cassandra-cli --host localhost
•
    ] use twissandra;
    ] help;
    ] help list;
    ] help get;
    ] help del;
•
    Delete the most recent tweet
    –
        How would you find this w/o looking at the UI?
Exercise 2
•
    User jim is following user tom, but
    twissandra doesn't populate Timeline with
    tweets from before the follow action.
•
    Insert a tweet from tom before the follow
    action into jim's timeline
Secondary (column) indexes
Exercise 3
•
    Add a state column to the Tweet column
    family definition, with an index (index_type
    KEYS).
    –
        Hint: a no-op update column family on Tweet would be
        update column family Tweet with
        column_metadata=[{column_name:body,
        validation_class:UTF8Type}, {column_name:username,
        validation_class:UTF8Type}]
•
    Set the state column on several tweets to TX.
    Select them using get … where.
Language support
•
    Python
    –
        pycassa
    –
        telephus
•
    Ruby
    –
        Speed is a negative
•
    Java
    –
        Hector
•
    PHP
    –
        phpcassa
Done yet?
•
    Still doing 1+N queries per page
•
    Solution: Supercolumns
Applying SuperColumns to Twissandra

jbellis   1289847840615
            1289847844275      1289847844275     1289847887086
                                                 1289847844275
                 Id:
                  Id:               Id:
                                     Id:                Id:
                                                      Id:
          3f19757a-c89d...
              3f19757a-       844e75e2-b546...
                                 3f19757a-       a20fcf52-595c...
                                                  3f19757a-
               c89d...             c89d...          c89d...
              uname:
               uname:             uname:
                                   uname:          uname:
                                                    uname:
              zznate
               zznate              driftx
                                   zznate           zznate
                                                     zznate

                body:
                 body:            body:
                                    body:             body:
                                                       body:
          O Do geese see
            stone be not so   Rise geese see
                               Do to vote sir    Do Igeese see
                                                      prefer pi
                  ...                ...                ...
Supercolumns: limitations
•
    Requires reading an entire SC (not the entire
    row) from disk even if you just want one
    subcolumn
UUIDs
•
    Column names should be uuids, not longs,
    to avoid collisions
•
    Version 1 UUIDs can be sorted by time
    (“TimeUUID”)
•
    Any UUID can be sorted by its raw bytes
    (“LexicalUUID”)
    –
        Usually Version 4
    –
        Slightly less overhead
Lucandra
•
    What documents contain term X?
    –
        … and term Y?
    –
        … or start with Z?
Fields and Terms

<doc>
  <field name=”title”>apache talk</field>
  <field name=”date”>20110201</field>
</doc>


   feld      term       freq     position
   title    apache       1          0
   title      talk       1          1
   date    20110201      1          0
Lucandra ColumnFamilies
create column family documents with comparator = BytesType;

Create column family terminfo with column_type = Super and
comparator = BytesType and subcomparator = BytesType;
Lucandra data
Document Key      col name         value
"documentId" => { fieldName , value }

Term Key          col name         value
"field/term" => { documentId , position vector }
Lucandra queries
•
    get_slice
•
    get_range_slices
•
    No silver bullet
FAQ: counting
•
    UUIDs + batch process
•
    column-per-app-server
•
    counter API (after 1.0 is out)
Locking
•
    Zookeeper
•
    Cages: http://code.google.com/p/cages/
•
    Not suitable for multi-DC
UUIDs

counter1   672e34a2-ba33...   b681a0b1-58f2...


counter2   3f19757a-c89d...   844e75e2-b546...   a20fcf52-595c...




counter1    aggregated: 27


counter2    aggregated: 42
Column per appserver

counter1   672e34a2-ba33: 12    b681a0b1-58f2: 4   1872c1c2-38f1: 9


counter2   3f19757a-c89d: 7    844e75e2-b546: 11
Counter API

 key   counter1: (14, 13, 9)   counter2: (11, 15, 17)
General Tips
●
    Start with queries, work backwards
●
    Avoid storing extra “timestamp” columns
●
    Insert instead of check-then-insert
●
    Use client-side clock to your advantage
●
    use TTL
●
    Learn to love wide rows
Cassandra Tutorial

More Related Content

What's hot

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraSoftwareMill
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandraAaron Ploetz
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overviewElifTech
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0Joe Stein
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesPhil Peace
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basicsnickmbailey
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsJulien Anguenot
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyBenjamin Black
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraEric Evans
 
Managing Objects and Data in Apache Cassandra
Managing Objects and Data in Apache CassandraManaging Objects and Data in Apache Cassandra
Managing Objects and Data in Apache CassandraDataStax
 
Learn Cassandra at edureka!
Learn Cassandra at edureka!Learn Cassandra at edureka!
Learn Cassandra at edureka!Edureka!
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning CassandraDave Gardner
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value StoreSantal Li
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Apache cassandra architecture internals
Apache cassandra architecture internalsApache cassandra architecture internals
Apache cassandra architecture internalsBhuvan Rawal
 

What's hot (20)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overview
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
Managing Objects and Data in Apache Cassandra
Managing Objects and Data in Apache CassandraManaging Objects and Data in Apache Cassandra
Managing Objects and Data in Apache Cassandra
 
Learn Cassandra at edureka!
Learn Cassandra at edureka!Learn Cassandra at edureka!
Learn Cassandra at edureka!
 
Learning Cassandra
Learning CassandraLearning Cassandra
Learning Cassandra
 
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Apache cassandra architecture internals
Apache cassandra architecture internalsApache cassandra architecture internals
Apache cassandra architecture internals
 

Similar to Cassandra Tutorial

Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Alternator webinar september 2019
Alternator webinar   september 2019Alternator webinar   september 2019
Alternator webinar september 2019Nadav Har'El
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIScyllaDB
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_uploadRajini Ramesh
 
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and ConfluentWebinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and ConfluentKinetica
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedOmid Vahdaty
 
Jan 2015 - Cassandra101 Manchester Meetup
Jan 2015 - Cassandra101 Manchester MeetupJan 2015 - Cassandra101 Manchester Meetup
Jan 2015 - Cassandra101 Manchester MeetupChristopher Batey
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
Exploring KSQL Patterns
Exploring KSQL PatternsExploring KSQL Patterns
Exploring KSQL Patternsconfluent
 
Cassandra Day Chicago 2015: Building Java Applications with Apache Cassandra
Cassandra Day Chicago 2015: Building Java Applications with Apache CassandraCassandra Day Chicago 2015: Building Java Applications with Apache Cassandra
Cassandra Day Chicago 2015: Building Java Applications with Apache CassandraDataStax Academy
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Trickssiculars
 
Data stores: beyond relational databases
Data stores: beyond relational databasesData stores: beyond relational databases
Data stores: beyond relational databasesJavier García Magna
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraHanborq Inc.
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruTim Callaghan
 

Similar to Cassandra Tutorial (20)

Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Alternator webinar september 2019
Alternator webinar   september 2019Alternator webinar   september 2019
Alternator webinar september 2019
 
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible APIIntroducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and ConfluentWebinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Linq
LinqLinq
Linq
 
Jan 2015 - Cassandra101 Manchester Meetup
Jan 2015 - Cassandra101 Manchester MeetupJan 2015 - Cassandra101 Manchester Meetup
Jan 2015 - Cassandra101 Manchester Meetup
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Exploring KSQL Patterns
Exploring KSQL PatternsExploring KSQL Patterns
Exploring KSQL Patterns
 
Cassandra Day Chicago 2015: Building Java Applications with Apache Cassandra
Cassandra Day Chicago 2015: Building Java Applications with Apache CassandraCassandra Day Chicago 2015: Building Java Applications with Apache Cassandra
Cassandra Day Chicago 2015: Building Java Applications with Apache Cassandra
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
Adding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of TricksAdding Riak to your NoSQL Bag of Tricks
Adding Riak to your NoSQL Bag of Tricks
 
Data stores: beyond relational databases
Data stores: beyond relational databasesData stores: beyond relational databases
Data stores: beyond relational databases
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Master tuning
Master   tuningMaster   tuning
Master tuning
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Cassandra Tutorial

  • 1. Apache Cassandra in Action Jonathan Ellis jbellis@datastax.com / @spyced
  • 2. Why Cassandra? • Relational databases are not designed to scale • B-trees are slow – and require read-before-write
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. (“The eBay Architecture,” Randy Shoup and Dan Pritchett)
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Reader Memtable Writer Commitlog The Log-Structured Merge-Tree, Bigtable: A Distributed Storage System for Structured Data
  • 15. Dynamo, 2007 Bigtable, 2006 OSS, 2008 Incubator, 2009 TLP, 2010
  • 16. Cassandra in production • Digital Reasoning: NLP + entity analytics • OpenWave: enterprise messaging • OpenX: largest publisher-side ad network in the world • Cloudkick: performance data & aggregation • SimpleGEO: location-as-API • Ooyala: video analytics and business intelligence • ngmoco: massively multiplayer game worlds
  • 17. FUD? • “Cassandra is only appropriate for unimportant data.”
  • 18. Durabilty • Write to commitlog – fsync is cheap since it’s append-only • Write to memtable • [amortized] flush memtable to sstable
  • 19. SSTable format, briefly <key 127> <key 255> <row data 0> ... <row data 1> ... <row data 127> ... <row data 255> ... Sorted [clustered] by row key
  • 21. W A T L
  • 22. W A F T L
  • 23. W A F (A-L] T L
  • 24. W A (A-F] F T (F-L] L
  • 25. Key “C” W A F T L
  • 26. Reliability • No single points of failure • Multiple datacenters • Monitorable
  • 27. Some headlines • “Resyncing Broken MySQL Replication” • “How To Repair MySQL Replication” • “Fixing Broken MySQL Database Replication” • “Replication on Linux broken after db restore” • “MySQL :: Repairing broken replication”
  • 28.
  • 29.
  • 30. Good architecture solves multiple problems at once • Availability in single datacenter • Availability in multiple datacenters
  • 31. Y Key “C” A W U F T L P
  • 32. Y Key “C” A W U F X T hint L P
  • 33. Y A W U F T L P
  • 34.
  • 35. Y Key “C” A W U F T L P
  • 36. Y Key “C” A W U F T L P
  • 37. Tuneable consistency • ONE, QUORUM, ALL • R+W>N • Choose availability vs consistency (and latency)
  • 39. JMX
  • 41. When do you need Cassandra? • Ian Eure: “If you’re deploying memcache on top of your database, you’re inventing your own ad-hoc, difficult to maintain NoSQL data store”
  • 42. Not Only SQL • Curt Monash: “ACID-compliant transaction integrity commonly costs more in terms of DBMS licenses and many other components of TCO (Total Cost of Ownership) than [scalable NoSQL]. Worse, it can actually hurt application uptime, by forcing your system to pull in its horns and stop functioning in the face of failures that a non-transactional system might smoothly work around. Other flavors of “complexity can be a bad thing” apply as well. Thus, transaction integrity can be more trouble than it’s worth.” [Curt’s emphasis]
  • 43.
  • 44. Keyspaces & ColumnFamilies • Conceptually, like “schemas” and “tables”
  • 45. Inside CFs, columns are dynamic • Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.”
  • 46. ColumnFamilies • Static – Object data • Dynamic – Precalculated query results
  • 47. “static” columnfamilies Users zznate Password: * Name: Nate driftx Password: * Name: Brandon thobbs Password: * Name: Tyler jbellis Password: * Name: Jonathan Site: riptano.com
  • 48. “dynamic” columnfamilies Following zznate driftx: thobbs: driftx thobbs zznate: jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  • 49. Inserting • Really “insert or update” • Not a key/value store – update as much of the row as you want
  • 50. Example: twissandra • http://twissandra.com
  • 51. CREATE TABLE users ( id INTEGER PRIMARY KEY, username VARCHAR(64), password VARCHAR(64) ); CREATE TABLE following ( user INTEGER REFERENCES user(id), followed INTEGER REFERENCES user(id) ); CREATE TABLE tweets ( id INTEGER, user INTEGER REFERENCES user(id), body VARCHAR(140), timestamp TIMESTAMP );
  • 52. Cassandrified create column family users with comparator = UTF8Type and column_metadata = [{column_name: password, validation_class: UTF8Type}] create column family tweets with comparator = UTF8Type and column_metadata = [{column_name: body, validation_class: UTF8Type}, {column_name: username, validation_class: UTF8Type}] create column family friends with comparator = UTF8Type create column family followers with comparator = UTF8Type create column family userline with comparator = LongType and default_validation_class = UUIDType create column family timeline with comparator = LongType and default_validation_class = UUIDType
  • 54. User RowKey: ericflo => (column=password, value=****, timestamp=1289446382541473) ------------------- RowKey: jbellis => (column=password, value=****, timestamp=1289446438490709) uname = 'jericevans' password = '**********' columns = {'password': password} USER.insert(uname, columns)
  • 55. Natural keys vs surrogate
  • 56. Friends and Followers RowKey: ericflo => (column=jbellis, value=1289446467611029, timestamp=1289446467611064) => (column=b6n, value=1289446467611031, timestamp=1289446467611080) to_uname = 'ericflo' FRIENDS.insert(uname, {to_uname: time.time()}) FOLLOWERS.insert(to_uname, {uname: time.time()})
  • 57. zznate driftx: thobbs: driftx thobbs zznate: jbellis driftx: mdenni pcmanu thobbs: xedin: zznat s: s: e:
  • 58. Tweets RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f => (column=body, value=Four score and seven years ago, timestamp=1289446891681799) => (column=username, value=alincoln, timestamp=1289446891681799) ------------------- RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f => (column=body, value=Do geese see God?, timestamp=1289501976713199) => (column=username, value=pdrome, timestamp=1289501976713199)
  • 59. Userline RowKey: ericflo => (column=1289446393708810, value=6a0b4834-ed44-11df- bc31-000c29864c4f, timestamp=1289446393710212) => (column=1289446397693831, value=6c6b5916-ed44-11df- bc31-000c29864c4f, timestamp=1289446397694646) => (column=1289446891681780, value=92dbeb50-ed45-11df- a6d0-000c29864c4f, timestamp=1289446891685065) => (column=1289446897315887, value=96379f92-ed45-11df- a6d0-000c29864c4f, timestamp=1289446897317676)
  • 60. Userline zznate 1289847840615: 3f19757a-c89d... 1289847887086: a20fcf52-595c... driftx thobbs 1289847887086: a20fcf52-595c... jbellis 1289847840615: 3f19757a-c89d... 128984784425: 844e75e2-b546...
  • 61.
  • 62. Timeline RowKey: ericflo => (column=1289446393708810, value=6a0b4834-ed44-11df- bc31-000c29864c4f, timestamp=1289446393710212) => (column=1289446397693831, value=6c6b5916-ed44-11df- bc31-000c29864c4f, timestamp=1289446397694646) => (column=1289446891681780, value=92dbeb50-ed45-11df- a6d0-000c29864c4f, timestamp=1289446891685065) => (column=1289446897315887, value=96379f92-ed45-11df- a6d0-000c29864c4f, timestamp=1289446897317676)
  • 63. Adding a tweet tweet_id = str(uuid()) body = '@ericflo thanks for Twissandra, it helps!' timestamp = long(time.time() * 1e6) columns = {'uname': useruuid, 'body': body} TWEET.insert(tweet_id, columns) columns = {ts: tweet_id} USERLINE.insert(uname, columns) TIMELINE.insert(uname, columns) for follower_uname in FOLLOWERS.get(uname, 5000): TIMELINE.insert(follower_uname, columns)
  • 64. Reads timeline = USERLINE.get(uname, column_reversed=True) tweets = TWEET.multiget(timeline.values()) start = request.GET.get('start') limit = NUM_PER_PAGE timeline = TIMELINE.get(uname, column_start=start, column_count=limit, column_reversed=True) tweets = TWEET.multiget(timeline.values())
  • 65. Programatically • Don't use thrift directly • Higher level clients have a lot of features you want – Knowledge about data types – Connection pooling – Automatic retries – Logging
  • 66. Raw thrift API: Connecting def get_client(host='127.0.0.1', port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client
  • 67. Raw thrift API: Inserting data = {'id': useruuid, ...} columns = [Column(k, v, time.time()) for (k, v) in data.items()] mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns] rows = {useruuid: {'User': mutations}} client.batch_mutate('Twissandra', rows, ConsistencyLevel.ONE)
  • 68. API layers • libpq • Thrift • JDBC • Hector • JPA • Hector object- mapper
  • 69. Running twissandra • Login: notroot/notroot – (root/riptano) • cd twissandra • python manage.py runserver & • Navigate to http://127.0.0.1:8000 • Login as jim/jim, tom/tom, or create your own
  • 70. One more thing • !PUBLIC! userline
  • 71. Exercise 1 • $ cassandra-cli --host localhost • ] use twissandra; ] help; ] help list; ] help get; ] help del; • Delete the most recent tweet – How would you find this w/o looking at the UI?
  • 72. Exercise 2 • User jim is following user tom, but twissandra doesn't populate Timeline with tweets from before the follow action. • Insert a tweet from tom before the follow action into jim's timeline
  • 74. Exercise 3 • Add a state column to the Tweet column family definition, with an index (index_type KEYS). – Hint: a no-op update column family on Tweet would be update column family Tweet with column_metadata=[{column_name:body, validation_class:UTF8Type}, {column_name:username, validation_class:UTF8Type}] • Set the state column on several tweets to TX. Select them using get … where.
  • 75. Language support • Python – pycassa – telephus • Ruby – Speed is a negative • Java – Hector • PHP – phpcassa
  • 76. Done yet? • Still doing 1+N queries per page • Solution: Supercolumns
  • 77. Applying SuperColumns to Twissandra jbellis 1289847840615 1289847844275 1289847844275 1289847887086 1289847844275 Id: Id: Id: Id: Id: Id: 3f19757a-c89d... 3f19757a- 844e75e2-b546... 3f19757a- a20fcf52-595c... 3f19757a- c89d... c89d... c89d... uname: uname: uname: uname: uname: uname: zznate zznate driftx zznate zznate zznate body: body: body: body: body: body: O Do geese see stone be not so Rise geese see Do to vote sir Do Igeese see prefer pi ... ... ...
  • 78. Supercolumns: limitations • Requires reading an entire SC (not the entire row) from disk even if you just want one subcolumn
  • 79. UUIDs • Column names should be uuids, not longs, to avoid collisions • Version 1 UUIDs can be sorted by time (“TimeUUID”) • Any UUID can be sorted by its raw bytes (“LexicalUUID”) – Usually Version 4 – Slightly less overhead
  • 80. Lucandra • What documents contain term X? – … and term Y? – … or start with Z?
  • 81. Fields and Terms <doc> <field name=”title”>apache talk</field> <field name=”date”>20110201</field> </doc> feld term freq position title apache 1 0 title talk 1 1 date 20110201 1 0
  • 82. Lucandra ColumnFamilies create column family documents with comparator = BytesType; Create column family terminfo with column_type = Super and comparator = BytesType and subcomparator = BytesType;
  • 83. Lucandra data Document Key col name value "documentId" => { fieldName , value } Term Key col name value "field/term" => { documentId , position vector }
  • 84. Lucandra queries • get_slice • get_range_slices • No silver bullet
  • 85. FAQ: counting • UUIDs + batch process • column-per-app-server • counter API (after 1.0 is out)
  • 86. Locking • Zookeeper • Cages: http://code.google.com/p/cages/ • Not suitable for multi-DC
  • 87. UUIDs counter1 672e34a2-ba33... b681a0b1-58f2... counter2 3f19757a-c89d... 844e75e2-b546... a20fcf52-595c... counter1 aggregated: 27 counter2 aggregated: 42
  • 88. Column per appserver counter1 672e34a2-ba33: 12 b681a0b1-58f2: 4 1872c1c2-38f1: 9 counter2 3f19757a-c89d: 7 844e75e2-b546: 11
  • 89. Counter API key counter1: (14, 13, 9) counter2: (11, 15, 17)
  • 90. General Tips ● Start with queries, work backwards ● Avoid storing extra “timestamp” columns ● Insert instead of check-then-insert ● Use client-side clock to your advantage ● use TTL ● Learn to love wide rows