HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
1. CHAPTER 04: HBASE TABLE DESIGN
HBase IN ACTION
by Nick Dimiduk et. al.
2. Overview: HBase table design
HBase schema design concepts
Mapping relational modeling knowledge to the
HBase world
Advanced table definition parameters
HBase Filters to optimize read performance
3. 4.1 How to approach schema design
When we say schema, we include the following
considerations:
How many column families should the table have?
What data goes into what column family?
How many columns should be in each column family?
What should the column names be?
What information should go into the cells?
How many versions should be stored for each cell?
What should the rowkey structure be, and what should it
contain?
4. Hbase Course
Data Manipulation at Scale: Systems and
Algorithms
Using HBase for Real-time Access to Your Big
Data
5. 4.1.1 Modeling for the questions
A table store data about what users a particular user
follows, support
read the entire list of users,
and query for the presence of a specific user in that list
7. 4.1.1 Modeling for the questions (cont.)
Thinking further along those lines, you can come up
with the following questions:
1. Whom does TheFakeMT follow?
2. Does TheFakeMT follow TheRealMT?
3. Who follows TheFakeMT?
4. Does TheRealMT follow TheFakeMT?
8. 4.1.2 Defining requirements: more work up front
always pays
From the perspective of TwitBase, you expect data to
be written to HBase when the following things
happen:
A user follows someone
A user unfollows someone they were following
10. 4.1.2 Defining requirements: more work up front
always pays (cont.)
What is different from design tables in relational
systems and tables in HBase?
13. 4.1.4 Targeted data access
Only the keys are indexed in HBase tables.
There are two ways to retrieve data from a table: Get and
Scan.
HBase tables are flexible, and you can store anything in
the form of byte[].
Store everything with similar access patterns in the same
column family.
Indexing is done on the Key portion of the KeyValue
objects, consisting of the rowkey, qualifier, and
timestamp in that order.
Tall tables can potentially allow you to move toward O(1)
operations, but you trade atomicity
14. 4.1.4 Targeted data access (cont.)
De-normalizing is the way to go when designing HBase
schemas.
Think how you can accomplish your access patterns in
single API calls rather than multiple API calls.
Hashing allows for fixed-length keys and better
distribution but takes away ordering.
Column qualifiers can be used to store data, just like
cells.
The length of column qualifiers impacts the storage
footprint because you can put data in them.
The length of the column family name impacts the size of
data sent over the wire to the client (in KeyValue
objects).
15. 4.2 De-normalization is the word in HBase land
One of the key concepts when designing HBase
tables is de-normalization.
16. 4.3 Heterogeneous data in the same table
HBase schemas are flexible, and you’ll use that
flexibility now to avoid doing scans every time you
want a list of followers for a given user.
Isolate different access patterns as much as possible.
The way to improve the load distribution in this case
is to have separate tables for the two types of
relationships you want to store.
17. 4.4 Rowkey design strategies
In designing HBase tables, the rowkey is the single
most important thing.
Your rowkeys determine the performance you get
while interacting with HBase tables.
Unlike relational databases, where you can index on
multiple columns, Hbase indexes only on the key;
18. 4.5 I/O considerations
The sorted nature of HBase tables can turn out to be
a great thing for your application—or not
Optimized for writes
HASHING
SALTING
Optimized for reads
Cardinality and rowkey structure
19. 4.6 From relational to non-relational
There is no simple way to map your relational
database knowledge to HBase. It’s a different
paradigm of thinking
Things don’t necessarily map 1:1, and these concepts
are evolving and being defined as the adoption of
NoSQL systems increases.
20. 4.6.1 Some basic concepts
ENTITIES
These map to tables.
In both relational databases and HBase, the default container
for an entity is a table, and each row in the table should
represent one instance of that entity.
ATTRIBUTES
These map to columns.
Identifying attribute: This is the attribute that uniquely
identifies exactly one instance of an entity (that is, one row).
Non-identifying attribute: Non-identifying attributes are
easier to map.
21. 4.6.1 Some basic concepts (cont.)
RELATIONSHIPS
These map to foreign-key relationships.
There is no direct mapping of these in HBase, and often it
comes down to denormalizing the data.
HBase, not having any built-in joins or constraints, has little
use for explicit relationships.
22. 4.6.2 Nested entities
In Hbase, the columns (also known as column
qualifiers) aren’t predefined at design time.
23. 4.6.2 Nested entities (cont.)
it’s possible to model it in HBase
as a single row.
There are some limitations to
this
this technique only works to one
level deep: your nested entities can’t
themselves have nested entities.
it’s not as efficient to access an
individual value stored as a nested
column qualifier inside a row
24. 4.6.3 Some things don’t map
COLUMN FAMILIES
(LACK OF) INDEXES
VERSIONING
25. 4.7 Advanced column family configurations
HBase has a few advanced features that you can use
when designing your tables.
Configurable block size
hbase(main):002:0> create 'mytable', {NAME => 'colfam1',
BLOCKSIZE => '65536'}
Block cache
hbase(main):002:0> create 'mytable', {NAME => 'colfam1',
BLOCKCACHE => 'false’}
Aggressive caching
hbase(main):002:0> create 'mytable', {NAME => 'colfam1',
IN_MEMORY => 'true'}
27. 4.8 Filtering data
Filters are a powerful feature that can come in handy
in such cases.
HBase provides an API you can use to implement
custom filters.
28. 4.8.1 Implementing a filter
Implement custom filter by extending FilterBase
abstract class
The filtering logic goes in the filterKeyValue(..) method
To install custom filters
have to compile them into a JAR and put them in the HBase
classpath so they get picked up by the RegionServers at startup
time.
To compile the JAR, in the top-level directory of the project, do
the following:
mvn install
cp target/twitbase-1.0.0.jar /my/folder/
30. Hbase Course
Data Manipulation at Scale: Systems and
Algorithms
Using HBase for Real-time Access to Your Big
Data
31. 4.9 Summary
It’s about the questions, not the relationships.
Design is never finished.
Scale is a first-class entity.
Every dimension is an opportunity.
Editor's Notes
http://ouo.io/uaiKO
Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster.
First, all libraries and higher- level components in the stack benefit from improvements at the lower layers.
Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one.
Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.
Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster.
First, all libraries and higher- level components in the stack benefit from improvements at the lower layers.
Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one.
Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.