SlideShare a Scribd company logo
1 of 58
Prashant Gupta
Introduction to NoSQL
What is NoSQL?
 It’s a whole new way of thinking about a database and encompasses a
wide variety of different database technologies.
 A non-relational and largely distributed database system that enables
rapid and analysis of extremely high-volume, disparate (different) data
types.
Why NoSQL?
• The Benefits of NoSQL : A very flexible and schema-less data model,
horizontal scalability, distributed architectures.
• Dynamic Schemas
• Auto-sharding
• Replication
• Integrated Caching
NoSQL Database Types
• Document databases pair each key with a complex data structure
known as a document. Documents can contain many different key-
value pairs, or key-array pairs, or even nested documents.
• Graph stores are used to store information about networks, such as
social connections. Graph stores include Neo4J and HyperGraphDB.
• Key-value stores are the simplest NoSQL databases. Every single item
in the database is stored as an attribute name (or "key"), together with
its value. Examples of key-value stores are Riak and Voldemort. Some
key-value stores, such as Redis, allow each value to have a type, such as
"integer", which adds functionality.
• Wide-column stores such as Cassandra and HBase are optimized for
queries over large datasets, and store columns of data together, instead
of rows
SQL Database NoSQL Database
Types One type (SQL database) with
minor variations
Many different types including
key-value stores, document
databases, wide-column stores,
and graph databases
Development History Developed in 1970s to deal with
first wave of data storage
applications.
Developed in 2000s to deal with
limitations of SQL databases,
particularly concerning scale,
replication and unstructured data
storage.
Examples MySQL, Postgres, Oracle
Database
MongoDB, Cassandra, HBase,
Neo4j ,Riak, Voldemort,
CouchDB ,DynamoDB
Schemas Structure and data types are fixed
in advance.
Typically dynamic. Records can
add new information on the fly,
and unlike SQL table
Scaling Vertically Horizontally
Data Manipulation Specific language using Select,
Insert, and Update statements,
e.g. SELECT fields FROM table
WHERE…
Through object-oriented APIs
NoSQL vs. SQL Summary
 Relational and NoSQL data models are very different. The relational
model takes data and separates it into many interrelated tables that
contain rows and columns. Tables reference each other through
foreign keys that are stored in columns as well.
 NoSQL databases have a very different model. For example, a
document-oriented NoSQL database takes the data you want to store
and aggregates it into documents using the JSON format.
NoSQL's More Flexible Data Model
CAP THEROM
• In theoretical computer science, the CAP theorem, also named
Brewer's theorem.
• The CAP theorem was first proposed by Eric Brewer of the
University of California, Berkeley, in 2000, and then proven by Seth
Gilbert and Nancy Lynch of MIT in 2002.
• it’s important to keep three core requirements in mind ;
1. Consistency
2. Availability
3. Partition-Tolerance
• Consistency (C) requires that all reads initiated after a successful
write return the same and latest value at any given logical time.
• Availability (A) requires that every node (not in failed state) always
execute queries. as Write opertion
• Partition Tolerance (P) requires that a system be able to re-route a
communication when there are temporary breaks or failures in the
network. The goal is to maintain synchronization among the involved
nodes.
Brewer’s theorem states that it’s typically not possible
for a distributed system to provide all three
requirements Simultaneously because one
of them will always be compromised.
Hbase
• Apache HBase is an open-source, distributed, versioned, fault-
tolerant, scalable, column-oriented store modeled after
Google's Bigtable, with random real-time read/write access to data.
• HBase is a column-oriented database
• when you need random, real-time read/write access to your Big
Data.
• When project's goal is the hosting of very large
tables -- billions of rows X millions of columns
Features of hbase
• Scalable.
• Automatic failover
• Consistent reads and writes.
• Sharding of tables
• Failover support
• Classes for backing hadoop mapreduce jobs
• Java API for client access
• Thrift gateway and a REST-ful Web
• Shell support
Backdrop
• Started toward by Chad Walters and Jim
• 2006.NOV
– Google releases paper on BigTable
• 2007.FEB
– Initial HBase prototype created as Hadoop contrib.
• 2007.OCT
– First useable HBase
• 2008.JAN
– Hadoop become Apache top-level project and HBase becomes
subproject
• 2008.OCT
– HBase 0.18, 0.19 released
• 2013.JUNE
Latest version is HBase 0.97
Difference Between HDFS and
Hbase
• HDFS is a distributed file system that is well suited for the storage of
large files. It's documentation states that it is not, however, a general
purpose file system, and does not provide fast individual record
lookups in files. HBase, on the other hand, is built on top of HDFS
and provides fast record lookups (and updates) for large tables.
What it is not
• No sql
• No relation
• No joins
• Not a replacement of RDBMS
What Is The Difference Between
HBase and RDMS?
Row-Oriented vs. Column-
Oriented data stores.
HBase Data Model
HBase Architectural Components
• HBase is composed of three types of servers in a master slave type
of architecture.
• Region assignment, DDL (create, delete tables) operations are
handled by the HBase Master process
• Region servers serve data for reads and writes. When accessing
data, clients communicate with HBase RegionServers directly..
• Zookeeper, which is part of HDFS, maintains a live
cluster state.
• The Hadoop DataNode stores the data that the Region Server is
managing.
• All HBase data is stored in HDFS files.
Components
Other way around
Hmaster
• Master server is responsible for
monitoring all RegionServer
instances in the cluster, and is the
interface for all metadata changes, it
runs on the server which hosts
namenode.
• Master controls critical functions
such as RegionServer failover and
completing region splits. So while
the cluster can still run for a time
without the Master, the Master
should be restarted as soon as
possible.
Regionservers
• It is responsible for serving and managing
regions, its like a data node for HBase.
• These can be thought of Datanode for
Hadoop cluster. It serve the client request
for the data.It handle the actual data storage
and request.
• It consists of Regions or in better words
tables.
• Regionservers are usually configured to run
on servers of HDFS Datanode. Running
Regionservers on the Datanode server has
the advantage of data locality too
Zookeeper
• Zookeeper is an open source
software providing a highly reliable,
distributed coordination service
• Entry point for an HBase system
• It includes tracking of region servers,
where the root region is hosted
API
Interface to HBase
Using these we can we can access HBase
and perform read/write and other operation
on HBase.
REST, Thrift, and Avro
Thrift API framework, for scalable cross-
language services development, combines
a software stack with a code generation
engine to build services that work efficiently
and seamlessly between C++, Java,
Python, PHP, Ruby, Erlang, Perl, Haskell,
C#, Cocoa, JavaScript, Node.js, Smalltalk,
OCaml and Delphi and other languages.
HBASE ARCHITECTURE
1. The client gets the Region server
that hosts the META table from
ZooKeeper.
2. The client will query the .META.
server to get the region server
corresponding to the row key it
wants to access. The client
caches this information along with
the META table location.
3. It will get the Row from the
corresponding Region Server.
HBase First Read or
Write
There is a special HBase Catalog table
called the META table, which holds the
location of the regions in the cluster.
ZooKeeper stores the location of the
META table.
 ZooKeeper stores the location of the
META table.
This is what three steps happens the
first time a client reads or writes to
HBase:
• WAL: Write Ahead Log is a file on the distributed file system. The
WAL is used to store new data that hasn't yet been persisted to
permanent storage; it is used for recovery in the case of failure.
• BlockCache: is the read cache. It stores frequently read data in
memory. Least Recently Used data is evicted when full.
• MemStore: is the write cache. It stores new data which has not yet
been written to disk. It is sorted before writing to disk.
There is one MemStore per column family per
region.
• Hfiles store the rows as sorted KeyValues on disk.
HBase Write Steps (1)
• When the client issues a Put request, the first step is to write the data to the
write-ahead log, the WAL:
- Edits are appended to the end of the WAL file that is stored on disk
- The WAL is used to recover not-yet-persisted data in case a server
crashes.
HBase Write Steps (2)
Once the data is written to the WAL, it is placed in the MemStore.
Then, the put request acknowledgement returns to the client.
HBase MemStore
`
HBase Major Compaction
• Major compaction merges and
rewrites all the HFiles in a region
to one HFile per column family,
and in the process, drops deleted
or expired cells.
• since major compaction rewrites
all of the files, lots of disk I/O and
network traffic might occur during
the process. This is called write
amplification.
• Major compactions can be
scheduled to run automatically.
Due to write amplification, major
compactions are usually
scheduled for weekends or
evenings.
Region Split
HDFS Data Replication
• The WAL file and the Hfiles are persisted on disk and replicated,
Start Hbase Server and Shell
• To start Hbase server, go up to HBase folder
and run following command in terminal
$ ./bin/start-hbase.sh
Note: HMaster, Zookeeper, Regionserver is started when we use above
command
• Stop your HBase instance by running the stop script
– $bin/stop-hbase.sh
 stopping HBase...............
Web Interface
Master Web Interface
The Master starts a web-interface on port 60010 by default.
EX: http://localhost:60010/
Regionserver Web Interface
Regionserver starts a web-interface on port 60030 by default.
EX: http://localhost:60030/
HBASE Shell
• What is HBaseshell?
Hbase shell is a command line interface for reading and writing data
to HBase database.
• To start Shell : Run following to start HBase shell
$ ./bin/hbase shell
This will connect hbase shell to hbase server.
Commands description
create create a table in database
put add a record in a table
get retrieve a record from a table
Scan retrieve a set of records from a table
delete, deleteall delete a entire row or a column, or a cell from a
table
alter Alter a table (add or delete column family)
describe Describe the named table
list List all tables in database
disable/enable Disable/enable the named table
drop Drop the named table
Creating a Table using HBase Shell
• You can create a table using the create command, here you must specify
the table name and the Column Family name. The syntax to create a table
in HBase shell is shown below.
• create ‘<table name>’,’<column family>’
• create 'emp', 'personal data', ’professional data’
Given below is a sample schema of a table named emp. It has two column
families: “personal data” and “professional data”.
hbase(main):001:0 > list
Inserting the First Row
• put command, you can insert rows into a table.;
• put ’<table name>’,’Key’,’<colfamily:colname>’,’<value>’
• put 'emp','1','personal data:name','raju‘
• put 'emp','1','personal data:city',‘Pune‘
• put 'emp','1','professional data:designation','manager‘
• put 'emp','1','professional data:salary','50000' 0
Logical and physical layout of rows within regions
• Compression
HBase has pluggable compression algorithm support that allows you to choose
the best compression or none for the data stored in a particular column family.
The possible
Supported compression algorithms
Value Description
NONE Disables compression (default)
GZ Uses the Java-supplied or native GZip compression
LZO Enables LZO compression; must be installed separately
SNAPPY Enables Snappy compression; binaries must be installed
separately
Updating Data using HBase Shell
• You can update an existing cell value using the put command. To do so,
just follow the same syntax and mention your new value as shown b
• put ‘table name’,’row ’,'Column family:column name',’new value’
below.
• put 'emp','row1','personal:city','Delhi‘
Reading Data
• get ’<table name>’,’row1’
• get 'emp', '1‘
Reading a Specific Column
• hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column
family:column name ’}
• get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}
• Deleting a Specific Cell in a Table
• delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
HIVE Hbase Integration
Connect from JAVA
HANDS-ON document
Create a table ;
create 'customer', {NAME=>'addr'}, {NAME=>'order'}
Use ‘describe’ to get the description of the table.
describe 'customer'
insert and get the records
Put some data into the table
put 'customer', 'jsmith', 'addr:city', 'nashville'
Use ‘get’ to retrieve the data for ‘jsmith’
get 'customer', 'jsmith'
Apache HBase™

More Related Content

What's hot

NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 

What's hot (20)

Sqoop
SqoopSqoop
Sqoop
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NoSql
NoSqlNoSql
NoSql
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 

Viewers also liked

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.Vijaykumar Vangapandu
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 

Viewers also liked (7)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

Similar to Apache HBase™ (20)

CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Dsm project-h base-cassandra
Dsm project-h base-cassandraDsm project-h base-cassandra
Dsm project-h base-cassandra
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Hbase
HbaseHbase
Hbase
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
01 hbase
01 hbase01 hbase
01 hbase
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Big data stores
Big data  storesBig data  stores
Big data stores
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 

More from Prashant Gupta

More from Prashant Gupta (8)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
6.hive
6.hive6.hive
6.hive
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Recently uploaded

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 

Recently uploaded (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 

Apache HBase™

  • 2. Introduction to NoSQL What is NoSQL?  It’s a whole new way of thinking about a database and encompasses a wide variety of different database technologies.  A non-relational and largely distributed database system that enables rapid and analysis of extremely high-volume, disparate (different) data types. Why NoSQL? • The Benefits of NoSQL : A very flexible and schema-less data model, horizontal scalability, distributed architectures. • Dynamic Schemas • Auto-sharding • Replication • Integrated Caching
  • 3. NoSQL Database Types • Document databases pair each key with a complex data structure known as a document. Documents can contain many different key- value pairs, or key-array pairs, or even nested documents. • Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB. • Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality. • Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows
  • 4. SQL Database NoSQL Database Types One type (SQL database) with minor variations Many different types including key-value stores, document databases, wide-column stores, and graph databases Development History Developed in 1970s to deal with first wave of data storage applications. Developed in 2000s to deal with limitations of SQL databases, particularly concerning scale, replication and unstructured data storage. Examples MySQL, Postgres, Oracle Database MongoDB, Cassandra, HBase, Neo4j ,Riak, Voldemort, CouchDB ,DynamoDB Schemas Structure and data types are fixed in advance. Typically dynamic. Records can add new information on the fly, and unlike SQL table Scaling Vertically Horizontally Data Manipulation Specific language using Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE… Through object-oriented APIs NoSQL vs. SQL Summary
  • 5.  Relational and NoSQL data models are very different. The relational model takes data and separates it into many interrelated tables that contain rows and columns. Tables reference each other through foreign keys that are stored in columns as well.  NoSQL databases have a very different model. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the JSON format. NoSQL's More Flexible Data Model
  • 6. CAP THEROM • In theoretical computer science, the CAP theorem, also named Brewer's theorem. • The CAP theorem was first proposed by Eric Brewer of the University of California, Berkeley, in 2000, and then proven by Seth Gilbert and Nancy Lynch of MIT in 2002. • it’s important to keep three core requirements in mind ; 1. Consistency 2. Availability 3. Partition-Tolerance
  • 7. • Consistency (C) requires that all reads initiated after a successful write return the same and latest value at any given logical time. • Availability (A) requires that every node (not in failed state) always execute queries. as Write opertion • Partition Tolerance (P) requires that a system be able to re-route a communication when there are temporary breaks or failures in the network. The goal is to maintain synchronization among the involved nodes. Brewer’s theorem states that it’s typically not possible for a distributed system to provide all three requirements Simultaneously because one of them will always be compromised.
  • 8.
  • 9. Hbase • Apache HBase is an open-source, distributed, versioned, fault- tolerant, scalable, column-oriented store modeled after Google's Bigtable, with random real-time read/write access to data. • HBase is a column-oriented database • when you need random, real-time read/write access to your Big Data. • When project's goal is the hosting of very large tables -- billions of rows X millions of columns
  • 10. Features of hbase • Scalable. • Automatic failover • Consistent reads and writes. • Sharding of tables • Failover support • Classes for backing hadoop mapreduce jobs • Java API for client access • Thrift gateway and a REST-ful Web • Shell support
  • 11. Backdrop • Started toward by Chad Walters and Jim • 2006.NOV – Google releases paper on BigTable • 2007.FEB – Initial HBase prototype created as Hadoop contrib. • 2007.OCT – First useable HBase • 2008.JAN – Hadoop become Apache top-level project and HBase becomes subproject • 2008.OCT – HBase 0.18, 0.19 released • 2013.JUNE Latest version is HBase 0.97
  • 12. Difference Between HDFS and Hbase • HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. What it is not • No sql • No relation • No joins • Not a replacement of RDBMS
  • 13. What Is The Difference Between HBase and RDMS?
  • 15.
  • 17. HBase Architectural Components • HBase is composed of three types of servers in a master slave type of architecture. • Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process • Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly.. • Zookeeper, which is part of HDFS, maintains a live cluster state. • The Hadoop DataNode stores the data that the Region Server is managing. • All HBase data is stored in HDFS files.
  • 20. Hmaster • Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes, it runs on the server which hosts namenode. • Master controls critical functions such as RegionServer failover and completing region splits. So while the cluster can still run for a time without the Master, the Master should be restarted as soon as possible.
  • 21.
  • 22. Regionservers • It is responsible for serving and managing regions, its like a data node for HBase. • These can be thought of Datanode for Hadoop cluster. It serve the client request for the data.It handle the actual data storage and request. • It consists of Regions or in better words tables. • Regionservers are usually configured to run on servers of HDFS Datanode. Running Regionservers on the Datanode server has the advantage of data locality too
  • 23.
  • 24. Zookeeper • Zookeeper is an open source software providing a highly reliable, distributed coordination service • Entry point for an HBase system • It includes tracking of region servers, where the root region is hosted
  • 25.
  • 26. API Interface to HBase Using these we can we can access HBase and perform read/write and other operation on HBase. REST, Thrift, and Avro Thrift API framework, for scalable cross- language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
  • 28.
  • 29. 1. The client gets the Region server that hosts the META table from ZooKeeper. 2. The client will query the .META. server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location. 3. It will get the Row from the corresponding Region Server. HBase First Read or Write There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. ZooKeeper stores the location of the META table.  ZooKeeper stores the location of the META table. This is what three steps happens the first time a client reads or writes to HBase:
  • 30. • WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure. • BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full. • MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region. • Hfiles store the rows as sorted KeyValues on disk.
  • 31.
  • 32. HBase Write Steps (1) • When the client issues a Put request, the first step is to write the data to the write-ahead log, the WAL: - Edits are appended to the end of the WAL file that is stored on disk - The WAL is used to recover not-yet-persisted data in case a server crashes.
  • 33. HBase Write Steps (2) Once the data is written to the WAL, it is placed in the MemStore. Then, the put request acknowledgement returns to the client.
  • 35. `
  • 36.
  • 37. HBase Major Compaction • Major compaction merges and rewrites all the HFiles in a region to one HFile per column family, and in the process, drops deleted or expired cells. • since major compaction rewrites all of the files, lots of disk I/O and network traffic might occur during the process. This is called write amplification. • Major compactions can be scheduled to run automatically. Due to write amplification, major compactions are usually scheduled for weekends or evenings.
  • 39. HDFS Data Replication • The WAL file and the Hfiles are persisted on disk and replicated,
  • 40. Start Hbase Server and Shell • To start Hbase server, go up to HBase folder and run following command in terminal $ ./bin/start-hbase.sh Note: HMaster, Zookeeper, Regionserver is started when we use above command • Stop your HBase instance by running the stop script – $bin/stop-hbase.sh  stopping HBase...............
  • 41. Web Interface Master Web Interface The Master starts a web-interface on port 60010 by default. EX: http://localhost:60010/ Regionserver Web Interface Regionserver starts a web-interface on port 60030 by default. EX: http://localhost:60030/
  • 42. HBASE Shell • What is HBaseshell? Hbase shell is a command line interface for reading and writing data to HBase database. • To start Shell : Run following to start HBase shell $ ./bin/hbase shell This will connect hbase shell to hbase server.
  • 43. Commands description create create a table in database put add a record in a table get retrieve a record from a table Scan retrieve a set of records from a table delete, deleteall delete a entire row or a column, or a cell from a table alter Alter a table (add or delete column family) describe Describe the named table list List all tables in database disable/enable Disable/enable the named table drop Drop the named table
  • 44. Creating a Table using HBase Shell • You can create a table using the create command, here you must specify the table name and the Column Family name. The syntax to create a table in HBase shell is shown below. • create ‘<table name>’,’<column family>’ • create 'emp', 'personal data', ’professional data’ Given below is a sample schema of a table named emp. It has two column families: “personal data” and “professional data”. hbase(main):001:0 > list
  • 45. Inserting the First Row • put command, you can insert rows into a table.; • put ’<table name>’,’Key’,’<colfamily:colname>’,’<value>’ • put 'emp','1','personal data:name','raju‘ • put 'emp','1','personal data:city',‘Pune‘ • put 'emp','1','professional data:designation','manager‘ • put 'emp','1','professional data:salary','50000' 0
  • 46.
  • 47. Logical and physical layout of rows within regions
  • 48. • Compression HBase has pluggable compression algorithm support that allows you to choose the best compression or none for the data stored in a particular column family. The possible Supported compression algorithms Value Description NONE Disables compression (default) GZ Uses the Java-supplied or native GZip compression LZO Enables LZO compression; must be installed separately SNAPPY Enables Snappy compression; binaries must be installed separately
  • 49. Updating Data using HBase Shell • You can update an existing cell value using the put command. To do so, just follow the same syntax and mention your new value as shown b • put ‘table name’,’row ’,'Column family:column name',’new value’ below. • put 'emp','row1','personal:city','Delhi‘ Reading Data • get ’<table name>’,’row1’ • get 'emp', '1‘
  • 50. Reading a Specific Column • hbase> get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’} • get 'emp', 'row1', {COLUMN ⇒ 'personal:name'} • Deleting a Specific Cell in a Table • delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
  • 52.
  • 53.
  • 54.
  • 56.
  • 57. HANDS-ON document Create a table ; create 'customer', {NAME=>'addr'}, {NAME=>'order'} Use ‘describe’ to get the description of the table. describe 'customer' insert and get the records Put some data into the table put 'customer', 'jsmith', 'addr:city', 'nashville' Use ‘get’ to retrieve the data for ‘jsmith’ get 'customer', 'jsmith'

Editor's Notes

  1. The Combiner is a "mini-reduce" process which operates only on data generated by one machine.
  2. Voldemort is a distributed key-value storage system, It is used at ’LinkedIn’ by numerous critical services powering a large portion of the site. Riak uses a simple key/value model for object storage. You can store anything you want in Riak: text, images, JSON/XML/HTML documents, user and session data, backups, log files, an MongoDB uses a document mode.
  3. Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1. Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations. Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly? Further Reading - http://robertgreiner.com/2014/08/cap-theorem-revisited/
  4. The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Columns, Cells and Versions.
  5. Region Servers are collocated with the HDFS DataNodes, which enable data locality (putting the data close to where it is needed) for the data served by the RegionServers.  HBase data is local when it is written, but when a region is moved, it is not local until compaction. The NameNode maintains metadata information for all the physical data blocks that comprise the files.
  6. A master is responsible for: Coordinating the region servers- Assigning regions on startup , re-assigning regions for recovery or load balancing- Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper) Admin functions- Interface for creating, deleting, updating tables
  7. Hfile : File System for hbase Memstore: In memory store for hbase Write Ahead Log (WAL) is that HLog edits will be written immediately
  8. HBase Tables are divided horizontally by row key range into “Regions.” A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called “Region Servers,” and these serve data for reads and writes. A region server can serve about 1,000 regions.
  9. HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster. Zookeeper maintains which servers are alive and available, and provides server failure notification. Zookeeper uses consensus to guarantee common shared state. Note that there should be three or five machines for consensus.
  10. For future reads, the client uses the cache to retrieve the META location and previously read row keys. Over time, it does not need to query the META table, unless there is a miss because a region has moved; then it will re-query and update the cache. HBase Meta Table This META table is an HBase table that keeps a list of all regions in the system. The .META. table is like a b tree. The .META. table structure is as follows:- Key: region start key,region id- Values: RegionServer
  11. Data is stored in an HFile which contains sorted key/values. When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head. An HFile contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The multi-level index is like a b+tree: Key value pairs are stored in increasing order Indexes point by row key to the key value data in 64KB “blocks” Each block has its own leaf-index The last key of each block is put in the intermediate index The root index points to the intermediate index The trailer points to the meta blocks, and is written at the end of persisting the data to the file. The trailer also has information like bloom filters and time range info. Bloom filters help to skip files that do not contain a certain row key. The time range info is useful for skipping the file if it is not in the time range the read is looking for.
  12. Minor compactions will usually pick up a couple of the smaller adjacent StoreFiles and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this. Sometimes a minor compaction will pick up all the StoreFiles in the Store and in this case it actually promotes itself to being a major compaction.
  13.  how does HBase recover the MemStore updates not persisted to Hfiles ? HBase Crash Recovery =============== When a RegionServer fails, Crashed Regions are unavailable until detection and recovery steps have happened. Zookeeper will determine Node failure when it loses region server heart beats. The HMaster will then be notified that the Region Server has failed. When the HMaster detects that a region server has crashed, the HMaster reassigns the regions from the crashed server to active Region servers. In order to recover the crashed region server’s memstore edits that were not flushed to disk. The HMaster splits the WAL belonging to the crashed region server into separate files and stores these file in the new region servers’ data nodes. Each Region Server then replays the WAL from the respective split WAL, to rebuild the memstore for that region.
  14. Verification You can verify whether the table is created using the list command as shown below.