SlideShare a Scribd company logo
1 of 31
S
Apache HBase:
Introduction & Use Cases
Subash D’Souza
What is HBase?
S HBase is an open source, distributed, sorted map modeled
after Google’s Big Table
S NoSQL solution built atop Apache Hadoop
S Top level Apache Project
CAP Theorem
S Consistency (all nodes see the same data at the same time)
S Availability (a guarantee that every request receives a
response about whether it was successful or failed)
S Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
According to the theorem, a distributed system can satisfy any
two of these guarantees at the same time, but not all three.
Ref: http://blog.nahurst.com/visual-guide-to-nosql-systems
Usage Scenarios
S Lots of Data - 100s of Gigs to Petabytes
S High Throughput – 1000’s of records/sec
S Scalable cache capacity – Adding several nodes adds to
available cache
S Data Layout – Excels at key lookup and no penalty for sparse
columns
Column Oriented Databases
S HBase belongs to family of databases called as Column-oriented
S Column-oriented databases save their data grouped by columns.
S The reason to store values on a per-column basis instead is based on
the assumption that, for specific queries, not all of the values are
needed.
S Reduced I/O is one of the primary reasons for this new layout
S Specialized algorithms—for example, delta and/or prefix
compression—selected based on the type of the column (i.e., on the
data stored) can yield huge improvements in compression ratios.
Better ratios result in more efficient bandwidth usage.
HBase as a Column Oriented
Database
S HBase is not a column-oriented database in the typical RDBMS
sense, but utilizes an on-disk column storage format.
S This is also where the majority of similarities end, because although
HBase stores data on disk in a column-oriented format, it is distinctly
different from traditional columnar databases.
S Whereas columnar databases excel at providing real-time analytical
access to data, HBase excels at providing key-based access to a
specific cell of data, or a sequential range of cells.
HBase and Hadoop
S Hadoop excels at storing data of arbitrary, semi-, or even
unstructured formats, since it lets you decide how to interpret
the data at analysis time, allowing you to change the way you
classify the data at any time: once you have updated the
algorithms, you simply run the analysis again.
S HBase sits atop Hadoop using all the best features of HDFS
such as scalability and data replication
HBase UnUsage
S When data access patterns are unknown – HBase follows a
data centric model rather than relationship centric, Hence it
does not make sense doing an ERD model for HBase
S Small amounts of data – Just use an RDBMS
S Limited/No random reads and writes – Just use HDFS directly
HBase Use Cases - Facebook
S One of the earliest and largest users of HBase
S Facebook messaging platform built atop HBase in 2010
S Chosen because of the high write throughput and low latency
random reads
S Other features such as Horizontal Scalability, Strong
Consistency and High Availability via Automatic Failover.
HBase Use Cases - Facebook
S In addition to online transaction processing workloads like
messages, it is also used for online analytic processing
workloads where large data scans are prevalent.
S Also used in production by other Facebook services, including
the internal monitoring system, the recently launched Nearby
Friends feature, search indexing, streaming data analysis, and
data scraping for their internal data warehouses.
Seek vs. Transfer
S One of the fundamental differences
between typical RDBMS and nosql
ones is the use of B or B+ trees and
Log Structured Merge Trees(LSM)
which was the basis of Google’s
BigTable
B+ Trees
S B+ Trees allow for efficient insertion, lookup and deletion of
records that are identified by keys.
S Represent dynamic, multilevel indexes with lower and upper
bounds per segment or page
S This allows for higher fanout compared to binary trees
resulting in lower number of I/O operations
S Range scans are also very efficient
LSM Trees
S Incoming data first stored in logfile, completely sequentially
S Once the log has modification saved, updates an in-memory
store.
S Once enough updates are accrued in the in-memory store, it
flushes a sorted list of key->record pairs to disks creating store
files.
S At this point all updates to log can be deleted since
modifications have been persisted.
Fundamental Difference
S Disk Drives
S Too Many Modifications force costly optimizations.
S More Data at random locations cause faster fragmentation
S Updates and deletes are done at disk seek rates rather than
disk transfer rates
Fundamental Difference(Contd)
S Works at disk transfer rates
S Scales better to handle large amounts of data.
S Guarantees consistent insert rate
S Transform random writes into sequential writes using logfiles
plus in-memory store
S Reads independent from writes so no contention between the
two
HBase Basics
S When data is added to HBase, it is first written to the WAL(Write
ahead log) called HLog.
S Once the write is done, it is then written to an in memory called
MemStore
S Once the memory exceeds a certain threshold, it flushes to disk as
an HFile
S Over time HBase merges smaller HFiles into larger ones. This process
is called compaction
Ref: https://www.altamiracorp.com/blog/employee-posts/handling-big-data-with-
hbase-part-3-architecture-overview
Facebook-Hydrabase
S In HBase, when a regionserver fails, all regions hosted by that
regionserver are moved to another regionserver.
S Depending on HBase has been setup, this typically entails
splitting and replaying the WAL files which could take time and
lengthens the failover
S Hydrabase differs from HBase in this. Instead of having a
region by a single region server, it is hosted by a set of
regionservers.
S When a regionserver fails, there are standby regionservers
ready to take over
Facebook-Hydrabase
S The standby region servers can be spread across different
racks or even data centers, providing availability.
S The set of region servers serving each region form a quorum.
Each quorum has a leader that services read and write
requests from the client.
S HydraBase uses the RAFT consensus protocol to ensure
consistency across the quorum.
S With a quorum of 2F+1, HydraBase can tolerate up to F
failures.
S Increases reliability from 99.99% to 99.999% ~ 5 mins
downtime/year.
HBase Users - Flurry
S Mobile analytics, monetization and advertising company
founded in 2005
S Recently acquired by Yahoo
S 2 data centers with 2 clusters each, bi directional replication
S 1000 slave nodes per cluster – 32 GB RAM, 4 drives(1 or 2 TB),
1 Gig E, Dual Quad Core processors *2 HT = 16 procs
S ~30 tables, 250k regions, 430TB(after LZO compression)
S 2 big tables are approx 90% of that, 1 wide table with 3 CF, 4
billion rows with 1 MM cells per row. The other tall table with
1 CF, 1 trillion rows and 1 cell per row
HBase Security – 0.98
S Cell Tags – All values in HBase are now written in cells, can also
carry arbitrary no. of tags such as metadata
S Cell ACLs – enables the checking of (R)ead, (W)rite, E(X)excute,
(A)dmin & (C)reate
S Cell Labels – Visibility expression support via new security
coprocessor
S Transparent Encryption – data is encrypted on disk – HFiles are
encrypted when written and decrypted when read
S RBAC – Uses Hadoop Group Mapping Service and ACL’s to
implement
Apache Phoenix
S SQL layer atop HBase – Has a query engine, metadata
repository & embedded JDBC driver, top level apache project,
currently only for HBase
S Fastest way to access HBase data – HBase specific push down,
compiles queries into native, direct HBase calls(no map-
reduce), executes scans in parallel
S Integrates with Pig, Flume & Sqoop
S Phoenix maps HBase data model to relational world
Ref: Taming HBase with Apache Phoenix and SQL, HBaseCon 2014
Open TSDB 2.0
S Distributed, Scalable Time Series Database on top of HBase
S Time Series – data points for identity over time.
S Stores trillions of data points, never loses precision, scales
using HBase
S Good for system monitoring & measurement – servers &
networks, Sensor data – The internet of things, SCADA,
Financial data, Results of Scientific experiments, etc.
Open TSDB 2.0
S Users – OVH(3rd largest cloud/hosting provider) to monitor
everything from networking, temperature, voltage to resource
utilization, etc.
S Yahoo uses it to monitor application performance & statistics
S Arista networking uses it for high performance networking
S Other users such as Pinterest, Ebay, Box, etc.
Apache Slider(Incubator)
S YARN application to deploy existing distributed applications on
YARN, monitor them and make them larger or smaller as
desired -even while the application is running.
S Incubator Apache Project; Similar to Tez for Hive/Pig
S Applications can be stopped, "frozen" and restarted, "thawed"
later; It allows users to create and run multiple instances of
applications, even with different application versions if needed
S Applications such as HBase, Accumulo & Storm can run atop it
Thanks!!
S Credits – Apache, Cloudera, Hortonworks, MapR, Facebook,
Flurry & HBaseCon
S @sawjd22
S www.linkedin.com/in/sawjd/
S Q & A

More Related Content

What's hot

Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBaseHBaseCon
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application ArchetypesCloudera, Inc.
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 

What's hot (20)

Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
6.hive
6.hive6.hive
6.hive
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Apache sqoop
Apache sqoopApache sqoop
Apache sqoop
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Viewers also liked

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)tatsuya6502
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case StudiesEvan Liu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)Ryan Tabora
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBaseGokuldas Pillai
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConamarsri
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.Data Con LA
 
Case Study (Paper): How Cisco Spark is used at ZOOM International
Case Study (Paper): How Cisco Spark is used at ZOOM InternationalCase Study (Paper): How Cisco Spark is used at ZOOM International
Case Study (Paper): How Cisco Spark is used at ZOOM InternationalZOOM International
 

Viewers also liked (20)

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)Searching Billions of Product Logs in Real Time (Use Case)
Searching Billions of Product Logs in Real Time (Use Case)
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Case Study (Paper): How Cisco Spark is used at ZOOM International
Case Study (Paper): How Cisco Spark is used at ZOOM InternationalCase Study (Paper): How Cisco Spark is used at ZOOM International
Case Study (Paper): How Cisco Spark is used at ZOOM International
 

Similar to Apache HBase - Introduction & Use Cases

Similar to Apache HBase - Introduction & Use Cases (20)

HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Dsm project-h base-cassandra
Dsm project-h base-cassandraDsm project-h base-cassandra
Dsm project-h base-cassandra
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Hbase
HbaseHbase
Hbase
 
No SQL introduction
No SQL introductionNo SQL introduction
No SQL introduction
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
Hbase
HbaseHbase
Hbase
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Apache HBase - Introduction & Use Cases

  • 1. S Apache HBase: Introduction & Use Cases Subash D’Souza
  • 2. What is HBase? S HBase is an open source, distributed, sorted map modeled after Google’s Big Table S NoSQL solution built atop Apache Hadoop S Top level Apache Project
  • 3. CAP Theorem S Consistency (all nodes see the same data at the same time) S Availability (a guarantee that every request receives a response about whether it was successful or failed) S Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.
  • 5. Usage Scenarios S Lots of Data - 100s of Gigs to Petabytes S High Throughput – 1000’s of records/sec S Scalable cache capacity – Adding several nodes adds to available cache S Data Layout – Excels at key lookup and no penalty for sparse columns
  • 6. Column Oriented Databases S HBase belongs to family of databases called as Column-oriented S Column-oriented databases save their data grouped by columns. S The reason to store values on a per-column basis instead is based on the assumption that, for specific queries, not all of the values are needed. S Reduced I/O is one of the primary reasons for this new layout S Specialized algorithms—for example, delta and/or prefix compression—selected based on the type of the column (i.e., on the data stored) can yield huge improvements in compression ratios. Better ratios result in more efficient bandwidth usage.
  • 7. HBase as a Column Oriented Database S HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format. S This is also where the majority of similarities end, because although HBase stores data on disk in a column-oriented format, it is distinctly different from traditional columnar databases. S Whereas columnar databases excel at providing real-time analytical access to data, HBase excels at providing key-based access to a specific cell of data, or a sequential range of cells.
  • 8. HBase and Hadoop S Hadoop excels at storing data of arbitrary, semi-, or even unstructured formats, since it lets you decide how to interpret the data at analysis time, allowing you to change the way you classify the data at any time: once you have updated the algorithms, you simply run the analysis again. S HBase sits atop Hadoop using all the best features of HDFS such as scalability and data replication
  • 9. HBase UnUsage S When data access patterns are unknown – HBase follows a data centric model rather than relationship centric, Hence it does not make sense doing an ERD model for HBase S Small amounts of data – Just use an RDBMS S Limited/No random reads and writes – Just use HDFS directly
  • 10. HBase Use Cases - Facebook S One of the earliest and largest users of HBase S Facebook messaging platform built atop HBase in 2010 S Chosen because of the high write throughput and low latency random reads S Other features such as Horizontal Scalability, Strong Consistency and High Availability via Automatic Failover.
  • 11. HBase Use Cases - Facebook S In addition to online transaction processing workloads like messages, it is also used for online analytic processing workloads where large data scans are prevalent. S Also used in production by other Facebook services, including the internal monitoring system, the recently launched Nearby Friends feature, search indexing, streaming data analysis, and data scraping for their internal data warehouses.
  • 12. Seek vs. Transfer S One of the fundamental differences between typical RDBMS and nosql ones is the use of B or B+ trees and Log Structured Merge Trees(LSM) which was the basis of Google’s BigTable
  • 13. B+ Trees S B+ Trees allow for efficient insertion, lookup and deletion of records that are identified by keys. S Represent dynamic, multilevel indexes with lower and upper bounds per segment or page S This allows for higher fanout compared to binary trees resulting in lower number of I/O operations S Range scans are also very efficient
  • 14. LSM Trees S Incoming data first stored in logfile, completely sequentially S Once the log has modification saved, updates an in-memory store. S Once enough updates are accrued in the in-memory store, it flushes a sorted list of key->record pairs to disks creating store files. S At this point all updates to log can be deleted since modifications have been persisted.
  • 15. Fundamental Difference S Disk Drives S Too Many Modifications force costly optimizations. S More Data at random locations cause faster fragmentation S Updates and deletes are done at disk seek rates rather than disk transfer rates
  • 16. Fundamental Difference(Contd) S Works at disk transfer rates S Scales better to handle large amounts of data. S Guarantees consistent insert rate S Transform random writes into sequential writes using logfiles plus in-memory store S Reads independent from writes so no contention between the two
  • 17. HBase Basics S When data is added to HBase, it is first written to the WAL(Write ahead log) called HLog. S Once the write is done, it is then written to an in memory called MemStore S Once the memory exceeds a certain threshold, it flushes to disk as an HFile S Over time HBase merges smaller HFiles into larger ones. This process is called compaction
  • 19.
  • 20.
  • 21.
  • 22. Facebook-Hydrabase S In HBase, when a regionserver fails, all regions hosted by that regionserver are moved to another regionserver. S Depending on HBase has been setup, this typically entails splitting and replaying the WAL files which could take time and lengthens the failover S Hydrabase differs from HBase in this. Instead of having a region by a single region server, it is hosted by a set of regionservers. S When a regionserver fails, there are standby regionservers ready to take over
  • 23. Facebook-Hydrabase S The standby region servers can be spread across different racks or even data centers, providing availability. S The set of region servers serving each region form a quorum. Each quorum has a leader that services read and write requests from the client. S HydraBase uses the RAFT consensus protocol to ensure consistency across the quorum. S With a quorum of 2F+1, HydraBase can tolerate up to F failures. S Increases reliability from 99.99% to 99.999% ~ 5 mins downtime/year.
  • 24. HBase Users - Flurry S Mobile analytics, monetization and advertising company founded in 2005 S Recently acquired by Yahoo S 2 data centers with 2 clusters each, bi directional replication S 1000 slave nodes per cluster – 32 GB RAM, 4 drives(1 or 2 TB), 1 Gig E, Dual Quad Core processors *2 HT = 16 procs S ~30 tables, 250k regions, 430TB(after LZO compression) S 2 big tables are approx 90% of that, 1 wide table with 3 CF, 4 billion rows with 1 MM cells per row. The other tall table with 1 CF, 1 trillion rows and 1 cell per row
  • 25. HBase Security – 0.98 S Cell Tags – All values in HBase are now written in cells, can also carry arbitrary no. of tags such as metadata S Cell ACLs – enables the checking of (R)ead, (W)rite, E(X)excute, (A)dmin & (C)reate S Cell Labels – Visibility expression support via new security coprocessor S Transparent Encryption – data is encrypted on disk – HFiles are encrypted when written and decrypted when read S RBAC – Uses Hadoop Group Mapping Service and ACL’s to implement
  • 26. Apache Phoenix S SQL layer atop HBase – Has a query engine, metadata repository & embedded JDBC driver, top level apache project, currently only for HBase S Fastest way to access HBase data – HBase specific push down, compiles queries into native, direct HBase calls(no map- reduce), executes scans in parallel S Integrates with Pig, Flume & Sqoop S Phoenix maps HBase data model to relational world
  • 27. Ref: Taming HBase with Apache Phoenix and SQL, HBaseCon 2014
  • 28. Open TSDB 2.0 S Distributed, Scalable Time Series Database on top of HBase S Time Series – data points for identity over time. S Stores trillions of data points, never loses precision, scales using HBase S Good for system monitoring & measurement – servers & networks, Sensor data – The internet of things, SCADA, Financial data, Results of Scientific experiments, etc.
  • 29. Open TSDB 2.0 S Users – OVH(3rd largest cloud/hosting provider) to monitor everything from networking, temperature, voltage to resource utilization, etc. S Yahoo uses it to monitor application performance & statistics S Arista networking uses it for high performance networking S Other users such as Pinterest, Ebay, Box, etc.
  • 30. Apache Slider(Incubator) S YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the application is running. S Incubator Apache Project; Similar to Tez for Hive/Pig S Applications can be stopped, "frozen" and restarted, "thawed" later; It allows users to create and run multiple instances of applications, even with different application versions if needed S Applications such as HBase, Accumulo & Storm can run atop it
  • 31. Thanks!! S Credits – Apache, Cloudera, Hortonworks, MapR, Facebook, Flurry & HBaseCon S @sawjd22 S www.linkedin.com/in/sawjd/ S Q & A

Editor's Notes

  1. How their architecture makes use of modern hardware especially disk drives. B+ Trees work well until there are too many modifications, because they force you to perform costly optimizations to retain that advantage for a limited amount of time. The more and faster you add data at random locations, the faster the pages become fragmented again. Eventually, you may take in data at a higher rate than the optimization process takes to rewrite the existing files. The updates and deletes are done at disk seek rates, rather than disk transfer rates.
  2. LSM-trees work at disk transfer rates and scale much better to handle large amounts of data. They also guarantee a very consistent insert rate, as they transform random writes into sequential writes using the logfile plus in-memory store. The reads are independent from the writes, so you also get no contention between these two operations.