SlideShare a Scribd company logo
1 of 44
Apache Hadoop
and HBase
Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
Nov 2, 2010
Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems
今日は!
Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
Hadoop MapReduce
The Hadoop Ecosystem
Questions
Data is
everywhere.
Data is important.
“I keep saying that the sexy
job in the next 10 years will be
statisticians, and I‟m not
kidding.”
Hal Varian
(Google‟s chief economist)
Are you throwing
away data?
Data comes in many shapes and
sizes: relational tuples, log files,
semistructured textual data (e.g., e-
mail), … .
Are you throwing it away because it
doesn‟t „fit‟?
So, what‟s
Hadoop?
Apache Hadoop is an
open-source system
to reliably store and process
A LOT of information
across many commodity computers.
Two Core
Components
HDFS Map/Reduce
Self-healing
high-bandwidth
clustered storage.
Fault-tolerant
distributed
processing.
Store Process
What makes
Hadoop special?
Hadoop separates
distributed system fault-
tolerance code from
application logic.
Systems
Programmers
Statisticians
Unicorns
Hadoop lets you interact
with a cluster, not a
bunch of machines.
Image:Yahoo! Hadoop cluster [
OSCON ‟07 ]
Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:
Extensive machine learning on <100GB of image
data
Simple SQL-style queries on >100TB of
clickstream data
Hadoop works for both applications!
A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack
Hadoop sounds like
magic.
How is it possible?
Cluster nodes
NameNode (metadata server and database)
JobTracker (scheduler)
DataNodes
(block storage)
TaskTrackers
(task execution)
Master nodes (1 each)
Slave nodes (1-4000 each)
NameNode
HDFS Data Storage
/logs/weblog.txt
blk_29232
blk_19231
blk_329432
158MB
DN 1
DN 2
DN 3
DN 4
64MB64MB30MB
HDFS Write Path
• HDFS has split the file into
64MB blocks and stored it on
the DataNodes.
• Now, we want to process that
data.
The MapReduce
Programming
Model
You specify map()
and reduce()
functions.
The framework
does the rest.
map()
map: K₁,V₁→list K₂,V₂
Key: byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
Key: userimage
Value: 2326 bytes
The map function runs on the same node as the data
was stored!
Input Format
• Wait! HDFS is not a Key-Value store!
• InputFormat interprets bytes as a Key
and Value
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700]
"GET /userimage/123 HTTP/1.0" 200 2326
Key: log offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
The Shuffle
Each map output is assigned to a
“reducer” based on its key
map output is grouped and
sorted by key
reduce()
K₂, iter(V₂)→list(K₃,V₃)
Key: userimage
Value: 2326 bytes (from map task 0001)
Value: 1000 bytes (from map task 0008)
Value: 3020 bytes (from map task 0120)
Key: userimage
Value: 6346 bytes
userimage t 6346
TextOutputFormat
Reducer function
Putting it together...
Hadoop is not NoSQL
(sorry!)
Hive project adds SQL
support to Hadoop
HiveQL (SQL dialect)
compiles to a query plan
Query plan executes as
MapReduce jobs
Hive Example
CREATE TABLE movie_rating_data (
userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't„
STORED AS TEXTFILE;
LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE
movie_rating_data;
CREATE TABLE average_ratings AS
SELECT AVG(rating) FROM movie_rating_data
GROUP BY movieid;
The Hadoop
Ecosystem
(Column DB)
Hadoop in the Wild
(yes, it‟s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟09)
Facebook: 15TB new data per day;
1200 machines, 21PB in one cluster
Twitter: ~1TB per day, ~80 nodes
Lots of 5-40 node clusters at companies without
petabytes of data (web, retail, finance, telecom,
research)
What about real time
access?
• MapReduce is a batch system
• The fastest MR job takes 24 seconds
• HDFS just stores bytes, and is append-
only
• Not about to serve data for your next
web site.
Apache HBase
HBase is an
open source, distributed,
sorted map
modeled after Google‟s BigTable
HBase is built on
Hadoop
• Hadoop provides:
• Fault tolerance
• Scalability
• Batch processing with MapReduce
HDFS + HBase
= HDFS + random read/write
• HBase uses HDFS for storage
• “Log structured merge trees”
• Similar to “log structured file systems”
• Same storage pattern as Cassandra!
A Big Sorted Map
Row key Column key Timestamp Cell
Row1 info:aaa 1273516197868 valueA
Row1 info:bbb 1273871824184 valueB
Row1 info:bbb 1273871823022 oldValueB
Row1 info:ccc 1273746289103 valueC
Row2 info:hello 1273878447049 i_am_a_value
Row3 info: 1273616297446 another_value
Sorted by
Row key
and Column
Timestamp is a long value
2 Versions
of this cell
HBase API
• get(row)
• put(row, map<column, value>)
• scan(key range, filter)
• increment(row, columns)
• … (checkAndPut, delete, etc…)
• MapReduce/Hive
HBase Architecture
HBase in Numbers
• Largest cluster: 600 nodes, ~600TB
• Most clusters: 5-20 nodes, 100GB-4TB
• Writes: 1-3ms, 1k-10k writes/sec per node
• Reads: 0-3ms cached, 10-30ms disk
• 10-40k reads / second / node from cache
• Cell size: 0-3MB preferred
HBase compared
• Favors Consistency over Availability (but
availability is good in practice!)
• Great Hadoop integration (very efficient
bulk loads, MapReduce analysis)
• Ordered range partitions (not hash)
• Automatically shards/scales (just turn on
more servers)
• Sparse column storage (not key-value)
HBase in Production
• Facebook (product release soon)
• StumbleUpon / su.pr
• Mozilla (receives crash reports)
• … many others
Ok, fine, what next?
Get Hadoop!
Cloudera‟s Distribution for Hadoop
http://cloudera.com/
http://hadoop.apache.org/
Try it out! (Locally, VM, or EC2)
Watch free training videos on
http://cloudera.com/
Available in
Japanese!
Questions?
• todd@cloudera.com
• (feedback? yes!)
• (hiring? yes!)

More Related Content

What's hot

Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 

What's hot (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Apache hive
Apache hiveApache hive
Apache hive
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop
HadoopHadoop
Hadoop
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Hive
HiveHive
Hive
 

Viewers also liked

HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General SessionCloudera, Inc.
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept Miha Ahronovitz
 
A successful Git branching model
A successful Git branching model A successful Git branching model
A successful Git branching model abodeltae
 
The MovieLens Datasets: History and Context
The MovieLens Datasets: History and ContextThe MovieLens Datasets: History and Context
The MovieLens Datasets: History and ContextMax Harper
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNGDATA
 

Viewers also liked (20)

HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 
A successful Git branching model
A successful Git branching model A successful Git branching model
A successful Git branching model
 
The MovieLens Datasets: History and Context
The MovieLens Datasets: History and ContextThe MovieLens Datasets: History and Context
The MovieLens Datasets: History and Context
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 

Similar to Apache Hadoop and HBase

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 

Similar to Apache Hadoop and HBase (20)

EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
HADOOP
HADOOPHADOOP
HADOOP
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Recently uploaded (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Apache Hadoop and HBase

  • 1. Apache Hadoop and HBase Todd Lipcon todd@cloudera.com @tlipcon @cloudera Nov 2, 2010
  • 2. Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems 今日は!
  • 3. Outline Why should you care? (Intro) What is Hadoop? How does it work? Hadoop MapReduce The Hadoop Ecosystem Questions
  • 5.
  • 6.
  • 7. “I keep saying that the sexy job in the next 10 years will be statisticians, and I‟m not kidding.” Hal Varian (Google‟s chief economist)
  • 8. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  • 10. Apache Hadoop is an open-source system to reliably store and process A LOT of information across many commodity computers.
  • 11. Two Core Components HDFS Map/Reduce Self-healing high-bandwidth clustered storage. Fault-tolerant distributed processing. Store Process
  • 13. Hadoop separates distributed system fault- tolerance code from application logic. Systems Programmers Statisticians Unicorns
  • 14. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  • 15. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data Hadoop works for both applications!
  • 16. A Typical Look... 5-4000 commodity servers (8-core, 24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack
  • 17. Hadoop sounds like magic. How is it possible?
  • 18. Cluster nodes NameNode (metadata server and database) JobTracker (scheduler) DataNodes (block storage) TaskTrackers (task execution) Master nodes (1 each) Slave nodes (1-4000 each)
  • 21. • HDFS has split the file into 64MB blocks and stored it on the DataNodes. • Now, we want to process that data.
  • 23. You specify map() and reduce() functions. The framework does the rest.
  • 24. map() map: K₁,V₁→list K₂,V₂ Key: byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!
  • 25. Input Format • Wait! HDFS is not a Key-Value store! • InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  • 26. The Shuffle Each map output is assigned to a “reducer” based on its key map output is grouped and sorted by key
  • 27. reduce() K₂, iter(V₂)→list(K₃,V₃) Key: userimage Value: 2326 bytes (from map task 0001) Value: 1000 bytes (from map task 0008) Value: 3020 bytes (from map task 0120) Key: userimage Value: 6346 bytes userimage t 6346 TextOutputFormat Reducer function
  • 29. Hadoop is not NoSQL (sorry!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  • 30. Hive Example CREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't„ STORED AS TEXTFILE; LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT AVG(rating) FROM movie_rating_data GROUP BY movieid;
  • 32. Hadoop in the Wild (yes, it‟s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ‟09) Facebook: 15TB new data per day; 1200 machines, 21PB in one cluster Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
  • 33. What about real time access? • MapReduce is a batch system • The fastest MR job takes 24 seconds • HDFS just stores bytes, and is append- only • Not about to serve data for your next web site.
  • 34. Apache HBase HBase is an open source, distributed, sorted map modeled after Google‟s BigTable
  • 35. HBase is built on Hadoop • Hadoop provides: • Fault tolerance • Scalability • Batch processing with MapReduce
  • 36. HDFS + HBase = HDFS + random read/write • HBase uses HDFS for storage • “Log structured merge trees” • Similar to “log structured file systems” • Same storage pattern as Cassandra!
  • 37. A Big Sorted Map Row key Column key Timestamp Cell Row1 info:aaa 1273516197868 valueA Row1 info:bbb 1273871824184 valueB Row1 info:bbb 1273871823022 oldValueB Row1 info:ccc 1273746289103 valueC Row2 info:hello 1273878447049 i_am_a_value Row3 info: 1273616297446 another_value Sorted by Row key and Column Timestamp is a long value 2 Versions of this cell
  • 38. HBase API • get(row) • put(row, map<column, value>) • scan(key range, filter) • increment(row, columns) • … (checkAndPut, delete, etc…) • MapReduce/Hive
  • 40. HBase in Numbers • Largest cluster: 600 nodes, ~600TB • Most clusters: 5-20 nodes, 100GB-4TB • Writes: 1-3ms, 1k-10k writes/sec per node • Reads: 0-3ms cached, 10-30ms disk • 10-40k reads / second / node from cache • Cell size: 0-3MB preferred
  • 41. HBase compared • Favors Consistency over Availability (but availability is good in practice!) • Great Hadoop integration (very efficient bulk loads, MapReduce analysis) • Ordered range partitions (not hash) • Automatically shards/scales (just turn on more servers) • Sparse column storage (not key-value)
  • 42. HBase in Production • Facebook (product release soon) • StumbleUpon / su.pr • Mozilla (receives crash reports) • … many others
  • 43. Ok, fine, what next? Get Hadoop! Cloudera‟s Distribution for Hadoop http://cloudera.com/ http://hadoop.apache.org/ Try it out! (Locally, VM, or EC2) Watch free training videos on http://cloudera.com/ Available in Japanese!