Introduction to Hadoop

Object
Partners
Inc.

Introduction to Hadoop

Click to edit Master subtitle style
Demo by:
Presented by:
Nick Adelman
Joel Crabb

Object
Partners
Inc. Agenda

Ø
Terminology
Ø
Why does Hadoop Exist?
Ø
HDFS and Hbase
Ø
Examples
Ø
Getting Started
Ø
Demo

Object
Partners
Inc. Terminology

Ø
Hadoop
–
Core set of technologies hosted by Apache Foundation for
storing and searching data sets in the Tera and Petabyte
range
Ø
HDFS
–
Hadoop File System used as the basis for all Hadoop
technologies
Ø
Hbase
–
Distributed Map based database which uses HDFS as its
underlying data store
Ø
Map Reduce
–
A framework for programming distributed parallel
processing algorithms

Object
Partners
Inc. Terminology

Ø
Distributed Computing
–
A computing paradigm that parallelizes computations over
multiple compute nodes in order to decrease overall
processing time
Ø
NOSQL
–
Programming paradigm which does not use a relational
database as the backend data store
Ø
Big Data
–
Generic term used when working with large data sets
Ø
Name Node
–
Server that knows location of all files in cluster

Object
Partners
Inc. Enterprise Architecture 101

HDFS HDFS

Map Reduce
Data Data
Hbase Hbase
RDBMS RDBMS

Object
Partners
Inc. The New System Constraint

Ø
Hard disk seek time is the new constraint when
working with a Petabyte data set
–
Spread the seek time among multiple servers
–
Isolate the data to a single read per disk
–
Faster to read too much data sequentially on disk and
discard the excess

Ø
Working under this paradigm requires New Tools

Object
Partners
Inc. New Tools: Why does Hadoop exist?

Ø
In the early 2000s Google had problems:

Ø
Problem 1: Store Tera and Petabytes of data:
–
Inexpensive, Reliable, Accessible
Ø
Answer: distributed file system

Ø
Problem 2: Distributed Computing is Hard
Ø
Answer: make distributed computing easier

Ø
Problem 3: Datasets too large for RDBMS
Ø
Answer: make a new way to store application data

Object
Partners
Inc. Google’s Solution: Tool 1

Ø
Google File System (GFS)
–
A file system specifically built to manage large files and
support distributed computing
Ø
Inexpensive:
–
Store files distributed across a cluster of cheap servers
Ø
Reliable:
–
Plan for server failure: if you have 1000 servers, one will
fail every day
–
Always maintain three copies of each file (configurable)
Ø
Accessible:
–
File Chunk size is 64MB = Less file handles to manage
–
Master table keeps track of locations of each file copy

Problem 1: Store Tera and Petabytes of data

Object
Partners

Ø
Map Reduce – abstracts away the hard parts of
distributed computing

Ø
Programmers no longer need to manage:
–
Where is the data?
–
What piece of data am I working on?
–
How do I move data and result sets?
–
How do I combine results?
Ø
Leverages the GFS
–
Send processing to the data
–
Multiple file copies means higher chance to use more
nodes for each process


Object
Partners
Inc. Tool 2: Map Reduce

Ø
Distributed parallel processing framework
Ø
Map - done N times on N servers
–
Perform an operation (search) on a chunk (GBs) of data
Ø
Search 100 GB
–
Process Map on 25 servers with 4GB of memory
–
100 GB processed in-parallel in-memory
–
Create Maps storing results (key-value pair)
Ø
Reduce
–
Take Maps from N nodes
–
Merge (reduce) maps to a single sorted map (result set)


Object
Partners

Ø
Bigtable: new paradigm in storing large data sets
–
“a sparse, distributed, persistent multi-dimensional sorted
map”*
*Bigtable: A Distributed Storage System for Structured Data
Ø
Sparse: Few entries in map are populated
Ø
Distributed: Data spread across multiple logical
machines in multiple copies
Ø
Multi-dimensional: Maps within maps organize and
store data
Ø
Sorted: Sorted by lexiographic keys
–
Lexiographic = alphabetically including numbers

Problem 3: Data sets too large for RDBMS

Object
Partners
Inc. Google’s Architecture

Map Reduce Direct Access Map Reduce

Bigtable

GFS

Object
Partners
Inc. Hadoop – If Something Works…
Ø
Hadoop was started to recreate these technologies
in the Open Source community

GFS HDFS

Bigtable Hbase

Map Map
Reduce Reduce

Object
Partners
Inc. A Little More on HDFS

Ø
Plan for Failure
–
In a thousand node cluster, machines will fail often
–
HDFS is built to detect failure and redistribute files
Ø
Fast Data Access
–
Generally a batch processing system
Ø
Large Files – typically GB to TB files
Ø
Simple Coherency
–
Once file is closed, it cannot be updated or appended
Ø
Cloud Ready
–
Setup on Amazon EC2 / S3

Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html

Object
Partners
Inc. A Little More on Hbase

Ø
Multi-dimensional Map
Ø
Map<byte[ ]
–
Map<byte[ ]
• Map<byte[ ]
–
Map<Long, byte[]>>>>

Ø
First Map: Row Key to Column Family
Ø
Second Map: Column Family to Column Label
Ø
Third Map: Column Label to Timestamp
Ø
Fourth Map: Timestamp to Value

A Column Family is a grouping of columns of the same data type.

Object
Partners
Inc. Hbase Storage Model

Object
Partners
Inc. Hbase Access

Ø
REST interface
–
http://wiki.apache.org/hadoop/Hbase/Stargate
Ø
Groovy
–
http://wiki.apache.org/hadoop/Hbase/Groovy
Ø
Scala
–
http://wiki.apache.org/hadoop/Hbase/Scala

Object
Partners
Inc. Industry Examples

Ø Web/File Search (Yahoo!)
Ø
Yahoo! Is the main sponsor and contributor to Hadoop
Ø
Has over 25,000 servers running Hadoop

Ø Log aggregation (Amazon, Facebook, Baidu)
Ø RDBMS replacement (Google Analytics)
Ø Image store (Google Earth)
Ø Email store (Gmail)
Ø Natural Language Search (Microsoft)
Ø Many more…

* Information from http://wiki.apache.org/hadoop/PoweredBy

Object
Partners
Inc. Use Case #1: Yahoo! Search

Ø
Problem circa 2006

Ø
Yahoo! search is seen as inferior to Google’s
Ø
Google is better at:
–
Storing Tera and Petabytes of unstructured data
–
Searching the data set efficiently
–
Applying custom analytics to data set
–
Presenting a more relevant result set

Object
Partners
Inc. Use Case #1: Yahoo! Search

Ø
Solution – Emulate Google with Hadoop’s HDFS,
Pig and Map Reduce
–
HDFS
• Stores Petabytes of web page data distributed over a
cluster of compute nodes (1000s)
• Runs on commodity hardware
• Average server – 2X4 core, 4 – 32 GB RAM *
–
Pig (Hadoop Sub-project)
• Analytics processing platform
–
Map Reduce
• Build indexes from raw web data

* http://wiki.apache.org/hadoop/PoweredBy

Object
Partners Use Case #2: RDBMS Replacement
Inc.

Ø
Google Analytics circa 2006
Ø
Problem
–
Store Terabytes of analytics data about website usage
–
GBs of data added per hour
–
Data added in small increments
–
Access and display data in < 3 seconds per request

Object
Partners
Inc. Use Case #2: RDBMS Replacement

Ø
Solution – Bigtable, Map Reduce on GFS
Ø
Bigtable sits over GFS inputs small bits of data
Ø
In 2006, GA cluster supported ~220 TB*
Ø
Raw Click Table (200 TB)
–
Rows keyed by WebsiteName + Session Time
–
All website data stored consecutively on disk
Ø
Summary Table (20 TB)
–
Map Reduce of Raw Click Table for customer web views

Pattern: Collect data in one Bigtable instance
Map Reduce to a View Bigtable instance

*Bigtable: A Distributed Storage System for Structured Data

Object
Partners
Inc. Can You Use Hadoop?

Ø
IF…
–
You have a large amount of data (Terabytes+)
–
You can split your data collection data store
from your online or analytics data store
–
You can order your data lexiographically
–
You can run analytics as batches
–
You cannot afford a large enough RDBMS
–
You need dynamic column additions
–
You need near linear performance as data set
grows

Object
Partners
Inc. Other Hadoop Technologies

Ø
Hive – SQL like query language to use Hadoop like
a data warehouse
Ø
Pig – parallel data analysis framework
Ø
Zookeeper – Distributed application coordination
framework
Ø
Chukwa – Data collection system for distributed
computing
Ø
Avro – data serialization framework

Object
Partners
Inc. New Skills for IT

Ø
Learning to restructure data
Ø
Learning to write Map Reduce programs
Ø
Learning to maintain a Hadoop cluster
Ø
Forgetting RDBMS/SQL dominated design
principals

It takes a new style of creativity to both structure data
in Hadoop and write useful Map Reduce programs.

Object
Partners
Inc. Getting Started

Ø
You can install a test system on a single Unix box
Ø
For a full system a minimum of 3 servers
–
10 to 20 servers is a small cluster
Ø
Expect to spend a day to a week getting a multi-
node cluster configured.
Ø
A book like Pro Hadoop, by Jason Venner may
save you time but is based on the 0.19 Hadoop
release (currently at 0.20)

Object
Partners
Inc. Optional Quickstart

Ø
Cloudera has a preconfigured single node Hadoop
instance available for download at:
http://www.cloudera.com/hadoop-training-virtual-mach
Ø
Yahoo! Has a Hadoop distribution as well at:
http://developer.yahoo.com/hadoop/distribution/

Object
Partners
Inc. Alternatives to Hbase

Ø
Project Voldemort
–
http://project-voldemort.com/
–
Used by Linked In
Ø
Hypertable
–
http://www.hypertable.org/
–
Used by BaiDu (Search leader of China)
Ø
Cassandra
–
http://cassandra.apache.org/
–
Apache sponsored distributed database
–
Used by Facebook

Object
Partners
Inc. Helpful Information

Ø
http://hadoop.apache.org
Ø
http://hbase.apache.org
Ø
http://wiki.apache.org/hadoop/HadoopPresentations
Ø
http://labs.google.com/papers/bigtable.html
Ø
http://labs.google.com/papers/gfs.html
Ø
http://labs.google.com/papers/mapreduce.html
Ø
Twitter: @hbase
Ø
Two articles on Map Reduce in the 01/2010
Communications of the ACM

Introduction to Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Hadoop

Similar to Introduction to Hadoop (20)

Introduction to Hadoop