The Hadoop Ecosystem

The Hadoop Ecosystem

J Singh, DataThinks.org

March 12, 2012

• Introduction
– What Hadoop is, and what it’s not
– Origins and History
– Hello Hadoop
• The Hadoop Bestiary
• The Hadoop Providers
• Hosted Hadoop Frameworks

© J Singh, 2011 2
2

What Hadoop is, and what it’s not
• A Framework for Map Reduce

• A Top-level Apache Project

• Hadoop is • Hadoop is not
 A Framework, not a “solution” A painless replacement for SQL
• Think Linux or J2EE

 Scalable Uniformly fast or efficient

 Great for pipelining massive Great for ad hoc Analysis
amounts of data to achieve the
end result

 Sometimes the only option

© J Singh, 2011 3
3

You are ready for Hadoop when…
• You no longer get enthused by the prospect of more data
– Rate of data accumulation is increasing
– The idea of moving data from hither to yon is positively scary
– A hit man threatens to delete your data in the middle of the night
• And you want to pay him to do it

• Seriously, you are ready for Hadoop when analysis is the bottleneck
– Could be because of data size
– Could be because of the complexity of the data
– Could be because of the level of analysis required
– Could be because the analysis requirements are fluid

© J Singh, 2011 4
4

MapReduce Conceptual Underpinnings
• Based on Functional Programming model
– From Lisp
• (map square '(1 2 3 4)) (1 4 9 16)
• (reduce plus '(1 4 9 16)) 30
– From APL
• +/ N N  1 2 3 4

• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
– Hundreds and thousands of low-end servers are running at the
same time

© J Singh, 2011 5
5

MapReduce Flow

Word Count Example

MapOut
foo 1
Lines Result
bar 1
foo bar foo 3
quux 1
quux foo labs 1
foo 1
foo labs quux 2
foo 1
quux bar 1
labs 1
quux 1

© J Singh, 2011 6
6

Hello Hadoop
• Word Count
– Example with Unstructured Data
– Load 5 books from Gutenberg.org
into /tmp/gutenberg
– Load them into HDFS
– Run Hadoop
• Results are put into HDFS
– Copy results into file system

– What could be simpler?

– DIY instructions for Amazon EC2
available on DataThinks.org blog

© J Singh, 2011 7
7

• Introduction
– Core: Hadoop Map Reduce and Hadoop Distributed File System
– Data Access: HBase, Pig, Hive
– Algorithms: Mahout
– Data Import: Flume, Sqoop and Nutch

© J Singh, 2011 8
8

The Core: Hadoop and HDFS
• Hadoop • Hadoop Distributed File System
– One master, n slaves – Robust Data Storage across
– Master machines, insulating against
• Schedules mappers & reducers failure
• Connects pipeline stages – Keeps n copies of each file
• Handles failure semantics • Configurable number of copies
• Distributes copies across racks
and locations

© J Singh, 2011 9
9

Hadoop Bestiary (p1a): Hbase, Pig
• Database Primitives • Processing
– Hbase – Pig
• Wide column data structure • A high(-ish) level data-flow
built on HDFS language and execution
framework for parallel
computation
• Accesses HDFS and Hbase
• Batch as well as Interactive
• Integrates UDFs written in
Java, Python, JavaScript
• Compiles to map & reduce
functions – not 100% efficiently

© J Singh, 2011 10
10

In Pig (Latin)

Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;

store Top5 into ‘top5sites’;

© J Singh, 2011 11
11
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Pig Translation into Map Reduce

Load Users Load Pages
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Job 1 Join on name Joined = join …
Group on url
Grouped = group …
Summed = … count()…
Job 2 Count clicks Sorted = order …
Top5 = limit …
Order by clicks

Job 3 Take top 5

© J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12
12

Hadoop Bestiary (p1b): Hbase, Hive
• Database Primitives • Processing
– Hbase – Hive
• Wide column data structure • Data Warehouse Infrastructure
built on HDFS • QL, a subset of SQL that
supports primitives supportable
by Map Reduce
• Support for custom mappers
and reducers for more
sophisticated analysis
• Compiles to map & reduce
functions – not 100% efficiently

Hive Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
:: ::
STORED AS SEQUENCEFILE;

© J Singh, 2011 13
13

Hadoop Bestiary (p2): Mahout
• Algorithms • Examples
– Mahout – Clustering Algorithms
• Scalable machine learning and • Canopy Clustering
data mining • K-Means Clustering
• Runs on top of Hadoop • …
• Written in Java
• In active development – Recommenders / Collaborative
– Algorithms being added
Filtering Algorithms

– Other
• Regression Algorithms
• Neural Networks
• Hidden Markov Models

© J Singh, 2011 14
14

Hadoop Bestiary (p3): Data Import
• Data Import Mechanisms • Data Import
– Sqoop: Structured Data – Sqoop
– Flume: Streams • Import from RDBMS to HDFS
• Export too
– Flume
• Import streams
– Text Files
– System Logs
– Nutch
• Import from Web
• Note: Nutch + Hadoop = Lucene

© J Singh, 2011 15
15

Hadoop Bestiary (p4): Complete Picture

© J Singh, 2011 16
16

• Introduction
– Apache
– Cloudera
– Options when your data lives in a Database

© J Singh, 2011 17
17

Apache Distribution
• The Definitive Repository
– The hub for Code, Documentation, Tutorials

– Many contributors, for example
• Pig was a Yahoo! Contribution
• Hive came from Facebook
• Sqoop came from Cloudera

• Bare metal install option:
– Download to your machine(s) from Apache
– Install and Operate
• Modify to fit your business better

© J Singh, 2011 18
18

Cloudera
• Cloudera : Hadoop :: Red Hat : Linux

• Cloudera’s Distribution Including Apache Hadoop (CDH)
– A packaged set of Hadoop modules that work together
– Now at CDH3
– Largest contributor of code to Apache Hadoop

• $76M in Venture funding so far

© J Singh, 2011 19
19

When the data lives in a Database…

• Objective: keeping Analytics and Data as close as possible

• Options for RDBMS : • Options for NoSQL Databases
– Sqoop data to/from HDFS – Sqoop-like connectors
• Need to move the data • Need to move the data
• Can utilize all parts of Hadoop
– In-database analytics
• Available for TeraData, – Built-in Map Reduce available
Greenplum, etc. for most NoSQL databases
• If you have the need • Knows about and tuned to the
– And the $$$ storage mechanism
• But typically only offers map
and reduce
– No Pig, Hive, …

© J Singh, 2011 20
20

• Introduction
• Hadoop Platforms as a Service
– Amazon Elastic MapReduce
– Hadoop in Windows Azure
– Google App Engine
– Other
• Infochimps
• IBM SmartCloud

© J Singh, 2011 21
21

Amazon Elastic Map Reduce (EMR)
• Hosted Map Reduce
– CLI on your laptop
• Control over size of cluster
• Automatic spin-up/down instances

– Map & Reduce programs on S3
• Pig, Hive or
• Custom in Java, Ruby, Python,
Perl, PHP, R, C++, Cascading

– Data In/Out on S3 or
– Data In/Out on DynamoDB

• Keep in mind:
– Hadoop on EC2 is also an option

© J Singh, 2011 22
22

Hadoop in Windows Azure
• Basic Level
– Hive Add-in for Excel
– Hive ODBC Driver

• Hadoop-based Distribution for Windows Server and Azure
– Strategic Partnership with HortonWorks
– Windows-based CLI on your laptop

• Broadest Level
– JavaScript framework for Hadoop
– Hadoop connectors for SQL Server and Parallel Data Warehouse

© J Singh, 2011 23
23

Google App Engine MapReduce
• Map Reduce as a Service
– Distinct from Google’s internal Map Reduce
– Part of Google App Engine

• Works with Google Datastore
– A Wide Column Store

• A “purely programmatic” environment
– Write Map and Reduce functions in Python / Java

© J Singh, 2011 24
24

Take Aways
• There are many flavors of
Hadoop.
– The important part is
Functional Programming and
Map Reduce

– Don’t let the proliferation of
choices stump you.

– Experiment with it!

© J Singh, 2011 26
26

Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups

• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

© J Singh, 2011 27
27

The Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to The Hadoop Ecosystem

Similar to The Hadoop Ecosystem (20)

More from J Singh

More from J Singh (20)

Recently uploaded

Recently uploaded (20)

The Hadoop Ecosystem

Editor's Notes