Hadoop and Hive in Enterprises

Hive in enterprises
Headline Goes Here

DO NOT USE PUBLICLY
PRIOR TO 10/23/12

Mark Grover - Software Engineer, Cloudera (@mark_grover)
Speaker Name or Subhead Goes Here
Prasad Majumdar – Apache Hive Committer, Software
Engineer, Cloudera
November 25th, 2013

1

©2013 Cloudera, Inc. All Rights
Reserved.

What we will be Talking About
•

Integration of Hive and Hadoop in enterprises
Current challenges
• How is Hadoop being leveraged with existing data
infrastructures?
•

•

Other tools and features in and around Hive
Authentication and Authorization
• BI Tools
• User Interface
•

2

Reserved.

What is Apache Hadoop?
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Scalable
 Fault tolerant
 Distributed
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
3

CORE HADOOP SYSTEM COMPONENTS
Hadoop Distributed
File System (HDFS)
MapReduce
Self-Healing, High
Bandwidth Clustered
Storage

Excels at
Processing Complex Data

Distributed Computing
Framework

Scales
Economically

 Scale-out architecture divides workloads
across multiple nodes

 Can be deployed on commodity
hardware

 Flexible file system eliminates ETL
bottlenecks

 Open source platform guards against
vendor lock

Reserved.

Current Challenges
Limitations of Existing Data Management Systems

4

Reserved.

The Transforming of Transformation
Enterprise
Applications

OLTP

Extract
Transform
Load

Query
Data
Warehouse
Transform

ODS
5

Reserved.

Business
Intelligence

Volume, Velocity, Variety Cause Capacity Problems
Enterprise
Applications

OLTP

6

1
2

1

Slow Data Transformations = Missed ETL SLAs.
Slow Queries = Frustrated Business Users.
Extract
Transform
Load

2

1

Query
Data
Warehouse
Transform

Reserved.

Business
Intelligence

Economics: Return on Byte
Return on Byte (ROB) =

Value of Data
Cost of Storing Data
High ROB
Low ROB
(but still a ton
of aggregate
value)

7

Reserved.

Data Warehouse Optimization
Enterprise
Applications

Data Warehouse
Query
(High $/Byte)

OLTP

ETL

Hadoop
Transform
Query

ODS
8

Store
Reserved.

Business
Intelligence

Enterprise
Applications
Hadoop
Transform
OLTP

ETL

Query
Store

ODS
9

Reserved.

Business
Intelligence

The Key Benefit: Agility/Flexibility
Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS):
•

Prescriptive Data Modeling:

•

Descriptive Data Modeling:

•

Create static DB schema

•

Copy data in its native format

•

Transform data into RDBMS

•

Create schema + parser

•

Query data in RDBMS format

•

Query Data in its native format

•

New columns must be added explicitly
before new data can propagate into
the system.

•

New data can start flowing any time and
will appear retroactively once the
schema/parser properly describes it.

•

Good for Known Unknowns
(Repetition)

•

Good for Unknown Unknowns
(Exploration)

10

Reserved.

Not Just Transformation
Other Ways Hadoop is Being Leveraged

11

Reserved.

Data Archiving Before Hadoop

Data
Warehouse

12

Tape
Archive

Reserved.

Active Archiving with Hadoop

Data
Warehouse

13

Hadoop

Reserved.

Offloading Analysis
Data Warehouse

Hadoop

14

Reserved.

Business
Intelligence

Exploratory Analysis

Developers

Business
Users

Analysts

Hadoop

15

Data
Warehouse

Reserved.

Use Case: A Major Financial Institution
The Challenge:
• Current EDW at capacity; cannot support growing data depth and width
• Performance issues in business critical apps; little room for innovation.
DATA WAREHOUSE

Operational

DATA WAREHOUSE

Operational
(50%)

(44%)

Analytics
(50%)

ELT Processing
(42%)

Analytics (11%)

16

HADOOP

Analytics
Processing
Storage

The Solution:
• Hadoop offloads data storage (S),
processing (T) & some analytics (Q)
from the EDW.
• EDW resources can now be focused
on repeatable operational analytics.
• Month data scan in 4 secs vs. 4 hours

Reserved.

Hadoop Integration
The Big Picture

17

Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

Data Integration Tools
NoSQL

18

Reserved.

Data Import/Export Tools

Data
Warehouse
/RDBMS

Streaming
Data

19

Data Import/Export

Reserved.

Flume in 2 Minutes
Or, why you shouldn’t be using scripts for data movement.

Reliable, distributed, and available system for efficient
collection, aggregation and movement of streaming data, e.g.
logs.
• Open-source, Apache project.
•

20

Reserved.

Flume in 2 Minutes
JVM process hosting components
Flume Agent
External
Source

Source

Web Server
Twitter
Consumes events
JMS
and forwards to
System logs channels
…
21

Channel

Sink

Stores events
Removes event from
until consumed
channel and puts
by sinks –
into external
file, memory, JD
destination
BC
Reserved.

Destination

Sqoop Overview
Apache project designed to ease import and export of data
between Hadoop and relational databases.
• Provides functionality to do bulk imports and exports of data
with HDFS, Hive and HBase.
• Java based. Leverages MapReduce to transfer data in parallel.
•

22

Reserved.

Sqoop Overview
Uses a “connector” abstraction.
• Two types of connectors
•

Standard connectors are JDBC based.
• Direct connectors use native database interfaces to improve
performance.
•

•

23

Direct connectors are available for many open-source and
commercial databases – MySQL, PostgreSQL, Oracle, SQL
Server, Teradata, etc.
Reserved.

Sqoop Import Flow
Run import

Client

Collect metadata
Sqoop
Pull data

Generate code,
Execute MR job

MapReduce

Map

Map

Write to Hadoop
Hadoop

24

Reserved.

Map

Transformation/Processing
Standard interface is Java MapReduce
• Higher-level interfaces are commonly used:
•

Apache Hive – provides an SQL like interface to data in Hadoop.
• Apache Pig – declarative language providing functionality to
declare a sequence of transformations.
• Cloudera Impala – real-time SQL query engine on Hadoop
•

Both Hive and Pig convert queries into MapReduce jobs and
submit to Hadoop for execution.
• Impala has its own execution engine
•

25

Reserved.

Orchestration
Schedulers for Hadoop jobs
Oozie
• Azkaban
•

26

Reserved.

Data Flow with OSS Tools
Transform

Web
Servers

Raw Logs

Hadoop

Load
Sqoop, etc.

Flume, etc.
Process
Orchestration
Oozie, etc.

27

Reserved.

Hadoop Integration

28

Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export


NoSQL
29

Reserved.


30

Reserved.

Pentaho
Existing BI tools extended to support Hadoop.
• Provides data import/export, transformation, job orchestration,
reporting, and analysis functionality.
• Supports integration with HDFS, Hive and HBase.
• Community and Enterprise Editions offered.
•

31

Reserved.

Informatica
•

Informatica
•
•

•
•
•

32

Data import/export
Metadata services
Data lineage
Transformation
…

Reserved.

Hadoop Integration
Business Intelligence/Analytic Tools

33

Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

NoSQL

34

Reserved.

Business Intelligence/Analytics Tools

35

Reserved.

Business Intelligence/Analytics Tools

Relational
Databases

36

Data
Warehouses

…

Reserved.

ODBC Driver
Most of these tools use the ODBC
standard.
• Since Hive is an SQL-like system it’s a
good fit for ODBC.
• Several vendors, including Cloudera,
make ODBC drivers available for
Hadoop.
• JDBC is also used by some products for
Hive Integration
•

37

Reserved.

BI/Analytics Tools
ODBC

DRIVER

HIVEQL

HIVE SERVER

HIVE

Hadoop Integration
Next Generation BI/Analytics Tools

38

Reserved.

New “Hadoop Native” Tools
You can think of Hadoop as becoming a shared execution environment supporting new
data analysis tools…

BI/Analytics

New Query Engines
Co

MapReduce

39

Reserved.

Hadoop Native Tools – Advantages
•

New data analysis tools:
Designed and optimized for working with Hadoop data and large
data sets.
• Remove reliance on Hive for accessing data – can work with any
data in Hadoop.
•

•

New query engines:
Provide ability to do low latency queries against Hadoop data.
• Make it possible to do ad-hoc, exploratory analysis of data in
Hadoop.
•

40

Reserved.

Datameer

41

Reserved.

Datameer

42

Reserved.

What Hive community expected?
Hive
Compiler
Executor

Embedded Hive engine for
batch or ad-hoc queries ..

Hadoop

Meta
Store

HDFS

What industry users expect ...
Integration is
the key
requirement

Need server proxy access

•
•
•

Facilitate remote client
o Server process to support concurrent clients
Standard compliant connectors
o JDBC, ODBC
Security, Auditing

Hive Integration
HiveServer1

HiveServer2

No support for concurrent
queries. Requires running
multiple HiveServers for
multiple users
• No support for security.
• The Thrift API in the Hive
Server doesn’t support
common JDBC/ODBC calls.

•

•

47

Adds support for concurrent
queries. Can support multiple
users.
• Adds security support with
Kerberos.
• Better support for JDBC and
ODBC.

Reserved.

Protecting Hadoop data and services

•
•
•
•

Kerberos based authentication
Posix style file permissions
Access control for job submission
Encryption over wire

Securing Hive access

•
•
•

Restrict access to service
Supports Kerberos and LDAP authentication
Encryption over wire

Need for authorization

•
•

•

Secure authorization
o Enforce policy control access to data for authenticated
user
Fine grain authorization
o Ability to control subset of data
Role based authorization
o Ability to associate privileges with roles

Current state of authorization

•

•

File based authorization
o Control at file level
o Insufficient for collaboration
o No fine grain access control
Sub-optimal built-in authorization
o Intended for preventing accidental changes
o Not for preventing malicious users for hacking ..

Apache Sentry

•
•
•

Policy Engine for authorization
Fine-grain, role based
Pluggable modules for Hadoop components
o Works with out of the box with Hive

Recap

54

Reserved.

Enterprise
Applications

Data Warehouse
Query
(High $/Byte)

OLTP

ETL

Hadoop
Transform
Query

ODS
55

Store
Reserved.

Business
Intelligence

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

NoSQL

56

Reserved.

Questions?
Slides at github.com/markgrover/hive-sjsu
Prasad:
http://www.linkedin.com/pub/prasad-mujumdar/29/147/88b
prasadm@cloudera.com
Mark:
www.linkedin.com/in/grovermark
mgrover@cloudera.com
•

57

Reserved.

Flume in 2 Minutes
Reliable – events are stored in channel until delivered to next stage.
• Recoverable – events can be persisted to disk and recovered in the
event of failure.
•

Flume Agent

Source

58

Channel

Sink

Reserved.

Destination

Flume in 2 Minutes
Supports multi-hop flows for more complex processing.
• Also fan-out, fan-in.
•

Flume Agent
Sourc
e

59

Channel

Flume Agent
Sink

Sourc
e

Reserved.

Channel

Sink

Flume in 2 Minutes
• Declarative

No coding required.
• Configuration specifies
how components are
wired together.
•

60

Reserved.

Flume in 2 Minutes
•

Similar systems:
Scribe
• Chukwa
•

61

Reserved.

Sqoop Limitations

Sqoop has some limitations, including:
•

Poor support for security.
•

$ sqoop import –username scott –password tiger…
Sqoop can read command line options from an option file, but this still
has holes.

Error prone syntax.
• Tight coupling to JDBC model – not a good fit for non-RDBMS
systems.
•

62

Reserved.

Fortunately…

Sqoop 2 (incubating) will address many of these
limitations:
Adds a web-based GUI.
• Centralized configuration.
• More flexible model.
• Improved security model.
•

63

Reserved.

New Query Engines – Impala
•

Fast, interactive queries on data stored in Hadoop (HDFS and HBase).
•

But also designed to support long running queries.

Uses familiar Hive Query Language and shares metastore.
• Tight integration with Hadoop.
•

•
•

•

High Performance
•
•
•

64

Reads common Hadoop file formats.
Runs on Hadoop DataNodes.
C++, not Java.
Runtime code generation.
Entirely re-designed execution engine bypasses MapReduce.
Confidential. ©2012 Cloudera, Inc. All
Rights Reserved.

Impala Architecture
Common Hive SQL and interface

Unified metadata and scheduler
Hive
Metastore

SQL App

YARN

HDFS NN

State
Store

ODBC

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

HDFS DN
65

HBase

HDFS DN

HBase

Fully MPP
Distributed

Query Planner

Query Coordinator
Query Exec Engine
HDFS DN

HBase

Hadoop and Hive in Enterprises

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and Hive in Enterprises

Similar to Hadoop and Hive in Enterprises (20)

More from markgrover

More from markgrover (20)

Recently uploaded

Recently uploaded (20)

Hadoop and Hive in Enterprises

Editor's Notes