SlideShare a Scribd company logo
1 of 65
Hive in enterprises
Headline Goes Here

DO NOT USE PUBLICLY
PRIOR TO 10/23/12

Mark Grover - Software Engineer, Cloudera (@mark_grover)
Speaker Name or Subhead Goes Here
Prasad Majumdar – Apache Hive Committer, Software
Engineer, Cloudera
November 25th, 2013

1

©2013 Cloudera, Inc. All Rights
Reserved.
What we will be Talking About
•

Integration of Hive and Hadoop in enterprises
Current challenges
• How is Hadoop being leveraged with existing data
infrastructures?
•

•

Other tools and features in and around Hive
Authentication and Authorization
• BI Tools
• User Interface
•

2

©2013 Cloudera, Inc. All Rights
Reserved.
What is Apache Hadoop?
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Scalable
 Fault tolerant
 Distributed
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
3

CORE HADOOP SYSTEM COMPONENTS
Hadoop Distributed
File System (HDFS)
MapReduce
Self-Healing, High
Bandwidth Clustered
Storage

Excels at
Processing Complex Data

Distributed Computing
Framework

Scales
Economically

 Scale-out architecture divides workloads
across multiple nodes

 Can be deployed on commodity
hardware

 Flexible file system eliminates ETL
bottlenecks

 Open source platform guards against
vendor lock

©2013 Cloudera, Inc. All Rights
Reserved.
Current Challenges
Limitations of Existing Data Management Systems

4

©2013 Cloudera, Inc. All Rights
Reserved.
The Transforming of Transformation
Enterprise
Applications

OLTP

Extract
Transform
Load

Query
Data
Warehouse
Transform

ODS
5

©2013 Cloudera, Inc. All Rights
Reserved.

Business
Intelligence
Volume, Velocity, Variety Cause Capacity Problems
Enterprise
Applications

OLTP

6

1
2

1

Slow Data Transformations = Missed ETL SLAs.
Slow Queries = Frustrated Business Users.
Extract
Transform
Load

2

1

Query
Data
Warehouse
Transform

©2013 Cloudera, Inc. All Rights
Reserved.

Business
Intelligence
Economics: Return on Byte
Return on Byte (ROB) =

Value of Data
Cost of Storing Data
High ROB
Low ROB
(but still a ton
of aggregate
value)

7

©2013 Cloudera, Inc. All Rights
Reserved.
Data Warehouse Optimization
Enterprise
Applications

Data Warehouse
Query
(High $/Byte)

OLTP

ETL

Hadoop
Transform
Query

ODS
8

Store
©2013 Cloudera, Inc. All Rights
Reserved.

Business
Intelligence
Data Warehouse Optimization
Enterprise
Applications
Hadoop
Transform
OLTP

ETL

Query
Store

ODS
9

©2013 Cloudera, Inc. All Rights
Reserved.

Business
Intelligence
The Key Benefit: Agility/Flexibility
Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS):
•

Prescriptive Data Modeling:

•

Descriptive Data Modeling:

•

Create static DB schema

•

Copy data in its native format

•

Transform data into RDBMS

•

Create schema + parser

•

Query data in RDBMS format

•

Query Data in its native format

•

New columns must be added explicitly
before new data can propagate into
the system.

•

New data can start flowing any time and
will appear retroactively once the
schema/parser properly describes it.

•

Good for Known Unknowns
(Repetition)

•

Good for Unknown Unknowns
(Exploration)

10

©2013 Cloudera, Inc. All Rights
Reserved.
Not Just Transformation
Other Ways Hadoop is Being Leveraged

11

©2013 Cloudera, Inc. All Rights
Reserved.
Data Archiving Before Hadoop

Data
Warehouse

12

Tape
Archive

©2013 Cloudera, Inc. All Rights
Reserved.
Active Archiving with Hadoop

Data
Warehouse

13

Hadoop

©2013 Cloudera, Inc. All Rights
Reserved.
Offloading Analysis
Data Warehouse

Hadoop

14

©2013 Cloudera, Inc. All Rights
Reserved.

Business
Intelligence
Exploratory Analysis

Developers

Business
Users

Analysts

Hadoop

15

Data
Warehouse

©2013 Cloudera, Inc. All Rights
Reserved.
Use Case: A Major Financial Institution
The Challenge:
• Current EDW at capacity; cannot support growing data depth and width
• Performance issues in business critical apps; little room for innovation.
DATA WAREHOUSE

Operational

DATA WAREHOUSE

Operational
(50%)

(44%)

Analytics
(50%)

ELT Processing
(42%)

Analytics (11%)

16

HADOOP

Analytics
Processing
Storage

The Solution:
• Hadoop offloads data storage (S),
processing (T) & some analytics (Q)
from the EDW.
• EDW resources can now be focused
on repeatable operational analytics.
• Month data scan in 4 secs vs. 4 hours

©2013 Cloudera, Inc. All Rights
Reserved.
Hadoop Integration
The Big Picture

17

©2013 Cloudera, Inc. All Rights
Reserved.
BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

Data Integration Tools
NoSQL

18

©2013 Cloudera, Inc. All Rights
Reserved.
Data Import/Export Tools

Data
Warehouse
/RDBMS

Streaming
Data

19

Data Import/Export

©2013 Cloudera, Inc. All Rights
Reserved.
Flume in 2 Minutes
Or, why you shouldn’t be using scripts for data movement.

Reliable, distributed, and available system for efficient
collection, aggregation and movement of streaming data, e.g.
logs.
• Open-source, Apache project.
•

20

©2013 Cloudera, Inc. All Rights
Reserved.
Flume in 2 Minutes
JVM process hosting components
Flume Agent
External
Source

Source

Web Server
Twitter
Consumes events
JMS
and forwards to
System logs channels
…
21

Channel

Sink

Stores events
Removes event from
until consumed
channel and puts
by sinks –
into external
file, memory, JD
destination
©2013 Cloudera, Inc. All Rights
BC
Reserved.

Destination
Sqoop Overview
Apache project designed to ease import and export of data
between Hadoop and relational databases.
• Provides functionality to do bulk imports and exports of data
with HDFS, Hive and HBase.
• Java based. Leverages MapReduce to transfer data in parallel.
•

22

©2012 Cloudera, Inc. All Rights
Reserved.
Sqoop Overview
Uses a “connector” abstraction.
• Two types of connectors
•

Standard connectors are JDBC based.
• Direct connectors use native database interfaces to improve
performance.
•

•

23

Direct connectors are available for many open-source and
commercial databases – MySQL, PostgreSQL, Oracle, SQL
Server, Teradata, etc.
©2012 Cloudera, Inc. All Rights
Reserved.
Sqoop Import Flow
Run import

Client

Collect metadata
Sqoop
Pull data

Generate code,
Execute MR job

MapReduce

Map

Map

Write to Hadoop
Hadoop

24

©2012 Cloudera, Inc. All Rights
Reserved.

Map
Transformation/Processing
Standard interface is Java MapReduce
• Higher-level interfaces are commonly used:
•

Apache Hive – provides an SQL like interface to data in Hadoop.
• Apache Pig – declarative language providing functionality to
declare a sequence of transformations.
• Cloudera Impala – real-time SQL query engine on Hadoop
•

Both Hive and Pig convert queries into MapReduce jobs and
submit to Hadoop for execution.
• Impala has its own execution engine
•

25

©2013 Cloudera, Inc. All Rights
Reserved.
Orchestration
Schedulers for Hadoop jobs
Oozie
• Azkaban
•

26

©2013 Cloudera, Inc. All Rights
Reserved.
Data Flow with OSS Tools
Transform

Web
Servers

Raw Logs

Hadoop

Load
Sqoop, etc.

Flume, etc.
Process
Orchestration
Oozie, etc.

27

©2013 Cloudera, Inc. All Rights
Reserved.
Hadoop Integration
Data Integration Tools

28

©2013 Cloudera, Inc. All Rights
Reserved.
BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

Data Integration Tools

NoSQL
29

©2013 Cloudera, Inc. All Rights
Reserved.
Data Integration Tools

30

©2013 Cloudera, Inc. All Rights
Reserved.
Pentaho
Existing BI tools extended to support Hadoop.
• Provides data import/export, transformation, job orchestration,
reporting, and analysis functionality.
• Supports integration with HDFS, Hive and HBase.
• Community and Enterprise Editions offered.
•

31

©2012 Cloudera, Inc. All Rights
Reserved.
Informatica
•

Informatica
•
•

•
•
•

32

Data import/export
Metadata services
Data lineage
Transformation
…

©2013 Cloudera, Inc. All Rights
Reserved.
Hadoop Integration
Business Intelligence/Analytic Tools

33

©2013 Cloudera, Inc. All Rights
Reserved.
BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

Data Integration Tools
NoSQL

34

©2013 Cloudera, Inc. All Rights
Reserved.
Business Intelligence/Analytics Tools

35

©2013 Cloudera, Inc. All Rights
Reserved.
Business Intelligence/Analytics Tools

Relational
Databases

36

Data
Warehouses

…

©2013 Cloudera, Inc. All Rights
Reserved.
ODBC Driver
Most of these tools use the ODBC
standard.
• Since Hive is an SQL-like system it’s a
good fit for ODBC.
• Several vendors, including Cloudera,
make ODBC drivers available for
Hadoop.
• JDBC is also used by some products for
Hive Integration
•

37

©2013 Cloudera, Inc. All Rights
Reserved.

BI/Analytics Tools
ODBC

DRIVER

HIVEQL

HIVE SERVER

HIVE
Hadoop Integration
Next Generation BI/Analytics Tools

38

©2013 Cloudera, Inc. All Rights
Reserved.
New “Hadoop Native” Tools
You can think of Hadoop as becoming a shared execution environment supporting new
data analysis tools…

BI/Analytics

New Query Engines
Co

MapReduce

39

©2013 Cloudera, Inc. All Rights
Reserved.
Hadoop Native Tools – Advantages
•

New data analysis tools:
Designed and optimized for working with Hadoop data and large
data sets.
• Remove reliance on Hive for accessing data – can work with any
data in Hadoop.
•

•

New query engines:
Provide ability to do low latency queries against Hadoop data.
• Make it possible to do ad-hoc, exploratory analysis of data in
Hadoop.
•

40

©2013 Cloudera, Inc. All Rights
Reserved.
Datameer

41

©2013 Cloudera, Inc. All Rights
Reserved.
Datameer

42

©2013 Cloudera, Inc. All Rights
Reserved.
What Hive community expected?
Hive
Compiler
Executor

Embedded Hive engine for
batch or ad-hoc queries ..

Hadoop

Meta
Store

HDFS
What industry users expect ...
Integration is
the key
requirement
Need server proxy access

•
•
•

Facilitate remote client
o Server process to support concurrent clients
Standard compliant connectors
o JDBC, ODBC
Security, Auditing
Hive Server2
Hive Integration
HiveServer1

HiveServer2

No support for concurrent
queries. Requires running
multiple HiveServers for
multiple users
• No support for security.
• The Thrift API in the Hive
Server doesn’t support
common JDBC/ODBC calls.

•

•

47

Adds support for concurrent
queries. Can support multiple
users.
• Adds security support with
Kerberos.
• Better support for JDBC and
ODBC.

©2013 Cloudera, Inc. All Rights
Reserved.
Protecting Hadoop data and services

•
•
•
•

Kerberos based authentication
Posix style file permissions
Access control for job submission
Encryption over wire
Securing Hive access

•
•
•

Restrict access to service
Supports Kerberos and LDAP authentication
Encryption over wire
Need for authorization

•
•

•

Secure authorization
o Enforce policy control access to data for authenticated
user
Fine grain authorization
o Ability to control subset of data
Role based authorization
o Ability to associate privileges with roles
Current state of authorization

•

•

File based authorization
o Control at file level
o Insufficient for collaboration
o No fine grain access control
Sub-optimal built-in authorization
o Intended for preventing accidental changes
o Not for preventing malicious users for hacking ..
Apache Sentry

•
•
•

Policy Engine for authorization
Fine-grain, role based
Pluggable modules for Hadoop components
o Works with out of the box with Hive
Hue - Hadoop User Experience
Recap

54

©2013 Cloudera, Inc. All Rights
Reserved.
Data Warehouse Optimization
Enterprise
Applications

Data Warehouse
Query
(High $/Byte)

OLTP

ETL

Hadoop
Transform
Query

ODS
55

Store
©2013 Cloudera, Inc. All Rights
Reserved.

Business
Intelligence
BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data

Data Import/Export

Data Integration Tools
NoSQL

56

©2013 Cloudera, Inc. All Rights
Reserved.
Questions?
Slides at github.com/markgrover/hive-sjsu
Prasad:
http://www.linkedin.com/pub/prasad-mujumdar/29/147/88b
prasadm@cloudera.com
Mark:
www.linkedin.com/in/grovermark
mgrover@cloudera.com
•

57

©2013 Cloudera, Inc. All Rights
Reserved.
Flume in 2 Minutes
Reliable – events are stored in channel until delivered to next stage.
• Recoverable – events can be persisted to disk and recovered in the
event of failure.
•

Flume Agent

Source

58

Channel

Sink

©2013 Cloudera, Inc. All Rights
Reserved.

Destination
Flume in 2 Minutes
Supports multi-hop flows for more complex processing.
• Also fan-out, fan-in.
•

Flume Agent
Sourc
e

59

Channel

Flume Agent
Sink

Sourc
e

©2013 Cloudera, Inc. All Rights
Reserved.

Channel

Sink
Flume in 2 Minutes
• Declarative

No coding required.
• Configuration specifies
how components are
wired together.
•

60

©2013 Cloudera, Inc. All Rights
Reserved.
Flume in 2 Minutes
•

Similar systems:
Scribe
• Chukwa
•

61

©2013 Cloudera, Inc. All Rights
Reserved.
Sqoop Limitations

Sqoop has some limitations, including:
•

Poor support for security.
•

$ sqoop import –username scott –password tiger…
Sqoop can read command line options from an option file, but this still
has holes.

Error prone syntax.
• Tight coupling to JDBC model – not a good fit for non-RDBMS
systems.
•

62

©2012 Cloudera, Inc. All Rights
Reserved.
Fortunately…

Sqoop 2 (incubating) will address many of these
limitations:
Adds a web-based GUI.
• Centralized configuration.
• More flexible model.
• Improved security model.
•

63

©2012 Cloudera, Inc. All Rights
Reserved.
New Query Engines – Impala
•

Fast, interactive queries on data stored in Hadoop (HDFS and HBase).
•

But also designed to support long running queries.

Uses familiar Hive Query Language and shares metastore.
• Tight integration with Hadoop.
•

•
•

•

High Performance
•
•
•

64

Reads common Hadoop file formats.
Runs on Hadoop DataNodes.
C++, not Java.
Runtime code generation.
Entirely re-designed execution engine bypasses MapReduce.
Confidential. ©2012 Cloudera, Inc. All
Rights Reserved.
Impala Architecture
Common Hive SQL and interface

Unified metadata and scheduler
Hive
Metastore

SQL App

YARN

HDFS NN

State
Store

ODBC

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

HDFS DN
65

HBase

HDFS DN

HBase

Fully MPP
Distributed

Query Planner

Query Coordinator
Query Exec Engine
HDFS DN

HBase

More Related Content

What's hot

Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseGwen (Chen) Shapira
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeRick van den Bosch
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 

What's hot (19)

Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 

Viewers also liked

Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadThink Big, a Teradata Company
 
Lessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataLessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataDataWorks Summit
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase IntroductionHanborq Inc.
 
MetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
Scalability using Node.js
Scalability using Node.jsScalability using Node.js
Scalability using Node.jsratankadam
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDBMongoDB
 
Date time java 8 (jsr 310)
Date time java 8 (jsr 310)Date time java 8 (jsr 310)
Date time java 8 (jsr 310)Eyal Golan
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014StampedeCon
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High AvailabilityHortonworks
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Lessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of DataLessons Learned on How to Secure Petabytes of Data
Lessons Learned on How to Secure Petabytes of Data
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase Introduction
 
MetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL Capacity
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Scalability using Node.js
Scalability using Node.jsScalability using Node.js
Scalability using Node.js
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDB
 
Date time java 8 (jsr 310)
Date time java 8 (jsr 310)Date time java 8 (jsr 310)
Date time java 8 (jsr 310)
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
 
Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data Processing
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
 

Similar to Hadoop and Hive in Enterprises

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
2013 05 Oracle big_dataapplianceoverview
2013 05 Oracle big_dataapplianceoverview2013 05 Oracle big_dataapplianceoverview
2013 05 Oracle big_dataapplianceoverviewjdijcks
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseDataWorks Summit
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 

Similar to Hadoop and Hive in Enterprises (20)

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
2013 05 Oracle big_dataapplianceoverview
2013 05 Oracle big_dataapplianceoverview2013 05 Oracle big_dataapplianceoverview
2013 05 Oracle big_dataapplianceoverview
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Integrating Hadoop Into the Enterprise
Integrating Hadoop Into the EnterpriseIntegrating Hadoop Into the Enterprise
Integrating Hadoop Into the Enterprise
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 

Recently uploaded

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Hadoop and Hive in Enterprises

  • 1. Hive in enterprises Headline Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Mark Grover - Software Engineer, Cloudera (@mark_grover) Speaker Name or Subhead Goes Here Prasad Majumdar – Apache Hive Committer, Software Engineer, Cloudera November 25th, 2013 1 ©2013 Cloudera, Inc. All Rights Reserved.
  • 2. What we will be Talking About • Integration of Hive and Hadoop in enterprises Current challenges • How is Hadoop being leveraged with existing data infrastructures? • • Other tools and features in and around Hive Authentication and Authorization • BI Tools • User Interface • 2 ©2013 Cloudera, Inc. All Rights Reserved.
  • 3. What is Apache Hadoop? Apache Hadoop is an open source platform for data storage and processing that is…  Scalable  Fault tolerant  Distributed Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema 3 CORE HADOOP SYSTEM COMPONENTS Hadoop Distributed File System (HDFS) MapReduce Self-Healing, High Bandwidth Clustered Storage Excels at Processing Complex Data Distributed Computing Framework Scales Economically  Scale-out architecture divides workloads across multiple nodes  Can be deployed on commodity hardware  Flexible file system eliminates ETL bottlenecks  Open source platform guards against vendor lock ©2013 Cloudera, Inc. All Rights Reserved.
  • 4. Current Challenges Limitations of Existing Data Management Systems 4 ©2013 Cloudera, Inc. All Rights Reserved.
  • 5. The Transforming of Transformation Enterprise Applications OLTP Extract Transform Load Query Data Warehouse Transform ODS 5 ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • 6. Volume, Velocity, Variety Cause Capacity Problems Enterprise Applications OLTP 6 1 2 1 Slow Data Transformations = Missed ETL SLAs. Slow Queries = Frustrated Business Users. Extract Transform Load 2 1 Query Data Warehouse Transform ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • 7. Economics: Return on Byte Return on Byte (ROB) = Value of Data Cost of Storing Data High ROB Low ROB (but still a ton of aggregate value) 7 ©2013 Cloudera, Inc. All Rights Reserved.
  • 8. Data Warehouse Optimization Enterprise Applications Data Warehouse Query (High $/Byte) OLTP ETL Hadoop Transform Query ODS 8 Store ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • 10. The Key Benefit: Agility/Flexibility Schema-on-Read (Hadoop): Schema-on-Write (RDBMS): • Prescriptive Data Modeling: • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format • New columns must be added explicitly before new data can propagate into the system. • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Known Unknowns (Repetition) • Good for Unknown Unknowns (Exploration) 10 ©2013 Cloudera, Inc. All Rights Reserved.
  • 11. Not Just Transformation Other Ways Hadoop is Being Leveraged 11 ©2013 Cloudera, Inc. All Rights Reserved.
  • 12. Data Archiving Before Hadoop Data Warehouse 12 Tape Archive ©2013 Cloudera, Inc. All Rights Reserved.
  • 13. Active Archiving with Hadoop Data Warehouse 13 Hadoop ©2013 Cloudera, Inc. All Rights Reserved.
  • 14. Offloading Analysis Data Warehouse Hadoop 14 ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • 16. Use Case: A Major Financial Institution The Challenge: • Current EDW at capacity; cannot support growing data depth and width • Performance issues in business critical apps; little room for innovation. DATA WAREHOUSE Operational DATA WAREHOUSE Operational (50%) (44%) Analytics (50%) ELT Processing (42%) Analytics (11%) 16 HADOOP Analytics Processing Storage The Solution: • Hadoop offloads data storage (S), processing (T) & some analytics (Q) from the EDW. • EDW resources can now be focused on repeatable operational analytics. • Month data scan in 4 secs vs. 4 hours ©2013 Cloudera, Inc. All Rights Reserved.
  • 17. Hadoop Integration The Big Picture 17 ©2013 Cloudera, Inc. All Rights Reserved.
  • 18. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 18 ©2013 Cloudera, Inc. All Rights Reserved.
  • 19. Data Import/Export Tools Data Warehouse /RDBMS Streaming Data 19 Data Import/Export ©2013 Cloudera, Inc. All Rights Reserved.
  • 20. Flume in 2 Minutes Or, why you shouldn’t be using scripts for data movement. Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. • Open-source, Apache project. • 20 ©2013 Cloudera, Inc. All Rights Reserved.
  • 21. Flume in 2 Minutes JVM process hosting components Flume Agent External Source Source Web Server Twitter Consumes events JMS and forwards to System logs channels … 21 Channel Sink Stores events Removes event from until consumed channel and puts by sinks – into external file, memory, JD destination ©2013 Cloudera, Inc. All Rights BC Reserved. Destination
  • 22. Sqoop Overview Apache project designed to ease import and export of data between Hadoop and relational databases. • Provides functionality to do bulk imports and exports of data with HDFS, Hive and HBase. • Java based. Leverages MapReduce to transfer data in parallel. • 22 ©2012 Cloudera, Inc. All Rights Reserved.
  • 23. Sqoop Overview Uses a “connector” abstraction. • Two types of connectors • Standard connectors are JDBC based. • Direct connectors use native database interfaces to improve performance. • • 23 Direct connectors are available for many open-source and commercial databases – MySQL, PostgreSQL, Oracle, SQL Server, Teradata, etc. ©2012 Cloudera, Inc. All Rights Reserved.
  • 24. Sqoop Import Flow Run import Client Collect metadata Sqoop Pull data Generate code, Execute MR job MapReduce Map Map Write to Hadoop Hadoop 24 ©2012 Cloudera, Inc. All Rights Reserved. Map
  • 25. Transformation/Processing Standard interface is Java MapReduce • Higher-level interfaces are commonly used: • Apache Hive – provides an SQL like interface to data in Hadoop. • Apache Pig – declarative language providing functionality to declare a sequence of transformations. • Cloudera Impala – real-time SQL query engine on Hadoop • Both Hive and Pig convert queries into MapReduce jobs and submit to Hadoop for execution. • Impala has its own execution engine • 25 ©2013 Cloudera, Inc. All Rights Reserved.
  • 26. Orchestration Schedulers for Hadoop jobs Oozie • Azkaban • 26 ©2013 Cloudera, Inc. All Rights Reserved.
  • 27. Data Flow with OSS Tools Transform Web Servers Raw Logs Hadoop Load Sqoop, etc. Flume, etc. Process Orchestration Oozie, etc. 27 ©2013 Cloudera, Inc. All Rights Reserved.
  • 28. Hadoop Integration Data Integration Tools 28 ©2013 Cloudera, Inc. All Rights Reserved.
  • 29. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 29 ©2013 Cloudera, Inc. All Rights Reserved.
  • 30. Data Integration Tools 30 ©2013 Cloudera, Inc. All Rights Reserved.
  • 31. Pentaho Existing BI tools extended to support Hadoop. • Provides data import/export, transformation, job orchestration, reporting, and analysis functionality. • Supports integration with HDFS, Hive and HBase. • Community and Enterprise Editions offered. • 31 ©2012 Cloudera, Inc. All Rights Reserved.
  • 32. Informatica • Informatica • • • • • 32 Data import/export Metadata services Data lineage Transformation … ©2013 Cloudera, Inc. All Rights Reserved.
  • 33. Hadoop Integration Business Intelligence/Analytic Tools 33 ©2013 Cloudera, Inc. All Rights Reserved.
  • 34. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 34 ©2013 Cloudera, Inc. All Rights Reserved.
  • 35. Business Intelligence/Analytics Tools 35 ©2013 Cloudera, Inc. All Rights Reserved.
  • 37. ODBC Driver Most of these tools use the ODBC standard. • Since Hive is an SQL-like system it’s a good fit for ODBC. • Several vendors, including Cloudera, make ODBC drivers available for Hadoop. • JDBC is also used by some products for Hive Integration • 37 ©2013 Cloudera, Inc. All Rights Reserved. BI/Analytics Tools ODBC DRIVER HIVEQL HIVE SERVER HIVE
  • 38. Hadoop Integration Next Generation BI/Analytics Tools 38 ©2013 Cloudera, Inc. All Rights Reserved.
  • 39. New “Hadoop Native” Tools You can think of Hadoop as becoming a shared execution environment supporting new data analysis tools… BI/Analytics New Query Engines Co MapReduce 39 ©2013 Cloudera, Inc. All Rights Reserved.
  • 40. Hadoop Native Tools – Advantages • New data analysis tools: Designed and optimized for working with Hadoop data and large data sets. • Remove reliance on Hive for accessing data – can work with any data in Hadoop. • • New query engines: Provide ability to do low latency queries against Hadoop data. • Make it possible to do ad-hoc, exploratory analysis of data in Hadoop. • 40 ©2013 Cloudera, Inc. All Rights Reserved.
  • 41. Datameer 41 ©2013 Cloudera, Inc. All Rights Reserved.
  • 42. Datameer 42 ©2013 Cloudera, Inc. All Rights Reserved.
  • 43. What Hive community expected? Hive Compiler Executor Embedded Hive engine for batch or ad-hoc queries .. Hadoop Meta Store HDFS
  • 44. What industry users expect ... Integration is the key requirement
  • 45. Need server proxy access • • • Facilitate remote client o Server process to support concurrent clients Standard compliant connectors o JDBC, ODBC Security, Auditing
  • 47. Hive Integration HiveServer1 HiveServer2 No support for concurrent queries. Requires running multiple HiveServers for multiple users • No support for security. • The Thrift API in the Hive Server doesn’t support common JDBC/ODBC calls. • • 47 Adds support for concurrent queries. Can support multiple users. • Adds security support with Kerberos. • Better support for JDBC and ODBC. ©2013 Cloudera, Inc. All Rights Reserved.
  • 48. Protecting Hadoop data and services • • • • Kerberos based authentication Posix style file permissions Access control for job submission Encryption over wire
  • 49. Securing Hive access • • • Restrict access to service Supports Kerberos and LDAP authentication Encryption over wire
  • 50. Need for authorization • • • Secure authorization o Enforce policy control access to data for authenticated user Fine grain authorization o Ability to control subset of data Role based authorization o Ability to associate privileges with roles
  • 51. Current state of authorization • • File based authorization o Control at file level o Insufficient for collaboration o No fine grain access control Sub-optimal built-in authorization o Intended for preventing accidental changes o Not for preventing malicious users for hacking ..
  • 52. Apache Sentry • • • Policy Engine for authorization Fine-grain, role based Pluggable modules for Hadoop components o Works with out of the box with Hive
  • 53. Hue - Hadoop User Experience
  • 54. Recap 54 ©2013 Cloudera, Inc. All Rights Reserved.
  • 55. Data Warehouse Optimization Enterprise Applications Data Warehouse Query (High $/Byte) OLTP ETL Hadoop Transform Query ODS 55 Store ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • 56. BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 56 ©2013 Cloudera, Inc. All Rights Reserved.
  • 58. Flume in 2 Minutes Reliable – events are stored in channel until delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure. • Flume Agent Source 58 Channel Sink ©2013 Cloudera, Inc. All Rights Reserved. Destination
  • 59. Flume in 2 Minutes Supports multi-hop flows for more complex processing. • Also fan-out, fan-in. • Flume Agent Sourc e 59 Channel Flume Agent Sink Sourc e ©2013 Cloudera, Inc. All Rights Reserved. Channel Sink
  • 60. Flume in 2 Minutes • Declarative No coding required. • Configuration specifies how components are wired together. • 60 ©2013 Cloudera, Inc. All Rights Reserved.
  • 61. Flume in 2 Minutes • Similar systems: Scribe • Chukwa • 61 ©2013 Cloudera, Inc. All Rights Reserved.
  • 62. Sqoop Limitations Sqoop has some limitations, including: • Poor support for security. • $ sqoop import –username scott –password tiger… Sqoop can read command line options from an option file, but this still has holes. Error prone syntax. • Tight coupling to JDBC model – not a good fit for non-RDBMS systems. • 62 ©2012 Cloudera, Inc. All Rights Reserved.
  • 63. Fortunately… Sqoop 2 (incubating) will address many of these limitations: Adds a web-based GUI. • Centralized configuration. • More flexible model. • Improved security model. • 63 ©2012 Cloudera, Inc. All Rights Reserved.
  • 64. New Query Engines – Impala • Fast, interactive queries on data stored in Hadoop (HDFS and HBase). • But also designed to support long running queries. Uses familiar Hive Query Language and shares metastore. • Tight integration with Hadoop. • • • • High Performance • • • 64 Reads common Hadoop file formats. Runs on Hadoop DataNodes. C++, not Java. Runtime code generation. Entirely re-designed execution engine bypasses MapReduce. Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • 65. Impala Architecture Common Hive SQL and interface Unified metadata and scheduler Hive Metastore SQL App YARN HDFS NN State Store ODBC Query Planner Query Planner Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine HDFS DN 65 HBase HDFS DN HBase Fully MPP Distributed Query Planner Query Coordinator Query Exec Engine HDFS DN HBase

Editor's Notes

  1. Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  2. Current Architecture BuildIn the beginning, there were enterprise applications backed by relational databases. These databases were optimized for processing transactions, or Online Transaction Processing (OLTP), which required high speed transactional reading and writing.Given the valuable data in these databases, business users wanted to be able to query them in order to ask questions. They used Business Intelligence tools that provided features like reports, dashboards, scorecards, alerts, and more. But these queries put a tremendous burden on the OLTP systems, which were not optimized to be queried like this.So architects introduced another database, called a data warehouse – you may also hear about data marts or operational data stores (ODS) – that were optimized for answering user queries.The data warehouse was loaded with data from the source systems. Specialized tools Extracted the source data, applied some Transformations to it – such as parsing, cleansing, validating, matching, translating, encoding, sorting, or aggregating – and then Loaded it into the data warehouse. For short we call these ETL.As it matured, the data warehouse incorporated additional data sources.Since the data warehouse was typically a very powerful database, some organizations also began performing some transformation workloads right in the database, choosing to load raw data for speed and letting the database do the heavy lifting of transformations. This model is called ELT. Many organizations perform both ETL and ELT for data integration.
  3. Issues BuildAs data volumes and business complexity grows, ETL and ELT processing is unable to keep up. Critical business windows are missed.Databases are designed to load and query data, not transform it. Transforming data in the database consumes valuable CPU, making queries run slower.
  4. Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  5. Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  6. Conventional databases are expensive to scale as data volumes grow. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
  7. Bank of AmericaA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. The system was spending 44% of its resources for operational functions and 42% for ELT processing, leaving only 11% for analytics and discovery of ROI from new opportunities. The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offloading raw data to tape backup and relying on small data samples and aggregations for analytics in order to reduce the data volume on Teradata. .Solution: The bank deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, allowing the EDW to focus on its real purpose: performing operational functions and analytics. Results: By offloading data processing and storage onto Cloudera, which runs on industry standard hardware, the bank avoided spending millions to expand their Teradata infrastructure. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.
  8. This is a very quick overview and glosses over much of the capabilities and functionality offered by Flume. This is describing 1.3 or “Flume NG”.
  9. Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
  10. Should be emphasized that with this system we maintain the raw logs in Hadoop, allowing new transformations as needed.
  11. Most of these tools integrate to existing data stores using the ODBC standard.
  12. MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
  13. JDBC/ODBC support: HiveServer1 Thrift API lacks support for asynchronous query execution, the ability to cancel running queries, and methods for retrieving information about the capabilities of the remote server.
  14. Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  15. Showing a definite bias here, but Impala is available now in beta, soon to be GA, and supported by major BI and analytics vendors. Also the system that I’m familiar with.Systems like Impala provide important new capabilities for performing data analysis with Hadoop, so well worth covering in this context. According to TDWI, lack of real-time query capabilities is an obstacle to Hadoop adoption for many companies.
  16. Impalad’scomponsed of 3 components – planner, coordinator, and execution engine.State Store Daemon isn’t shown here, but maintains information on impala daemons running in system