Cloud Austin Meetup - Hadoop like a champion

Data Access with Hadoop
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ameet@hortonworks.com
@ameetp512
Ameet Paranjape

Interactive and real-time data analysis in Hadoop!

2013
Digital universe
2.3 Zettabytes
85% of growth from new types of data
with machine-generated data increasing
15x
1 Zettabyte (ZB) = 1 million Petabytes (PB); Sources: IDC and IDG Enterprise
2020
Digital universe
40 Zettabytes
Analysts consensus estimates
enterprise data growth of
year over year through 2020
50x

Traditional systems under pressure
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
RDBMS EDW MPP
Packaged
Applications
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor. Machine Data
Unstructured docs, emails
Server logs
SOURCES
Existing Sources
(CRM, ERP,…)
New Data Types
…and difficult to
manage new data

Virtualization
Slicing your servers into pieces so your can parcel out computing resources
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 5

Hadoop
Tying your servers together to make them act like one big computer
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 6

Cost of storage is going down
According to StatisticBrain, the average cost per gigabyte of storage was
$437,500 in 1980, $11 in 2000, and just five cents in 2013.

Hadoop 101
The basics
1. Hadoop ties your servers together, and makes them act like one big computer
• So you can use inexpensive servers to do your big data processing
2. Hadoop works well with structured, semi-structured,
and unstructured information

Hadoop and the Modern Data Architecture (MDA)
SOURCES
EXISTING
Systems
Clickstream Web
&Social
Batch Interactive Real-Time
HDFS
(Hadoop Distributed File System)
Geolocation Sensor
& Machine
Server
Logs
Unstructured
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N

It’s crowded out there!

Recommended Reading
The Forrester Wave Report – Big Data Hadoop Solutions, Q1 2014

Hadoop Comparison Tips
1. Is the solution open or closed source?
2. If code is open, who owns the IP?
3. What’s available for free and what do you pay for?
4. Is the solution substrate agnostic?
5. OS support options?
6. Partnerships
7. What’s the pricing model?
8. Local resources to help?

A Blueprint for Enterprise Hadoop
Load data
and manage
according
to policy
PRESENTATION & APPLICATION
ENTERPRISE MGMT & SECURITY
DATA ACCESS SECURITY
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
YARN Data Operating System
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
DATA MANAGEMENT
GOVERNANCE
& INTEGRATION
OPERATIONS
Enable both existing and new application to
provide value to the organization
Empower existing operations and
security tools to manage Hadoop
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT OPTIONS

Apache Hadoop & A Hadoop “Distribution”
Apache Hadoop Is a project
 Governed by Apache Software Foundation (ASF)
 Comprises YARN and HDFS
Hadoop distribution is a package of projects (e.g. HDP)
 Packages Apache Hadoop and related Apache projects
 It extends Hadoop with:
– Data access services to manipulate the data
– Data governance and integration services
– Security services
– Operational services to manage the cluster
 Tested for consistency across the entire package
 Hardened for the enterprise
Page 14

YARN has transformed Hadoop
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
DATA MANAGEMENT
°
°
N
HDFS

Apache Projects for Data Access
DATA ACCESS
Tez Tez
Apache Pig
Apache Hive
Apache HBase
Apache Storm
Apache Solr
Apache Spark
Traditional Tools
In-Memory
Spark
DATA MANAGEMENT
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS

Apache Projects for Governance
GOVERNANCE
& INTEGRATION
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
DATA ACCESS
Script
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
In-Memory
Spark
Tez Tez
DATA MANAGEMENT
Pig
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
Apache Falcon
Apache Sqoop
Apache Flume
Hadoop NFS & WebHDFS

Apache Projects for Security
Tez Tez
Apache Knox
Apache Argus
Entire Stack
(HDFS, Hive, YARN)
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
In-Memory
Spark
DATA MANAGEMENT
SECURITY
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS

Apache Projects for Operations
Tez Tez
Apache Ambari
Apache Zookeeper
Apache Oozie
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
In-Memory
Spark
DATA MANAGEMENT
SECURITY
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS

Remember the MDA
SOURCES
EXISTING
Systems
Clickstream Web
&Social
HDFS
Geolocation Sensor
& Machine
Server
Logs
Unstructured
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N

What is Data Access?
Data Access defines ALL the channels
through which data can be accessed,
analyzed, cleansed and consumed within
Hadoop. Each channel can be categorized
into THREE core patterns; Batch, Interactive
and Real-time.
Multiple engines provide
optimized access to your mission
critical data.

Access patterns enabled by YARN
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
N °
HDFS

HBase
• Apache™ HBase is a non-relational (NoSQL)
database that runs on top of the Hadoop®
Distributed File System (HDFS).
• It is columnar and provides fault-tolerant
storage and quick access to large quantities
of sparse data.
• It also adds transactional capabilities to
Hadoop, allowing users to conduct updates,
inserts and deletes.
• HBase was created for hosting very large
tables with billions of rows and millions of
columns.
•
Developers use it to:
• Provide low latency access to
massive amounts of data (eg.
Recommendation engine
results)
• Document store

Spark
• Spark is a general-purpose engine for ad-hoc
interactive analytics, iterative machine-learning,
and other use cases well-suited to
interactive, in-memory data processing of GB
to TB sized datasets.
• Spark loads data into memory so it can be
queried repeatedly. It can create a “shadow”
of data that can be used in the next iteration
of a query
• Spark provides simple APIs for data scientists
and engineers familiar with Scala
(programming language) to build applications
• Spark is YARN-ready – another engine on
YARN!
Developers use it to:
• Data Science: machine
Learning and iterative analytics

Stream Processing in Hadoop
Sentiment Clickstream Machine/Sensor Server Logs Geo-location
How do I deal with this
continuous stream of data
coming in from sensors…etc?
Apache Storm
Real-time event processing for sensor
and business activity monitoring
• Unlocks new business cases for Hadoop
• Key component of a data lake architecture
• Scale: Ingest millions of events per second. Fast
query on petabytes of data
• Integrated with Ambari to manage
• Predictive Analytics
Prevent Optimize
Finance
- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco
- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail
- Offers
- Pricing
- Machine failures - Supply chain
Manufacturing
Transportation -
Driver & fleet issues - Routes
- Pricing
Web
- Application failures
- Operational issues
- Site content
----
Monitor real-time
data to…

Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for
a given route; an event could be:
• 'Normal' events: starting / stopping of the vehicle
• ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Route?
Truck?
Driver?
Analysts query a
broad history to
understand if today’s
violations are part of
a larger problem with
specific routes,
trucks, or drivers
Company uses an application that
monitors truck locations and violations
from the truck/driver in real-time

Solutions on Hadoop Require All!
Truck Sensors
Inbound Messaging
(Kafka)
Stream Processing
(Storm)
Many Workloads: YARN
Distributed Storage: HDFS
Microsoft
Excel
Interactive Query
(Hive on Tez)
Alerts & Events
(ActiveMQ)
Real-time Serving
(HBase)
Real-Time
User Interface

Query Executes Blazingly Fast with Hive 13 on Tez

Do Specific Routes Cause More Issues?

Do Specific Trucks Cause More Issues?

Do Specific Drivers in Trucks Cause More Issues?

Try it out...

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Has
Fundamentally
Changed Hadoop
YARN enables:
• More Workloads
From batch to interactive & real-time
• More Data
Multiple data sets of varying types
and structures
• More Value
Hosting multiple business cases
in a single Hadoop cluster
Enterprise Hadoop Enables…
• More Workloads
From batch to interactive & real-time
• More Data
Multiple data sets of varying types
and structures
• More Value
Hosting multiple business cases
in a single Hadoop cluster

Cloud Austin Meetup - Hadoop like a champion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Cloud Austin Meetup - Hadoop like a champion

Similar to Cloud Austin Meetup - Hadoop like a champion (20)

Recently uploaded

Recently uploaded (20)

Cloud Austin Meetup - Hadoop like a champion

Editor's Notes