Hadoop Crash Course

Hadoop Crash Course
Rafael Coss
Community Team
Developer Advocate
@racoss
rafael@hortonworks.com

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Evolve with Data Trends by Using Open Source
Ã Keep up with rapidly evolving Data Trends
Ã Differentiate your biz with Insights from Data
Ã Speed up your adaptation to constant change by using open source
Ã A Modern Data Architecture is a Journey

Agenda
Hadoop Use Cases
Traditional Data Architectures
What’s Apache Hadoop?
Data Access with Hadoop
Lab Intro

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Customers are building Modern Data
Applications to transform their industries,
renovating their IT architectures and
innovating their Data in Motion and Data at
Rest platforms to power actionable
intelligence.

Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS

Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Must do to make modern data applications possible

Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Powerful means to
optimize current business

Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Powerful means to
optimize current business
Pathway to transform for
strategic advantage and
new revenue streams

Customers are building Modern Data Applications to transform their industries –
renovating their IT architectures and innovating with their Data in Motion
or Data at Rest to power actionable intelligence.
Social
Mapping
Payment
Tracking
Factory
Yields
Defect
Detection
Call Analysis
Machine
Data
Product
Design
M & A
Due
Diligence
Next Product
Recs
Cyber
Security
Risk
Modeling
Ad
Placement
Proactive
Repair
Disaster
Mitigation
Investment
Planning
Inventory
Predictions
Customer
Support
Sentiment
Analysis
Supply Chain
Ad
Placement
Basket
Analysis
Segments
Cross-
Sell
Customer
Retention
Vendor
Scorecards
Optimize
Inventories
OPEX
Reduction
Mainframe
Offloads
Historical
Records
Data
as a Service
Public
Data
Capture
Fraud
Prevention
Device Data
Ingest
Rapid
Reporting
Digital
Protection

INTERNET
OF
ANYTHING
The Future of Data is about
actionable intelligence derived from all
your data coming from the Internet of
Anything

What are all the dimensions of all your data?
Structured ---------------- Unstructured
At Rest --------------------- In-motion
KPI ------------------- Data Exhaust
Core ------------------ Jagged Edge
Within Your Firewall ----------- External Data .
On-prem ----------------- Cloud .

The Future of Data
Actionable Intelligence
D A T A I N M O T I O N
STORAGE
STORAGE
GROUP 2GROUP 1
GROUP 4GROUP 3
D ATA AT R E S T
INTERNET
OF
ANYTHING
Connected Data Architecture
Across your Data Plane
is powering Actionable Intelligence
Any and all data
from sensors,
machines,
geolocation, clicks,
files, social.
Secure point-to-point and
bi-directional data flows
Collect and curate all data.

New Data Paradigm Opens Up New Opportunity
2.8 zettabytes
in 2012
44 zettabytes
in 2020
N E W
1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research
Clickstream
ERP, CRM, SCM
Web & social
Geolocation
Internet of Things
Server logs
Files, emails
Transform every industry via
full fidelity of data and analytics
Opportunity
T R A D I T I O N A L
LAGGARDS
LEADERS
Ability to
Consume Data
Enterprise
Blind Spot

What disrupted the data center?
?
Data?

Traditional Architecture and its gaps

Observation
Interaction
Intelligence

Systems of Intelligence
Systems of
Engagements
Systems of
Interactions
Data Systems
18
Systems of
Record
Events
In
Gray
Actionable
Intelligence
OperatorsDevelopers
Products
Analytics
In
Green
Systems of Insight

19
Modern Data Applications Data Scope
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
Capacity constrained down sampling of available information Whole population analytics connects the dots
Carefully cleanse all information before any analysis Analyze information as is & cleanse as needed

RDBMS
Sales
NoSQL
Unstructured
Visualization
& Dashboards
Business
Analytics
Data
Marts
Data
Marts Archive
StatisticsOLAP
EDW
File
Server
Clickstream
Logs
Web &
Social Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents

RDBMS
Sales
NoSQL
Unstructured
Visualization
& Dashboards
Business
Analytics
Data
Marts
Data
Marts Archive
StatisticsOLAP
EDW
File
Server
Clickstream
Logs
Web &
Social Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents
Ã Too expensive and slow as data growth keeps accelerating
Ã Too slow to get the data prepared for analytics
Ã Analytics is only leveraging a limited data set
Ã Cold data becomes archived and is no longer usable for analytics
Ã Data ingest is rigid and slow for new IoAT data types
Ã Limited real time insights
Traditional Data Architecture Challenges with Big Data

RDBMS
Sales
NoSQL
Unstructured
Visualization
& Dashboards
Business
Analytics
Data
Marts
Data
Marts Archive
StatisticsOLAP
EDW
File
Server
Clickstream
Logs
Web &
Social Logs
AudioVideo
LogsLogs
Logs
Geolocation
JSON
ETL
POS CRM ERP
ECM
Filter
App
Server
Message
Bus
Documents

Hadoop Architecture
Data Access Engines
Distributed Reliable Storage
Distributed Compute Framework
Resource Management, Data LocalityData Operating System
Batch Interactive Real-time
Governance
&
Integration
Security
Applications
Deploy Anywhere

runs on
ETL
RDBMS Import/Export
Distributed Storage & Processing Framework
Secure NoSQL DB
SQL on HBase
NoSQL DB
Workflow Management
SQL
Streaming Data Ingestion
Cluster System Operations
Secure Gateway
Distributed Registry
ETL
Search & Indexing
Even Faster Data Processing
Data Management
Machine Learning
Hadoop Ecosystem

Open Enterprise Hadoop Capabilities
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System

HORTONWORKS DATA PLATFORM
DATA MGMT
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
2.2.0
2.4.0
2.6.0
Ongoing Innovation in Apache
HDFS
YARN
MapReduce
Hadoop Core
What is Apache Hadoop?
Yahoo!
2006
Hortonworks
Oct 2011
Yahoo! start focus on multiple Hadoop apps & clusters
Contributes Hadoop to Apache
2008
HDP 1.0
Oct 2012
Apache Hadoop v2 YARN
Google publishes GFS & MapReduce papers
2004-2005
HDP 2.4
March 2016 2.7.1
HDP 2.2
Dec 2014
HDP 2.3
July 2015
2.7.1
HDP 2.5
Aug 2016
2.7.3

Apache Hadoop = Storage + Compute
storage storage
storage storage
Hadoop Distributed File
System (HDFS)
CPU RAM
Yet Another Resource
Negotiator (YARN)

`
+
/directory/structure/in/memory.txt
Resource management + schedulingDisk, CPU, Memory
Core
NameNode
HDFS
ResourceManager
YARN
Hadoop daemon
User application
NN
RM
DataNode
HDFS
NodeManager
YARN
Worker Node

Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

HDFS Storage Architecture - Before
Before
• DataNode is a single storage
• Storage is uniform - Only
storage type Disk
• Storage types hidden from the
file system
All disks as
single storage

Cloud
Storage
HDFS Storage Architecture - Now
New Architecture
• DataNode is a collection of storages
• Support different types of storages
– Disk, SSDs, Memory
Block Storage Policies
– Describes how to store data blocks
in HDFS
Collection of
tier storage

It Looks Like a File System

Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made for developing distributed applications to
process vast amounts of data in-parallel on large
clusters
• Proven
Reliable interface to Hadoop which works from GB to
PB. But, batch oriented – Speed is not it’s strong point.
• Ecosystem
Ported to Hadoop 2 to run on YARN. Supports original
investments in Hadoop by customers and partner
ecosystem.
DataNode1
Mapper
Data is shuffled
across the network
& sorted
Map Phase Shuffle/Sort Reduce Phase
MapReduce Job Lifecycle
Saying that MapReduce is dead is
preposterous
- Would limits us to only new workloads
- ALL Hadoop clusters use map reduce
- Proven at Enterprise Scale
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
YARN: Data Operating System
Interactive Real-TimeBatch

What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map Process
Map Process
Map Process
Map Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map Process
Reduce
Process
Reduce
Process
Data
Read & ETL
Shuffle &
Sort Aggregation
Data
Data
Data
Data
Data
Data
Data
Data

1st Gen Hadoop: Cost Effective Batch at Scale
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
Silos created for distinct
use casesSingle App
BATCH
HDFS
Single App
ONLINE

Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for managing large
volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by large
web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
ü Manages new data paradigm
ü Handles data at scale
ü Cost effective
ü Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce

YARN extends Hadoop into data center leaders
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
YARN : Data Operating System
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider

What does iOS 6 and Windows 3.1 have in common?

Hadoop Beyond Batch with YARN
Single Use Sysztem
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new…
HADOOP 1
MapReduce
(cluster resource management
& data processing)
Data Flow
Pig
SQL
Hive
Others
API,
Engine,
and
System
YARN
(Data Operating System: resource management, etc.)
Data Flow
Pig
SQL
Hive
Other
ISV
Apache Yarn as a Base
System
Engine
API’s
1 ° ° ° ° °
° ° ° ° ° N
HDFS
(redundant, reliable storage)
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
Batch
MapReduce
Tez Tez
MapReduce as the Base
HADOOP 2

Hadoop Workload Evolution
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new… Multi Use Platform
Data & Beyond
HADOOP 1
YARN
HADOOP 2
1 ° ° ° °
° ° ° ° N
HDFS
1 ° ° °
° ° ° N
HDFS
MapReduce
HADOOP.Next
YARN ‘
1 ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
DATA ACCESS APPS
Docker
MySQLMR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
MR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
Docker
Tomcat
Docker
Other

What’s new in Apache Hadoop 3.0?
Storage Optimization
HDFS: Erasure codes
Improved Utilization
YARN: Long Running Services
YARN: Schedule Enhancements
Additional Workloads
YARN: Docker & Isolation
Easier to Use
New User Interface
Refactor Base
Lots of Trunk content
JDK8 and newer dependent libraries
- 3.0.0-alpha1 - Sep/3/2016
- Alpha2 - Jan/25/2017
- Alpha3 - Q2 2017 (Estimated)
- Beta/GA - Q3/Q4 2017 (Estimated)
Release Timeline
3.0

Gartner: What is Hadoop?
Ã Common Apache Projects
– ALL = 7 (6)
– Except for 1 = 3 (5)
– Except for 2 = 4 (4)
² About 14 Common Projects
Ã Uncommon Projects
– Only 1 = 9 (1)
– Only 2 = 7 (2)
– Only 3 = 6 (3)
² About 22 Uncommon Projects
http://blogs.gartner.com/merv-adrian/2015/07/02/now-what-is-hadoop/
ODPi
ODPi
ODPi
ODPi
ODPi ODPi ODPi
Hortonworks.com/tutorials

1446 Market Street San Francisco, CA 94102
HORTONWORKS DATA PLATFORM
Hadoop
& YARN
Flume
Oozie
Pig
Hive
Tez
Sqoop
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.3
Oct 2015
4.2.0
0.96.1
0.98.0 0.9.1
0.8.1
1.4.1 1.1.2
2.7.3 1.4.6 0.11.0 0.7.02.5.00.10.1.0 3.4.61.5.25.5.1 0.91.0 0.8.01.7.04.7.0 1.1.0 0.10.00.7.0
1.2.1+
2.1***
0.16.0
HDP 2.6*
1H2017
4.2.0
1.6.3+
2.1**
1.1.2
2.7.1 1.4.6 0.6.0 0.5.02.2.10.9.0 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.4
Mar 2016
4.2.01.6.0 1.1.2
Zeppelin
Ongoing Innovation in Apache
0.7.0
* HDP 2.6 – Shows current Apache branches being used. Final component version subject to change based on Apache release process.
** Spark 1.6.3+ Spark 2.1 – HDP 2.6 supports both Spark 1.6.3 and Spark 2.1 as GA
*** Hive 2.1 is GA within HDP 2.6.
2.7.3 1.4.6 0.9.0 0.6.02.4.00.10.0 3.4.61.5.25.5.1 0.91.0 0.7.01.7.04.7.0 1.0.1 0.10.00.7.0
1.2.1+
2.1***
0.16.0
HDP 2.5
Aug 2016
4.2.0
1.6.2+
2.0**
1.1.20.6.0
Druid
0.9.2

Next Generation Data Vendors Investment for the Enterprise
Vertical
Integration with
YARN and HDFS
Ensure engines can run
reliably and respectfully
in a YARN based
cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack

What do distributions do?
Ã Define a stack of components
• Rich and latest set of Apache Projects (open source & open community) without lock in
Ã Vertical and Horizontal integration of components
• Vertical: Best Speed and Scale
• Horizontal: Open Enterprise Ready
Ã Provision and Upgrade stack
• Robust, Easy and Anywhere
Ã Accelerate time to value (easy of use)
• New Face of Hadoop with Uis from Ambari, Ambari Views, Ranger, Falcon, Atlas
Ã Partner Ecosystem
• Rich and Deep
Ã Support
• Industry’s best, SmartSense and influence community

How Do You Operate a Hadoop Cluster?
Apache™ Ambari is a platform
to provision, manage and
monitor Hadoop clusters

Ambari Core Features and Extensibility
Install & Configure
Operate, Manage &
Administer
Develop
Optimize & Tune
Developer
Data Architect
Ambari provides core services for operations, development and
extensions points for both
Extensibility Features
Stacks, Blueprints & REST APIs
Core Features
Install Wizard & Web
Web, Operator Views,
Metrics & Alerts
User Views
User Views
Views Framework & REST APIs
Views Framework
Views Framework
How?
Cluster Admin

New user interface enables fast &
easy SQL definition and execution.

Data Worker and DevOps Tooling
Ã A pluggable way to provide a common user
experience across multiple user personas.
Ambari Views
HDP
System Admin/operators
Data Workers
Application Developers
A M B A R I
Ã Single point of entry for all users.
Ã Open Community

New User Views for DevOps
Capacity Scheduler View
Browse and manage YARN queues
Tez View
View information related to Tez jobs that
are executing on the cluster

New User Views for Development
Pig View
Author and execute Pig Scripts.
Hive View
Author, execute and debug Hive
queries.
Files View
Browse HDFS file system.

Apache Zeppelin
• Web-based notebook for data engineers, data
analysts and data scientists
• Brings interactive data ingestion, data
exploration, visualization, sharing and
collaboration features to Hadoop and
Spark
• Modern data science studio
• Scala with Spark
• Python with Spark
• SparkSQL
• Apache Hive, and more.

Access patterns enabled by YARN
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop
• Quickly find value in raw data files
• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…
SensorMobile
Weblog
Operational
/ MPP
SQL Queries

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Architecture
User issues SQL query
Hive parses and plans query
Query converted to
MapReduce and executed on
Hadoop
2
3
Web UI
JDBC /
ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce, Tez or Spark Job
Data DataData
Hadoop 3
Data-local processing

Hive and the Stinger Initiative
Base Optimizations
Generate simplified DAGs
In-memory Hash Joins
Vector Query Engine
Optimized for modern processor
architectures
Tez
Express tasks more simply
Eliminate disk writes
Pre-warmed Containers
ORCFile
Column Store
High Compression
Predicate / Filter Pushdowns
YARN
Next-gen Hadoop data processing
framework
+ +
Query Planner
Intelligent Cost-Based Optimizer
Performance Optimizations
100x+ faster time to insight
Deeper analytical capabilities

Stinger.next and Sub-Second SQL
Emergence of LLAP brings Sub-Second SQL response times within reach with Hive.
BATCH & INTERACTIVE BATCH & INTERACTIVE BATCH, INTERACTIVE & SUB-SECONDSPEED
DELIVERY
SQL
UPDATES
ENGINES
STINGER
DELIVERED
PROGRESS
DELIVERED
FINAL
VERSION
HDP 2.1
VERSION
0.13
VERSION
HDP 2.3
VERSION
1.2.1
SQL:2003+ SQL:2011 SUBSET
READ-ONLY SQL INSERT/UPDATE/DELETE
MR, TEZ MR, TEZ
FUTURE
STINGER NEXT
COMPLETE ACID SUPPORT INCLUDING MERGE
COMPREHENSIVE SQL:2011 BASED ANALYTICS
MR, TEZ, LLAP
DELIVERED IN DEVELOPMENT
Tiered Data Storage
Stinger.next Phase 3
YARN: Containerized
Applications

Data Types SQL Features File Formats Latest Additions…
Numeric Core SQL Features Columnar Scalable Cross Product
FLOAT/DOUBLE Date, Time and Arithmetical Functions ORCFile Primary Key / Foreign Key
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Non-Equijoin
INT/TINYINT/SMALLINT/BIGINT Derived Table Subqueries
Text
Tech Preview:
Proc. Extensions (PL/SQL)
BOOLEAN Correlated + Uncorrelated Subqueries CSV Future
String UNION ALL Logfile ACID MERGE
CHAR / VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Multi Subquery
STRING Common Table Expressions Avro Comparison to sub-select
BINARY UNION DISTINCT JSON INTERSECT and EXCEPT
Date, Time Advanced Analytics XML
DATE OLAP and Windowing Functions Custom Formats
TIMESTAMP CUBE and Grouping Sets Other Features
Interval Types Nested Data Analytics XPath Analytics
Complex Types Nested Data Traversal
ARRAY Lateral Views
MAP ACID Transactions
STRUCT INSERT / UPDATE / DELETE
UNION
Apache Hive: Journey to SQL:2011 Analytics
Legend
Existing
Future
New with Hive 2.0

Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In-development
Legend
Apache Hive: Modern Architecture

Apache Tez is a critical innovation of the Stinger Initiative.
• Along with YARN, Tez not only improves
Hive, but improves all things batch and interactive
for Hadoop; Pig, Cascading…
• More Efficient Processing than MapReduce
• Reduce operations and complexity of back end processing
• Allows for Map Reduce Reduce which saves hard disk operations
• Implements a “service” which is always on, decreasing start times of jobs
• Allows Caching of Data in Memory
YARN
Dev
Cascading/
Scalding
Why is Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scripting
Pig
SQL
Hive
Tez Tez
Applications
Tez

Apache Tez
Hive – MapReduce Hive – Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
SELECT b.id
Tez avoids unneeded writes to
HDFS
Tez allows Reducer-only jobs
within the DAG

Sub-second Queries in Hive: LLAP (Live Long and Process)
Ã Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
Ã Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
Ã Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
Ã Working on interface to allow other data engines to read securely in parallel
Ã Beta in 2.0

Hive 2 with LLAP: Architecture Overview
Deep
Storage
HDFS
S3 + Other HDFS
Compatible Filesystems
YARN Cluster
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC SQL
Queries

Hive With LLAP Execution Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only

Scripting Data Pipeline & ETL
Apache Pig
• Data flow engine and scripting language (Pig Latin)
• Allows you to transform data and datasets
Advantages over MapReduce
• Reduces time to write jobs
• Community support
• Piggybank has a significant number of UDF’s to help adoption
• There are a large number of existing shops using PIG

Why use Pig?
• Maybe we want to join two datasets, from different sources, on a
common value, and want to filter, and sort, and get top 5 sites

Resource Management
Storage
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Apache Spark enthusiasm
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
Spark Core Engine

Apache Spark & Apache Hadoop Perfect Together
General Purpose Data Access Engine
for fast, large-scale data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python and R
Built-in Libraries
Enable data workers to rapidly iterate over data for:
ETL, Machine Learning, SQL and Stream processing
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Apache Projects Enable Access Patterns
Various open source projects have
incubated in order to meet these access
pattern needs
Today, they can all run on a single cluster
on a single set of data because of YARN
All powered by a broad open community
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive
Solr
Spark
Hive
Pig
Real-Time
HBase
Accumulo
Storm
Batch
MapReduce
Applications
Kafka

Evolve with Data Trends by Using Open Source
Ã Keep up with rapidly evolving Data Trends
Ã Differentiate your biz with Insights from Data
Ã Speed up your adaptation to constant change by using open source
Ã A Modern Data Architecture is a Journey

HDP 2.5 Sandbox
Ã Provides Free preconfigured
HDP
– Runs in a Virtual Machine,
Docker or Azure
Hortonworks.com/sandbox
Ã Easy to Use
– Operations
• Ambari
– Dev and DevOps
• Ambari User Views
– Web Notebook
• Zeppelin
Ã Works with 60+ Free tutorial
Hortonworks.com/tutorials

Sandbox Start
Ã Import Sandbox Image
Ã Takes about 4 min
Ã Splash Screen
– Virtual Box Sample Splash Screen
– Save the host IP

Data Discovery Lab
• Elefante Wine Company has a fleet of over 100 trucks.
• The geolocation data collected from the trucks contains events generated while the truck drivers are
driving.
• The company’s goal with Hadoop is to Mitigate Risk:
o Understand correlations between miles driven and events
o Compute the risk factor for each driver based on mileage & events
o Lab Env
o Sandbox 2.5
o Lab Doc
o URL: http://tinyurl.com/hello-hdp
o Load Data
o Query Data
o Process Data

Elefante Wine Current Challenges
The Company
Elefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine
in a highly-regulated industry with stringent transportation requirements.
The Situation
Recently a number of driver violations led to fines and increased insurance rates
The Challenges
• Rising Operational Costs
• Driver Safety
• Risk Management
• Logistics Optimization

© Hortonworks Inc. 2012
Professional Services
Elefante Wine Company has a large fleet of trucks in USA
A truck generates millions of events for a
given route; an event could be:
§ 'Normal' events: starting / stopping of the
vehicle
§ ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time to calculate risk
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers

Elefante Wine Risk and Driver Safety Challenges
Trucks outfitted with new sensors generating large
volumes of new data:
• Location
• Speed
• Driver Violations
Need to be integrate real-time & historical data
Increase safety and reduce liabilities
Anticipate driver violations BEFORE they
happen and take precautionary actions
Find predictive correlations in driver behavior over
large volumes of real-time data
Difficult to deliver timely insights to the right
people and systems to take action
Data Discovery
Uncover new
findings
Predictive Analytics
Identify your next best
action
Better Understanding
of the Past
Better Prediction
of the Future

What’s our goal?
Ã Solution:
– Collect additional data via sensors in trucks to better understand Risk Factors
Ã How:
– Quickly store new sensor data in a common repository
– Prepare the data for analysis
– Explore the data
– Calculate Risk
– Generate a report

geolocation.csv
trucks.csv
Temporary table B Geolocation
Temporary table T
Trucks
csv
csv ORC
ORC
SQL
SQL
LOAD
LOAD
Temporary Tables
Created
Loaded
Deleted

Geolocation
Trucks
ORC
ORC
SQL
SQL
PIG or Spark
Risk Calculation
Truck_mileage
ORC
Avg_mileage
ORC
DriverMileage
ORC
RiskFactor
ORC
Events
ORC
Trucking Risk Analysis – Hadoop ELT

Connected Data Architecture with HDC for AWS Market Place
C L O U D
Ideal Use Cases:
Data Science and Exploration
(Spark, Zeppelin)
ETL and Data Preparation
(Hive, Spark)
Analytics and Reporting
(Hive2 w/LLAP, Zeppelin)
Cloud Data
Processing
(HDC for AWS)

Big Data Tutorials
Ã Get Started
– hortonworks.com/tutorials
– Apache Hadoop & Ecosystem
• hor.tn/hello-hdp
– Apache Spark
• hor.tn/spark-zep-intro
– Apache NiFi
• hor.tn/nifi-intro
– Use Case
• IoT
• Social Media

developer.hortonworks.com

Hortonworks Nourishes the Community
H ORTON WORKS
COMMUN IT Y CON N EC T ION
HO RTO NWO RKS
PART NE RWO RKS
https://community.hortonworks.com

Want to continue the technical Introduction?
Ã Hadoop Summit Crash Courses
– Replays
– Free
Ã hadoopsummit.org/san-jose/agenda
– Apache Hadoop
– Apache Spark
– Apache NiFi
– IoT & Streaming
– Data Science

Thank you!
rafael@hortonworks.com
@racoss

Protecting the Elephant in the Castle…..
Kerberos,
Wire Encryption
HDFS Encryption
Apache Ranger
Network Segmentation,
Firewalls
LDAP/AD
Apache Knox

Hadoop Crash Course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Crash Course

Similar to Hadoop Crash Course (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop Crash Course