More Related Content Similar to Hadoop Crash Course (20) More from DataWorks Summit/Hadoop Summit (20) Hadoop Crash Course4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Customers are building Modern Data
Applications to transform their industries,
renovating their IT architectures and
innovating their Data in Motion and Data at
Rest platforms to power actionable
intelligence.
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Must do to make modern data applications possible
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Must do to make modern data applications possible
Powerful means to
optimize current business
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design M & A
Call
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
EXP LO R E O PT IM IZ E T R A NS FO R M
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
PREDICTIVE
ANALYTICS
Must do to make modern data applications possible
Powerful means to
optimize current business
Pathway to transform for
strategic advantage and
new revenue streams
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Open Enterprise Hadoop Capabilities
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made for developing distributed applications to
process vast amounts of data in-parallel on large
clusters
• Proven
Reliable interface to Hadoop which works from GB to
PB. But, batch oriented – Speed is not it’s strong point.
• Ecosystem
Ported to Hadoop 2 to run on YARN. Supports original
investments in Hadoop by customers and partner
ecosystem.
DataNode1
Mapper
Data is shuffled
across the network
& sorted
Map Phase Shuffle/Sort Reduce Phase
MapReduce Job Lifecycle
Saying that MapReduce is dead is
preposterous
- Would limits us to only new workloads
- ALL Hadoop clusters use map reduce
- Proven at Enterprise Scale
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
YARN: Data Operating System
Interactive Real-TimeBatch
35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map Process
Map Process
Map Process
Map Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map Process
Reduce
Process
Reduce
Process
Data
Read & ETL
Shuffle &
Sort Aggregation
Data
Data
Data
Data
Data
Data
Data
Data
38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN extends Hadoop into data center leaders
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
YARN : Data Operating System
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS Hadoop Distributed File System
DATA MANAGEMENT
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory
Spark
Others
ISV Engines
Tez Tez Slider Slider
41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Workload Evolution
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new… Multi Use Platform
Data & Beyond
HADOOP 1
YARN
HADOOP 2
1 ° ° ° °
° ° ° ° N
HDFS
(redundant, reliable storage)
1 ° ° °
° ° ° N
HDFS
MapReduce
HADOOP.Next
YARN ‘
1 ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(redundant, reliable storage)
DATA ACCESS APPS
Docker
MySQLMR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
MR2 Others
(ISV Engines)
Multiple
(Script, SQL, NoSQL, …)
Docker
Tomcat
Docker
Other
43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Gartner: What is Hadoop?
à Common Apache Projects
– ALL = 7 (6)
– Except for 1 = 3 (5)
– Except for 2 = 4 (4)
² About 14 Common Projects
à Uncommon Projects
– Only 1 = 9 (1)
– Only 2 = 7 (2)
– Only 3 = 6 (3)
² About 22 Uncommon Projects
http://blogs.gartner.com/merv-adrian/2015/07/02/now-what-is-hadoop/
ODPi
ODPi
ODPi
ODPi
ODPi ODPi ODPi
Hortonworks.com/tutorials
44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1446 Market Street San Francisco, CA 94102
HORTONWORKS DATA PLATFORM
Hadoop
& YARN
Flume
Oozie
Pig
Hive
Tez
Sqoop
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.3
Oct 2015
4.2.0
0.96.1
0.98.0 0.9.1
0.8.1
1.4.1 1.1.2
2.7.3 1.4.6 0.11.0 0.7.02.5.00.10.1.0 3.4.61.5.25.5.1 0.91.0 0.8.01.7.04.7.0 1.1.0 0.10.00.7.0
1.2.1+
2.1***
0.16.0
HDP 2.6*
1H2017
4.2.0
1.6.3+
2.1**
1.1.2
2.7.1 1.4.6 0.6.0 0.5.02.2.10.9.0 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0
HDP 2.4
Mar 2016
4.2.01.6.0 1.1.2
Zeppelin
Ongoing Innovation in Apache
0.7.0
* HDP 2.6 – Shows current Apache branches being used. Final component version subject to change based on Apache release process.
** Spark 1.6.3+ Spark 2.1 – HDP 2.6 supports both Spark 1.6.3 and Spark 2.1 as GA
*** Hive 2.1 is GA within HDP 2.6.
2.7.3 1.4.6 0.9.0 0.6.02.4.00.10.0 3.4.61.5.25.5.1 0.91.0 0.7.01.7.04.7.0 1.0.1 0.10.00.7.0
1.2.1+
2.1***
0.16.0
HDP 2.5
Aug 2016
4.2.0
1.6.2+
2.0**
1.1.20.6.0
Druid
0.9.2
45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Generation Data Vendors Investment for the Enterprise
Vertical
Integration with
YARN and HDFS
Ensure engines can run
reliably and respectfully
in a YARN based
cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack
46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What do distributions do?
à Define a stack of components
• Rich and latest set of Apache Projects (open source & open community) without lock in
à Vertical and Horizontal integration of components
• Vertical: Best Speed and Scale
• Horizontal: Open Enterprise Ready
à Provision and Upgrade stack
• Robust, Easy and Anywhere
à Accelerate time to value (easy of use)
• New Face of Hadoop with Uis from Ambari, Ambari Views, Ranger, Falcon, Atlas
à Partner Ecosystem
• Rich and Deep
à Support
• Industry’s best, SmartSense and influence community
49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ambari Core Features and Extensibility
Install & Configure
Operate, Manage &
Administer
Develop
Optimize & Tune
Developer
Data Architect
Ambari provides core services for operations, development and
extensions points for both
Extensibility Features
Stacks, Blueprints & REST APIs
Core Features
Install Wizard & Web
Web, Operator Views,
Metrics & Alerts
User Views
User Views
Views Framework & REST APIs
Views Framework
Views Framework
How?
Cluster Admin
54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin
• Web-based notebook for data engineers, data
analysts and data scientists
• Brings interactive data ingestion, data
exploration, visualization, sharing and
collaboration features to Hadoop and
Spark
• Modern data science studio
• Scala with Spark
• Python with Spark
• SparkSQL
• Apache Hive, and more.
56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive: SQL in Hadoop
• Created by a team at Facebook
• Provides a standard SQL interface to data stored in Hadoop
• Quickly find value in raw data files
• Proven at petabyte scale
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…
SensorMobile
Weblog
Operational
/ MPP
SQL Queries
58. Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Architecture
User issues SQL query
Hive parses and plans query
Query converted to
MapReduce and executed on
Hadoop
2
3
Web UI
JDBC /
ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce, Tez or Spark Job
Data DataData
Hadoop 3
Data-local processing
59. 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive and the Stinger Initiative
Base Optimizations
Generate simplified DAGs
In-memory Hash Joins
Vector Query Engine
Optimized for modern processor
architectures
Tez
Express tasks more simply
Eliminate disk writes
Pre-warmed Containers
ORCFile
Column Store
High Compression
Predicate / Filter Pushdowns
YARN
Next-gen Hadoop data processing
framework
+ +
Query Planner
Intelligent Cost-Based Optimizer
Performance Optimizations
100x+ faster time to insight
Deeper analytical capabilities
60. 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stinger.next and Sub-Second SQL
Emergence of LLAP brings Sub-Second SQL response times within reach with Hive.
BATCH & INTERACTIVE BATCH & INTERACTIVE BATCH, INTERACTIVE & SUB-SECONDSPEED
DELIVERY
SQL
UPDATES
ENGINES
STINGER
DELIVERED
PROGRESS
DELIVERED
FINAL
VERSION
HDP 2.1
VERSION
0.13
VERSION
HDP 2.3
VERSION
1.2.1
SQL:2003+ SQL:2011 SUBSET
READ-ONLY SQL INSERT/UPDATE/DELETE
MR, TEZ MR, TEZ
FUTURE
STINGER NEXT
COMPLETE ACID SUPPORT INCLUDING MERGE
COMPREHENSIVE SQL:2011 BASED ANALYTICS
MR, TEZ, LLAP
DELIVERED IN DEVELOPMENT
Tiered Data Storage
Stinger.next Phase 3
YARN: Containerized
Applications
61. 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats Latest Additions…
Numeric Core SQL Features Columnar Scalable Cross Product
FLOAT/DOUBLE Date, Time and Arithmetical Functions ORCFile Primary Key / Foreign Key
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Non-Equijoin
INT/TINYINT/SMALLINT/BIGINT Derived Table Subqueries
Text
Tech Preview:
Proc. Extensions (PL/SQL)
BOOLEAN Correlated + Uncorrelated Subqueries CSV Future
String UNION ALL Logfile ACID MERGE
CHAR / VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Multi Subquery
STRING Common Table Expressions Avro Comparison to sub-select
BINARY UNION DISTINCT JSON INTERSECT and EXCEPT
Date, Time Advanced Analytics XML
DATE OLAP and Windowing Functions Custom Formats
TIMESTAMP CUBE and Grouping Sets Other Features
Interval Types Nested Data Analytics XPath Analytics
Complex Types Nested Data Traversal
ARRAY Lateral Views
MAP ACID Transactions
STRUCT INSERT / UPDATE / DELETE
UNION
Apache Hive: Journey to SQL:2011 Analytics
Legend
Existing
Future
New with Hive 2.0
62. 62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In-development
Legend
Apache Hive: Modern Architecture
63. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Tez is a critical innovation of the Stinger Initiative.
• Along with YARN, Tez not only improves
Hive, but improves all things batch and interactive
for Hadoop; Pig, Cascading…
• More Efficient Processing than MapReduce
• Reduce operations and complexity of back end processing
• Allows for Map Reduce Reduce which saves hard disk operations
• Implements a “service” which is always on, decreasing start times of jobs
• Allows Caching of Data in Memory
YARN
Dev
Cascading/
Scalding
Why is Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scripting
Pig
SQL
Hive
Tez Tez
Applications
Tez
YARN: Data Operating System
Interactive Real-TimeBatch
64. 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Tez
Hive – MapReduce Hive – Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
SELECT b.id
Tez avoids unneeded writes to
HDFS
Tez allows Reducer-only jobs
within the DAG
65. 65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sub-second Queries in Hive: LLAP (Live Long and Process)
à Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
à Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
à Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
à Working on interface to allow other data engines to read securely in parallel
à Beta in 2.0
68. 68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scripting Data Pipeline & ETL
Apache Pig
• Data flow engine and scripting language (Pig Latin)
• Allows you to transform data and datasets
Advantages over MapReduce
• Reduces time to write jobs
• Community support
• Piggybank has a significant number of UDF’s to help adoption
• There are a large number of existing shops using PIG
YARN: Data Operating System
Interactive Real-TimeBatch
70. 70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Resource Management
Storage
Elegant Developer APIs
DataFrames, Machine Learning, and SQL
Made for Data Science
All apps need to get predictive at scale and fine granularity
Democratize Machine Learning
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Community
Broad developer, customer and partner interest
Realize Value of Data Operating System
A key tool in the Hadoop toolbox
Apache Spark enthusiasm
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
Spark Core Engine
71. 71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark & Apache Hadoop Perfect Together
General Purpose Data Access Engine
for fast, large-scale data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python and R
Built-in Libraries
Enable data workers to rapidly iterate over data for:
ETL, Machine Learning, SQL and Stream processing
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
72. 72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Projects Enable Access Patterns
Various open source projects have
incubated in order to meet these access
pattern needs
Today, they can all run on a single cluster
on a single set of data because of YARN
All powered by a broad open community
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive
Solr
Spark
Hive
Pig
Real-Time
HBase
Accumulo
Storm
Batch
MapReduce
Applications
Kafka
78. Data Discovery Lab
• Elefante Wine Company has a fleet of over 100 trucks.
• The geolocation data collected from the trucks contains events generated while the truck drivers are
driving.
• The company’s goal with Hadoop is to Mitigate Risk:
o Understand correlations between miles driven and events
o Compute the risk factor for each driver based on mileage & events
o Lab Env
o Sandbox 2.5
o Lab Doc
o URL: http://tinyurl.com/hello-hdp
o Load Data
o Query Data
o Process Data
79. Elefante Wine Current Challenges
The Company
Elefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine
in a highly-regulated industry with stringent transportation requirements.
The Situation
Recently a number of driver violations led to fines and increased insurance rates
The Challenges
• Rising Operational Costs
• Driver Safety
• Risk Management
• Logistics Optimization
80. © Hortonworks Inc. 2012
Professional Services
Elefante Wine Company has a large fleet of trucks in USA
A truck generates millions of events for a
given route; an event could be:
§ 'Normal' events: starting / stopping of the
vehicle
§ ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time to calculate risk
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
81. Elefante Wine Risk and Driver Safety Challenges
Trucks outfitted with new sensors generating large
volumes of new data:
• Location
• Speed
• Driver Violations
Need to be integrate real-time & historical data
Increase safety and reduce liabilities
Anticipate driver violations BEFORE they
happen and take precautionary actions
Find predictive correlations in driver behavior over
large volumes of real-time data
Difficult to deliver timely insights to the right
people and systems to take action
Data Discovery
Uncover new
findings
Predictive Analytics
Identify your next best
action
Better Understanding
of the Past
Better Prediction
of the Future