Apache Hadoop Now Next and Beyond

•

12 likes•3,999 views

With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Apache Hadoop is fast becoming the defacto platform for processing Big Data. Hadoop started from a relatively humble beginning as a point solution for small search systems. Its growth into an important technology to the broader enterprise community dates back to Yahoo’s 2006 decision to evolve Hadoop into a system for solving its internet scale big data problems. Eric will discuss the current state of Hadoop and what is coming from a development standpoint as Hadoop evolves to meet more workloads.

Hadoop Now, Next & Beyond
Community Driven Enterprise Apache Hadoop
Eric Baldeschwieler, “Eric14”
Hortonworks CTO
@jeric14

© Hortonworks Inc. 2013

Quick History: Hadoop at Yahoo!

Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/

© Hortonworks Inc. 2013

Hortonworks Approach to Enterprise Hadoop

Community Driven Enterprise Apache Hadoop
Identify and introduce enterprise
requirements into the pubic domain

Work with the community to advance and
incubate open source projects

Apply Enterprise Rigor to provide the most
stable and reliable distribution

© Hortonworks Inc. 2013

Making Hadoop Enterprise-Ready
OPERATIONAL DATA
SERVICES SERVICES
Manage &
AMBARI FLUME Store, HIVE
PIG
Operate at Process and HBASE
Scale SQOOP Access Data
OOZIE HCATALOG

MAP REDUCE
Distributed
HADOOP CORE Storage & Processing
HDFS

Enterprise Readiness: HA,
PLATFORM SERVICES DR, Snapshots, Security, …

HORTONWORKS
DATA PLATFORM (HDP)

OS / VM Cloud Appliance

© Hortonworks Inc. 2013

HCatalog: Table-level Abstractions

• Consistency of data models across tools (MapReduce, Pig, HBase and Hive)
• Accessibility: share data as tables inside and out of Hadoop

HCatalog Shared table
and schema
management
• Raw Hadoop data Table access opens the
• Inconsistent, unknown Consistent schema platform
• Tool specific access REST API

© Hortonworks Inc. 2013

Ambari: Provision > Manage > Monitor
A framework for operating Hadoop…with APIs for integration

Manage Monitor

Provision Ambari Integrate

Hadoop Cluster

© Hortonworks Inc. 2013

Ambari: Latest Highlights
• Job Diagnostics

• Cluster History

• Instant Insight

• Cluster Navigation

• REST interface
Apache Ambari Dashboard

© Hortonworks Inc. 2013

See Hadoop > Learn Hadoop > Do Hadoop

Hands on
Full environment
step-by- step
to evaluate
tutorials to learn
Hadoop

© Hortonworks Inc. 2013

Hadoop 2.0 Innovations - YARN

• Focus on scale and innovation
– Support 10,000+ computer clusters
– Extensible to encourage innovation

Graph Processing
• Next generation execution

MapReduce
– Improves MapReduce performance

Other
Tez
• Supports new frameworks beyond
MapReduce YARN: Cluster Resource Management
– Low latency, Streaming, Services
– Do more with a single Hadoop cluster
HDFS Redundant, Reliable Storage

© Hortonworks Inc. 2013

Tez on YARN: Going Beyond Batch

Tez Task

Tez Optimizes Execution Always-On Tez Service
New runtime engine for Low latency processing for
more efficient data processing all Hadoop data processing

© Hortonworks Inc. 2013

Stinger Initiative
• Community initiative around Hive
• Enables Hive to support interactive workloads
• Enhances Hive’s standard SQL interface for Hadoop
• Improves existing tools & preserves investments

Execution Query File
Engine Planner Format

+ + = 100X
Tez Hive ORC file

© Hortonworks Inc. 2013

Stinger: Make Hive Fly For All BI Needs
Parameterized Reports Enterprise Reports

Dashboard / Scorecard

Visualization Data Mining

More SQL
+

100X Faster

Interactive Batch

© Hortonworks Inc. 2013

Knox: Make Hadoop Security Simple

Authentication &
Verification
User Store Hadoop Cluster
KDC, AD,
LDAP

{REST} Knox
Client Gateway

© Hortonworks Inc. 2013

Hortonworks Data Platform 2.0 Alpha 2

Key New Features Business Value
– Hive performance
– First distribution to include Tez
Single Platform Multiple Use
From HDP 2.0 Alpha BATCH INTERACTIVE ONLINE
– Yarn
– Full Stack HA
– Snapshots
– Disaster Recovery Big Data
Transactions, Interactions, Observations
– Rolling Upgrades

Available today http://Hortonworks.com/products

Page 14
© Hortonworks Inc. 2013

Falcon: Data Lifecycle Management
• New Apache Incubator Project

• Introduced by InMobi, Hortonworks and Yahoo!

• Data Lifecycle Management Framework for Hadoop

• Configure and Manage Workflows & Policies for:

– Data Movement
– Disaster Recovery
– Data Retention

© Hortonworks Inc. 2013

Join the Community & Get Involved!

• INNOVATE
Open Source

Vendors
• INTEGRATE

End Users • COMMUNICATE

© Hortonworks Inc. 2013

What's hot

Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.

Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents

201305 hadoop jpl-v3Eric Baldeschwieler

Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner

Stinger.Next by Alan Gates of HortonworksData Con LA

Sgi hadoop clusters_ds_us_4381Michael Moretti

Cloud Austin Meetup - Hadoop like a championAmeet Paranjape

Big data on virtualized infrastuctureDataWorks Summit

Delivering Apache Hadoop for the Modern Data Architecture Hortonworks

Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.

Virtualized HadoopJoey Jablonski

Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.

YARN - Strata 2014Hortonworks

Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks

Processing Big Datacwensel

Enabling Diverse Workload Scheduling in YARNDataWorks Summit

Hadoop summit cloudera keynote_v5Cloudera, Inc.

Hadoop crashcourse v3Hortonworks

Deploying Docker applications on YARN via SliderHortonworks

Cloud computing eraTrendProgContest13

What's hot (20)

Hadoop World 2011: Mike Olson Keynote Presentation

Контроль зверей: инструменты для управления и мониторинга распределенных сист...

201305 hadoop jpl-v3

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Stinger.Next by Alan Gates of Hortonworks

Sgi hadoop clusters_ds_us_4381

Cloud Austin Meetup - Hadoop like a champion

Big data on virtualized infrastucture

Delivering Apache Hadoop for the Modern Data Architecture

Webinar: Productionizing Hadoop: Lessons Learned - 20101208

Virtualized Hadoop

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar

YARN - Strata 2014

Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Processing Big Data

Enabling Diverse Workload Scheduling in YARN

Hadoop summit cloudera keynote_v5

Hadoop crashcourse v3

Deploying Docker applications on YARN via Slider

Cloud computing era

Similar to Apache Hadoop Now Next and Beyond

Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks

Discover.hdp2.2.ambari.final[1]Hortonworks

Introduction to the Hadoop EcoSystemShivaji Dutta

Introduction to Hortonworks Data PlatformHortonworks

Hadoop - Now, Next and BeyondTeradata Aster

Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit

Why hadoop for data science?Hortonworks

Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks

Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks

Discover hdp 2.2 hdfs - finalHortonworks

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks

Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks

Strata feb2013alanfgates

Discover.hdp2.2.h base.final[2]Hortonworks

Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton

Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks

Discover.hdp2.2.storm and kafka.finalHortonworks

Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks

Supporting Financial Services with a More Flexible Approach to Big DataHortonworks

Similar to Apache Hadoop Now Next and Beyond (20)

Big Data Analytics - Is Your Elephant Enterprise Ready?

Discover.hdp2.2.ambari.final[1]

Introduction to the Hadoop EcoSystem

Introduction to Hortonworks Data Platform

Hadoop - Now, Next and Beyond

Introduction to Microsoft HDInsight and BI Tools

Why hadoop for data science?

Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS

Discover HDP 2.1: Apache Solr for Hadoop Search

Discover hdp 2.2 hdfs - final

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...

Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop

Strata feb2013

Discover.hdp2.2.h base.final[2]

Mrinal devadas, Hortonworks Making Sense Of Big Data

Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive

Discover.hdp2.2.storm and kafka.final

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next

Supporting Financial Services with a More Flexible Approach to Big Data

More from DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

More from DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Apache Hadoop Now Next and Beyond

1. Hadoop Now, Next & Beyond Community Driven Enterprise Apache Hadoop Eric Baldeschwieler, “Eric14” Hortonworks CTO @jeric14 © Hortonworks Inc. 2013

2. Quick History: Hadoop at Yahoo! Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/ © Hortonworks Inc. 2013

3. Hortonworks Approach to Enterprise Hadoop Community Driven Enterprise Apache Hadoop Identify and introduce enterprise requirements into the pubic domain Work with the community to advance and incubate open source projects Apply Enterprise Rigor to provide the most stable and reliable distribution © Hortonworks Inc. 2013

4. Making Hadoop Enterprise-Ready OPERATIONAL DATA SERVICES SERVICES Manage & AMBARI FLUME Store, HIVE PIG Operate at Process and HBASE Scale SQOOP Access Data OOZIE HCATALOG MAP REDUCE Distributed HADOOP CORE Storage & Processing HDFS Enterprise Readiness: HA, PLATFORM SERVICES DR, Snapshots, Security, … HORTONWORKS DATA PLATFORM (HDP) OS / VM Cloud Appliance © Hortonworks Inc. 2013

5. HCatalog: Table-level Abstractions • Consistency of data models across tools (MapReduce, Pig, HBase and Hive) • Accessibility: share data as tables inside and out of Hadoop HCatalog Shared table and schema management • Raw Hadoop data Table access opens the • Inconsistent, unknown Consistent schema platform • Tool specific access REST API © Hortonworks Inc. 2013

6. Ambari: Provision > Manage > Monitor A framework for operating Hadoop…with APIs for integration Manage Monitor Provision Ambari Integrate Hadoop Cluster © Hortonworks Inc. 2013

7. Ambari: Latest Highlights • Job Diagnostics • Cluster History • Instant Insight • Cluster Navigation • REST interface Apache Ambari Dashboard © Hortonworks Inc. 2013

8. See Hadoop > Learn Hadoop > Do Hadoop Hands on Full environment step-by- step to evaluate tutorials to learn Hadoop © Hortonworks Inc. 2013

9. Hadoop 2.0 Innovations - YARN • Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation Graph Processing • Next generation execution MapReduce – Improves MapReduce performance Other Tez • Supports new frameworks beyond MapReduce YARN: Cluster Resource Management – Low latency, Streaming, Services – Do more with a single Hadoop cluster HDFS Redundant, Reliable Storage © Hortonworks Inc. 2013

10. Tez on YARN: Going Beyond Batch Tez Task Tez Optimizes Execution Always-On Tez Service New runtime engine for Low latency processing for more efficient data processing all Hadoop data processing © Hortonworks Inc. 2013

11. Stinger Initiative • Community initiative around Hive • Enables Hive to support interactive workloads • Enhances Hive’s standard SQL interface for Hadoop • Improves existing tools & preserves investments Execution Query File Engine Planner Format + + = 100X Tez Hive ORC file © Hortonworks Inc. 2013

12. Stinger: Make Hive Fly For All BI Needs Parameterized Reports Enterprise Reports Dashboard / Scorecard Visualization Data Mining More SQL + 100X Faster Interactive Batch © Hortonworks Inc. 2013

13. Knox: Make Hadoop Security Simple Authentication & Verification User Store Hadoop Cluster KDC, AD, LDAP {REST} Knox Client Gateway © Hortonworks Inc. 2013

14. Hortonworks Data Platform 2.0 Alpha 2 Key New Features Business Value – Hive performance – First distribution to include Tez Single Platform Multiple Use From HDP 2.0 Alpha BATCH INTERACTIVE ONLINE – Yarn – Full Stack HA – Snapshots – Disaster Recovery Big Data Transactions, Interactions, Observations – Rolling Upgrades Available today http://Hortonworks.com/products Page 14 © Hortonworks Inc. 2013

15. Falcon: Data Lifecycle Management • New Apache Incubator Project • Introduced by InMobi, Hortonworks and Yahoo! • Data Lifecycle Management Framework for Hadoop • Configure and Manage Workflows & Policies for: – Data Movement – Disaster Recovery – Data Retention © Hortonworks Inc. 2013

16. Join the Community & Get Involved! • INNOVATE Open Source Vendors • INTEGRATE End Users • COMMUNICATE © Hortonworks Inc. 2013

Editor's Notes

HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
Job DiagnosticsVisualize and troubleshoot Hadoop job execution and performanceCluster History View historical job execution & performanceInstant InsightView health of Core Hadoop (HDFS, MapReduce) and related projectsCluster Navigation “Quick link” buttons jump into namenode web UI for a serverREST interface provides external access to Ambari for existing tools. Facilitates integration with Microsoft System Center and Teradata Viewpoint
Hortonworks SandboxHortonworks accelerates Hadoop skills development with an easy-to-use, flexible and extensible platform to learn, evaluate and use Apache HadoopWhat is it: virtualized single-node implementation of the enterprise-ready Hortonworks Data PlatformProvides demos, videos and step-by-step hands-on tutorialsPre-built partner integrations and access to datasetsWhat it does: Dramatically accelerates the process of learning Apache HadoopSee It -- demos and videos to illustrate use casesLearn It -- multi level step by step tutorials Do It -- hands on exercises for faster skills developmentHow it helps: Accelerate and validates the use of Hadoop within your unique data architectureUse your data to explore and investigate your use casesZERO to big data in 15 minutes
Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
Tez Approved as New Apache Incubator ProjectHortonworks Introduces Next-Generation Runtime for Improving Latency and Throughput of Hadoop Apps
Buzz about low latency access in Hadoop
Hortonworks Unveils Stinger Initiative to Make Apache Hive 100X Faster for Interactive QueriesHortonworks leading effort with a group of community contributors focusing on enhancing Apache Hive, the defacto standard for SQL access to HadoopEnterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…

Apache Hadoop Now Next and Beyond

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Hadoop Now Next and Beyond

Similar to Apache Hadoop Now Next and Beyond (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Apache Hadoop Now Next and Beyond

Editor's Notes