SlideShare a Scribd company logo
1 of 46
Apache Hadoop 2.0
Migration from 1.0 to 2.0
Vinod Kumar Vavilapalli
Hortonworks Inc
vinodkv [at] apache.org
@tshooter

© Hortonworks Inc. 2014

Page 1
Hello!
• 6.5 Hadoop-years old
• Previously at Yahoo!, @Hortonworks now.
• Last thing at School – a two node Tomcat cluster. Three months later,
first thing at job, brought down a 800 node cluster ;)
• Two hats
– Hortonworks: Hadoop MapReduce and YARN
– Apache: Apache Hadoop YARN lead. Apache Hadoop PMC, Apache Member

• Worked/working on
– YARN, Hadoop MapReduce, HadoopOnDemand, CapacityScheduler, Hadoop
security
– Apache Ambari: Kickstarted the project and its first release
– Stinger: High performance data processing with Hadoop/Hive

• Lots of random trouble shooting on clusters
• 99% + code in Apache, Hadoop  

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 2
Agenda
•
•
•
•

Apache Hadoop 2
Migration Guide for Administrators
Migration Guide for Users
Summary

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 3
Apache Hadoop 2
Next Generation Architecture

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 4
Hadoop 1 vs Hadoop 2
Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1.0

HADOOP 2.0
MapReduce

Others

(data processing)

MapReduce

YARN

(cluster resource management
& data processing)

(cluster resource management)

HDFS

HDFS2

(redundant, reliable storage)

(redundant, highly-available & reliable storage)

© Hortonworks Inc. 2014

Page 5
Why Migrate?
• 2.0 > 2 * 1.0
– HDFS: Lots of ground-breaking features
– YARN: Next generation architecture
– Beyond MapReduce with Tez, Storm, Spark; in Hadoop!
– Did I mention Services like HBase, Accumulo on YARN with HoYA?

• Return on Investment: 2x throughput on same hardware!

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 6
Yahoo!
• On YARN (0.23.x)
• Moving fast to 2.x

http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 7
Twitter

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 8
HDFS
•
•
•
•

High Availability – NameNode HA
Scale further – Federation
Time-machine – HDFS Snapshots
NFSv3 access to data in HDFS

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 9
HDFS Contd.
•
•
•
•

Support for multiple storage tiers – Disk, Memory, SSD
Finer grained access – ACLs
Faster access to data – DataNode Caching
Operability – Rolling upgrades

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 10
YARN: Taking Hadoop Beyond Batch
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service
Applications Run Natively in Hadoop
BATCH
INTERACTIVE
(MapReduce)
(Tez)

ONLINE
(HBase)

STREAMING
(Storm, S4,…)

GRAPH
(Giraph)

IN-MEMORY
(Spark)

HPC MPI
(OpenMPI)

OTHER
(Search)
(Weave…)

YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)

© Hortonworks Inc. 2014

Page 11
5 Key Benefits of YARN
1.

Scale

2.

New Programming Models &
Services

3.

Improved cluster utilization

4.

Agility

5.

Beyond Java
© Hortonworks Inc. 2014

Page 12
Any catch?
•
•
•
•

I could go on and on about the benefits, but what’s the catch?
Nothing major!
Major architectural changes
But the impact on user applications and APIs kept to a minimal
– Feature parity
– Administrators
– End-users

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 13
Administrators
Guide to migrating your clusters to Hadoop-2.x

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 14
New Environment
• Hadoop Common, HDFS and MR are installable separately, but optional
• Env
– HADOOP_HOME deprecated, but works
– The environment variables - HADOOP_COMMON_HOME,
HADOOP_HDFS_HOME, HADOOP_MAPRED_HOME,
– HADOOP_YARN_HOME : New

• Commands
– bin/hadoop works as usual but some sub-commands are deprecated
– Separate commands for mapred and hdfs
– hdfs fs -ls
– mapred job -kill <job_id>

– bin/yarn-daemon.sh etc for starting yarn daemons

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 15
Wire compatibility
• Not RPC wire compatible with prior versions of Hadoop
• Admins cannot mix and match versions
• Clients must be updated to use the same version of Hadoop client
library as the one installed on the cluster.

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 16
Capacity management
• Slots -> Dynamic memory based Resources
• Total memory on each node
– yarn.nodemanager.resource.memory-mb

• Minimum and maximum sizes
– yarn.scheduler.minimum-allocation-mb
– yarn.scheduler.maximum-allocation-mb

• MapReduce configs don’t change
– mapreduce.map.memory.mb
– mapreduce.map.java.opts

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 17
Cluster Schedulers
• Concepts stay the same
– CapacityScheduler: Queues, User-limits
– FairScheduler: Pools
– Warning: Configuration names now have YARN-isms

• Key enhancements
– Hierarchical Queues for fine-grained control
– Multi-resource scheduling (CPU, Memory etc.)
– Online administration (add queues, ACLs etc.)
– Support for long-lived services (HBase, Accumulo, Storm) (In progress)
– Node Labels for fine-grained administrative controls (Future)

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 18
Configuration
• Watch those damn knobs!
• Should work if you are using the previous configs in Common, HDFS
and client side MapReduce configs
• MapReduce server side is toast
– No migration
– Just use new configs

• Past sins
– From 0.21.x
– Configuration names changed for better separation: client and server config names
– Cleaning up naming: mapred.job.queue.name → mapreduce.job.queuename

• Old user-facing, job related configs work as before but deprecated
• Configuration mappings exist

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 19
Installation/Upgrade
• Fresh install
• Upgrading from an existing version
• Fresh Install
– Apache Ambari : Fully automated!
– Traditional manual install of RPMs/Tarballs

• Upgrade
– Apache Ambari
– Semi automated
– Supplies scripts which take care of most things

– Manual upgrade

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 20
HDFS Pre-upgrade
• Backup Configuration files
• Stop users!
• Run fsck and fix any errors
– hadoop fsck / -files -blocks -locations
> /tmp/dfs-old-fsck-1.log

• Capture the complete namespace
– hadoop dfs -lsr / > dfs-old-lsr-1.log

• Create a list of DataNodes in the
cluster
– hadoop dfsadmin -report > dfs-oldreport-1.log

• Back up NameNode meta-data
– dfs.name.dir/edits
– dfs.name.dir/image/fsimage
– dfs.name.dir/current/fsimage
– dfs.name.dir/current/VERSION

• Finalize the state of the filesystem
– hadoop namenode –finalize

• Other meta-data backup
– Hive Metastore, Hcat, Oozie
– mysqldump

• Save the namespace
– hadoop dfsadmin -safemode enter
– hadoop dfsadmin –saveNamespace

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 21
HDFS Upgrade
• Stop all services
• Tarballs/RPMs

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 22
HDFS Post-upgrade
• Process liveliness
• Verify that all is well
– Namenode goes out of safe mode: hdfs dfsadmin -safemode wait

• File-System health
• Compare from before
– Node list
– Full Namespace

• You can start HDFS without finalizing the upgrade. When you are ready
to discard your backup, you can finalize the upgrade.
– hadoop dfsadmin -finalizeUpgrade

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 23
MapReduce upgrade
• Ask users to stop their thing
• Stop the MR sub-system
• Replace everything 

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 24
HBase Upgrade
• Tarballs/RPMs
• HBase 0.95 removed support for Hfile V1
– Before the actual upgrade, check if there are HFiles in V1 format using
HFileV1Detector

• /usr/lib/hbase/bin/hbase upgrade –execute

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 25
Users
Guide to migrating your applications to Hadoop-2.x

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 26
Migrating the Hadoop Stack
•
•
•
•
•
•

MapReduce
MR Streaming
Pipes
Pig
Hive
Oozie

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 27
MapReduce Applications
• Binary Compatibility of org.apache.hadoop.mapred APIs
– Full binary compatibility for vast majority of users and applications
– Nothing to do!

• Use existing MR application jars of your existing application via
bin/hadoop to submit them directly to YARN
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 28
MapReduce Applications contd.
• Source Compatibility of org.apache.hadoop.mapreduce API
– Minority of users
– Proved to be difficult to ensure full binary compatibility to the existing applications

• Existing application using mapreduce APIs are source compatible
• Can run on YARN with no changes, need recompilation only

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 29
MapReduce Applications contd.
• MR Streaming applications
– work without any changes

• Pipes applications
– will need recompilation

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 30
MapReduce Applications contd.
• Examples
– Can run with minor tricks

• Benchmarks
– To compare 1.x vs 2.x

• Things to do
– Play with YARN
– Compare performance

http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 31
MapReduce feature parity
• Setup, cleanup tasks are no longer separate tasks,
– And we dropped the optionality (which was a hack anyways).

• JobHistory
– JobHistory file format changed to avro/json based.
– Rumen automatically recognizes the new format.
– Parsing history files yourselves? Need to move to new parsers.

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 32
User logs
• Putting user-logs on DFS.
– AM logs too!
– While the job is running, logs are on the individual nodes
– After that on DFS

• Provide pretty printers and parsers for various log files – syslog, stdout,
stderr
• User logs directory with quotas beyond their current user directories
• Logs expire after a month by default and get GCed.

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 33
Application recovery
• No more lost applications on the master restart!
– Applications do not lose previously completed work
– If AM crashes, RM will restart it from where it stopped

– Applications can (WIP) continue to run while RM is down
– No need to resubmit if RM restarts

• Specifically for MR jobs
– Changes to semantics of OutputCommitter
– We fixed FileOutputCommitter, but if you have your own OutputCommitter, you
need to care about application-recoverability

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 34
JARs
• No single hadoop-core jar
• Common, hdfs and mapred jars separated
• Projects completely mavenized and YARN has separate jars for API,
client and server code
– Good. You don’t link to server side code anymore

• Some jars like avro, jackson etc are upgraded to their later versions
– If they have compatibility problems, you will have too
– You can override that behavior by putting your jars first in the Classpath

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 35
More features
• Uber AM
– Run small jobs inside the AM itself
– No need for launching tasks.
– Is seamless – JobClient will automatically determine if this is a small job.

• Speculative tasks
– Was not enabled by default in 1.x
– Much better in 2.x, supported

• No JVM-Reuse: Feature dropped
• Netty based zero-copy shuffle
• MiniMRcluster →MiniMRYarnCluster

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 36
Web UI
• Web UIs completely overhauled.
– Rave reviews ;)
– And some rotten tomatoes too

• Functional improvements
– capability to sort tables by one or more columns
– filter rows incrementally in "real time".

• Any user applications or tools that depends on Web UI and extract data
using screen-scrapping will cease to function
– Web services!

• AM web UI, History server UI, RM UI work together

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 37
Apache Pig
• One of the two major data process applications in the Hadoop
ecosystem
• Existing Pig scripts that work with Pig 0.10.1 and beyond will work just
fine on top of YARN !
• Versions prior to pig-0.10.1 may not run directly on YARN
– Please accept my sincere condolences!

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 38
Apache Hive
• Queries on Hive 0.10.0 and beyond will work without changes on top of
YARN!
• Hive 0.13 & beyond: Apache TEZ!!
– Interactive SQL queries at scale!
– Hive + Stinger: Petabyte Scale SQL, in Hadoop – Alan Gates & Owen O’Malley
1.30pm Thu (2/13) at Ballroom F

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 39
Apache Oozie
• Existing oozie workflows can start taking advantage of YARN in 0.23 and
2.x with Oozie 3.2.0 and above !

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 40
Cascading & Scalding
• Cascading 2.5 - Just works, certified!
• Scalding too!

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 41
Beyond upgrade
Where do I go from here?

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 42
YARN Eco-system
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
Elastic Search – Scalable Search
Cloudera Llama – Impala on YARN
DataTorrent – Data Analysis
HOYA – HBase on YARN

© Hortonworks Inc. 2014

There's an app for that...
YARN App Marketplace!

Frameworks Powered By YARN
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2

Page 43
Summary
• Apache Hadoop 2 is, at least, twice as good!
– No, seriously!

• Exciting journey with Hadoop for this decade…
– Hadoop is no longer just HDFS & MapReduce

• Architecture for the future
– Centralized data and multi-variage applications
– Possibility of exciting new applications and types of workloads

• Admins
– A bit of work

• End-user
– Mostly should just work as is

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 44
YARN Book coming soon!

Architecting the Future of Big Data
© Hortonworks Inc. 2014

Page 45
Thank you!

http://hortonworks.com/products/hortonworks-sandbox/

Download Sandbox: Experience Apache Hadoop
Both 2.x and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/

Questions?
© Hortonworks Inc. 2014

Page 46

More Related Content

Recently uploaded

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Apache Hadoop 2.0: Migration from 1.0 to 2.0

  • 1. Apache Hadoop 2.0 Migration from 1.0 to 2.0 Vinod Kumar Vavilapalli Hortonworks Inc vinodkv [at] apache.org @tshooter © Hortonworks Inc. 2014 Page 1
  • 2. Hello! • 6.5 Hadoop-years old • Previously at Yahoo!, @Hortonworks now. • Last thing at School – a two node Tomcat cluster. Three months later, first thing at job, brought down a 800 node cluster ;) • Two hats – Hortonworks: Hadoop MapReduce and YARN – Apache: Apache Hadoop YARN lead. Apache Hadoop PMC, Apache Member • Worked/working on – YARN, Hadoop MapReduce, HadoopOnDemand, CapacityScheduler, Hadoop security – Apache Ambari: Kickstarted the project and its first release – Stinger: High performance data processing with Hadoop/Hive • Lots of random trouble shooting on clusters • 99% + code in Apache, Hadoop   Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 2
  • 3. Agenda • • • • Apache Hadoop 2 Migration Guide for Administrators Migration Guide for Users Summary Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 3
  • 4. Apache Hadoop 2 Next Generation Architecture Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 4
  • 5. Hadoop 1 vs Hadoop 2 Single Use System Multi Purpose Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce Others (data processing) MapReduce YARN (cluster resource management & data processing) (cluster resource management) HDFS HDFS2 (redundant, reliable storage) (redundant, highly-available & reliable storage) © Hortonworks Inc. 2014 Page 5
  • 6. Why Migrate? • 2.0 > 2 * 1.0 – HDFS: Lots of ground-breaking features – YARN: Next generation architecture – Beyond MapReduce with Tez, Storm, Spark; in Hadoop! – Did I mention Services like HBase, Accumulo on YARN with HoYA? • Return on Investment: 2x throughput on same hardware! Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 6
  • 7. Yahoo! • On YARN (0.23.x) • Moving fast to 2.x http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 7
  • 8. Twitter Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 8
  • 9. HDFS • • • • High Availability – NameNode HA Scale further – Federation Time-machine – HDFS Snapshots NFSv3 access to data in HDFS Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 9
  • 10. HDFS Contd. • • • • Support for multiple storage tiers – Disk, Memory, SSD Finer grained access – ACLs Faster access to data – DataNode Caching Operability – Rolling upgrades Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 10
  • 11. YARN: Taking Hadoop Beyond Batch Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service Applications Run Natively in Hadoop BATCH INTERACTIVE (MapReduce) (Tez) ONLINE (HBase) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave…) YARN (Cluster Resource Management) HDFS2 (Redundant, Reliable Storage) © Hortonworks Inc. 2014 Page 11
  • 12. 5 Key Benefits of YARN 1. Scale 2. New Programming Models & Services 3. Improved cluster utilization 4. Agility 5. Beyond Java © Hortonworks Inc. 2014 Page 12
  • 13. Any catch? • • • • I could go on and on about the benefits, but what’s the catch? Nothing major! Major architectural changes But the impact on user applications and APIs kept to a minimal – Feature parity – Administrators – End-users Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 13
  • 14. Administrators Guide to migrating your clusters to Hadoop-2.x Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 14
  • 15. New Environment • Hadoop Common, HDFS and MR are installable separately, but optional • Env – HADOOP_HOME deprecated, but works – The environment variables - HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_MAPRED_HOME, – HADOOP_YARN_HOME : New • Commands – bin/hadoop works as usual but some sub-commands are deprecated – Separate commands for mapred and hdfs – hdfs fs -ls – mapred job -kill <job_id> – bin/yarn-daemon.sh etc for starting yarn daemons Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 15
  • 16. Wire compatibility • Not RPC wire compatible with prior versions of Hadoop • Admins cannot mix and match versions • Clients must be updated to use the same version of Hadoop client library as the one installed on the cluster. Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 16
  • 17. Capacity management • Slots -> Dynamic memory based Resources • Total memory on each node – yarn.nodemanager.resource.memory-mb • Minimum and maximum sizes – yarn.scheduler.minimum-allocation-mb – yarn.scheduler.maximum-allocation-mb • MapReduce configs don’t change – mapreduce.map.memory.mb – mapreduce.map.java.opts Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 17
  • 18. Cluster Schedulers • Concepts stay the same – CapacityScheduler: Queues, User-limits – FairScheduler: Pools – Warning: Configuration names now have YARN-isms • Key enhancements – Hierarchical Queues for fine-grained control – Multi-resource scheduling (CPU, Memory etc.) – Online administration (add queues, ACLs etc.) – Support for long-lived services (HBase, Accumulo, Storm) (In progress) – Node Labels for fine-grained administrative controls (Future) Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 18
  • 19. Configuration • Watch those damn knobs! • Should work if you are using the previous configs in Common, HDFS and client side MapReduce configs • MapReduce server side is toast – No migration – Just use new configs • Past sins – From 0.21.x – Configuration names changed for better separation: client and server config names – Cleaning up naming: mapred.job.queue.name → mapreduce.job.queuename • Old user-facing, job related configs work as before but deprecated • Configuration mappings exist Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 19
  • 20. Installation/Upgrade • Fresh install • Upgrading from an existing version • Fresh Install – Apache Ambari : Fully automated! – Traditional manual install of RPMs/Tarballs • Upgrade – Apache Ambari – Semi automated – Supplies scripts which take care of most things – Manual upgrade Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 20
  • 21. HDFS Pre-upgrade • Backup Configuration files • Stop users! • Run fsck and fix any errors – hadoop fsck / -files -blocks -locations > /tmp/dfs-old-fsck-1.log • Capture the complete namespace – hadoop dfs -lsr / > dfs-old-lsr-1.log • Create a list of DataNodes in the cluster – hadoop dfsadmin -report > dfs-oldreport-1.log • Back up NameNode meta-data – dfs.name.dir/edits – dfs.name.dir/image/fsimage – dfs.name.dir/current/fsimage – dfs.name.dir/current/VERSION • Finalize the state of the filesystem – hadoop namenode –finalize • Other meta-data backup – Hive Metastore, Hcat, Oozie – mysqldump • Save the namespace – hadoop dfsadmin -safemode enter – hadoop dfsadmin –saveNamespace Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 21
  • 22. HDFS Upgrade • Stop all services • Tarballs/RPMs Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 22
  • 23. HDFS Post-upgrade • Process liveliness • Verify that all is well – Namenode goes out of safe mode: hdfs dfsadmin -safemode wait • File-System health • Compare from before – Node list – Full Namespace • You can start HDFS without finalizing the upgrade. When you are ready to discard your backup, you can finalize the upgrade. – hadoop dfsadmin -finalizeUpgrade Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 23
  • 24. MapReduce upgrade • Ask users to stop their thing • Stop the MR sub-system • Replace everything  Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 24
  • 25. HBase Upgrade • Tarballs/RPMs • HBase 0.95 removed support for Hfile V1 – Before the actual upgrade, check if there are HFiles in V1 format using HFileV1Detector • /usr/lib/hbase/bin/hbase upgrade –execute Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 25
  • 26. Users Guide to migrating your applications to Hadoop-2.x Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 26
  • 27. Migrating the Hadoop Stack • • • • • • MapReduce MR Streaming Pipes Pig Hive Oozie Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 27
  • 28. MapReduce Applications • Binary Compatibility of org.apache.hadoop.mapred APIs – Full binary compatibility for vast majority of users and applications – Nothing to do! • Use existing MR application jars of your existing application via bin/hadoop to submit them directly to YARN <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 28
  • 29. MapReduce Applications contd. • Source Compatibility of org.apache.hadoop.mapreduce API – Minority of users – Proved to be difficult to ensure full binary compatibility to the existing applications • Existing application using mapreduce APIs are source compatible • Can run on YARN with no changes, need recompilation only Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 29
  • 30. MapReduce Applications contd. • MR Streaming applications – work without any changes • Pipes applications – will need recompilation Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 30
  • 31. MapReduce Applications contd. • Examples – Can run with minor tricks • Benchmarks – To compare 1.x vs 2.x • Things to do – Play with YARN – Compare performance http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/ Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 31
  • 32. MapReduce feature parity • Setup, cleanup tasks are no longer separate tasks, – And we dropped the optionality (which was a hack anyways). • JobHistory – JobHistory file format changed to avro/json based. – Rumen automatically recognizes the new format. – Parsing history files yourselves? Need to move to new parsers. Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 32
  • 33. User logs • Putting user-logs on DFS. – AM logs too! – While the job is running, logs are on the individual nodes – After that on DFS • Provide pretty printers and parsers for various log files – syslog, stdout, stderr • User logs directory with quotas beyond their current user directories • Logs expire after a month by default and get GCed. Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 33
  • 34. Application recovery • No more lost applications on the master restart! – Applications do not lose previously completed work – If AM crashes, RM will restart it from where it stopped – Applications can (WIP) continue to run while RM is down – No need to resubmit if RM restarts • Specifically for MR jobs – Changes to semantics of OutputCommitter – We fixed FileOutputCommitter, but if you have your own OutputCommitter, you need to care about application-recoverability Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 34
  • 35. JARs • No single hadoop-core jar • Common, hdfs and mapred jars separated • Projects completely mavenized and YARN has separate jars for API, client and server code – Good. You don’t link to server side code anymore • Some jars like avro, jackson etc are upgraded to their later versions – If they have compatibility problems, you will have too – You can override that behavior by putting your jars first in the Classpath Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 35
  • 36. More features • Uber AM – Run small jobs inside the AM itself – No need for launching tasks. – Is seamless – JobClient will automatically determine if this is a small job. • Speculative tasks – Was not enabled by default in 1.x – Much better in 2.x, supported • No JVM-Reuse: Feature dropped • Netty based zero-copy shuffle • MiniMRcluster →MiniMRYarnCluster Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 36
  • 37. Web UI • Web UIs completely overhauled. – Rave reviews ;) – And some rotten tomatoes too • Functional improvements – capability to sort tables by one or more columns – filter rows incrementally in "real time". • Any user applications or tools that depends on Web UI and extract data using screen-scrapping will cease to function – Web services! • AM web UI, History server UI, RM UI work together Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 37
  • 38. Apache Pig • One of the two major data process applications in the Hadoop ecosystem • Existing Pig scripts that work with Pig 0.10.1 and beyond will work just fine on top of YARN ! • Versions prior to pig-0.10.1 may not run directly on YARN – Please accept my sincere condolences! Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 38
  • 39. Apache Hive • Queries on Hive 0.10.0 and beyond will work without changes on top of YARN! • Hive 0.13 & beyond: Apache TEZ!! – Interactive SQL queries at scale! – Hive + Stinger: Petabyte Scale SQL, in Hadoop – Alan Gates & Owen O’Malley 1.30pm Thu (2/13) at Ballroom F Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 39
  • 40. Apache Oozie • Existing oozie workflows can start taking advantage of YARN in 0.23 and 2.x with Oozie 3.2.0 and above ! Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 40
  • 41. Cascading & Scalding • Cascading 2.5 - Just works, certified! • Scalding too! Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 41
  • 42. Beyond upgrade Where do I go from here? Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 42
  • 43. YARN Eco-system Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN © Hortonworks Inc. 2014 There's an app for that... YARN App Marketplace! Frameworks Powered By YARN Apache Twill REEF by Microsoft Spring support for Hadoop 2 Page 43
  • 44. Summary • Apache Hadoop 2 is, at least, twice as good! – No, seriously! • Exciting journey with Hadoop for this decade… – Hadoop is no longer just HDFS & MapReduce • Architecture for the future – Centralized data and multi-variage applications – Possibility of exciting new applications and types of workloads • Admins – A bit of work • End-user – Mostly should just work as is Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 44
  • 45. YARN Book coming soon! Architecting the Future of Big Data © Hortonworks Inc. 2014 Page 45
  • 46. Thank you! http://hortonworks.com/products/hortonworks-sandbox/ Download Sandbox: Experience Apache Hadoop Both 2.x and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Questions? © Hortonworks Inc. 2014 Page 46

Editor's Notes

  1. Graph processing – Giraph, HamaStream proessing – Smaza, Storm, Spark, DataTorrentMapReduceTez – fast query executionWeave/REEF – frameworks to help with writing applicationsList of some of the applications which already support YARN, in some form.Smaza, Storm, S4 and DataTorrent are streaming frameworksVarious types of graph processing frameworks – Giraph and Hama are graph processing systemsThere’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier