SlideShare a Scribd company logo
1 of 47
MANAGING GROWTH IN PRODUCTION HADOOP
DEPLOYMENTS
Soam Acharya
@soamwork
Charles Wimmer
@cwimmer
Altiscale
@altiscale
HADOOP SUMMIT 2015
SAN JOSE
ALTISCALE : INFRASTRUCTURE NERDS
• Soam Acharya - Head of Application Engineering
• Formerly Chief Scientist @ Limelight OVP, Yahoo Research Engineer
• Charles Wimmer, Head of Operations
• Former Yahoo! & LinkedIn SRE
• Managed 40000 nodes in Hadoop clusters at Yahoo!
• Hadoop as a Service, built and managed by Big Data, SaaS, and
enterprise software veterans
• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...
2
3
SO, YOU’VE PUT TOGETHER YOUR FIRST HADOOP
DEPLOYMENT
● It’s now running production ETLs
CONGRATULATIONS!
BUT THEN ...
• Your data scientists get on the cluster and start building models
5
BUT THEN ...
• Your data scientists get on the cluster and start building models
• Your BI team starts running interactive SQL on Hadoop queries ..
6
BUT THEN ...
• Your data scientists get on the cluster and start building models
• Your BI team starts running interactive SQL on Hadoop queries ..
• Your mobile team starts sending RT events into the cluster ..
7
BUT THEN ...
• Your data scientists get on the cluster and start building models
• Your BI team starts running interactive SQL on Hadoop queries ..
• Your mobile team starts sending RT events into the cluster ..
• You sign up more clients
• And the input data for your initial use case doubles ..
8
SOON, YOUR CLUSTER ...
9
AND YOU …
10
THE “SUCCESS DISASTER” SCENARIO
● Initial success
● Many subsequent use cases on cluster
● Cluster gets bogged down
11
WHY DO CLUSTERS FAIL?
• Failure categories:
1. Too much data
2. Too many jobs
3. Too many users
12
HOW EXTRICATE YOURSELF?
• Short term strategy:
• Get more resources for your cluster
• Expand cluster size!
• More headroom for longer term strategy
• Longer term strategy
13
LONGER TERM STRATEGY
• Can’t cover every scenario
• Per failure category:
• Selected pressure points (PPs)
• Can occur at different levels of Hadoop stack
• Identify and shore up pressure points
• Squeeze more capacity from cluster
14
HADOOP 2 STACK REMINDER
15
Application Layer
Execution Framework
Core Hadoop Layer
Machine Level
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
16
FAILURE CATEGORY 1 - TOO MUCH DATA
16
PP: HDFS at capacity
PP: Too many objects
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
PRESSURE POINT - HDFS AT CAPACITY
• Unpredictable cluster behavior
• Transient errors
• Hadoop daemons can’t save logs to HDFS
• Execution framework errors:
• Hive unable to run queries that create temp tables
17
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hive-user/hive_2014-
07-23_08-43-40_408_2604848084853512498-1/_task_tmp.-ext-10001/_tmp.000121_0 could only be
replicated to 1 nodes instead of minReplication (=2). There are xx datanode(s) running and
no node(s) are excluded in this operation.
HDFS AT CAPACITY MITIGATION
● Use HDFS quotas!
hdfs dfsadmin -setSpaceQuota 113367670 /
● Quotas can be set per directory
● Cannot be set per user
● Protection against accidental cluster destabilzation
18
TOO MANY OBJECTS
“Elephants are afraid of mice. Hadoop is afraid of small files.”
19
# of dirs + files
# of blocks
TOO MANY OBJECTS
● Memory pressure:
o Namenode heap: too many files + directories + objects in HDFS
o Datanode heap: too many blocks allocated per node
● Performance overhead
o Too much time spent on container creation and teardown
o More time spent in execution framework than actual application
20
WHERE ARE THE OBJECTS?
Use HDFS count:
hdfs dfs -count -q <directory name>
● Number of directories, files and bytes
● On per directory basis
Use fsimage files:
● Can be produced by NN
hdfs oiv <fsimage file>
● Detailed breakdown of the HDFS file system
● Hard!
21
TOO MANY OBJECTS - MITIGATION
• Short term:
• Increase NN/DN heapsizes
• Node physical limits
• Increase cluster node count
• Longer term:
• Find and compact
• Coalesce multiple files
• Use HAR
22
COALESCE MULTIPLE FILES I
• Hadoop streaming job
• Whatever Hadoop can read on cluster
• LZO output
23
hadoop 
jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar 
-D mapreduce.job.reduces=40 
-D mapred.output.compress=true 
-D mapred.output.compression.codec=
com.hadoop.compression.lzo.LzopCodec 
-D mapreduce.output.fileoutputformat.compress.type=BLOCK 
-D mapreduce.reduce.memory.mb=8192 
-mapper /bin/cat 
-reducer /bin/cat 
-input $IN_DIR 
-output $DIR
COALESCE MULTIPLE FILES II
● Build index for LZO output
● Tell hadoop where the splits are
hadoop 
jar /opt/hadoop/share/hadoop/common/lib/hadoop-lzo-*.jar 
com.hadoop.compression.lzo.DistributedLzoIndexer 
$DIR
24
COMBINE FILES INTO HAR
• HAR: Hadoop Archive
hadoop archive -archiveName <archive name>.har -p <HDFS
parent path> <dir1> <dir2> ... <outputDir>
• MR job to produce archive
• Watch out for replication factor
• On versions 2.4 and earlier, source files are set to a default replication factor 10
• Not good for small clusters
• -r <replication factor> option added in 2.6
25
COMBINE FILES INTO HAR
• HAR archives are useful if you want to preserve the file/directory structure of input
[alti_soam@desktop ~]$ hdfs dfs -ls har:///tmp/alti_soam_test.har
Found 3 items
drwxr-xr-x - alti_soam hdfs 0 2013-09-03 22:44 har:/tmp/alti_soam_test.har/examples
drwxr-xr-x - alti_soam hdfs 0 2013-11-16 03:53 har:/tmp/alti_soam_test.har/test-pig-
avro-dir
drwxr-xr-x - alti_soam hdfs 0 2013-11-12 22:23 har:/tmp/alti_soam_test.har/test-camus
26
FAILURE CATEGORY 2 - TOO MANY JOBS
“Help! My job is stuck!”
27
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
Jobs don’t make progress
Jobs don’t start
“Right” jobs finish last
Mixed profile job issues
TOO MANY JOBS REMEDIATION
• Need to quantify job processing on cluster
• Hadoop job usage analysis:
• Resource Manager logs
• History Server logs, job history files
• APIs
• Analysis goals:
• Queue usage => cluster utilization
• Time spent by jobs/containers in waiting state
• Job level stats
• # of jobs, type of jobs …
• Queue tuning
28
HADOOP LOGS - RESOURCE MANAGER
• job stats (outcome, duration, startdate)
• queue used
• container:
• number allocated
• Memory, vCPU allocation
• state transition times
• outcome
29
HADOOP LOGS - JOBHISTORY FILES
• Configure history server to produce files
• Created for every MR job
• HDFS data volume processed
• for mappers/reducers:
• CPU time
• memory used
• start/end time
• max parallel maps, reduces
• GC time
• not available for Tez/Spark:
• Use timeline server for better logging
• Timeline server dependencies
30
HADOOP LOG ANALYSIS
• Analysis goals:
• Queue usage => cluster utilization
• Time spent by jobs/containers in waiting state
• Job level stats:
• # of jobs
• Failed/killed vs successful
• Type of jobs
• Container level stats
• How analyze logs?
• Custom scripts
• Parse job history files, hadoop logs
• Data warehouse
• Visualization
• Not much by the way of publicly available tools
31
32
SAMPLE PLOT:
CONTAINER WAIT
TIME AND
UTILIZATION PER
QUEUE
Container wait times
Queue utilization
vCore usage
33
SAMPLE PLOT:
DAILY JOB TYPE
AND STATUS
34
SAMPLE PLOT:
DAILY JOB
BREAKDOWN
BY USER
QUEUE TUNING STRATEGY
• Determine how you want your cluster to behave
• Pick scheduler depending on behavior
• Real world examples:
• Production jobs must get resources
• Dedicate a certain portion of the cluster regardless of cluster state (idle, at capacity)
• Data loading jobs
• Constrain to a small portion of cluster to preserve network bandwidth
• Research jobs:
• Small portion of cluster at peak
• Large portion of cluster when idle
• Divide up cluster amongst business units
35
QUEUE TUNING - SCHEDULER BASICS
Fair Scheduler Capacity Scheduler
Resource Allocation Get approx. equal share of
resources over time
Queues are allocated a fraction of
the cluster
Which Resource? Memory, CPU (optional) Memory, CPU (optional)
Inter-queue Constraints Max & Min Shares: X Mbs, Y
vcores
Min capacity => guaranteed
fraction of entire cluster when
busy
Max capacity => guaranteed
fraction of entire cluster when
idle
Intra-queue Resource Sharing Pick policy: FIFO, FAIR, Dominant
Resource Fairness
Tunable policy:
- Many users: 1st x users get 1/x
- Many jobs, single user: FIFO
36
MORE ON EACH SCHEDULER
• Fair Scheduler:
• Hadoop Summit 2009
• Job Scheduling With the Fair and Capacity Schedulers - Matei Zaharia
• Capacity Scheduler:
• Hadoop Summit 2015 (5/9, 12:05pm)
• Towards SLA-based Scheduling on YARN - Sumeet Singh, Nathan Roberts
37
TOO MANY JOBS - MIXED PROFILE JOBS
• Jobs may have different memory profiles
• Standard MR jobs: small container sizes
• Newer execution frameworks (Spark, H2O):
• Large container sizes
• All or nothing scheduling
• A job with many little tasks
• Can starve jobs that require large containers
38
TOO MANY JOBS - MIXED PROFILE JOBS MITIGATION
• Reduce container sizes if possible
• Always start with the lowest container sizes
• Node labels (YARN-2492) and gang scheduling (YARN-624)
• More details:
• Running Spark and MapReduce Together In Production - David Chaiken
• Hadoop Summit 2015, 06/09, 2:35pm
39
TOO MANY JOBS - HARDENING YOUR CLUSTER
• Cluster configuration audit
• Container vs heap size
• Appropriate kernel level configuration
• Turn on Linux Container Executor
• Enable Hadoop Security
• Use operating system cgroups
• Protect Hadoop daemons
• Cage user processes:
• Impala
• Limits on what Hadoop can control:
• CPU
• But not memory, network & disk BW
40
mapreduce.map.memory.mb = 1536
mapreduce.map.java.opts = -Xmx2560m
FAILURE CATEGORY 3 - TOO MANY USERS
Data access control
41
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
Inter-departmental
resource
contention
(too many jobs)
TOO MANY USERS - QUEUE ACCESS
● Use queue ACLs
o restrict which users can submit jobs to a queue
o per queue administrator roles:
 submit job
 administer job
o restrict whether users can view applications in another queue
42
DATA ACCESS CONTROL
• By default, Hadoop supports UNIX style file permissions
• Easy to circumvent
HADOOP_USER_NAME=hdfs hdfs dfs -rm /priv/data
• Use Kerberos
43
DATA ACCESS CONTROL - ACCOUNTABILITY
• HDFS Audit logs
• Produced by NameNode
015-02-24 20:59:45,382 INFO FSNamesystem.audit:
allowed=true ugi=soam (auth:SIMPLE) ip=/10.251.255.181
cmd=delete src=/hive/what_a_con.db
dst=/user/soam/.Trash/Current/hive/what_a_con.db
perm=soam:hiveusers:rwxrwxr-x
44
“Who deleted
that file?”
SQUEEZE MORE CAPACITY FROM CLUSTER
45
Application Layer
Execution Framework
Core Hadoop Layer
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
• Targeted upgrades, optimizations
SQUEEZE MORE CAPACITY FROM CLUSTER
• Optimizations:
• Application layer:
• Query optimizations, algorithmic level optimizations
• Upgrading:
• Execution Framework:
• Tremendous performance improvements in Hive/Tez, Spark over the past two years
• Pig, Cascading all continue to improve
• Hadoop layer:
• Recent focus on security, stability
• Recommendation:
• Focus on upgrading execution framework
46
QUESTIONS? COMMENTS?
47

More Related Content

What's hot

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 

What's hot (20)

Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Similar to Managing growth in Production Hadoop Deployments

Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudRogue Wave Software
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 

Similar to Managing growth in Production Hadoop Deployments (20)

Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Recently uploaded (20)

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Managing growth in Production Hadoop Deployments

  • 1. MANAGING GROWTH IN PRODUCTION HADOOP DEPLOYMENTS Soam Acharya @soamwork Charles Wimmer @cwimmer Altiscale @altiscale HADOOP SUMMIT 2015 SAN JOSE
  • 2. ALTISCALE : INFRASTRUCTURE NERDS • Soam Acharya - Head of Application Engineering • Formerly Chief Scientist @ Limelight OVP, Yahoo Research Engineer • Charles Wimmer, Head of Operations • Former Yahoo! & LinkedIn SRE • Managed 40000 nodes in Hadoop clusters at Yahoo! • Hadoop as a Service, built and managed by Big Data, SaaS, and enterprise software veterans • Yahoo!, Google, LinkedIn, VMWare, Oracle, ... 2
  • 3. 3 SO, YOU’VE PUT TOGETHER YOUR FIRST HADOOP DEPLOYMENT ● It’s now running production ETLs
  • 5. BUT THEN ... • Your data scientists get on the cluster and start building models 5
  • 6. BUT THEN ... • Your data scientists get on the cluster and start building models • Your BI team starts running interactive SQL on Hadoop queries .. 6
  • 7. BUT THEN ... • Your data scientists get on the cluster and start building models • Your BI team starts running interactive SQL on Hadoop queries .. • Your mobile team starts sending RT events into the cluster .. 7
  • 8. BUT THEN ... • Your data scientists get on the cluster and start building models • Your BI team starts running interactive SQL on Hadoop queries .. • Your mobile team starts sending RT events into the cluster .. • You sign up more clients • And the input data for your initial use case doubles .. 8
  • 11. THE “SUCCESS DISASTER” SCENARIO ● Initial success ● Many subsequent use cases on cluster ● Cluster gets bogged down 11
  • 12. WHY DO CLUSTERS FAIL? • Failure categories: 1. Too much data 2. Too many jobs 3. Too many users 12
  • 13. HOW EXTRICATE YOURSELF? • Short term strategy: • Get more resources for your cluster • Expand cluster size! • More headroom for longer term strategy • Longer term strategy 13
  • 14. LONGER TERM STRATEGY • Can’t cover every scenario • Per failure category: • Selected pressure points (PPs) • Can occur at different levels of Hadoop stack • Identify and shore up pressure points • Squeeze more capacity from cluster 14
  • 15. HADOOP 2 STACK REMINDER 15 Application Layer Execution Framework Core Hadoop Layer Machine Level YARN network h/w OS disk DN NN HDFS NN disk NM NM cpu cpu RAM RAM RM MRSpark Tez Hive SparkSQL Cascading Pig Application YARN
  • 16. 16 FAILURE CATEGORY 1 - TOO MUCH DATA 16 PP: HDFS at capacity PP: Too many objects YARN network h/w OS disk DN NN HDFS NN disk NM NM cpu cpu RAM RAM RM MRSpark Tez Hive SparkSQL Cascading Pig Application YARN
  • 17. PRESSURE POINT - HDFS AT CAPACITY • Unpredictable cluster behavior • Transient errors • Hadoop daemons can’t save logs to HDFS • Execution framework errors: • Hive unable to run queries that create temp tables 17 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hive-user/hive_2014- 07-23_08-43-40_408_2604848084853512498-1/_task_tmp.-ext-10001/_tmp.000121_0 could only be replicated to 1 nodes instead of minReplication (=2). There are xx datanode(s) running and no node(s) are excluded in this operation.
  • 18. HDFS AT CAPACITY MITIGATION ● Use HDFS quotas! hdfs dfsadmin -setSpaceQuota 113367670 / ● Quotas can be set per directory ● Cannot be set per user ● Protection against accidental cluster destabilzation 18
  • 19. TOO MANY OBJECTS “Elephants are afraid of mice. Hadoop is afraid of small files.” 19 # of dirs + files # of blocks
  • 20. TOO MANY OBJECTS ● Memory pressure: o Namenode heap: too many files + directories + objects in HDFS o Datanode heap: too many blocks allocated per node ● Performance overhead o Too much time spent on container creation and teardown o More time spent in execution framework than actual application 20
  • 21. WHERE ARE THE OBJECTS? Use HDFS count: hdfs dfs -count -q <directory name> ● Number of directories, files and bytes ● On per directory basis Use fsimage files: ● Can be produced by NN hdfs oiv <fsimage file> ● Detailed breakdown of the HDFS file system ● Hard! 21
  • 22. TOO MANY OBJECTS - MITIGATION • Short term: • Increase NN/DN heapsizes • Node physical limits • Increase cluster node count • Longer term: • Find and compact • Coalesce multiple files • Use HAR 22
  • 23. COALESCE MULTIPLE FILES I • Hadoop streaming job • Whatever Hadoop can read on cluster • LZO output 23 hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar -D mapreduce.job.reduces=40 -D mapred.output.compress=true -D mapred.output.compression.codec= com.hadoop.compression.lzo.LzopCodec -D mapreduce.output.fileoutputformat.compress.type=BLOCK -D mapreduce.reduce.memory.mb=8192 -mapper /bin/cat -reducer /bin/cat -input $IN_DIR -output $DIR
  • 24. COALESCE MULTIPLE FILES II ● Build index for LZO output ● Tell hadoop where the splits are hadoop jar /opt/hadoop/share/hadoop/common/lib/hadoop-lzo-*.jar com.hadoop.compression.lzo.DistributedLzoIndexer $DIR 24
  • 25. COMBINE FILES INTO HAR • HAR: Hadoop Archive hadoop archive -archiveName <archive name>.har -p <HDFS parent path> <dir1> <dir2> ... <outputDir> • MR job to produce archive • Watch out for replication factor • On versions 2.4 and earlier, source files are set to a default replication factor 10 • Not good for small clusters • -r <replication factor> option added in 2.6 25
  • 26. COMBINE FILES INTO HAR • HAR archives are useful if you want to preserve the file/directory structure of input [alti_soam@desktop ~]$ hdfs dfs -ls har:///tmp/alti_soam_test.har Found 3 items drwxr-xr-x - alti_soam hdfs 0 2013-09-03 22:44 har:/tmp/alti_soam_test.har/examples drwxr-xr-x - alti_soam hdfs 0 2013-11-16 03:53 har:/tmp/alti_soam_test.har/test-pig- avro-dir drwxr-xr-x - alti_soam hdfs 0 2013-11-12 22:23 har:/tmp/alti_soam_test.har/test-camus 26
  • 27. FAILURE CATEGORY 2 - TOO MANY JOBS “Help! My job is stuck!” 27 YARN network h/w OS disk DN NN HDFS NN disk NM NM cpu cpu RAM RAM RM MRSpark Tez Hive SparkSQL Cascading Pig Application YARN Jobs don’t make progress Jobs don’t start “Right” jobs finish last Mixed profile job issues
  • 28. TOO MANY JOBS REMEDIATION • Need to quantify job processing on cluster • Hadoop job usage analysis: • Resource Manager logs • History Server logs, job history files • APIs • Analysis goals: • Queue usage => cluster utilization • Time spent by jobs/containers in waiting state • Job level stats • # of jobs, type of jobs … • Queue tuning 28
  • 29. HADOOP LOGS - RESOURCE MANAGER • job stats (outcome, duration, startdate) • queue used • container: • number allocated • Memory, vCPU allocation • state transition times • outcome 29
  • 30. HADOOP LOGS - JOBHISTORY FILES • Configure history server to produce files • Created for every MR job • HDFS data volume processed • for mappers/reducers: • CPU time • memory used • start/end time • max parallel maps, reduces • GC time • not available for Tez/Spark: • Use timeline server for better logging • Timeline server dependencies 30
  • 31. HADOOP LOG ANALYSIS • Analysis goals: • Queue usage => cluster utilization • Time spent by jobs/containers in waiting state • Job level stats: • # of jobs • Failed/killed vs successful • Type of jobs • Container level stats • How analyze logs? • Custom scripts • Parse job history files, hadoop logs • Data warehouse • Visualization • Not much by the way of publicly available tools 31
  • 32. 32 SAMPLE PLOT: CONTAINER WAIT TIME AND UTILIZATION PER QUEUE Container wait times Queue utilization vCore usage
  • 33. 33 SAMPLE PLOT: DAILY JOB TYPE AND STATUS
  • 35. QUEUE TUNING STRATEGY • Determine how you want your cluster to behave • Pick scheduler depending on behavior • Real world examples: • Production jobs must get resources • Dedicate a certain portion of the cluster regardless of cluster state (idle, at capacity) • Data loading jobs • Constrain to a small portion of cluster to preserve network bandwidth • Research jobs: • Small portion of cluster at peak • Large portion of cluster when idle • Divide up cluster amongst business units 35
  • 36. QUEUE TUNING - SCHEDULER BASICS Fair Scheduler Capacity Scheduler Resource Allocation Get approx. equal share of resources over time Queues are allocated a fraction of the cluster Which Resource? Memory, CPU (optional) Memory, CPU (optional) Inter-queue Constraints Max & Min Shares: X Mbs, Y vcores Min capacity => guaranteed fraction of entire cluster when busy Max capacity => guaranteed fraction of entire cluster when idle Intra-queue Resource Sharing Pick policy: FIFO, FAIR, Dominant Resource Fairness Tunable policy: - Many users: 1st x users get 1/x - Many jobs, single user: FIFO 36
  • 37. MORE ON EACH SCHEDULER • Fair Scheduler: • Hadoop Summit 2009 • Job Scheduling With the Fair and Capacity Schedulers - Matei Zaharia • Capacity Scheduler: • Hadoop Summit 2015 (5/9, 12:05pm) • Towards SLA-based Scheduling on YARN - Sumeet Singh, Nathan Roberts 37
  • 38. TOO MANY JOBS - MIXED PROFILE JOBS • Jobs may have different memory profiles • Standard MR jobs: small container sizes • Newer execution frameworks (Spark, H2O): • Large container sizes • All or nothing scheduling • A job with many little tasks • Can starve jobs that require large containers 38
  • 39. TOO MANY JOBS - MIXED PROFILE JOBS MITIGATION • Reduce container sizes if possible • Always start with the lowest container sizes • Node labels (YARN-2492) and gang scheduling (YARN-624) • More details: • Running Spark and MapReduce Together In Production - David Chaiken • Hadoop Summit 2015, 06/09, 2:35pm 39
  • 40. TOO MANY JOBS - HARDENING YOUR CLUSTER • Cluster configuration audit • Container vs heap size • Appropriate kernel level configuration • Turn on Linux Container Executor • Enable Hadoop Security • Use operating system cgroups • Protect Hadoop daemons • Cage user processes: • Impala • Limits on what Hadoop can control: • CPU • But not memory, network & disk BW 40 mapreduce.map.memory.mb = 1536 mapreduce.map.java.opts = -Xmx2560m
  • 41. FAILURE CATEGORY 3 - TOO MANY USERS Data access control 41 YARN network h/w OS disk DN NN HDFS NN disk NM NM cpu cpu RAM RAM RM MRSpark Tez Hive SparkSQL Cascading Pig Application YARN Inter-departmental resource contention (too many jobs)
  • 42. TOO MANY USERS - QUEUE ACCESS ● Use queue ACLs o restrict which users can submit jobs to a queue o per queue administrator roles:  submit job  administer job o restrict whether users can view applications in another queue 42
  • 43. DATA ACCESS CONTROL • By default, Hadoop supports UNIX style file permissions • Easy to circumvent HADOOP_USER_NAME=hdfs hdfs dfs -rm /priv/data • Use Kerberos 43
  • 44. DATA ACCESS CONTROL - ACCOUNTABILITY • HDFS Audit logs • Produced by NameNode 015-02-24 20:59:45,382 INFO FSNamesystem.audit: allowed=true ugi=soam (auth:SIMPLE) ip=/10.251.255.181 cmd=delete src=/hive/what_a_con.db dst=/user/soam/.Trash/Current/hive/what_a_con.db perm=soam:hiveusers:rwxrwxr-x 44 “Who deleted that file?”
  • 45. SQUEEZE MORE CAPACITY FROM CLUSTER 45 Application Layer Execution Framework Core Hadoop Layer YARN network h/w OS disk DN NN HDFS NN disk NM NM cpu cpu RAM RAM RM MRSpark Tez Hive SparkSQL Cascading Pig Application YARN • Targeted upgrades, optimizations
  • 46. SQUEEZE MORE CAPACITY FROM CLUSTER • Optimizations: • Application layer: • Query optimizations, algorithmic level optimizations • Upgrading: • Execution Framework: • Tremendous performance improvements in Hive/Tez, Spark over the past two years • Pig, Cascading all continue to improve • Hadoop layer: • Recent focus on security, stability • Recommendation: • Focus on upgrading execution framework 46