How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
Managing growth in Production Hadoop Deployments
1. MANAGING GROWTH IN PRODUCTION HADOOP
DEPLOYMENTS
Soam Acharya
@soamwork
Charles Wimmer
@cwimmer
Altiscale
@altiscale
HADOOP SUMMIT 2015
SAN JOSE
2. ALTISCALE : INFRASTRUCTURE NERDS
• Soam Acharya - Head of Application Engineering
• Formerly Chief Scientist @ Limelight OVP, Yahoo Research Engineer
• Charles Wimmer, Head of Operations
• Former Yahoo! & LinkedIn SRE
• Managed 40000 nodes in Hadoop clusters at Yahoo!
• Hadoop as a Service, built and managed by Big Data, SaaS, and
enterprise software veterans
• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...
2
3. 3
SO, YOU’VE PUT TOGETHER YOUR FIRST HADOOP
DEPLOYMENT
● It’s now running production ETLs
5. BUT THEN ...
• Your data scientists get on the cluster and start building models
5
6. BUT THEN ...
• Your data scientists get on the cluster and start building models
• Your BI team starts running interactive SQL on Hadoop queries ..
6
7. BUT THEN ...
• Your data scientists get on the cluster and start building models
• Your BI team starts running interactive SQL on Hadoop queries ..
• Your mobile team starts sending RT events into the cluster ..
7
8. BUT THEN ...
• Your data scientists get on the cluster and start building models
• Your BI team starts running interactive SQL on Hadoop queries ..
• Your mobile team starts sending RT events into the cluster ..
• You sign up more clients
• And the input data for your initial use case doubles ..
8
11. THE “SUCCESS DISASTER” SCENARIO
● Initial success
● Many subsequent use cases on cluster
● Cluster gets bogged down
11
12. WHY DO CLUSTERS FAIL?
• Failure categories:
1. Too much data
2. Too many jobs
3. Too many users
12
13. HOW EXTRICATE YOURSELF?
• Short term strategy:
• Get more resources for your cluster
• Expand cluster size!
• More headroom for longer term strategy
• Longer term strategy
13
14. LONGER TERM STRATEGY
• Can’t cover every scenario
• Per failure category:
• Selected pressure points (PPs)
• Can occur at different levels of Hadoop stack
• Identify and shore up pressure points
• Squeeze more capacity from cluster
14
15. HADOOP 2 STACK REMINDER
15
Application Layer
Execution Framework
Core Hadoop Layer
Machine Level
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
16. 16
FAILURE CATEGORY 1 - TOO MUCH DATA
16
PP: HDFS at capacity
PP: Too many objects
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
17. PRESSURE POINT - HDFS AT CAPACITY
• Unpredictable cluster behavior
• Transient errors
• Hadoop daemons can’t save logs to HDFS
• Execution framework errors:
• Hive unable to run queries that create temp tables
17
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hive-user/hive_2014-
07-23_08-43-40_408_2604848084853512498-1/_task_tmp.-ext-10001/_tmp.000121_0 could only be
replicated to 1 nodes instead of minReplication (=2). There are xx datanode(s) running and
no node(s) are excluded in this operation.
18. HDFS AT CAPACITY MITIGATION
● Use HDFS quotas!
hdfs dfsadmin -setSpaceQuota 113367670 /
● Quotas can be set per directory
● Cannot be set per user
● Protection against accidental cluster destabilzation
18
19. TOO MANY OBJECTS
“Elephants are afraid of mice. Hadoop is afraid of small files.”
19
# of dirs + files
# of blocks
20. TOO MANY OBJECTS
● Memory pressure:
o Namenode heap: too many files + directories + objects in HDFS
o Datanode heap: too many blocks allocated per node
● Performance overhead
o Too much time spent on container creation and teardown
o More time spent in execution framework than actual application
20
21. WHERE ARE THE OBJECTS?
Use HDFS count:
hdfs dfs -count -q <directory name>
● Number of directories, files and bytes
● On per directory basis
Use fsimage files:
● Can be produced by NN
hdfs oiv <fsimage file>
● Detailed breakdown of the HDFS file system
● Hard!
21
22. TOO MANY OBJECTS - MITIGATION
• Short term:
• Increase NN/DN heapsizes
• Node physical limits
• Increase cluster node count
• Longer term:
• Find and compact
• Coalesce multiple files
• Use HAR
22
24. COALESCE MULTIPLE FILES II
● Build index for LZO output
● Tell hadoop where the splits are
hadoop
jar /opt/hadoop/share/hadoop/common/lib/hadoop-lzo-*.jar
com.hadoop.compression.lzo.DistributedLzoIndexer
$DIR
24
25. COMBINE FILES INTO HAR
• HAR: Hadoop Archive
hadoop archive -archiveName <archive name>.har -p <HDFS
parent path> <dir1> <dir2> ... <outputDir>
• MR job to produce archive
• Watch out for replication factor
• On versions 2.4 and earlier, source files are set to a default replication factor 10
• Not good for small clusters
• -r <replication factor> option added in 2.6
25
26. COMBINE FILES INTO HAR
• HAR archives are useful if you want to preserve the file/directory structure of input
[alti_soam@desktop ~]$ hdfs dfs -ls har:///tmp/alti_soam_test.har
Found 3 items
drwxr-xr-x - alti_soam hdfs 0 2013-09-03 22:44 har:/tmp/alti_soam_test.har/examples
drwxr-xr-x - alti_soam hdfs 0 2013-11-16 03:53 har:/tmp/alti_soam_test.har/test-pig-
avro-dir
drwxr-xr-x - alti_soam hdfs 0 2013-11-12 22:23 har:/tmp/alti_soam_test.har/test-camus
26
27. FAILURE CATEGORY 2 - TOO MANY JOBS
“Help! My job is stuck!”
27
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
Jobs don’t make progress
Jobs don’t start
“Right” jobs finish last
Mixed profile job issues
28. TOO MANY JOBS REMEDIATION
• Need to quantify job processing on cluster
• Hadoop job usage analysis:
• Resource Manager logs
• History Server logs, job history files
• APIs
• Analysis goals:
• Queue usage => cluster utilization
• Time spent by jobs/containers in waiting state
• Job level stats
• # of jobs, type of jobs …
• Queue tuning
28
29. HADOOP LOGS - RESOURCE MANAGER
• job stats (outcome, duration, startdate)
• queue used
• container:
• number allocated
• Memory, vCPU allocation
• state transition times
• outcome
29
30. HADOOP LOGS - JOBHISTORY FILES
• Configure history server to produce files
• Created for every MR job
• HDFS data volume processed
• for mappers/reducers:
• CPU time
• memory used
• start/end time
• max parallel maps, reduces
• GC time
• not available for Tez/Spark:
• Use timeline server for better logging
• Timeline server dependencies
30
31. HADOOP LOG ANALYSIS
• Analysis goals:
• Queue usage => cluster utilization
• Time spent by jobs/containers in waiting state
• Job level stats:
• # of jobs
• Failed/killed vs successful
• Type of jobs
• Container level stats
• How analyze logs?
• Custom scripts
• Parse job history files, hadoop logs
• Data warehouse
• Visualization
• Not much by the way of publicly available tools
31
35. QUEUE TUNING STRATEGY
• Determine how you want your cluster to behave
• Pick scheduler depending on behavior
• Real world examples:
• Production jobs must get resources
• Dedicate a certain portion of the cluster regardless of cluster state (idle, at capacity)
• Data loading jobs
• Constrain to a small portion of cluster to preserve network bandwidth
• Research jobs:
• Small portion of cluster at peak
• Large portion of cluster when idle
• Divide up cluster amongst business units
35
36. QUEUE TUNING - SCHEDULER BASICS
Fair Scheduler Capacity Scheduler
Resource Allocation Get approx. equal share of
resources over time
Queues are allocated a fraction of
the cluster
Which Resource? Memory, CPU (optional) Memory, CPU (optional)
Inter-queue Constraints Max & Min Shares: X Mbs, Y
vcores
Min capacity => guaranteed
fraction of entire cluster when
busy
Max capacity => guaranteed
fraction of entire cluster when
idle
Intra-queue Resource Sharing Pick policy: FIFO, FAIR, Dominant
Resource Fairness
Tunable policy:
- Many users: 1st x users get 1/x
- Many jobs, single user: FIFO
36
37. MORE ON EACH SCHEDULER
• Fair Scheduler:
• Hadoop Summit 2009
• Job Scheduling With the Fair and Capacity Schedulers - Matei Zaharia
• Capacity Scheduler:
• Hadoop Summit 2015 (5/9, 12:05pm)
• Towards SLA-based Scheduling on YARN - Sumeet Singh, Nathan Roberts
37
38. TOO MANY JOBS - MIXED PROFILE JOBS
• Jobs may have different memory profiles
• Standard MR jobs: small container sizes
• Newer execution frameworks (Spark, H2O):
• Large container sizes
• All or nothing scheduling
• A job with many little tasks
• Can starve jobs that require large containers
38
39. TOO MANY JOBS - MIXED PROFILE JOBS MITIGATION
• Reduce container sizes if possible
• Always start with the lowest container sizes
• Node labels (YARN-2492) and gang scheduling (YARN-624)
• More details:
• Running Spark and MapReduce Together In Production - David Chaiken
• Hadoop Summit 2015, 06/09, 2:35pm
39
40. TOO MANY JOBS - HARDENING YOUR CLUSTER
• Cluster configuration audit
• Container vs heap size
• Appropriate kernel level configuration
• Turn on Linux Container Executor
• Enable Hadoop Security
• Use operating system cgroups
• Protect Hadoop daemons
• Cage user processes:
• Impala
• Limits on what Hadoop can control:
• CPU
• But not memory, network & disk BW
40
mapreduce.map.memory.mb = 1536
mapreduce.map.java.opts = -Xmx2560m
41. FAILURE CATEGORY 3 - TOO MANY USERS
Data access control
41
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
Inter-departmental
resource
contention
(too many jobs)
42. TOO MANY USERS - QUEUE ACCESS
● Use queue ACLs
o restrict which users can submit jobs to a queue
o per queue administrator roles:
submit job
administer job
o restrict whether users can view applications in another queue
42
43. DATA ACCESS CONTROL
• By default, Hadoop supports UNIX style file permissions
• Easy to circumvent
HADOOP_USER_NAME=hdfs hdfs dfs -rm /priv/data
• Use Kerberos
43
44. DATA ACCESS CONTROL - ACCOUNTABILITY
• HDFS Audit logs
• Produced by NameNode
015-02-24 20:59:45,382 INFO FSNamesystem.audit:
allowed=true ugi=soam (auth:SIMPLE) ip=/10.251.255.181
cmd=delete src=/hive/what_a_con.db
dst=/user/soam/.Trash/Current/hive/what_a_con.db
perm=soam:hiveusers:rwxrwxr-x
44
“Who deleted
that file?”
45. SQUEEZE MORE CAPACITY FROM CLUSTER
45
Application Layer
Execution Framework
Core Hadoop Layer
YARN
network
h/w
OS
disk
DN NN
HDFS
NN
disk
NM NM
cpu cpu
RAM RAM
RM
MRSpark Tez
Hive SparkSQL Cascading Pig
Application
YARN
• Targeted upgrades, optimizations
46. SQUEEZE MORE CAPACITY FROM CLUSTER
• Optimizations:
• Application layer:
• Query optimizations, algorithmic level optimizations
• Upgrading:
• Execution Framework:
• Tremendous performance improvements in Hive/Tez, Spark over the past two years
• Pig, Cascading all continue to improve
• Hadoop layer:
• Recent focus on security, stability
• Recommendation:
• Focus on upgrading execution framework
46