SlideShare a Scribd company logo
1 of 21
© Hortonworks Inc. 2013
Top 10 things to get the most
out of your Hadoop Cluster
Suresh Srinivas | @suresh_m_s
Sanjay Radia | @srr
Page 1
© Hortonworks Inc. 2013
About Me
• Architect & Founder at Hortonworks
• Long time Apache Hadoop committer and PMC
member
• Designed and developed many key Hadoop features
• Experience from supporting many clusters
–Including some of the world’s largest Hadoop clusters
Page 2
© Hortonworks Inc. 2013
Agenda
Best Practices, Tips and Tricks for
• Building cluster
• Configuration
• Monitoring
• Reliability
• Multi-tenancy
Page 3
© Hortonworks Inc. 2013
Hardware and Cluster Sizing
• Considerations
–Larger clusters heal faster on nodes or disk failure
–Machines with huge storage take longer to recover
–More racks give more failure domains
• Recommendations
– Get good-quality commodity hardware
– Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores
– More memory is better – real time is memory hungry!
– Before considering fatter machines (1U 6 disks vs. 2U 12 disks)
– Get to 30-40 machines or 3-4 racks
–Use pilot cluster to learn about load patterns
– Balanced hardware for I/O, compute or memory bound
–Rule of thumb – network to compute cost of 20%
–More details - http://tinyurl.com/hwx-hadoop-hw
Page 4
© Hortonworks Inc. 2013
Configuration is Key
• Avoid JVM issues
–Use 64 bit JVM for all daemons
– Compressed OOPS enabled by default (6 u23 and later)
–Java heap size
– Set same max and starting heapsize, Xmx == Xms
– Avoid java defaults – configure NewSize and MaxNewSize
– Use 1/8 to 1/6 of max size for JVMs larger than 4G
–Use low-latency GC collector
– -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N>
– High <N> on Namenode and JobTracker
–Important JVM configs to help debugging
– -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails
– -XX:ErrorFile=<file>
– -XX:+HeapDumpOnOutOfMemoryError
Page 5
© Hortonworks Inc. 2013
Configuration is Key…
• Multiple redundant dirs for namenode metadata
–One of dfs.name.dir should be on NFS
–NFS softmount - tcp,soft,intr,timeo=20,retrans=5
• Configure open fd ulimit
–Default 1024 is too low
–16K for datanodes, 64K for Master nodes
• Setup cluster nodes with time synchronization
• Use version control for configuration!
Page 6
© Hortonworks Inc. 2013
Configuration is Key…
• Use disk fail in place for datanodes
–Disk failure is no longer datanode failure
–Especially important for large density nodes
• Set dfs.namenode.name.dir.restore to true
–Restores NN storage directory during checkpointing
• Take periodic backups of namenode metadata
–Make copies of the entire storage directory
• Master node OS device should be highly available
–RAID-1 (mirrored pair)
• Set aside a lot of disk space for NN logs
–It is verbose – set aside multiple GBs
–Many installs configure this too small
– NN logs roll with in minutes – hard to debug issues
Page 7
© Hortonworks Inc. 2013
Checkpointing
• Secondary Namenode - confusing name
Page 8
© Hortonworks Inc. 2013
Checkpointing…
• Setup a single secondary namenode
–Periodically merges file system image with journal
–Two secondary namenodes not supported
– Many instances of accidental two secondary namenodes
– Known to cause metadata corruption!
• In HA setup standby replaces secondary
• Ensure periodic checkpoints are happening
–Checkpoint time can be queried in scripts
– Shown in NN webUI as well
–Real incident
– A cluster was run for more than a year with no checkpoint!
– Namenode stopped when it ran out of disk space
– NN was running for more than an year – no restart!!!
– Restoring the cluster was not fun!
Page 9
© Hortonworks Inc. 2013
Don’t edit the metadata files!
• Editing can corrupt the cluster state
–Might result in loss of data
• Real incident
–NN misconfigured to point to another NN’s metadata
–DNs can’t register due to namespace ID mismatch
– System detected the problem correctly
– Safety net ignored by the admin!
–Admin edits the namenode VERSION file to match ids
What Happens Next?
Page 10
© Hortonworks Inc. 2013
Guard Against Accidental Deletion
• rm –r deletes the data at the speed of Hadoop!
–ctrl-c of the command does not stop deletion!
–Undeleting files on datanodes is hard & time consuming
– Immediately shutdown NN, unmount disks on datanodes
– Recover deleted files
– Start namenode without the delete operation in edits
• Enable Trash
• Real Incident
–Customer is running a distro of Hadoop with trash not enabled
–Deletes a large dir (100 TB) and shuts down NN immediately
–Support person asks NN to be restarted to see if trash is enabled!
What happens next?
• Now HDFS has Snapshots!
Page 11
© Hortonworks Inc. 2013
Monitor Usage
• Cluster storage, nodes, files, blocks grows
– Update NN heap, handler count, number of DN xceivers
– Tweak other related config periodically
• Monitor the hardware usage for your work load
– Disk I/O, network I/O, CPU and memory usage
– Use this information when expanding cluster capacity
• Monitor the usage with HADOOP metrics
– JVM metrics – GC times, Memory used, Thread Status
– RPC metrics – especially latency to track slowdowns
– HDFS metrics
– Used storage, # of files and blocks, total load on the cluster
– File System operations
– MapReduce Metrics
– Slot utilization and Job status
• Tweak configurations during upgrades/maintenance
Page 12
© Hortonworks Inc. 2013
Monitoring Simplified With Ambari
Cluster Metrics Summary
Page 13
© Hortonworks Inc. 2013
Monitoring Simplified With Ambari
HDFS Metrics Summary
Page 14
© Hortonworks Inc. 2013
Monitoring Simplified With Ambari
MapReduce Metrics Summary
Page 15
© Hortonworks Inc. 2013
Monitor Failures
• If a large % of datanodes fail put NN to safemode
–Avoids unnecessary replication
–Bring back the datanodes or rack
• Track dead datanodes
–Bring back datanodes when the number grows
• Ensure cluster storage utilization is < 85%
–When the cluster is nearly full things slow down
• Monitor for corrupt blocks
–Delete tmp files with replication factor = 1 and missing blocks
• Have a portfolio of cluster validation tests/jobs
–Run them on restart, upgrade & config changes
Page 16
© Hortonworks Inc. 2013
Tools To Manage Clusters
• Use Balancer periodically
–Distributes data and hence processing
–Important to run after expanding the cluster
–Use appropriate balancer bandwidth – does not need restart
– dfsadmin –setBalancerBandwidth <bandwidth>
• Decommissioning
–Before removing/replacing DNs from the cluster
• Distcp for copying data to another cluster
–Backup, Disaster recovery
–More enhancements to come in the near future
• Tooling can be done around JMX/JMX http
–See the list - http://<nn>/jmx?get=Hadoop:service=NameNode
–All information equivalent to NN WebUI
Page 17
© Hortonworks Inc. 2013
Further Simplify Management
• HDFS uses JBODs with replication, not RAID
–Monitors nodes, disks, block checksums
–Automatic Recovery - parallel – very fast
– Recovers entire 12TB node in 10s of minutes in a 100 node cluster
Compare with the cost & urgency of repairing a RAID 5!
• Spare cluster capacity further simplifies management
–Nodes/clusters continue to run on failures, with lower capacity
– Nodes and disks can be fixed when convenient (unlike RAID)
– Configure how many disk failures => node failure
–1 operator can manage 3-4K nodes
Page 18
© Hortonworks Inc. 2013
Design For Multi-tenancy
• Share compute capacity with Capacity Scheduler
– Queue(s) and sub-queues with a guaranteed capacity per tenant
–Almost like dedicated hardware
–Better than private cluster –access to unused capacity
–Resource limits for tasks
– Memory limits are monitored
– C-groups just got into Yarn
– Resource isolation without VM overhead!
• Share HDFS Storage
–Set quotas per-user and per-project data directories
–Federation - Isolate categories of uses to separate namespaces
– Production vs. experimental, HBase etc.
Page 19
© Hortonworks Inc. 2013
Train Users
• Train users on best practices on writing apps
• Reduce storage use
–Delete unnecessary data periodically
–Move cold data into Hadoop archive
• Encourage using replication >= 3 for important data
–Hot data also needs higher replication
• Setup a small test cluster
–Users test their code before moving to production
–Avoid debugging in production cluster
• Setup user mailing list for information exchange
• Encourage creating jiras in Apache
–Helps community identify issues, fix bugs, stabilize quickly
Page 20
© Hortonworks Inc. 2013
Thank You – Q&A
Summary
1. Choose suitable server hardware and cluster sizes
2. Configuration is key
3. Checkpointing
4. Don’t edit metadata files
5. Guard against accidental deletions
6. Monitor usage and failures
7. Use available tools for managing the cluster
8. Simplify management with spare capacity
9. Design for multi-tenancy
10. Train your users on best practices
Page 21

More Related Content

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Top Ten things to get the most out of your Hadoop cluster

  • 1. © Hortonworks Inc. 2013 Top 10 things to get the most out of your Hadoop Cluster Suresh Srinivas | @suresh_m_s Sanjay Radia | @srr Page 1
  • 2. © Hortonworks Inc. 2013 About Me • Architect & Founder at Hortonworks • Long time Apache Hadoop committer and PMC member • Designed and developed many key Hadoop features • Experience from supporting many clusters –Including some of the world’s largest Hadoop clusters Page 2
  • 3. © Hortonworks Inc. 2013 Agenda Best Practices, Tips and Tricks for • Building cluster • Configuration • Monitoring • Reliability • Multi-tenancy Page 3
  • 4. © Hortonworks Inc. 2013 Hardware and Cluster Sizing • Considerations –Larger clusters heal faster on nodes or disk failure –Machines with huge storage take longer to recover –More racks give more failure domains • Recommendations – Get good-quality commodity hardware – Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores – More memory is better – real time is memory hungry! – Before considering fatter machines (1U 6 disks vs. 2U 12 disks) – Get to 30-40 machines or 3-4 racks –Use pilot cluster to learn about load patterns – Balanced hardware for I/O, compute or memory bound –Rule of thumb – network to compute cost of 20% –More details - http://tinyurl.com/hwx-hadoop-hw Page 4
  • 5. © Hortonworks Inc. 2013 Configuration is Key • Avoid JVM issues –Use 64 bit JVM for all daemons – Compressed OOPS enabled by default (6 u23 and later) –Java heap size – Set same max and starting heapsize, Xmx == Xms – Avoid java defaults – configure NewSize and MaxNewSize – Use 1/8 to 1/6 of max size for JVMs larger than 4G –Use low-latency GC collector – -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N> – High <N> on Namenode and JobTracker –Important JVM configs to help debugging – -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails – -XX:ErrorFile=<file> – -XX:+HeapDumpOnOutOfMemoryError Page 5
  • 6. © Hortonworks Inc. 2013 Configuration is Key… • Multiple redundant dirs for namenode metadata –One of dfs.name.dir should be on NFS –NFS softmount - tcp,soft,intr,timeo=20,retrans=5 • Configure open fd ulimit –Default 1024 is too low –16K for datanodes, 64K for Master nodes • Setup cluster nodes with time synchronization • Use version control for configuration! Page 6
  • 7. © Hortonworks Inc. 2013 Configuration is Key… • Use disk fail in place for datanodes –Disk failure is no longer datanode failure –Especially important for large density nodes • Set dfs.namenode.name.dir.restore to true –Restores NN storage directory during checkpointing • Take periodic backups of namenode metadata –Make copies of the entire storage directory • Master node OS device should be highly available –RAID-1 (mirrored pair) • Set aside a lot of disk space for NN logs –It is verbose – set aside multiple GBs –Many installs configure this too small – NN logs roll with in minutes – hard to debug issues Page 7
  • 8. © Hortonworks Inc. 2013 Checkpointing • Secondary Namenode - confusing name Page 8
  • 9. © Hortonworks Inc. 2013 Checkpointing… • Setup a single secondary namenode –Periodically merges file system image with journal –Two secondary namenodes not supported – Many instances of accidental two secondary namenodes – Known to cause metadata corruption! • In HA setup standby replaces secondary • Ensure periodic checkpoints are happening –Checkpoint time can be queried in scripts – Shown in NN webUI as well –Real incident – A cluster was run for more than a year with no checkpoint! – Namenode stopped when it ran out of disk space – NN was running for more than an year – no restart!!! – Restoring the cluster was not fun! Page 9
  • 10. © Hortonworks Inc. 2013 Don’t edit the metadata files! • Editing can corrupt the cluster state –Might result in loss of data • Real incident –NN misconfigured to point to another NN’s metadata –DNs can’t register due to namespace ID mismatch – System detected the problem correctly – Safety net ignored by the admin! –Admin edits the namenode VERSION file to match ids What Happens Next? Page 10
  • 11. © Hortonworks Inc. 2013 Guard Against Accidental Deletion • rm –r deletes the data at the speed of Hadoop! –ctrl-c of the command does not stop deletion! –Undeleting files on datanodes is hard & time consuming – Immediately shutdown NN, unmount disks on datanodes – Recover deleted files – Start namenode without the delete operation in edits • Enable Trash • Real Incident –Customer is running a distro of Hadoop with trash not enabled –Deletes a large dir (100 TB) and shuts down NN immediately –Support person asks NN to be restarted to see if trash is enabled! What happens next? • Now HDFS has Snapshots! Page 11
  • 12. © Hortonworks Inc. 2013 Monitor Usage • Cluster storage, nodes, files, blocks grows – Update NN heap, handler count, number of DN xceivers – Tweak other related config periodically • Monitor the hardware usage for your work load – Disk I/O, network I/O, CPU and memory usage – Use this information when expanding cluster capacity • Monitor the usage with HADOOP metrics – JVM metrics – GC times, Memory used, Thread Status – RPC metrics – especially latency to track slowdowns – HDFS metrics – Used storage, # of files and blocks, total load on the cluster – File System operations – MapReduce Metrics – Slot utilization and Job status • Tweak configurations during upgrades/maintenance Page 12
  • 13. © Hortonworks Inc. 2013 Monitoring Simplified With Ambari Cluster Metrics Summary Page 13
  • 14. © Hortonworks Inc. 2013 Monitoring Simplified With Ambari HDFS Metrics Summary Page 14
  • 15. © Hortonworks Inc. 2013 Monitoring Simplified With Ambari MapReduce Metrics Summary Page 15
  • 16. © Hortonworks Inc. 2013 Monitor Failures • If a large % of datanodes fail put NN to safemode –Avoids unnecessary replication –Bring back the datanodes or rack • Track dead datanodes –Bring back datanodes when the number grows • Ensure cluster storage utilization is < 85% –When the cluster is nearly full things slow down • Monitor for corrupt blocks –Delete tmp files with replication factor = 1 and missing blocks • Have a portfolio of cluster validation tests/jobs –Run them on restart, upgrade & config changes Page 16
  • 17. © Hortonworks Inc. 2013 Tools To Manage Clusters • Use Balancer periodically –Distributes data and hence processing –Important to run after expanding the cluster –Use appropriate balancer bandwidth – does not need restart – dfsadmin –setBalancerBandwidth <bandwidth> • Decommissioning –Before removing/replacing DNs from the cluster • Distcp for copying data to another cluster –Backup, Disaster recovery –More enhancements to come in the near future • Tooling can be done around JMX/JMX http –See the list - http://<nn>/jmx?get=Hadoop:service=NameNode –All information equivalent to NN WebUI Page 17
  • 18. © Hortonworks Inc. 2013 Further Simplify Management • HDFS uses JBODs with replication, not RAID –Monitors nodes, disks, block checksums –Automatic Recovery - parallel – very fast – Recovers entire 12TB node in 10s of minutes in a 100 node cluster Compare with the cost & urgency of repairing a RAID 5! • Spare cluster capacity further simplifies management –Nodes/clusters continue to run on failures, with lower capacity – Nodes and disks can be fixed when convenient (unlike RAID) – Configure how many disk failures => node failure –1 operator can manage 3-4K nodes Page 18
  • 19. © Hortonworks Inc. 2013 Design For Multi-tenancy • Share compute capacity with Capacity Scheduler – Queue(s) and sub-queues with a guaranteed capacity per tenant –Almost like dedicated hardware –Better than private cluster –access to unused capacity –Resource limits for tasks – Memory limits are monitored – C-groups just got into Yarn – Resource isolation without VM overhead! • Share HDFS Storage –Set quotas per-user and per-project data directories –Federation - Isolate categories of uses to separate namespaces – Production vs. experimental, HBase etc. Page 19
  • 20. © Hortonworks Inc. 2013 Train Users • Train users on best practices on writing apps • Reduce storage use –Delete unnecessary data periodically –Move cold data into Hadoop archive • Encourage using replication >= 3 for important data –Hot data also needs higher replication • Setup a small test cluster –Users test their code before moving to production –Avoid debugging in production cluster • Setup user mailing list for information exchange • Encourage creating jiras in Apache –Helps community identify issues, fix bugs, stabilize quickly Page 20
  • 21. © Hortonworks Inc. 2013 Thank You – Q&A Summary 1. Choose suitable server hardware and cluster sizes 2. Configuration is key 3. Checkpointing 4. Don’t edit metadata files 5. Guard against accidental deletions 6. Monitor usage and failures 7. Use available tools for managing the cluster 8. Simplify management with spare capacity 9. Design for multi-tenancy 10. Train your users on best practices Page 21