SlideShare a Scribd company logo
1 of 42
Hadoop Troubleshooting 101
Kate Ting, Cloudera
Ariel Rabkin, Cloudera/UC Berkeley
How to Destroy Your Cluster

  with 7 Misconfigurations
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
Ticket Breakdown

• No one issue was more than 2% of tickets

• First symptom ≠ root cause

• Bad configuration goes undetected
Breakdown of Tickets by Component
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
What are Misconfigurations?

• Any diagnostic ticket requiring a change to Hadoop or to
  OS config files

• Comprise 35% of tickets

• e.g. resource-allocation: memory, file-handles, disk-space
Problem Tickets by Category and Time
Why Care About Misconfigurations?
The life of an over-subscribed MR/HBase
   cluster is nasty, brutish, and short.
            (with apologies to Thomas Hobbes)
Oversubscription of MR heap caused swap.




      Swap caused RegionServer to time out and die.



Dead RegionServer caused MR tasks to fail until MR job died.
Whodunit?



  A Faulty MR Config Killed HBase
Who Are We?

• Kate Ting
  o Customer Operations Engineer, Cloudera
  o 9 months troubleshooting 100+ Hadoop clusters
  o kate@cloudera.com

• Ariel Rabkin
  o Customer Operations Tools Team, Intern, Cloudera
  o UC Berkeley CS PhD candidate
  o Dissertation on diagnosing configuration bugs
  o asrabkin@eecs.berkeley.edu
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
1. Task Out Of Memory Error

FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
     at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask
.java:781)
     at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350
)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
     at org.apache.hadoop.mapred.Child.main(Child.java:170)
1. Task Out Of Memory Error
• What does it mean?
  o Memory leak in task code

• What causes this?
  o MR task heap sizes will not fit

• How can it be resolved?
   o Set io.sort.mb < mapred.child.java.opts
       o e.g. 512M for io.sort.mb and 1G for mapred.child.java.opts
       io.sort.mb = size of map buffer; smaller buffer => more spills
   o Use fewer mappers & reducers
       mappers = # of cores on node
           if not using HBase
           if # disks = # cores
       4:3 mapper:reducer ratio
(Mappers + Reducers)*
               Child Task Heap
                       +
                   DN heap
                       +
Total RAM          TT heap
                       +
                     3GB
                       +
                   RS heap
                       +
             Other Services' heap
2. JobTracker Out of Memory Error
ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:122)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:653)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3965)
at
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskIniti
alizationListener.java:79)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8
86)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2. JobTracker Out of Memory Error

• What does it mean?
  o Total JT memory usage > allocated RAM

• What causes this?
  o Tasks too small
  o Too much job history
2. JobTracker Out of Memory Error

• How can it be resolved?
  o Verify by summary output: sudo -u mapred jmap -J-d64
    -histo:live <PID of JT>
  o Increase JT's heap space
  o Separate JT and NN
  o Decrease
    mapred.jobtracker.completeuserjobs.maximum to 5
      JT cleans up mem more often, reducing needed
       RAM
3. Unable to Create New Native Thread

ERROR mapred.JvmManager: Caught Throwable in JVMRunner. Aborting
TaskTracker.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:234)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at
org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.j
ava:127)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild
(JvmManager.java:472)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(Jvm
Manager.java:446)
3. Unable to Create New Native Thread
• What does it mean?
  o DN show up as dead even though processes are still
    running on those machines

• What causes this?
  o nproc default of 4200 processes is too low

• How can it be resolved?
  o Adjust /etc/security/limits.conf:
      hbase soft/hard nproc 50000
      hdfs soft/hard nproc 50000
      mapred soft/hard nproc 50000
      Restart DN, TT, JT, NN
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagment
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
4. Too Many Fetch-Failures



INFO org.apache.hadoop.mapred.JobInProgress: Too many
fetch-failures for output of task:
4. Too Many Fetch-Failures

• What does it mean?
  o Reducer fetch operations fail to retrieve mapper outputs
  o Too many fetch failures occur on a particular TT
     o => blacklisted TT

• What causes this?
  o DNS issues
  o Not enough http threads on the mapper side for the
    reducers
  o JVM bug
4. Too Many Fetch-Failures
• How can it be resolved?
   o mapred.reduce.slowstart.completed.maps = 0.80
       Allows reducers from other jobs to run while a big job waits on
        mappers
   o tasktracker.http.threads = 80
       Specifies # threads used by the TT to serve map output to
        reducers
   o mapred.reduce.parallel.copies = SQRT(NodeCount) with a floor
     of 10
       Specifies # parallel copies used by reducers to fetch map
        output
   o Stop using 6.1.26 Jetty, which is fetch-failure prone and upgrade
     to CDH3u2 (MAPREDUCE-2980), MAPREDUCE-3184
5. Not Able to Place Enough Replicas



WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not
able to place enough replicas
5. Not Able to Place Enough Replicas



• What does it mean?
  o NN is unable to choose some DNs
5. Not Able to Place Enough Replicas
•   What causes this?
    o dfs replication > # avail DNs
        o # avail DNs is low due to low disk space
        o mapred.submit.replication default of 10 is too high
    o NN is unable to satisfy block replacement policy
        o If # racks > 2, a block has to exist in at least 2 racks
    o DN being decommissioned
    o DN has too much load on it
    o Block-size unreasonably large
    o Not enough xcievers threads
        Default 256 threads that DN can manage is too low
        Note the accidental misspelling
5. Not Able to Place Enough Replicas
• How can it be resolved?
  o Set dfs.datanode.max.xcievers = 4096 then restart DN
  o Look for nodes down (or rack down)
  o Check disk space
     o Log dir may be filling up or a runaway task may fill
       up an entire disk
  o Rebalance
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Exhaustion
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Exhaustion
     • Fetch Failures
     • Replicas
  • Data Space Exhaustion
     • No File
     • Too Many Files
• Cloudera Manager
6. No Such File or Directory

ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:
496)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapred.TaskTracker.initializeDirectories(TaskTracker.java:666)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:734)
at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1431)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3521)
Total Storage




                MR space
                DFS space
6. No Such File or Directory
• What does it mean?
  o TT failing to start or jobs are failing
• What causes this?
  o Diskspace filling up on the TT disk
  o Permissions on the directories that TT writes to
• How can it be resolved?
   Job logging dir dfs.datanode.du.reserved = 10% total
    vol
   Permissions should be 755 and owner should be
    mapred for userlogs and mapred.local.dir
7. Too Many Open Files

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(<ip address>:50010,
storageID=<storage id>, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Too many open files
      at sun.nio.ch.IOUtil.initPipe(Native Method)
      at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
      at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
      at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
      at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
      at java.io.DataInputStream.readShort(DataInputStream.java:295)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:97)
7. Too Many Open Files
• What does it mean?
  o Hitting open file handles limit of user account running
    Hadoop

• What causes this?
  o Nofile default of 1024 files is too low

• How can it be resolved?
  o Adjust /etc/security/limits.conf:
      hdfs - nofile 32768
      mapred - nofile 32768
      hbase - nofile 32768
      Restart DN, TT, JT, NN
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
Additional Resources
• Avoiding Common Hadoop Administration Issues:
  http://www.cloudera.com/blog/2010/08/avoiding-common-hadoop-
  administration-issues/

• Tips for Improving MapReduce Performance:
  http://www.cloudera.com/blog/2009/12/7-tips-for-improving-
  mapreduce-performance/

• Basic Hardware Recommendations:
  http://www.cloudera.com/blog/2010/03/clouderas-support-team-
  shares-some-basic-hardware-recommendations/

• Cloudera Knowledge
  Base: http://ccp.cloudera.com/display/KB/Knowledge+Base
Acknowledgements

Adam Warrington
Amr Awadallah
Angus Klein
Brock Noland
Clint Heath
Eric Sammer
Esteban Gutierrez
Jeff Bean
Joey Echeverria
Laura Gonzalez
Linden Hillenbrand
Omer Trajman
Patrick Angeles
Takeaways

• Configuration is up to you. Bugs are up to
  the community.

• Misconfigurations are hard to diagnose.

• Get it right the first time with Cloudera
  Manager.

More Related Content

What's hot

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 

Viewers also liked

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!Edureka!
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop AdministrationEdureka!
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems WebinarCloudera, Inc.
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Why Migrate from MySQL to Cassandra
Why Migrate from MySQL to CassandraWhy Migrate from MySQL to Cassandra
Why Migrate from MySQL to CassandraDATAVERSITY
 
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & TroubleshootingApache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & TroubleshootingJayush Luniya
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
How to solve the mismanagement crisis
How to solve the mismanagement crisisHow to solve the mismanagement crisis
How to solve the mismanagement crisisBalamurugan Ka
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityUday Vakalapudi
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to VerticaTommi Siivola
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveTapan Avasthi
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012jbellis
 

Viewers also liked (20)

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Why Migrate from MySQL to Cassandra
Why Migrate from MySQL to CassandraWhy Migrate from MySQL to Cassandra
Why Migrate from MySQL to Cassandra
 
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & TroubleshootingApache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
How to solve the mismanagement crisis
How to solve the mismanagement crisisHow to solve the mismanagement crisis
How to solve the mismanagement crisis
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data Locality
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to Vertica
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 

Similar to Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera

Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingData Con LA
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Tingjaxconf
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)srisatish ambati
 
Monitor some of the things
Monitor some of the thingsMonitor some of the things
Monitor some of the thingsandertech
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 

Similar to Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera (20)

Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
Monitor some of the things
Monitor some of the thingsMonitor some of the things
Monitor some of the things
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera

  • 1. Hadoop Troubleshooting 101 Kate Ting, Cloudera Ariel Rabkin, Cloudera/UC Berkeley
  • 2. How to Destroy Your Cluster with 7 Misconfigurations
  • 3. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 4. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 5. Ticket Breakdown • No one issue was more than 2% of tickets • First symptom ≠ root cause • Bad configuration goes undetected
  • 6. Breakdown of Tickets by Component
  • 7. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 8. What are Misconfigurations? • Any diagnostic ticket requiring a change to Hadoop or to OS config files • Comprise 35% of tickets • e.g. resource-allocation: memory, file-handles, disk-space
  • 9. Problem Tickets by Category and Time
  • 10. Why Care About Misconfigurations?
  • 11. The life of an over-subscribed MR/HBase cluster is nasty, brutish, and short. (with apologies to Thomas Hobbes)
  • 12. Oversubscription of MR heap caused swap. Swap caused RegionServer to time out and die. Dead RegionServer caused MR tasks to fail until MR job died.
  • 13. Whodunit? A Faulty MR Config Killed HBase
  • 14. Who Are We? • Kate Ting o Customer Operations Engineer, Cloudera o 9 months troubleshooting 100+ Hadoop clusters o kate@cloudera.com • Ariel Rabkin o Customer Operations Tools Team, Intern, Cloudera o UC Berkeley CS PhD candidate o Dissertation on diagnosing configuration bugs o asrabkin@eecs.berkeley.edu
  • 15. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 16. 1. Task Out Of Memory Error FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask .java:781) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350 ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170)
  • 17. 1. Task Out Of Memory Error • What does it mean? o Memory leak in task code • What causes this? o MR task heap sizes will not fit • How can it be resolved? o Set io.sort.mb < mapred.child.java.opts o e.g. 512M for io.sort.mb and 1G for mapred.child.java.opts  io.sort.mb = size of map buffer; smaller buffer => more spills o Use fewer mappers & reducers  mappers = # of cores on node  if not using HBase  if # disks = # cores  4:3 mapper:reducer ratio
  • 18. (Mappers + Reducers)* Child Task Heap + DN heap + Total RAM TT heap + 3GB + RS heap + Other Services' heap
  • 19. 2. JobTracker Out of Memory Error ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:122) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:653) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3965) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskIniti alizationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8 86) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
  • 20. 2. JobTracker Out of Memory Error • What does it mean? o Total JT memory usage > allocated RAM • What causes this? o Tasks too small o Too much job history
  • 21. 2. JobTracker Out of Memory Error • How can it be resolved? o Verify by summary output: sudo -u mapred jmap -J-d64 -histo:live <PID of JT> o Increase JT's heap space o Separate JT and NN o Decrease mapred.jobtracker.completeuserjobs.maximum to 5  JT cleans up mem more often, reducing needed RAM
  • 22. 3. Unable to Create New Native Thread ERROR mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at org.apache.hadoop.util.Shell.runCommand(Shell.java:234) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.j ava:127) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild (JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(Jvm Manager.java:446)
  • 23. 3. Unable to Create New Native Thread • What does it mean? o DN show up as dead even though processes are still running on those machines • What causes this? o nproc default of 4200 processes is too low • How can it be resolved? o Adjust /etc/security/limits.conf:  hbase soft/hard nproc 50000  hdfs soft/hard nproc 50000  mapred soft/hard nproc 50000  Restart DN, TT, JT, NN
  • 24. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagment • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 25. 4. Too Many Fetch-Failures INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task:
  • 26. 4. Too Many Fetch-Failures • What does it mean? o Reducer fetch operations fail to retrieve mapper outputs o Too many fetch failures occur on a particular TT o => blacklisted TT • What causes this? o DNS issues o Not enough http threads on the mapper side for the reducers o JVM bug
  • 27. 4. Too Many Fetch-Failures • How can it be resolved? o mapred.reduce.slowstart.completed.maps = 0.80  Allows reducers from other jobs to run while a big job waits on mappers o tasktracker.http.threads = 80  Specifies # threads used by the TT to serve map output to reducers o mapred.reduce.parallel.copies = SQRT(NodeCount) with a floor of 10  Specifies # parallel copies used by reducers to fetch map output o Stop using 6.1.26 Jetty, which is fetch-failure prone and upgrade to CDH3u2 (MAPREDUCE-2980), MAPREDUCE-3184
  • 28. 5. Not Able to Place Enough Replicas WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas
  • 29. 5. Not Able to Place Enough Replicas • What does it mean? o NN is unable to choose some DNs
  • 30. 5. Not Able to Place Enough Replicas • What causes this? o dfs replication > # avail DNs o # avail DNs is low due to low disk space o mapred.submit.replication default of 10 is too high o NN is unable to satisfy block replacement policy o If # racks > 2, a block has to exist in at least 2 racks o DN being decommissioned o DN has too much load on it o Block-size unreasonably large o Not enough xcievers threads  Default 256 threads that DN can manage is too low  Note the accidental misspelling
  • 31. 5. Not Able to Place Enough Replicas • How can it be resolved? o Set dfs.datanode.max.xcievers = 4096 then restart DN o Look for nodes down (or rack down) o Check disk space o Log dir may be filling up or a runaway task may fill up an entire disk o Rebalance
  • 32. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Exhaustion • TT OOME • JT OOME • Native Threads • Thread Exhaustion • Fetch Failures • Replicas • Data Space Exhaustion • No File • Too Many Files • Cloudera Manager
  • 33. 6. No Such File or Directory ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java: 496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapred.TaskTracker.initializeDirectories(TaskTracker.java:666) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:734) at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1431) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3521)
  • 34. Total Storage MR space DFS space
  • 35. 6. No Such File or Directory • What does it mean? o TT failing to start or jobs are failing • What causes this? o Diskspace filling up on the TT disk o Permissions on the directories that TT writes to • How can it be resolved?  Job logging dir dfs.datanode.du.reserved = 10% total vol  Permissions should be 755 and owner should be mapred for userlogs and mapred.local.dir
  • 36. 7. Too Many Open Files ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(<ip address>:50010, storageID=<storage id>, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Too many open files at sun.nio.ch.IOUtil.initPipe(Native Method) at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readShort(DataInputStream.java:295) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:97)
  • 37. 7. Too Many Open Files • What does it mean? o Hitting open file handles limit of user account running Hadoop • What causes this? o Nofile default of 1024 files is too low • How can it be resolved? o Adjust /etc/security/limits.conf:  hdfs - nofile 32768  mapred - nofile 32768  hbase - nofile 32768  Restart DN, TT, JT, NN
  • 38. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 39.
  • 40. Additional Resources • Avoiding Common Hadoop Administration Issues: http://www.cloudera.com/blog/2010/08/avoiding-common-hadoop- administration-issues/ • Tips for Improving MapReduce Performance: http://www.cloudera.com/blog/2009/12/7-tips-for-improving- mapreduce-performance/ • Basic Hardware Recommendations: http://www.cloudera.com/blog/2010/03/clouderas-support-team- shares-some-basic-hardware-recommendations/ • Cloudera Knowledge Base: http://ccp.cloudera.com/display/KB/Knowledge+Base
  • 41. Acknowledgements Adam Warrington Amr Awadallah Angus Klein Brock Noland Clint Heath Eric Sammer Esteban Gutierrez Jeff Bean Joey Echeverria Laura Gonzalez Linden Hillenbrand Omer Trajman Patrick Angeles
  • 42. Takeaways • Configuration is up to you. Bugs are up to the community. • Misconfigurations are hard to diagnose. • Get it right the first time with Cloudera Manager.

Editor's Notes

  1. You are sitting here because you know that misconfigurations and bugs break the most clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you. I guarantee if you wait an hour before walking out that you will leave with the know-how to get your configuration right the first time.
  2. hadoop errors show up far from where you configured so don&apos;t know what log files to analyze
  3. if your cluster is broken, probably a misconfiguration. this is a hard prob b/c the err msgs are not tightly tied to the root cause
  4. @tlipcon if sum(max heap size) &gt; physical RAM - 3GB, go directly to jail. do not pass go. do not collect $200.
  5. why not fix the defaults? b/c they are not static defaults, they are not all hadoop defaults but also OS defaults
  6. resource allocation is hard because the place where you are misallocating is far from where you ran out e.g. fetch failures could be due to system, config, or bug but not sure what from the error msg and you don&apos;t know which misconfig from err msg also ff doesn&apos;t show up on the node where it&apos;s configured
  7. one large cluster doesn&apos;t necessarily show all the different dependencies;cloudera has seen a lot of diverse operation clusters.part of what cloudera does is provide tools to help diagnose and understand how hadoop operates
  8. Not biased or anything but you can get it right the first time with ClouderaMgr