SlideShare a Scribd company logo
1 of 38
Hadoop Operations –
Best Practices from the Field
June 11, 2015
Chris Nauroth
email: cnauroth@hortonworks.com
twitter: @cnauroth
Suresh Srinivas
email: suresh@hortonworks.com
twitter: @suresh_m_s
© Hortonworks Inc. 2011
About Me
Chris Nauroth
• Member of Technical Staff, Hortonworks
– Apache Hadoop committer, PMC member, and Apache Software Foundation member
– Major contributor to HDFS ACLs, Windows compatibility, and operability improvements
• Hadoop user since 2010
– Prior employment experience deploying, maintaining and using Hadoop clusters
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Agenda
• Analysis of Hadoop Support Cases
– Support case trends
– Configuration
– Software Improvements
• Key Learnings and Best Practices
– HDFS ACLs
– HDFS Snapshots
– Reporting DataNode Volume Failures
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Support Case Trends – Proportional Cases per Month
Page 4
Architecting the Future of Big Data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
HDFS
Map Reduce
YARN
Other (37 components)
© Hortonworks Inc. 2011
Support Case Trends – Root Cause
Page 5
Architecting the Future of Big Data
0
200
400
600
800
1000
1200
Customer Environment
(Non HDP)
Documentation Defect Documentation Gap Documentation Not
Utilized
Education -
Configuration
Needs Training Product Defect
YARN
Map Reduce
HDFS
© Hortonworks Inc. 2011
Support Case Trends
• Core Hadoop components (HDFS, YARN and MapReduce) are used across all deployments, and
therefore receive proportionally more support cases than other ecosystem components.
• Misconfiguration is the dominant root cause.
• Documentation is a close second.
• We are constantly improving the code to eliminate operational issues, help with diagnosis and
provide increased visibility.
• Best practices get incorporated into Apache Ambari for improved defaults, simplified
configuration and deeper monitoring.
Page 6
Architecting the Future of Big Data
Configuration
© Hortonworks Inc. 2011
Configuration - Hardware and Cluster Sizing
• Considerations
–Larger clusters heal faster on nodes or disk failure
–Machines with huge storage take longer to recover
–More racks give more failure domains
• Recommendations
– Get good-quality commodity hardware
– Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores
– More memory is better – real time is memory hungry!
– Before considering fatter machines (1U 6 disks vs. 2U 12 disks)
– Get to 30-40 machines or 3-4 racks
–Use pilot cluster to learn about load patterns
– Balanced hardware for I/O, compute or memory bound
– More details - http://tinyurl.com/hwx-hadoop-hw
Page 8
© Hortonworks Inc. 2011
Configuration – JVM Tuning
• Avoid JVM issues
– Use 64 bit JVM for all daemons
– Compressed OOPS enabled by default (6 u23 and later)
– Java heap size
– Set same max and starting heapsize, Xmx == Xms
– Avoid java defaults – configure NewSize and MaxNewSize
– Use 1/8 to 1/6 of max size for JVMs larger than 4G
– Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB
– Use low-latency GC collector
– -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N>
– High <N> on Namenode and JobTracker or ResourceManager
– Important JVM configs to help debugging
– -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails
– -XX:ErrorFile=<file>
– -XX:+HeapDumpOnOutOfMemoryError
Page 9
© Hortonworks Inc. 2011
Configuration
• Deploy with QuorumJournalManager for high availability
• Configure open fd ulimit
– Default 1024 is too low
– 16K for datanodes, 64K for Master nodes
• Use version control for configuration!
Page 10
© Hortonworks Inc. 2011
Configuration
• Use disk fail in place for datanodes: dfs.datanode.failed.volumes.tolerated
– Disk failure is no longer datanode failure
– Especially important for large density nodes
• Set dfs.namenode.name.dir.restore to true
– Restores NN storage directory during checkpointing
• Take periodic backups of namenode metadata
– Make copies of the entire storage directory
• Set aside a lot of disk space for NN logs
– It is verbose – set aside multiple GBs
– Many installs configure this too small
– NN logs roll with in minutes – hard to debug issues
Page 11
© Hortonworks Inc. 2011
Configuration – Monitoring Usage
• Cluster storage, nodes, files, blocks grows
– Update NN heap, handler count, number of DN xceivers
– Tweak other related config periodically
• Monitor the hardware usage for your work load
– Disk I/O, network I/O, CPU and memory usage
– Use this information when expanding cluster capacity
• Monitor the usage with HADOOP metrics
– JVM metrics – GC times, Memory used, Thread Status
– RPC metrics – especially latency to track slowdowns
– HDFS metrics
– Used storage, # of files and blocks, total load on the cluster
– File System operations
– MapReduce Metrics
– Slot utilization and Job status
• Tweak configurations during upgrades/maintenance on an ongoing basis
Page 12
HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Install & Configure: Ambari Guided Configuration
Guide configuration and provide
recommendations for the most
common settings.
(HBase Example Shown here)
Software Improvements
Real Incidents and Software Improvements to Address Them
© Hortonworks Inc. 2011
Don’t edit the metadata files!
• Editing can corrupt the cluster state
– Might result in loss of data
• Real incident
– NN misconfigured to point to another NN’s metadata
– DNs can’t register due to namespace ID mismatch
– System detected the problem correctly
– Safety net ignored by the admin!
– Admin edits the namenode VERSION file to match ids
Mass deletion of unknown blocks that do not
exist in that namespace
Page 15
© Hortonworks Inc. 2011
Improvement
• Pause deletion of blocks when the namenode starts up
– https://issues.apache.org/jira/browse/HDFS-6186
– Supports configurable delay of block deletions after NameNode startup
– Gives an admin extra time to diagnose before deletions begin
• Show when block deletion will start after NameNode startup in WebUI
– https://issues.apache.org/jira/browse/HDFS-6385
– The web UI already displayed the number of pending block deletions
– This enhanced the display to indicate when actual deletion will begin
Page 16
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Block Deletion Start Time
Page 17
Architecting the Future of Big Data
New
© Hortonworks Inc. 2011
Guard Against Accidental Deletion
• rm –r deletes the data at the speed of Hadoop!
– ctrl-c of the command does not stop deletion!
– Undeleting files on datanodes is hard & time consuming
– Immediately shutdown NN, unmount disks on datanodes
– Recover deleted files
– Start namenode without the delete operation in edits
• Enable Trash
• Real Incident
– Customer is running a distro of Hadoop with trash not enabled
– Deletes a large dir (100 TB) and shuts down NN immediately
– Support person asks NN to be restarted to see if trash is enabled!
Blocks start deleting
Page 18
© Hortonworks Inc. 2011
Improvement
• HDFS Snapshots
– https://issues.apache.org/jira/browse/HDFS-2802
– A snapshot is a read-only point-in-time image of part of the file system
– A snapshot created before a deletion can be used to restore deleted data
– More coverage of snapshots later in the presentation
• HDFS ACLs
– https://issues.apache.org/jira/browse/HDFS-4685
– Finer-grained control of file permissions can help prevent an accidental deletion
– More coverage of ACLs later in the presentation
Page 19
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Unexpected error during HA HDFS upgrade
• Background: HDFS HA Architecture
– http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
• Real Incident
– During upgrade, NameNode calls every JournalNode to request backup of metadata directory, which renames
“current” directory to “previous.tmp”.
– Permissions incorrect on metadata directory for 1 out of 3 JournalNodes.
– The hdfs user is not authorized to rename. Backup fails for that JournalNode, so upgrade process aborts with
error.
Root cause not easily identifiable, long time to
recover
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Improvement
• Improve diagnostics on storage directory rename operations by using native code.
– https://issues.apache.org/jira/browse/HDFS-7118
– Logs additional root cause information for rename failure. For example, EACCES
• Split error checks in into separate conditions to improve diagnostics.
– https://issues.apache.org/jira/browse/HDFS-7119
– Splits a log message about failure to delete or rename into separate log messages to clarify which specific action
failed
• When aborting NameNode or JournalNode, write the contents of the metadata directories and
permissions to logs.
– https://issues.apache.org/jira/browse/HDFS-7120
– Usually the first information asked of the user, so we can automate this
• For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that
the operation can succeed.
– https://issues.apache.org/jira/browse/HDFS-7121
– Prevents need for manual cleanup on 2 out of 3 JournalNodes where backup succeeded
Page 21
Architecting the Future of Big Data
Key Learnings and Best Practices
Features that Help Improve Production Operations
© Hortonworks Inc. 2011
HDFS ACLs
• Existing HDFS POSIX permissions good, but not flexible enough
– Permission requirements may differ from the natural organizational hierarchy of users and groups.
• HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX
ACL model.
– An ACL (Access Control List) provides a way to set different permissions for specific named users or named
groups, not only the file’s owner and file’s group.
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS File Permissions Example
• Authorization requirements:
–In a sales department, they would like a single user Maya (Department Manager) to
control all modifications to sales data
–Other members of sales department need to view the data, but can’t modify it.
–Everyone else in the company must not be allowed to view the data.
• Can be implemented via the following:
Read/Write perm for user
maya
User
Group
Read perm for group sales
File with sales data
© Hortonworks Inc. 2011
HDFS ACLs
• Problem
–No longer feasible for Maya to control all modifications to the file
– New Requirement: Maya, Diane and Clark are allowed to make modifications
– New Requirement: New group called executives should be able to read the sales data
–Current permissions model only allows permissions at 1 group and 1 user
• Solution: HDFS ACLs
–Now assign different permissions to different users and groups
Owner
Group
Others
HDFS
Directory
… rwx
… rwx
… rwx
Group D … rwx
Group F … rwx
User Y … rwx
© Hortonworks Inc. 2011
HDFS ACLs
New Tools for ACL Management (setfacl, getfacl)
– hdfs dfs -setfacl -m group:execs:r-- /sales-data
– hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group:
sales user::rw- group::r-- group:execs:r-- mask::r-- other::--
– How do you know if a directory has ACLs set?
– hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0
2014-03-04 16:31 /sales-data
© Hortonworks Inc. 2011
HDFS ACLs Best Practices
• Start with traditional HDFS permissions to implement most permission requirements.
• Define a smaller number of ACLs to handle exceptional cases.
• A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that
has only traditional permissions.
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Snapshots
• HDFS Snapshots
– A snapshot is a read-only point-in-time image of part of the file system
– Performance: snapshot creation is instantaneous, regardless of data size or subtree depth
– Reliability: snapshot creation is atomic
– Scalability: snapshots do not create extra copies of data blocks
– Useful for protecting against accidental deletion of data
• Example: Daily Feeds
hdfs dfs -ls /daily-feeds
Found 5 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-16
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17
Page 28
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Snapshots
• Create a snapshot after each daily load
hdfs dfsadmin -allowSnapshot /daily-feeds
Allowing snaphot on /daily-feeds succeeded
hdfs dfs -createSnapshot /daily-feeds snapshot-to-2014-10-17
Created snapshot /daily-feeds/.snapshot/snapshot-to-2014-10-17
• User accidentally deletes data for 2014-10-16
hdfs dfs -ls /daily-feeds
Found 4 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17
Page 29
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HDFS Snapshots
• Snapshots to the rescue: the data is still in the snapshot
hdfs dfs -ls /daily-feeds/.snapshot/snapshot-to-2014-10-17
Found 5 items
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-13
drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-14
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-15
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16
drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-
feeds/.snapshot/snapshot-to-2014-10-17/2014-10-17
• Restore data from 2014-10-16
hdfs dfs -cp /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 /daily-feeds
Page 30
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
• Configuring dfs.datanode.failed.volumes.tolerated > 0 enables a DataNode to keep running after
volume failures
• DataNode is still running, but capacity is degraded
• HDFS already provided a count of failed volumes for each DataNode, but no further details
• Apache Hadoop 2.7.0 provides more information: failed path, estimated lost capacity and failure
date/time
• An administrator can use this information to prioritize cluster maintenance work
Page 31
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
Page 32
Architecting the Future of Big Data
New
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
Page 33
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
Page 34
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Reporting DataNode Volume Failures
• Everything in the web UI is sourced from standardized Hadoop metrics
– Each DataNode publishes its own metrics
– NameNode publishes aggregate information from every DataNode
• Metrics accessible through JMX or the HTTP /jmx URI
• Integrated in Ambari
• Can be integrated into your preferred management tools and ops dashboards
Page 35
Architecting the Future of Big Data
New System to Manage the Health of Hadoop
Clusters
• Ambari Alerts are installed and configured by default
• Health Alerts and Metrics managed via Ambari Web
© Hortonworks Inc. 2011
Summary
• Configuration
– Prevent garbage collection issues
– Configure for redundancy
– Retune configuration in response to metrics
• HDFS ACLs
– Implement fine-grained authorization rules on files
– Can protect against accidental file manipulations
• HDFS Snapshots
– Point-in-time image of part of the filesystem
– Useful for restoring to a prior state after accidental file manipulation
• Reporting DataNode Volume Failures
– Metrics and web UI exposing information about volume failures on DataNodes
– Useful for planning cluster maintenance work
• Use Ambari
– Helps install, configure, monitor and manage Hadoop clusters
– Incorporates the latest best practices
Page 37
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thank you, Q&A
Resource Location
Hardware
Recommendations for
Apache Hadoop
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
HDFS operational and
debuggability
improvements
https://issues.apache.org/jira/browse/HDFS-6185
HDFS ACLs Blog Post http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/
HDFS Snapshots Blog Post http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/
Learn more
Contact me with your operations questions and suggestions
Chris Nauroth – cnauroth@hortonworks.com

More Related Content

What's hot

Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wCloudera Japan
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdfBest-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdfssuserf8f9b2
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
 
DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...
DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...
DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...Simplilearn
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 

What's hot (20)

Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
HDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13wHDFSネームノードのHAについて #hcj13w
HDFSネームノードのHAについて #hcj13w
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdfBest-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...
DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...
DevOps Tutorial For Beginners | DevOps Tutorial | DevOps Tools | DevOps Train...
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 

Viewers also liked

Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Viewers also liked (13)

Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Hadoop Operations - Best Practices from the Field

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryChris Nauroth
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Perforce
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory StorageDataWorks Summit
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...SpringPeople
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Establishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawEstablishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawFlamer
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsLetsConnect
 
Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01Arunkumar Shanmugam
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
 

Similar to Hadoop Operations - Best Practices from the Field (20)

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Establishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan LawEstablishing Environment Best Practices T12 Brendan Law
Establishing Environment Best Practices T12 Brendan Law
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM Connections
 
Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01Backup netezza-tsm-v1403c-140330170451-phpapp01
Backup netezza-tsm-v1403c-140330170451-phpapp01
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Hadoop Operations - Best Practices from the Field

  • 1. Hadoop Operations – Best Practices from the Field June 11, 2015 Chris Nauroth email: cnauroth@hortonworks.com twitter: @cnauroth Suresh Srinivas email: suresh@hortonworks.com twitter: @suresh_m_s
  • 2. © Hortonworks Inc. 2011 About Me Chris Nauroth • Member of Technical Staff, Hortonworks – Apache Hadoop committer, PMC member, and Apache Software Foundation member – Major contributor to HDFS ACLs, Windows compatibility, and operability improvements • Hadoop user since 2010 – Prior employment experience deploying, maintaining and using Hadoop clusters Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Agenda • Analysis of Hadoop Support Cases – Support case trends – Configuration – Software Improvements • Key Learnings and Best Practices – HDFS ACLs – HDFS Snapshots – Reporting DataNode Volume Failures Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Support Case Trends – Proportional Cases per Month Page 4 Architecting the Future of Big Data 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 HDFS Map Reduce YARN Other (37 components)
  • 5. © Hortonworks Inc. 2011 Support Case Trends – Root Cause Page 5 Architecting the Future of Big Data 0 200 400 600 800 1000 1200 Customer Environment (Non HDP) Documentation Defect Documentation Gap Documentation Not Utilized Education - Configuration Needs Training Product Defect YARN Map Reduce HDFS
  • 6. © Hortonworks Inc. 2011 Support Case Trends • Core Hadoop components (HDFS, YARN and MapReduce) are used across all deployments, and therefore receive proportionally more support cases than other ecosystem components. • Misconfiguration is the dominant root cause. • Documentation is a close second. • We are constantly improving the code to eliminate operational issues, help with diagnosis and provide increased visibility. • Best practices get incorporated into Apache Ambari for improved defaults, simplified configuration and deeper monitoring. Page 6 Architecting the Future of Big Data
  • 8. © Hortonworks Inc. 2011 Configuration - Hardware and Cluster Sizing • Considerations –Larger clusters heal faster on nodes or disk failure –Machines with huge storage take longer to recover –More racks give more failure domains • Recommendations – Get good-quality commodity hardware – Buy the sweet-spot in pricing: 3TB disk, 96GB, 8-12 cores – More memory is better – real time is memory hungry! – Before considering fatter machines (1U 6 disks vs. 2U 12 disks) – Get to 30-40 machines or 3-4 racks –Use pilot cluster to learn about load patterns – Balanced hardware for I/O, compute or memory bound – More details - http://tinyurl.com/hwx-hadoop-hw Page 8
  • 9. © Hortonworks Inc. 2011 Configuration – JVM Tuning • Avoid JVM issues – Use 64 bit JVM for all daemons – Compressed OOPS enabled by default (6 u23 and later) – Java heap size – Set same max and starting heapsize, Xmx == Xms – Avoid java defaults – configure NewSize and MaxNewSize – Use 1/8 to 1/6 of max size for JVMs larger than 4G – Configure –XX:PermSize=128 MB, -XX:MaxPermSize=256 MB – Use low-latency GC collector – -XX:+UseConcMarkSweepGC, -XX:ParallelGCThreads=<N> – High <N> on Namenode and JobTracker or ResourceManager – Important JVM configs to help debugging – -verbose:gc -Xloggc:<file> -XX:+PrintGCDetails – -XX:ErrorFile=<file> – -XX:+HeapDumpOnOutOfMemoryError Page 9
  • 10. © Hortonworks Inc. 2011 Configuration • Deploy with QuorumJournalManager for high availability • Configure open fd ulimit – Default 1024 is too low – 16K for datanodes, 64K for Master nodes • Use version control for configuration! Page 10
  • 11. © Hortonworks Inc. 2011 Configuration • Use disk fail in place for datanodes: dfs.datanode.failed.volumes.tolerated – Disk failure is no longer datanode failure – Especially important for large density nodes • Set dfs.namenode.name.dir.restore to true – Restores NN storage directory during checkpointing • Take periodic backups of namenode metadata – Make copies of the entire storage directory • Set aside a lot of disk space for NN logs – It is verbose – set aside multiple GBs – Many installs configure this too small – NN logs roll with in minutes – hard to debug issues Page 11
  • 12. © Hortonworks Inc. 2011 Configuration – Monitoring Usage • Cluster storage, nodes, files, blocks grows – Update NN heap, handler count, number of DN xceivers – Tweak other related config periodically • Monitor the hardware usage for your work load – Disk I/O, network I/O, CPU and memory usage – Use this information when expanding cluster capacity • Monitor the usage with HADOOP metrics – JVM metrics – GC times, Memory used, Thread Status – RPC metrics – especially latency to track slowdowns – HDFS metrics – Used storage, # of files and blocks, total load on the cluster – File System operations – MapReduce Metrics – Slot utilization and Job status • Tweak configurations during upgrades/maintenance on an ongoing basis Page 12
  • 13. HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Install & Configure: Ambari Guided Configuration Guide configuration and provide recommendations for the most common settings. (HBase Example Shown here)
  • 14. Software Improvements Real Incidents and Software Improvements to Address Them
  • 15. © Hortonworks Inc. 2011 Don’t edit the metadata files! • Editing can corrupt the cluster state – Might result in loss of data • Real incident – NN misconfigured to point to another NN’s metadata – DNs can’t register due to namespace ID mismatch – System detected the problem correctly – Safety net ignored by the admin! – Admin edits the namenode VERSION file to match ids Mass deletion of unknown blocks that do not exist in that namespace Page 15
  • 16. © Hortonworks Inc. 2011 Improvement • Pause deletion of blocks when the namenode starts up – https://issues.apache.org/jira/browse/HDFS-6186 – Supports configurable delay of block deletions after NameNode startup – Gives an admin extra time to diagnose before deletions begin • Show when block deletion will start after NameNode startup in WebUI – https://issues.apache.org/jira/browse/HDFS-6385 – The web UI already displayed the number of pending block deletions – This enhanced the display to indicate when actual deletion will begin Page 16 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 Block Deletion Start Time Page 17 Architecting the Future of Big Data New
  • 18. © Hortonworks Inc. 2011 Guard Against Accidental Deletion • rm –r deletes the data at the speed of Hadoop! – ctrl-c of the command does not stop deletion! – Undeleting files on datanodes is hard & time consuming – Immediately shutdown NN, unmount disks on datanodes – Recover deleted files – Start namenode without the delete operation in edits • Enable Trash • Real Incident – Customer is running a distro of Hadoop with trash not enabled – Deletes a large dir (100 TB) and shuts down NN immediately – Support person asks NN to be restarted to see if trash is enabled! Blocks start deleting Page 18
  • 19. © Hortonworks Inc. 2011 Improvement • HDFS Snapshots – https://issues.apache.org/jira/browse/HDFS-2802 – A snapshot is a read-only point-in-time image of part of the file system – A snapshot created before a deletion can be used to restore deleted data – More coverage of snapshots later in the presentation • HDFS ACLs – https://issues.apache.org/jira/browse/HDFS-4685 – Finer-grained control of file permissions can help prevent an accidental deletion – More coverage of ACLs later in the presentation Page 19 Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 Unexpected error during HA HDFS upgrade • Background: HDFS HA Architecture – http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html • Real Incident – During upgrade, NameNode calls every JournalNode to request backup of metadata directory, which renames “current” directory to “previous.tmp”. – Permissions incorrect on metadata directory for 1 out of 3 JournalNodes. – The hdfs user is not authorized to rename. Backup fails for that JournalNode, so upgrade process aborts with error. Root cause not easily identifiable, long time to recover Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Improvement • Improve diagnostics on storage directory rename operations by using native code. – https://issues.apache.org/jira/browse/HDFS-7118 – Logs additional root cause information for rename failure. For example, EACCES • Split error checks in into separate conditions to improve diagnostics. – https://issues.apache.org/jira/browse/HDFS-7119 – Splits a log message about failure to delete or rename into separate log messages to clarify which specific action failed • When aborting NameNode or JournalNode, write the contents of the metadata directories and permissions to logs. – https://issues.apache.org/jira/browse/HDFS-7120 – Usually the first information asked of the user, so we can automate this • For JournalNode operations that must succeed on all nodes, execute a pre-check to verify that the operation can succeed. – https://issues.apache.org/jira/browse/HDFS-7121 – Prevents need for manual cleanup on 2 out of 3 JournalNodes where backup succeeded Page 21 Architecting the Future of Big Data
  • 22. Key Learnings and Best Practices Features that Help Improve Production Operations
  • 23. © Hortonworks Inc. 2011 HDFS ACLs • Existing HDFS POSIX permissions good, but not flexible enough – Permission requirements may differ from the natural organizational hierarchy of users and groups. • HDFS ACLs augment the existing HDFS POSIX permissions model by implementing the POSIX ACL model. – An ACL (Access Control List) provides a way to set different permissions for specific named users or named groups, not only the file’s owner and file’s group. Page 23 Architecting the Future of Big Data
  • 24. © Hortonworks Inc. 2011 HDFS File Permissions Example • Authorization requirements: –In a sales department, they would like a single user Maya (Department Manager) to control all modifications to sales data –Other members of sales department need to view the data, but can’t modify it. –Everyone else in the company must not be allowed to view the data. • Can be implemented via the following: Read/Write perm for user maya User Group Read perm for group sales File with sales data
  • 25. © Hortonworks Inc. 2011 HDFS ACLs • Problem –No longer feasible for Maya to control all modifications to the file – New Requirement: Maya, Diane and Clark are allowed to make modifications – New Requirement: New group called executives should be able to read the sales data –Current permissions model only allows permissions at 1 group and 1 user • Solution: HDFS ACLs –Now assign different permissions to different users and groups Owner Group Others HDFS Directory … rwx … rwx … rwx Group D … rwx Group F … rwx User Y … rwx
  • 26. © Hortonworks Inc. 2011 HDFS ACLs New Tools for ACL Management (setfacl, getfacl) – hdfs dfs -setfacl -m group:execs:r-- /sales-data – hdfs dfs -getfacl /sales-data # file: /sales-data # owner: maya # group: sales user::rw- group::r-- group:execs:r-- mask::r-- other::-- – How do you know if a directory has ACLs set? – hdfs dfs -ls /sales-data Found 1 items -rw-r-----+ 3 maya sales 0 2014-03-04 16:31 /sales-data
  • 27. © Hortonworks Inc. 2011 HDFS ACLs Best Practices • Start with traditional HDFS permissions to implement most permission requirements. • Define a smaller number of ACLs to handle exceptional cases. • A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only traditional permissions. Page 27 Architecting the Future of Big Data
  • 28. © Hortonworks Inc. 2011 HDFS Snapshots • HDFS Snapshots – A snapshot is a read-only point-in-time image of part of the file system – Performance: snapshot creation is instantaneous, regardless of data size or subtree depth – Reliability: snapshot creation is atomic – Scalability: snapshots do not create extra copies of data blocks – Useful for protecting against accidental deletion of data • Example: Daily Feeds hdfs dfs -ls /daily-feeds Found 5 items drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13 drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-16 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17 Page 28 Architecting the Future of Big Data
  • 29. © Hortonworks Inc. 2011 HDFS Snapshots • Create a snapshot after each daily load hdfs dfsadmin -allowSnapshot /daily-feeds Allowing snaphot on /daily-feeds succeeded hdfs dfs -createSnapshot /daily-feeds snapshot-to-2014-10-17 Created snapshot /daily-feeds/.snapshot/snapshot-to-2014-10-17 • User accidentally deletes data for 2014-10-16 hdfs dfs -ls /daily-feeds Found 4 items drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-13 drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily-feeds/2014-10-14 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-15 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily-feeds/2014-10-17 Page 29 Architecting the Future of Big Data
  • 30. © Hortonworks Inc. 2011 HDFS Snapshots • Snapshots to the rescue: the data is still in the snapshot hdfs dfs -ls /daily-feeds/.snapshot/snapshot-to-2014-10-17 Found 5 items drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-13 drwxr-xr-x - chris supergroup 0 2014-10-13 14:36 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-14 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-15 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 drwxr-xr-x - chris supergroup 0 2014-10-13 14:37 /daily- feeds/.snapshot/snapshot-to-2014-10-17/2014-10-17 • Restore data from 2014-10-16 hdfs dfs -cp /daily-feeds/.snapshot/snapshot-to-2014-10-17/2014-10-16 /daily-feeds Page 30 Architecting the Future of Big Data
  • 31. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures • Configuring dfs.datanode.failed.volumes.tolerated > 0 enables a DataNode to keep running after volume failures • DataNode is still running, but capacity is degraded • HDFS already provided a count of failed volumes for each DataNode, but no further details • Apache Hadoop 2.7.0 provides more information: failed path, estimated lost capacity and failure date/time • An administrator can use this information to prioritize cluster maintenance work Page 31 Architecting the Future of Big Data
  • 32. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures Page 32 Architecting the Future of Big Data New
  • 33. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures Page 33 Architecting the Future of Big Data
  • 34. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures Page 34 Architecting the Future of Big Data
  • 35. © Hortonworks Inc. 2011 Reporting DataNode Volume Failures • Everything in the web UI is sourced from standardized Hadoop metrics – Each DataNode publishes its own metrics – NameNode publishes aggregate information from every DataNode • Metrics accessible through JMX or the HTTP /jmx URI • Integrated in Ambari • Can be integrated into your preferred management tools and ops dashboards Page 35 Architecting the Future of Big Data
  • 36. New System to Manage the Health of Hadoop Clusters • Ambari Alerts are installed and configured by default • Health Alerts and Metrics managed via Ambari Web
  • 37. © Hortonworks Inc. 2011 Summary • Configuration – Prevent garbage collection issues – Configure for redundancy – Retune configuration in response to metrics • HDFS ACLs – Implement fine-grained authorization rules on files – Can protect against accidental file manipulations • HDFS Snapshots – Point-in-time image of part of the filesystem – Useful for restoring to a prior state after accidental file manipulation • Reporting DataNode Volume Failures – Metrics and web UI exposing information about volume failures on DataNodes – Useful for planning cluster maintenance work • Use Ambari – Helps install, configure, monitor and manage Hadoop clusters – Incorporates the latest best practices Page 37 Architecting the Future of Big Data
  • 38. © Hortonworks Inc. 2011 Thank you, Q&A Resource Location Hardware Recommendations for Apache Hadoop http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_cluster-planning- guide/content/ch_hardware-recommendations.html HDFS operational and debuggability improvements https://issues.apache.org/jira/browse/HDFS-6185 HDFS ACLs Blog Post http://hortonworks.com/blog/hdfs-acls-fine-grained-permissions-hdfs-files-hadoop/ HDFS Snapshots Blog Post http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/ Learn more Contact me with your operations questions and suggestions Chris Nauroth – cnauroth@hortonworks.com

Editor's Notes

  1. First, a quick introduction. My name is Chris Nauroth. I’m a software engineer on the HDFS team at Hortonworks. I’m an Apache Hadoop committer and PMC member. I’m also an Apache Software Foundation member. Some of my major contributions include HDFS ACLs, Windows compatibility and various operability improvements. Prior to Hortonworks, I worked for Disney and did an initial deployment of Hadoop there. As part of that job, I worked very closely with the systems engineering team responsible for maintaining those Hadoop clusters, so I tend to think back to that team and get excited about things I can do now as a software engineer to help make that team’s job easier. I’m also here with Suresh Srinivas, one of the founders of Hortonworks, and a long-time Hadoop committer and PMC member. He has a lot of experience supporting some of the world’s largest clusters at Yahoo and elsewhere. Together with Suresh, we have experience supporting Hadoop clusters since 2008.
  2. For today’s agenda, I’d like to start by sharing some analysis that we’ve done of support case trends. In that analysis, we’re going to see that some common patterns emerge, and that’s going to lead into a discussion of configuration best practices and software improvements. In the second half of the talk, we’ll move into a discussion of key learnings and best practices around how recent HDFS features can help prevent problems or manage day-to-day maintenance.
  3. Let’s dive into the support case analysis. The data source for this chart is the entire history of support cases at Hortonworks. The x-axis is month and the y-axis is the proportion of support cases reported against a specific component. The chart focuses on 3 components that we define as the core of Hadoop: HDFS, YARN and MapReduce. All other components in the ecosystem are collapsed into a single line. Here we see a trend stabilizing around 30% of support cases driven from those core components. It also makes sense intuitively that a large proportion of support cases are driven from those core components, because every deployment uses them. As you rise up the stack, deployments start to vary in the components they choose to deploy. For example, a deployment may or may not deploy Hbase depending on its use cases.
  4. The second chart shows an analysis of root cause category in each of those 3 core components. The source data contains many additional root cause categories. I’ve chosen to prune this down to the most significant ones to simplify the chart. The pattern that we see here is that a lot of support cases are driven by configuration issues or documentation problems. On an interesting side note, I gave a version of this presentation last year at Strata, and since then I’ve refreshed these charts with current data. Something I noticed is that documentation, configuration and software defects are propotionally a little bit smaller than last time. We’ve been investing a lot of energy in these areas, so it was satisfying to see the data showing that those efforts have been somewhat successful.
  5. Investment in operations at the core helps the most users.
  6. With that, let’s move into a discussion of common configuration issues that we continue to see.
  7. Fewer nodes is less resilient than many nodes. Failure of a DataNode that’s heavier on storage causes more re-replication activity. Map Reduce jobs may need to rerun more tasks. Commodity != poor quality.
  8. Compressed ordinary object pointers are a technique used in the JVM to represent managed pointers as 32-bit offsets from a 64-bit base heap address. This saves on the space taken by 64-bit native pointers. We used to have a recommendation to pass a JVM argument to turn this on. Recent JVM versions just use it by default. Xmx different from Xms can cause big expensive malloc. Surprising results when you run out of memory late in the process lifetime. N=8 typically. Oom-killer.
  9. NameNode high availability was a very hot topic a few years ago. At this point, the recommended HA architecture is to use QuorumJournalManager, which sets up an active-standby pair of NameNodes and offloads edit logging to a separate set of daemons called the JournalNodes. On a side note, version control for configuration is a good thing. It can be helpful to look back on the history of changes or restore to a last known good state.
  10. The DataNode has a feature called disk-fail-in-place that allows it to keep running even if individual volumes have failed. This is off by default, but you can turn it on by editing hdfs-site.xml and setting property dfs.datanode.failed.volumes.tolerated to the number of volumes that you tolerate failing before shutting down the entire DataNode. This is useful for large-density nodes, meaning nodes that have a lot of disks. If you have a node with 16 disks, and 2 disks fail, you’d probably prefer to keep that DataNode running with 14 disks available to serve clients instead of shutting down the whole thing. dfs.namenode.name.dir.restore is a property that controls whether or not the NameNode should attempt to bring back into service metadata storage directories that previously failed. By turning this on, you have the ability to repair a failed directory online and bring it back into service without restarting the NameNode process. We recommend taking periodic backups of the NameNode metadata. Copy the entire storage directory. Also plan on reserving a lot of disk for NameNode logs. A common pitfall is choosing too little space for logs, which then forces you to configure Log4J to roll logs very rapidly, and this can make debugging harder.
  11. Something to keep in mind that usage patterns on a cluster tend to change over time as use cases change. Configuration may need to change in reaction to changing usage patterns. If you have a major upgrade or maintenance planned, then that’s a good opportunity to review configurations and see if anything else needs to change.
  12. Increasingly, we’re pushing configuration best practices into the implementation of Ambari. This takes the burden off of administrators to remember these best practices during deployments. For those who don’t know, Apache Ambari is an open source cluster deployment and management tool. For a little variety, I chose to pull a screenshot related to HBase. Here we can see that Ambari starts by recommending some good defaults, but still gives administrators the option to tune settings to match their specific needs.
  13. Next, I’d like to discuss a few software improvements that were prompted by our experiences in support cases. We’ve found that often very small code changes can have a big impact on preventing problems or recovering from them. I’m going to discuss some real incidents that we’ve seen and how they led us to make those code changes.
  14. First, a public service announcement: don’t edit the metadata files. The NameNode metadata files are crucial for maintaining the state of the file system, so editing them can corrupt cluster state and result in loss of data. Don’t edit them. Now that I’ve said that, let’s talk about editing the metadata files. This is a real incident. A NameNode was misconfigured to point to the metadata from a different NameNode. An important note here is that part of the NameNode metadata is a namespace ID, which uniquely identifies that file system namespace. When DataNodes register with a NameNode for the first time, they also acquire that namespace ID and persist it locally. On subsequent DataNode restarts, the NameNode has a check that the DataNode attempting to register with it is presenting the same namespace ID. After NameNode restart, the DataNodes could not register with the NameNode because of the namespace ID mismatch. The system detected the problem correctly, and so far everything is working as designed. However, the admin thought an appropriate fix would be to manually edit the VERSION file, which is the part of the metadata containing the namespace ID, and change it to match what the DataNodes were reporting. “What happens next?” The problem is that the NameNode’s fsimage also persists the block IDs that are known for each file. When these DataNodes from a different cluster started sending their block reports, the NameNode replied by saying these blocks do not exist in my namespace, and therefore they should be deleted.
  15. This is the HDFS web UI, now with a small enhancement to show the time when block deletions will start.
  16. HDFS is known for being a scalable system. One of the things it’s really awesome at is scaling deletes! This can be a scary situation if someone deletes the wrong thing, because attempting to recover by undeleting block files is error-prone and time-consuming work across all DataNodes. We recommend enabling the HDFS trash feature as a safety net, which essentially changes deletes into renames, and the NameNode can then reap the trash files at a later time. However, I’m going to talk about a real incident in which trash was not enabled. There was a large directory deleted, and the admin realized this was a mistake and chose to shut down the NameNode immediately. The support engineer taking the case naturally figured we could restore from trash, so advised restarting the NameNode. “What happens next?”
  17. This incident really points out the importance of protecting data against accidental deletion. HDFS snapshots and HDFS ACLs are two features that I think help with this. I’ll have more coverage of these features later in the presentation.
  18. “What happens next?”
  19. If you’ve used POSIX ACLs on a Linux file system, then you already know how it works in HDFS too.
  20. By convention, snapshots can be referenced as a file system path under sub-directory “.snapshot”.
  21. Here is a screenshot pointing out a change in the HDFS web UI: Total Datanode Volume Failures is a hyperlink. Clicking that jumps to…
  22. …this new screen listing the volume failures in detail. We can see the path of each failed storage location, and an estimate of the capacity that was lost. I think of this screen being used by a system engineer as a to-do list as part of regular cluster maintenance.
  23. Here is what it looks like when there are no volume failures. I included this picture, because this is what we all want it to look like. Of course, it won’t always be that way.