2. Use case: Detect node anomaly by analyzing task failure ratio across all nodes
Assumption : task failure ratio for every node should be approximately equal
Algorithm : node by node compare (symmetry violation) and per node trend
HADOOP EAGLE – EBAY INC 2
HADOOP EAGLE
Background – initial use cases
4. 4
Scale Challenges @ eBay Hadoop Monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• 10+ large Hadoop clusters
• 10,000+ data nodes
• 50,000+ jobs per day
• 50,000,000+ tasks per day
• 500+ types of Hadoop/Hbase native metrics
• Billions of audit events, metrics per day
5. 5
Use cases challenges @ eBay Hadoop Monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Host
• Task failure ratio based machine anomaly detection
• Job monitoring across its lifetime
• Real-time running job performance analysis
• Near real-time job history analytics
• Data skew detection
• Hadoop native metrics
• Hdfs
• Hbase
• M/R
• Logs
• GC log
• Hadoop daemon log
• Audit log
• HDFS image file
• Yarn Framework
• Queue
6. HADOOP EAGLE – EBAY INC 6
HADOOP EAGLE
Engineering Challenges @ eBay Hadoop Monitoring
• Varieties of data sources
M/R history job, running, GC log, namenode log, hadoop native metrics, YARN
queue, audit log, hdfs image file etc.
• Varieties of data collectors
pull form hdfs, pull YARN API, ship logs, …
• Complex business logic
join outside data, pre-aggregations, memory window …
• Alert rules can’t be hot deployed
• Scalability issue with single process
7. 7
Job History Performance Analyzer
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Monitor job history files in near real-time
• Crawl job history files immediately after it is completed
• Apply expertise rules for job performance suggestions
• Job history trend for the same type of job
Job
Start
Event
Task
Start
Event
Task
End
Event
Task
roll-up
Task2
Start
Event
Task2
End
Event
Task
roll-
up
Job
End
Event
Job
Suggestion
Rules
8. 8
Job real-time monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
snapshots
• CPU, HDFS I/O, Disk I/O, slot
seconds
• Roll up to user/queue/cluster level
• Slide window based alert
11. 11
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Programming Paradigm and Abstraction
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
12. 12
As a framework, Eagle does not
assume :
• Data source (where, what)
• Business logic execution path (how)
• Policy engine implementation (how)
• Data sink (where, what)
Eagle as a Framework
HADOOP EAGLE – EBAY INC
As a framework, Eagle does the
following:
• SQL-like service API
• High-performing query framework
• Lightweight streaming process java API
• Extensible policy engine implementation
• Scalable and distributed rule evaluation
• Native HBase data storage support
• Metadata driven stream processing
• Data source extensibility
• Data sink extensibility
• Interactive dashboard
HADOOP EAGLE
15. 15
Facts
• Computation is based on single
event which constitutes endless
continuous stream
• Computation can be
aggregation, time-window,
length-window or join outside
data etc.
• Filter design pattern is used for
modularizing code at the
beginning
Lightweight Streaming Process Framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Abstraction
Inspired by cascading framework, we
abstract a light-weight streaming
programing API which is independent of
execution environment
Streaming process is directed acyclic
graph
This layer of indirection is for code
modularization, code reuse and prevention
of coupling with specific execution
environment
Runs on single process, Storm or other
streaming technology like Spark
33. 33
• Uniform HBASE row-key design for all types of monitoring data sources
• Logically partition data by tags which is defined in annotation @Partition({“cluster”,
“datacenter”})
• Physically shard data by HBASE native feature: rowkey range and region mapping
• Write throughput optimized by using HBASE multi-put
• Co-processor to maximize query performance
• Push evaluation of numeric filters down to HBase
• Secondary index support
• Inspection of RESTful resources and entity metadata
• Numeric filters
• Expression evaluation in output fields
• Rowkey inspection
Tuning for HBase Storage
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
35. 35
• Interactive: IPython notebook-
like interactive visualization
analysis and troubleshooting.
• Dashboard: Customizable
dashboard layout and drill-down
path, persist and share.
Generic Dashboard Analytics for Eagle Store
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
36. 36
Open Source Soon …
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• First use case: Eagle to secure
Hadoop platform based on Eagle
framework
• Work closely with Hortonworks,
Dataguise, …
• Share with community and get
community’s support
• Continue to open source job
monitoring, GC monitoring etc.
Anomaly detection algorithm
Continuously crawl job history files immediately after it is completed
Calculate minute level job failure ratio for each node
A node is identified to be anomalous when either of the following 2 conditions happen
This node continuously fails tasks within this node
This node has significant higher failure ratio than rest of nodes within the cluster
Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
Uniform HBASE row-key design for all types of monitoring data sources
Logically partition data by tags which is defined in annotation @Partition({“cluster”, “datacenter”})
Physically shard data by HBASE native feature: rowkey range and region mapping
Write throughput optimized by using HBASE multi-put
Co-processor to maximize query performance
Push evaluation of numeric filters down to HBase
Secondary index support
Inspection of RESTful resources and entity metadata
Numeric filters
Expression evaluation in output fields
Rowkey inspection