SlideShare a Scribd company logo
1 of 37
Download to read offline
HADOOP EAGLE
Full-stack realtime monitoring framework for eBay hadoop
Edward Zhang @yonzhang2012 | Hao Chen @ihaoch
Use case: Detect node anomaly by analyzing task failure ratio across all nodes
Assumption : task failure ratio for every node should be approximately equal
Algorithm : node by node compare (symmetry violation) and per node trend
HADOOP EAGLE – EBAY INC 2
HADOOP EAGLE
Background – initial use cases
3
Host: Task failure based anomaly host detection
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Anomaly Detection & Alerting Analysis Auto-Remediation
4
Scale Challenges @ eBay Hadoop Monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• 10+ large Hadoop clusters
• 10,000+ data nodes
• 50,000+ jobs per day
• 50,000,000+ tasks per day
• 500+ types of Hadoop/Hbase native metrics
• Billions of audit events, metrics per day
5
Use cases challenges @ eBay Hadoop Monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Host
• Task failure ratio based machine anomaly detection
• Job monitoring across its lifetime
• Real-time running job performance analysis
• Near real-time job history analytics
• Data skew detection
• Hadoop native metrics
• Hdfs
• Hbase
• M/R
• Logs
• GC log
• Hadoop daemon log
• Audit log
• HDFS image file
• Yarn Framework
• Queue
HADOOP EAGLE – EBAY INC 6
HADOOP EAGLE
Engineering Challenges @ eBay Hadoop Monitoring
• Varieties of data sources
M/R history job, running, GC log, namenode log, hadoop native metrics, YARN
queue, audit log, hdfs image file etc.
• Varieties of data collectors
pull form hdfs, pull YARN API, ship logs, …
• Complex business logic
join outside data, pre-aggregations, memory window …
• Alert rules can’t be hot deployed
• Scalability issue with single process
7
Job History Performance Analyzer
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Monitor job history files in near real-time
• Crawl job history files immediately after it is completed
• Apply expertise rules for job performance suggestions
• Job history trend for the same type of job
Job
Start
Event
Task
Start
Event
Task
End
Event
Task
roll-up
Task2
Start
Event
Task2
End
Event
Task
roll-
up
Job
End
Event
Job
Suggestion
Rules
8
Job real-time monitoring
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
snapshots
• CPU, HDFS I/O, Disk I/O, slot
seconds
• Roll up to user/queue/cluster level
• Slide window based alert
9
Service: GC Log / Server Log
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• GC event detection and prediction
• Log metrics statistics
• Real-time log anomaly detection
Why Eagle Monitoring Framework
HADOOP EAGLE – EBAY INC 10
HADOOP EAGLE
11
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Programming Paradigm and Abstraction
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
12
As a framework, Eagle does not
assume :
• Data source (where, what)
• Business logic execution path (how)
• Policy engine implementation (how)
• Data sink (where, what)
Eagle as a Framework
HADOOP EAGLE – EBAY INC
As a framework, Eagle does the
following:
• SQL-like service API
• High-performing query framework
• Lightweight streaming process java API
• Extensible policy engine implementation
• Scalable and distributed rule evaluation
• Native HBase data storage support
• Metadata driven stream processing
• Data source extensibility
• Data sink extensibility
• Interactive dashboard
HADOOP EAGLE
Eagle Overall Architecture
13HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Eagle Monitoring Framework Internals
HADOOP EAGLE – EBAY INC 14
• Lightweight Streaming Process Framework
• Extensible & Scalable Policy Framework for Alert
• Eagle Query Framework
• Interactive Dashboards
HADOOP EAGLE
15
Facts
• Computation is based on single
event which constitutes endless
continuous stream
• Computation can be
aggregation, time-window,
length-window or join outside
data etc.
• Filter design pattern is used for
modularizing code at the
beginning
Lightweight Streaming Process Framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Abstraction
 Inspired by cascading framework, we
abstract a light-weight streaming
programing API which is independent of
execution environment
 Streaming process is directed acyclic
graph
 This layer of indirection is for code
modularization, code reuse and prevention
of coupling with specific execution
environment
 Runs on single process, Storm or other
streaming technology like Spark
16
Step 1: Task DAG graph setup
Eagle Stream Data Processing API
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}
Step 2: Inter-task data exchange protocol
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}
17
Execution Graph development, compile and deploy
Development / Compile Phase Deployment / Runtime Phase
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Eagle Monitoring Framework Internals
HADOOP EAGLE – EBAY INC 18
• Lightweight Streaming Process Framework
• Extensible & Scalable Policy Framework for Alert
• Eagle Query Framework
• Interactive Dashboards
HADOOP EAGLE
19
Extensible & Scalable Policy framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Scalability
• Dynamic policy partitioning across compute nodes based on configurable partition class
• Dynamic policy deployment
• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
Extensibility
• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
Features
• Policy CRUD
• Stream metadata (event attribute name, attribute type, attribute value resolver, …)
20
Dynamic Policy Partitioning
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
21
Scalability of Policy Evaluation
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
22
Extensibility of policy framework
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();
public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();
public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();
public List<Module> getBindingModules();
}
Policy Evaluator Provider use SPI to register policy engine implementations
Eagle Monitoring Framework Internals
HADOOP EAGLE – EBAY INC 23
• Lightweight Streaming Process Framework
• Extensible & Scalable Policy Framework for Alert
• Eagle Query Framework
• Interactive Dashboards
HADOOP EAGLE
Eagle Query Framework
HADOOP EAGLE – EBAY INC 24
HADOOP EAGLE
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized
Structure
• …
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
Features
• Simple API
• Powerful query
• High performance
• Scalability
• Pluggable
• …
The light-weight metadata-driven store layer to serve
commonly shared storage & query requirements of most monitoring system
Eagle Query Framework
HADOOP EAGLE – EBAY INC 25
HADOOP EAGLE
Eagle Query Framework
HADOOP EAGLE – EBAY INC 26
HADOOP EAGLE
• Metadata definition ORM
• High performance RESTful API supporting CRUD
• SQL-like declarative query syntax
• Generic service client library
• Native support HBase and RDBMS
• Interactive and customizable dashboard
27
• Annotations are metadata to entity
• Metadata driven query compiling and
response rendering
• Metadata driven ser/deser
• Rename column to shorter string(hbase)
• Entity metadata primitives
• Table
• ColumnFamily
• Prefix(the very first partition key)
• Service(entity identifier)
• Partition
• Tags
• Indexes
• Column
Metadata definition ORM
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@TimeSeries(false)
@Partition({"cluster", "datacenter"})
@Tags({"programId", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" })
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
@Column("d")
private String notificationDef;
@Column("e")
private String remediationDef;
@Column("f")
private boolean enabled;
28
Generic RESTful API & Query
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
::=
<EntityName>
“[" <FilterCondition> "]"
"<" <GroupbyFields> ">"
"{" <AggregatedFunctions> "}” [ "." "{" <SortbyOptions> "}" ]
eagle-service/rest/entities?query=
29
Generic RESTful API Query Syntax
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]{@startTime,@numTotalMaps}&startTime=&endTime=&pageSize=100
Aggregation Query ::= <EntityName> [QueryCondition]<GroupbyFields>{ AggregatedFunctions}.{SortbyOptions}
query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]<@user>{count, min(endTime-startTime)}&startTime=&endTime=&pageSize=100
query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND
@failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100
CONTAINS, IN, !=, =, <, <=, >, >=
query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND
@failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100&startRowkey=BgVz-9R…….
Search Query
Aggregate Query
TimeSeries Histogram Query
query=GenericMetricService[@cluster="ares" AND @datacenter="lvs"]<@user>{sum(value)}.{sum(value) desc} &timeSeries=true&intervmin=1440
&pageSize=10000000&startTime=2014-07-01 00:00:00&endTime=2014-08-01 00:00:00&metricName=eagle.hdfs.spacesize.cluster
Operators
Numeric Filters
Paginations
30
Generic Eagle Service Client Library
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• Basic CRUD
• Fluent DSL
• Metric Builder API
• Parallel Client
• Asynchronous Client
client.metric("unit.test.metrics")
.batch(5)
.tags(tags)
.send("unit.test.metrics", System.currentTimeMillis(), tags, 0.1, 0.2, 0.3)
.send(System.currentTimeMillis(), 0.1)
.send(System.currentTimeMillis(),0.1,0.2)
.send(System.currentTimeMillis(),tags,0.1,0.2,0.3)
.send("unit.test.anothermetrics",System.currentTimeMillis(),tags,0.1,0.2,0.3)
.flush();
client.search("GenericMetricService[@cluster="cluster4ut" AND @datacenter =
"datacenter4ut"]<@cluster>{sum(value)}")
.startTime(0)
.endTime(System.currentTimeMillis()+24 * 3600 * 1000)
.metricName("unit.test.metrics")
.pageSize(1000)
.send();
31
Uniform rowkey design
• Metric
• Entity
• Log
HBase Storage Design
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …
Rowvalue ::= Log Content
com.ebay.eagle.coprocessor.AggregateProtocol
32
HBase Coprocessor
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
avg count max min sum
nocoprocesso in single
region
coprocessor in single
region
estimated in cluster
33
• Uniform HBASE row-key design for all types of monitoring data sources
• Logically partition data by tags which is defined in annotation @Partition({“cluster”,
“datacenter”})
• Physically shard data by HBASE native feature: rowkey range and region mapping
• Write throughput optimized by using HBASE multi-put
• Co-processor to maximize query performance
• Push evaluation of numeric filters down to HBase
• Secondary index support
• Inspection of RESTful resources and entity metadata
• Numeric filters
• Expression evaluation in output fields
• Rowkey inspection
Tuning for HBase Storage
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Eagle Monitoring Framework Internals
HADOOP EAGLE – EBAY INC 34
• Lightweight Streaming Process Framework
• Extensible & Scalable Policy Framework for Alert
• Eagle Query Framework
• Interactive Dashboards
HADOOP EAGLE
35
• Interactive: IPython notebook-
like interactive visualization
analysis and troubleshooting.
• Dashboard: Customizable
dashboard layout and drill-down
path, persist and share.
Generic Dashboard Analytics for Eagle Store
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
36
Open Source Soon …
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
• First use case: Eagle to secure
Hadoop platform based on Eagle
framework
• Work closely with Hortonworks,
Dataguise, …
• Share with community and get
community’s support
• Continue to open source job
monitoring, GC monitoring etc.
37
Q & A
HADOOP EAGLE – EBAY INC
HADOOP EAGLE

More Related Content

More from Hao Chen

Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementHao Chen
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesHao Chen
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Hao Chen
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in ActionHao Chen
 
Hadoop Management Console from eBay at China Hadoop 2015
Hadoop Management Console from eBay at China Hadoop 2015Hadoop Management Console from eBay at China Hadoop 2015
Hadoop Management Console from eBay at China Hadoop 2015Hao Chen
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Hao Chen
 

More from Hao Chen (7)

Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture Evolvement
 
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New FeaturesApache Eagle: Architecture Evolvement and New Features
Apache Eagle: Architecture Evolvement and New Features
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in Action
 
Hadoop Management Console from eBay at China Hadoop 2015
Hadoop Management Console from eBay at China Hadoop 2015Hadoop Management Console from eBay at China Hadoop 2015
Hadoop Management Console from eBay at China Hadoop 2015
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 

Recently uploaded

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 

Recently uploaded (20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 

Hadoop Eagle: Full-stack realtime monitoring framework for eBay hadoop

  • 1. HADOOP EAGLE Full-stack realtime monitoring framework for eBay hadoop Edward Zhang @yonzhang2012 | Hao Chen @ihaoch
  • 2. Use case: Detect node anomaly by analyzing task failure ratio across all nodes Assumption : task failure ratio for every node should be approximately equal Algorithm : node by node compare (symmetry violation) and per node trend HADOOP EAGLE – EBAY INC 2 HADOOP EAGLE Background – initial use cases
  • 3. 3 Host: Task failure based anomaly host detection HADOOP EAGLE – EBAY INC HADOOP EAGLE Anomaly Detection & Alerting Analysis Auto-Remediation
  • 4. 4 Scale Challenges @ eBay Hadoop Monitoring HADOOP EAGLE – EBAY INC HADOOP EAGLE • 10+ large Hadoop clusters • 10,000+ data nodes • 50,000+ jobs per day • 50,000,000+ tasks per day • 500+ types of Hadoop/Hbase native metrics • Billions of audit events, metrics per day
  • 5. 5 Use cases challenges @ eBay Hadoop Monitoring HADOOP EAGLE – EBAY INC HADOOP EAGLE • Host • Task failure ratio based machine anomaly detection • Job monitoring across its lifetime • Real-time running job performance analysis • Near real-time job history analytics • Data skew detection • Hadoop native metrics • Hdfs • Hbase • M/R • Logs • GC log • Hadoop daemon log • Audit log • HDFS image file • Yarn Framework • Queue
  • 6. HADOOP EAGLE – EBAY INC 6 HADOOP EAGLE Engineering Challenges @ eBay Hadoop Monitoring • Varieties of data sources M/R history job, running, GC log, namenode log, hadoop native metrics, YARN queue, audit log, hdfs image file etc. • Varieties of data collectors pull form hdfs, pull YARN API, ship logs, … • Complex business logic join outside data, pre-aggregations, memory window … • Alert rules can’t be hot deployed • Scalability issue with single process
  • 7. 7 Job History Performance Analyzer HADOOP EAGLE – EBAY INC HADOOP EAGLE • Monitor job history files in near real-time • Crawl job history files immediately after it is completed • Apply expertise rules for job performance suggestions • Job history trend for the same type of job Job Start Event Task Start Event Task End Event Task roll-up Task2 Start Event Task2 End Event Task roll- up Job End Event Job Suggestion Rules
  • 8. 8 Job real-time monitoring HADOOP EAGLE – EBAY INC HADOOP EAGLE • Monitoring running job in real time • Minute-level job progress snapshots • Minute-level resource usage snapshots • CPU, HDFS I/O, Disk I/O, slot seconds • Roll up to user/queue/cluster level • Slide window based alert
  • 9. 9 Service: GC Log / Server Log HADOOP EAGLE – EBAY INC HADOOP EAGLE • GC event detection and prediction • Log metrics statistics • Real-time log anomaly detection
  • 10. Why Eagle Monitoring Framework HADOOP EAGLE – EBAY INC 10 HADOOP EAGLE
  • 11. 11 • Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards • We need create framework to cover full stack in monitoring system Programming Paradigm and Abstraction HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 12. 12 As a framework, Eagle does not assume : • Data source (where, what) • Business logic execution path (how) • Policy engine implementation (how) • Data sink (where, what) Eagle as a Framework HADOOP EAGLE – EBAY INC As a framework, Eagle does the following: • SQL-like service API • High-performing query framework • Lightweight streaming process java API • Extensible policy engine implementation • Scalable and distributed rule evaluation • Native HBase data storage support • Metadata driven stream processing • Data source extensibility • Data sink extensibility • Interactive dashboard HADOOP EAGLE
  • 13. Eagle Overall Architecture 13HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 14. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 14 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  • 15. 15 Facts • Computation is based on single event which constitutes endless continuous stream • Computation can be aggregation, time-window, length-window or join outside data etc. • Filter design pattern is used for modularizing code at the beginning Lightweight Streaming Process Framework HADOOP EAGLE – EBAY INC HADOOP EAGLE Abstraction  Inspired by cascading framework, we abstract a light-weight streaming programing API which is independent of execution environment  Streaming process is directed acyclic graph  This layer of indirection is for code modularization, code reuse and prevention of coupling with specific execution environment  Runs on single process, Storm or other streaming technology like Spark
  • 16. 16 Step 1: Task DAG graph setup Eagle Stream Data Processing API HADOOP EAGLE – EBAY INC HADOOP EAGLE @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); } Step 2: Inter-task data exchange protocol @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
  • 17. 17 Execution Graph development, compile and deploy Development / Compile Phase Deployment / Runtime Phase HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 18. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 18 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  • 19. 19 Extensible & Scalable Policy framework HADOOP EAGLE – EBAY INC HADOOP EAGLE Scalability • Dynamic policy partitioning across compute nodes based on configurable partition class • Dynamic policy deployment • Event partitioning by storm and policy partitioning by Eagle (N events * M policies) Extensibility • Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc. Features • Policy CRUD • Stream metadata (event attribute name, attribute type, attribute value resolver, …)
  • 20. 20 Dynamic Policy Partitioning HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 21. 21 Scalability of Policy Evaluation HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 22. 22 Extensibility of policy framework HADOOP EAGLE – EBAY INC HADOOP EAGLE public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); public Class<? extends PolicyEvaluator> getPolicyEvaluator(); public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser(); public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder(); public List<Module> getBindingModules(); } Policy Evaluator Provider use SPI to register policy engine implementations
  • 23. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 23 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  • 24. Eagle Query Framework HADOOP EAGLE – EBAY INC 24 HADOOP EAGLE Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure • … Query • Search • Filter • Aggregation • Sort • Expression • …. Features • Simple API • Powerful query • High performance • Scalability • Pluggable • … The light-weight metadata-driven store layer to serve commonly shared storage & query requirements of most monitoring system
  • 25. Eagle Query Framework HADOOP EAGLE – EBAY INC 25 HADOOP EAGLE
  • 26. Eagle Query Framework HADOOP EAGLE – EBAY INC 26 HADOOP EAGLE • Metadata definition ORM • High performance RESTful API supporting CRUD • SQL-like declarative query syntax • Generic service client library • Native support HBase and RDBMS • Interactive and customizable dashboard
  • 27. 27 • Annotations are metadata to entity • Metadata driven query compiling and response rendering • Metadata driven ser/deser • Rename column to shorter string(hbase) • Entity metadata primitives • Table • ColumnFamily • Prefix(the very first partition key) • Service(entity identifier) • Partition • Tags • Indexes • Column Metadata definition ORM HADOOP EAGLE – EBAY INC HADOOP EAGLE @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @TimeSeries(false) @Partition({"cluster", "datacenter"}) @Tags({"programId", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }) }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; @Column("d") private String notificationDef; @Column("e") private String remediationDef; @Column("f") private boolean enabled;
  • 28. 28 Generic RESTful API & Query HADOOP EAGLE – EBAY INC HADOOP EAGLE ::= <EntityName> “[" <FilterCondition> "]" "<" <GroupbyFields> ">" "{" <AggregatedFunctions> "}” [ "." "{" <SortbyOptions> "}" ] eagle-service/rest/entities?query=
  • 29. 29 Generic RESTful API Query Syntax HADOOP EAGLE – EBAY INC HADOOP EAGLE query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]{@startTime,@numTotalMaps}&startTime=&endTime=&pageSize=100 Aggregation Query ::= <EntityName> [QueryCondition]<GroupbyFields>{ AggregatedFunctions}.{SortbyOptions} query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]<@user>{count, min(endTime-startTime)}&startTime=&endTime=&pageSize=100 query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100 CONTAINS, IN, !=, =, <, <=, >, >= query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100&startRowkey=BgVz-9R……. Search Query Aggregate Query TimeSeries Histogram Query query=GenericMetricService[@cluster="ares" AND @datacenter="lvs"]<@user>{sum(value)}.{sum(value) desc} &timeSeries=true&intervmin=1440 &pageSize=10000000&startTime=2014-07-01 00:00:00&endTime=2014-08-01 00:00:00&metricName=eagle.hdfs.spacesize.cluster Operators Numeric Filters Paginations
  • 30. 30 Generic Eagle Service Client Library HADOOP EAGLE – EBAY INC HADOOP EAGLE • Basic CRUD • Fluent DSL • Metric Builder API • Parallel Client • Asynchronous Client client.metric("unit.test.metrics") .batch(5) .tags(tags) .send("unit.test.metrics", System.currentTimeMillis(), tags, 0.1, 0.2, 0.3) .send(System.currentTimeMillis(), 0.1) .send(System.currentTimeMillis(),0.1,0.2) .send(System.currentTimeMillis(),tags,0.1,0.2,0.3) .send("unit.test.anothermetrics",System.currentTimeMillis(),tags,0.1,0.2,0.3) .flush(); client.search("GenericMetricService[@cluster="cluster4ut" AND @datacenter = "datacenter4ut"]<@cluster>{sum(value)}") .startTime(0) .endTime(System.currentTimeMillis()+24 * 3600 * 1000) .metricName("unit.test.metrics") .pageSize(1000) .send();
  • 31. 31 Uniform rowkey design • Metric • Entity • Log HBase Storage Design HADOOP EAGLE – EBAY INC HADOOP EAGLE Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content
  • 32. com.ebay.eagle.coprocessor.AggregateProtocol 32 HBase Coprocessor HADOOP EAGLE – EBAY INC HADOOP EAGLE 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 avg count max min sum nocoprocesso in single region coprocessor in single region estimated in cluster
  • 33. 33 • Uniform HBASE row-key design for all types of monitoring data sources • Logically partition data by tags which is defined in annotation @Partition({“cluster”, “datacenter”}) • Physically shard data by HBASE native feature: rowkey range and region mapping • Write throughput optimized by using HBASE multi-put • Co-processor to maximize query performance • Push evaluation of numeric filters down to HBase • Secondary index support • Inspection of RESTful resources and entity metadata • Numeric filters • Expression evaluation in output fields • Rowkey inspection Tuning for HBase Storage HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 34. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 34 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  • 35. 35 • Interactive: IPython notebook- like interactive visualization analysis and troubleshooting. • Dashboard: Customizable dashboard layout and drill-down path, persist and share. Generic Dashboard Analytics for Eagle Store HADOOP EAGLE – EBAY INC HADOOP EAGLE
  • 36. 36 Open Source Soon … HADOOP EAGLE – EBAY INC HADOOP EAGLE • First use case: Eagle to secure Hadoop platform based on Eagle framework • Work closely with Hortonworks, Dataguise, … • Share with community and get community’s support • Continue to open source job monitoring, GC monitoring etc.
  • 37. 37 Q & A HADOOP EAGLE – EBAY INC HADOOP EAGLE

Editor's Notes

  1. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  2. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  3. Uniform HBASE row-key design for all types of monitoring data sources Logically partition data by tags which is defined in annotation @Partition({“cluster”, “datacenter”}) Physically shard data by HBASE native feature: rowkey range and region mapping Write throughput optimized by using HBASE multi-put Co-processor to maximize query performance Push evaluation of numeric filters down to HBase Secondary index support Inspection of RESTful resources and entity metadata Numeric filters Expression evaluation in output fields Rowkey inspection