SlideShare a Scribd company logo
1 of 38
HADOOP
Full-stack real-time monitoring framework for eBay Hadoop
Hao Chen | 陈浩
eBay Cloud Service
$ whoami
Hao Chen | 陈浩
Software Engineer
Analytics Data Infrastructure, Cloud Services
eBay Inc.
hchen9@ebay.com
linkedin.com/in/haozch
twitter.com/haozch
weibo.com/haochencn
2
3
eBay’s Challenges in Monitoring
10+ large hadoop clusters
10,000+ nodes
50,000+ jobs per day
50,000,000+ tasks per day
500+ types of hadoop/hbase metrics
Billions of audit events per day
Large Scale in Real Time Various Business Logic
Hadoop
Hbase
Spark
Data Security
Hardware
Cloud
Database
Complex and Scalable Policy
Join multiple data sources
Threshold based, windows based
Multiple metrics correlation
Metrics pre-aggregations
Machine learning based
Engineering Modularization
Varieties of data sources
Varieties of data collectors
Complex business logic
Alert rules can’t be hot deployed
Scalability issue with single process
What’s Eagle
4
The uniform monitoring and alerting framework to
monitor large-scale distributed system like hadoop,
spark, cloud, etc. in real time.
Eagle = Eagle Framework + Eagle Apps
Eagle Ecosystem
5
Apps
 DAM
 JPA
 HBase
 Spark
Interface
 Web Portal
 REST Services
 Ambari Plugin
Integration
 Kafka
 Storm
 HBase
 Druid
 Elastic Search
Eagle Framework
Provide full-stack monitoring framework for efficiently
developing highly scalable real-time monitoring applications.
Eagle Apps
Provide built-in monitoring applications for domains like hadoop,
spark, hbase, storm and cloud.
Eagle Integration
Integrate with distributed real-time execution environment like
storm, message bus like kafka and storage layer like hbase, and
also support extensions.
Eagle Interface
Allow to access or manage eagle through REST service, web UI
or Ambari plugin.
Eagle
Framework
6
Eagle App Highlights
JPA: Job Performance Analyzer
DAM: Security Data Activity Monitoring
7
JPA: Job Performance Analyzer
Historical job analysis
Running job analysis
Anomaly host detection
Job data skew detection
Job performance suggestion
Anomaly Prediction based on machine learning
Monitor and analyze job performance in real-time
8
Historical Job Analyzer
• Job historical performance trend
• Task and attempt distribution
• Various level (cluster/job/user/host) of
resource utilization
• Anomaly historical performance detection
• TooLowBytesConsumedPerCPUSecond
• JobStatisticLongDuration
• TooLargeReduceNumAlert
• TooLargeShuffleSizeAlert
9
Running Job Analyzer
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
snapshots
• CPU, HDFS I/O, Disk I/O, slot seconds
• Roll up to user/queue/cluster level
• Anomaly running status detection
• TooLongJobDuration
• NoProgressForLong
• TooManyTaskFailure
Use Case Detect node anomaly by analyzing task failure ratio across all nodes
Assumption Task failure ratio for every node should be approximately equal
Algorithm Node by node compare (symmetry violation) and per node trend
10
Task Failure based Anomaly Host Detection
11
Task Failure based Anomaly Host Detection
Alerting: Anomaly Detection &
Alerting
Insight: Task failure drill-down Insight: Task failure drill-down
Counters & Features
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
12
Real-time Data Skew Detection
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Counters
Correlation > 0.9
& Max(Z-Score) > 90%
13
Real-time Data Skew Detection
14
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection
• Offline: Analyzing and combining 500+ metrics together for causal anomaly
detections (IG -> PCA -> GMM -> MCC)
• Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold
Selection
PCA (Principal Component Analysis)
15
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection
16
DAM: Data Activity Monitoring
Secure hadoop in real-time
Security Use Cases
Security Architecture Overview
Security Components Highlights
Security Machine Learning Integration
17
Security Use Cases
Data Loss Prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning
algorithm to detect anomalies
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle
user profiles. Eagle supports multiple native operation types.
Security Architecture Overview
18
19
Security Component Highlights
Policy Manager
Expressive language - create and modify policies for alerting and remediation on certain data activity
monitoring events.
Data classification
Integrate with Dataguise & Apache Ranger.
Policy-based Remediation
Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs.
User Profiling
Based on Machine learning to automatically generate anomaly detection policy
User Activity Exploration
Ability to drill down into alert details to understand the data security threat
20
Security Machine Learning Integration
• User Activity Profiling
• Offline: Determine bandwidth from training dataset the kernel density
function parameters (KDE)
• Online: If a test data point lies outside the trained bandwidth, it is anomaly
(Policy)
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)Kernel Density Function
21
Security Machine Learning Integration
• User Activity Profiling on Spark
Historical Audit
Events
Real-time Audit
Events
Batch Preprocess
User Profile Model
Generation (KDE + EVD
Algorithm)
Eagle StorageHDFS
Stream
Preprocess
Policy Engine
Online detection on Storm
Offline training on Spark
Archived data
Real-time stream
Kafka
Persist model
Dynamically load models & policies
Alert Consumer
Persist alert
Eagle Security
Plugins
Eagle Monitoring Framework
22
Eagle = Eagle Framework + Eagle Apps
Full-stack real time monitoring framework
23
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Monitoring Programming Paradigm
Eagle Monitoring Framework
24
Eagle Monitoring Framework Highlights
25
Eagle = Eagle Framework + Eagle Apps
Lightweight Streaming Process Framework
Extensible & Scalable Policy Framework
Eagle Query Framework
Customizable Dashboards
26
Step 1: Task DAG graph setup
Eagle Stream Data Processing API
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}
Step 2: Inter-task data exchange protocol
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}
27
Execution Graph development, compile and deploy
Development / Compile Phase
Deployment / Runtime Phase
28
Extensible & Scalable Policy Framework
Usability
• Declarative Policy Definition Syntax
• Stream Metadata (event attribute name, attribute type, attribute value resolver, …)
Scalability
• Dynamic policy partitioning across compute nodes based on configurable partition class
• Dynamic policy deployment
• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
Extensibility
• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
29
Usability of Policy Framework
Case HBase Region server high call queue length
Policy In the past 30 minutes, there are more than 20 times call queue length>2000
from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min)
select host, value, avg(value) as avgValue, count(*) as count
group by host
having count >= 20
insert into HighRegionServerCallQueueLengthStream;
30
Scalability of Policy Evaluation
Dynamic Policy Partition
• N Users with 3 partitions, M
policies with 2 partitions, then 3*2
physical tasks
• Physical partition + Policy-level
partition
31
Extensibility of Policy Framework
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();
public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();
public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();
public List<Module> getBindingModules();
}
Policy Evaluator Provider use SPI to register policy engine implementations
Built-in Supported Policy Engine
• Siddhi Complex Event Processing Engine
• Machine Learning based Policy Engine
Eagle Query Framework
32
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized
Structure
• …
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
The light-weight metadata-driven store layer to serve
commonly shared storage & query requirements of most monitoring system
33
• Interactive: IPython notebook-like
interactive visualization analysis and
troubleshooting.
• Dashboard: Customizable dashboard layout
and drill-down path, persist and share.
Customizable Dashboard
Provide real-time interactive visualization and analytics capability supporting variety of
data sources like eagle, druid and so on.
34
Eagle in Future
The general monitoring platform for large-scale system of eBay
35
Open Source
First Use Case
Eagle to secure Hadoop in real time based on Eagle framework
External Partners
Hortonworks, Dataguise, Paypal and Apache Ranger
Following Components to Open Source
JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on
is opening source soon
36
Reference
Eagle at Hadoop Summit 2015, San Jose
http://2015.hadoopsummit.org
Slides | Video
Eagle at Big Data Summit 2014, Shanghai
http://2014ebay.csdn.net/m/zone/ebay_en
Slides | Video
37
The End & Thanks
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Hao Chen
hchen9@ebay.com | @haozch
38
We are Hiring Now
https://careers.ebayinc.com
Or contact me: hchen9@ebay.com

More Related Content

What's hot

Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeDataWorks Summit
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieDataWorks Summit/Hadoop Summit
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillHenry Saputra
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveAll Things Open
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsSriram Krishnan
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchDataWorks Summit/Hadoop Summit
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesSriram Krishnan
 

What's hot (20)

Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics Perspective
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm
 

Viewers also liked

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Srinath Perera
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSrinath Perera
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 

Viewers also liked (8)

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 

Similar to Eagle from eBay at China Hadoop Summit 2015

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 

Similar to Eagle from eBay at China Hadoop Summit 2015 (20)

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
Handout3o
Handout3oHandout3o
Handout3o
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 

Recently uploaded

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Eagle from eBay at China Hadoop Summit 2015

  • 1. HADOOP Full-stack real-time monitoring framework for eBay Hadoop Hao Chen | 陈浩 eBay Cloud Service
  • 2. $ whoami Hao Chen | 陈浩 Software Engineer Analytics Data Infrastructure, Cloud Services eBay Inc. hchen9@ebay.com linkedin.com/in/haozch twitter.com/haozch weibo.com/haochencn 2
  • 3. 3 eBay’s Challenges in Monitoring 10+ large hadoop clusters 10,000+ nodes 50,000+ jobs per day 50,000,000+ tasks per day 500+ types of hadoop/hbase metrics Billions of audit events per day Large Scale in Real Time Various Business Logic Hadoop Hbase Spark Data Security Hardware Cloud Database Complex and Scalable Policy Join multiple data sources Threshold based, windows based Multiple metrics correlation Metrics pre-aggregations Machine learning based Engineering Modularization Varieties of data sources Varieties of data collectors Complex business logic Alert rules can’t be hot deployed Scalability issue with single process
  • 4. What’s Eagle 4 The uniform monitoring and alerting framework to monitor large-scale distributed system like hadoop, spark, cloud, etc. in real time. Eagle = Eagle Framework + Eagle Apps
  • 5. Eagle Ecosystem 5 Apps  DAM  JPA  HBase  Spark Interface  Web Portal  REST Services  Ambari Plugin Integration  Kafka  Storm  HBase  Druid  Elastic Search Eagle Framework Provide full-stack monitoring framework for efficiently developing highly scalable real-time monitoring applications. Eagle Apps Provide built-in monitoring applications for domains like hadoop, spark, hbase, storm and cloud. Eagle Integration Integrate with distributed real-time execution environment like storm, message bus like kafka and storage layer like hbase, and also support extensions. Eagle Interface Allow to access or manage eagle through REST service, web UI or Ambari plugin. Eagle Framework
  • 6. 6 Eagle App Highlights JPA: Job Performance Analyzer DAM: Security Data Activity Monitoring
  • 7. 7 JPA: Job Performance Analyzer Historical job analysis Running job analysis Anomaly host detection Job data skew detection Job performance suggestion Anomaly Prediction based on machine learning Monitor and analyze job performance in real-time
  • 8. 8 Historical Job Analyzer • Job historical performance trend • Task and attempt distribution • Various level (cluster/job/user/host) of resource utilization • Anomaly historical performance detection • TooLowBytesConsumedPerCPUSecond • JobStatisticLongDuration • TooLargeReduceNumAlert • TooLargeShuffleSizeAlert
  • 9. 9 Running Job Analyzer • Monitoring running job in real time • Minute-level job progress snapshots • Minute-level resource usage snapshots • CPU, HDFS I/O, Disk I/O, slot seconds • Roll up to user/queue/cluster level • Anomaly running status detection • TooLongJobDuration • NoProgressForLong • TooManyTaskFailure
  • 10. Use Case Detect node anomaly by analyzing task failure ratio across all nodes Assumption Task failure ratio for every node should be approximately equal Algorithm Node by node compare (symmetry violation) and per node trend 10 Task Failure based Anomaly Host Detection
  • 11. 11 Task Failure based Anomaly Host Detection Alerting: Anomaly Detection & Alerting Insight: Task failure drill-down Insight: Task failure drill-down
  • 12. Counters & Features Use Case Detect data skew by statistics and distributions for attempt execution durations and counters Assumption Duration and counters should be in normal distribution 12 Real-time Data Skew Detection mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling & Statistics Avg Min Max Distributions Max z-score Top-N Correlation Threshold & Detection Counters Correlation > 0.9 & Max(Z-Score) > 90%
  • 14. 14 Anomaly Prediction based on Machine Learning • Anomaly Metric Predictive Detection • Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA -> GMM -> MCC) • Online: Predictively alert for anomaly metrics Normal (Green) and Abnormal (Red) Data and Probability Distribution and Threshold Selection PCA (Principal Component Analysis)
  • 15. 15 Anomaly Prediction based on Machine Learning • Anomaly Metric Predictive Detection
  • 16. 16 DAM: Data Activity Monitoring Secure hadoop in real-time Security Use Cases Security Architecture Overview Security Components Highlights Security Machine Learning Integration
  • 17. 17 Security Use Cases Data Loss Prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
  • 19. 19 Security Component Highlights Policy Manager Expressive language - create and modify policies for alerting and remediation on certain data activity monitoring events. Data classification Integrate with Dataguise & Apache Ranger. Policy-based Remediation Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs. User Profiling Based on Machine learning to automatically generate anomaly detection policy User Activity Exploration Ability to drill down into alert details to understand the data security threat
  • 20. 20 Security Machine Learning Integration • User Activity Profiling • Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE) • Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy) PCs(Principle Components) in EVD (Eigenvalue Value Decomposition)Kernel Density Function
  • 21. 21 Security Machine Learning Integration • User Activity Profiling on Spark Historical Audit Events Real-time Audit Events Batch Preprocess User Profile Model Generation (KDE + EVD Algorithm) Eagle StorageHDFS Stream Preprocess Policy Engine Online detection on Storm Offline training on Spark Archived data Real-time stream Kafka Persist model Dynamically load models & policies Alert Consumer Persist alert Eagle Security Plugins
  • 22. Eagle Monitoring Framework 22 Eagle = Eagle Framework + Eagle Apps Full-stack real time monitoring framework
  • 23. 23 • Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards • We need create framework to cover full stack in monitoring system Monitoring Programming Paradigm
  • 25. Eagle Monitoring Framework Highlights 25 Eagle = Eagle Framework + Eagle Apps Lightweight Streaming Process Framework Extensible & Scalable Policy Framework Eagle Query Framework Customizable Dashboards
  • 26. 26 Step 1: Task DAG graph setup Eagle Stream Data Processing API @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); } Step 2: Inter-task data exchange protocol @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
  • 27. 27 Execution Graph development, compile and deploy Development / Compile Phase Deployment / Runtime Phase
  • 28. 28 Extensible & Scalable Policy Framework Usability • Declarative Policy Definition Syntax • Stream Metadata (event attribute name, attribute type, attribute value resolver, …) Scalability • Dynamic policy partitioning across compute nodes based on configurable partition class • Dynamic policy deployment • Event partitioning by storm and policy partitioning by Eagle (N events * M policies) Extensibility • Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
  • 29. 29 Usability of Policy Framework Case HBase Region server high call queue length Policy In the past 30 minutes, there are more than 20 times call queue length>2000 from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min) select host, value, avg(value) as avgValue, count(*) as count group by host having count >= 20 insert into HighRegionServerCallQueueLengthStream;
  • 30. 30 Scalability of Policy Evaluation Dynamic Policy Partition • N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks • Physical partition + Policy-level partition
  • 31. 31 Extensibility of Policy Framework public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); public Class<? extends PolicyEvaluator> getPolicyEvaluator(); public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser(); public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder(); public List<Module> getBindingModules(); } Policy Evaluator Provider use SPI to register policy engine implementations Built-in Supported Policy Engine • Siddhi Complex Event Processing Engine • Machine Learning based Policy Engine
  • 32. Eagle Query Framework 32 Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure • … Query • Search • Filter • Aggregation • Sort • Expression • …. The light-weight metadata-driven store layer to serve commonly shared storage & query requirements of most monitoring system
  • 33. 33 • Interactive: IPython notebook-like interactive visualization analysis and troubleshooting. • Dashboard: Customizable dashboard layout and drill-down path, persist and share. Customizable Dashboard Provide real-time interactive visualization and analytics capability supporting variety of data sources like eagle, druid and so on.
  • 34. 34 Eagle in Future The general monitoring platform for large-scale system of eBay
  • 35. 35 Open Source First Use Case Eagle to secure Hadoop in real time based on Eagle framework External Partners Hortonworks, Dataguise, Paypal and Apache Ranger Following Components to Open Source JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on is opening source soon
  • 36. 36 Reference Eagle at Hadoop Summit 2015, San Jose http://2015.hadoopsummit.org Slides | Video Eagle at Big Data Summit 2014, Shanghai http://2014ebay.csdn.net/m/zone/ebay_en Slides | Video
  • 37. 37 The End & Thanks If you want to go fast, go alone. If you want to go far, go together. -- African Proverb Hao Chen hchen9@ebay.com | @haozch
  • 38. 38 We are Hiring Now https://careers.ebayinc.com Or contact me: hchen9@ebay.com

Editor's Notes

  1. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  2. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  3. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  4. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  5. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  6. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  7. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  8. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  9. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  10. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  11. IG: Information Gain, 信息增益, 概率分布或者信息论,是非对称的,用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时,再使用P进行编码的差异。通常P代表样本或观察值的分布,也有可能是精确计算的理论分布。Q代表一种理论,模型,描述或者对P的近似。目的: 特征选择 PCA: 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法。又称主分量分析。http://baike.baidu.com/view/45376.htm?fromtitle=principal+component+analysis&type=syn GMM: 高斯混合模型(或者混合高斯模型),也可以简写为MOG(Mixture of Gaussian)。用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。http://baike.baidu.com/view/3767607.htm MCC: 马修相关系数,http://baike.baidu.com/view/3767607.html
  12. IG: Information Gain, 信息增益, 概率分布或者信息论,是非对称的,用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时,再使用P进行编码的差异。通常P代表样本或观察值的分布,也有可能是精确计算的理论分布。Q代表一种理论,模型,描述或者对P的近似。目的: 特征选择 PCA: 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法。又称主分量分析。http://baike.baidu.com/view/45376.htm?fromtitle=principal+component+analysis&type=syn GMM: 高斯混合模型(或者混合高斯模型),也可以简写为MOG(Mixture of Gaussian)。用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。http://baike.baidu.com/view/3767607.htm MCC: 马修相关系数,http://baike.baidu.com/view/3767607.html
  13. Data loss prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets. Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types. Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
  14. Data loss prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets. Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types. Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
  15. Data loss prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets. Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types. Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
  16. Histogram Density Estimation: 直方密度估计 Kernel density estimation-核密度估计 EVD: 线性代数,特征值分解,矩阵之集,Eigenvalue Value Decomposition, http://www.stats.ox.ac.uk/~sejdinov/teaching/HT15_lecture2-nup.pdf 高斯混合模型更多的用于分类,Parzen等KDE方法更多的用于概率密度的估计 http://blog.sina.com.cn/s/blog_6923201d01010tjo.html
  17. Histogram Density Estimation: 直方密度估计 Kernel density estimation-核密度估计 EVD: 线性代数,特征值分解,矩阵之集,Eigenvalue Value Decomposition, http://www.stats.ox.ac.uk/~sejdinov/teaching/HT15_lecture2-nup.pdf 高斯混合模型更多的用于分类,Parzen等KDE方法更多的用于概率密度的估计 http://blog.sina.com.cn/s/blog_6923201d01010tjo.html
  18. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  19. As a framework, Eagle does not assume : Data source (where, what) Business logic execution path (how) Policy engine implementation (how) Data sink (where, what) As a framework, Eagle does the following: SQL-like service API High-performing query framework Lightweight streaming process java API Extensible policy engine implementation Scalable and distributed rule evaluation Metadata driven stream processing Data source extensibility Data sink extensibility Interactive dashboard
  20. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  21. Supports syntax: Search Aggregate Time Series Histogram Expression Filter Paginations Metadata definition ORM High performance RESTful API SQL-like declarative query syntax Supporting HBase and RDBMS as storage Logically partition by tags defined in annotation Co-processor support Secondary index support Generic service client library Supports syntax: Search Aggregate Time Series Histogram Expression Filter Paginations Metadata definition ORM High performance RESTful API SQL-like declarative query syntax Supporting HBase and RDBMS as storage Logically partition by tags defined in annotation Co-processor support Secondary index support Generic service client library
  22. eBay内部,随着越来越多的大型分布式系统在企业级平台中部署,monitoring for large-scale 分布式系统的需求尤其强烈,eagle 将给予eagle framework 为核心基础,不断结合business logic特性逐渐壮大其Eagle Apps的生态圈,同时不断优化核心框架本身。 同时我们相信不止是ebay,大部分企业级平台,部署和维护这些大型分布式系统时,都会遇到共同的问题,集群越大,各方面监控所面临的挑战也越大,我们相信Eagle这针对于大型分布式系统监控的优势也会越突出。我们也一直非常期待同大家进行相关的交流和探讨,因此作为抛砖引玉,我们会以开源的形式开放eagle的代码,一方面ebay在这方面的大型分布式系统监控方面的努力可以对那些需要解决类似的公司有所帮助或者参考,同时也希望得到业界的反馈,对于我们的解决方式上进行深入交流,我们自己也可以从中有所收获,甚至,大家可以一起合作创建一个定位与大型分布式系统的开源监控平台。