SlideShare a Scribd company logo
1 of 38
Apache in Action
Secure Hadoop in Real-time
Hao Chen / 陈浩 / hao@apache.org / @haozch
Who is the Guy
2
Co-creator, Committer and PMC @ Apache Eagle
hao@apache.org
Hao Chen / 陈浩
Sr. Software Engineer @ eBay Cloud Service
hchen9@ebay.com
Speaker @ Hadoop Summit (SJC, SHA, BJ) ...
http://people.apache.org/~hao
Agenda
3
• About Eagle
• Architecture
• Ecosystem
• Q & A
What’s Apache Eagle
4
Apache Eagle is a distributed real-time monitoring and
alerting engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access
to sensitive data, recognize attacks/ malicious activity and block access in real time.
See http://eagle.incubator.apache.org or http://goeagle.io
Apache Eagle History
Donated to Apache Software Foundation (ASF) from eBay at Oct 26th, 2015
5
Dec 2013 Oct 23 2015 Oct 26 2015
Hadoop Eagle
Project Initiative
Apache Incubator
eagle.incubator.apache.org
Github Open Source
github.com/apache/incubator-eagle
Hadoop Eagle
Production Release
May 2014
Why build Apache Eagle
6
Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any
existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs
generated by hadoop system in eBay.
2013/2014
10,000 nodes
150,000+ cores
170 PB
2000+ user
3000+ nodes
10,000+ cores
50+ PB
2012
2011
1000+ nodes
10,000+ cores
10+ PB
100+ nodes
1000 + cores
1 PB
2010
2009
50+ nodes
2007
1-10 nodes
Hadoop Data
• Security
• Activity
Hadoop Platform
• Heath
• Availability
• Performance
Hadoop @ eBay Inc
Apache Eagle @ eBay
7
7 CLUSTERS
7427 NODES
160 PB DATA
10 B+ EVENTS / DAY
500+ METRIC TYPES
50,000+ JOBS / DAY
50,000,000+ TASKS / DAY
MONITOR
PROCESS
Agenda
8
• About Eagle
• Architecture
• Ecosystem
• Q & A
Apache Eagle Architecture Overview
9
Scalable
Scales to monitor thousands of policies and
billions of access events
Machine Learning
Create dynamic user profiles based on
user behavior
Real-time
Generates alerts in real time and blocks
users with malicious intent
Extensible
Eagle can be easily extended to monitor
other data sources
Apache Eagle Architecture Overview
10
STREAM PROCESSING
ENGINE
DataCollector
Kafka
HDFS, Audit, Security
METADATA MANAGER
DATASTORES
REMEDIATION ENGINE
Apache
Ranger
MACHINE LEARNING
MODULE
Custom
module
Actionable Alerts
Activities
Actionable
Alerts
PolicyThresholdsUser properties
MLThresholds
Real Time Alert
Dashboard
Security Analyst
Admin Console
Security Engineer
Insights
Metadata
Management
MACHINE LEARNING TRAINING MODULE
Policy Engine
Apache Eagle Architecture Features
11
• Real-time Data Collection
• Distributed Policy Engine
• Stream Processing DSL
• Scalable Data Storage & Query
• Machine Learning Intergration
NOTE {NAME}-{NUMBER} like HDFS-6914 means open source project ticket id contributed by us
Apache Eagle - Data Collection
12
Decoupling with Message Bus
• Apache Kafka: high-throughput distributed
messaging
• Partition: balance between logic and throughput
Cross-Platform Integration
• Community Kafka Client (18+)
• Python/Go/C/C++/JAVA ..
• Enhanced Log4j-kafka
• KAFKA-2041: Extensible Partition Key
• KAFKA-2077: Advanced Topic Selector
Apache Eagle - Data Collection
13
Availability: Filebeat + Logstash
Logstash-1
Logstash-2
Logstash-…
Logstash-
Ligh-weight collector (golang) with daemon Logstash instances cluster
Shuffle Grouping
KafkaField Grouping
Distributed Message Bus
Resource consumption balance
Message throughput balance ( LOGSTASH-179)
Storm Spout: Distributed crawling for hadoop job, node jmx and service logs, etc.
Zookeeper: Centralized state management and distributed locking
Apache Eagle - Data Collection
14
Scalability: Distributed Real-time Ingestion
Zookeeper
METRIC
Centralized State Management
JOB EVENT LOG
Apache Eagle - Distributed Real-time Policy Engine
15
METADATA MANAGER
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real Time
Alerts
Alerts
Policy
Management
Policy
Dynamical Policy Deployment
Real-time
Event Stream
Stream_{1}
Stream_{*}
Dynamical Stream Schema
Stream
Processing
Highlights
• Real-time
• Usability
• Scalability
• Extensibility
• Metadata-driven
Apache Eagle - Distributed Real-time Policy Engine
16
METADATA MANAGER
Real Time
Alerts
Alerts
Policy
Management
Policy
Event Stream
(Kafka)
Dynamical Stream Schema
Dynamical Policy Deployment
Real-time
• Kafka-based Distributed
Message Bus (Extensible)
• Storm-based Real-time
Execution Environment
(Extensible)
• Stream events are
processed and alerts are
evaluated during
streaming
Apache Eagle - Distributed Real-time Policy Engine
17
METADATA MANAGER
Distributed Streaming Cluster Environment
Real Time
Alerts
Alerts
Policy
Management
Policy
Dynamical Policy Deployment
Usability
• Powerful SQL-Like CEP CQL for
Policy Definition
• Dynamical Poilcy Metadata Lifecycle
Management (Deployment/Update)
• Easy-to-use Policy management and
Alert analytics UI
from metricStream[(name == 'ReplLag')
and (value > 1000)] select * insert into
outputStream;
Apache Eagle - Distributed Real-time Policy Engine
18
Full-function Streaming CEP CQL: Siddhi on Storm by default
hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user,
count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream;
• Filter
• Join
• Aggregation: Avg, Sum , Min, Max, etc
• Group by
• Having
• Stream handlers for window: TimeWindow, Batch Window, Length Window
• Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <, and arithmetic operations
• Pattern processing
• Sequence processing
• Event Tables: intergrate historical data in realtime processing
• SQL-Like Query: Query, Stream Definition and Query Plan compilation
Apache Eagle - Distributed Real-time Policy Engine
19
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream
Processing
Scalability: dynamic policy partition by {event} * {policy}
• N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks
• Physical partition + policy-level partition
Apache Eagle - Distributed Real-time Policy Engine
20
Distributed Streaming Partition Problem
https://en.wikipedia.org/wiki/Partition_problem
S = {3,1,1,2,2,1,1}
S1 = {1,1,1,1,1}
S2 = {2,2}
S3 = {3}
Apache Eagle - Distributed Real-time Policy Engine
21
Distributed Streaming Partition Strategy
groupBy[ GreedyStrategy ]((_.key1,_.key2 ))
HBase
Key Distribution Statistics
(Online/Offline)
Realtime Partition
Strategy
Key Statistics Cache
Async
Strategy
• Greedy (Online/Offline)
• PoTC
• PKG
• Hashing
Apache Eagle - Distributed Real-time Policy Engine
22
Distributed Real-time Policy Engine
Siddhi CEP
Policy
Evaluator
Machine
Learning Policy
Evaluator
Extensibility
• Support WSO2 Siddhi CEP as first class
• Extensible policy engine implementation
• Extensible policy lifecycle management
Extensible
Policy Evaluator
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType(); // literal string to identify one type of policy
public Class getPolicyEvaluator(); // get policy evaluator implementation
public List getBindingModules(); // policy text with json format to object mapping
}
public interface PolicyEvaluator {
public void evaluate(ValuesArray input) throws Exception; // evaluate input event
public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef);// policy update
public void onPolicyDelete(); // invoked when policy is deleted
}
METADATA MANAGER
Policy/Metadata
Apache Eagle - Distributed Real-time Policy Engine
23
Metadata-Driven
• Stream Schema: AlertStreamSchemaEntity
• Policy Definition: AlertDefinitionAPIEntity
• Central metadata management
• Dynamic metadata deployment
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
METADATA MANAGER
Distributed Real-time Policy Engine
Dynamic Metadata Loading
.flatMap(AuditLogTransformer)
.groupBy(_.user)
.flatMap(UserProfileAggregator);
Apache Eagle - Fluent Stream Processing DSL
24
env.fromKafka (KafkaConfig)
.alert.persistAndEmail
val env = ExecutionEnvironment.getStorm()
env.execute();
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Alerts
Real-time
Event Stream
Stream_{1}
Stream_{*}
Stream
Processing
env.execute()
Apache Eagle - Fluent Stream Processing DSL
25
• Physical execution platform independent
• Easily assemble data transformation, filtering,
join and alerting DAG in fluent way
• DAG rewrite and optimization
• StreamUnionExpansion
• StreamGroupbyExpansion
• StreamNameExpansion
• StreamAlertExpansion
• StreamParallelismConfigExpansion
trait StreamProducer{
filter
flatMap
map{1,2,3,4}
groupBy
streamUnion // stream join is hard, not implemented for storm
alertWithConsumer
}
StormExecutionEnvironment env =
ExecutionEnvironmentFactory.getStorm(config);
env.newSource(new
KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1)
.flatMap(new AuditLogTransformer())
.groupBy(0)
.flatMap(new UserProfileAggregatorExecutor());
.alertWithConsumer(“userActivity“,”userProfileExecutor“)
env.execute();
Optimizer
1. Development 2. Optimization 3. Compile to native app
Apache Eagle - Scalable Data Storage and Query
26
• Entity Metadata on large-scale NoSQL
storage like HBase
• Full-function SQL-Like REST Query
• Optimized rowkey design for time-series
monitoring data
• HBase Coprocessor
• Secondary Index
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
query=
AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
Uniform rowkey design
• Metric
• Entity
• Log
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …
Rowvalue ::= Log Content
27
Apache Eagle – Uniform HBase Rowkey Design
Apache Eagle - Machine Learning Intergration
28
29
User Activity Profiling
Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE)
Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy)
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)
Kernel Density Function
Apache Eagle – User/System Activity Profiling
30
Anomaly Metric Predictive Detection
Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA ->
GMM -> MCC)
Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold Selection
PCA (Principal Component Analysis)
Apache Eagle - Anomaly Metric Predictive Detection
Anomaly Metric Predictive Detection Case Study
Agenda
• About Apache Eagle
• Architecture
• Ecosystem
• Q & A
31
32
Apps
 Security
 Hadoop
 Cloud
 Database
Interface
 Web Portal
 REST Services
 Analytics Visualization
Integration
 Ambari
 Docker
 Ranger
 Dataguise
Eagle Framework
Distributed real-time framework for efficiently
developing highly scalable monitoring applications
Eagle Apps
Securiy / Hadoop / Cloud / Database
Eagle Interface
REST Service / Management UI / Customizable Analytics
Visualization
Eagle Integration
Ambari / Docker / Ranger / Dataguise
Open Source
Community-driven and Cross-community cooperation
Eagle
Framework
Apache Eagle Ecosystem
33
Apache Eagle Ecosystem - Security
How to Secure Hadoop in Realtime?
• Apache Eagle
• Apache Ranger
• Apache Knox
• Dataguise
34
Apache Eagle Ecosystem - Hadoop
Eagle in Apache Amabri: natively be part of hadoop ecosystem
35
Apache Eagle Ecosystem - Docker
Eagle in Docker: natively fly on Cloud/Container
STORM
KAFKA
ZOOKEEPER
HBASE
HADOOP
…
Powered of
git clone apache/incubator-eagle
eagle-docker
15 + 1 = 1
docker pull apacheeagle
36
Apache Eagle Ecosystem - Open Source
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Learn more about Apache Eagle
37
• EAGLE: USER PROFILE-BASED ANOMALY DETECTION IN HADOOP CLUSTER
(IEEE)
• EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP
CLUSTER
Q & A
apache/incubator-eagle
@TheApacheEagle
@ApacheEagle
http://eagle.incubator.apache.org
The slide is licensed under Creative Commons Attribution 4.0 International license.

More Related Content

What's hot

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...DataWorks Summit
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillHenry Saputra
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieDataWorks Summit/Hadoop Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveAll Things Open
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 

What's hot (20)

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics Perspective
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 

Viewers also liked

Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
あいにきて IoT
あいにきて IoTあいにきて IoT
あいにきて IoTYuki Higuchi
 
2do boletin emancipacion de la mujer
2do boletin   emancipacion de la mujer2do boletin   emancipacion de la mujer
2do boletin emancipacion de la mujerColectivo chamampi
 
Walden3 twin slideshare 01
Walden3 twin slideshare 01Walden3 twin slideshare 01
Walden3 twin slideshare 01Avi Dey
 
Science and Nature Portfolio
Science and Nature PortfolioScience and Nature Portfolio
Science and Nature Portfolioian cuming
 
9789740333616
97897403336169789740333616
9789740333616CUPress
 
Afl presentation
Afl presentationAfl presentation
Afl presentationannacb19
 

Viewers also liked (13)

Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
MSII service global
MSII service globalMSII service global
MSII service global
 
Leanforms folder panterra
Leanforms folder panterraLeanforms folder panterra
Leanforms folder panterra
 
あいにきて IoT
あいにきて IoTあいにきて IoT
あいにきて IoT
 
2do boletin emancipacion de la mujer
2do boletin   emancipacion de la mujer2do boletin   emancipacion de la mujer
2do boletin emancipacion de la mujer
 
Walden3 twin slideshare 01
Walden3 twin slideshare 01Walden3 twin slideshare 01
Walden3 twin slideshare 01
 
Science and Nature Portfolio
Science and Nature PortfolioScience and Nature Portfolio
Science and Nature Portfolio
 
9789740333616
97897403336169789740333616
9789740333616
 
Afl presentation
Afl presentationAfl presentation
Afl presentation
 

Similar to Apache Eagle in Action

Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementHao Chen
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionFabian Hadiji
 
Security and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseSecurity and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseAmazon Web Services
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovSoftServe
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis PlatformValentin Kropov
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesThanigai Vellore
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
ThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxGrace Jansen
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubAlfonso Martino
 
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Build an AI/ML-driven image archive processing workflow: Image archive, analy...Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Build an AI/ML-driven image archive processing workflow: Image archive, analy...wesley chun
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsDamien Dallimore
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Prototyping applications with heroku and elasticsearch
 Prototyping applications with heroku and elasticsearch Prototyping applications with heroku and elasticsearch
Prototyping applications with heroku and elasticsearchprotofy
 

Similar to Apache Eagle in Action (20)

Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture Evolvement
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to production
 
Security and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseSecurity and Advanced Automation in the Enterprise
Security and Advanced Automation in the Enterprise
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin Kropov
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis Platform
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
ThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptx
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
 
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Build an AI/ML-driven image archive processing workflow: Image archive, analy...Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Prototyping applications with heroku and elasticsearch
 Prototyping applications with heroku and elasticsearch Prototyping applications with heroku and elasticsearch
Prototyping applications with heroku and elasticsearch
 

Recently uploaded

RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...Henrik Hanke
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxnoorehahmad
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 

Recently uploaded (20)

RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 

Apache Eagle in Action

  • 1. Apache in Action Secure Hadoop in Real-time Hao Chen / 陈浩 / hao@apache.org / @haozch
  • 2. Who is the Guy 2 Co-creator, Committer and PMC @ Apache Eagle hao@apache.org Hao Chen / 陈浩 Sr. Software Engineer @ eBay Cloud Service hchen9@ebay.com Speaker @ Hadoop Summit (SJC, SHA, BJ) ... http://people.apache.org/~hao
  • 3. Agenda 3 • About Eagle • Architecture • Ecosystem • Q & A
  • 4. What’s Apache Eagle 4 Apache Eagle is a distributed real-time monitoring and alerting engine for hadoop from eBay Open sourced as Apache Incubator Project on Oct 26th 2015 Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time. See http://eagle.incubator.apache.org or http://goeagle.io
  • 5. Apache Eagle History Donated to Apache Software Foundation (ASF) from eBay at Oct 26th, 2015 5 Dec 2013 Oct 23 2015 Oct 26 2015 Hadoop Eagle Project Initiative Apache Incubator eagle.incubator.apache.org Github Open Source github.com/apache/incubator-eagle Hadoop Eagle Production Release May 2014
  • 6. Why build Apache Eagle 6 Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs generated by hadoop system in eBay. 2013/2014 10,000 nodes 150,000+ cores 170 PB 2000+ user 3000+ nodes 10,000+ cores 50+ PB 2012 2011 1000+ nodes 10,000+ cores 10+ PB 100+ nodes 1000 + cores 1 PB 2010 2009 50+ nodes 2007 1-10 nodes Hadoop Data • Security • Activity Hadoop Platform • Heath • Availability • Performance Hadoop @ eBay Inc
  • 7. Apache Eagle @ eBay 7 7 CLUSTERS 7427 NODES 160 PB DATA 10 B+ EVENTS / DAY 500+ METRIC TYPES 50,000+ JOBS / DAY 50,000,000+ TASKS / DAY MONITOR PROCESS
  • 8. Agenda 8 • About Eagle • Architecture • Ecosystem • Q & A
  • 9. Apache Eagle Architecture Overview 9 Scalable Scales to monitor thousands of policies and billions of access events Machine Learning Create dynamic user profiles based on user behavior Real-time Generates alerts in real time and blocks users with malicious intent Extensible Eagle can be easily extended to monitor other data sources
  • 10. Apache Eagle Architecture Overview 10 STREAM PROCESSING ENGINE DataCollector Kafka HDFS, Audit, Security METADATA MANAGER DATASTORES REMEDIATION ENGINE Apache Ranger MACHINE LEARNING MODULE Custom module Actionable Alerts Activities Actionable Alerts PolicyThresholdsUser properties MLThresholds Real Time Alert Dashboard Security Analyst Admin Console Security Engineer Insights Metadata Management MACHINE LEARNING TRAINING MODULE Policy Engine
  • 11. Apache Eagle Architecture Features 11 • Real-time Data Collection • Distributed Policy Engine • Stream Processing DSL • Scalable Data Storage & Query • Machine Learning Intergration NOTE {NAME}-{NUMBER} like HDFS-6914 means open source project ticket id contributed by us
  • 12. Apache Eagle - Data Collection 12 Decoupling with Message Bus • Apache Kafka: high-throughput distributed messaging • Partition: balance between logic and throughput Cross-Platform Integration • Community Kafka Client (18+) • Python/Go/C/C++/JAVA .. • Enhanced Log4j-kafka • KAFKA-2041: Extensible Partition Key • KAFKA-2077: Advanced Topic Selector
  • 13. Apache Eagle - Data Collection 13 Availability: Filebeat + Logstash Logstash-1 Logstash-2 Logstash-… Logstash- Ligh-weight collector (golang) with daemon Logstash instances cluster Shuffle Grouping KafkaField Grouping Distributed Message Bus Resource consumption balance Message throughput balance ( LOGSTASH-179)
  • 14. Storm Spout: Distributed crawling for hadoop job, node jmx and service logs, etc. Zookeeper: Centralized state management and distributed locking Apache Eagle - Data Collection 14 Scalability: Distributed Real-time Ingestion Zookeeper METRIC Centralized State Management JOB EVENT LOG
  • 15. Apache Eagle - Distributed Real-time Policy Engine 15 METADATA MANAGER Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real Time Alerts Alerts Policy Management Policy Dynamical Policy Deployment Real-time Event Stream Stream_{1} Stream_{*} Dynamical Stream Schema Stream Processing Highlights • Real-time • Usability • Scalability • Extensibility • Metadata-driven
  • 16. Apache Eagle - Distributed Real-time Policy Engine 16 METADATA MANAGER Real Time Alerts Alerts Policy Management Policy Event Stream (Kafka) Dynamical Stream Schema Dynamical Policy Deployment Real-time • Kafka-based Distributed Message Bus (Extensible) • Storm-based Real-time Execution Environment (Extensible) • Stream events are processed and alerts are evaluated during streaming
  • 17. Apache Eagle - Distributed Real-time Policy Engine 17 METADATA MANAGER Distributed Streaming Cluster Environment Real Time Alerts Alerts Policy Management Policy Dynamical Policy Deployment Usability • Powerful SQL-Like CEP CQL for Policy Definition • Dynamical Poilcy Metadata Lifecycle Management (Deployment/Update) • Easy-to-use Policy management and Alert analytics UI from metricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
  • 18. Apache Eagle - Distributed Real-time Policy Engine 18 Full-function Streaming CEP CQL: Siddhi on Storm by default hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user, count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream; • Filter • Join • Aggregation: Avg, Sum , Min, Max, etc • Group by • Having • Stream handlers for window: TimeWindow, Batch Window, Length Window • Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <, and arithmetic operations • Pattern processing • Sequence processing • Event Tables: intergrate historical data in realtime processing • SQL-Like Query: Query, Stream Definition and Query Plan compilation
  • 19. Apache Eagle - Distributed Real-time Policy Engine 19 Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Stream_{1} Stream_{*} Stream Processing Scalability: dynamic policy partition by {event} * {policy} • N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks • Physical partition + policy-level partition
  • 20. Apache Eagle - Distributed Real-time Policy Engine 20 Distributed Streaming Partition Problem https://en.wikipedia.org/wiki/Partition_problem S = {3,1,1,2,2,1,1} S1 = {1,1,1,1,1} S2 = {2,2} S3 = {3}
  • 21. Apache Eagle - Distributed Real-time Policy Engine 21 Distributed Streaming Partition Strategy groupBy[ GreedyStrategy ]((_.key1,_.key2 )) HBase Key Distribution Statistics (Online/Offline) Realtime Partition Strategy Key Statistics Cache Async Strategy • Greedy (Online/Offline) • PoTC • PKG • Hashing
  • 22. Apache Eagle - Distributed Real-time Policy Engine 22 Distributed Real-time Policy Engine Siddhi CEP Policy Evaluator Machine Learning Policy Evaluator Extensibility • Support WSO2 Siddhi CEP as first class • Extensible policy engine implementation • Extensible policy lifecycle management Extensible Policy Evaluator public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); // literal string to identify one type of policy public Class getPolicyEvaluator(); // get policy evaluator implementation public List getBindingModules(); // policy text with json format to object mapping } public interface PolicyEvaluator { public void evaluate(ValuesArray input) throws Exception; // evaluate input event public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef);// policy update public void onPolicyDelete(); // invoked when policy is deleted } METADATA MANAGER Policy/Metadata
  • 23. Apache Eagle - Distributed Real-time Policy Engine 23 Metadata-Driven • Stream Schema: AlertStreamSchemaEntity • Policy Definition: AlertDefinitionAPIEntity • Central metadata management • Dynamic metadata deployment @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; METADATA MANAGER Distributed Real-time Policy Engine Dynamic Metadata Loading
  • 24. .flatMap(AuditLogTransformer) .groupBy(_.user) .flatMap(UserProfileAggregator); Apache Eagle - Fluent Stream Processing DSL 24 env.fromKafka (KafkaConfig) .alert.persistAndEmail val env = ExecutionEnvironment.getStorm() env.execute(); Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Alerts Real-time Event Stream Stream_{1} Stream_{*} Stream Processing env.execute()
  • 25. Apache Eagle - Fluent Stream Processing DSL 25 • Physical execution platform independent • Easily assemble data transformation, filtering, join and alerting DAG in fluent way • DAG rewrite and optimization • StreamUnionExpansion • StreamGroupbyExpansion • StreamNameExpansion • StreamAlertExpansion • StreamParallelismConfigExpansion trait StreamProducer{ filter flatMap map{1,2,3,4} groupBy streamUnion // stream join is hard, not implemented for storm alertWithConsumer } StormExecutionEnvironment env = ExecutionEnvironmentFactory.getStorm(config); env.newSource(new KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1) .flatMap(new AuditLogTransformer()) .groupBy(0) .flatMap(new UserProfileAggregatorExecutor()); .alertWithConsumer(“userActivity“,”userProfileExecutor“) env.execute(); Optimizer 1. Development 2. Optimization 3. Compile to native app
  • 26. Apache Eagle - Scalable Data Storage and Query 26 • Entity Metadata on large-scale NoSQL storage like HBase • Full-function SQL-Like REST Query • Optimized rowkey design for time-series monitoring data • HBase Coprocessor • Secondary Index @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; query= AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
  • 27. Uniform rowkey design • Metric • Entity • Log Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content 27 Apache Eagle – Uniform HBase Rowkey Design
  • 28. Apache Eagle - Machine Learning Intergration 28
  • 29. 29 User Activity Profiling Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE) Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy) PCs(Principle Components) in EVD (Eigenvalue Value Decomposition) Kernel Density Function Apache Eagle – User/System Activity Profiling
  • 30. 30 Anomaly Metric Predictive Detection Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA -> GMM -> MCC) Online: Predictively alert for anomaly metrics Normal (Green) and Abnormal (Red) Data and Probability Distribution and Threshold Selection PCA (Principal Component Analysis) Apache Eagle - Anomaly Metric Predictive Detection Anomaly Metric Predictive Detection Case Study
  • 31. Agenda • About Apache Eagle • Architecture • Ecosystem • Q & A 31
  • 32. 32 Apps  Security  Hadoop  Cloud  Database Interface  Web Portal  REST Services  Analytics Visualization Integration  Ambari  Docker  Ranger  Dataguise Eagle Framework Distributed real-time framework for efficiently developing highly scalable monitoring applications Eagle Apps Securiy / Hadoop / Cloud / Database Eagle Interface REST Service / Management UI / Customizable Analytics Visualization Eagle Integration Ambari / Docker / Ranger / Dataguise Open Source Community-driven and Cross-community cooperation Eagle Framework Apache Eagle Ecosystem
  • 33. 33 Apache Eagle Ecosystem - Security How to Secure Hadoop in Realtime? • Apache Eagle • Apache Ranger • Apache Knox • Dataguise
  • 34. 34 Apache Eagle Ecosystem - Hadoop Eagle in Apache Amabri: natively be part of hadoop ecosystem
  • 35. 35 Apache Eagle Ecosystem - Docker Eagle in Docker: natively fly on Cloud/Container STORM KAFKA ZOOKEEPER HBASE HADOOP … Powered of git clone apache/incubator-eagle eagle-docker 15 + 1 = 1 docker pull apacheeagle
  • 36. 36 Apache Eagle Ecosystem - Open Source If you want to go fast, go alone. If you want to go far, go together. -- African Proverb
  • 37. Learn more about Apache Eagle 37 • EAGLE: USER PROFILE-BASED ANOMALY DETECTION IN HADOOP CLUSTER (IEEE) • EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP CLUSTER
  • 38. Q & A apache/incubator-eagle @TheApacheEagle @ApacheEagle http://eagle.incubator.apache.org The slide is licensed under Creative Commons Attribution 4.0 International license.

Editor's Notes

  1. 特色
  2. Tech stack decision about Message Bus
  3. Cross-Nodes Resource Balanace in production environment
  4. A good ecosystem will continiously keep the architecture renew and advanced no matter in technical stack iteration, human resource or business growth