SlideShare a Scribd company logo
1 of 40
Achieving Real-time Ingestion
and Analysis of Security
Events through Kafka and
Metron
Kevin Mao
Senior Data Engineer, Capital One
Kevin.Mao@capitalone.com
@KevinJokaiMao
About Me
 B.S., Computer Science, University of Maryland, Baltimore County
 M.S., Computer Science, George Mason University
 Enterprise Data Services, Data Intelligence
 Purple Rain Project
 Huge Zelda fan!
Agenda
 Part 1: Motivation and Background
 Part 2: Approach and Architecture
 Part 3: Challenges
 Part 4: Future Work
 Part 5: Wrapping Up
Part 1: Motivation and Background
Capital One
 45,000 Employees
 45 Million Customers
 26,000 EC2 Instances
 Credit Cards
 Traditional Banking
 Home/Auto Loans
 Brokerage Services
The Problem
 The ways in which adversaries can attack your system are increasing
- DNC hacks involved convincing spear phishing emails posing as Google
Password Reset
- Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock
medical records system held hostage by ransomware
 Organizations have to keep up by employing a more numerous and more diverse
set of tools
 Finding a way to effectively use those tools is difficult
The Data
 HTTP Proxy logs
 Email Metadata
 VPN logs
 Firewall events
 DNS
 Syslogs (*nix, Windows)
 Security Endpoints
 Threat Intelligence
 IDS Events
 Wireless Access Points
 Mobile Device Management
 And more...
 ~ 40 distinct data feeds
 ~ 5 Billion events per day
 ~ 75,000 Peak events per second
 ~ 5 TB per day
What We Started Out With
 Enterprise SIEM (Security Information and Event Management) platform
- Primary management tool for many years
- Encountered stability issues while scaling out to 13 months of data retention
 Splunk
- Great UI experience
- Scaling out to 13 months becomes prohibitively expensive
Where Does That Leave Us
 We need a solution for security event and telemetry data that is diverse,
voluminous, and fast-moving.
 Horizontally and linearly scalable
 Platform and interface built for:
- SOC Analysts to quickly respond to incidents
- Forensic Investigators to analyze historical data and compile reports
- Threat Hunters to efficiently find vulnerabilities and malicious behavior
 Affordable!
Purple Rain
PART 2: Approach and Architecture
NiFi
 Data routing, transformation, and distribution platform
 Easy to use Web UI
 On-Prem Cluster – Collects data from all local devices
- Flows into AWS Cluster
- 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk
 AWS Cluster – Collects, preprocesses, and tags incoming data
- 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)
 Individual data flows defined for each feed
Kafka
 Distributed messaging platform
- Publish-Subscribe model
- Producer/Consumer implementations across many languages
- Support for stream processing and ingestion via Kafka Streams/Connect
 Serves as communication backbone for infrastructure
 20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2)
 Replication factor of 2
 Set partition count to multiple of aggregate disk count
Storm
 Distributed realtime stream computation system
 Scales up by adding more worker nodes
 Fault tolerant – When a node dies, jobs that were on that node are moved to
another
 Support for topology isolation, microbatching, and custom routing
 Storm Nimbus/UI – m4.2xlarge
 45 Storm Worker Nodes – m4.2xlarge
 4 worker slots per node – 2 vCPU 8GB Mem
Metron
 Security analytics framework built on top of Storm
 Consists of two sets of Storm topologies:
- Parser topologies – Parse raw data into human readable JSON format
- Enrichment topologies – Enrich parsed data with contextual information, then
send to storage tier.
 Enrichment of incoming data streams with additional information
- Domain Generation Algorithm (DGA) scoring via machine learning model
- Active Directory user lookup
- Geolocation/ASN data for external IP addresses
- WHOIS lookup for unknown domain names
ElasticSearch
 Distributed, RESTful search and analytics engine
- Each data feed is comprised of its own set of daily indices
- Each index is further subdivided into shards
 Linearly scalable
 Low latency full-text search
 3 Master Nodes – m4.2xlarge
 100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)
Kibana
 Data visualization frontend for ElasticSearch
 Alert management system
 Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts
 Multiple open source and custom plugins
• Timelion
• fermiumlabs/mathlion
• prelert/kibana-swimlane-vis
• sirensolutions/kibi
• siresolutions/sentinl
• snuids/heatmap
• chenryn/kbn_sankey_vis
• And more...
S3
 Simple Storage Service – Object storage service in the cloud
 Compatible with processing engines like Spark, EMR
 Data stored in two formats:
- Raw data – Used for replaying data through the pipeline and meeting our
obligations as a system of record for some feeds
- Parsed data – Stored in columnar format (ORC) for batch processing
 Everything in S3 is encrypted
Monitoring
 Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.)
 Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm,
ElasticSearch
 Most data stored in a separate ElasticSearch cluster
 Grafana for visualization
 ElastAlert for platform alerting
PART 3: Challenges
Format Wars
 Ingested raw data comes in a variety of formats
- CSV, JSON, XML, CEF
 Sometimes the formats are poorly defined
- Windows Syslogs pretty indented using tabs, but no delimiters
- Various subtypes come in different formats
 Upstream changes to raw data format often propagate through our entire pipeline,
eventually making the data in ElasticSearch unusable
 Takeaway: Format and serialize data as far upstream as possible.
Monitoring and Alerting
 Platform-level telemetry should be stored with all the other data
- Instead of a separate Zabbix subsystem
 Collect more granular application-level data
- Most components expose metrics via JMX
- Necessary to effectively troubleshoot performance bottlenecks
- Useful for capacity planning
 Logging data collection
 Common problem among many teams at Capital One
 Takeaway: Reduce duplication of work by offering common monitoring
infrastructure, or even Monitoring-as-a-Service
Rehydration
 EC2 Instances with AMIs older than 60 days must be terminated
- Internal Capital One policy
 Spent a lot of time developing automation and orchestration to spin up a full cluster
from scratch
 How do you rehydrate a newly provisioned platform with data?
 How do you avoid service interruption to the user?
 Blue/Green cluster deployment
 Rolling rehydration every 30 days
Auditing
 Internal Audit
- 2 Internal Audits of NPI/PCI handling and storage processes
 OCC (Office of the Comptroller of the Currency)
- Audit of data sources, networking, and archival of data.
 FRB (Federal Reserve Board)
- IT Risk Management – Alerts considered as an authoritative source as part of first
line of defense
- Resiliency – Provide evidence of ability to failover within an acceptable window of
time
Handling Sensitive Data
 Social Security Numbers
 Credit card info
 Home/Auto Loans
 Checking/Savings Account Data
 Trading data
 Automated process to scan for PII/PCI data and scrub it from the raw data stream
- Secure raw data topics via encryption and access control
- Streaming job to scrub raw feeds and produce into separate ‘clean’ topics
 Backwards remediation process for data stored in HDFS/S3
PART 4: Future Work
Schema Management
 Authoritative service for clients to retrieve schemas applied to datasets.
 Implementation is protocol dependent.
- Avro – Confluent Schema Registry
- Protobuf – Central GH Repository
 Streaming job to parse and schema-fy raw data prior to processing it.
- Raw data that fails to fit schema diverted to alternate Kafka topic.
Monitoring
 Consolidate monitoring stack.
- Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends
- Separate stacks for Time-series numeric and logging:
- TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack
- ELK stack
- Both have tradeoffs
Generalized Data Processing
 Metron is really good for working in the infosec space, but does not generalize well.
 Exploring options for building a data platform to address multiple use cases.
- Credit transactions
- Credit fraud
- Anti-Money Laundering
- Legal
 Focus on supporting machine learning.
PART 5: Wrapping Up
Retrospective
 Users (SoC analysts, threat hunters, etc.) are generally happy with the platform.
 Low query latency
 Working to address concerns around data integrity (duplicates, loss, malformed)
 They want more data!
- Bro
- Silvertail
- Phantom
Q&A
Thanks!
Kevin.Mao@capitalone.com
@KevinJokaiMao
linkedin.com/in/kevinjmao
We’re hiring in SF, Chicago, and DC!
Machine Learning Engineers
Software Engineers
Data Engineers
Data Scientists

More Related Content

What's hot

Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real WorldDataWorks Summit
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...DataWorks Summit
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics Franco Ucci
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 

What's hot (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Conviva spark
Conviva sparkConviva spark
Conviva spark
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 

Viewers also liked

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEFeng Zhu
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Apache Apex
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronArun Kejariwal
 
Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center   Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center confluent
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache Kafkaconfluent
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
How Pinterest Scaled To Build The World's Catalog of 100+ Billion Ideas
How Pinterest Scaled To Build The World's Catalog of 100+ Billion IdeasHow Pinterest Scaled To Build The World's Catalog of 100+ Billion Ideas
How Pinterest Scaled To Build The World's Catalog of 100+ Billion IdeasRomit Jadhwani
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
Tableau Online Training in canada
Tableau Online Training in canadaTableau Online Training in canada
Tableau Online Training in canadaBoundTechS
 

Viewers also liked (20)

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center   Monitoring Apache Kafka with Confluent Control Center
Monitoring Apache Kafka with Confluent Control Center
 
Debugging Apache Spark - Scala & Python super happy fun times 2017
Debugging Apache Spark -   Scala & Python super happy fun times 2017Debugging Apache Spark -   Scala & Python super happy fun times 2017
Debugging Apache Spark - Scala & Python super happy fun times 2017
 
Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache Kafka
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
How Pinterest Scaled To Build The World's Catalog of 100+ Billion Ideas
How Pinterest Scaled To Build The World's Catalog of 100+ Billion IdeasHow Pinterest Scaled To Build The World's Catalog of 100+ Billion Ideas
How Pinterest Scaled To Build The World's Catalog of 100+ Billion Ideas
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Tableau Online Training in canada
Tableau Online Training in canadaTableau Online Training in canada
Tableau Online Training in canada
 

Similar to Real-time Security Event Ingestion and Analysis through Kafka and Metron

Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyDaniel Hochman
 
ING CoreIntel - collect and process network logs across data centers in near ...
ING CoreIntel - collect and process network logs across data centers in near ...ING CoreIntel - collect and process network logs across data centers in near ...
ING CoreIntel - collect and process network logs across data centers in near ...Evention
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Présentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo WazuhPrésentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo WazuhAurélie Henriot
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...Niraj Tolia
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisAmazon Web Services
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWSAmazon Web Services
 

Similar to Real-time Security Event Ingestion and Analysis through Kafka and Metron (20)

Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
 
Shaping a Digital Vision
Shaping a Digital VisionShaping a Digital Vision
Shaping a Digital Vision
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Hdf5
Hdf5Hdf5
Hdf5
 
Core intel
Core intelCore intel
Core intel
 
ING CoreIntel - collect and process network logs across data centers in near ...
ING CoreIntel - collect and process network logs across data centers in near ...ING CoreIntel - collect and process network logs across data centers in near ...
ING CoreIntel - collect and process network logs across data centers in near ...
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Présentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo WazuhPrésentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo Wazuh
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
(ARC346) Scaling To 25 Billion Daily Requests Within 3 Months On AWS
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Real-time Security Event Ingestion and Analysis through Kafka and Metron

  • 1. Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Metron Kevin Mao Senior Data Engineer, Capital One Kevin.Mao@capitalone.com @KevinJokaiMao
  • 2. About Me  B.S., Computer Science, University of Maryland, Baltimore County  M.S., Computer Science, George Mason University  Enterprise Data Services, Data Intelligence  Purple Rain Project  Huge Zelda fan!
  • 3. Agenda  Part 1: Motivation and Background  Part 2: Approach and Architecture  Part 3: Challenges  Part 4: Future Work  Part 5: Wrapping Up
  • 4. Part 1: Motivation and Background
  • 5. Capital One  45,000 Employees  45 Million Customers  26,000 EC2 Instances  Credit Cards  Traditional Banking  Home/Auto Loans  Brokerage Services
  • 6. The Problem  The ways in which adversaries can attack your system are increasing - DNC hacks involved convincing spear phishing emails posing as Google Password Reset - Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock medical records system held hostage by ransomware  Organizations have to keep up by employing a more numerous and more diverse set of tools  Finding a way to effectively use those tools is difficult
  • 7. The Data  HTTP Proxy logs  Email Metadata  VPN logs  Firewall events  DNS  Syslogs (*nix, Windows)  Security Endpoints  Threat Intelligence  IDS Events  Wireless Access Points  Mobile Device Management  And more...  ~ 40 distinct data feeds  ~ 5 Billion events per day  ~ 75,000 Peak events per second  ~ 5 TB per day
  • 8. What We Started Out With  Enterprise SIEM (Security Information and Event Management) platform - Primary management tool for many years - Encountered stability issues while scaling out to 13 months of data retention  Splunk - Great UI experience - Scaling out to 13 months becomes prohibitively expensive
  • 9. Where Does That Leave Us  We need a solution for security event and telemetry data that is diverse, voluminous, and fast-moving.  Horizontally and linearly scalable  Platform and interface built for: - SOC Analysts to quickly respond to incidents - Forensic Investigators to analyze historical data and compile reports - Threat Hunters to efficiently find vulnerabilities and malicious behavior  Affordable!
  • 11. PART 2: Approach and Architecture
  • 12.
  • 13. NiFi  Data routing, transformation, and distribution platform  Easy to use Web UI  On-Prem Cluster – Collects data from all local devices - Flows into AWS Cluster - 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk  AWS Cluster – Collects, preprocesses, and tags incoming data - 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)  Individual data flows defined for each feed
  • 14.
  • 15. Kafka  Distributed messaging platform - Publish-Subscribe model - Producer/Consumer implementations across many languages - Support for stream processing and ingestion via Kafka Streams/Connect  Serves as communication backbone for infrastructure  20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2)  Replication factor of 2  Set partition count to multiple of aggregate disk count
  • 16. Storm  Distributed realtime stream computation system  Scales up by adding more worker nodes  Fault tolerant – When a node dies, jobs that were on that node are moved to another  Support for topology isolation, microbatching, and custom routing  Storm Nimbus/UI – m4.2xlarge  45 Storm Worker Nodes – m4.2xlarge  4 worker slots per node – 2 vCPU 8GB Mem
  • 17. Metron  Security analytics framework built on top of Storm  Consists of two sets of Storm topologies: - Parser topologies – Parse raw data into human readable JSON format - Enrichment topologies – Enrich parsed data with contextual information, then send to storage tier.  Enrichment of incoming data streams with additional information - Domain Generation Algorithm (DGA) scoring via machine learning model - Active Directory user lookup - Geolocation/ASN data for external IP addresses - WHOIS lookup for unknown domain names
  • 18. ElasticSearch  Distributed, RESTful search and analytics engine - Each data feed is comprised of its own set of daily indices - Each index is further subdivided into shards  Linearly scalable  Low latency full-text search  3 Master Nodes – m4.2xlarge  100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)
  • 19. Kibana  Data visualization frontend for ElasticSearch  Alert management system  Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts  Multiple open source and custom plugins • Timelion • fermiumlabs/mathlion • prelert/kibana-swimlane-vis • sirensolutions/kibi • siresolutions/sentinl • snuids/heatmap • chenryn/kbn_sankey_vis • And more...
  • 20.
  • 21.
  • 22.
  • 23.
  • 24. S3  Simple Storage Service – Object storage service in the cloud  Compatible with processing engines like Spark, EMR  Data stored in two formats: - Raw data – Used for replaying data through the pipeline and meeting our obligations as a system of record for some feeds - Parsed data – Stored in columnar format (ORC) for batch processing  Everything in S3 is encrypted
  • 25. Monitoring  Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.)  Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm, ElasticSearch  Most data stored in a separate ElasticSearch cluster  Grafana for visualization  ElastAlert for platform alerting
  • 26.
  • 28. Format Wars  Ingested raw data comes in a variety of formats - CSV, JSON, XML, CEF  Sometimes the formats are poorly defined - Windows Syslogs pretty indented using tabs, but no delimiters - Various subtypes come in different formats  Upstream changes to raw data format often propagate through our entire pipeline, eventually making the data in ElasticSearch unusable  Takeaway: Format and serialize data as far upstream as possible.
  • 29. Monitoring and Alerting  Platform-level telemetry should be stored with all the other data - Instead of a separate Zabbix subsystem  Collect more granular application-level data - Most components expose metrics via JMX - Necessary to effectively troubleshoot performance bottlenecks - Useful for capacity planning  Logging data collection  Common problem among many teams at Capital One  Takeaway: Reduce duplication of work by offering common monitoring infrastructure, or even Monitoring-as-a-Service
  • 30. Rehydration  EC2 Instances with AMIs older than 60 days must be terminated - Internal Capital One policy  Spent a lot of time developing automation and orchestration to spin up a full cluster from scratch  How do you rehydrate a newly provisioned platform with data?  How do you avoid service interruption to the user?  Blue/Green cluster deployment  Rolling rehydration every 30 days
  • 31. Auditing  Internal Audit - 2 Internal Audits of NPI/PCI handling and storage processes  OCC (Office of the Comptroller of the Currency) - Audit of data sources, networking, and archival of data.  FRB (Federal Reserve Board) - IT Risk Management – Alerts considered as an authoritative source as part of first line of defense - Resiliency – Provide evidence of ability to failover within an acceptable window of time
  • 32. Handling Sensitive Data  Social Security Numbers  Credit card info  Home/Auto Loans  Checking/Savings Account Data  Trading data  Automated process to scan for PII/PCI data and scrub it from the raw data stream - Secure raw data topics via encryption and access control - Streaming job to scrub raw feeds and produce into separate ‘clean’ topics  Backwards remediation process for data stored in HDFS/S3
  • 34. Schema Management  Authoritative service for clients to retrieve schemas applied to datasets.  Implementation is protocol dependent. - Avro – Confluent Schema Registry - Protobuf – Central GH Repository  Streaming job to parse and schema-fy raw data prior to processing it. - Raw data that fails to fit schema diverted to alternate Kafka topic.
  • 35. Monitoring  Consolidate monitoring stack. - Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends - Separate stacks for Time-series numeric and logging: - TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack - ELK stack - Both have tradeoffs
  • 36. Generalized Data Processing  Metron is really good for working in the infosec space, but does not generalize well.  Exploring options for building a data platform to address multiple use cases. - Credit transactions - Credit fraud - Anti-Money Laundering - Legal  Focus on supporting machine learning.
  • 38. Retrospective  Users (SoC analysts, threat hunters, etc.) are generally happy with the platform.  Low query latency  Working to address concerns around data integrity (duplicates, loss, malformed)  They want more data! - Bro - Silvertail - Phantom
  • 39. Q&A
  • 40. Thanks! Kevin.Mao@capitalone.com @KevinJokaiMao linkedin.com/in/kevinjmao We’re hiring in SF, Chicago, and DC! Machine Learning Engineers Software Engineers Data Engineers Data Scientists