Real-time Security Event Ingestion and Analysis through Kafka and Metron

Achieving Real-time Ingestion
and Analysis of Security
Events through Kafka and
Metron
Kevin Mao
Senior Data Engineer, Capital One
Kevin.Mao@capitalone.com
@KevinJokaiMao

About Me
 B.S., Computer Science, University of Maryland, Baltimore County
 M.S., Computer Science, George Mason University
 Enterprise Data Services, Data Intelligence
 Purple Rain Project
 Huge Zelda fan!

Agenda
 Part 1: Motivation and Background
 Part 2: Approach and Architecture
 Part 3: Challenges
 Part 4: Future Work
 Part 5: Wrapping Up

Part 1: Motivation and Background

Capital One
 45,000 Employees
 45 Million Customers
 26,000 EC2 Instances
 Credit Cards
 Traditional Banking
 Home/Auto Loans
 Brokerage Services

The Problem
 The ways in which adversaries can attack your system are increasing
- DNC hacks involved convincing spear phishing emails posing as Google
Password Reset
- Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock
medical records system held hostage by ransomware
 Organizations have to keep up by employing a more numerous and more diverse
set of tools
 Finding a way to effectively use those tools is difficult

The Data
 HTTP Proxy logs
 Email Metadata
 VPN logs
 Firewall events
 DNS
 Syslogs (*nix, Windows)
 Security Endpoints
 Threat Intelligence
 IDS Events
 Wireless Access Points
 Mobile Device Management
 And more...
 ~ 40 distinct data feeds
 ~ 5 Billion events per day
 ~ 75,000 Peak events per second
 ~ 5 TB per day

What We Started Out With
 Enterprise SIEM (Security Information and Event Management) platform
- Primary management tool for many years
- Encountered stability issues while scaling out to 13 months of data retention
 Splunk
- Great UI experience
- Scaling out to 13 months becomes prohibitively expensive

Where Does That Leave Us
 We need a solution for security event and telemetry data that is diverse,
voluminous, and fast-moving.
 Horizontally and linearly scalable
 Platform and interface built for:
- SOC Analysts to quickly respond to incidents
- Forensic Investigators to analyze historical data and compile reports
- Threat Hunters to efficiently find vulnerabilities and malicious behavior
 Affordable!

PART 2: Approach and Architecture

NiFi
 Data routing, transformation, and distribution platform
 Easy to use Web UI
 On-Prem Cluster – Collects data from all local devices
- Flows into AWS Cluster
- 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk
 AWS Cluster – Collects, preprocesses, and tags incoming data
- 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)
 Individual data flows defined for each feed

Kafka
 Distributed messaging platform
- Publish-Subscribe model
- Producer/Consumer implementations across many languages
- Support for stream processing and ingestion via Kafka Streams/Connect
 Serves as communication backbone for infrastructure
 20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2)
 Replication factor of 2
 Set partition count to multiple of aggregate disk count

Storm
 Distributed realtime stream computation system
 Scales up by adding more worker nodes
 Fault tolerant – When a node dies, jobs that were on that node are moved to
another
 Support for topology isolation, microbatching, and custom routing
 Storm Nimbus/UI – m4.2xlarge
 45 Storm Worker Nodes – m4.2xlarge
 4 worker slots per node – 2 vCPU 8GB Mem

Metron
 Security analytics framework built on top of Storm
 Consists of two sets of Storm topologies:
- Parser topologies – Parse raw data into human readable JSON format
- Enrichment topologies – Enrich parsed data with contextual information, then
send to storage tier.
 Enrichment of incoming data streams with additional information
- Domain Generation Algorithm (DGA) scoring via machine learning model
- Active Directory user lookup
- Geolocation/ASN data for external IP addresses
- WHOIS lookup for unknown domain names

ElasticSearch
 Distributed, RESTful search and analytics engine
- Each data feed is comprised of its own set of daily indices
- Each index is further subdivided into shards
 Linearly scalable
 Low latency full-text search
 3 Master Nodes – m4.2xlarge
 100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)

Kibana
 Data visualization frontend for ElasticSearch
 Alert management system
 Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts
 Multiple open source and custom plugins
• Timelion
• fermiumlabs/mathlion
• prelert/kibana-swimlane-vis
• sirensolutions/kibi
• siresolutions/sentinl
• snuids/heatmap
• chenryn/kbn_sankey_vis
• And more...

S3
 Simple Storage Service – Object storage service in the cloud
 Compatible with processing engines like Spark, EMR
 Data stored in two formats:
- Raw data – Used for replaying data through the pipeline and meeting our
obligations as a system of record for some feeds
- Parsed data – Stored in columnar format (ORC) for batch processing
 Everything in S3 is encrypted

Monitoring
 Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.)
 Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm,
ElasticSearch
 Most data stored in a separate ElasticSearch cluster
 Grafana for visualization
 ElastAlert for platform alerting

Format Wars
 Ingested raw data comes in a variety of formats
- CSV, JSON, XML, CEF
 Sometimes the formats are poorly defined
- Windows Syslogs pretty indented using tabs, but no delimiters
- Various subtypes come in different formats
 Upstream changes to raw data format often propagate through our entire pipeline,
eventually making the data in ElasticSearch unusable
 Takeaway: Format and serialize data as far upstream as possible.

Monitoring and Alerting
 Platform-level telemetry should be stored with all the other data
- Instead of a separate Zabbix subsystem
 Collect more granular application-level data
- Most components expose metrics via JMX
- Necessary to effectively troubleshoot performance bottlenecks
- Useful for capacity planning
 Logging data collection
 Common problem among many teams at Capital One
 Takeaway: Reduce duplication of work by offering common monitoring
infrastructure, or even Monitoring-as-a-Service

Rehydration
 EC2 Instances with AMIs older than 60 days must be terminated
- Internal Capital One policy
 Spent a lot of time developing automation and orchestration to spin up a full cluster
from scratch
 How do you rehydrate a newly provisioned platform with data?
 How do you avoid service interruption to the user?
 Blue/Green cluster deployment
 Rolling rehydration every 30 days

Auditing
 Internal Audit
- 2 Internal Audits of NPI/PCI handling and storage processes
 OCC (Office of the Comptroller of the Currency)
- Audit of data sources, networking, and archival of data.
 FRB (Federal Reserve Board)
- IT Risk Management – Alerts considered as an authoritative source as part of first
line of defense
- Resiliency – Provide evidence of ability to failover within an acceptable window of
time

Handling Sensitive Data
 Social Security Numbers
 Credit card info
 Home/Auto Loans
 Checking/Savings Account Data
 Trading data
 Automated process to scan for PII/PCI data and scrub it from the raw data stream
- Secure raw data topics via encryption and access control
- Streaming job to scrub raw feeds and produce into separate ‘clean’ topics
 Backwards remediation process for data stored in HDFS/S3

Schema Management
 Authoritative service for clients to retrieve schemas applied to datasets.
 Implementation is protocol dependent.
- Avro – Confluent Schema Registry
- Protobuf – Central GH Repository
 Streaming job to parse and schema-fy raw data prior to processing it.
- Raw data that fails to fit schema diverted to alternate Kafka topic.

Monitoring
 Consolidate monitoring stack.
- Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends
- Separate stacks for Time-series numeric and logging:
- TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack
- ELK stack
- Both have tradeoffs

Generalized Data Processing
 Metron is really good for working in the infosec space, but does not generalize well.
 Exploring options for building a data platform to address multiple use cases.
- Credit transactions
- Credit fraud
- Anti-Money Laundering
- Legal
 Focus on supporting machine learning.

Retrospective
 Users (SoC analysts, threat hunters, etc.) are generally happy with the platform.
 Low query latency
 Working to address concerns around data integrity (duplicates, loss, malformed)
 They want more data!
- Bro
- Silvertail
- Phantom

Thanks!
Kevin.Mao@capitalone.com
@KevinJokaiMao
linkedin.com/in/kevinjmao
We’re hiring in SF, Chicago, and DC!
Machine Learning Engineers
Software Engineers
Data Engineers
Data Scientists

Real-time Security Event Ingestion and Analysis through Kafka and Metron

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Real-time Security Event Ingestion and Analysis through Kafka and Metron

Similar to Real-time Security Event Ingestion and Analysis through Kafka and Metron (20)

Recently uploaded

Recently uploaded (20)

Real-time Security Event Ingestion and Analysis through Kafka and Metron