Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Real-time Security Event Ingestion and Analysis through Kafka and Metron
1. Achieving Real-time Ingestion
and Analysis of Security
Events through Kafka and
Metron
Kevin Mao
Senior Data Engineer, Capital One
Kevin.Mao@capitalone.com
@KevinJokaiMao
2. About Me
B.S., Computer Science, University of Maryland, Baltimore County
M.S., Computer Science, George Mason University
Enterprise Data Services, Data Intelligence
Purple Rain Project
Huge Zelda fan!
3. Agenda
Part 1: Motivation and Background
Part 2: Approach and Architecture
Part 3: Challenges
Part 4: Future Work
Part 5: Wrapping Up
5. Capital One
45,000 Employees
45 Million Customers
26,000 EC2 Instances
Credit Cards
Traditional Banking
Home/Auto Loans
Brokerage Services
6. The Problem
The ways in which adversaries can attack your system are increasing
- DNC hacks involved convincing spear phishing emails posing as Google
Password Reset
- Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock
medical records system held hostage by ransomware
Organizations have to keep up by employing a more numerous and more diverse
set of tools
Finding a way to effectively use those tools is difficult
7. The Data
HTTP Proxy logs
Email Metadata
VPN logs
Firewall events
DNS
Syslogs (*nix, Windows)
Security Endpoints
Threat Intelligence
IDS Events
Wireless Access Points
Mobile Device Management
And more...
~ 40 distinct data feeds
~ 5 Billion events per day
~ 75,000 Peak events per second
~ 5 TB per day
8. What We Started Out With
Enterprise SIEM (Security Information and Event Management) platform
- Primary management tool for many years
- Encountered stability issues while scaling out to 13 months of data retention
Splunk
- Great UI experience
- Scaling out to 13 months becomes prohibitively expensive
9. Where Does That Leave Us
We need a solution for security event and telemetry data that is diverse,
voluminous, and fast-moving.
Horizontally and linearly scalable
Platform and interface built for:
- SOC Analysts to quickly respond to incidents
- Forensic Investigators to analyze historical data and compile reports
- Threat Hunters to efficiently find vulnerabilities and malicious behavior
Affordable!
13. NiFi
Data routing, transformation, and distribution platform
Easy to use Web UI
On-Prem Cluster – Collects data from all local devices
- Flows into AWS Cluster
- 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk
AWS Cluster – Collects, preprocesses, and tags incoming data
- 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)
Individual data flows defined for each feed
14.
15. Kafka
Distributed messaging platform
- Publish-Subscribe model
- Producer/Consumer implementations across many languages
- Support for stream processing and ingestion via Kafka Streams/Connect
Serves as communication backbone for infrastructure
20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2)
Replication factor of 2
Set partition count to multiple of aggregate disk count
16. Storm
Distributed realtime stream computation system
Scales up by adding more worker nodes
Fault tolerant – When a node dies, jobs that were on that node are moved to
another
Support for topology isolation, microbatching, and custom routing
Storm Nimbus/UI – m4.2xlarge
45 Storm Worker Nodes – m4.2xlarge
4 worker slots per node – 2 vCPU 8GB Mem
17. Metron
Security analytics framework built on top of Storm
Consists of two sets of Storm topologies:
- Parser topologies – Parse raw data into human readable JSON format
- Enrichment topologies – Enrich parsed data with contextual information, then
send to storage tier.
Enrichment of incoming data streams with additional information
- Domain Generation Algorithm (DGA) scoring via machine learning model
- Active Directory user lookup
- Geolocation/ASN data for external IP addresses
- WHOIS lookup for unknown domain names
18. ElasticSearch
Distributed, RESTful search and analytics engine
- Each data feed is comprised of its own set of daily indices
- Each index is further subdivided into shards
Linearly scalable
Low latency full-text search
3 Master Nodes – m4.2xlarge
100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)
19. Kibana
Data visualization frontend for ElasticSearch
Alert management system
Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts
Multiple open source and custom plugins
• Timelion
• fermiumlabs/mathlion
• prelert/kibana-swimlane-vis
• sirensolutions/kibi
• siresolutions/sentinl
• snuids/heatmap
• chenryn/kbn_sankey_vis
• And more...
20.
21.
22.
23.
24. S3
Simple Storage Service – Object storage service in the cloud
Compatible with processing engines like Spark, EMR
Data stored in two formats:
- Raw data – Used for replaying data through the pipeline and meeting our
obligations as a system of record for some feeds
- Parsed data – Stored in columnar format (ORC) for batch processing
Everything in S3 is encrypted
25. Monitoring
Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.)
Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm,
ElasticSearch
Most data stored in a separate ElasticSearch cluster
Grafana for visualization
ElastAlert for platform alerting
28. Format Wars
Ingested raw data comes in a variety of formats
- CSV, JSON, XML, CEF
Sometimes the formats are poorly defined
- Windows Syslogs pretty indented using tabs, but no delimiters
- Various subtypes come in different formats
Upstream changes to raw data format often propagate through our entire pipeline,
eventually making the data in ElasticSearch unusable
Takeaway: Format and serialize data as far upstream as possible.
29. Monitoring and Alerting
Platform-level telemetry should be stored with all the other data
- Instead of a separate Zabbix subsystem
Collect more granular application-level data
- Most components expose metrics via JMX
- Necessary to effectively troubleshoot performance bottlenecks
- Useful for capacity planning
Logging data collection
Common problem among many teams at Capital One
Takeaway: Reduce duplication of work by offering common monitoring
infrastructure, or even Monitoring-as-a-Service
30. Rehydration
EC2 Instances with AMIs older than 60 days must be terminated
- Internal Capital One policy
Spent a lot of time developing automation and orchestration to spin up a full cluster
from scratch
How do you rehydrate a newly provisioned platform with data?
How do you avoid service interruption to the user?
Blue/Green cluster deployment
Rolling rehydration every 30 days
31. Auditing
Internal Audit
- 2 Internal Audits of NPI/PCI handling and storage processes
OCC (Office of the Comptroller of the Currency)
- Audit of data sources, networking, and archival of data.
FRB (Federal Reserve Board)
- IT Risk Management – Alerts considered as an authoritative source as part of first
line of defense
- Resiliency – Provide evidence of ability to failover within an acceptable window of
time
32. Handling Sensitive Data
Social Security Numbers
Credit card info
Home/Auto Loans
Checking/Savings Account Data
Trading data
Automated process to scan for PII/PCI data and scrub it from the raw data stream
- Secure raw data topics via encryption and access control
- Streaming job to scrub raw feeds and produce into separate ‘clean’ topics
Backwards remediation process for data stored in HDFS/S3
34. Schema Management
Authoritative service for clients to retrieve schemas applied to datasets.
Implementation is protocol dependent.
- Avro – Confluent Schema Registry
- Protobuf – Central GH Repository
Streaming job to parse and schema-fy raw data prior to processing it.
- Raw data that fails to fit schema diverted to alternate Kafka topic.
35. Monitoring
Consolidate monitoring stack.
- Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends
- Separate stacks for Time-series numeric and logging:
- TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack
- ELK stack
- Both have tradeoffs
36. Generalized Data Processing
Metron is really good for working in the infosec space, but does not generalize well.
Exploring options for building a data platform to address multiple use cases.
- Credit transactions
- Credit fraud
- Anti-Money Laundering
- Legal
Focus on supporting machine learning.
38. Retrospective
Users (SoC analysts, threat hunters, etc.) are generally happy with the platform.
Low query latency
Working to address concerns around data integrity (duplicates, loss, malformed)
They want more data!
- Bro
- Silvertail
- Phantom