More Related Content Similar to Hadoop / Spark on Malware Expression (20) More from MapR Technologies (20) Hadoop / Spark on Malware Expression2. © 2014 MapR Technologies 2
Objective
• Advanced Persistent Threat (APT)
• Big Data + Threat Intelligence
• Hadoop + Spark Solution
• Example Detection Algorithm Development Scenarios (most of
them are still open problems)
Topics covered in this talk
3. © 2014 MapR Technologies 3© 2014 MapR Technologies
Advanced Persistent Threat
4. © 2014 MapR Technologies 4
APT
• Advanced Persistent Threat (APT) is one of the biggest headaches in
IT departments
– Target Compromise
– Countless DDoS attacks (Thousands a day according to Arbor Networks)
– These are only known cases, that could just be a tip of the iceberg.
• Why APT is so prevalent?
– No more hobby for smart hackers
– Huge money is involved, even behind organized crimes
– Political tool (Recent conflict between Ukraine and Russia sparked malware
warfare between them)
– Cyber warfare (Stuxnet)
Overview
5. © 2014 MapR Technologies 5
APT
• Hard to Detect
– More software layer stacks without thorough vulnerability test popping every day
• Storm, spark, yarn, grail, play, spring, flask, …
– Mobile area is even worse
• Particularly android
• Some estimates 30% or more devices are already compromised, worldwide
• Anti-Virus is useful only up to a certain point
– It takes months to years to define malware signature
– Zero day attack is still unpreventable
– It became almost a Placebo
• Firewall is not much useful anymore
– A device can be infected when the user brings it outside the Firewall premise
• Botnet itself is becoming more complex with many hierarchies
– Minimal binary delivery
– Surreptitious C&C connection with complex hierarchy or even headless peer to peer bots (Gameover
Zeus Botnet)
Status
6. © 2014 MapR Technologies 6
APT
• Snort / Suricata
– Rule based system
– Community support, pre/post-compromise detection
– Constant update is needed, cannot detect Zero day attack
– Sourcefire provides paid service
• Sandbox Technology
– Firewall + In premise detection
– Fireeye
• Poly-morphing technology
– ShapeSecurity
• Log data mining based methods
– Splunk / Sumo Logic, Solutionary
Defense, state of the art
7. © 2014 MapR Technologies 7
APT
• Many world wide security labs have malware labs and generate
threat reports
• The analysis takes from 2 weeks to months
• Involves
– Decoding binary execution and decrypting load / config parameters
– Complete time line analysis, from infection to exploit
– What devices and ips and domain names are involved
• Sometimes, analyze IRC data, or even social network data
– Botnet connection and verify the command and control
• Can we automate this with Big Data?
Threat Report
8. © 2014 MapR Technologies 8
APT
Example Annual Threat Report (from Fireeye, 2013, Europe)
Top Two
Industries in
Threat Finding
were
Healthcare
and Finance
9. © 2014 MapR Technologies 9
APT
• Configuration (Decrypted)
• ID: F16 08-07-2013
Group:
DNS/Port: Direct: toornt.servegame.com:443,
Proxy DNS/Port:
Proxy Hijack: No
ActiveX Startup Key:
HKLM Startup Entry:
File Name:
Install Path: C:Documents and SettingsAdminLocal SettingsTempmorse.exe
Keylog Path: C:Documents and SettingsAdminLocal SettingsTempmorse
Inject: No
Process Mutex: gdfgdfgdg
Key Logger Mutex:
ActiveX Startup: No
HKLM Startup: No
Copy To: No
Melt: No
Persistence: No
Keylogger: No
Password: !@#GooD#@!
Example Threat Report (from Fireeye)
C&C Servers
toornt.servegame.com
updateo.servegame.com
egypttv.sytes.net
skype.servemp3.com
natco2.no-ip.net
Why does it need Password?
10. © 2014 MapR Technologies 10
APT
• CHAIN OF EVENTS
• ASSOCIATED DOMAINS
• 192.81.171.13 - www.toonzone.net - Compromised website
• 190.123.47.198 - ilinsting.com - Redirect
• 64.202.116.124 - bgbyhn.in.ua - Fiesta EK
• INFECTION CHAIN OF EVENTS
• 06:40:07 UTC - www.toonzone.net - GET /forums/adult-swim-toonami-forum/
• 06:40:08 UTC - ilinsting.com - GET /szjhmucw.js?3ad1359a5153d640
• 06:40:09 UTC - bgbyhn.in.ua - GET /hdjng94/?2
• 06:40:11 UTC - bgbyhn.in.ua - GET /hdjng94/?25b6d1b1cb76ec625b500e0d560a50040703520d5053520a0706510355090109
• 06:40:12 UTC - bgbyhn.in.ua - GET /hdjng94/?2d8a97d01a056fdd41084e5a0b0c56050752085a0d55540b07570b54080f0708;5110411
• 06:40:14 UTC - bgbyhn.in.ua - GET /hdjng94/?02bb88c62d7306c8534209590a035103050452590c5a530d0501515709000057;5
• 06:40:15 UTC - bgbyhn.in.ua - GET /hdjng94/?02bb88c62d7306c8534209590a035103050452590c5a530d0501515709000057;5;1
• 06:40:42 UTC - bgbyhn.in.ua - GET /hdjng94/?2ad5cdef3fc4ef9851110f0e515f57530757540e5706555d07525700525c065e;6
• 06:40:43 UTC - bgbyhn.in.ua - GET /hdjng94/?2ad5cdef3fc4ef9851110f0e515f57530757540e5706555d07525700525c065e;6;1
• 06:40:43 UTC - bgbyhn.in.ua - GET /hdjng94/?5998786b9c7a1ffe544b580305030457000f0903035a0659000a0a0d0600555a
• 06:40:49 UTC - bgbyhn.in.ua - GET /hdjng94/?59576b00f4cfd03e5641500c04590205000f050c0200000b000a0602075a5308;1;2
• 06:40:49 UTC - bgbyhn.in.ua - GET /hdjng94/?59576b00f4cfd03e5641500c04590205000f050c0200000b000a0602075a5308;1;2;1
Another Example, Fiesta EK, from malware-traffic-analysis.net
11. © 2014 MapR Technologies 11© 2014 MapR Technologies
Big Data + Threat Intelligence
12. © 2014 MapR Technologies 12
Big Data + Threat Intelligence
• Tom Brady + Gisele Bundchen
– An Ideal Marriage
• With All the advances in Computing and Data Resources, why can’t we
automate Malware detection
• Big Data is an ideal platform for malware study
– Simple packet capture can easily make PETA bytes data from small offices
– Huge storage + Fast processing is essential for malware study
• Various aspects of Big Data fit well with Malware
– Streaming analysis (Storm, Spark Streaming)
– Volumetric data analysis (Spark)
– Graph analysis
• View network devices as nodes, discover command and control role
• Each url can be a node and the basis of graph analysis
– Visualization for intuitive analysis
Pros
13. © 2014 MapR Technologies 13
Big Data + Threat Intelligence
• Anomaly detection
– Typical log analysis
– Router / Switch has built in alarm setting
• Simple Level based detection
– Is this going to be useful?
• How much can you tell
• Machine Learning
– Not much useful
• Not easy to get labeled data
• Even with labeled data it is very hard to develop a feature set
– If the feature set is known, hackers will revise their codes
• Zero day attack does not come with a label
– Modeling needs complete understanding of criminal minds
Cons (e.g., Gwyneth Paltrow and Chris Martin)
14. © 2014 MapR Technologies 14
Big Data + Threat Intelligence
An Example Architecture
Storm Spout
Packet
Stream
Or
Binary
Downloads
Storm Bolt
Packet Analysis
Alert and store
packet data
Store to HDFS
Spark Analysis
Storm Bolt
Meta Data
Extraction
Packet stream
truly reveals
Malware
expression
compared to Log
Connect the Dots with Strong
In Memory Processing
15. © 2014 MapR Technologies 15
Big Data + Threat Intelligence
• Reduce False Positives
– Mantra in Malware detection business
• Big data is a great resource for reducing false positives (Type 1
error)
– As soon as an update on an algorithm is made, test against the Big
Data test cases
– The test can even be applied to old cases, greatly reducing false
positives
• Typically, we had to sample test data by weighting old data lower
False Positives
16. © 2014 MapR Technologies 16
Big Data + Threat Intelligence
• Wireshark (tshark) is the goto software for packet analysis
– Huge memory hogging software
• Need to put packet data onto HDFS
• Packetpig has been developed from Hortonworks
– A lot more has to be done to be closer anywhere near to the strength of
Wireshark
• Need to design efficient meta data collection and storage
mechanisms
– Use snort or custom c platform library to extract essential flow data
• Flow is a 5-tuple src/dest/ip/port/protocol
• Flow is the de facto unit of network malware expression analysis
Packet to HDFS
17. © 2014 MapR Technologies 17
Big Data + Threat Intelligence
• Big Data provides opportunity to map out all the ip addresses
used on a particular network
• Through graph analysis, find rogue IP addresses
• Use geographical information with IP to find abnormal
connection behavior
• DNS provides many insights on Malware connection
– Static IP cannot be used for malware control purpose
– Fast Flux
– Awkward names
IP based analysis
18. © 2014 MapR Technologies 18
Big Data + Threat Intelligence
• Flow is an essential malware analysis unit
• Flow identifies
– Who’s connecting to whom
• Frequency, duration, communication bandwidth
• App can be identified from flow
– Port, actual content
– Palo Alto Networks
• Normal flow vs Abnormal flow
– With enough data, we could potentially identify normal flow
• Use first 16 bytes?
– Cluster analysis, detect anomaly
Flow to detect malware expression
19. © 2014 MapR Technologies 19© 2014 MapR Technologies
Spark on Hadoop
20. © 2014 MapR Technologies 20
Apache Spark
• spark.apache.org
• github.com/apache/spark
• user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
at Apache Software Foundation
21. © 2014 MapR Technologies 21
Easy: Example – IP Count
• Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
• Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(“,”)(0))
.map(ip=> (ip, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
22. © 2014 MapR Technologies 22
Fast: Using RAM, Operator Graphs
• In-memory Caching
• Data Partitions read from
RAM instead of disk
• Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
23. © 2014 MapR Technologies 23
SPARK RDD
• Resilient Distributed Datasets (RDD) is the key (potentially) in
memory data structure
• RDD is distributed over Hadoop Nodes, typically resides on
memory
• Transform RDD, then get data from RDD, Lazy Evaluation
– 2 sets of interfaces are provided, one for transform, the other for taking
actions (e.g., count, save etc)
• Most of the interface is quite similar to Lisp operations and SQL
operations
• Use Persist (Cache) to have the RDD on memory
25. © 2014 MapR Technologies 25
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
textFile = sc.textFile(”SomeFile.txt”)
26. © 2014 MapR Technologies 26
Spark, Hadoop Malware Analysis
Why useful
Packet
Stream
Construct Group of
Suspected Flows In RDD
E.g., suspected DNS tunnels,
IRC communications
Analyze with SPARK on RDD, IN
MEMORY
Connect the Dots, Flows,
SysLogs and Events
Huge advantage over Wireshark!
Store in HDFS for easy
access and use HBase for
database support
Real Time Event
Processing
Fast
Classification or
Anomaly
Detection
27. © 2014 MapR Technologies 27
SPARK and Hadoop
• Connecting dots needs Huge Storage and Fast Access
– Potential need to go back in time to find correlating events
• DDoS attack found Today + 10 Days ago spotty IRC chat + 20 days ago NXDomain
events by the suspected infected machine
– Sometimes it takes months to know a domain (the machine contacted) is suspicious (e.g.,
scored in VirusTotal)
– Then see if these patterns match with known malware expressions
– Approximate matching technology here is quite important
» HMM and Correlation Modeling
– HDFS + Hbase would be a good solution
• Store relevant temporal data
• Retrieve fast according to the criteria
• SPARK + Hadoop provides fast development cycle
– From prototype to evaluation
Why Hadoop
28. © 2014 MapR Technologies 28© 2014 MapR Technologies
Example Detection Algorithm Development Scenarios
29. © 2014 MapR Technologies 29
Introduction to Botnet (Terminology)
Bot Master
Bots
Code Server
IRC Server
Victim
IRC Channel
Attack
IRC Channel
C&C Traffic
Updates
Old Days BotNet
operation,
Just for Reference
Companies are
interested in
finding these in
there premises
30. © 2014 MapR Technologies 30
(Malware Expression) Detection Phases
• Pre Infection Detection
– Intrusion Detection System
• Active Infection Detection
– Recruit and Reconnaissance in the internal network
• Post Infection Detection
– Exploit and Monetize
31. © 2014 MapR Technologies 31
Pre Infection Detection
• Detect suspicious URLs
– When a device tries to contact or download suspicious URLs, block it
• How it works
– If suspicious or unknown contents are detected, send it to backend big
data deep analysis engine
– Update suspicious IP/Domain Name/URLs
– Update hash of the binary
– Regularly remove old hash/suspicious URLs
CAMP
32. © 2014 MapR Technologies 32
On going infection detection
• How it works
– Detect suspicious internal behavior
– Develop normal behavioral model for target customer site
– Detect abnormal authentication behavior, e.g., Kerberos, LDAP etc
– Detect suspicious data move
– Detect suspicious port usage
– Detect tunnels
• It is highly important to leverage Big Data to develop sustainable
normal behavioral model and constant update. Network data/model is
constantly changing.
• Consult with Security experts to define the measure points
In-network infection propagation
33. © 2014 MapR Technologies 33
Post Infection Detection
• HTTP / DNS is most frequently abused protocols
– Firewalls allow these ports get through
– If needed, play man in the middle for SSL data inspection
• Ill formed Http Header detection
– Abnormal location
– Abnormal referrer
– Abnormal User Agent
– Abnormal Size
• Abnormal Http Post Detection (e.g., entropy analysis)
• Ill formed XML / HTML
• SQL Injection
– SELECT * FROM users WHERE name = '' OR '1'='1';
• LDAP Code Injection
Protocol Abnormality
Collect Malware
Expression
Samples
Develop Feature
Set with Hadoop
and SME
Deploy and
Continually
update the model
34. © 2014 MapR Technologies 34
Post Infection Detection
• Click Fraud
• Like Fraud
• DDoS
• SPAM
Volumetric Abnormality
35. © 2014 MapR Technologies 35
Post Infection Detection
• Cadence
• Weird domain name resolution
• Fast Fluxing domain names
• Abnormal IRC traffic behavior
• Abnormal twitter behavior
• Abnormal facebook behavior
Command and Control Contact
36. © 2014 MapR Technologies 36
DGA
ClickSecurity.com
What Features Would U Use?
37. © 2014 MapR Technologies 37© 2014 MapR Technologies
Conclusion
38. © 2014 MapR Technologies 38
Conclusion
• Threat Intelligence and Big Data are very HOT
• Big Data is the ideal analysis platform for Malware expression analysis
– Caution, Remember the Cons
– Useful for efficiently connecting the dots
• Big Data enables
– Persistent model building and updating
– Reducing false positives through exhaustive data check compared to spot check
• Hadoop / SPARK supports ideal platform for Malware expression analysis
– SPARK provides strong inmemory processing power for complex malware data analysis
with simpler scripting level coding
• scala
– MapR provides fastest data access on Hadoop nodes
• M7
• MapR is the better hadoop
• Don’t under estimate NFS and Volume convenience
• Questions are welcome, send to syoon@maprtech.com,
mvasquez@maprtech.com nestrada@maprtech.com