More Related Content Similar to Wrangle 2016: Malware Tracking at Scale (20) Wrangle 2016: Malware Tracking at Scale2. © 2016 Cloudera, Inc. All rights reserved. 2
About me
• Michael Bentley
• Formerly Director of Research and Response @ Lookout
• Currently working on data mining projects
• KK6WCN
• michael@setnorth.com
3. © 2016 Cloudera, Inc. All rights reserved. 3
Agenda
• What we are trying to accomplish
• How basic heuristics work
• Where basic heuristics don’t work
• Tracking with pairwise similarity and EMR
• Visualizations to help extract more information
• Mistakes and caveats
4. © 2016 Cloudera, Inc. All rights reserved. 4
What are we trying to accomplish
• Searching for major versions of software (malware)
• Find ways to detect it with simple heuristics
• Find ways to track it
• Dataset discovery
5. © 2016 Cloudera, Inc. All rights reserved. 5
Simple heuristics
• Detect on static data
• Detect on analysis stack created metadata
applications analysisacquisition
Hashes
Strings
Who signed
it / certificate
6. © 2016 Cloudera, Inc. All rights reserved. 6
Simple heuristics - hashes
APK file
Hashes
Icon
Dex File
7. © 2016 Cloudera, Inc. All rights reserved. 7
Simple heuristics - string detection
• Nice ASCII string delimited by
null bytes
• Malicious class path
• Byte code
• Exact match in one or both
directions of string
• Ctrl + F
Null byte
8. © 2016 Cloudera, Inc. All rights reserved. 8
Simple heuristics- certificates
• Same
malware
• Different
certificates
9. © 2016 Cloudera, Inc. All rights reserved. 9
Where simple heuristics are good
• Good for things that don’t change
• Computationally cheap
• About the same scenario for network (IDS) or
application inspection (malware detection)
10. © 2016 Cloudera, Inc. All rights reserved. 10
Where it’s problematic
• Anything with funding/making money.
• Malware created in Eastern Europe, Asia, Italy (Hacking
Team)
• Mass creation of certificates
• Code taken from Stack Overflow
• Anything with basic string obfuscation
• Hunting for new major versions
11. © 2016 Cloudera, Inc. All rights reserved. 11
Enter pairwise
similarity
You’re about to see a spreadsheet at a big data
conference
http://gunshowcomic.com/648
12. © 2016 Cloudera, Inc. All rights reserved. 12
Application pairwise similarity
13. © 2016 Cloudera, Inc. All rights reserved. 13
Go from pick one
app and rescan
corpus
14. © 2016 Cloudera, Inc. All rights reserved. 14
Pick one application – Rescan corpus
• Examine one app
• Find heuristic
• Rescan corpus
• Rinse repeat ad infinitum
• Throw people at the problem
http://bit.ly/2a0zcZR
15. © 2016 Cloudera, Inc. All rights reserved. 15
Decoding what you already have
• Pairwise similarity defines the
relationships for us
• Dots represent unique (SHA1)
applications
• Colors represent major versions of
malware
• Each color is within ~85% match of
code distance
16. © 2016 Cloudera, Inc. All rights reserved. 16
Clustering and intelligence
APK
APK
APK
APK
APK
APK
APK
Nearest neighbor
95% similar
Cluster 1
85% similar
Cluster 2
85% similar
Cluster 0
< 85% similar
• APKs are nodes and edges
• Clusters are neighborhoods
19. © 2016 Cloudera, Inc. All rights reserved. 19
Evolution of malware over time
• By taking the clustering data and
then overlaying it with the packaged
at data we can watch malware
evolve over time.
• Color represents major version
• Time is a 4 month sliding window
• Shows iterations from malware
writers
20. © 2016 Cloudera, Inc. All rights reserved. 20
Pairwise problems and options
• Comparing 3500 applications is 12,250,000 operations
• As you bring more applications in, expect to scale EMR cluster or
reduce n.
• You can overmatch on similarity – outlier issue
21. © 2016 Cloudera, Inc. All rights reserved. 21
Tripping over the bar
• Pairwise similarity for 7k apps is about 5gB.
• So is S3
• Things go bad when you don’t respect the bucket
size
• Troubleshooting CSV sizes is a thing
• Doesn’t work well on small applications
• Temporary files on your local machine that are
70gB cause problems
22. © 2016 Cloudera, Inc. All rights reserved. 22
Knowledge
• I had never used NetworkX before ~2014
• I had no idea how to go from what we had into a decent format for visualizing this
(GraphML).
• Almost no experience in graph theory before ~2014
• Gilad Lotan had a great PyCon talk which got me started. I still reference his talks.
• Gephi is a great shortcut for visualizing in 2D if you aren’t familiar with D3
• Seth Hardy who gave tons of amazing feedback while I was learning
• Jack Urban who proved that it was possible to track applications as a network
• Gensim library is a great way to get started in doing comparisons of applications
• Lots of inspiration from the Defcon 22 OpenDNS talk (theirs is better)