Talha Obaid, Email Security, Symantec at MLconf ATL 2017

Machine Learning for Detecting Malware
Talha Obaid Ling Zhou Timothy You Xinlei Cai
MLConf – Atlanta Sep 2017
Email Security
Scripting

Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only
The Team!
Ling Zhou
Timothy You
Xinlei Cai
Talha Obaid

Machine Learning @ Symantec
• Early adopter of ML in industry
• SRL – Symantec Research Labs
• CAML – Centre for Advanced Machine Learning
• Malware detection, spam identification
• Helped achieve the compounded impact
• Malware polymorphism
https://www.symantec.com/connect/blogs/meet-symantec-labs-industrys-best-kept-secret

Reference:
https://www.symantec.com/connect/blogs/machine-learning-not-only-answer
How I got
infected?

Email is the weapon of choice!
• One in 131 emails contained malicious link or attachment, the highest rate in five years
• The rate jumped from 1 in 220 emails in 2015 to 1 in 131 emails in 2016
• In 2016 Small to Medium sized Businesses were the most impacted by phishing attacks with 1
in 95 emails containing malware
• Email sent daily in 2016 – 269 billion*
• The general office worker receives an average of 600 emails per week*
• Blended attacks - Email as a career for malicious URL
• Office document files are an effective weapon
• Lighter footprint and hiding in plain sight
Reference:
https://www.symantec.com/security-center/threat-report
* Email Statistics Report, 2017-2021, Radicati Group, February 2017
Copyright © Symantec

Worldwide Email Forecast
Worldwide Email
Users* (M)
3,718 3,823 3,930 4,037 4,147
% Growth 3% 3% 3% 3%
Reference: https://www.radicati.com/wp/wp-content/uploads/2017/01/Email-Statistics-Report-2017-2021-Executive-Summary.pdf
* Includes both Business and Consumer Email users
Daily Email Traffic 2017 2018 2019 2020 2021
Total Worldwide Emails
Sent/Received Per Day
(B)
269.0 281.1 293.6 306.4 319.6
% Growth 4.5% 4.4% 4.4% 4.3%
Worldwide Daily Email Traffic (B), 2017-2021
Worldwide Email User Forecast (M), 2017–2021

Email: Locky malware delivery vector
Reference:
https://www.symantec.com/security-center/threat-report
http://www.latimes.com/business/technology/la-me-ln-hollywood-hospital-bitcoin-20160217-story.html
https://arstechnica.com/information-technology/2016/02/locky-crypto-ransomware-rides-in-on-malicious-word-document-macro/
• Released in 2016
• Still active in 2017
• “Enable macro if data encoding is incorrect”
• If the user does enable macros, the macros then save and run a
binary file that downloads the actual encryption Trojan
• Hospital in Hollywood payed $17,000 in bitcoin to hackers

Scripting Malware – real ones!

Exampli Gratia
AutoClose, Random variable, String split

Fake variable
Fake comment
Fake condition

Multiple Function
String split

String encryption
Random variable
Function Call hidden

String Encryption
Random variable
Multi function
Click event

String hidden
Fake condition

Machine Learning for
hand-written text!

Domain Differences
Programming Language
• Non-Ambiguous
• Deterministic language
• Clear distinction between syntax and semantics
• Semicolons, Tabs vs Spaces, Editor wars
• Identifier, sub routine calls, imports
• Comments, conventions, notations
• Design patterns
Natural Language
• Ambiguous
• Context-bound languages
• Less distinguished between syntax and semantic
• Puns, Rants, Parodies, Imitations
• TF-IDF
• LSTM – Long short term memory
• Bag of words

Machine Learning Applications – Code!
Automatic Patch Generation by Learning Correct Code by Fan et. al.
Reference:
https://www.newscientist.com/article/mg23331144-500-ai-learns-to-write-its-own-code-by-stealing-from-other-programs/
http://people.csail.mit.edu/rinard/paper/popl16.pdf

https://www.forbes.com/sites/adrianbridgwater/2016/03/07/machine-learning-needs-a-human-in-the-loop
https://blogs.technet.microsoft.com/machinelearning/2016/10/17/the-power-of-human-in-the-loop-combine-human-intelligence-with-machine-learning/
Human-In-The-Loop?

Rule ^
ML
Email
Analyze
Inflation
Macro
Extraction
Parsing
Feature
Extraction

Feature Selection (Total 72 Features)
ML_1... ML_12…
ML_2... ML_13…
ML_3... ML_14…*
ML_4... ML_15…
ML_5... ML_16…
ML_6... ML_17…
ML_7… ML_18…
ML_8… ML_19…
ML_9… ML_20…
ML_10… ML_21…*
ML_11… …
Note: Features with (*) can be expanded to the count of each item.
ML_21_1… ML_14_1…
ML_21_2… ML_14_1…
ML_21_3… ML_14_1…
ML_21_4… ML_14_1…
ML_21_5… ML_14_1…
ML_21_1… ML_14_1…
ML_21_1… ML_14_1…
ML_21_1… ML_14_1…
ML_21_1… ML_14_1…
ML_21_1… ML_14_1…
… 29 features … 21 features

Optimization
ML_1…
(Composite)
ML_2… ML_3… ML_4… ML_14_3…
1 31469 1245 35 211 0
2 44617 1264 14 171 0
3 33247 1045 14 158 0
… … … … … …
1234 18828 682 29 222 1
… … … … … …
40000 1273048 844 19 151 0
• Treat ML_1… feature since it is dependent on other features.
• Treat features like ML_14_3… since categorical feature.

Spam run – from Aug 21 to Aug 27
{
"desc": "Shell call",
"artifact": " Shell "Explorer.exe " & strCommande, vbNormalFocus, "
},

Just this morning … 15 Sep 2017

Recently captured…
{
"desc": "Small routine with string manipulation",
"artifact": " Chinook = (AscB(Sumatran_Rhinoceros))"
"artifact": " Tapir = Chinook(Mid(Sand_Lizard, Chipmunk, 1)) - Int(M..."
},
{
"desc": "Small routine with run & Obfuscated object concat & Obfuscated object creation
arguments shell & Createobject run one-liner",
"artifact": " CreateObject(Pig + "Shell").Run Module1.Ibis(Sea_Dragon, ""
},

{
"desc": "Obfuscated object variable",
"artifact": "Set
miLxhuTjOMrpjvLQQNhstoiWlCkOdozYkasyizjweDRGlKRkgtkgxHZyAoLfJFFaMSFJDNiRekNpWbkbkzhjETbcA
tytnDmZxruTFIhTLSCM = CreateObject(ujcYEkvJXWWtqcIKOpdaxorehRVbSNYlQPiQQao"
},
{
"desc": "Obfuscated object creation arguments",
"artifact": "Set qvBvooYSTaFymchvnZIkLUSrhheHIwfYCSyrpgvjePoCKWbhMYoOBOJVcKO =
CreateObject(kbUBGIKqbHJyTmAmPbuHSBjqouVxfwCfSfEWfcNXxXYAhCJKXcegnoejsdNMnNKeFdfnieGnOXJv
cjJlkKZDSV"
},
{
"desc": "Long obfuscated variable assignment",
"artifact": "ZGwEiLSTkOsQSFcFzZVPMMuHalgKESzgWlohddzbmveToRIxzt"
},

{
"desc": "Macro with constant manipulation in function call",
"artifact": "dNDfJESUPztgDlcNnWNZLIPsGgXDVndgUDYaarDOIWeCVstlSACjSVcUyLZ =
CWvXJUNlxQcbDqNtnmQhCsifqGFBSHE$(327 - 240) & CWvXJUNlxQcbDqNtnmQhCsifqGFBSHE$(324 - 241) & CWvXJUNl…"
},
{
"desc": "Highly random long string found",
"artifact": "mRClEXzmRGxUqDPLJHcHeEMgjtqozQbuXXYIpdNJOtykVB"
},
{
"desc": "Object creation variable identifier",
"artifact": "qvBvooYSTaFymchvnZIkLUSrhheHIwfYCSyrpgvjePoCKWbhMYoOBOJVcKO"
},
{
"desc": "Random subroutine name",
"artifact": "dnHLjlClNBEYNnZihnFPOighaDbyTOUim"
},

{
"desc": "Random identifier with suspicious assignments",
"artifact":
"ujcYEkvJXWWtqcIKOpdaxorehRVbSNYlQPiQQaoCIdBbVAdczWFVpbOGsxrmOTqKykcaurtoAaRUmQJgntcvICwoBcYTiBopmrc
kXChHdQUOKtTcnKzV = Chr$(327 - 240) & Chr$(324 - 241) & Chr$(24…"
},
{
"desc": "Shell/SaveToFile string contains strange variable name",
"artifact":
"RhIzeRHLbzssvNwesaErYKfXuynMPZjWdUBgPAZZUnlhknaNjNAQERoHClFgeuvBPWPbMQPsAeXlYymHXZdCZTRMfteev"
},
{
"desc": "File with following name was created and run created",
"artifact": "XABNAGkAYwByAG8AcwBvAGYAdAA=XABxAGIASwBWAEsAdgBsAGgAdwBpAEoAUgBLAC4AZQB4AGUA"
},
And… we capture a lot more!

Findings & Going Forward …
• “If an artifact is missing” means a sample is missed – not anymore
• All features contribute to the verdict in unison
• Obfuscation is still a challenge and will remain to be one
• Identify why a variable of string type is assigned a byte array?
• Why an assignment expression is more than say 200 characters?
• Keep transitioning inflating malware samples from sandbox to static analysis

Thank You!
Talha Obaid Ling Zhou Timothy You Xinlei Cai
Email Security
Join us!
www.symantec.com/about/careers

Talha Obaid, Email Security, Symantec at MLconf ATL 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Talha Obaid, Email Security, Symantec at MLconf ATL 2017

Similar to Talha Obaid, Email Security, Symantec at MLconf ATL 2017 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Talha Obaid, Email Security, Symantec at MLconf ATL 2017

Editor's Notes