A Machine Learning approach for detecting a Malware:
The project is to improve the way we detect script based malware using Machine Learning. Malware has become one of the most active channel to deliver threats like Banking Trojans and Ransomware. The talk is aimed at finding a new and effective way to detect the malware. We started with acquiring both malicious and clean samples. Later we performed feature identification, while building on top of existing knowledge base of malware. Then we performed automated feature extraction. After certain feature set is obtained, we teased-out feature which are categorical, interdependent or composite. We applied varying machine learning models, producing both binary and categorical outcomes. We cross validated our results and re-tuned our feature set and our model, until we obtained satisfying results, with least false-positives. We concluded that not all the extracted features are significant, in fact some features are detrimental on the model performance. Once such features are factored-out, it results not only in better match, but also provides a significant gain in performance.
32. Thank You!
Talha Obaid Ling Zhou Timothy You Xinlei Cai
Email Security
Join us!
www.symantec.com/about/careers
Editor's Notes
Domain experts – Malware classes
Malware engine experts
ML guy
Though we have chats and messengers
Locky malware
Edited. Broke up the slides into two.
Though we have chats and messengers
Locky malware
Dealt so far regarding NLP
Extracting features from samples.
Shuffling the samples and put them into two sample sets: training and testing.
Applying machine learning algorithms on those samples.
Validating the Machine Learning results.
Resized the hexagons to fit the words better.
Comment on the image: I think the text within the circles will be undecipherable to the audience. I could hardly read it even if I try to bring my eyes closer to the screen.
Machine learning is not a black box, and we have to be very conscious while using features.
The features used should be totally independent among themselves, and dependent on the verdict, i.e. on the left side.
We realized that these two features, VBA size was a composite feature, whereas CByte was categorical, meaning forcing sub grouping within the model.