SlideShare a Scribd company logo
1 of 8
Download to read offline
Application of Data Mining based Malicious Code
Detection Techniques for Detecting new Spyware
Cumhur Doruk Bozagac
Bilkent University, Computer Science and Engineering Department,
06532 Ankara, Turkey
dorukb@cs.bilkent.edu.tr
Abstract. Over the past few years, a new computer security problem
has arisen, Spyware. Anti-Virus techniques cannot deal with spyware,
due to their silent infection techniques and their differences from a
regular virus. Various Anti-Spyware programs have been implemented
as a counter-measure but most of these programs work using the
signature method, which is weak against new spyware. Schultz[1] has
presented a data-mining framework that detects new, previously unseen
malicious executables. However his paper was published in 2001 and
there was no spyware in their training or test dataset. This paper takes
Schultz[1]’s work as a candidate and applies its techniques against a
new spyware dataset collected in 2005, to see whether their techniques
can be used against this new threat.
1 Introduction
Spyware is any software installed on a computer without the user's
knowledge, that gathers information about that user for later retrieval by
whomever controls it. There are two types of spyware: malware and adware.
Malware is any program that gathers personal information from the user’s PC.
Key loggers, screen capture devices, and Trojans are in this category. Adware
is a program designed for showing a user advertisements, like homepage
hijackers, pop-up windows and search page hijackers.
Spyware poses several risks. The most conspicuous is compromising a user’s
privacy by transmitting information about that user’s behavior. However,
spyware can also detract from the usability and stability of a user’s computing
environment, and it has the potential to introduce new security vulnerabilities
to the infected host. Because spyware is widespread, such vulnerabilities
would put millions of computers at risk.
1
Spyware programs have a series of traits that makes them difficult to detect
and enables them to install themselves on numerous PCs for long periods of
time[2]:
- They use almost perfect camouflage systems. They are normally
installed on computers along with another kind of application: a P2P
client, a hard disk utility…
- File names don’t normally give any clues as to the real nature of the
file, and so they often go unnoticed along with other legitimate
application files.
- As they are not viruses, and they don’t use a routine that can be
associated with them, antivirus programs don’t detect them unless, they
are designed specifically to do so.
Panda Anti-Virus program has recently released a new online virus scanner,
which can also detect spyware. According to their reports[2], in the first 24
hours of the operation, 84 percent of malicious code detected was spyware
and the first 74 most detected malicious code were all spy programs.
Spyware can be detected through the effects it has on systems: use of system
resources resulting in a slowdown of the PC; high level of Internet connection
bandwidth consumption; complete loss of the connection or general system
instability. They also often change settings of certain applications, such as the
browser home page, or insert icons on the desktop. The main consequence of
spyware on PCs is the gathering of information, including confidential details,
making users feel uneasy about what is happening in their computer behind
their backs.
Current virus scanner technology has two parts[1]: a signature-based detector
and a heuristic classifier that detects new viruses. The classic signature-based
detection algorithm relies on signatures (unique telltale strings) of known
malicious executables to generate detection models. Signature-based methods
create a unique tag for each malicious program so that future examples of it
can be correctly classified with a small error rate. These methods do not
generalize well to detect new malicious binaries because they are created to
give a false positive rate as close to zero as possible. Whenever a detection
method generalizes to new instances, the tradeoff is for a higher false positive
rate. Heuristic classifiers are generated by a group of virus experts to detect
new malicious programs. This kind of analysis can be time-consuming and
oftentimes still fail to detect new malicious executables.
2
2 Methodology
Schultz[1] proposes using data mining methods for detecting new malicious
executables. They apply three different algorithms with each one having its
own feature extraction technique. The first one is RIPPER algorithm, which
they only applied on Portable Executable (PE) format data using the Portable
Executable header information extraction technique, so I skip this algorithm.
The second algorithm is Naïve Bayes algorithm using strings in the binaries as
features. This technique can be easily avoided by encoding or encrypting the
strings in a file, so this technique is weak against new malicious code. The
third algorithm is Multi-Naïve Bayes algorithm, which uses byte sequences in
a file as features. The idea is, malicious executables have common intentions
and they may have similar byte code. We can detect malicious executable by
looking at the frequency analysis of byte code in a file. Multi-Naïve Bayes
algorithm is basically a collection of Naïve Bayes algorithms for splitting the
data into sets. I will use the same feature extraction technique, but I will use
only one instance of Naïve Bayes algorithm. Source code for this method can
be found at http://www.cs.columbia.edu/ids/mef/software .
The dataset consists of 312 benign(non-malicious) executables and 614
spyware executable. The spyware collection is formed by using the virus
collection at http://vx.netlux.org and by crawling the internet using a
sandboxed operating system and manually collecting spywares. The benign
executables are collected from the system files in Windows XP operating
system and from programs of a stereotype user.
ue.
Byte sequences are extracted using the hexdump tool in Linux. For each file in
the dataset, using this tool a hexdump file is formed. When the algorithm is
run, user should specify a “window size”. Naïve Bayes algorithm can be
specified to use how many sequences of byte data. Default window size is
one, which means algorithm will look at 4 hexadecimals(2 bytes of data) for
frequency analysis. I run the algorithm for a window size of 2 and 4
separately using 5-fold cross validation techniq
3 Test Results
To evaluate our system we are interested in several
Quantities, just like Schultz[1] :
1. True Positives (TP), the number of malicious executable examples
classified as malicious executables
3
2. True Negatives (TN), the number of benign programs classified as
benign.
3. False Positives (FP), the number of benign programs classified as
malicious executables
4. False Negatives (FN), the number of malicious executables classified
as benign binaries.
The “detection rate” of the classifier is the percentage of the total malicious
programs labeled malicious. The “false positive rate” is the percentage of
benign programs which were labeled as malicious. The “overall accuracy” is
the percentage of true classifications over all data.
The first test is done using a window size of 2 and its results curve is shown in
Table 1.
Window Size = 2
0
20
40
60
80
100
0 1.5152 4.5455 14.516 22.222 27.419 48.438 51.613
False Positive Rate %
DetectionRate%
W2
Table 1. Results of the tests for Spyware dataset.
4
As you can see, high detection rate can only be gained for high false positive
rates and till 20% false positive rate the model has a detection rate lower than
80%. Then I run the algorithm with a window size of 4 and the results can be
seen in Table 2.
Window Size = 4
0
20
40
60
80
100
0 1.538 4.615 6.061 12.12 13.64 26.98 29.03 35.48 50
False Positive Rate %
DetectionRate%
W4
Table 2. Results of the tests for Spyware dataset.
Results were better for this case, but still the overall accuracy was rather low,
so I decided to investigate the reasons by looking at the classification test
results. I realized that a class of spyware called Trojans had a quite low
detection rate. These are different from ordinary spyware, since they are quite
complicated programs and their binary size were rather big compared to other
files in the dataset. I thought that these programs may be the reason for the
high false positive rate and low detection rate, so I run another test for a
window size of 2, after excluding the 59 Trojans from the dataset.
5
Window Size = 2 and without Trojans
0
20
40
60
80
100
0 3.278689 5 11.66667 22.0339 35
False Positive Rate %
DetectionRate%
W2
Table 3. Results of the tests for Spyware dataset.
As the Table 3 shows, the overall accuracy has improved for the window size
of 2 and we reach 80% detection rate before a false positive rate of 15%. Also
we score better for low false positive rates. These results encouraged for a
window size of 4 test without the Trojans in the dataset and the results of this
test are shown in Table 4.
Window Size = 4 and without Trojans
0
20
40
60
80
100
0 4.918 5 10 16.67 20.34 25 30 43.33
False Positive Rate %
DetectionRate%
W4
Table 4. Results of the tests for Spyware dataset.
6
We had 80% detection rate even before the false positive rate of 4% and
overall accuracy has been improved. Below you can see best overall accuracy
results for each run in Table 5.
TP TN FP FN
Detection
Rate
False
Positive
Rate
Overall
Accuracy
Window
Size=2
118 49 14 5 95.93 22.22 89.78
Window
Size=4
119 46 17 4 96.75 26.98 88.71
Window
Size=2
w/o
Trojan
96 46 13 15 86.49 22.03 83.53
Window
Size=4
w/o
Trojan
99 58 3 12 89.19 4.92 91.28
Table 5. Results of the tests for Spyware dataset.
The final run has the best overall accuracy and has the lowest false positive
rate, which is significantly lower than other three runs.
4 Conclusion
Spyware is a new threat and current anti-virus programs are weak against
spyware, because they use almost perfect camouflage systems and as they are
not viruses, and they don’t use a routine that can be associated with them,
antivirus programs don’t detect them unless, they are designed specifically to
do so. There are some anti-spyware programs, but they work with signature
scheme, so again they are weak against unseen spywares.
Before the tests, I had some doubt for using byte sequences as a feature. All
viruses have a common technique for infection, which is replicating
themselves onto new executables. One can argue that, the byte codes for
infection can ve a distinguishing feature for the viruses and these can be used
for detecting viruses. However, spyware does not use these techniques, so it is
difficult to separate them from normal executables in this sense. On the other
side, there is a common behaviour for all spyware, they watch the user’s
actions for reporting. Probably, this is why I get overall accuracy when I run
tests with trojans in the dataset. Trojans are not really spyware, tough they are
7
used for spying, indeed they are complete programs working as a toolbox for
hackers. As a result, after removing them I had a dataset with a common
programmatic behavour, which gives me better classifying results.
Another thing important about this technique is the window size. In the paper
Schultz[1] does not give an optimal window size, but in general we can say
that the larger window size we use, better results we get. The problem is, with
a window size of five the program needs more than one gigabyte of ram for
running and the run takes too much time and this is why I was only able to
test with a window size of four.
To conclude, applying Schultz[1]’s techniques on spyware, I was able to see
that a data mining based heuristic scheme has the potential to be used for de-
tecting new spyware.
5 Future Work
Future work involves using Multi-Naïve Bayes for using a window size of
five and six. I expect to have better overall accuracy for larger window sizes.
Furthermore Christodorescu[3] proposes a technique for testing malicious
code detector’s performance against obfuscation. Since we are using byte
sequences as our features, changes in the byte sequence will most probably
affect our results. These techniques can be used to test performance of this
spyware detection technique against obfuscated code.
6 References
1. M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, Data Mining Methods for
Detection of New Malicious Executables, Proceedings of IEEE Symposium on
Security and Privacy. Oakland, pp.38-49, 2001
2. “The silent epidemic of 2005: 84% of malware on computers worldwide is spy-
ware”. http://www.pandasoftware.com/about/press/viewnews.aspx?noticia=5968.
3. M. Christodorescu and S. Jha, Testing Malware Detectors, International Sympo-
sium on Software Testing and Analysis archive. Boston, Massachusetts, USA.
8

More Related Content

What's hot

Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Akash Karwande
 
Monitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in InternetMonitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in InternetIOSR Journals
 
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSA FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSIJNSA Journal
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesArshadRaja786
 
IRJET- Android Malware Detection using Machine Learning
IRJET-  	  Android Malware Detection using Machine LearningIRJET-  	  Android Malware Detection using Machine Learning
IRJET- Android Malware Detection using Machine LearningIRJET Journal
 
Cyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on ExamplesCyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on ExamplesSandeep Kumar Seeram
 
Malware classification using Machine Learning
Malware classification using Machine LearningMalware classification using Machine Learning
Malware classification using Machine LearningJapneet Singh
 
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEMAPPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEMijcsit
 
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...Keith Jones, PhD
 
robust malware detection for iot devices using deep eigen space learning
robust malware detection for iot devices using deep eigen space learningrobust malware detection for iot devices using deep eigen space learning
robust malware detection for iot devices using deep eigen space learningVenkat Projects
 
Self Evolving Antivirus Based on Neuro-Fuzzy Inference System
Self Evolving Antivirus Based on Neuro-Fuzzy Inference SystemSelf Evolving Antivirus Based on Neuro-Fuzzy Inference System
Self Evolving Antivirus Based on Neuro-Fuzzy Inference SystemIJRES Journal
 
An effective architecture and algorithm for detecting worms with various scan...
An effective architecture and algorithm for detecting worms with various scan...An effective architecture and algorithm for detecting worms with various scan...
An effective architecture and algorithm for detecting worms with various scan...UltraUploader
 
Automated malware invariant generation
Automated malware invariant generationAutomated malware invariant generation
Automated malware invariant generationUltraUploader
 
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSDETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSAAKANKSHA JAIN
 
Acf.cw.la1.s01.1 example
Acf.cw.la1.s01.1 exampleAcf.cw.la1.s01.1 example
Acf.cw.la1.s01.1 exampleBazlin Ahmad
 

What's hot (20)

Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques
 
Monitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in InternetMonitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in Internet
 
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSA FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
IRJET- Android Malware Detection using Machine Learning
IRJET-  	  Android Malware Detection using Machine LearningIRJET-  	  Android Malware Detection using Machine Learning
IRJET- Android Malware Detection using Machine Learning
 
Cyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on ExamplesCyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on Examples
 
Malware classification using Machine Learning
Malware classification using Machine LearningMalware classification using Machine Learning
Malware classification using Machine Learning
 
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEMAPPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
 
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNIN...
 
Acf.cw.la1.s01.2
Acf.cw.la1.s01.2Acf.cw.la1.s01.2
Acf.cw.la1.s01.2
 
Toc.cw.la1.s01.2
Toc.cw.la1.s01.2Toc.cw.la1.s01.2
Toc.cw.la1.s01.2
 
robust malware detection for iot devices using deep eigen space learning
robust malware detection for iot devices using deep eigen space learningrobust malware detection for iot devices using deep eigen space learning
robust malware detection for iot devices using deep eigen space learning
 
Self Evolving Antivirus Based on Neuro-Fuzzy Inference System
Self Evolving Antivirus Based on Neuro-Fuzzy Inference SystemSelf Evolving Antivirus Based on Neuro-Fuzzy Inference System
Self Evolving Antivirus Based on Neuro-Fuzzy Inference System
 
Acf.cw.la1.s01.1
Acf.cw.la1.s01.1Acf.cw.la1.s01.1
Acf.cw.la1.s01.1
 
An effective architecture and algorithm for detecting worms with various scan...
An effective architecture and algorithm for detecting worms with various scan...An effective architecture and algorithm for detecting worms with various scan...
An effective architecture and algorithm for detecting worms with various scan...
 
Automated malware invariant generation
Automated malware invariant generationAutomated malware invariant generation
Automated malware invariant generation
 
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMSDETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
 
Windows 8 kasp1248
Windows 8 kasp1248Windows 8 kasp1248
Windows 8 kasp1248
 
proposal
proposalproposal
proposal
 
Acf.cw.la1.s01.1 example
Acf.cw.la1.s01.1 exampleAcf.cw.la1.s01.1 example
Acf.cw.la1.s01.1 example
 

Similar to Application of data mining based malicious code detection techniques for detecting new spyware

A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSA STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSijaia
 
Malware analysis and detection using reverse Engineering, Available at: www....
Malware analysis and detection using reverse Engineering,  Available at: www....Malware analysis and detection using reverse Engineering,  Available at: www....
Malware analysis and detection using reverse Engineering, Available at: www....Research Publish Journals (Publisher)
 
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSA FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSIJNSA Journal
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning MethodsIRJET Journal
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationUltraUploader
 
A generic virus detection agent on the internet
A generic virus detection agent on the internetA generic virus detection agent on the internet
A generic virus detection agent on the internetUltraUploader
 
Optimised Malware Detection in Digital Forensics
Optimised Malware Detection in Digital Forensics Optimised Malware Detection in Digital Forensics
Optimised Malware Detection in Digital Forensics IJNSA Journal
 
Automatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detectionAutomatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detectionUltraUploader
 
A trust system based on multi level virus detection
A trust system based on multi level virus detectionA trust system based on multi level virus detection
A trust system based on multi level virus detectionUltraUploader
 
Optimised malware detection in digital forensics
Optimised malware detection in digital forensicsOptimised malware detection in digital forensics
Optimised malware detection in digital forensicsIJNSA Journal
 
Identifying, Monitoring, and Reporting Malware
Identifying, Monitoring, and Reporting MalwareIdentifying, Monitoring, and Reporting Malware
Identifying, Monitoring, and Reporting MalwareTeodoro Cipresso
 
Basic survey on malware analysis, tools and techniques
Basic survey on malware analysis, tools and techniquesBasic survey on malware analysis, tools and techniques
Basic survey on malware analysis, tools and techniquesijcsa
 
Chapter 1 malware analysis primer
Chapter 1 malware analysis primerChapter 1 malware analysis primer
Chapter 1 malware analysis primerManjuA8
 
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.pptCHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.pptManjuAppukuttan2
 
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...CSCJournals
 

Similar to Application of data mining based malicious code detection techniques for detecting new spyware (20)

A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSA STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
 
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
 
Malware analysis and detection using reverse Engineering, Available at: www....
Malware analysis and detection using reverse Engineering,  Available at: www....Malware analysis and detection using reverse Engineering,  Available at: www....
Malware analysis and detection using reverse Engineering, Available at: www....
 
A44090104
A44090104A44090104
A44090104
 
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLSA FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
A FRAMEWORK FOR ANALYSIS AND COMPARISON OF DYNAMIC MALWARE ANALYSIS TOOLS
 
Antimalware
AntimalwareAntimalware
Antimalware
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning Methods
 
[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...
[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...
[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creation
 
A generic virus detection agent on the internet
A generic virus detection agent on the internetA generic virus detection agent on the internet
A generic virus detection agent on the internet
 
Optimised Malware Detection in Digital Forensics
Optimised Malware Detection in Digital Forensics Optimised Malware Detection in Digital Forensics
Optimised Malware Detection in Digital Forensics
 
Automatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detectionAutomatically generated win32 heuristic virus detection
Automatically generated win32 heuristic virus detection
 
A trust system based on multi level virus detection
A trust system based on multi level virus detectionA trust system based on multi level virus detection
A trust system based on multi level virus detection
 
Optimised malware detection in digital forensics
Optimised malware detection in digital forensicsOptimised malware detection in digital forensics
Optimised malware detection in digital forensics
 
Identifying, Monitoring, and Reporting Malware
Identifying, Monitoring, and Reporting MalwareIdentifying, Monitoring, and Reporting Malware
Identifying, Monitoring, and Reporting Malware
 
Basic survey on malware analysis, tools and techniques
Basic survey on malware analysis, tools and techniquesBasic survey on malware analysis, tools and techniques
Basic survey on malware analysis, tools and techniques
 
Chapter 1 malware analysis primer
Chapter 1 malware analysis primerChapter 1 malware analysis primer
Chapter 1 malware analysis primer
 
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.pptCHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
CHAPTER 1 MALWARE ANALYSIS PRIMER.ppt
 
The modern-malware-review-march-2013
The modern-malware-review-march-2013 The modern-malware-review-march-2013
The modern-malware-review-march-2013
 
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
 

More from UltraUploader

01 le 10 regole dell'hacking
01   le 10 regole dell'hacking01   le 10 regole dell'hacking
01 le 10 regole dell'hackingUltraUploader
 
00 the big guide sz (by dr.to-d)
00   the big guide sz (by dr.to-d)00   the big guide sz (by dr.to-d)
00 the big guide sz (by dr.to-d)UltraUploader
 
[E book ita] php manual
[E book   ita] php manual[E book   ita] php manual
[E book ita] php manualUltraUploader
 
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...UltraUploader
 
[Ebook ita - database] access 2000 manuale
[Ebook   ita - database] access 2000 manuale[Ebook   ita - database] access 2000 manuale
[Ebook ita - database] access 2000 manualeUltraUploader
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)UltraUploader
 
(Ebook ita - inform - access) guida al database access (doc)
(Ebook   ita - inform - access) guida al database access (doc)(Ebook   ita - inform - access) guida al database access (doc)
(Ebook ita - inform - access) guida al database access (doc)UltraUploader
 
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
(Ebook   computer - ita - pdf) fondamenti di informatica - teoria(Ebook   computer - ita - pdf) fondamenti di informatica - teoria
(Ebook computer - ita - pdf) fondamenti di informatica - teoriaUltraUploader
 
Broadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorBroadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorUltraUploader
 
Botnetsand applications
Botnetsand applicationsBotnetsand applications
Botnetsand applicationsUltraUploader
 
Bot software spreads, causes new worries
Bot software spreads, causes new worriesBot software spreads, causes new worries
Bot software spreads, causes new worriesUltraUploader
 
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...UltraUploader
 
Bird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyBird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyUltraUploader
 
Biologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesBiologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesUltraUploader
 
Biological versus computer viruses
Biological versus computer virusesBiological versus computer viruses
Biological versus computer virusesUltraUploader
 
Biological aspects of computer virology
Biological aspects of computer virologyBiological aspects of computer virology
Biological aspects of computer virologyUltraUploader
 
Biological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksBiological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksUltraUploader
 

More from UltraUploader (20)

1 (1)
1 (1)1 (1)
1 (1)
 
01 intro
01 intro01 intro
01 intro
 
01 le 10 regole dell'hacking
01   le 10 regole dell'hacking01   le 10 regole dell'hacking
01 le 10 regole dell'hacking
 
00 the big guide sz (by dr.to-d)
00   the big guide sz (by dr.to-d)00   the big guide sz (by dr.to-d)
00 the big guide sz (by dr.to-d)
 
[E book ita] php manual
[E book   ita] php manual[E book   ita] php manual
[E book ita] php manual
 
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...[Ebook   ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
[Ebook ita - security] introduzione alle tecniche di exploit - mori - ifoa ...
 
[Ebook ita - database] access 2000 manuale
[Ebook   ita - database] access 2000 manuale[Ebook   ita - database] access 2000 manuale
[Ebook ita - database] access 2000 manuale
 
(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)(E book) cracking & hacking tutorial 1000 pagine (ita)
(E book) cracking & hacking tutorial 1000 pagine (ita)
 
(Ebook ita - inform - access) guida al database access (doc)
(Ebook   ita - inform - access) guida al database access (doc)(Ebook   ita - inform - access) guida al database access (doc)
(Ebook ita - inform - access) guida al database access (doc)
 
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
(Ebook   computer - ita - pdf) fondamenti di informatica - teoria(Ebook   computer - ita - pdf) fondamenti di informatica - teoria
(Ebook computer - ita - pdf) fondamenti di informatica - teoria
 
Broadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitorBroadband network virus detection system based on bypass monitor
Broadband network virus detection system based on bypass monitor
 
Botnetsand applications
Botnetsand applicationsBotnetsand applications
Botnetsand applications
 
Bot software spreads, causes new worries
Bot software spreads, causes new worriesBot software spreads, causes new worries
Bot software spreads, causes new worries
 
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
 
Blast off!
Blast off!Blast off!
Blast off!
 
Bird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassemblyBird binary interpretation using runtime disassembly
Bird binary interpretation using runtime disassembly
 
Biologically inspired defenses against computer viruses
Biologically inspired defenses against computer virusesBiologically inspired defenses against computer viruses
Biologically inspired defenses against computer viruses
 
Biological versus computer viruses
Biological versus computer virusesBiological versus computer viruses
Biological versus computer viruses
 
Biological aspects of computer virology
Biological aspects of computer virologyBiological aspects of computer virology
Biological aspects of computer virology
 
Biological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networksBiological models of security for virus propagation in computer networks
Biological models of security for virus propagation in computer networks
 

Application of data mining based malicious code detection techniques for detecting new spyware

  • 1. Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Cumhur Doruk Bozagac Bilkent University, Computer Science and Engineering Department, 06532 Ankara, Turkey dorukb@cs.bilkent.edu.tr Abstract. Over the past few years, a new computer security problem has arisen, Spyware. Anti-Virus techniques cannot deal with spyware, due to their silent infection techniques and their differences from a regular virus. Various Anti-Spyware programs have been implemented as a counter-measure but most of these programs work using the signature method, which is weak against new spyware. Schultz[1] has presented a data-mining framework that detects new, previously unseen malicious executables. However his paper was published in 2001 and there was no spyware in their training or test dataset. This paper takes Schultz[1]’s work as a candidate and applies its techniques against a new spyware dataset collected in 2005, to see whether their techniques can be used against this new threat. 1 Introduction Spyware is any software installed on a computer without the user's knowledge, that gathers information about that user for later retrieval by whomever controls it. There are two types of spyware: malware and adware. Malware is any program that gathers personal information from the user’s PC. Key loggers, screen capture devices, and Trojans are in this category. Adware is a program designed for showing a user advertisements, like homepage hijackers, pop-up windows and search page hijackers. Spyware poses several risks. The most conspicuous is compromising a user’s privacy by transmitting information about that user’s behavior. However, spyware can also detract from the usability and stability of a user’s computing environment, and it has the potential to introduce new security vulnerabilities to the infected host. Because spyware is widespread, such vulnerabilities would put millions of computers at risk. 1
  • 2. Spyware programs have a series of traits that makes them difficult to detect and enables them to install themselves on numerous PCs for long periods of time[2]: - They use almost perfect camouflage systems. They are normally installed on computers along with another kind of application: a P2P client, a hard disk utility… - File names don’t normally give any clues as to the real nature of the file, and so they often go unnoticed along with other legitimate application files. - As they are not viruses, and they don’t use a routine that can be associated with them, antivirus programs don’t detect them unless, they are designed specifically to do so. Panda Anti-Virus program has recently released a new online virus scanner, which can also detect spyware. According to their reports[2], in the first 24 hours of the operation, 84 percent of malicious code detected was spyware and the first 74 most detected malicious code were all spy programs. Spyware can be detected through the effects it has on systems: use of system resources resulting in a slowdown of the PC; high level of Internet connection bandwidth consumption; complete loss of the connection or general system instability. They also often change settings of certain applications, such as the browser home page, or insert icons on the desktop. The main consequence of spyware on PCs is the gathering of information, including confidential details, making users feel uneasy about what is happening in their computer behind their backs. Current virus scanner technology has two parts[1]: a signature-based detector and a heuristic classifier that detects new viruses. The classic signature-based detection algorithm relies on signatures (unique telltale strings) of known malicious executables to generate detection models. Signature-based methods create a unique tag for each malicious program so that future examples of it can be correctly classified with a small error rate. These methods do not generalize well to detect new malicious binaries because they are created to give a false positive rate as close to zero as possible. Whenever a detection method generalizes to new instances, the tradeoff is for a higher false positive rate. Heuristic classifiers are generated by a group of virus experts to detect new malicious programs. This kind of analysis can be time-consuming and oftentimes still fail to detect new malicious executables. 2
  • 3. 2 Methodology Schultz[1] proposes using data mining methods for detecting new malicious executables. They apply three different algorithms with each one having its own feature extraction technique. The first one is RIPPER algorithm, which they only applied on Portable Executable (PE) format data using the Portable Executable header information extraction technique, so I skip this algorithm. The second algorithm is Naïve Bayes algorithm using strings in the binaries as features. This technique can be easily avoided by encoding or encrypting the strings in a file, so this technique is weak against new malicious code. The third algorithm is Multi-Naïve Bayes algorithm, which uses byte sequences in a file as features. The idea is, malicious executables have common intentions and they may have similar byte code. We can detect malicious executable by looking at the frequency analysis of byte code in a file. Multi-Naïve Bayes algorithm is basically a collection of Naïve Bayes algorithms for splitting the data into sets. I will use the same feature extraction technique, but I will use only one instance of Naïve Bayes algorithm. Source code for this method can be found at http://www.cs.columbia.edu/ids/mef/software . The dataset consists of 312 benign(non-malicious) executables and 614 spyware executable. The spyware collection is formed by using the virus collection at http://vx.netlux.org and by crawling the internet using a sandboxed operating system and manually collecting spywares. The benign executables are collected from the system files in Windows XP operating system and from programs of a stereotype user. ue. Byte sequences are extracted using the hexdump tool in Linux. For each file in the dataset, using this tool a hexdump file is formed. When the algorithm is run, user should specify a “window size”. Naïve Bayes algorithm can be specified to use how many sequences of byte data. Default window size is one, which means algorithm will look at 4 hexadecimals(2 bytes of data) for frequency analysis. I run the algorithm for a window size of 2 and 4 separately using 5-fold cross validation techniq 3 Test Results To evaluate our system we are interested in several Quantities, just like Schultz[1] : 1. True Positives (TP), the number of malicious executable examples classified as malicious executables 3
  • 4. 2. True Negatives (TN), the number of benign programs classified as benign. 3. False Positives (FP), the number of benign programs classified as malicious executables 4. False Negatives (FN), the number of malicious executables classified as benign binaries. The “detection rate” of the classifier is the percentage of the total malicious programs labeled malicious. The “false positive rate” is the percentage of benign programs which were labeled as malicious. The “overall accuracy” is the percentage of true classifications over all data. The first test is done using a window size of 2 and its results curve is shown in Table 1. Window Size = 2 0 20 40 60 80 100 0 1.5152 4.5455 14.516 22.222 27.419 48.438 51.613 False Positive Rate % DetectionRate% W2 Table 1. Results of the tests for Spyware dataset. 4
  • 5. As you can see, high detection rate can only be gained for high false positive rates and till 20% false positive rate the model has a detection rate lower than 80%. Then I run the algorithm with a window size of 4 and the results can be seen in Table 2. Window Size = 4 0 20 40 60 80 100 0 1.538 4.615 6.061 12.12 13.64 26.98 29.03 35.48 50 False Positive Rate % DetectionRate% W4 Table 2. Results of the tests for Spyware dataset. Results were better for this case, but still the overall accuracy was rather low, so I decided to investigate the reasons by looking at the classification test results. I realized that a class of spyware called Trojans had a quite low detection rate. These are different from ordinary spyware, since they are quite complicated programs and their binary size were rather big compared to other files in the dataset. I thought that these programs may be the reason for the high false positive rate and low detection rate, so I run another test for a window size of 2, after excluding the 59 Trojans from the dataset. 5
  • 6. Window Size = 2 and without Trojans 0 20 40 60 80 100 0 3.278689 5 11.66667 22.0339 35 False Positive Rate % DetectionRate% W2 Table 3. Results of the tests for Spyware dataset. As the Table 3 shows, the overall accuracy has improved for the window size of 2 and we reach 80% detection rate before a false positive rate of 15%. Also we score better for low false positive rates. These results encouraged for a window size of 4 test without the Trojans in the dataset and the results of this test are shown in Table 4. Window Size = 4 and without Trojans 0 20 40 60 80 100 0 4.918 5 10 16.67 20.34 25 30 43.33 False Positive Rate % DetectionRate% W4 Table 4. Results of the tests for Spyware dataset. 6
  • 7. We had 80% detection rate even before the false positive rate of 4% and overall accuracy has been improved. Below you can see best overall accuracy results for each run in Table 5. TP TN FP FN Detection Rate False Positive Rate Overall Accuracy Window Size=2 118 49 14 5 95.93 22.22 89.78 Window Size=4 119 46 17 4 96.75 26.98 88.71 Window Size=2 w/o Trojan 96 46 13 15 86.49 22.03 83.53 Window Size=4 w/o Trojan 99 58 3 12 89.19 4.92 91.28 Table 5. Results of the tests for Spyware dataset. The final run has the best overall accuracy and has the lowest false positive rate, which is significantly lower than other three runs. 4 Conclusion Spyware is a new threat and current anti-virus programs are weak against spyware, because they use almost perfect camouflage systems and as they are not viruses, and they don’t use a routine that can be associated with them, antivirus programs don’t detect them unless, they are designed specifically to do so. There are some anti-spyware programs, but they work with signature scheme, so again they are weak against unseen spywares. Before the tests, I had some doubt for using byte sequences as a feature. All viruses have a common technique for infection, which is replicating themselves onto new executables. One can argue that, the byte codes for infection can ve a distinguishing feature for the viruses and these can be used for detecting viruses. However, spyware does not use these techniques, so it is difficult to separate them from normal executables in this sense. On the other side, there is a common behaviour for all spyware, they watch the user’s actions for reporting. Probably, this is why I get overall accuracy when I run tests with trojans in the dataset. Trojans are not really spyware, tough they are 7
  • 8. used for spying, indeed they are complete programs working as a toolbox for hackers. As a result, after removing them I had a dataset with a common programmatic behavour, which gives me better classifying results. Another thing important about this technique is the window size. In the paper Schultz[1] does not give an optimal window size, but in general we can say that the larger window size we use, better results we get. The problem is, with a window size of five the program needs more than one gigabyte of ram for running and the run takes too much time and this is why I was only able to test with a window size of four. To conclude, applying Schultz[1]’s techniques on spyware, I was able to see that a data mining based heuristic scheme has the potential to be used for de- tecting new spyware. 5 Future Work Future work involves using Multi-Naïve Bayes for using a window size of five and six. I expect to have better overall accuracy for larger window sizes. Furthermore Christodorescu[3] proposes a technique for testing malicious code detector’s performance against obfuscation. Since we are using byte sequences as our features, changes in the byte sequence will most probably affect our results. These techniques can be used to test performance of this spyware detection technique against obfuscated code. 6 References 1. M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, Data Mining Methods for Detection of New Malicious Executables, Proceedings of IEEE Symposium on Security and Privacy. Oakland, pp.38-49, 2001 2. “The silent epidemic of 2005: 84% of malware on computers worldwide is spy- ware”. http://www.pandasoftware.com/about/press/viewnews.aspx?noticia=5968. 3. M. Christodorescu and S. Jha, Testing Malware Detectors, International Sympo- sium on Software Testing and Analysis archive. Boston, Massachusetts, USA. 8