In this talk we will present some techniques that we use on a day to day basis in our research, where we combine our internet-wide data scanning and acquisition platform with ML/Data science techniques which allows us to find things faster or extract results in a more automated way. We will focus on practical cases and examples that even our audience at home will be able to use if they want. A couple of examples we will look at is how to classify images such as VNC screenshots, we will look at network scans and using machine learning to classify them and also the use of natural language processing to analyze CVEs. We will also talk a bit about a data analysis and classification pipeline architecture, we will look at the different technologies and what they do and how they can be used.
We will start by giving a very brief entry to the data science world and talk about:
Technologies
Techniques
How these relate to infosec
Algorithms and how they can be used
How people can come into the world of data and machine learning
Data visualization techniques and what are the best choices for different types of data
A couple of examples we will look at is how to classify images such as VNC or x11 screenshots, OCR, we will look at network scans and using machine learning to classify them and also the use of natural language processing to analyze CVEs. We will look at scoring and classification algorithms and how they can be used on ip addresses and we will talk about the use of learning and how we are applying it in real life.
We will also talk a bit about a data analysis and classification pipeline architecture, we will look at the different technologies and what they do and how they can be used. Some specific examples of our research that should give you an idea of some things we will talk about can be seen here:
https://blog.binaryedge.io/2015/11/10/ssh/
https://blog.binaryedge.io/2015/09/30/vnc-image-analysis-and-data-science/
https://blog.binaryedge.io/2015/08/10/data-technologies-and-security-part-1/
BSides Lisbon - Data science, machine learning and cybersecurity
1. By Tiago Henriques, Filipa Rodrigues
Florentino Bexiga, Ana Barbosa
I, for one, welcome our
new Cyber Overlords!
An introduction to the use of
data science in cybersecurity
2. WHO ARE WE?
MACHINE LEARNING AND CYBERSECURITY
IMAGE WORKFLOW
IMAGE ANALYSIS IN DETAIL
DATA VISUALISATION
Agenda
3. Tiago is the CEO and Data necromancer at
BinaryEdge however he gets to meddle in the
intersection of data science and cybersecurity
by providing his team with lovely problems that
they solve on a daily basis.
Tiago Henriques
Presenter
4. Florentino is the Data MacGyver at
BinaryEdge. On a daily basis he needs to
deploy infrastructure used to analyse big
and realtime data. When not doing that, he
can be found creating models to analyse
data. Give him an orange, he’ll give you a
skynet. Why an orange you ask? He’s
hungry and likes oranges, there!
Florentino Bexiga
Presenter
5. Filipa is the Data Diva at BinaryEdge, she
dances the macarena with numbers to get
them to tell her all their dirty secret.
Filipa Rodrigues
Presenter
6. Ana is the Data Ferret at BinaryEdge.
She is small and hides between the 110th
and 111th characters of the ascii code to
see and show data in that unique
perspective of someone who can’t reach
the box of cookies stored on top of the
capitol 'I'
Ana Barbosa
Presenter
8. 200 port scan of the entire internet/ month
1,400,000,000 scanning events/ month *
746,000 torrents monitored and increasing
1,362,225,600 torrent events/ month
* at a minimum
How we got here....
9. <= 100
Number of IPs found
>= 1,000,000
100,000 < #found < 1,000,000
10,000 < #found <= 100,000
1,000 < #found <= 10,000
100 < #found <= 1,000
Worldwide distribution of IPs running services
11. Data Science & Machine Learning
How many IP addresses did job X had vs. job Y?
What is the average duration of the scans?
Can we extract more from all the screenshots we get?
Can we have a more optimized job distribution?
We can only identify X% of services because we’re
using static signatures, can we do better?
Can we find similar images?
MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER
DATA SCIENCE
&
MACHINE LEARNING
12. Data Science & Machine Learning
DATA SCIENCE MACHINE LEARNING
INITIAL ANALYSIS AND CLEAN UP
EXPLORATORY DATA ANALYSIS
DATA VISUALISATION
KNOWLEDGE DISCOVERY
CLASSIFICATION
CLUSTERING
SIMILARITY MATCHING
REGRESSION
IDENTIFICATION
13. Problems and Limitations of
Machine Learning in CyberSecurity
Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of
machine learning
Prediction – Scenarios and data too volatile, not enough proper sources of data
Lack of data in quantity and quality to train models
14. Good use cases
further work needs to be done, but will allow to move antivirus from a static/
signature based system into a much improved dynamic/ learning based
system
If a computer is hacked certain behaviors will change, if constant data is being
monitored and fed into a system the hack could be detected
detection of vulnerable patterns during development
sentiment analysis applied to emails, tweets, social networks of employees
PATTERN DETECTION/OUTLIER
DETECTION (IDS/IPS)
ANTIVIRUS
ANTI-SPAM
SMARTER FUZZERS
SOURCE CODE ANALYSIS
INTERNAL ATTACKERS
15. metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forums
sub-reddits
internal
external
phone
email
linked urls
likes
topics
BGP
AS
whois
AS membership
AS peer
list of IPs
shared
infrastructure
co-hosted
sites
contact
geolocation
office
locations
social
networks
phone
portscan
dns
torrents
binaryedge.io2016
domains
AXFR
MX records
screenshots
web
services
http https
webserver
framework
headers
cookies
certificate
configuration
authorities
entities
SMB
VNC
RDP
users
appsfiles
peers torrent name
OCR
SW
banners
image
classifier
vulnerabilities
data points
27. Scan
DOES IT
GENERATE A
SCREENSHOT?
STORE THE IMAGE FILE
ON THE CLOUD
YES
NO
GENERATE A NOTIFICATION
THAT NEW IMAGE WAS UPLOADED
FINISH
SCAN
GENERATES EVENTS
{
"origin": {
"type": "vnc",
...
},
"target": {
"ip": "XX.XXX.XX.XXX",
"port": 5900
},
"result": {
"data": {
"version": "3.7",
"width": "1366",
"height": "768",
"auth_enabled": false,
"link": "https://5723981752938cbafeefbcfab42342342.jpg"
}
},
"@timestamp": "2016-04-22T14:53:02.377Z"
}
28. Image Workflow
GET IMAGE
EXTRACT TARGET METADATA
DOES IT
CONTAIN ANY
CONTENT?
YES
CREATE IMAGE SIGNATURE
STORE DATA
NO
FINISH
ENHANCE IMAGE FOR LOGO AND
FACE DETECTION AND OCR EXTRACTION
PERFORM LOGO AND FACE DETECTION
AND OCR EXTRACTION
STORE RESULTS
PERFORM ADDITIONAL ACTIONS
29. Image WorkflowImage Workflow
GET IMAGE
EXTRACT TARGET METADATA
DOES IT
CONTAIN ANY
CONTENT?
YES
CREATE IMAGE SIGNATURE
STORE DATA
NO
FINISH
ENHANCE IMAGE FOR LOGO AND
FACE DETECTION AND OCR EXTRACTION
PERFORM LOGO AND FACE DETECTION
AND OCR EXTRACTION
STORE RESULTS
PERFORM ADDITIONAL ACTIONS
32. Data Visualization
EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS
“a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.”
Andy Kirk, “Data Visualization: a successful design process”
33. Experimentation is important
design can be used in the future
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331
951,629 854,817 789,515 759,115 490,290 288,885
266,827 257,105 219,025 198,898 186,286 141,474
HowmanyopenportsdoesanIPhave?
NumberofIPswithXopenportsport
NumberofIPs
34. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Distribution of IP addresses running encrypted and unencrypted services
{
"origin": {
"type": "service-simple",
...
},
"target": {
"ip": "XX.XX.XXX.XXX",
"port": 80,
"protocol": "tcp"
},
"result": {
...
"service": {
"product": "Microsoft HTTPAPI httpd",
"name": "http",
"extrainfo": "SSDP/UPnP",
"cpe": [
"cpe:/o:microsoft:windows"
]
}
},
"@timestamp": "2016-04-22T04:07:18.161Z"
}
on port 443
on port 80
51,467,779
HTTP
28,671,263
IPs running
HTTP services
IPs running
HTTPS services
16,519,503IPs running both
HTTP and HTTPS services
HTTP
&
HTTPS
HTTPS
Data Visualization
35. Data Visualization
Top 10Web Servers for theWeb
Most common web servers found on port 80
Apache httpd
AkamaiGHost
Micorosft IIS httpd
nginx
lighttpd
Huawei HG532e ADSL modem http admin
Microsoft HTTPAPI httpd
Technicolor DSL modem http admin
Mbedthis-Appweb
micro_httpd
2 4 6 80 10 12 millions
11,493,552
8,361,080
4,843,769
3,860,883
2,031,741
1,539,629
952,300
699,202
694,393
678,657
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
{
...
"result": {
"data": {
"apps": [
{
"name": "Apache",
"confidence": 100,
"version": "2.2.26",
"categories": [
"web-servers" ]
...
}
}
}
}
36. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Overview of protocols used for email, according to encryption used
Email Protocols
ENCRYPTED UNENCRYPTED
POP3 POP3S SMTP SMTPS IMAP IMAPS
4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364
10,416,812 12,234,969
SERVICE
COUNT
Data Visualization
{
"origin": {
"type": "service-simple",
...
},
"target": {
"ip": "XX.XXX.XXX.XX",
"port": 143,
"protocol": "tcp"
},
"result": {
...
"service": {
"method": "probe_matching",
"product": "Dovecot imapd",
"name": "imap",
"cpe": [
"cpe:/a:dovecot:dovecot"
]
...
},
"@timestamp": "2016-04-22T01:56:54.583Z"
}
37. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Big Data Technologies
Changes in amount of data exposed without security
MongoDB Memcached Redis 2 TB
644.3 TB
Aug 2015 Jan 2016 July 2016
724.7 TB 627.7 TB
13.2 TB
11.3 TB
710.9 TB 12.0 TB
598.7 TB 27.5 TB 1.5 TB
1.8 TB
619.8 TB
{
"origin": {
"type": "redis",
...
},
"target": {
"ip": "XXX.XX.XX.XXX",
"port": 6379
},
"result": {
"data": {
"redis_version": "3.0.6",
...
"used_memory": 1374760,
"used_memory_human": "1.31M",
"used_memory_rss": 1839104,
"used_memory_peak": 25195656,
"used_memory_peak_human": "24.03M",
"used_memory_lua": 36864,
"mem_fragmentation_ratio": 1.34,
...
},
"@timestamp": "2016-04-22T15:37:10.913Z"
}
Data Visualization
38. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Heartbleed
Countries with higher number of IPs vulnerable to Heartbleed
Russia
5,264
Republic of Korea
4,564
China
6,790
United States
23,649
Italy
2,508
Germany
6,382
France
5,622
Netherlands
2,779United Kingdom
3,459
Japan
2,484
{
"origin": {
"type": "ssl",
},
"target": {
"ip":“XXX.XX.X.XXX”,
"port": 443
},
"result": {
"data": {
"vulnerabilities": {
"heartbleed": {
"is_vulnerable_to_heartbleed": true
},
"openssl_ccs": {
"is_vulnerable_to_ccs_injection": false
}
},
}
}
}
Data Visualization
39. Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
VNC wordcloud
loginwindows
edition
2016
delete
ctrl
server
press
microsoft
system
welcome
your help
file
linux
google
kernel
from
ubuntu
43. Tools
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
BALANCE
Automation
Programming Language
to create plots
Fine tunning in illustrator
(make it better for the audience)
Hand-editing process
Human error
Originality
Automated Analysis
Illustrator (or other tool) to
create visualization solution
Human error
Data Visualization
44. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DOCUMENT EVERY STEP OF THE PROCESS
Calculations
Choices of visualisations
Choices of data points
REVIEW EVERYTHING
What could have been done differently?
What could be better?
TAKE CONSTRUCTIVE FEEDBACK
Even if it means to start over
A visualization can be used in the future
Data Visualization