SlideShare a Scribd company logo
1 of 52
Cloudstone
Sharpening Your Weapons Through Big Data
Christopher Grayson
@_lavalamp
+ =
Introduction
WHOAMI
3
• ATL
• Web development
• Academic researcher
• Haxin’ all the things
• (but I rlllly like networks)
• Founder
• Red team
@_lavalamp
• Common Crawl
• MapReduce
• Hadoop
• Amazon Elastic MapReduce (EMR)
• Mining Common Crawl using Hadoop
on EMR
• Other ”big” data sources
WHAT’S DIS
4
• Academic research =/= industry
research
• Tactics can (and should!) be cross-
applied
• Lots of power in big data, only
problem is how to extract it
• Largely untapped resource
• Content discovery (largely) sucks
WHY’S DIS
5
1. Background
2. Common Crawl
3. MapReduce & Hadoop
4. Elastic Map Reduce
5. Mining Common Crawl
6. Data Mining Results
7. Big(ish) Data Sources
8. Conclusion
Agenda
6
Background
• DARPA CINDER program
• Continual authentication through
side channel data mining
• Penetration testing
• Web Sight
My Background
8
• Penetration testing scopes are
rarely adequate
• Faster, more accurate tools ==
better engagements
• It’s 2017 – application layer often
comprises the majority of attack
surface
• Expedite discovery of application-
layer attack surface
Time == $$$
9
• Many web applications map disk
contents to URLs
• Un-linked resources are commonly
less secure
• Older versions
• Debugging tools
• Backups with wrong extensions
• Find via brute force
• Current tools are quite lacking
Web App Content Discovery
10
Common Crawl
• California-based 501(c)(3) non-
profit organization
• Performing full web crawls on a
regular basis using different user
agents since 2008
• Data stored in AWS HDFS
• A single crawl contains many
terabytes of data
• Full crawl metadata can exceed 10TB
What is Common Crawl?
12
http://commoncrawl.org/
• Crawl data is stored in three
proprietary data formats
• WARC (Web ARChive) – raw crawl data
• WAT – HTTP request and response
metadata
• WET – plain-text HTTP responses
• WAT files likely contain the juicy bits
you’re interested in
• Use existing libraries for parsing file
contents
CC Data Format
13
• Data is stored in AWS HDFS (S3)
http://commoncrawl.org/the-
data/get-started/
• Can use the usual AWS S3
command line tools for debugging
• Newer crawls contain files listing
WAT and WET paths
CC HDFS Storage
14
• When running Hadoop jobs, HDFS
path is supplied to identify all files
to process
• Pulling down single files and
checking them out helps with
debugging code
• Use AWS S3 command line tool to
interact with CC data
Accessing HDFS in AWS
15
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz .
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
MapReduce & Hadoop
• Programming model for processing
large amounts of data
• Processing done in two phases:
• Map – take input data and extract
what you care about (key-value pairs)
• Reduce – apply a simple aggregation
function across the mapped data
(count, sum, etc)
• Easy concept, quirky to get what
you need out of it
What is MapReduce?
17
https://en.wikipedia.org/wiki/MapReduce
• Apache Hadoop
• De-facto standard open source
implementation of MapReduce
• Written in Java
• Has an interface to process data in
other languages, but writing code in
Java comes with perks
How ‘bout Hadoop?
18
• Use the Hadoop library for the
version you’ll be deploying against
• Implement Tool and Configured
class
• Implement mapper and reducer
classes
• Configure data types and
input/output paths
• ???
• Profit
Writing Hadoop Code
19
• MapReduce supports the map ->
reduce paradigm
• This is a fairly constrictive paradigm
• Have to be creative to determine
what to do during both the map
and reduce phases to extract and
aggregate the data you care about
Shoehorning into Hadoop
20
Elastic Map Reduce
• EMR
• Amazon’s cloud service for running
Hadoop jobs
• Usage of all the standard AWS tools
• Set up a cloud of EC2 instances to
process your data
• Free access to data stored in S3
Elastic MapReduce?!
22
• Choose how much you want to pay
for EC2 instances
• EMR allows you to use spot pricing
for your instances
• Must have one or two master nodes
alive at all time (no spot pricing)
• Choose the right spot price and
your total cost for processing all of
Common Crawl can be <$100.00
Spot Pricing!!!
23
Mining Common Crawl
• We want to find the most common
URL paths for every server type
• We have access to HTTP request
and response headers
• We must find a way to map our
requirements into the map and
reduce phases
• Map – Collect/generate the data we
care about, fit into key-value pairs
• Reduce – Apply a mathematical
aggregation across the collected data
Here Comes the Shoehorn
25
MAP
• Create unique strings that contain
(1) a reference to the type of server
and (2) the URL path segment for
every URL path segment in ever URL
found within the CC HTTP responses
REDUCE
• Count the number of instances of
each unique string
My Solution
26
• Working with big data requires
coercion of input data to expected
values
• Aggregating on random data ==
huge output files
• For processing CC data, I had to
coerce the following values to avoid
massive result files
• Server headers
• GUIDs in URL paths
• Integers in URL paths
Mapping URL Paths
27
• People put wonky stuff in server
headers
• Reviewed the contents of a few
WAT files and retrieved all server
headers
• Chose a list of server types to
support
• Coerce header values into list of
supported server types
• Not supported -> misc_server
• No server header -> null_server
Coercing Server Headers
28
• URL paths can contain regularly
randomized data
• Dates
• GUIDs
• Integers
• Replace URL paths with default
strings when
• Length exceeds 16
• Contents all integers
• Contents majority integers
Coercing URL Paths
29
Mapping process results in strings containing coerced
server header and URL path
1. Record type
2. Server type
3. URL path segment
Mapping Result Key
30
< 02_';)_apache_generic_';)_ AthenaCarey >
1 2 3
Mapping Example
31
GET /foo/bar/baz.html?asd=123 HTTP/1.1
Host: www.woot.com
User-Agent: Mozilla/5.0 (Macintosh; Intel
Mac OS X 10.12; rv:53.0) Gecko/20100101
Firefox/53.0
Accept: text/html
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Server: Apache/2.4.9 (Unix)
Connection: close
Upgrade-Insecure-Requests: 1
/foo/bar/baz.html on Apache (Unix)
< 02_';)_apache_unix_';)_/foo/>, 1
< 02_';)_apache_unix_';)_/bar/>, 1
< 02_';)_apache_unix_';)_baz.html>, 1
• Swap out the fileInputPath
and fileOutputPath values in
HadoopRunner.java
• Compile using ant (not Eclipse,
unless you really like tearing your
hair out)
• Upload Hadoop JAR file to AWS S3
• Create EMR cluster
• Add a “step” to EMR cluster
referencing the JAR file in AWS S3
Running in EMR
32
• Processing took about two days
using five medium-powered EC2
instances as task nodes
• 93,914,151 results (mapped string
combined with # of occurences)
• ~3.6GB across 14 files
• Still fairly raw data – we need to
process it for it to be useful
Resulting Data
33
• We effectively have tuples of server
types, URL path segments, and the
number of occurrences for each
server type and segment pair
• Must process the results and order
by most common path segments
• Parsing code can be found here:
Parsing the Results
34
https://github.com/lavalamp-/lava-hadoop-processing
Data Mining Results
URL Segment Counts
36
500,000.00 5,000,000.00 50,000,000.00 500,000,000.00
Gunicon
Thin
Openresty
Zope
Lotus Domino
Sun Web Server
Apache (Windows)
Jetty
PWS
Lighttpd
IBM HTTP Server
Resin
Oracle Application Server
Litespeed
Miscellaneous
IIS
Nginx
Apache (Unix)
Apache (Generic)
# of Discovered URL Segments
ServerType
# of URL Segments by Server Type
Coverage
Server Type 50% 75% 90% 95% 99% 99.70% 99.90%
Apache (Generic) 58 217 475 611 749 776 784
Apache (Unix) 53 189 395 502 604 624 629
Apache (Windows) 14 41 78 97 117 121 122
Gunicon 2 4 5 6 6 6 6
IBM HTTP Server 6 15 21 24 26 26 26
IIS 103 330 610 738 859 882 889
Jetty 4 10 15 17 19 19 19
Lighttpd 20 76 178 240 306 320 324
Litespeed 16 43 73 90 109 113 114
Lotus Domino 3 5 6 7 7 7 7
Miscellaneous 93 329 687 907 1147 1196 1210
Nginx 87 341 760 1005 1284 1343 1360
Openresty 7 31 97 159 271 306 318
Oracle Application Server 1 4 6 6 7 7 7
PWS 6 15 22 25 28 29 29
Resin 1 5 9 10 12 12 12
Sun Web Server 6 11 14 16 17 17 17
Thin 3 6 10 11 12 13 13
Zope 12 25 37 42 47 47 48
Coverage by # of Requests
37
Most Common URL Segments
38
index.php
/forum/
/forums/
/news/
viewtopic.php
showthread.php
/tag/
/index.php/
newreply.php
/cgi-bin/
Apache (Unix)
index.php
index.cfm
/uhtbin/
/cgisirsi.exe/
/NCLD/
/catalog/
modules.php
/events/
/forum/
/item/
Apache (Windows)
/news/
index.php
/wiki/
/forums/
/forum/
/tag/
/search/
showthread.php
viewtopic.php
/en/
Apache (Generic)
/article/
/news/
/page/
/id/
default.aspx
/products/
/NEWS/
/en/
/apps/
/search/
IIS
/tag/
/news/
/forums/
/forum/
index.php
/tags/
showthread.php
/page/
/category/
/articles/
Nginx
Comparison w/ Other Sources
39
FuzzDB (all) 850,425 +99.8%
FuzzDB (web & app server) 7,234 +81.2%
Dirs3arch 5,992 +77.3%
Dirbuster 105,847 +98.7%
Burp Suite 424,203 +99.7%
91.34% Average improvement upon existing technologies
*no other approaches provide coverage guarantees
• Common Crawl respects (I believe)
robots.txt
• Certainly has a number of blind
spots
• Results omit highly-repetitive URL
segments (integers, GUIDs)
• Crawling likely misses plenty of
JavaScript-based URLs
• Lots of juicy files are never linked,
therefore missed by Common Crawl
Caveats
40
Resulting hit list files can be found in the following repository:
https://goo.gl/lxdPDm
Getchu Some Data
41
Big(ish) Data Sources
• Public archive of research data
collected through active scans of
the Internet
• Lots of references to other projects
containing data about
• DNS
• Port scans
• Web crawls
• SSL certificates
Scans.io
43
https://scans.io/
• American Registry for Internet
Numbers
• WHOIS records for a significant
amount of the IPv4 address space
• Other regional registries have
similar services
• ARIN
• AFRINIC
• APNIC
• LACNIC
• RIPE NCC
ARIN
44
https://www.arin.net/
• Awesome open source tools for
performing Internet-scale data
collection
• Zmap – network scans
• Zgrab – banner grabbing & network
service interaction
• ZDNS – DNS lookups
Zmap
45
https://zmap.io/
• Use SQL syntax to search all sorts of
huge datasets
• One public dataset contains all
public GitHub data…
Google BigQuery
46
https://cloud.google.com/bigquery/
Google BigQuery Tastiness
47
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa"
OR BQFILES.path like "%id_dsa";
13,706
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.aws/credentials’;
42
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.keystore’;
14,558
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%robots.txt’;
197,694
Conclusion
• MapReduce
• Hadoop
• Amazon Elastic MapReduce
• Common Crawl
• Shoehorning problem sets into
MapReduce
• Benefits from using big data
• Additional data sources
Recap
49
• Hone content discovery based on
already-found URL paths
• Generate content discovery hit lists
for specific user agents (mobile vs.
desktop)
• Hone network service scanning
based on already-found service
ports
Future Work
50
• Common Crawl Hadoop Project
https://github.com/lavalamp-/LavaHadoopCrawlAnalysis
• Common Crawl Results Processing Project
https://github.com/lavalamp-/lava-hadoop-processing
• Content Discovery Hit Lists
https://github.com/lavalamp-/content-discovery-hit-lists
• Lavalamp’s Blog
https://l.avala.mp/
References
51
THANK YOU!
@_lavalamp
chris [AT] websight [DOT] io
https://github.com/lavalamp-
https://l.avala.mp

More Related Content

What's hot

Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]APNIC
 
AstriCon 2017 - Docker Swarm & Asterisk
AstriCon 2017  - Docker Swarm & AsteriskAstriCon 2017  - Docker Swarm & Asterisk
AstriCon 2017 - Docker Swarm & AsteriskEvan McGee
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
 
Oram And Secure Computation
Oram And Secure ComputationOram And Secure Computation
Oram And Secure ComputationChong-Kuan Chen
 
Docker Registry + Basic Auth
Docker Registry + Basic AuthDocker Registry + Basic Auth
Docker Registry + Basic AuthRemotty
 
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)William Yeh
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveMadhu Venugopal
 
Docker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-PlaneDocker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-PlaneMadhu Venugopal
 
Docker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu VenugopalDocker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu VenugopalMichelle Antebi
 
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...Derek Ashmore
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewYafang Chang
 
Docker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingDocker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingSreenivas Makam
 
Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica Docker, Inc.
 
SANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a ServiceSANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a ServiceToni de la Fuente
 

What's hot (20)

Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAILDNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
 
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
 
AstriCon 2017 - Docker Swarm & Asterisk
AstriCon 2017  - Docker Swarm & AsteriskAstriCon 2017  - Docker Swarm & Asterisk
AstriCon 2017 - Docker Swarm & Asterisk
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
 
Oram And Secure Computation
Oram And Secure ComputationOram And Secure Computation
Oram And Secure Computation
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 
Docker Registry + Basic Auth
Docker Registry + Basic AuthDocker Registry + Basic Auth
Docker Registry + Basic Auth
 
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep dive
 
Docker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-PlaneDocker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-Plane
 
Docker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu VenugopalDocker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu Venugopal
 
NkSIP: The Erlang SIP application server
NkSIP: The Erlang SIP application serverNkSIP: The Erlang SIP application server
NkSIP: The Erlang SIP application server
 
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Docker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingDocker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental Networking
 
Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica
 
YARN
YARNYARN
YARN
 
SANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a ServiceSANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a Service
 

Viewers also liked

Introduction to LavaPasswordFactory
Introduction to LavaPasswordFactoryIntroduction to LavaPasswordFactory
Introduction to LavaPasswordFactoryChristopher Grayson
 
You, and Me, and Docker Makes Three
You, and Me, and Docker Makes ThreeYou, and Me, and Docker Makes Three
You, and Me, and Docker Makes ThreeChristopher Grayson
 
Grey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request ForgeryGrey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request ForgeryChristopher Grayson
 
Started In Security Now I'm Here
Started In Security Now I'm HereStarted In Security Now I'm Here
Started In Security Now I'm HereChristopher Grayson
 
Grey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 RecapGrey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 RecapChristopher Grayson
 
Root the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF AdministrationRoot the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF AdministrationChristopher Grayson
 

Viewers also liked (7)

So You Want to be a Hacker?
So You Want to be a Hacker?So You Want to be a Hacker?
So You Want to be a Hacker?
 
Introduction to LavaPasswordFactory
Introduction to LavaPasswordFactoryIntroduction to LavaPasswordFactory
Introduction to LavaPasswordFactory
 
You, and Me, and Docker Makes Three
You, and Me, and Docker Makes ThreeYou, and Me, and Docker Makes Three
You, and Me, and Docker Makes Three
 
Grey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request ForgeryGrey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request Forgery
 
Started In Security Now I'm Here
Started In Security Now I'm HereStarted In Security Now I'm Here
Started In Security Now I'm Here
 
Grey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 RecapGrey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 Recap
 
Root the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF AdministrationRoot the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF Administration
 

Similar to Cloudstone - Sharpening Your Weapons Through Big Data

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageDaniel Rohan
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Pagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index OptimizationPagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index OptimizationRalf Schwoebel
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 

Similar to Cloudstone - Sharpening Your Weapons Through Big Data (20)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud Storage
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Oracle Big Data Cloud service
Oracle Big Data Cloud serviceOracle Big Data Cloud service
Oracle Big Data Cloud service
 
Pagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index OptimizationPagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index Optimization
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 

Recently uploaded

『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 

Recently uploaded (11)

『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 

Cloudstone - Sharpening Your Weapons Through Big Data

  • 1. Cloudstone Sharpening Your Weapons Through Big Data Christopher Grayson @_lavalamp + =
  • 3. WHOAMI 3 • ATL • Web development • Academic researcher • Haxin’ all the things • (but I rlllly like networks) • Founder • Red team @_lavalamp
  • 4. • Common Crawl • MapReduce • Hadoop • Amazon Elastic MapReduce (EMR) • Mining Common Crawl using Hadoop on EMR • Other ”big” data sources WHAT’S DIS 4
  • 5. • Academic research =/= industry research • Tactics can (and should!) be cross- applied • Lots of power in big data, only problem is how to extract it • Largely untapped resource • Content discovery (largely) sucks WHY’S DIS 5
  • 6. 1. Background 2. Common Crawl 3. MapReduce & Hadoop 4. Elastic Map Reduce 5. Mining Common Crawl 6. Data Mining Results 7. Big(ish) Data Sources 8. Conclusion Agenda 6
  • 8. • DARPA CINDER program • Continual authentication through side channel data mining • Penetration testing • Web Sight My Background 8
  • 9. • Penetration testing scopes are rarely adequate • Faster, more accurate tools == better engagements • It’s 2017 – application layer often comprises the majority of attack surface • Expedite discovery of application- layer attack surface Time == $$$ 9
  • 10. • Many web applications map disk contents to URLs • Un-linked resources are commonly less secure • Older versions • Debugging tools • Backups with wrong extensions • Find via brute force • Current tools are quite lacking Web App Content Discovery 10
  • 12. • California-based 501(c)(3) non- profit organization • Performing full web crawls on a regular basis using different user agents since 2008 • Data stored in AWS HDFS • A single crawl contains many terabytes of data • Full crawl metadata can exceed 10TB What is Common Crawl? 12 http://commoncrawl.org/
  • 13. • Crawl data is stored in three proprietary data formats • WARC (Web ARChive) – raw crawl data • WAT – HTTP request and response metadata • WET – plain-text HTTP responses • WAT files likely contain the juicy bits you’re interested in • Use existing libraries for parsing file contents CC Data Format 13
  • 14. • Data is stored in AWS HDFS (S3) http://commoncrawl.org/the- data/get-started/ • Can use the usual AWS S3 command line tools for debugging • Newer crawls contain files listing WAT and WET paths CC HDFS Storage 14
  • 15. • When running Hadoop jobs, HDFS path is supplied to identify all files to process • Pulling down single files and checking them out helps with debugging code • Use AWS S3 command line tool to interact with CC data Accessing HDFS in AWS 15 aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz . aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
  • 17. • Programming model for processing large amounts of data • Processing done in two phases: • Map – take input data and extract what you care about (key-value pairs) • Reduce – apply a simple aggregation function across the mapped data (count, sum, etc) • Easy concept, quirky to get what you need out of it What is MapReduce? 17 https://en.wikipedia.org/wiki/MapReduce
  • 18. • Apache Hadoop • De-facto standard open source implementation of MapReduce • Written in Java • Has an interface to process data in other languages, but writing code in Java comes with perks How ‘bout Hadoop? 18
  • 19. • Use the Hadoop library for the version you’ll be deploying against • Implement Tool and Configured class • Implement mapper and reducer classes • Configure data types and input/output paths • ??? • Profit Writing Hadoop Code 19
  • 20. • MapReduce supports the map -> reduce paradigm • This is a fairly constrictive paradigm • Have to be creative to determine what to do during both the map and reduce phases to extract and aggregate the data you care about Shoehorning into Hadoop 20
  • 22. • EMR • Amazon’s cloud service for running Hadoop jobs • Usage of all the standard AWS tools • Set up a cloud of EC2 instances to process your data • Free access to data stored in S3 Elastic MapReduce?! 22
  • 23. • Choose how much you want to pay for EC2 instances • EMR allows you to use spot pricing for your instances • Must have one or two master nodes alive at all time (no spot pricing) • Choose the right spot price and your total cost for processing all of Common Crawl can be <$100.00 Spot Pricing!!! 23
  • 25. • We want to find the most common URL paths for every server type • We have access to HTTP request and response headers • We must find a way to map our requirements into the map and reduce phases • Map – Collect/generate the data we care about, fit into key-value pairs • Reduce – Apply a mathematical aggregation across the collected data Here Comes the Shoehorn 25
  • 26. MAP • Create unique strings that contain (1) a reference to the type of server and (2) the URL path segment for every URL path segment in ever URL found within the CC HTTP responses REDUCE • Count the number of instances of each unique string My Solution 26
  • 27. • Working with big data requires coercion of input data to expected values • Aggregating on random data == huge output files • For processing CC data, I had to coerce the following values to avoid massive result files • Server headers • GUIDs in URL paths • Integers in URL paths Mapping URL Paths 27
  • 28. • People put wonky stuff in server headers • Reviewed the contents of a few WAT files and retrieved all server headers • Chose a list of server types to support • Coerce header values into list of supported server types • Not supported -> misc_server • No server header -> null_server Coercing Server Headers 28
  • 29. • URL paths can contain regularly randomized data • Dates • GUIDs • Integers • Replace URL paths with default strings when • Length exceeds 16 • Contents all integers • Contents majority integers Coercing URL Paths 29
  • 30. Mapping process results in strings containing coerced server header and URL path 1. Record type 2. Server type 3. URL path segment Mapping Result Key 30 < 02_';)_apache_generic_';)_ AthenaCarey > 1 2 3
  • 31. Mapping Example 31 GET /foo/bar/baz.html?asd=123 HTTP/1.1 Host: www.woot.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0 Accept: text/html Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Server: Apache/2.4.9 (Unix) Connection: close Upgrade-Insecure-Requests: 1 /foo/bar/baz.html on Apache (Unix) < 02_';)_apache_unix_';)_/foo/>, 1 < 02_';)_apache_unix_';)_/bar/>, 1 < 02_';)_apache_unix_';)_baz.html>, 1
  • 32. • Swap out the fileInputPath and fileOutputPath values in HadoopRunner.java • Compile using ant (not Eclipse, unless you really like tearing your hair out) • Upload Hadoop JAR file to AWS S3 • Create EMR cluster • Add a “step” to EMR cluster referencing the JAR file in AWS S3 Running in EMR 32
  • 33. • Processing took about two days using five medium-powered EC2 instances as task nodes • 93,914,151 results (mapped string combined with # of occurences) • ~3.6GB across 14 files • Still fairly raw data – we need to process it for it to be useful Resulting Data 33
  • 34. • We effectively have tuples of server types, URL path segments, and the number of occurrences for each server type and segment pair • Must process the results and order by most common path segments • Parsing code can be found here: Parsing the Results 34 https://github.com/lavalamp-/lava-hadoop-processing
  • 36. URL Segment Counts 36 500,000.00 5,000,000.00 50,000,000.00 500,000,000.00 Gunicon Thin Openresty Zope Lotus Domino Sun Web Server Apache (Windows) Jetty PWS Lighttpd IBM HTTP Server Resin Oracle Application Server Litespeed Miscellaneous IIS Nginx Apache (Unix) Apache (Generic) # of Discovered URL Segments ServerType # of URL Segments by Server Type
  • 37. Coverage Server Type 50% 75% 90% 95% 99% 99.70% 99.90% Apache (Generic) 58 217 475 611 749 776 784 Apache (Unix) 53 189 395 502 604 624 629 Apache (Windows) 14 41 78 97 117 121 122 Gunicon 2 4 5 6 6 6 6 IBM HTTP Server 6 15 21 24 26 26 26 IIS 103 330 610 738 859 882 889 Jetty 4 10 15 17 19 19 19 Lighttpd 20 76 178 240 306 320 324 Litespeed 16 43 73 90 109 113 114 Lotus Domino 3 5 6 7 7 7 7 Miscellaneous 93 329 687 907 1147 1196 1210 Nginx 87 341 760 1005 1284 1343 1360 Openresty 7 31 97 159 271 306 318 Oracle Application Server 1 4 6 6 7 7 7 PWS 6 15 22 25 28 29 29 Resin 1 5 9 10 12 12 12 Sun Web Server 6 11 14 16 17 17 17 Thin 3 6 10 11 12 13 13 Zope 12 25 37 42 47 47 48 Coverage by # of Requests 37
  • 38. Most Common URL Segments 38 index.php /forum/ /forums/ /news/ viewtopic.php showthread.php /tag/ /index.php/ newreply.php /cgi-bin/ Apache (Unix) index.php index.cfm /uhtbin/ /cgisirsi.exe/ /NCLD/ /catalog/ modules.php /events/ /forum/ /item/ Apache (Windows) /news/ index.php /wiki/ /forums/ /forum/ /tag/ /search/ showthread.php viewtopic.php /en/ Apache (Generic) /article/ /news/ /page/ /id/ default.aspx /products/ /NEWS/ /en/ /apps/ /search/ IIS /tag/ /news/ /forums/ /forum/ index.php /tags/ showthread.php /page/ /category/ /articles/ Nginx
  • 39. Comparison w/ Other Sources 39 FuzzDB (all) 850,425 +99.8% FuzzDB (web & app server) 7,234 +81.2% Dirs3arch 5,992 +77.3% Dirbuster 105,847 +98.7% Burp Suite 424,203 +99.7% 91.34% Average improvement upon existing technologies *no other approaches provide coverage guarantees
  • 40. • Common Crawl respects (I believe) robots.txt • Certainly has a number of blind spots • Results omit highly-repetitive URL segments (integers, GUIDs) • Crawling likely misses plenty of JavaScript-based URLs • Lots of juicy files are never linked, therefore missed by Common Crawl Caveats 40
  • 41. Resulting hit list files can be found in the following repository: https://goo.gl/lxdPDm Getchu Some Data 41
  • 43. • Public archive of research data collected through active scans of the Internet • Lots of references to other projects containing data about • DNS • Port scans • Web crawls • SSL certificates Scans.io 43 https://scans.io/
  • 44. • American Registry for Internet Numbers • WHOIS records for a significant amount of the IPv4 address space • Other regional registries have similar services • ARIN • AFRINIC • APNIC • LACNIC • RIPE NCC ARIN 44 https://www.arin.net/
  • 45. • Awesome open source tools for performing Internet-scale data collection • Zmap – network scans • Zgrab – banner grabbing & network service interaction • ZDNS – DNS lookups Zmap 45 https://zmap.io/
  • 46. • Use SQL syntax to search all sorts of huge datasets • One public dataset contains all public GitHub data… Google BigQuery 46 https://cloud.google.com/bigquery/
  • 47. Google BigQuery Tastiness 47 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa" OR BQFILES.path like "%id_dsa"; 13,706 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.aws/credentials’; 42 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.keystore’; 14,558 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%robots.txt’; 197,694
  • 49. • MapReduce • Hadoop • Amazon Elastic MapReduce • Common Crawl • Shoehorning problem sets into MapReduce • Benefits from using big data • Additional data sources Recap 49
  • 50. • Hone content discovery based on already-found URL paths • Generate content discovery hit lists for specific user agents (mobile vs. desktop) • Hone network service scanning based on already-found service ports Future Work 50
  • 51. • Common Crawl Hadoop Project https://github.com/lavalamp-/LavaHadoopCrawlAnalysis • Common Crawl Results Processing Project https://github.com/lavalamp-/lava-hadoop-processing • Content Discovery Hit Lists https://github.com/lavalamp-/content-discovery-hit-lists • Lavalamp’s Blog https://l.avala.mp/ References 51
  • 52. THANK YOU! @_lavalamp chris [AT] websight [DOT] io https://github.com/lavalamp- https://l.avala.mp