Building a distributed scanner can be challenging, building one using real browsers even more so.
Injecting JavaScript to extract JS libraries and their versions, storing all HTML and JavaScript along with security headers requires a unique architecture. Having scanned the top 1,000,000 sites, I will cover the challenges I overcame in designing a scalable system to fingerprint the current state of the web. I will also present some of the more interesting findings of the data that was analyzed.
--- Isaac Dawson
Isaac Dawson is a Principal Security Researcher at Veracode, Inc. where he leads the R&D efforts of Veracode's dynamic analysis offerings. Prior to Veracode, he was a consultant for @stake and then Symantec. In 2004 he moved to Japan to start their application security consulting team.
After leaving for Veracode, he decided Japan was just too comfortable and has stayed ever since.
An avid go programmer, he has an interest in distributed systems and in particular, scanning the web.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
[CB16] Around the Web in 80 Hours: Scalable Fingerprinting with Chromium Automation by Isaac Dawson
1. ISAAC DAWSON,
AROUND THE WEB IN 80 HOURS: SCALABLE
FINGERPRINTING WITH CHROMIUM AUTOMATION
VERACODE
15
2. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
ABOUT ME:
▸ Previously at @stake, Symantec (10 years)
▸ Moved into research role at Veracode, Inc. (6 years)
▸ Living in Japan for 12 years
▸ I <3
3. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
IT ALL STARTED IN 2012…
4. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
SECURITY HEADER SCANNING HISTORY
▸ All scanners use the Alexa Top 1 Million URLs
▸ Galexa (November 2012 - March 2014)
▸ Golexa (March 2014 - February 2016)
▸ Creeper v0-v1 (February 2016 - July 2016)
▸ Creeper v2 (July 2016 - …)
6. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
SUMMARY OF SYSTEMS & COMPONENTS
▸ Admin (x1) - Manages jobs
▸ Agents (x50) - Analyzes URLs
▸ DB Writers (x4) - Feeds analysis data into the DB & S3
▸ Database (x1) - PostgreSQL 9.5 DB
▸ NSQ - A message queue for URLs, reports and responses
▸ S3 - Stores serialized DOM and HTML/JS
7. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
THE MESSAGE QUEUE -NSQD, NSQLOOKUPD
▸ NSQ is an easy to deploy message queue
▸ JSON messages between all systems
▸ All agents point to Admin service running NSQLookupd
8. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
HELPFUL NSQ FEATURES
// Create consumer
c.urlConsumer, err = nsq.NewConsumer(job.Topics["url"],
creeper_types.UrlChannel, cfg)
// Process numBrowser of messages concurrently (7)
c.urlConsumer.AddConcurrentHandlers(
nsq.HandlerFunc(c.processUrls),
numBrowsers)
// Job taking too long to handle/process a message?
msg.Touch() // notify we are still working on this message
// Need to requeue because chrome crashed?
msg.RequeueWithoutBackoff(-1)
// Need to change max # of inflight messages?
c.urlConsumer.ChangeMaxInFlight(c.getInflightCount())
1
2
3
4
9. VERACODE
DATA STORAGE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
DATAFLOW
DB
AGENT
ADMIN
WRITER
WRITER
WRITER S3
AGENT
AGENT
12. VERACODE
CREEPER AGENTS: GETTING THE DATA
BROWSER AUTOMATION REQUIREMENTS
▸ Automatable
▸ Fast
▸ Capture network
▸ Capture various browser events (CSP violations)
▸ Inject JavaScript
13. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHOSE CHROME, FOR OBVIOUS REASONS…
▸ Each agent runs 3-6 tabs concurrently
▸ Headless, uses Xvfb
▸ Can get full read access to network response data
▸ Easily inject javascript
▸ Can subscribe to console messages
17. VERACODE
CREEPER AGENTS: GETTING THE DATA
GCD
▸ GCD generates Go code using templates
▸ Remote access to debugger events, functions, types.
▸ Can be updated easily as the protocol files change
18. VERACODE
CREEPER AGENTS: GETTING THE DATA
GCD WAS GOOD BUT…
▸ Needed something better
▸ Built autogcd to automate:
▸ Trapping console messages
▸ Intercepting network data
▸ Injecting JS
▸ Took some inspiration from WebDriver
21. VERACODE
CREEPER AGENTS: GETTING THE DATA
INJECTING JAVASCRIPT
▸ Extract JS libraries and versions
▸ Retire.js and Wappalyzer have some good pointers
▸ Created a JSON file with 86 frameworks
▸ Must wait for the page to be fully loaded
23. VERACODE
CREEPER AGENTS: GETTING THE DATA
INJECTING JAVASCRIPT - INJECTING
for _, library := range JsLibs.Libraries {
res, err := b.ExecuteScript(library.Statement)
if err == nil && string(res) != "" {
log.Printf("%s library result was: %sn",
library.Key,
string(res))
report.JavaScriptLibraries[library.Key] = string(res)
}
}
24. VERACODE
CREEPER AGENTS: GETTING THE DATA
INJECTING JAVASCRIPT - WHEN IS A PAGE DONE?
▸ DOMContentLoaded doesn’t handle dynamically loaded
JS
▸ Listen for DOM change events
▸ Page loaded if no DOM change events occur for > 2
seconds
▸ Timeout after 5 seconds
28. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CHROME BUG #1
▸ Turns out opening tabs excessively can cause tabs to not
respond to debugger protocol
29. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CHROME BUG #1 - SOLUTION
▸ Mark tabs as ‘dead’
▸ If max dead tab count is reached, drain active URLs and kill
chrome
31. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CHROME BUG #2 - CHRASHSAFARI.COM
▸ Would completely kill chrome *and* agent
▸ Lost all active tabs
▸ This site cost me about 2-3 weeks development time
32. VERACODE
▸ Created killface package
▸ Sends a notification to stop active work
▸ Worker count dynamically adjusted to 1
▸ Pauses queue, runs all unfinished URLs again
▸ Once active count is 0, restart normally
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CRASHSAFARI.COM - SOLUTION
33. VERACODE
CREEPER AGENTS: GETTING THE DATA
OTHER CHALLENGES
✘ NSQ messages too large, zipping ineffective
✓Split response data/report data
✘ Sites block AWS IP ranges, (craigslist.com etc)
☹ Timeout…
✘ Concurrency issues
✓ Very careful use of go routines, channels and timers.
✘ Site analysis failures/timeouts
✓ Try 3 times, keep track of retry state.
✓ During retry, open a new browser and work on additional url
35. VERACODE
DB WRITERS: STORING THE DATA
PREVIOUSLY…
▸ Creeper v0 had many problems
▸ RDS did not support PostgreSQL 9.5
▸ Duplicate data
▸ For v1, wrote to disk, SHA1 of contents:
▸ /job/files/5/a/b/c/5abcfbe73e39e0572a939b09f1eb16d7.html
▸ v1 did not shard database tables
▸ Database tables were normalized
▸ Lock contention
37. VERACODE
DB WRITERS: STORING THE DATA
CHALLENGES - GETTING THE DATA IN QUICKLY
▸ Get the data out of the DB writers as soon as possible
▸ Careful to not overload the database with many
connections
▸ Reduce lock contention for writing
38. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION #1 - GETTING THE DATA IN QUICKLY
▸ DB Writers batch up reports and responses
▸ Inserted every 2.5-3.5 seconds
▸ Reduces number of required DB connections
39. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION #1 BATCHER
func (b *Batcher) AddReport(r *creeper_types.CreeperReport) {
select {
case b.reportPool <- r:
atomic.AddInt32(&b.reportCount, 1)
}
}
func (b *Batcher) EmptyReports() []*creeper_types.CreeperReport {
reports := make([]*creeper_types.CreeperReport, 0)
for {
select {
case report := <-b.reportPool:
reports = append(reports, report)
default:
return reports
}
}
return nil
}
40. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION #2 - GETTING THE DATA IN QUICKLY
▸ Insert into temporary table using COPY FROM
▸ Extracted from temporary table and INSERTed into final
table. This allows for UPSERTS:
INSERT INTO header_names (header_name)
SELECT responses_tmp.header_name FROM responses_tmp
ON CONFLICT DO NOTHING;
41. VERACODE
DB WRITERS: STORING THE DATA
CHALLENGES - LARGE TABLES
▸ INSERT INTO … FROM SELECT … on a table with
80,000,000 rows
▸ As tables got bigger, db writers slowed down
▸ This is not scalable
42. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION - TABLE SHARDING
▸ Much like sharding for the file system
▸ Requires a key:
▸ URL ID. (Ex: 1,google.com 2,microsoft.com etc)
▸ Only large tables require sharding
44. VERACODE
DB WRITERS: STORING THE DATA
CREATING A SHARD KEY
▸ Choose the number of times to shard your tables:
▸ shardKey = input_id % 32
▸ Created PLpgSQL functions:
▸
create unlogged table if not exists job_0_responses (
response_id serial primary key,
input_id integer not null,
body_hash varchar(64) not null,
resp_url bytea not null,
resp_uuid varchar(64) unique not null,
resp_type_id integer references resp_types (resp_type_id) not null,
status_id integer references status_lines (status_id) not null,
status_code integer,
mime_type_id integer references mime_types (mime_type_id) not null,
response_time bigint
);
EXECUTE merge_headers(job, shardKey)
45. VERACODE
DB WRITERS: STORING THE DATA
CONS WITH SHARDING
▸ Added complexity for querying
▸ Best to create a new table with all data for reporting
▸ In the future, may use Citus for sharding across multiple
databases
47. VERACODE
▸ S3 limits 100/rps, but pushing 200-2000/rps
▸ Had to contact support
▸ Exponential Backoff, retry 10 times
▸ Hash is stored in response table
▸ HeadObject first to check existence, then PutObject
▸ HeadObjects are way cheaper
DB WRITERS: STORING THE DATA
MOVING TO S3
48. VERACODE
DB WRITERS: STORING THE DATA
LASTLY…
▸ Created unlogged tables
▸ Modified PostgreSQL configuration:
▸ Set checkpoints 5 minutes (max) instead of 1
▸ Enabled fsync
▸ Set max_wal_size 256
50. VERACODE
THE RESULTS: A LOOK AT DATA
SCAN STATISTICS
Responses 72,193,155
Headers 525,385,900
JS Results 1,943,925
URLs w/Errors 67,315
Redirected to HTTPS 145,268
URLS w/CSP Violations 740
Scan Time 15 Hours
Cost 343$ / 35063円
51. VERACODE
THE RESULTS: A LOOK AT DATA
CSP VIOLATIONS
▸ 722 out of 4965 sites using CSP had violations
▸ Security sites:
▸ https://www.globalsign.com/en/, http://secunia.com/,
▸ https://lastpass.com/, https://www.avant.com/, http://
www.veracode.com/
▸ Well known organizations:
▸ http://www.alibaba.com, https://www.doubleclickbygoogle.com
▸ https://mozillians.org/en-US/
52. VERACODE
THE RESULTS: A LOOK AT DATA
SUM OF CSP VIOLATION TYPES
0
750
1500
2250
3000
SCRIPTSRC
IMGSRC
FRAMESRC
FONTSRC
STYLESRC
CONNECTSRC
MEDIASRC
CHILDSRC
OBJECTSRC
BASEURI
FORMACTIONMANIFESTSRC
53. VERACODE
THE RESULTS: A LOOK AT DATA
TOP JAVASCRIPT LIBRARIES > 3000
0
200000
400000
600000
800000
JQUERY
JQUERY-UI
MODERNIZR
JQUERY-UI-DIALOG
YEPNOPE
JQUERY-UI-AUTOCOMPLETE
JQUERY-UI-TOOLTIP
BOOTSTRAP
HTML5SHIV
UNDERSCORE
JQUERY.PRETTYPHOTO
PROTOTYPEJS
DRUPAL
MOOTOOLS
MEJS
BACKBONE.JS
ANGULARJS
FOUNDATION
JWPLAYER
REQUIREJS
HANDLEBARS.JS
HAMMERJS
JPLAYER
MUSTACHE.JS
SCRIPTACULOUS
SHADOWBOX
ZEROCLIPBOARD
YUI
RAPHAEL
DATATABLES
KNOCKOUT
54. VERACODE
THE RESULTS: A LOOK AT DATA
JAVASCRIPT ‘NEXTGEN’ FRAMEWORKS > 100
0
4500
9000
13500
18000
BACKBONE.JS
ANGULARJS
FOUNDATION
YUI
KNOCKOUT
DOJO
REACTJS
MARIONETTEJS
VUEJS
EMBER
METEOR
MITHRIL
EXTJS
POLYMER
55. VERACODE
THE RESULTS: A LOOK AT DATA
VULNERABILITY COUNTS
0
20000
40000
60000
80000
JQUERY
JQUERY-UI-DIALOG
JQUERY.PRETTYPHOTO
ANGULARJS
JQUERY-UI-TOOLTIP
JPLAYER
HANDLEBARS.JS
ZEROCLIPBOARD
MUSTACHE.JS
YUI
PROTOTYPEJS
MEJS
JWPLAYER
DOJO
EMBER
TINYMCE
PLUPLOAD
JQUERY-MOBILE
CKEDITOR
57. VERACODE
THE RESULTS: A LOOK AT DATA
SOME OF MY FAVORITE HTTP STATUS LINES
▸ HTTP 500 access denied ("java.io.FilePermission" "D:
homeXXXXXXXXX.comoriModelGlueunityeventrequ
estEventRequest.cfc" "read")
▸ HTTP 500 "Duplicate entry '1473335051' for key
'timestamp' SQL=INSERT INTO `#__zt_visitor_counter`
(`id`,`timestamp`,`visits`,`guests`,`ipaddress`,`useragent`)
VALUES (null, '1473335051', 1 , 1 , '54.208.81.16',
‘chrome')"
▸ HTTP 500 "Server Made Big Boo"
59. VERACODE
THE RESULTS: A LOOK AT DATA
CONCLUSION
▸ Use NSQ, seriously.
▸ Concurrency can be difficult
▸ Batch data before inserting to DB
▸ If DB rows > a few million, consider sharding
▸ Test different types of table schema for performance
▸ Treat browsers like garbage and handle appropriately
60. VERACODE
THE RESULTS: A LOOK AT DATA
QUESTIONS?
▸ twitter: @_wirepair
▸ github: wirepair
▸ gcd: https://github.com/wirepair/gcd
▸ autogcd: https://github.com/wirepair/autogcd
▸ killface: https://github.com/wirepair/killface
▸ Thanks to all my coworkers supporting and listening to my
daily rants!