Human Factors of XR: Using Human Factors to Design XR Systems
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
1. ISAAC DAWSON,
AROUND THE WEB IN 80 HOURS: SCALABLE
FINGERPRINTING WITH CHROMIUM AUTOMATION
VERACODE
15
2. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
ABOUT ME:
▸ Previously at @stake, Symantec (10 years)
▸ Moved into research role at Veracode, Inc. (6 years)
▸ Living in Japan for 12 years
▸ I <3
3. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
IT ALL STARTED IN 2012…
4. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
SECURITY HEADER SCANNING HISTORY
▸ All scanners use the Alexa Top 1 Million URLs
▸ Galexa (November 2012 - March 2014)
▸ Golexa (March 2014 - February 2016)
▸ Creeper v0-v1 (February 2016 - July 2016)
▸ Creeper v2 (July 2016 - …)
6. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
SUMMARY OF SYSTEMS & COMPONENTS
▸ Admin (x1) - Manages jobs
▸ Agents (x50) - Analyzes URLs
▸ DB Writers (x4) - Feeds analysis data into the DB & S3
▸ Database (x1) - PostgreSQL 9.5 DB
▸ NSQ - A message queue for URLs, reports and responses
▸ S3 - Stores serialized DOM and HTML/JS
7. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
THE MESSAGE QUEUE -NSQD, NSQLOOKUPD
▸ NSQ is an easy to deploy message queue
▸ JSON messages between all systems
▸ All agents point to Admin service running NSQLookupd
8. VERACODE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
HELPFUL NSQ FEATURES
// Create consumer
c.urlConsumer, err = nsq.NewConsumer(job.Topics["url"],
creeper_types.UrlChannel, cfg)
// Process numBrowser of messages concurrently (7)
c.urlConsumer.AddConcurrentHandlers(
nsq.HandlerFunc(c.processUrls),
numBrowsers)
// Job taking too long to handle/process a message?
msg.Touch() // notify we are still working on this message
// Need to requeue because chrome crashed?
msg.RequeueWithoutBackoff(-1)
// Need to change max # of inflight messages?
c.urlConsumer.ChangeMaxInFlight(c.getInflightCount())
1
2
3
4
9. VERACODE
DATA STORAGE
AROUND THE WEB IN 80 HOURS: SCALABLE FINGERPRINTING WITH
CHROMIUM AUTOMATION
DATAFLOW
DB
AGENT
ADMIN
WRITER
WRITER
WRITER S3
AGENT
AGENT
12. VERACODE
CREEPER AGENTS: GETTING THE DATA
BROWSER AUTOMATION REQUIREMENTS
▸ Automatable
▸ Fast
▸ Capture network
▸ Capture various browser events (CSP violations)
▸ Inject JavaScript
13. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHOSE CHROME, FOR OBVIOUS REASONS…
▸ Each agent runs 3-6 tabs concurrently
▸ Headless, uses Xvfb
▸ Can get full read access to network response data
▸ Easily inject javascript
▸ Can subscribe to console messages
17. VERACODE
CREEPER AGENTS: GETTING THE DATA
GCD
▸ GCD generates Go code using templates
▸ Remote access to debugger events, functions, types.
▸ Can be updated easily as the protocol files change
18. VERACODE
CREEPER AGENTS: GETTING THE DATA
GCD WAS GOOD BUT…
▸ Needed something better
▸ Built autogcd to automate:
▸ Trapping console messages
▸ Intercepting network data
▸ Injecting JS
▸ Took some inspiration from WebDriver
21. VERACODE
CREEPER AGENTS: GETTING THE DATA
INJECTING JAVASCRIPT
▸ Extract JS libraries and versions
▸ Retire.js and Wappalyzer have some good pointers
▸ Created a JSON file with 86 frameworks
▸ Must wait for the page to be fully loaded
23. VERACODE
CREEPER AGENTS: GETTING THE DATA
INJECTING JAVASCRIPT - INJECTING
for _, library := range JsLibs.Libraries {
res, err := b.ExecuteScript(library.Statement)
if err == nil && string(res) != "" {
log.Printf("%s library result was: %sn",
library.Key,
string(res))
report.JavaScriptLibraries[library.Key] = string(res)
}
}
24. VERACODE
CREEPER AGENTS: GETTING THE DATA
INJECTING JAVASCRIPT - WHEN IS A PAGE DONE?
▸ DOMContentLoaded doesn’t handle dynamically loaded
JS
▸ Listen for DOM change events
▸ Page loaded if no DOM change events occur for > 2
seconds
▸ Timeout after 5 seconds
28. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CHROME BUG #1
▸ Turns out opening tabs excessively can cause tabs to not
respond to debugger protocol
29. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CHROME BUG #1 - SOLUTION
▸ Mark tabs as ‘dead’
▸ If max dead tab count is reached, drain active URLs and kill
chrome
31. VERACODE
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CHROME BUG #2 - CHRASHSAFARI.COM
▸ Would completely kill chrome *and* agent
▸ Lost all active tabs
▸ This site cost me about 2-3 weeks development time
32. VERACODE
▸ Created killface package
▸ Sends a notification to stop active work
▸ Worker count dynamically adjusted to 1
▸ Pauses queue, runs all unfinished URLs again
▸ Once active count is 0, restart normally
CREEPER AGENTS: GETTING THE DATA
CHALLENGES - CRASHSAFARI.COM - SOLUTION
33. VERACODE
CREEPER AGENTS: GETTING THE DATA
OTHER CHALLENGES
✘ NSQ messages too large, zipping ineffective
✓Split response data/report data
✘ Sites block AWS IP ranges, (craigslist.com etc)
☹ Timeout…
✘ Concurrency issues
✓ Very careful use of go routines, channels and timers.
✘ Site analysis failures/timeouts
✓ Try 3 times, keep track of retry state.
✓ During retry, open a new browser and work on additional url
35. VERACODE
DB WRITERS: STORING THE DATA
PREVIOUSLY…
▸ Creeper v0 had many problems
▸ RDS did not support PostgreSQL 9.5
▸ Duplicate data
▸ For v1, wrote to disk, SHA1 of contents:
▸ /job/files/5/a/b/c/5abcfbe73e39e0572a939b09f1eb16d7.html
▸ v1 did not shard database tables
▸ Database tables were normalized
▸ Lock contention
37. VERACODE
DB WRITERS: STORING THE DATA
CHALLENGES - GETTING THE DATA IN QUICKLY
▸ Get the data out of the DB writers as soon as possible
▸ Careful to not overload the database with many
connections
▸ Reduce lock contention for writing
38. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION #1 - GETTING THE DATA IN QUICKLY
▸ DB Writers batch up reports and responses
▸ Inserted every 2.5-3.5 seconds
▸ Reduces number of required DB connections
39. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION #1 BATCHER
func (b *Batcher) AddReport(r *creeper_types.CreeperReport) {
select {
case b.reportPool <- r:
atomic.AddInt32(&b.reportCount, 1)
}
}
func (b *Batcher) EmptyReports() []*creeper_types.CreeperReport {
reports := make([]*creeper_types.CreeperReport, 0)
for {
select {
case report := <-b.reportPool:
reports = append(reports, report)
default:
return reports
}
}
return nil
}
40. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION #2 - GETTING THE DATA IN QUICKLY
▸ Insert into temporary table using COPY FROM
▸ Extracted from temporary table and INSERTed into final
table. This allows for UPSERTS:
INSERT INTO header_names (header_name)
SELECT responses_tmp.header_name FROM responses_tmp
ON CONFLICT DO NOTHING;
41. VERACODE
DB WRITERS: STORING THE DATA
CHALLENGES - LARGE TABLES
▸ INSERT INTO … FROM SELECT … on a table with
80,000,000 rows
▸ As tables got bigger, db writers slowed down
▸ This is not scalable
42. VERACODE
DB WRITERS: STORING THE DATA
SOLUTION - TABLE SHARDING
▸ Much like sharding for the file system
▸ Requires a key:
▸ URL ID. (Ex: 1,google.com 2,microsoft.com etc)
▸ Only large tables require sharding
44. VERACODE
DB WRITERS: STORING THE DATA
CREATING A SHARD KEY
▸ Choose the number of times to shard your tables:
▸ shardKey = input_id % 32
▸ Created PLpgSQL functions:
▸
create unlogged table if not exists job_0_responses (
response_id serial primary key,
input_id integer not null,
body_hash varchar(64) not null,
resp_url bytea not null,
resp_uuid varchar(64) unique not null,
resp_type_id integer references resp_types (resp_type_id) not null,
status_id integer references status_lines (status_id) not null,
status_code integer,
mime_type_id integer references mime_types (mime_type_id) not null,
response_time bigint
);
EXECUTE merge_headers(job, shardKey)
45. VERACODE
DB WRITERS: STORING THE DATA
CONS WITH SHARDING
▸ Added complexity for querying
▸ Best to create a new table with all data for reporting
▸ In the future, may use Citus for sharding across multiple
databases
47. VERACODE
▸ S3 limits 100/rps, but pushing 200-2000/rps
▸ Had to contact support
▸ Exponential Backoff, retry 10 times
▸ Hash is stored in response table
▸ HeadObject first to check existence, then PutObject
▸ HeadObjects are way cheaper
DB WRITERS: STORING THE DATA
MOVING TO S3
48. VERACODE
DB WRITERS: STORING THE DATA
LASTLY…
▸ Created unlogged tables
▸ Modified PostgreSQL configuration:
▸ Set checkpoints 5 minutes (max) instead of 1
▸ Enabled fsync
▸ Set max_wal_size 256
50. VERACODE
THE RESULTS: A LOOK AT DATA
SCAN STATISTICS
Responses 72,193,155
Headers 525,385,900
JS Results 1,943,925
URLs w/Errors 67,315
Redirected to HTTPS 145,268
URLS w/CSP Violations 740
Scan Time 15 Hours
Cost 343$ / 35063円
51. VERACODE
THE RESULTS: A LOOK AT DATA
CSP VIOLATIONS
▸ 722 out of 4965 sites using CSP had violations
▸ Security sites:
▸ https://www.globalsign.com/en/, http://secunia.com/,
▸ https://lastpass.com/, https://www.avant.com/, http://
www.veracode.com/
▸ Well known organizations:
▸ http://www.alibaba.com, https://www.doubleclickbygoogle.com
▸ https://mozillians.org/en-US/
52. VERACODE
THE RESULTS: A LOOK AT DATA
SUM OF CSP VIOLATION TYPES
0
750
1500
2250
3000
SCRIPTSRC
IMGSRC
FRAMESRC
FONTSRC
STYLESRC
CONNECTSRC
MEDIASRC
CHILDSRC
OBJECTSRC
BASEURI
FORMACTIONMANIFESTSRC
53. VERACODE
THE RESULTS: A LOOK AT DATA
TOP JAVASCRIPT LIBRARIES > 3000
0
200000
400000
600000
800000
JQUERY
JQUERY-UI
MODERNIZR
JQUERY-UI-DIALOG
YEPNOPE
JQUERY-UI-AUTOCOMPLETE
JQUERY-UI-TOOLTIP
BOOTSTRAP
HTML5SHIV
UNDERSCORE
JQUERY.PRETTYPHOTO
PROTOTYPEJS
DRUPAL
MOOTOOLS
MEJS
BACKBONE.JS
ANGULARJS
FOUNDATION
JWPLAYER
REQUIREJS
HANDLEBARS.JS
HAMMERJS
JPLAYER
MUSTACHE.JS
SCRIPTACULOUS
SHADOWBOX
ZEROCLIPBOARD
YUI
RAPHAEL
DATATABLES
KNOCKOUT
54. VERACODE
THE RESULTS: A LOOK AT DATA
JAVASCRIPT ‘NEXTGEN’ FRAMEWORKS > 100
0
4500
9000
13500
18000
BACKBONE.JS
ANGULARJS
FOUNDATION
YUI
KNOCKOUT
DOJO
REACTJS
MARIONETTEJS
VUEJS
EMBER
METEOR
MITHRIL
EXTJS
POLYMER
55. VERACODE
THE RESULTS: A LOOK AT DATA
VULNERABILITY COUNTS
0
20000
40000
60000
80000
JQUERY
JQUERY-UI-DIALOG
JQUERY.PRETTYPHOTO
ANGULARJS
JQUERY-UI-TOOLTIP
JPLAYER
HANDLEBARS.JS
ZEROCLIPBOARD
MUSTACHE.JS
YUI
PROTOTYPEJS
MEJS
JWPLAYER
DOJO
EMBER
TINYMCE
PLUPLOAD
JQUERY-MOBILE
CKEDITOR
57. VERACODE
THE RESULTS: A LOOK AT DATA
SOME OF MY FAVORITE HTTP STATUS LINES
▸ HTTP 500 access denied ("java.io.FilePermission" "D:
homeXXXXXXXXX.comoriModelGlueunity
eventrequestEventRequest.cfc" "read")
▸ HTTP 500 "Duplicate entry '1473335051' for key
'timestamp' SQL=INSERT INTO `#__zt_visitor_counter`
(`id`,`timestamp`,`visits`,`guests`,`ipaddress`,`useragent`)
VALUES (null, '1473335051', 1 , 1 , '54.208.81.16',
‘chrome')"
▸ HTTP 500 "Server Made Big Boo"
59. VERACODE
THE RESULTS: A LOOK AT DATA
CONCLUSION
▸ Use NSQ, seriously.
▸ Concurrency can be difficult
▸ Batch data before inserting to DB
▸ If DB rows > a few million, consider sharding
▸ Test different types of table schema for performance
▸ Treat browsers like garbage and handle appropriately
60. VERACODE
THE RESULTS: A LOOK AT DATA
QUESTIONS?
▸ twitter: @_wirepair
▸ github: wirepair
▸ gcd: https://github.com/wirepair/gcd
▸ autogcd: https://github.com/wirepair/autogcd
▸ killface: https://github.com/wirepair/killface
▸ Thanks to all my coworkers supporting and listening to my
daily rants!