Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
AOL - Ian Holsman - Hadoop World 2010
1. ‘because data has needs’
Hadoop World, October 2010
Ian Holsman
The Data Layer
2. 2
Who Am I?
• Ian Holsman
• CTO of Relegence
• Started in open source in 2000 on the Apache Web Server
• Joined AOL in 2007
• I work in the ‘content’ side of AOL
3. 3
AOL has
• 3 large (>100 boxes) hadoop clusters
• 1 in advertising
• 1 in search
• 1 in content
• I am talking today about the ‘content’ side of the house
4. 4
Agenda
• The Opportunity
• How we addressed it
• Unexpected benefits and issues we had
• What we are doing today
5. It started with a question
can we do better than a ‘top stories’ link?
?
6. 6
The Opportunity - circa 2008
• Get more information about our customers
• Increase recirculation
• Increase RPM of our pages
7. 7
Which we translated into
• Build a better ‘related’ page module
• initially site-specific
• but the plan was to make it site-wide
9. 9
How we addressed it
• Custom Javascript injected onto the page so we can start
measuring things
• Custom web server modules to handle cookies over multiple
domains
• Custom log processing infrastructure to push data onto HDFS
every 15m
• Map-Reduce jobs to provide reports & create MySQL databases
• built a co-visitation algorithm to produce related pages
10. 10
Privacy
• We have tried our best to keep things anonymous from the
start
• We don’t track IP-level data, we translate to WOEIDs
• So we can’t tell you (or governments) what a particular IP did
• It’s not perfect
• We avoided putting it on ‘sensitive’ sites
12. 12
Did it on the cheap
• 2-3 person project
• Grabbed 50 ‘spare’ machines that were lying around
• Installed hadoop
• Put our ‘beacon’ on a site (AOL real-estate)
• and away we went
• a ‘skunkworks’ with the blessing of the CIO
• minimized red-tape.
13. 13
in 2-3 months
• we had infrastructure up
• we were processing page views & uniques
• we installed the beacon on other ‘small’ sites
• we had ‘data’ and a proof of concept that was meaningful for
business owners
14. 14
We got people’s attention
• Start doing basic reports for bebo
• 300m PV’s a day
• needed to move from skunkworks to a ‘real’ project.
15. 15
Major issues
• Hadoop
• Map-reduce was slow to write and inflexible
• Hadoop kept on hanging, both the name server and our custom push jobs would stall
• Operations
• how to move from 0.18 to 0.19 ?
• Jobs failing meant we were getting paged, and restart-ability was never really designed
• Felt like we were building our house on quicksand.
• we were running off factory-defaults
• network wasn’t optimized at ALL
• People
• zero experience going in
• people were learning by doing.
• lots of new things made fault detection ‘interesting’
• our group started becoming a bottleneck
• Map reduce hard to learn
17. 17
Operational issues
• Got ‘real’ machines
• put onto same switches/racks
• built the filesystem to better match how we used hadoop
• upgraded to 0.19 at same time
• took 48 hours to migrate
• Spent some time listening to experts
• tuned our cluster a bit better
• removed developer access to the ‘hadoop’ user
• Still not a 100% “production” system
• but close enough for my liking
26. 26
The URL viewer
• Get stats about any URL
• Page views
• Google Searches
• Referrers
• Exits
• Custom parameters
• Geographic regions
• Have similar tool for
anonymous userIDs
32. 32
The current deliverables
• Get more information about our customers
• Increase recirculation
• Increase RPM of our pages
• Build metrics into our platform
• What works on pages
• How are we performing
• Build intelligence on the page
• Collaborative filtering
• Product recommendations
• Top-K type lists
• Make it closer to real time
• not the focus of this talk
33. 33
What data are we processing?
• Beacon Web servers
• Tracking beacon injected into the HTML page via custom javascript
• Tracks
• Page views
• Page clicks
• Custom event that the content developer wants
• Tracks standard things like referrers, and user agents, and Location
• Developer can add custom parameters to tell us about the page
• needed to write a custom module to generate anonymous user ids + 3rd party domain tracking
• custom module to map IP#’s to geographic WOEID-based locations
• Ad impressions
• User viewed a campaign
• Integrate it with campaign manage to determine actual revenue
• URL context (through relegence)
• We can determine who & what a article is about
• through relegence, similar to what OpenCalais does