SlideShare a Scribd company logo
1 of 34
‘because data has needs’
Hadoop World, October 2010
Ian Holsman
The Data Layer
2
Who Am I?
• Ian Holsman
• CTO of Relegence
• Started in open source in 2000 on the Apache Web Server
• Joined AOL in 2007
• I work in the ‘content’ side of AOL
3
AOL has
• 3 large (>100 boxes) hadoop clusters
• 1 in advertising
• 1 in search
• 1 in content
• I am talking today about the ‘content’ side of the house
4
Agenda
• The Opportunity
• How we addressed it
• Unexpected benefits and issues we had
• What we are doing today
It started with a question
can we do better than a ‘top stories’ link?
?
6
The Opportunity - circa 2008
• Get more information about our customers
• Increase recirculation
• Increase RPM of our pages
7
Which we translated into
• Build a better ‘related’ page module
• initially site-specific
• but the plan was to make it site-wide
How we addressed it
9
How we addressed it
• Custom Javascript injected onto the page so we can start
measuring things
• Custom web server modules to handle cookies over multiple
domains
• Custom log processing infrastructure to push data onto HDFS
every 15m
• Map-Reduce jobs to provide reports & create MySQL databases
• built a co-visitation algorithm to produce related pages
10
Privacy
• We have tried our best to keep things anonymous from the
start
• We don’t track IP-level data, we translate to WOEIDs
• So we can’t tell you (or governments) what a particular IP did
• It’s not perfect
• We avoided putting it on ‘sensitive’ sites
11
Initial architecture (July 2008)
12
Did it on the cheap
• 2-3 person project
• Grabbed 50 ‘spare’ machines that were lying around
• Installed hadoop
• Put our ‘beacon’ on a site (AOL real-estate)
• and away we went
• a ‘skunkworks’ with the blessing of the CIO
• minimized red-tape.
13
in 2-3 months
• we had infrastructure up
• we were processing page views & uniques
• we installed the beacon on other ‘small’ sites
• we had ‘data’ and a proof of concept that was meaningful for
business owners
14
We got people’s attention
• Start doing basic reports for bebo
• 300m PV’s a day
• needed to move from skunkworks to a ‘real’ project.
15
Major issues
• Hadoop
• Map-reduce was slow to write and inflexible
• Hadoop kept on hanging, both the name server and our custom push jobs would stall
• Operations
• how to move from 0.18 to 0.19 ?
• Jobs failing meant we were getting paged, and restart-ability was never really designed
• Felt like we were building our house on quicksand.
• we were running off factory-defaults
• network wasn’t optimized at ALL
• People
• zero experience going in
• people were learning by doing.
• lots of new things made fault detection ‘interesting’
• our group started becoming a bottleneck
• Map reduce hard to learn
Operational Issues
17
Operational issues
• Got ‘real’ machines
• put onto same switches/racks
• built the filesystem to better match how we used hadoop
• upgraded to 0.19 at same time
• took 48 hours to migrate
• Spent some time listening to experts
• tuned our cluster a bit better
• removed developer access to the ‘hadoop’ user
• Still not a 100% “production” system
• but close enough for my liking
then Yahoo open sourced ‘PIG’
19
PIG fixed a lot of ‘people’ issues
• easy to use
• didn’t require much training
• enabled the system to be used by ‘regular’ channel developers
We felt like Alice going through the rabbit hole
http://www.flickr.com/photos/spam/3355824586/
21
The data unlocked innovations
• The channel developers knew their data
• They used it in ways we never expected
22
Built off the data
Some cool tools
23
The heatmap tool
24
The heatmap tool
25
Aol’s Traffic Exchange
26
The URL viewer
• Get stats about any URL
• Page views
• Google Searches
• Referrers
• Exits
• Custom parameters
• Geographic regions
• Have similar tool for
anonymous userIDs
27
using simple aggregation techniques and mahout
Some useful applications
28
Shopping recommendations
• Utilizes Mahout
• Utilizes custom parameters
• Better click through rate
than external vendors
• Can apply technique to any
product-based channel
29
User recommendations
{
"algo": "recoByPVPartDay",
"UnauthId": "007e3dc60bbe11dfadba39f9fdfe11b5#2",
"url-pv-Info": [
{
"pv": 54.0,
"url": "joystiq.com/2010/07/14/sengoku-basara-samurai-heroes-sticks-six-swords-into-north-amer"
},
{
"pv": 49.0,
"url": "joystiq.com/2010/07/14/maxis-hiring-development-director-for-online-simulation-game"
},
{
"pv": 35.0,
"url": "joystiq.com/2010/07/14/how-to-play-sin-and-punishment-star-successor"
},
{
"pv": 10.0,
"url": "news.bigdownload.com/2010/07/14/bioware-co-founder-were-working-on-smaller-mmo-type-games"
},
{
"pv": 3.0,
"url": "news.bigdownload.com/2010/07/14/natural-selection-2-alpha-test-coming-july-26-for-special-editio"
}
]
}
30
User Interests
{
"tags": [
{
"tag": "Video games",
"score": 435.0
},
{
"tag": "Internet search engines",
"score": 96.0
},
{
"tag": "Internet",
"score": 84.0
}
],
"unauthId": "007e3dc60bbe11dfadba39f9fdfe11b5"
}
Where we are today
32
The current deliverables
• Get more information about our customers
• Increase recirculation
• Increase RPM of our pages
• Build metrics into our platform
• What works on pages
• How are we performing
• Build intelligence on the page
• Collaborative filtering
• Product recommendations
• Top-K type lists
• Make it closer to real time
• not the focus of this talk
33
What data are we processing?
• Beacon Web servers
• Tracking beacon injected into the HTML page via custom javascript
• Tracks
• Page views
• Page clicks
• Custom event that the content developer wants
• Tracks standard things like referrers, and user agents, and Location
• Developer can add custom parameters to tell us about the page
• needed to write a custom module to generate anonymous user ids + 3rd party domain tracking
• custom module to map IP#’s to geographic WOEID-based locations
• Ad impressions
• User viewed a campaign
• Integrate it with campaign manage to determine actual revenue
• URL context (through relegence)
• We can determine who & what a article is about
• through relegence, similar to what OpenCalais does
34
The Data Layer Infrastructure today

More Related Content

What's hot

Polyglottany Is Not A Sin
Polyglottany Is Not A SinPolyglottany Is Not A Sin
Polyglottany Is Not A SinEric Lubow
 
Leweb09 Building Wave Robots
Leweb09 Building Wave RobotsLeweb09 Building Wave Robots
Leweb09 Building Wave RobotsPatrick Chanezon
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018iguazio
 
Quick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase ServerQuick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase ServerNic Raboy
 
Containers, Habitat and Orchestration - Infracoders Meetup Graz
Containers, Habitat and Orchestration - Infracoders Meetup GrazContainers, Habitat and Orchestration - Infracoders Meetup Graz
Containers, Habitat and Orchestration - Infracoders Meetup GrazInfralovers
 
Interop 2011 - Scaling Platform As A Service
Interop 2011 - Scaling Platform As A ServiceInterop 2011 - Scaling Platform As A Service
Interop 2011 - Scaling Platform As A ServicePatrick Chanezon
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyCeph Community
 

What's hot (7)

Polyglottany Is Not A Sin
Polyglottany Is Not A SinPolyglottany Is Not A Sin
Polyglottany Is Not A Sin
 
Leweb09 Building Wave Robots
Leweb09 Building Wave RobotsLeweb09 Building Wave Robots
Leweb09 Building Wave Robots
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
 
Quick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase ServerQuick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase Server
 
Containers, Habitat and Orchestration - Infracoders Meetup Graz
Containers, Habitat and Orchestration - Infracoders Meetup GrazContainers, Habitat and Orchestration - Infracoders Meetup Graz
Containers, Habitat and Orchestration - Infracoders Meetup Graz
 
Interop 2011 - Scaling Platform As A Service
Interop 2011 - Scaling Platform As A ServiceInterop 2011 - Scaling Platform As A Service
Interop 2011 - Scaling Platform As A Service
 
Webinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case StudyWebinar - DreamObjects/Ceph Case Study
Webinar - DreamObjects/Ceph Case Study
 

Viewers also liked

Learn after lunch - Our investment philosophy
Learn after lunch - Our investment philosophyLearn after lunch - Our investment philosophy
Learn after lunch - Our investment philosophyInformed Choice
 
Panamá: La Ruta por descubir
Panamá: La Ruta por descubirPanamá: La Ruta por descubir
Panamá: La Ruta por descubirEdwin1207
 
Project 4
Project 4Project 4
Project 4laguila
 
Bill Blatts: Auction Charges for Airport Landing Slots
Bill Blatts: Auction Charges for Airport Landing SlotsBill Blatts: Auction Charges for Airport Landing Slots
Bill Blatts: Auction Charges for Airport Landing SlotsMoral Economy
 
Dan Zaoui - La troupe Adama
Dan Zaoui - La troupe AdamaDan Zaoui - La troupe Adama
Dan Zaoui - La troupe AdamaDan Zaoui
 
Promeneurs du net 07
Promeneurs du net 07Promeneurs du net 07
Promeneurs du net 07Garlann Nizon
 
online assignment
online assignmentonline assignment
online assignmentshema12345
 
Twitter - miksi, miten, mitä
Twitter - miksi, miten, mitäTwitter - miksi, miten, mitä
Twitter - miksi, miten, mitäSatu Kantti
 
La autoprotección en centros sociosanitarios jcvalero
La autoprotección en centros sociosanitarios jcvaleroLa autoprotección en centros sociosanitarios jcvalero
La autoprotección en centros sociosanitarios jcvaleroJCCM1925
 

Viewers also liked (14)

Learn after lunch - Our investment philosophy
Learn after lunch - Our investment philosophyLearn after lunch - Our investment philosophy
Learn after lunch - Our investment philosophy
 
practico numero dos
practico numero dos practico numero dos
practico numero dos
 
Why Choose Indianapolis?
Why Choose Indianapolis?Why Choose Indianapolis?
Why Choose Indianapolis?
 
Panamá: La Ruta por descubir
Panamá: La Ruta por descubirPanamá: La Ruta por descubir
Panamá: La Ruta por descubir
 
Facebook Smart Card 070314
Facebook Smart Card 070314Facebook Smart Card 070314
Facebook Smart Card 070314
 
Project 4
Project 4Project 4
Project 4
 
Bill Blatts: Auction Charges for Airport Landing Slots
Bill Blatts: Auction Charges for Airport Landing SlotsBill Blatts: Auction Charges for Airport Landing Slots
Bill Blatts: Auction Charges for Airport Landing Slots
 
Dan Zaoui - La troupe Adama
Dan Zaoui - La troupe AdamaDan Zaoui - La troupe Adama
Dan Zaoui - La troupe Adama
 
Promeneurs du net 07
Promeneurs du net 07Promeneurs du net 07
Promeneurs du net 07
 
online assignment
online assignmentonline assignment
online assignment
 
Twitter - miksi, miten, mitä
Twitter - miksi, miten, mitäTwitter - miksi, miten, mitä
Twitter - miksi, miten, mitä
 
Reporte de PRÁCTICA DE LÍPIDOS
Reporte de PRÁCTICA DE LÍPIDOSReporte de PRÁCTICA DE LÍPIDOS
Reporte de PRÁCTICA DE LÍPIDOS
 
Capital structure
Capital structureCapital structure
Capital structure
 
La autoprotección en centros sociosanitarios jcvalero
La autoprotección en centros sociosanitarios jcvaleroLa autoprotección en centros sociosanitarios jcvalero
La autoprotección en centros sociosanitarios jcvalero
 

Similar to AOL - Ian Holsman - Hadoop World 2010

Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGIAF
 
Dev ops lessons learned - Michael Collins
Dev ops lessons learned  - Michael CollinsDev ops lessons learned  - Michael Collins
Dev ops lessons learned - Michael CollinsDevopsdays
 
Simplifying Use of Hive with the Hive Query Tool
Simplifying Use of Hive with the Hive Query ToolSimplifying Use of Hive with the Hive Query Tool
Simplifying Use of Hive with the Hive Query ToolDataWorks Summit
 
Warehouse Scale Datacenters: The case for a new approach to networking
Warehouse Scale Datacenters: The case for a new approach to networkingWarehouse Scale Datacenters: The case for a new approach to networking
Warehouse Scale Datacenters: The case for a new approach to networkingOpen Networking Summits
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissmacslide
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014royrapoport
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
 
Kuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails AppsKuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails AppsCameron Dutro
 
Performance optimisation - scaling a hobby project to serious business
Performance optimisation - scaling a hobby project to serious businessPerformance optimisation - scaling a hobby project to serious business
Performance optimisation - scaling a hobby project to serious businessHarald Zeitlhofer
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Serverless Toronto helps Startups
Serverless Toronto helps StartupsServerless Toronto helps Startups
Serverless Toronto helps StartupsDaniel Zivkovic
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)ibwhite
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Finding and Using Big Data in your business
Finding and Using Big Data in your businessFinding and Using Big Data in your business
Finding and Using Big Data in your businessSimon Elliston Ball
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your websitehernanibf
 
Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...Jody Garnett
 
Lessons Learned - Building Bassmaster.com with OpenPublish and Acquia Cloud
Lessons Learned - Building Bassmaster.com with OpenPublish and Acquia CloudLessons Learned - Building Bassmaster.com with OpenPublish and Acquia Cloud
Lessons Learned - Building Bassmaster.com with OpenPublish and Acquia CloudAcquia
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontierSnowplow Analytics
 

Similar to AOL - Ian Holsman - Hadoop World 2010 (20)

Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Dev ops lessons learned - Michael Collins
Dev ops lessons learned  - Michael CollinsDev ops lessons learned  - Michael Collins
Dev ops lessons learned - Michael Collins
 
Simplifying Use of Hive with the Hive Query Tool
Simplifying Use of Hive with the Hive Query ToolSimplifying Use of Hive with the Hive Query Tool
Simplifying Use of Hive with the Hive Query Tool
 
Warehouse Scale Datacenters: The case for a new approach to networking
Warehouse Scale Datacenters: The case for a new approach to networkingWarehouse Scale Datacenters: The case for a new approach to networking
Warehouse Scale Datacenters: The case for a new approach to networking
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Kuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails AppsKuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails Apps
 
Performance optimisation - scaling a hobby project to serious business
Performance optimisation - scaling a hobby project to serious businessPerformance optimisation - scaling a hobby project to serious business
Performance optimisation - scaling a hobby project to serious business
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Serverless Toronto helps Startups
Serverless Toronto helps StartupsServerless Toronto helps Startups
Serverless Toronto helps Startups
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Finding and Using Big Data in your business
Finding and Using Big Data in your businessFinding and Using Big Data in your business
Finding and Using Big Data in your business
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website
 
Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...
 
Lessons Learned - Building Bassmaster.com with OpenPublish and Acquia Cloud
Lessons Learned - Building Bassmaster.com with OpenPublish and Acquia CloudLessons Learned - Building Bassmaster.com with OpenPublish and Acquia Cloud
Lessons Learned - Building Bassmaster.com with OpenPublish and Acquia Cloud
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontier
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

AOL - Ian Holsman - Hadoop World 2010

  • 1. ‘because data has needs’ Hadoop World, October 2010 Ian Holsman The Data Layer
  • 2. 2 Who Am I? • Ian Holsman • CTO of Relegence • Started in open source in 2000 on the Apache Web Server • Joined AOL in 2007 • I work in the ‘content’ side of AOL
  • 3. 3 AOL has • 3 large (>100 boxes) hadoop clusters • 1 in advertising • 1 in search • 1 in content • I am talking today about the ‘content’ side of the house
  • 4. 4 Agenda • The Opportunity • How we addressed it • Unexpected benefits and issues we had • What we are doing today
  • 5. It started with a question can we do better than a ‘top stories’ link? ?
  • 6. 6 The Opportunity - circa 2008 • Get more information about our customers • Increase recirculation • Increase RPM of our pages
  • 7. 7 Which we translated into • Build a better ‘related’ page module • initially site-specific • but the plan was to make it site-wide
  • 9. 9 How we addressed it • Custom Javascript injected onto the page so we can start measuring things • Custom web server modules to handle cookies over multiple domains • Custom log processing infrastructure to push data onto HDFS every 15m • Map-Reduce jobs to provide reports & create MySQL databases • built a co-visitation algorithm to produce related pages
  • 10. 10 Privacy • We have tried our best to keep things anonymous from the start • We don’t track IP-level data, we translate to WOEIDs • So we can’t tell you (or governments) what a particular IP did • It’s not perfect • We avoided putting it on ‘sensitive’ sites
  • 12. 12 Did it on the cheap • 2-3 person project • Grabbed 50 ‘spare’ machines that were lying around • Installed hadoop • Put our ‘beacon’ on a site (AOL real-estate) • and away we went • a ‘skunkworks’ with the blessing of the CIO • minimized red-tape.
  • 13. 13 in 2-3 months • we had infrastructure up • we were processing page views & uniques • we installed the beacon on other ‘small’ sites • we had ‘data’ and a proof of concept that was meaningful for business owners
  • 14. 14 We got people’s attention • Start doing basic reports for bebo • 300m PV’s a day • needed to move from skunkworks to a ‘real’ project.
  • 15. 15 Major issues • Hadoop • Map-reduce was slow to write and inflexible • Hadoop kept on hanging, both the name server and our custom push jobs would stall • Operations • how to move from 0.18 to 0.19 ? • Jobs failing meant we were getting paged, and restart-ability was never really designed • Felt like we were building our house on quicksand. • we were running off factory-defaults • network wasn’t optimized at ALL • People • zero experience going in • people were learning by doing. • lots of new things made fault detection ‘interesting’ • our group started becoming a bottleneck • Map reduce hard to learn
  • 17. 17 Operational issues • Got ‘real’ machines • put onto same switches/racks • built the filesystem to better match how we used hadoop • upgraded to 0.19 at same time • took 48 hours to migrate • Spent some time listening to experts • tuned our cluster a bit better • removed developer access to the ‘hadoop’ user • Still not a 100% “production” system • but close enough for my liking
  • 18. then Yahoo open sourced ‘PIG’
  • 19. 19 PIG fixed a lot of ‘people’ issues • easy to use • didn’t require much training • enabled the system to be used by ‘regular’ channel developers
  • 20. We felt like Alice going through the rabbit hole http://www.flickr.com/photos/spam/3355824586/
  • 21. 21 The data unlocked innovations • The channel developers knew their data • They used it in ways we never expected
  • 22. 22 Built off the data Some cool tools
  • 26. 26 The URL viewer • Get stats about any URL • Page views • Google Searches • Referrers • Exits • Custom parameters • Geographic regions • Have similar tool for anonymous userIDs
  • 27. 27 using simple aggregation techniques and mahout Some useful applications
  • 28. 28 Shopping recommendations • Utilizes Mahout • Utilizes custom parameters • Better click through rate than external vendors • Can apply technique to any product-based channel
  • 29. 29 User recommendations { "algo": "recoByPVPartDay", "UnauthId": "007e3dc60bbe11dfadba39f9fdfe11b5#2", "url-pv-Info": [ { "pv": 54.0, "url": "joystiq.com/2010/07/14/sengoku-basara-samurai-heroes-sticks-six-swords-into-north-amer" }, { "pv": 49.0, "url": "joystiq.com/2010/07/14/maxis-hiring-development-director-for-online-simulation-game" }, { "pv": 35.0, "url": "joystiq.com/2010/07/14/how-to-play-sin-and-punishment-star-successor" }, { "pv": 10.0, "url": "news.bigdownload.com/2010/07/14/bioware-co-founder-were-working-on-smaller-mmo-type-games" }, { "pv": 3.0, "url": "news.bigdownload.com/2010/07/14/natural-selection-2-alpha-test-coming-july-26-for-special-editio" } ] }
  • 30. 30 User Interests { "tags": [ { "tag": "Video games", "score": 435.0 }, { "tag": "Internet search engines", "score": 96.0 }, { "tag": "Internet", "score": 84.0 } ], "unauthId": "007e3dc60bbe11dfadba39f9fdfe11b5" }
  • 31. Where we are today
  • 32. 32 The current deliverables • Get more information about our customers • Increase recirculation • Increase RPM of our pages • Build metrics into our platform • What works on pages • How are we performing • Build intelligence on the page • Collaborative filtering • Product recommendations • Top-K type lists • Make it closer to real time • not the focus of this talk
  • 33. 33 What data are we processing? • Beacon Web servers • Tracking beacon injected into the HTML page via custom javascript • Tracks • Page views • Page clicks • Custom event that the content developer wants • Tracks standard things like referrers, and user agents, and Location • Developer can add custom parameters to tell us about the page • needed to write a custom module to generate anonymous user ids + 3rd party domain tracking • custom module to map IP#’s to geographic WOEID-based locations • Ad impressions • User viewed a campaign • Integrate it with campaign manage to determine actual revenue • URL context (through relegence) • We can determine who & what a article is about • through relegence, similar to what OpenCalais does
  • 34. 34 The Data Layer Infrastructure today