SlideShare a Scribd company logo
1 of 42
Download to read offline
MongoDB at fotopedia
         Timeline storage
   Our Swiss Army Knife Database
MongoDB at fotopedia

•   Context

•   Wikipedia data storage

•   Metacache
Fotopedia
•   Fotonauts, an American/French company

•   Photo — Encyclopedia

•   Heavily interconnected system : flickr,
    facebook, wikipedia, picassa, twitter…

•   MongoDB in production since last october

•   main store lives in MySQL… for now
First contact


•   Wikipedia imported data
Wikipedia queries

•   wikilinks from one article

•   links to one article

•   geo coordinates

•   redirect

•   why not use wikipedia API ?
Download
~ 5.7GB gzip
XML



        Geo
        Redirect
        Backlink
        Related
    ~12GB tabular data
Problem


Load ~12GB into a K/V store
CouchDB 0.9 attempt


•   CouchDB had no dedicated import tool

•   need to go through HTTP / Rest API
“DATA LOADING”



   LOADING!




   (obviously hijacked from xkcd.com)
Problem, rephrased


  Load ~12GB into any K/V store
        in hours, not days
Hadoop HBase ?

•   as we were already using Hadoop Map/
    Reduce for preparation

•   bulk load was just emerging at that time,
    requiring to code against HBase private APIs,
    generate the data in an ad-hoc binary
    format, ...
photo by neural.it on Flickr
Problem, rerephrased
      Load ~12GB into any K/V store
            in hours, not days
  without wasting a week on development
       and another week on setup
      and several months on tuning
                please ?
MongoDB attempt
•   Transforming the tabular data into a JSON
    form : about half an hour or code, 45 minutes
    of hadoop parallel processing

•   setup mongo server : 15 minutes

•   mongoimport : 3 minutes to start it, 90 minutes
    to run

•   plug RoR app on mongo : minutes

•   prototype was done in a day
Batch   Synchronous


  Download
  ~ 5.7GB gzip



    Geo
    Redirect
    Backlink      Ruby on Rails
    Related
~12GB, 12M docs
Hot swap ?
•   Indexing was locking everything.

•   Just run two instances of MongoDB.
    •   One instance is servicing the web app

    •   One instance is asleep or loading data

•   One third instance knows the status of the two
    instances.
We loved:
•   JSON import format

•   efficiency of mongoimport

•   simple and flexible installation
    •   just one cumbersome dependency

    •   easy to start (we use runit)

    •   easy to have several instances on one box
Second contact

•   itʼs just all about graphes, anyway.
    •   wikilinks

    •   people following people

    •   related community albums

    •   and soon, interlanguage links
all about graphes...

•   ... and itʼs also all about cache.

•   The application needs to “feel” faster, letʼs
    cache more.

•   The application needs to “feel” right, so letʼs
    cache less.

•   or — big sigh — invalidate.
Page fragment caching
                                                 photo by Mykl Roventine on Flickr


    Nginx SSI


                     photo by Aires Dos Santos




Varnish HTTP cache



  RoR application                                  photo by Leslie Chatfield on Flickr
Haiku ?

There are only two hard things
in Computer Science:
cache invalidation and naming things.



                                        Phil Karlton
Naming things


•   REST have been a strong design principle in
    fotopedia since the early days, and the efforts
    are paying.
/en/2nd_arrondissement_of_Paris




  /en/Paris/fragment/left_col

 /users/john/fragment/contrib




  /en/Paris/fragment/related
Invalidating


•   Rest allows us to invalidate by URL prefix.

•   When the Paris album changes, we have to
    invalidate /en/Paris.*
Varnish invalidation

•   Varnish built-in regexp based invalidation is
    not designed for intensive, fine grained
    invalidation.

•   We need to invalidate URLs individually.
/en/Paris.*

/en/Paris

/en/Paris/fragment/left_col

/en/Paris/photos.json?skip=0&number=20

/en/Paris/photos.json?skip=13&number=27
Metacache workflow
                                           /en/Paris.*


                    invalidation worker

    Nginx SSI
                      /en/Paris
                      /en/Paris/fragment/left_col
                      /en/Paris/photos.json?skip=0&number=20
                      /en/Paris/photos.json?skip=13&number=27
Varnish HTTP cache

             varnish log
                                 /en/Paris/fragment/left_col

  RoR application
                         metacache feeder
Waw.

•   This time we are actually using MongoDB as a
    BTree. Impressive.

•   The metacache has been running fine for
    several months, and we want to go further.
Invalidate less
•   We need to be more specific as to what we
    invalidate.

•   Today, if somebody votes on a photo in the
    Paris album, we invalidate all /en/Paris prefix,
    and most of it is unchanged.

•   We will move towards a more clever
    metacache.
Metacache reloaded
•   Pub/Sub metacache

•   Have the backend send a specific header to
    be caught by the metacache-feeder, conaining
    “subscribe” message.

•   This header will be a JSON document, to be
    pushed to the metacache.

•   The purge commands will be mongo search
    queries.
/en/Paris

      /en/Paris/fragment/left_col

      /en/Paris/photos.json?skip=0&number=20

      /en/Paris/photos.json?skip=13&number=27



{url:/en/Paris, observe:[summary,links]}

{url:/en/Paris/fragment/left_col, observe: [cover]}

{url:/en/Paris/photos.json?skip=0&number=20,
observe:[photos]}

{url:/en/Paris/photos.json?skip=0&number=20,
observe:[photos]}
when somebody votes
       { url:/en/Paris.*, observe:photos }

               when the summary changes
       { url:/en/Paris.*, observe:summary }

              when the a new link is created
        { url:/en/Paris.*, observe:links }


{url:/en/Paris, observe:[summary,links]}

{url:/en/Paris/fragment/left_col, observe: [cover]}

{url:/en/Paris/photos.json?skip=0&number=20,
observe:[photos]}

{url:/en/Paris/photos.json?skip=0&number=20,
observe:[photos]}
Other uses cases

•   Timeline activities storage: just one more BTree
    usage.

•   Moderation workflow data: tiny dataset, but more
    complex queries, map/reduce.

•   Suspended experimentation around log
    collection and analysis
Current situation

•   Mysql: main data store

•   CouchDB: old timelines (+ chef)

•   MongoDB: metacache, wikipedia, moderation,
    new timelines

•   Redis: raw data cache for counters, recent
    activity (+ resque)
What about the main store ?


  •   albums are good fit for documents

  •   votes and score may be more tricky

  •   recent introduction of resque
In short
•   Simple, fast.

•   Hackable: in a language most can read.

•   Clear roadmap.

•   Very helpful and efficient team.

•   Designed with application developer needs in
    mind.

More Related Content

What's hot

Catmandu presentation at SWIB 2013
Catmandu presentation at SWIB 2013Catmandu presentation at SWIB 2013
Catmandu presentation at SWIB 2013
nicsteenlant
 

What's hot (20)

The Future of Bundled Bundler
The Future of Bundled BundlerThe Future of Bundled Bundler
The Future of Bundled Bundler
 
Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speed
 
Code4 lib 20141129 python
Code4 lib 20141129 pythonCode4 lib 20141129 python
Code4 lib 20141129 python
 
Gems on Ruby
Gems on RubyGems on Ruby
Gems on Ruby
 
The Future of Dependency Management for Ruby
The Future of Dependency Management for RubyThe Future of Dependency Management for Ruby
The Future of Dependency Management for Ruby
 
Roadmap for RubyGems 4 and Bundler 3
Roadmap for RubyGems 4 and Bundler 3Roadmap for RubyGems 4 and Bundler 3
Roadmap for RubyGems 4 and Bundler 3
 
Productive web applications that run only on the frontend
Productive web applications that run only on the frontendProductive web applications that run only on the frontend
Productive web applications that run only on the frontend
 
Catmandu presentation at SWIB 2013
Catmandu presentation at SWIB 2013Catmandu presentation at SWIB 2013
Catmandu presentation at SWIB 2013
 
Apache Camel K - Fredericia
Apache Camel K - FredericiaApache Camel K - Fredericia
Apache Camel K - Fredericia
 
Gems on Ruby
Gems on RubyGems on Ruby
Gems on Ruby
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Dependency Resolution with Standard Libraries
Dependency Resolution with Standard LibrariesDependency Resolution with Standard Libraries
Dependency Resolution with Standard Libraries
 
Be a microservices hero
Be a microservices heroBe a microservices hero
Be a microservices hero
 
Not only SQL
Not only SQL Not only SQL
Not only SQL
 
The Future of library dependency management of Ruby
 The Future of library dependency management of Ruby The Future of library dependency management of Ruby
The Future of library dependency management of Ruby
 
How to distribute Ruby to the world
How to distribute Ruby to the worldHow to distribute Ruby to the world
How to distribute Ruby to the world
 
Apache Camel K - Copenhagen v2
Apache Camel K - Copenhagen v2Apache Camel K - Copenhagen v2
Apache Camel K - Copenhagen v2
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
How to distribute Ruby to the world
How to distribute Ruby to the worldHow to distribute Ruby to the world
How to distribute Ruby to the world
 
OSS Security the hard way
OSS Security the hard wayOSS Security the hard way
OSS Security the hard way
 

Similar to Mongodb, our Swiss Army Knife Database

Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
lauraxthomson
 
Crash reports pycodeconf
Crash reports pycodeconfCrash reports pycodeconf
Crash reports pycodeconf
lauraxthomson
 
Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18
Pierre Joye
 

Similar to Mongodb, our Swiss Army Knife Database (20)

VelocityConf EU 2013 - Turbocharge your mobile web apps by using offline
VelocityConf EU 2013 - Turbocharge your mobile web apps by using offline VelocityConf EU 2013 - Turbocharge your mobile web apps by using offline
VelocityConf EU 2013 - Turbocharge your mobile web apps by using offline
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 
12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL12-Step Program for Scaling Web Applications on PostgreSQL
12-Step Program for Scaling Web Applications on PostgreSQL
 
Practical Use of MongoDB for Node.js
Practical Use of MongoDB for Node.jsPractical Use of MongoDB for Node.js
Practical Use of MongoDB for Node.js
 
Optimization of modern web applications
Optimization of modern web applicationsOptimization of modern web applications
Optimization of modern web applications
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
 
JS digest. Decemebr 2017
JS digest. Decemebr 2017JS digest. Decemebr 2017
JS digest. Decemebr 2017
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Isomorphic JavaScript with Node, WebPack, and React
Isomorphic JavaScript with Node, WebPack, and ReactIsomorphic JavaScript with Node, WebPack, and React
Isomorphic JavaScript with Node, WebPack, and React
 
Introduce flux & react in practice
Introduce flux & react in practiceIntroduce flux & react in practice
Introduce flux & react in practice
 
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
Node.js 기반 정적 페이지 블로그 엔진, 하루프레스
 
Angular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - LinagoraAngular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - Linagora
 
They why behind php frameworks
They why behind php frameworksThey why behind php frameworks
They why behind php frameworks
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
 
Crash reports pycodeconf
Crash reports pycodeconfCrash reports pycodeconf
Crash reports pycodeconf
 
bol.com Dutch Container Day presentation
bol.com Dutch Container Day presentationbol.com Dutch Container Day presentation
bol.com Dutch Container Day presentation
 
Velocity - Edge UG
Velocity - Edge UGVelocity - Edge UG
Velocity - Edge UG
 
Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18
 
Client Side Performance for Back End Developers - Cambridge .NET User Group -...
Client Side Performance for Back End Developers - Cambridge .NET User Group -...Client Side Performance for Back End Developers - Cambridge .NET User Group -...
Client Side Performance for Back End Developers - Cambridge .NET User Group -...
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Mongodb, our Swiss Army Knife Database

  • 1. MongoDB at fotopedia Timeline storage Our Swiss Army Knife Database
  • 2. MongoDB at fotopedia • Context • Wikipedia data storage • Metacache
  • 3. Fotopedia • Fotonauts, an American/French company • Photo — Encyclopedia • Heavily interconnected system : flickr, facebook, wikipedia, picassa, twitter… • MongoDB in production since last october • main store lives in MySQL… for now
  • 4. First contact • Wikipedia imported data
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Wikipedia queries • wikilinks from one article • links to one article • geo coordinates • redirect • why not use wikipedia API ?
  • 10. Download ~ 5.7GB gzip XML Geo Redirect Backlink Related ~12GB tabular data
  • 11. Problem Load ~12GB into a K/V store
  • 12. CouchDB 0.9 attempt • CouchDB had no dedicated import tool • need to go through HTTP / Rest API
  • 13. “DATA LOADING” LOADING! (obviously hijacked from xkcd.com)
  • 14. Problem, rephrased Load ~12GB into any K/V store in hours, not days
  • 15. Hadoop HBase ? • as we were already using Hadoop Map/ Reduce for preparation • bulk load was just emerging at that time, requiring to code against HBase private APIs, generate the data in an ad-hoc binary format, ...
  • 16. photo by neural.it on Flickr
  • 17. Problem, rerephrased Load ~12GB into any K/V store in hours, not days without wasting a week on development and another week on setup and several months on tuning please ?
  • 18. MongoDB attempt • Transforming the tabular data into a JSON form : about half an hour or code, 45 minutes of hadoop parallel processing • setup mongo server : 15 minutes • mongoimport : 3 minutes to start it, 90 minutes to run • plug RoR app on mongo : minutes • prototype was done in a day
  • 19. Batch Synchronous Download ~ 5.7GB gzip Geo Redirect Backlink Ruby on Rails Related ~12GB, 12M docs
  • 20. Hot swap ? • Indexing was locking everything. • Just run two instances of MongoDB. • One instance is servicing the web app • One instance is asleep or loading data • One third instance knows the status of the two instances.
  • 21. We loved: • JSON import format • efficiency of mongoimport • simple and flexible installation • just one cumbersome dependency • easy to start (we use runit) • easy to have several instances on one box
  • 22. Second contact • itʼs just all about graphes, anyway. • wikilinks • people following people • related community albums • and soon, interlanguage links
  • 23.
  • 24. all about graphes... • ... and itʼs also all about cache. • The application needs to “feel” faster, letʼs cache more. • The application needs to “feel” right, so letʼs cache less. • or — big sigh — invalidate.
  • 25.
  • 26. Page fragment caching photo by Mykl Roventine on Flickr Nginx SSI photo by Aires Dos Santos Varnish HTTP cache RoR application photo by Leslie Chatfield on Flickr
  • 27. Haiku ? There are only two hard things in Computer Science: cache invalidation and naming things. Phil Karlton
  • 28. Naming things • REST have been a strong design principle in fotopedia since the early days, and the efforts are paying.
  • 29. /en/2nd_arrondissement_of_Paris /en/Paris/fragment/left_col /users/john/fragment/contrib /en/Paris/fragment/related
  • 30. Invalidating • Rest allows us to invalidate by URL prefix. • When the Paris album changes, we have to invalidate /en/Paris.*
  • 31. Varnish invalidation • Varnish built-in regexp based invalidation is not designed for intensive, fine grained invalidation. • We need to invalidate URLs individually.
  • 33. Metacache workflow /en/Paris.* invalidation worker Nginx SSI /en/Paris /en/Paris/fragment/left_col /en/Paris/photos.json?skip=0&number=20 /en/Paris/photos.json?skip=13&number=27 Varnish HTTP cache varnish log /en/Paris/fragment/left_col RoR application metacache feeder
  • 34. Waw. • This time we are actually using MongoDB as a BTree. Impressive. • The metacache has been running fine for several months, and we want to go further.
  • 35. Invalidate less • We need to be more specific as to what we invalidate. • Today, if somebody votes on a photo in the Paris album, we invalidate all /en/Paris prefix, and most of it is unchanged. • We will move towards a more clever metacache.
  • 36. Metacache reloaded • Pub/Sub metacache • Have the backend send a specific header to be caught by the metacache-feeder, conaining “subscribe” message. • This header will be a JSON document, to be pushed to the metacache. • The purge commands will be mongo search queries.
  • 37. /en/Paris /en/Paris/fragment/left_col /en/Paris/photos.json?skip=0&number=20 /en/Paris/photos.json?skip=13&number=27 {url:/en/Paris, observe:[summary,links]} {url:/en/Paris/fragment/left_col, observe: [cover]} {url:/en/Paris/photos.json?skip=0&number=20, observe:[photos]} {url:/en/Paris/photos.json?skip=0&number=20, observe:[photos]}
  • 38. when somebody votes { url:/en/Paris.*, observe:photos } when the summary changes { url:/en/Paris.*, observe:summary } when the a new link is created { url:/en/Paris.*, observe:links } {url:/en/Paris, observe:[summary,links]} {url:/en/Paris/fragment/left_col, observe: [cover]} {url:/en/Paris/photos.json?skip=0&number=20, observe:[photos]} {url:/en/Paris/photos.json?skip=0&number=20, observe:[photos]}
  • 39. Other uses cases • Timeline activities storage: just one more BTree usage. • Moderation workflow data: tiny dataset, but more complex queries, map/reduce. • Suspended experimentation around log collection and analysis
  • 40. Current situation • Mysql: main data store • CouchDB: old timelines (+ chef) • MongoDB: metacache, wikipedia, moderation, new timelines • Redis: raw data cache for counters, recent activity (+ resque)
  • 41. What about the main store ? • albums are good fit for documents • votes and score may be more tricky • recent introduction of resque
  • 42. In short • Simple, fast. • Hackable: in a language most can read. • Clear roadmap. • Very helpful and efficient team. • Designed with application developer needs in mind.