The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content:
how we represented a complex relational database model in Solr
how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement
how we update the schema as the API evolves, with zero downtime
how we scale in response to unpredictable demand, using cloud services
14. Implementation
• Traffic patterns much less predictable than
a web site
• Need to easily scale on demand...
• ... and never take down guardian.co.uk due
to API traffic
15. Core
Web servers
App server
Memcached (20Gb)
rdbms
CMS
16. Core
Web servers
App server
Memcached (20Gb)
rdbms Content API
CMS
17. Core
Web servers
App server
Memcached (20Gb)
rdbms Content API
CMS
18. Why Solr?
• Database could not cope...
• ... and far too expensive to scale
• Solr ...
• ... was easy for developers to understand
• ... has a great replication model
• ... is simple to install
19. Core
Web servers
App server
Memcached (20Gb)
CMS
20. Core
Web servers
App server
Memcached (20Gb)
Solr Master
Indexer
CMS
21. Core
Api
Web servers
Solr & Api
App server
Solr & Api
Memcached (20Gb)
Replication
Solr & Api
Solr
Solr Master
Solr & Api
Indexer Solr & Api
CMS
Cloud, EC2
36. ... factboxes ...
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
factbox-data: [ 197544~|~~|~photography-tip~|~ ]
fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the
boats and spectators and played with the scales of the components of the scene ]
fact-value: [ The photographer has framed the cars between the boats and spectators
and played with the scales of the components of the scene ]
46. 1.1 million+ items of content in the database
Split into Batches
SELECT id FROM (
SELECT id, ROWNUM rownumber FROM
content_live ORDER BY id )
WHERE MOD(rownumber, 10000) = 0
47. 1.1 million+ items of content in the database
Split into Batches
SELECT id FROM (
SELECT id, ROWNUM rownumber FROM
content_live ORDER BY id )
WHERE MOD(rownumber, 10000) = 0
48. 1.1 million+ items of content in the database
Actor 1
Actor 2
Each actor: Actor 3
1. reads data from database
2. builds solr input document
Actor 4
3. submits to solr
49. 1.1 million+ items of content in the database
Actor 1
Actor 2
Each actor: Actor 3
1. reads data from database
2. builds solr input document
Actor 4
3. submits to solr
50. Summary
• Solr made free access to our content API
possible
• Replication rocks for scaling
• Solr just works for us (thank you!)
• NoSQL really isn’t that scary
As Stephen said:
Very basic links to interesting content
Note the registration paywall
Broadcast, stories, basic community
Rebuild started in 2005
“Web 2.0”, community, (full fat) RSS, discoverability, tagging.
Where do we go from here? Other newspaper sites - looking to restrict access to content via paywalls etc - we’re looking to open up
We’ve spent the last 12 months experimenting around open distribution and open partnerships - 4 initiatives make up the open platform (right now)
(As stephen said)
This talk focuses on the content API - provides a way for others to re-present our content in their applications
http://content.guardianapis.com
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance
(most users want most recent content, so default ordering is newest)
This is just a dismax search
Can also retrieve extra metadata, including tags
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all
If you have an API key can get full content. (You need to apply for this and agree to some T&Cs - mostly to ensure that we can take down content for legal reasons.) This example key is only valid for this conference, will be disabled afterwards :)
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
Refinements give the ability to narrow down your result set (ofc these are just solr facets)
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-refinements=all
Our current architecture - perhaps we could feed the content api off the database?
Our current architecture - perhaps we could feed the content api off the database?
time to developer understanding: about 2 hours
currently rebuild every night, incrementals during the day
[next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container
currently rebuild every night, incrementals during the day
[next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container
Lots of talk nowadays on “no sql” solutions
No.
Designed a new logo that better reflects where we currently are
disclaimer: the next slides describe how *we* did it; not necessarily best practice!
We took the opportunity to simplify our domain model....
Content fields are just fields
But also need to map tags, media, and factboxes
Here’s how we model tags & content
Fact boxes associate arbitary information with content
We need to search them, but 1-to-1 relationship with content
So no separate record
Fact boxes associate arbitary information with content
We need to search them, but 1-to-1 relationship with content
So no separate record
show-media allows access to the non-text assets of an item of content
Code mostly just takes input params, converts to solr query, and transforms result to json or xml
I’m not here to talk about scala, but here’s a quick couple of snippets
RichSolrDocument makes SolrDocument more “scala” ish
Scala can make writing understandable code much easier
Supporting auto scaling in EC2 - our base images all have empty index
(EC2 load balance is configured to check this url & add server to list on 200 response)
Thanks to Grant Ingersoll from Lucid Imagination for guiding us down this route (were planning to do something much more complicated), Also thanks to Francis Rhys-Jones to actually implementing this
This is game changing - suddenly we’re prepared to change the index -- and NoSQL solutions seem a whole lot less scary: we migrate our entire database every night!
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!
All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!