The Guardian Open Platform Content API: Implementation

•Download as KEY, PDF•

31 likes•5,855 views

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content: how we represented a complex relational database model in Solr how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement how we update the schema as the API evolves, with zero downtime how we scale in response to unpredictable demand, using cloud services

Technology

Solr in the Wild:
The Guardian’s
Open Platform
Content API
Graham Tackley
guardian.co.uk
1

• Content API
• MicroApp Framework
• Politics API
• Data Store
http://www.guardian.co.uk/open-platform

• Content API
• pis.com
MicroApp Framework
ard iana
Politics API ten
•ttp://con t.gu
h Data Store
•
http://www.guardian.co.uk/open-platform

http://content.guardianapis.com/search.json?q=prague%20beer&order-
by=relevance&show-ﬁelds=all&show-tags=all

http://content.guardianapis.com/search.json?q=prague%20beer&order-
by=relevance&show-ﬁelds=all&show-tags=all&api-key=eurocon2010

http://content.guardianapis.com/search.json?q=prague
%20beer&order-by=relevance&show-reﬁnements=all

Implementation

• Trafﬁc patterns much less predictable than
a web site
• Need to easily scale on demand...
• ... and never take down guardian.co.uk due
to API trafﬁc

Core

Web servers

App server

Memcached (20Gb)

rdbms

CMS

Core

Web servers

App server

Memcached (20Gb)

rdbms Content API

CMS

Why Solr?
• Database could not cope...
• ... and far too expensive to scale
• Solr ...
• ... was easy for developers to understand
• ... has a great replication model
• ... is simple to install

Core

Web servers

App server

Memcached (20Gb)

CMS

Core

Web servers

App server

Memcached (20Gb)

Solr Master

Indexer
CMS

Core
Api
Web servers
Solr & Api
App server
Solr & Api
Memcached (20Gb)

Replication
Solr & Api

Solr
Solr Master
Solr & Api

Indexer Solr & Api

CMS
Cloud, EC2

Solr Schema

• 350+ tables in database schema

Keywor Article
d

Contributor Video

Series Tags Content Audio

Publication Gallery

Tone Cartoon

... tags ...
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...]
tag-external-names: [ Eyewitness, Formula One, Monaco, ...]

... factboxes ...

record-type: content
id: world/picture/2010/may/14/formula-one-monaco
factbox-data: [ 197544~|~~|~photography-tip~|~ ]
fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the
boats and spectators and played with the scales of the components of the scene ]
fact-value: [ The photographer has framed the cars between the boats and spectators
and played with the scales of the components of the scene ]

... media
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]

... media
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]

record-type: media
id: PICTURE|362634152|IMAGE|362629791
credit: Mark Thompson/Getty Images
width: 1024
height: 768
path: /sys-images/Guardian/About/General/2010/5/14/1273823813621/66-lap-
Monaco-grand-prix-002.jpg

The Code

• Written in Scala
• Uses SolrJ
• Plan to open source in the new few months

Creating the Index

• Existing search index takes 20 hours to
build
• Solr index takes 1 hour
• Here’s how...

1.1 million+ items of content in the database

1.1 million+ items of content in the database

Split into Batches
SELECT id FROM (
SELECT id, ROWNUM rownumber FROM
content_live ORDER BY id )
WHERE MOD(rownumber, 10000) = 0

1.1 million+ items of content in the database

Actor 1

Actor 2

Each actor: Actor 3
1. reads data from database
2. builds solr input document
Actor 4
3. submits to solr

Summary

• Solr made free access to our content API
possible
• Replication rocks for scaling
• Solr just works for us (thank you!)
• NoSQL really isn’t that scary

• http://guardian.co.uk/open-platform
• http://content.guardianapis.com
graham.tackley@guardian.co.uk · @tackers
37

What's hot

Rails On Springswamy g

[AWS Dev Day] 이머징 테크 | Libra 소스코드분석 및 AWS에서 블록체인 기반 지불 시스템 최적화 방법 - 박혜영 AWS 솔...Amazon Web Services Korea

Introduction to Ruby on RailsManoj Kumar

Advanced Container Management and SchedulingAmazon Web Services

Building Global Serverless BackendsAmazon Web Services

Deep Dive into AWS FargateAmazon Web Services

Apache Camel v3, Camel K and Camel QuarkusClaus Ibsen

Containers State of the UnionAmazon Web Services

Apache Jackrabbit Oak on MongoDBMongoDB

Scaling with swaggerTony Tam

Effectively Deploying MongoDB on AEMNorberto Leite

Web Clients for Ruby and What they should be in the futureToru Kawamura

Amazon ECS Deep DiveAmazon Web Services

How Shopify Scales Railsjduff

Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...Amazon Web Services Korea

[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵Amazon Web Services Korea

Amazon EC2 Container ServiceAmazon Web Services

AEM WITH MONGODBNate Nelson

Padrino - the Godfather of SinatraStoyan Zhekov

Melbourne User Group OAK and MongoDBYuval Ararat

What's hot (20)

Rails On Spring

[AWS Dev Day] 이머징 테크 | Libra 소스코드분석 및 AWS에서 블록체인 기반 지불 시스템 최적화 방법 - 박혜영 AWS 솔...

Introduction to Ruby on Rails

Advanced Container Management and Scheduling

Building Global Serverless Backends

Deep Dive into AWS Fargate

Apache Camel v3, Camel K and Camel Quarkus

Containers State of the Union

Apache Jackrabbit Oak on MongoDB

Scaling with swagger

Effectively Deploying MongoDB on AEM

Web Clients for Ruby and What they should be in the future

Amazon ECS Deep Dive

How Shopify Scales Rails

Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...

[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵

Amazon EC2 Container Service

AEM WITH MONGODB

Padrino - the Godfather of Sinatra

Melbourne User Group OAK and MongoDB

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

DevoxxFR 2024 Reproducible Builds with Apache Maven

Scanning the Internet for External Cloud Exposures via SSL Certs

SAP Build Work Zone - Overview L2-L3.pptx

Unleash Your Potential - Namagunga Girls Coding Club

The Ultimate Guide to Choosing WordPress Pros and Cons

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

DSPy a system for AI to Write Prompts and Do Fine Tuning

Nell’iperspazio con Rocket: il Framework Web di Rust!

Dev Dives: Streamline document processing with UiPath Studio Web

Unraveling Multimodality with Large Language Models.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Take control of your SAP testing with UiPath Test Suite

The Guardian Open Platform Content API: Implementation

1. Solr in the Wild: The Guardian’s Open Platform Content API Graham Tackley guardian.co.uk 1

2. Guardian journalism online: 1995

3. Guardian journalism online: 1999

4. Guardian journalism online: 2000

5. Guardian journalism online: 2010

6. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform

7. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform

8. • Content API • pis.com MicroApp Framework ard iana Politics API ten •ttp://con t.gu h Data Store • http://www.guardian.co.uk/open-platform

9. http://content.guardianapis.com

10.

11. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-ﬁelds=all&show-tags=all

12. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-ﬁelds=all&show-tags=all&api-key=eurocon2010

13. http://content.guardianapis.com/search.json?q=prague %20beer&order-by=relevance&show-reﬁnements=all

14. Implementation • Trafﬁc patterns much less predictable than a web site • Need to easily scale on demand... • ... and never take down guardian.co.uk due to API trafﬁc

15. Core Web servers App server Memcached (20Gb) rdbms CMS

16. Core Web servers App server Memcached (20Gb) rdbms Content API CMS

17. Core Web servers App server Memcached (20Gb) rdbms Content API CMS

18. Why Solr? • Database could not cope... • ... and far too expensive to scale • Solr ... • ... was easy for developers to understand • ... has a great replication model • ... is simple to install

19. Core Web servers App server Memcached (20Gb) CMS

20. Core Web servers App server Memcached (20Gb) Solr Master Indexer CMS

21. Core Api Web servers Solr & Api App server Solr & Api Memcached (20Gb) Replication Solr & Api Solr Solr Master Solr & Api Indexer Solr & Api CMS Cloud, EC2

22.

23.

24. n otl y

25. Solr Schema • 350+ tables in database schema

26. Content ﬁelds are just ﬁelds...

27.

28. Tags

29. Tags Factbox

30. Tags Factbox Media

31. Keywor Article d Contributor Video Series Tags Content Audio Publication Gallery Tone Cartoon

32. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...]

33. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness type: series internal-name: Eyewitness (centespread photo series)

34. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness Included in search type: series internal-name: Eyewitness (centespread stored=false photo series)

35. ... factboxes ...

36. ... factboxes ... record-type: content id: world/picture/2010/may/14/formula-one-monaco factbox-data: [ 197544~|~~|~photography-tip~|~ ] fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ] fact-value: [ The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ]

37. ... media ...

38. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]

39. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...] record-type: media id: PICTURE|362634152|IMAGE|362629791 credit: Mark Thompson/Getty Images width: 1024 height: 768 path: /sys-images/Guardian/About/General/2010/5/14/1273823813621/66-lap- Monaco-grand-prix-002.jpg

40. The Code • Written in Scala • Uses SolrJ • Plan to open source in the new few months

41. The Code

42. The Code

43.

44. Creating the Index • Existing search index takes 20 hours to build • Solr index takes 1 hour • Here’s how...

45. 1.1 million+ items of content in the database

46. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0

47. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0

48. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr

49. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr

50. Summary • Solr made free access to our content API possible • Replication rocks for scaling • Solr just works for us (thank you!) • NoSQL really isn’t that scary

51. • http://guardian.co.uk/open-platform • http://content.guardianapis.com graham.tackley@guardian.co.uk · @tackers 37

Editor's Notes

As Stephen said: Very basic links to interesting content
Note the registration paywall
Broadcast, stories, basic community Rebuild started in 2005
&#x201C;Web 2.0&#x201D;, community, (full fat) RSS, discoverability, tagging. Where do we go from here? Other newspaper sites - looking to restrict access to content via paywalls etc - we&#x2019;re looking to open up
We&#x2019;ve spent the last 12 months experimenting around open distribution and open partnerships - 4 initiatives make up the open platform (right now) (As stephen said)
This talk focuses on the content API - provides a way for others to re-present our content in their applications
http://content.guardianapis.com
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance (most users want most recent content, so default ordering is newest) This is just a dismax search
Can also retrieve extra metadata, including tags http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all
If you have an API key can get full content. (You need to apply for this and agree to some T&Cs - mostly to ensure that we can take down content for legal reasons.) This example key is only valid for this conference, will be disabled afterwards :) http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
Refinements give the ability to narrow down your result set (ofc these are just solr facets) http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-refinements=all
Our current architecture - perhaps we could feed the content api off the database?
Our current architecture - perhaps we could feed the content api off the database?
time to developer understanding: about 2 hours
currently rebuild every night, incrementals during the day [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config. We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container
currently rebuild every night, incrementals during the day [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config. We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container
Lots of talk nowadays on &#x201C;no sql&#x201D; solutions
No. Designed a new logo that better reflects where we currently are
disclaimer: the next slides describe how *we* did it; not necessarily best practice! We took the opportunity to simplify our domain model....
Content fields are just fields But also need to map tags, media, and factboxes
Here&#x2019;s how we model tags & content
Fact boxes associate arbitary information with content We need to search them, but 1-to-1 relationship with content So no separate record
Fact boxes associate arbitary information with content We need to search them, but 1-to-1 relationship with content So no separate record
show-media allows access to the non-text assets of an item of content
Code mostly just takes input params, converts to solr query, and transforms result to json or xml I&#x2019;m not here to talk about scala, but here&#x2019;s a quick couple of snippets
RichSolrDocument makes SolrDocument more &#x201C;scala&#x201D; ish
Scala can make writing understandable code much easier
Supporting auto scaling in EC2 - our base images all have empty index (EC2 load balance is configured to check this url & add server to list on 200 response)
Thanks to Grant Ingersoll from Lucid Imagination for guiding us down this route (were planning to do something much more complicated), Also thanks to Francis Rhys-Jones to actually implementing this This is game changing - suddenly we&#x2019;re prepared to change the index -- and NoSQL solutions seem a whole lot less scary: we migrate our entire database every night!
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
Effectively the batch divisions become a work queue fed to a set of actors (Actually, we found that 8 worked best with our hardware) Each actor reads the data from the database; creates a solrinputdocument; submits
All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!
All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!

The Guardian Open Platform Content API: Implementation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Guardian Open Platform Content API: Implementation

Similar to The Guardian Open Platform Content API: Implementation (20)

Recently uploaded

Recently uploaded (20)

The Guardian Open Platform Content API: Implementation

Editor's Notes