IIIF & Digital Humanities

Giving access to BnF data
Jean-Philippe Moreux
May 22th, DIGITENS Workshop – Brest

GIVE
ACCESS
TO DATA
ENCOURAGE
REUSE
BUILD
NEW
SERVICES
OBSERVE,
DISSEMINATE,
DOCUMENT
SERVICES
USECASES
FEEDBACK

GIVING
ACCESS
TO THE
DATA
Exhibit, publish,
document:
api.bnf.fr &
gallicastudio.bnf.fr

What data?
"Raw" Data Data on Data
Data as Datasets Data on use
Derived Data
…

How?
APIs (Application Programming Interface):
allows developers to write programs that
communicate with each other
Datasets: collection of ready to use
or on-demand data/documents
Web services: allows machines to
communicate on the web, using web
protocoles (HTTP)
Temporality: synchronous/asynchronous

SPARQL
OAI-PMH
What protocols?

A (quick) map
Digital
Stores
OAI-PMH
Catalog, Gallica
Catalogs API IIIF Image
Digital Images
API Gallica
Digital Documents
data.bnf.fr
Linked Data
SRU
Catalog

NewsEye H2020 project
• Article Separation
• HTR (OCR++)
• Named Entities Recognition…
https://www.newseye.eu/
• French dataset: 60k issues delivered as
metadata+OCR only (no images)
• The partners can ingest images for processing
at page level or document level (manifest.json)
• The project DL can handle IIIF (Fedora)
Pros: no more HDs!
Cons: can be a long download and
painful for DLs servers…

• Proof of Concept on image search for digital libraries (topic : WW1)
• Automatic extraction of content from BnF digital collections (IIIF, Gallica, SRU,
OAI-PMH, SPARQL)
• Visual content enrichment thanks to deep learning approaches
Image Search
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
GallicaPix

• Automatic face/genre recognition with deep learning (L’Excelsior, 1910-1920)
• Data analysis, data visualisation
Image Search for DH
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq

How to work
with digital
content Detailled example: GallicaPix
web app on Image Search (WW1)
OAI-PMH
SRU
Linked Data
IIIF

I. Select the required metadata/documents
II. Identify the ways to get access to these data
III. Extract the resources (asynchronous mode) or use them
in real time (synchronous)
IV. Build the application/analyse the data/…
How to work with BnF digital content
GallicaPix : block diagram

1. How to find documents related to WW1?
1.1 With OAI-PMH (Open Archives Initiative - Protocol
for Metadata Harvesting)
A : Gallica OAI repository
B : BnF Catalog repository
C : Europeana repository
3 : GallicaPix (back-end)
4 : GallicaPix (front-end)
1 : Machine/machine queries
2 : Results: list of documents metadata

• List the Gallica « sets » in the OAI repository:
http://oai.bnf.fr/oai2/OAIHandler?verb=ListSets
• Harvest the WW1 set (« gallica:corpus:1418 »)
http://oai.bnf.fr/oai2/OAIHandler?verb=ListRecords&metadata
Prefix=oai_dc&set=gallica:corpus:1418
…
Drawbacks:
• No search criteria
• The sets must have been created by the OAI owner
Let’s do it!
1.1 With OAI-PMH

a) Search in Gallica (keyword search or advanced form)
b) Copy the query segment in the URL
1. How to find documents related to WW1? (cont’)
1.2 With the SRU protocol (Search/Retrieve via URL)

c) Paste the Gallica query into the SRU query
https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=
(dc.subject all "Guerre mondiale 1914-1918") and (dc.type all "image")
and (gallicapublication_date>="1914/01/01")
and (gallicapublication_date<="1918/01/01")&maximumRecords =100
d) Extract the metadata from the XML result list (-> coding)
1.2 With the SRU protocol (Search/Retrieve via URL)
13483
first

• All the documents about WW1 theme:
For humans (HTML format):
http://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/#documents
For machines (RDF XML/n3…, JSON-LD formats):
https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.xml
https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.n3
1.4 With data.bnf.fr: semantic search on Linked Data

• Authors related to WW1:
https://data.bnf.fr/fr/linked-authors/11939093
• Documents on Verdun:
https://data.bnf.fr/fr/15265210/verdun__meuse__france_
https://data.bnf.fr/fr/15265210/verdun__meuse__france_/rdf.xml
1.4 With data.bnf.fr: semantic search on Linked Data

3 : GallicaPix (back-end)
4 : GallicaPix (front-end)
1 : Human/machine queries
2 : Results: metadata
2. How to work with the documents?
From the results list (2):
a) Get the documents metadata
b) Store these metadata localy (3)
c) Get the documents (if needed) and stored them (3)
d) Build services on top of the local database (4)

Store the data?
In a document oriented database (NoSQL):
• XML databases: BaseX, eXist…
• JSON databases: MongoDB
• graph oriented databases
In any other place…

a) Get the documents metadata
a.1) With OAI-PMH:
http://oai.bnf.fr/oai2/OAIHandler?verb=GetRecord&metadataPrefix=oai_d
c&identifier=ark:/12148/bpt6k5738219s
a.2) With the Gallica Document API:
https://gallica.bnf.fr/services/OAIRecord?ark=bpt6k5738219s
a.3) With IIIF:
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k5738219s/manifest.json
Let’s do it!

• Text:
https://gallica.bnf.fr/ark:/12148/bpt6k6399988n.texteBrut
• OCR:
https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k6399988n&E
=ALTO&Deb=10
• Table of content:
https://gallica.bnf.fr/services/Toc?ark=ark:/12148/bpt6k6399988n
c) Extract the documents content with the Gallica API

• Preprocessed images:
https://gallica.bnf.fr/ark:/12148/btv1b8593523f.thumbnail
https://gallica.bnf.fr/ark:/12148/btv1b8593523f.medres
https://gallica.bnf.fr/ark:/12148/btv1b8593523f.highres
• IIIF:
c) Extract the documents content with the Gallica API
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/full/pct:10/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/461,556,8453,6584/923,719/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/461,556,8453,6584/923,719/0/gray.jpg

Images (with IIIF, International Image
Interoperability Framework)
Gallica Welcome
Collection
Europeana …

GallicaPix
…
Gallica
Welcome
Collection

d) Build services
Segmentation
x,y,w,h

d) Build services
Visual Indexation:
• classification
• object detection
• instance detection
• semantic segmentation
…
Car
x1,y1,w1,h1
Person
x1,y1,w1,h1

d) Build services
Classification of genres:
• build a reference dataset
• train a model (CNN)
1. Leverage the
metadata (SRU)
2. Download the
images (IIIF)
3. Train the model

d) Build services
Visual Indexation example: IBM Watson
Pros: no need to handle image file (size, rotation, crop),
no local storage
Cons: server intensive, speed, time out
curl -X POST -u "apikey:****" --form
"url=https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9
604090x/f1/22,781,4334,4751/,700/0/native.jpg"
"https://gateway.watsonplatform.
net/visual-
recognition/api/v3/classify?version=2018-03-19"

d) Build services
Aggregate,
process, enrich
Request: maps
of Verdun
results
Gallica
Europeana
GallicaPix
…

Selection
Segmentatio
n
Indexation QA Use
Search
API, datasets
Access
IIIF in the Visual Indexing Workflow
IIIF IIIF IIIF
IIIF makes prototyping and training of models easier,
but it can be ineffective for large datasets processing

IIIF & Digital Humanities

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to IIIF & Digital Humanities

Similar to IIIF & Digital Humanities (20)

More from Jean-Philippe Moreux

More from Jean-Philippe Moreux (8)

Recently uploaded

Recently uploaded (15)

IIIF & Digital Humanities

Editor's Notes