SlideShare a Scribd company logo
1 of 35
Giving access to BnF data
Jean-Philippe Moreux
May 22th, DIGITENS Workshop – Brest
GIVE
ACCESS
TO DATA
ENCOURAGE
REUSE
BUILD
NEW
SERVICES
OBSERVE,
DISSEMINATE,
DOCUMENT
SERVICES
USECASES
FEEDBACK
GIVING
ACCESS
TO THE
DATA
Exhibit, publish,
document:
api.bnf.fr &
gallicastudio.bnf.fr
What data?
"Raw" Data Data on Data
Data as Datasets Data on use
Derived Data
…
How?
APIs (Application Programming Interface):
allows developers to write programs that
communicate with each other
Datasets: collection of ready to use
or on-demand data/documents
Web services: allows machines to
communicate on the web, using web
protocoles (HTTP)
Temporality: synchronous/asynchronous
SPARQL
OAI-PMH
What protocols?
Too much data…
A (quick) map
Digital
Stores
OAI-PMH
Catalog, Gallica
Catalogs API IIIF Image
Digital Images
API Gallica
Digital Documents
data.bnf.fr
Linked Data
SRU
Catalog
api.bnf.fr
Examples
IIIF for R&D
NewsEye H2020 project
• Article Separation
• HTR (OCR++)
• Named Entities Recognition…
https://www.newseye.eu/
• French dataset: 60k issues delivered as
metadata+OCR only (no images)
• The partners can ingest images for processing
at page level or document level (manifest.json)
• The project DL can handle IIIF (Fedora)
Pros: no more HDs!
Cons: can be a long download and
painful for DLs servers…
• Proof of Concept on image search for digital libraries (topic : WW1)
• Automatic extraction of content from BnF digital collections (IIIF, Gallica, SRU,
OAI-PMH, SPARQL)
• Visual content enrichment thanks to deep learning approaches
Image Search
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
GallicaPix
• Automatic face/genre recognition with deep learning (L’Excelsior, 1910-1920)
• Data analysis, data visualisation
Image Search for DH
http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
How to work
with digital
content Detailled example: GallicaPix
web app on Image Search (WW1)
OAI-PMH
SRU
Linked Data
IIIF
I. Select the required metadata/documents
II. Identify the ways to get access to these data
III. Extract the resources (asynchronous mode) or use them
in real time (synchronous)
IV. Build the application/analyse the data/…
How to work with BnF digital content
GallicaPix : block diagram
1. How to find documents related to WW1?
1.1 With OAI-PMH (Open Archives Initiative - Protocol
for Metadata Harvesting)
A : Gallica OAI repository
B : BnF Catalog repository
C : Europeana repository
3 : GallicaPix (back-end)
4 : GallicaPix (front-end)
1 : Machine/machine queries
2 : Results: list of documents metadata
• List the Gallica « sets » in the OAI repository:
http://oai.bnf.fr/oai2/OAIHandler?verb=ListSets
• Harvest the WW1 set (« gallica:corpus:1418 »)
http://oai.bnf.fr/oai2/OAIHandler?verb=ListRecords&metadata
Prefix=oai_dc&set=gallica:corpus:1418
…
Drawbacks:
• No search criteria
• The sets must have been created by the OAI owner
Let’s do it!
1.1 With OAI-PMH
a) Search in Gallica (keyword search or advanced form)
b) Copy the query segment in the URL
1. How to find documents related to WW1? (cont’)
1.2 With the SRU protocol (Search/Retrieve via URL)
c) Paste the Gallica query into the SRU query
https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=
(dc.subject all "Guerre mondiale 1914-1918") and (dc.type all "image")
and (gallicapublication_date>="1914/01/01")
and (gallicapublication_date<="1918/01/01")&maximumRecords =100
d) Extract the metadata from the XML result list (-> coding)
1. How to find documents related to WW1? (cont’)
1.2 With the SRU protocol (Search/Retrieve via URL)
13483
first
• All the documents about WW1 theme:
For humans (HTML format):
http://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/#documents
For machines (RDF XML/n3…, JSON-LD formats):
https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.xml
https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.n3
1. How to find documents related to WW1? (cont’)
1.4 With data.bnf.fr: semantic search on Linked Data
• Authors related to WW1:
https://data.bnf.fr/fr/linked-authors/11939093
• Documents on Verdun:
https://data.bnf.fr/fr/15265210/verdun__meuse__france_
https://data.bnf.fr/fr/15265210/verdun__meuse__france_/rdf.xml
1. How to find documents related to WW1? (cont’)
1.4 With data.bnf.fr: semantic search on Linked Data
3 : GallicaPix (back-end)
4 : GallicaPix (front-end)
1 : Human/machine queries
2 : Results: metadata
2. How to work with the documents?
From the results list (2):
a) Get the documents metadata
b) Store these metadata localy (3)
c) Get the documents (if needed) and stored them (3)
d) Build services on top of the local database (4)
Store the data?
In a document oriented database (NoSQL):
• XML databases: BaseX, eXist…
• JSON databases: MongoDB
• graph oriented databases
In any other place…
a) Get the documents metadata
a.1) With OAI-PMH:
http://oai.bnf.fr/oai2/OAIHandler?verb=GetRecord&metadataPrefix=oai_d
c&identifier=ark:/12148/bpt6k5738219s
a.2) With the Gallica Document API:
https://gallica.bnf.fr/services/OAIRecord?ark=bpt6k5738219s
a.3) With IIIF:
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k5738219s/manifest.json
Let’s do it!
• Text:
https://gallica.bnf.fr/ark:/12148/bpt6k6399988n.texteBrut
• OCR:
https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k6399988n&E
=ALTO&Deb=10
• Table of content:
https://gallica.bnf.fr/services/Toc?ark=ark:/12148/bpt6k6399988n
c) Extract the documents content with the Gallica API
• Preprocessed images:
https://gallica.bnf.fr/ark:/12148/btv1b8593523f.thumbnail
https://gallica.bnf.fr/ark:/12148/btv1b8593523f.medres
https://gallica.bnf.fr/ark:/12148/btv1b8593523f.highres
• IIIF:
c) Extract the documents content with the Gallica API
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/full/pct:10/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/461,556,8453,6584/923,719/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/461,556,8453,6584/923,719/0/gray.jpg
Images (with IIIF, International Image
Interoperability Framework)
Gallica Welcome
Collection
Europeana …
GallicaPix
…
Gallica
Welcome
Collection
d) Build services
Segmentation
x,y,w,h
d) Build services
Visual Indexation:
• classification
• object detection
• instance detection
• semantic segmentation
…
Car
x1,y1,w1,h1
Person
x1,y1,w1,h1
d) Build services
Classification of genres:
• build a reference dataset
• train a model (CNN)
1. Leverage the
metadata (SRU)
2. Download the
images (IIIF)
3. Train the model
d) Build services
Visual Indexation example: IBM Watson
Pros: no need to handle image file (size, rotation, crop),
no local storage
Cons: server intensive, speed, time out
curl -X POST -u "apikey:****" --form
"url=https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9
604090x/f1/22,781,4334,4751/,700/0/native.jpg"
"https://gateway.watsonplatform.
net/visual-
recognition/api/v3/classify?version=2018-03-19"
d) Build services
Aggregate,
process, enrich
Request: maps
of Verdun
results
Gallica
Europeana
GallicaPix
…
Selection
Segmentatio
n
Indexation QA Use
Search
API, datasets
Access
IIIF in the Visual Indexing Workflow
IIIF IIIF IIIF
IIIF makes prototyping and training of models easier,
but it can be ineffective for large datasets processing
Thanks for your attention!

More Related Content

What's hot

BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationWARCnet
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Sammy Fung
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich ABES
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
 

What's hot (7)

BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE Platform
 
Bingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman PresentationBingham, De Wild & Aasman Presentation
Bingham, De Wild & Aasman Presentation
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
Geo linked data lstd10(v2-boris)
Geo linked data lstd10(v2-boris)Geo linked data lstd10(v2-boris)
Geo linked data lstd10(v2-boris)
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 

Similar to IIIF & Digital Humanities

IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...
IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...
IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...Claire Rioualen
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFGBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFDag Endresen
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
apidays LIVE Helsinki & North 2022_Apps without APIs
apidays LIVE Helsinki & North 2022_Apps without APIsapidays LIVE Helsinki & North 2022_Apps without APIs
apidays LIVE Helsinki & North 2022_Apps without APIsapidays
 
OA - Shared Canvas - TEI - Biblissima project
OA - Shared Canvas - TEI - Biblissima projectOA - Shared Canvas - TEI - Biblissima project
OA - Shared Canvas - TEI - Biblissima projectEquipex Biblissima
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014Matthew Vaughn
 
Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology Jace Liang
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Yahoo Developer Network
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdEOSC-hub project
 
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEONuxeo
 
Updates from Hungary (Jozsef Kovacs)
Updates from Hungary (Jozsef Kovacs)Updates from Hungary (Jozsef Kovacs)
Updates from Hungary (Jozsef Kovacs)EOSC-hub project
 
Data access and data extraction services within the Land Imagery Portal
Data access and data extraction services within the Land Imagery PortalData access and data extraction services within the Land Imagery Portal
Data access and data extraction services within the Land Imagery PortalGasperi Jerome
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin Vanessa Lošić
 

Similar to IIIF & Digital Humanities (20)

IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...
IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...
IFB cloud: Integration of snakemake workflows in an appliance designed for Ch...
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIFGBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
GBIF API Hackaton, March 2015, Leiden, Sp2000/GBIF
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
apidays LIVE Helsinki & North 2022_Apps without APIs
apidays LIVE Helsinki & North 2022_Apps without APIsapidays LIVE Helsinki & North 2022_Apps without APIs
apidays LIVE Helsinki & North 2022_Apps without APIs
 
OA - Shared Canvas - TEI - Biblissima project
OA - Shared Canvas - TEI - Biblissima projectOA - Shared Canvas - TEI - Biblissima project
OA - Shared Canvas - TEI - Biblissima project
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
 
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
 
Updates from Hungary (Jozsef Kovacs)
Updates from Hungary (Jozsef Kovacs)Updates from Hungary (Jozsef Kovacs)
Updates from Hungary (Jozsef Kovacs)
 
Data access and data extraction services within the Land Imagery Portal
Data access and data extraction services within the Land Imagery PortalData access and data extraction services within the Land Imagery Portal
Data access and data extraction services within the Land Imagery Portal
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
 

More from Jean-Philippe Moreux

IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...Jean-Philippe Moreux
 
Fouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPixFouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPixJean-Philippe Moreux
 
Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021Jean-Philippe Moreux
 
Hybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesHybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesJean-Philippe Moreux
 

More from Jean-Philippe Moreux (8)

IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
 
GallicaPix
GallicaPix GallicaPix
GallicaPix
 
Atelier API Gallica
Atelier API GallicaAtelier API Gallica
Atelier API Gallica
 
Image Retrieval at the BnF
Image Retrieval at the BnFImage Retrieval at the BnF
Image Retrieval at the BnF
 
Fouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPixFouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPix
 
Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021
 
Hybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesHybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital libraries
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 

Recently uploaded

AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfSkillCertProExams
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfMahamudul Hasan
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatmentnswingard
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lodhisaajjda
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Baileyhlharris
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIINhPhngng3
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...amilabibi1
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...David Celestin
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfSenaatti-kiinteistöt
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalFabian de Rijk
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoKayode Fayemi
 

Recently uploaded (15)

AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 

IIIF & Digital Humanities

  • 1. Giving access to BnF data Jean-Philippe Moreux May 22th, DIGITENS Workshop – Brest
  • 4. What data? "Raw" Data Data on Data Data as Datasets Data on use Derived Data …
  • 5. How? APIs (Application Programming Interface): allows developers to write programs that communicate with each other Datasets: collection of ready to use or on-demand data/documents Web services: allows machines to communicate on the web, using web protocoles (HTTP) Temporality: synchronous/asynchronous
  • 8. A (quick) map Digital Stores OAI-PMH Catalog, Gallica Catalogs API IIIF Image Digital Images API Gallica Digital Documents data.bnf.fr Linked Data SRU Catalog
  • 11. NewsEye H2020 project • Article Separation • HTR (OCR++) • Named Entities Recognition… https://www.newseye.eu/ • French dataset: 60k issues delivered as metadata+OCR only (no images) • The partners can ingest images for processing at page level or document level (manifest.json) • The project DL can handle IIIF (Fedora) Pros: no more HDs! Cons: can be a long download and painful for DLs servers…
  • 12. • Proof of Concept on image search for digital libraries (topic : WW1) • Automatic extraction of content from BnF digital collections (IIIF, Gallica, SRU, OAI-PMH, SPARQL) • Visual content enrichment thanks to deep learning approaches Image Search http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq GallicaPix
  • 13. • Automatic face/genre recognition with deep learning (L’Excelsior, 1910-1920) • Data analysis, data visualisation Image Search for DH http://demo14-18.bnf.fr:8984/rest?run=findIllustrations-form.xq
  • 14. How to work with digital content Detailled example: GallicaPix web app on Image Search (WW1) OAI-PMH SRU Linked Data IIIF
  • 15. I. Select the required metadata/documents II. Identify the ways to get access to these data III. Extract the resources (asynchronous mode) or use them in real time (synchronous) IV. Build the application/analyse the data/… How to work with BnF digital content GallicaPix : block diagram
  • 16. 1. How to find documents related to WW1? 1.1 With OAI-PMH (Open Archives Initiative - Protocol for Metadata Harvesting) A : Gallica OAI repository B : BnF Catalog repository C : Europeana repository 3 : GallicaPix (back-end) 4 : GallicaPix (front-end) 1 : Machine/machine queries 2 : Results: list of documents metadata
  • 17. • List the Gallica « sets » in the OAI repository: http://oai.bnf.fr/oai2/OAIHandler?verb=ListSets • Harvest the WW1 set (« gallica:corpus:1418 ») http://oai.bnf.fr/oai2/OAIHandler?verb=ListRecords&metadata Prefix=oai_dc&set=gallica:corpus:1418 … Drawbacks: • No search criteria • The sets must have been created by the OAI owner Let’s do it! 1.1 With OAI-PMH
  • 18. a) Search in Gallica (keyword search or advanced form) b) Copy the query segment in the URL 1. How to find documents related to WW1? (cont’) 1.2 With the SRU protocol (Search/Retrieve via URL)
  • 19. c) Paste the Gallica query into the SRU query https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query= (dc.subject all "Guerre mondiale 1914-1918") and (dc.type all "image") and (gallicapublication_date>="1914/01/01") and (gallicapublication_date<="1918/01/01")&maximumRecords =100 d) Extract the metadata from the XML result list (-> coding) 1. How to find documents related to WW1? (cont’) 1.2 With the SRU protocol (Search/Retrieve via URL) 13483 first
  • 20. • All the documents about WW1 theme: For humans (HTML format): http://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/#documents For machines (RDF XML/n3…, JSON-LD formats): https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.xml https://data.bnf.fr/fr/11939093/guerre_mondiale__1914-1918_/rdf.n3 1. How to find documents related to WW1? (cont’) 1.4 With data.bnf.fr: semantic search on Linked Data
  • 21. • Authors related to WW1: https://data.bnf.fr/fr/linked-authors/11939093 • Documents on Verdun: https://data.bnf.fr/fr/15265210/verdun__meuse__france_ https://data.bnf.fr/fr/15265210/verdun__meuse__france_/rdf.xml 1. How to find documents related to WW1? (cont’) 1.4 With data.bnf.fr: semantic search on Linked Data
  • 22. 3 : GallicaPix (back-end) 4 : GallicaPix (front-end) 1 : Human/machine queries 2 : Results: metadata 2. How to work with the documents? From the results list (2): a) Get the documents metadata b) Store these metadata localy (3) c) Get the documents (if needed) and stored them (3) d) Build services on top of the local database (4)
  • 23. Store the data? In a document oriented database (NoSQL): • XML databases: BaseX, eXist… • JSON databases: MongoDB • graph oriented databases In any other place…
  • 24. a) Get the documents metadata a.1) With OAI-PMH: http://oai.bnf.fr/oai2/OAIHandler?verb=GetRecord&metadataPrefix=oai_d c&identifier=ark:/12148/bpt6k5738219s a.2) With the Gallica Document API: https://gallica.bnf.fr/services/OAIRecord?ark=bpt6k5738219s a.3) With IIIF: https://gallica.bnf.fr/iiif/ark:/12148/bpt6k5738219s/manifest.json Let’s do it!
  • 25. • Text: https://gallica.bnf.fr/ark:/12148/bpt6k6399988n.texteBrut • OCR: https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k6399988n&E =ALTO&Deb=10 • Table of content: https://gallica.bnf.fr/services/Toc?ark=ark:/12148/bpt6k6399988n c) Extract the documents content with the Gallica API
  • 26. • Preprocessed images: https://gallica.bnf.fr/ark:/12148/btv1b8593523f.thumbnail https://gallica.bnf.fr/ark:/12148/btv1b8593523f.medres https://gallica.bnf.fr/ark:/12148/btv1b8593523f.highres • IIIF: c) Extract the documents content with the Gallica API https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/full/pct:10/0/native.jpg https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/461,556,8453,6584/923,719/0/native.jpg https://gallica.bnf.fr/iiif/ark:/12148/btv1b8593523f/f1/461,556,8453,6584/923,719/0/gray.jpg
  • 27. Images (with IIIF, International Image Interoperability Framework) Gallica Welcome Collection Europeana …
  • 30. d) Build services Visual Indexation: • classification • object detection • instance detection • semantic segmentation … Car x1,y1,w1,h1 Person x1,y1,w1,h1
  • 31. d) Build services Classification of genres: • build a reference dataset • train a model (CNN) 1. Leverage the metadata (SRU) 2. Download the images (IIIF) 3. Train the model
  • 32. d) Build services Visual Indexation example: IBM Watson Pros: no need to handle image file (size, rotation, crop), no local storage Cons: server intensive, speed, time out curl -X POST -u "apikey:****" --form "url=https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9 604090x/f1/22,781,4334,4751/,700/0/native.jpg" "https://gateway.watsonplatform. net/visual- recognition/api/v3/classify?version=2018-03-19"
  • 33. d) Build services Aggregate, process, enrich Request: maps of Verdun results Gallica Europeana GallicaPix …
  • 34. Selection Segmentatio n Indexation QA Use Search API, datasets Access IIIF in the Visual Indexing Workflow IIIF IIIF IIIF IIIF makes prototyping and training of models easier, but it can be ineffective for large datasets processing
  • 35. Thanks for your attention!

Editor's Notes

  1. Première étape du cycle de la donnée : les mettre à disposition Quelles données concernées : données et métadonnées produites par l’établissement Open data / ouverture des données publiques : métadonnées 2014 Conditions de réutilisations Gallica Sous quelles formes / comment les mettre à disposition ? Mise à disposition technique Interrogation synchrone : API : Interface de programmation applicative pour permettre à deux machines de dialoguer entre elles par un ou des protocoles normalisés Services web : API utilise des protocoles web Exemple : applications sur smartphone qui utilisent des données publiques et des données privées pour créer du service Interrogation asynchrone Données / data Jeux de données dont la constitution constitue une valeur ajoutée
  2. Tradition ancienne d’échanges de données en bibliothèque. La situation en 2016 : plusieurs API ouvertes par la BnF l’historique Z39.50, bon exemple d’API qui n’est pas un service web le protocole SRU sur le catalogue général version web du Z3950 des API très utilisées comme les entrepôts OAI un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica IIIF Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages Rappeler la diversité des publics : publics professionnels, développeurs, Prise de conscience : le hackathon 2016 > documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica réalisation d’un wiki sur la plate-forme Github A l’occasion du hackathon 2017 Regrouper la documentation existante Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS Corpus d’images Dumps de MD Listes d’URL du dépôt légal de l’internet Statistiques
  3. Tradition ancienne d’échanges de données en bibliothèque. La situation en 2016 : plusieurs API ouvertes par la BnF l’historique Z39.50, bon exemple d’API qui n’est pas un service web le protocole SRU sur le catalogue général version web du Z3950 des API très utilisées comme les entrepôts OAI un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica IIIF Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages Rappeler la diversité des publics : publics professionnels, développeurs, Prise de conscience : le hackathon 2016 > documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica réalisation d’un wiki sur la plate-forme Github A l’occasion du hackathon 2017 Regrouper la documentation existante Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS Corpus d’images Dumps de MD Listes d’URL du dépôt légal de l’internet Statistiques
  4. Tradition ancienne d’échanges de données en bibliothèque. La situation en 2016 : plusieurs API ouvertes par la BnF l’historique Z39.50, bon exemple d’API qui n’est pas un service web le protocole SRU sur le catalogue général version web du Z3950 des API très utilisées comme les entrepôts OAI un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica IIIF Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages Rappeler la diversité des publics : publics professionnels, développeurs, Prise de conscience : le hackathon 2016 > documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica réalisation d’un wiki sur la plate-forme Github A l’occasion du hackathon 2017 Regrouper la documentation existante Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS Corpus d’images Dumps de MD Listes d’URL du dépôt légal de l’internet Statistiques
  5. Tradition ancienne d’échanges de données en bibliothèque. La situation en 2016 : plusieurs API ouvertes par la BnF l’historique Z39.50, bon exemple d’API qui n’est pas un service web le protocole SRU sur le catalogue général version web du Z3950 des API très utilisées comme les entrepôts OAI un service créé spécifiquement pour la diffusion des données : data.bnf.fr et son sparql endpoint des API à usage interne, d’abord créées parfois accessibles de l’extérieur : le SRU Gallica IIIF Mais dispersion de l’accès, de la documentation (lorsqu’elle existait) reflet de l’histoire, dispersion des usages Rappeler la diversité des publics : publics professionnels, développeurs, Prise de conscience : le hackathon 2016 > documenter officiellement, assumer l’ouverture de services web IIIF, SRU Gallica réalisation d’un wiki sur la plate-forme Github A l’occasion du hackathon 2017 Regrouper la documentation existante Aussi les jeux de données, notamment produits dans le cadre des projets de recherche : lien avec le projet CORPUS Corpus d’images Dumps de MD Listes d’URL du dépôt légal de l’internet Statistiques
  6. Présentation du portail API et jeux de données : Le premier objet de ce portail est de centraliser la documentation sur les API et les jeux de données. Cette description s’organise Fiche technique Point de contact : réflexion Exemples de requêtes : méthode classique de présentation des API Format de requête et de réponse API et jeu de données en relation : ligne éditoriale, valeur ajoutée de cette présentation sur un portail, ce sont les usages qu’elle ouvre par le croisement entre les jeux de données et les API Pas seulement une description mais on a également posé les jalons d’un travail éditorial autour de ces jeux de données Pages transversales sur des notions essentielles comme les identifiants Actualités pour les nouveaux jeux de données ou services web et actualités des services (passage en https ou le plan de reprise d’activités) Articulation avec d’autres projets comme Gallica Studio Articulation avec d’autres lieux de description des données vers d’autres publics comme les publics professionnels pour les données
  7. Webographie littéraire : site de veille littéraire dédiée à la fiction, connectée à 200 blogs, Wikipedia, Youtube, des podcasts, la BNF, VIAF et des libraires de proximité. Exemple intéressant car Bibliosurf utilise les services web de la BNF de deux manières : Utilisation des données de la BnF directement (nom du traducteur, le titre original, la collection), mais les données de la BnF servent de pivot grâce aux identifiants (ISNI, VIAF, Wikidata) pour récupérer des informations dans d’autres bases de données (Wikipédia) Enrichissement par le SRU catalogue général des descriptions d’ouvrages : nom du traducteur, titre original, et la collection, à chaque affichage de la notice avec un cache de 24 heures. + Tant qu’un auteur n’a pas d’ISNI et dès qu’un internaute affiche la notice de l’auteur sur Bibliosurf, une requête ISBN part sur le SRU de la BNF pour récupérer l’ISNI dans la notice UNIMARC. Si l’ISNI est trouvé, il est ajouté dans la base de Bibliosurf. Identifiant Wikidata. Une requête SPARQL sur data.bnf sera lancée épisodiquement pour récupérer ces identifiants. Ceux-ci servent ensuite à interroger Wikidata et à récupérer les URL des sites d’auteurs. VIAF. Une requête SPARQL sur data.bnf sera lancée épisodiquement pour récupérer ces identifiants. Ceux-ci servent ensuite à interroger VIAF et récupéré les liens Wikipedia non encore référencés. Bibliosurf utilise ensuite l’API de Wikipédia pour afficher les photos et les biographies des auteurs. Utilisation des données de la BnF directement (nom du traducteur, le titre original, la collection), mais les données de la BnF servent de pivot grâce aux identifiants (ISNI, VIAF, Wikidata) pour récupérer des informations dans d’autres bases de données (Wikipédia) Il y a 3485 auteurs référencés sur Bibliosurf. 2954 ont un ISNI, 1830 un identifiant Wikidata, 2751 un identifiant VIAF, 1955 un lien Wikipedia, 412 sites internet.
  8. http://gallica.bnf.fr/ark:/12148/btv1b530027497
  9. http://gallica.bnf.fr/ark:/12148/btv1b530027497
  10. http://gallica.bnf.fr/ark:/12148/btv1b530027497