SlideShare a Scribd company logo
1 of 18
Download to read offline
Archiving the French Web:
the BnF web archiving workflow
Sara Aubry
Web Archiving Project Manager, IT department
Bibliothèque nationale de France
International Conference on Web archives and e-LD
Biblioteca Nacional de España, Madrid, July 9th 2013
Let’s start with some figures
• Programme start in 2000, industrialisation in 2008-
2012
• Collections:
– 1996 - now
– 20 000 websites for focused crawls, 2.5 million .fr domains for broad
crawls
– 18.8 billion URLs, 370 TB, growing up +100TB / year
• Resources:
– 9 Full Time Employees (5 librarians, 4 engineers)
– many partners within and out of Library, both at the national and
international level
– 70 robots (648GB RAM, 144 CPUs 2.4GHz)
Digital curation is not different!
• « Actions, tools and practices defined
and applied to collect, identify, select,
organize and preserve digital contents
(…) in order to use them and make them
available (…) »
Definition of Digital Archiving in Wikipedia
BnF workflow overview
Selecting
Collecting
Indexing
Accessing
Preserving
nas_preload
Selecting with BCWeb
Selecting with BCWeb
• A form-based application, commonly called a
« curator tool »
– for content curators and researchers to nominate
websites to harvest
– giving basic information about them (content policies,
trends watch)
• Most important information for each website:
– Internet address/URL
– frequency (daily, monthly, yearly, once…)
– size/budget (small, medium, big)
– depth (entire domain, part of it) Content curators
The Web is made of HTML pages
1 HTML page, 48
URL
• 1 HTML
• 1 text/css
• 4 javascript
• 17 image/png
• 5 image/jpeg
• 21 image/gif
all links and
inclusions are URL
references
Harvesting with Heritrix
• A harvester is a piece of
software (crawler,
spider, robot)
• Simulates what a
person would do with a
browser but repeatedly
and very fast
• Follows a looping
process
• Repeated until new and
in-scope URL are found
and limits are not
reached (budget and
time)
WARC
Pick a
location
Make a
Request
Receive a
Response
Examine for
references
Save the
content
Assets:
- open source
- small and large scale
- textual or all-media formats
- data structures
Digital curators: legal
deposit department
Engineers : IT department
Challenges:
• rich media and ever-changing
environment
• social networks
• content beyond paywalls
(news sites, ebooks)
Piloting the crawls with
NetarchiveSuite
• Prepare, schedule, run and monitor harvests
of websites, perform QA
Digital curators: legal
deposit department
Engineers : IT department
Offering access with Wayback
• Give readers the ability to
browse the web “as it
was” with:
– a regular web browser
– a search and redisplay
software
• An application called
“Web archives”
– Wayback: for URL search,
display and browsing
– Nutch prototype for
keyword search
– Guided paths for collection
highlights
Challenges:
• links with our main Catalogue and
open data repository
• “smart” URL search
• full text search and indexing
• small-scale data mining projects with
researchers
Questions ?
E-mail: sara.aubry@bnf.fr
Web site: http://www.bnf.fr
Twitter: http://twitter.com/DLWebBnF

More Related Content

What's hot

Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloNetarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloBiblioteca Nacional de España
 
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...Artium Vitoria
 
LoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana CloudLoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana Cloudlocloud
 
20190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_201920190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_2019Samuel Donvil
 
20190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_201920190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_2019PACKED vzw
 
Local content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providerslocloud
 
Digital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework ProgrammeDigital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework Programmelocloud
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...locloud
 
Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020ariadnenetwork
 
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studioI Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studioCulturaItalia
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCultaliada project
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531charper
 
LoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the CloudLoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the Cloudlocloud
 
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting  Digitization & Heritage Metadata : Calames Plus & other tracksUniting  Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting Digitization & Heritage Metadata : Calames Plus & other tracksABES
 
Open Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumOpen Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumSaskia Scheltjens
 

What's hot (20)

Datahub for museums (poster)
Datahub for museums (poster)Datahub for museums (poster)
Datahub for museums (poster)
 
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloNetarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
 
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
 
LoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana CloudLoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana Cloud
 
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
 
20190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_201920190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_2019
 
20190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_201920190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_2019
 
Local content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providers
 
Sam Donvil PACKED public domain day 2018
Sam Donvil PACKED public domain day 2018Sam Donvil PACKED public domain day 2018
Sam Donvil PACKED public domain day 2018
 
Digital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework ProgrammeDigital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework Programme
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...
 
Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020
 
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studioI Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531
 
LoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the CloudLoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the Cloud
 
Aquiles imlr seminar
Aquiles imlr seminarAquiles imlr seminar
Aquiles imlr seminar
 
Linked (open) data: het met elkaar verbinden van kennis en organisaties
Linked (open) data: het met elkaar verbinden van kennis en organisatiesLinked (open) data: het met elkaar verbinden van kennis en organisaties
Linked (open) data: het met elkaar verbinden van kennis en organisaties
 
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting  Digitization & Heritage Metadata : Calames Plus & other tracksUniting  Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
 
Open Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumOpen Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the Rijksmuseum
 

Similar to Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginnersarcomem
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and PotentialDaniel Gomes
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...Docker, Inc.
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711Buttes
 
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoTReusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoTFIESTA-IoT
 
FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended Amélie Gyrard
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...openminted_eu
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congressnullhandle
 
The ABES Discovery Study
The ABES Discovery StudyThe ABES Discovery Study
The ABES Discovery StudyABES
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in librariesLiaquat Rahoo
 
Internet tech & web prog. p1,2,3-ver1
Internet tech & web prog.  p1,2,3-ver1Internet tech & web prog.  p1,2,3-ver1
Internet tech & web prog. p1,2,3-ver1Taymoor Nazmy
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Nuno Freire
 

Similar to Archiving the French Web: the BnF web archiving workflow. Sara Aubry (20)

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
The Hellenic Aggregator
The Hellenic AggregatorThe Hellenic Aggregator
The Hellenic Aggregator
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
 
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoTReusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
 
FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress
 
The ABES Discovery Study
The ABES Discovery StudyThe ABES Discovery Study
The ABES Discovery Study
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in libraries
 
Internet tech & web prog. p1,2,3-ver1
Internet tech & web prog.  p1,2,3-ver1Internet tech & web prog.  p1,2,3-ver1
Internet tech & web prog. p1,2,3-ver1
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
 

More from Biblioteca Nacional de España

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoBiblioteca Nacional de España
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...Biblioteca Nacional de España
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Biblioteca Nacional de España
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Biblioteca Nacional de España
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoBiblioteca Nacional de España
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Biblioteca Nacional de España
 

More from Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

  • 1. Archiving the French Web: the BnF web archiving workflow Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France International Conference on Web archives and e-LD Biblioteca Nacional de España, Madrid, July 9th 2013
  • 2. Let’s start with some figures • Programme start in 2000, industrialisation in 2008- 2012 • Collections: – 1996 - now – 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls – 18.8 billion URLs, 370 TB, growing up +100TB / year • Resources: – 9 Full Time Employees (5 librarians, 4 engineers) – many partners within and out of Library, both at the national and international level – 70 robots (648GB RAM, 144 CPUs 2.4GHz)
  • 3. Digital curation is not different! • « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) » Definition of Digital Archiving in Wikipedia
  • 6. Selecting with BCWeb • A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate websites to harvest – giving basic information about them (content policies, trends watch) • Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators
  • 7. The Web is made of HTML pages 1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references
  • 8. Harvesting with Heritrix • A harvester is a piece of software (crawler, spider, robot) • Simulates what a person would do with a browser but repeatedly and very fast • Follows a looping process • Repeated until new and in-scope URL are found and limits are not reached (budget and time) WARC Pick a location Make a Request Receive a Response Examine for references Save the content
  • 9. Assets: - open source - small and large scale - textual or all-media formats - data structures
  • 11. Engineers : IT department Challenges: • rich media and ever-changing environment • social networks • content beyond paywalls (news sites, ebooks)
  • 12. Piloting the crawls with NetarchiveSuite • Prepare, schedule, run and monitor harvests of websites, perform QA Digital curators: legal deposit department Engineers : IT department
  • 13. Offering access with Wayback • Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay software • An application called “Web archives” – Wayback: for URL search, display and browsing – Nutch prototype for keyword search – Guided paths for collection highlights
  • 14.
  • 15.
  • 16.
  • 17. Challenges: • links with our main Catalogue and open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with researchers
  • 18. Questions ? E-mail: sara.aubry@bnf.fr Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF