SlideShare a Scribd company logo
1 of 143
Download to read offline
W/ARC file
W/ARC record
Header
Block Ex: HTTP
response, jpeg
file…
Ex: record ID, capture
date, record type,…
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2006-09-19T17:20:14Z
WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
Content-Type: application/warc-fields
Content-Length: 381
software: Heritrix 1.12.0 http://crawler.archive.org
hostname: crawling017.archive.org
ip: 207.241.227.234
isPartOf: testcrawl-20050708
description: testcrawl with WARC output
operator: IA_Admin
http-header-user-agent:
Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)
format: WARC file version 0.17
conformsTo:
http://www.archive.org/documents/WarcFileFormat-0.17.html
WARC/1.0 WARC-Type: request
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
Content-Length: 236
WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594>
Content-Type: application/http;msgtype=request
WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
GET /images/logoc.jpg HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)
From: stack@example.org
Connection: close
Referer: http://www.archive.org/
Host: www.archive.org
Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 1902
HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg
[image/jpeg binary data here]
WARC/1.0
WARC-Type: resource
WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
Content-Type: image/jpeg
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
Content-Length: 1662
[image/jpeg binary data here]
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-
57494593b943>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
Content-Type: application/warc-fields
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
Content-Length: 59
via: http://www.archive.org/
hopsFromSeed: E
fetchTimeMs: 565
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2007-03-06T00:43:35Z
WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: message/http
Content-Length: 226
HTTP/1.x 304 Not Modified
Date: Tue, 06 Mar 2007 00:43:35 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4
Connection: Keep-Alive
Keep-Alive: timeout=15, max=100
Etag: "3e45-67e-2ed02ec0"
WARC/1.0
WARC-Type: conversion
WARC-Target-URI:
http://www.archive.org/images/logoc.jpg
WARC-Date: 2016-09-19T19:00:40Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-
57494593dddd>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
WARC-Block-Digest:
sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK
Content-Type: image/neoimg
Content-Length: 934
[image/neoimg binary data here]
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>
WARC-Segment-Number: 1
Content-Type: application/http;msgtype=response
Content-Length: 1600
HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Content-Length: 1662
Connection: close
Content-Type: image/jpeg
[first 1360 bytes of image/jpeg binary data here]
WARC/1.0
WARC-Type: continuation
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7
WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef>
WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a-
aa4c6ec90bb0>
WARC-Segment-Number: 2
WARC-Segment-Total-Length: 1902
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 302
Digitization as a mean for preservation and dissemination
Now
Color (24 bits) – 400dpi –
TIFF uncompressed
1 page ~ 80Mb
More than x500 !!!
Then
Black & white – 300dpi –
TIFF G4
1 page ~ 200Kb
SPAR - Infrastructure
SPAR - Realization
Ingest
SPAR
Storage Abstraction Service (SAS)
Administration
Data management
Storage
Access
Preservation planning
Productionapplications
Disseminationapplications
Preservation
digitization
…
wayback
WEB Archiving
….
….
…
Audiovisual
http://public.ccsds.org/publications/archive/650x0m2.pdf
P
r
e
-
I
n
g
e
s
t
P
r
e
-
I
n
g
e
s
t
P
r
e
-
I
n
g
e
s
t
Storage abstraction service
Ingest
Storage
Preservation
planning
Administration
Data management
Accès
SIP DIPmets
rdf
rdf
Infrastructure
Preservation
digitization
Web archives
And so on
2005 2006 2007 2008 2009 2010 2011 20122004
Operations
may 2010
2013
Backup
storage
Backup
Servers
Backup site
Backup secondary storage
Primary
storage
Secondary storage
Lookup storageServers
Main site
Backup
Lookup storage
Online
storage
Oracle StorageTek SL8500
• up to 64 tape drives
• up to 8500 tapes
• up to 8 hand pickers
• up to 32 linked libraries
Primary storage
2 libraries
16 PB maximum
Backup storage
2 libraries
16 PB maximum
Capacity 1.5 TB
Transfer rate 140 MB/s
Primary storage
LTO5
Backup storage
T10000B
Capacity 1 TB
Transfer rate 120 MB/s
(previously: 9840C – 40GB) (previously: T10000A – 500GB)
P
r
e
-
I
n
g
e
s
t
Storage abstraction service
Ingest
Stockage
Preservation Administration
Data management
Access
SIP
AIP
DIPmets
rdf
rdf
AIP
Which
formats are
allowed?
How copies are
needed, in what
kind of media ?
What is the
maximum size
of a package ?
Do we need to log
each access?
SLA-I.xml, SLA-P.xml, SLA-A.xml
Mets.xml
Contract.pdf
03/07/1882 28/02/1883 01/03/1883
set
group
object
file
02/07/1882
Year 1883
Le Matin
Year 1882
01/07/1882
For this purpose, PDF/X chosen as a good compromise between truth to the
original, wide usage and standardization
Mets.xml: manifest
T000001.tiff: sample
format.xml: machine readable
description
format.txt: human description
http://www.fao.org/oek/jhove2/digital-preservation-and-jhove2-home/jhove2-tutorial/en/
http://bibnum.bnf.fr/containerMD-v1
Ingest request
reception
Manifest
Validation
Package search
within SPAR
SIP characteristics
audit
SIP files audit
and characterization
ARK identifier
generation
SET processing
Ingest completion
SIP reception
Audit
ACT_01
ACT_02
ACT_03
ACT_04
ACT_05
ACT_06
ACT_07
ACT_08
ACT_09
Structural metadata: METS
Descriptive and source metadata:
qualified Dublin Core
Provenance metadata: PREMIS
Technical metadata:
depends on the data-objects
58
1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now
70 Tb 0.5 Tb 45 Tb 22 Tb
operator
robot +
150 Tb
59
Pre Ingest
Digitized books
Digitized
audiovisual
documents
web archiving
Pre Ingest
Pre Ingest
HTML
HTML
HTML
HTML
ARC
data
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
+
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
1996-2005 2002 & 2004 2004-2008 2006-2010
70 Tb 0.5 Tb 45 Tb 22 Tb
unknown
2010-now
+Alexa bot
67
150 Tb
ARC
data
ARC
metadata
HTML
HTML
HTML
HTML
harvested
files
ARC
data ARC
data
ARC
data ARC
data
+
harvest 1 harvest 2
+
harvest 3
+
…
…
This is a collection containing French election websites
Here are the
files we
harvested
They are
included in
web archives
specific files
This was done
with these tools
A three-layered model
in SPAR
Harvest Definition (curator collection)
Harvest Instance (“technical” harvest = job)
ARC file (data or metadata)
filedesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77
1 0 InternetArchive
URL IP-address Archive-date Content-type Archive-length
metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVers
ion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814
text/xml 366
<?xml version="1.0" encoding="UTF-8"?> <harvestInfo>
<version>0.2</version> <jobId>32</jobId>
<priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum>
<origHarvestDefinitionID>1</origHarvestDefinitionID>
<maxBytesPerDomain>-1</maxBytesPerDomain>
<maxObjectsPerDomain>1000</maxObjectsPerDomain>
<orderXMLName>default</orderXMLName> </harvestInfo>
metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1.
14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml
44775
<?xml version="1.0" encoding="UTF-8"?> <crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
…
1996-2005 2002 & 2004 2004-2008 2006-2010
70 Tb 0.5 Tb 45 Tb 22 Tb
unknown
2010-now
+Alexa bot
71
150 Tb
Two layers:
- Collection
- ARC files
1996-2005 2010-now
Three layers:
- Harvest Definition
- Harvest instance
- ARC files
Two layers:
- Collection
- ARC files
2006-2010 2010-now
Four layers:
- Collection
- Harvest division
- Harvest instance
- ARC files
Three layers:
- Harvest Definition
- Harvest instance
- ARC files
03/07/1882 28/02/1883 01/03/1883
set
group
object
file
02/07/1882
Year 1883
Le Matin
Year 1882
01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882
set
group
03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882
AIPAIP
AIPAIP
set
Contains nothing but metadata
Curator information, allows to
group AIPs sharing the same
intellectual content
AIPAIP
Must contain files to be
preserved
Each AIP is an autonomous
unit
AIPAIP AIPAIP AIPAIP
<mets>
<dmdSec>
Intellectual metadata
<amdSec>
Administrative metadata
<fileSec>
List of the files
<structMap>
Structure of the package
<sourceMD>
Metadata about the source
used to produce this content
<techMD>
Technical metadata
<digiprovMD>
Provenance metadata
harvestInstance
has harvest
instance
is documented in
Outcome extensions
persons: admins
software
organizations
Harvest event
ARC
data
ARC
metadata
HTML
HTML
HTML
HTML
ARC
data ARC
data
ARC
data ARC
data
+
harvest 1 harvest 2
+
harvest 3
+
…
…
This is a collection containing French election websites
HTML
HTML
HTML
HTML
…
ARC
data
ARC
data
ARC
metadataARC
data ARC
data
ARC
data ARC
data …
…
This is a collection containing French election websites
AIPAIP
AIPAIP
AIPAIP
AIPAIP
AIPAIP
AIPAIP
AIPAIPset
ARC
data ARC
data …
AIPAIP
AIPAIP
ARC
data
AIPAIP
AIPAIPAIPAIP
groups
ARC
ARC.GZ
?
?
HTML
?
HTML
?
version-
block
header
metadata
object
First
ARC
record
data
object
containerMD
http://bibnum.bnf.fr/containerMD-v1
<mets>
<dmdSec>
Intellectual metadata
<amdSec>
Administrative metadata
<fileSec>
List of the files
<structMap>
Structure of the package
<sourceMD>
Metadata about the source
used to produce this content
<techMD>
Technical metadata
<digiprovMD>
Provenance metadata
containerMD
root element
containerMD
root element
containercontainer
entriesentries
entriesInformationentriesInformation
entryentry
entryentry
entryentry
ARCContainerARCContainer
ARCEntriesARCEntries
ARCRecordARCRecord
ARCRecordARCRecord
ARCRecordARCRecord…
ARC-specific
extensions
ARC-specific
extensions
aggregated
information
about the
entries
factorizing
and sum
92
Web archiving at the British Library
Helen Hockx-Yu
Head of Web Archiving
Overview
> Part 1: Background, history and organisation
> Part 2: Web Archiving Tools (including
demos)
> Part 3: Access
> Part 4: Non-print Legal Deposit and future
strategy
29th November 2012 Session 7 -Web archiving at the British Library 2
BL Structure
> BL Board and Executive Team
> e-Strategy and Information Systems (eIS)
> IT-based products and services
> Finance and Corporate Services (F&CS)
> Money
> Human Resources
> People
> Operations & Services (O&S)
> Front line services
> Scholarship and Collections (S&C)
> Content (Arts and humanities, Social Sciences, Science, Technology & Medicine)
> Strategic Marketing and Communications (SMC)
> Brand and reputation
29th November 2012 Session 7 -Web archiving at the British Library 3
Web archiving timeline
29th November 2012 Session 7 -Web archiving at the British Library 4
Current web archiving strategy
> Selective archiving of websites that
> reflect the diversity of lives, interests and activities throughout the UK
> contain research value or are of research interest
> feature political, cultural, social and economic events of national interest
> demonstrate innovative use of the web4 areas
> Also prioritise websites at risk and web-only content
> Permission based
> Permission to archive, to provide online access and to preserve. Also ask or 3rd
rights clearance
> 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)
> Online access through UK Web Archive
> Expect to crawl at domain level (from April 2013) for Non-
print Legal Deposit
29th November 2012 Session 7 -Web archiving at the British Library 5
The current Web Archiving team
29th November 2012 Session 7 -Web archiving at the British Library 6
Skills Profile
> IT
> Collection management, digital curation
> Management
> Communications
> Web Archiving
(Internal Collaboration)
> The Web Archiving Team is involved in the end to end process but work
with other departments / teams in the library
29th November 2012 Session 7 -Web archiving at the British Library 7
Department /Team Activity / Support
S&C
> Subject specialist group
> Curator’s Choice project
Selection, curation
eIS Network, hardware and IT support
O&S
Resource Discovery & Research
Corporate level resource discovery http://explore.bl.uk/
CA&D
Digital Processing
Cataloguing (special collection level)
SMC Publicity, press release, events
The Legal Deposit Programme Domain crawl capability / process and policy
Curator’s Choice
> Pilot project with a small group of dedicated curators /
subject specialists
> Special Collections of curator’s choice. Curators take
responsibility for owning, maintaining and growing the
collections over time
> Evolving Role of Libraries in the UK
> Political Action and Communication
> Slavery and Abolition in the Caribbean
> UK relations with the Low Countries
> 19th Century English Literature
> Oral History in the UK
> Film in the UK
> Energy
29th November 2012 Session 7 -Web archiving at the British Library 8
Web Archiving Advisory Group
> Provide advice and support to the Web Archiving Team
> Act as a ‘critical friend’ to assist in the development of policy
and practice.
> Specific advice and support on:
> Purpose, vision and benefits.
> Strategic direction and planning.
> Synergy with internal teams and collaboration with
external stakeholders/partners.
> Policy changes and risk management
29th November 2012 Session 7 -Web archiving at the British Library 9
(External) Collaboration
> UK Web Archiving Consortium (2004-2007): centralised infrastructure
and development, distributed collections
> UK Web Archive partners, National Archives, Legal Deposit Libraries
(LDLs)
> External Collaborators
> Welcome Library
> Live Art Development Agency
> The Cambridge Innovation Network
> The Women’s Library
> Institute of Historical esearch, University of London
> Individual researchers, specialists
> General public – ca. 20 nominations / week
> National organisations: DPC, JISC
> International: IIPC
29th November 2012 Session 7 -Web archiving at the British Library 10
JISC UK Web Domain Dataset (1996-2010)
> Collaboration with JISC and the Internet Archive
> UK Web Domain Dataset (1996-2010) – UK websites
extracted from the Internet Archive's collection and
supported by funding from the JISC
> 35TB research dataset
> No local access to individual websites but access to
secondary dataset allowed
> BL has developed visualisations of the dataset
> JISC funded 2 further projects using this dataset
> Analytical Access to the Domain Dark Archive
> Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social
Science Research
29th November 2012 Session 7 -Web archiving at the British Library 11
Web Archiving Tools
> Support key processes: selection, harvesting, storage,
access, preservation
> Mostly open source tools, some developed in-house
> New tools / changes to current tools expected when business
processes change due to non-print Legal Deposit
29th November 2012 Session 7 -Web archiving at the British Library 12
Selection Tools
> Selection: decide what websites to archive and to include as
part of a web archive collection
> Selection and Permission Tool: https://wct.bl.uk/selection/
> Submit selection – real time checking of duplicates, fetching meta tags from live
sites
> Collect metadata
> Add contact details
> Suggest crawl frequency
> Permissions management – send emails, direct users to online licence form, store
the completed forms, pass details to WCT (create authorisation record and a
pending target)
> Reports
> Twittervane
29th November 2012 Session 7 -Web archiving at the British Library 13
Harvesting Tools
> Harvesting: automated downloading of selected websites
using crawler software; quality assurance regarded as an
element
> The Web Curator Tool (WCT): https://wct.bl.uk/wct/
> Job scheduling
> Metadata
> Access control
> Harvesting (uses Heritirx)
> QA
29th November 2012 Session 7 -Web archiving at the British Library 14
Quality Assurance
> Placing more emphasis on intellectual content than
appearance or behaviour of a website
> Use four aspects to define quality:
> Completeness of capture: whether the intended content has been captured as
part of the harvest.
> Intellectual content: whether the intellectual content (as opposed to styling and
layout) can be replayed in the Access Tool.
> Behaviour: whether the harvested copy can be replayed including the behaviour
present on the live site, such as the ability to browse between links interactively.
> Appearance: look and feel of a website.
> Rely on visual comparison, previous harvests & crawl logs
> Recent development of QA module to allow bulk operation,
reduce # of clicks and make QA recommendations
29th November 2012 Session 7 -Web archiving at the British Library 15
Supporting Long-term Preservation
> Storing data in WARCs and metadata in METS
> Migrate all legacy data into WARCs
> WCT output WARC files
> Submission Information Package (SIP) profiles for selective
and domain crawls
> Storing descriptive metadata (eg permission information) & technical metadata
(eg crawl log, crawl configurations, virus scan events)
> Ingest archived websites in the Digital Library System (DLS)
> Command line tool generates SIPs
> Providing access from the DLS (in future)
29th November 2012 Session 7 -Web archiving at the British Library 16
Demo (45 minutes)
> Selection and Permission Tool (https://wct.bl.uk/selection/)
> Web Curator Tool (https://wct.bl.uk/wct/)
29th November 2012 Session 7 -Web archiving at the British Library 17
Access
> Currently 3 ways to access the web archive
> Online through the UK Web Archive
> Catalogue records (of special collections)
> Keywords search through primo (corporate resource
discovery system)
> Conduct researcher survey to understand
requirements
> Analytical access
29th November 2012 Session 7 -Web archiving at the British Library 18
Catalogue Records
29th November 2012 Session 7 -Web archiving at the British Library 19
Keyword search through Primo
29th November 2012 Session 7 -Web archiving at the British Library 20
UK Web Archive
29th November 2012 Session 7 -Web archiving at the British Library 21
> Websites archived by BL and
partners since 2004 (65% by
BL)
> 122,99 websites, 50,866
instances, 13.6TBWARCs
> Over 100,000 unique visits
since 1st April 2012
> Key websites include videos
> Full-text, N-gram, title and
URL search
> Browse by subject / special
collection, visual browsing
http://www.webarchive.org.uk
Analytical Access
> Shift of focus from the level of single webpages or websites
to the entire web archive collection.
> Use web archives as datasets
> Support survey, annotation, contextualisation and
visualisation
> Allows discovery of patterns, trends and relationships in
inter-linked web pages
> Extracting value from the “haystacks”
> Helps addresses a number of challenging issues
> Scalability
> Accessibility of individual websites
> Components missed by crawlers
29th November 2012 Session 7 -Web archiving at the British Library 22
Visualising the UK Web
> http://www.webarchive.org.uk/ukwa/visualisation
> N-gram search
> Links analysis
> Format Analysis
> Geo-index
> http://www.webarchive.org.uk/bluebox/
> uses the Memento aggregate TimeGate hosted by lanl.gov
> “resource not in archive” – who else has it?
> Open data
> Dataset and APIs for general use
> Enable broader community to re-use, explore and visualise content of web archive
29th November 2012 Session 7 -Web archiving at the British Library 23
Web Archiving Infrastructure
29th November 2012 Session 7 -Web archiving at the British Library 24
Non-print Legal Deposit: Time of change
> Expected to be in place in April 2013
> Access restricted to premises of Legal Deposit Libraries
> Library-wide Legal Deposit Programme to develop capability
and end-to-end process
> Web Archiving Team acts as “technical supplier” for a
number of projects
> Still need to work out how current (permission-based)
selective archiving relates to domain crawl under Legal
Deposit
> Will we request permissions for online access?
> Will we stop crawling some of the sites we are crawling now and include them in
the annual / bi-annual broad domain crawl?
> Who does what?
29th November 2012 Session 7 -Web archiving at the British Library 25
29th November 2012 Session 7 -Web archiving at the British Library 26
Web Archiving Strategy
26
Domain Crawl
Event
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain
harvesting:
• Broad
sweep of
.uk domain
• Once or
twice a year
Events & key
sites:
• Events of
national
interest
• Sites need
to be
captured
frequently
Special
Collection:
• Focused,
thematic
collections
• Support
priority
subjects
Key sitesEvent
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Web	
  Archiving	
  Workshop	
  
Leïla	
  Medjkoune,	
  Internet	
  Memory	
  
IIPC	
  workshop,	
  BNF,	
  Paris,	
  November	
  2012	
  
Internet	
  Memory	
  
Internet	
  Memory	
  Founda/on	
  (European	
  Archive)	
  
•  Established	
  in	
  2004	
  in	
  Amsterdam	
  and	
  then	
  Paris	
  
•  Mission:	
  Preserve	
  Web	
  content	
  by	
  building	
  a	
  shared	
  WA	
  plaJorm	
  
•  Ac/ons:	
  DisseminaLon,	
  R&D	
  and	
  partnerships	
  with	
  research	
  groups	
  and	
  
cultural	
  insLtuLons	
  
•  Open	
  Access	
  Collec/ons:	
  UK	
  NaLonal	
  Archives	
  &	
  Parliament,	
  PRONI,	
  CERN
and	
  The	
  NaLonal	
  Library	
  of	
  Ireland	
  
Internet	
  Memory	
  Research	
  
•  Spin-­‐off	
  of	
  IM	
  established	
  in	
  June	
  2011	
  in	
  Paris	
  
•  Missions:	
  Operate	
  large	
  scale	
  or	
  selecLve	
  crawls	
  &	
  develop	
  new	
  
technologies	
  (crawl,	
  access,	
  processing	
  and	
  extracLon)	
  	
  
Internet	
  Memory	
  
Infrastructure	
  
  Green	
  datacenters	
  
  Repository	
  and	
  data	
  access	
  for	
  large-­‐scale	
  data	
  
management:	
  
•  HDFS	
  (Hadoop	
  File	
  System):	
  Distributed,	
  fault-­‐tolerant	
  
file	
  system	
  
•  Hbase.	
  A	
  distributed	
  key-­‐value	
  index	
  
•  Convenient	
  model	
  for	
  temporal	
  archives	
  
•  MapReduce:	
  A	
  distributed	
  execuLon	
  framework	
  
•  Reliable	
  mechanism	
  to	
  run	
  an	
  analysis	
  job	
  on	
  
very	
  large	
  datasets	
  
	
  
Internet	
  Memory	
  
Focused	
  crawling:	
  
•  Automated	
  crawls	
  	
  
•  Quality	
  focused	
  crawls	
  :	
  
–  Video	
  capture,	
  Twiaer	
  crawls	
  
–  ExecuLon	
  tools	
  to	
  overcome	
  crawling	
  issues	
  on	
  specific	
  content	
  
Large	
  scale	
  crawling	
  
•  Inhouse	
  developped	
  distributed	
  sobware	
  	
  
•  Scalable	
  crawler	
  (10-­‐50	
  Bn	
  pages)	
  
•  Also	
  designed	
  for	
  focused	
  crawl	
  and	
  complex	
  scoping	
  
	
  
Research	
  projects	
  and	
  focus	
  
Web	
  Archiving	
  and	
  Preserva/on	
  
✓  Living	
  Web	
  Archives	
  (2007-­‐2010)	
  
✓  Archives	
  to	
  Community	
  MEMories:	
  
(2010-­‐2013)	
  
✓  SCAlable	
  PreservaLon	
  Environment	
  
(2010-­‐2013)	
  
Webscale	
  data	
  Archiving	
  and	
  
Extrac/on	
  
✓  Living	
  Knowledge	
  (2009-­‐2012)	
  
✓  Longitudinal	
  AnalyLcs	
  of	
  Web	
  
Archive	
  data	
  (2010-­‐2013)	
  
✓  TrendMiner	
  (2011-­‐2014)	
  
✓  DOPA	
  (2012-­‐2014)	
  
✓  AnnoMarket	
  (2012-­‐2014)	
  
Web	
  Archiving	
  project	
  ?	
  
OrganisaLonal	
  challenges:	
  
•  SelecLon/QA:	
  Librarian	
  /	
  Archivist,	
  Quality	
  assurance	
  team,	
  
Project	
  manager	
  
•  Content	
  capture/services	
  development:	
  Engineers,	
  
developers,	
  technicians	
  
•  Infrastructure	
  deployment	
  and	
  maintenance:	
  Engineers,	
  
System	
  administrators	
  
➥ Web	
  Archiving	
  projects	
  require	
  strong	
  competences	
  and	
  
experienced	
  human	
  resources	
  combined	
  with	
  a	
  scalable	
  
infrastructure	
  
IM	
  Shared	
  plaJorm	
  
Since	
  its	
  creaLon	
  in	
  2004,	
  the	
  Internet	
  Memory	
  
FoundaLon	
  works	
  in	
  close	
  collaboraLon	
  with	
  partners	
  
insLtuLons	
  and	
  research	
  groups	
  through	
  European	
  
projects:	
  
•  To	
  develop	
  methods	
  and	
  tools	
  improving	
  web	
  
archiving	
  quality	
  
•  To	
  grow	
  its	
  experLse	
  and	
  technological	
  taskforce	
  
Archivethe.Net	
  (1)	
  
	
  
•  To	
  mutualize	
  knowledge	
  and	
  skills	
  between	
  
insLtuLons	
  
•  To	
  share	
  internal	
  developments	
  with	
  partners	
  
insLtuLons	
  
•  To	
  cut	
  services	
  and	
  R&D	
  costs	
  
Archivethe.Net	
  (2)	
  
•  Archivethe.net is a shared web archiving platform
associated to a service. 	

•  The platform is combining new technology and
user needs to ensure a good service quality in
terms of reliability and efficiency 	

•  For whom ? our current partners, our new
partners and … for ourselves
Benefits	
  ?	
  
•  Integrated	
  web	
  archiving	
  process	
  :	
  from	
  selecLon	
  
to	
  access	
  
•  Ongoing	
  technological	
  developments	
  through	
  
specific	
  or	
  common	
  R&D	
  projects	
  
•  Dedicated	
  and	
  highly	
  skilled	
  team	
  to	
  follow	
  
partners’	
  projects	
  
•  Dedicated	
  infrastructure	
  
How	
  does	
  it	
  work?	
  (1)	
  
•  ATN	
  is	
  designed	
  as	
  a	
  Saas	
  
(Sobware	
  as	
  a	
  service)	
  	
  
•  The	
  plaJorm	
  offers	
  a	
  friendly	
  user	
  
interface	
  to	
  record	
  partners	
  web	
  
archiving	
  orders	
  
•  A	
  pipeline	
  organizes	
  and	
  manages	
  
the	
  producLon	
  
•  A	
  QA	
  team	
  ensures	
  the	
  quality	
  of	
  
the	
  archive	
  to	
  meet	
  partners 	
  
requirements	
  	
  
How	
  does	
  it	
  work?	
  (2)	
  
	
  
	
  
	
  
Demo	
  
ARCOMEM	
  Archivist	
  tool	
  ?	
  
Set	
  and	
  follow	
  web	
  archive	
  
campaigns	
  
•  V1:	
  A	
  crawler	
  cockpit	
  and	
  a	
  search	
  	
  
and	
  retrieval	
  applicaLon	
  
Intelligent	
  content	
  acquisiLon:	
  
•  Seeds	
  URLs	
  
•  Keywords	
  
•  Social	
  web	
  sites	
  APIs	
  	
  
•  Social	
  Media	
  Categories	
  (SMC)	
  	
  
SARA	
  
Search	
  and	
  retrieval	
  interface:	
  
•  Advance	
  search	
  
funcLonaliLes	
  
•  Filtering	
  via	
  faceLng	
  
•  SorLng	
  by	
  content	
  type,	
  
Social	
  media	
  plaJorm,	
  text/
image	
  contextual	
  
informaLon	
  (event,	
  
enLty,...),	
  etc.	
  
	
  
	
  
Crawler	
  Cockpit	
  Interface	
  
	
  
	
  
•  Create/select	
  a	
  campaign	
  
•  Describe	
  campaign	
  (Ltle,	
  
descripLon,	
  comments,	
  etc.)	
  
•  Define	
  scope:	
  select	
  criteria	
  such	
  
as	
  language,	
  keyword,	
  url,	
  
organisaLon,	
  etc.	
  
•  Select	
  social	
  media	
  categories	
  and	
  
APIs	
  to	
  explore	
  
•  Set	
  precedence	
  rules	
  for	
  some	
  
content	
  type	
  or	
  source	
  (images,	
  
videos,	
  tweets,	
  news,	
  etc.)	
  
Crawler	
  cockpit	
  interface	
  
	
  
	
  
	
  
Demo	
  
	
  
ARCOMEM	
  Archivist	
  Tool	
  V2	
  
•	
  Refinement	
  mode	
  :	
  Refine	
  crawl	
  
parameters	
  to	
  improve	
  crawls	
  
•	
  Improve	
  access	
  applicaLon	
  (SARA)	
  :	
  
Preview	
  funcLon	
  so	
  that	
  the	
  users	
  can	
  
review	
  the	
  results	
  of	
  the	
  campaign	
  set	
  up	
  
QA	
  for	
  Web	
  Archives?	
  
IM	
  QA	
  is	
  based	
  on:	
  
•  Tools	
  internally	
  developed	
  
•  Tools	
  developed	
  in	
  the	
  context	
  of	
  European	
  projects	
  	
  
•  	
  Automated	
  processes	
  
•  	
  Knowledge	
  and	
  skills	
  of	
  our	
  crawl	
  engineer	
  and	
  QA	
  
teams	
  
	
  
 QA	
  Methodology	
  and	
  tools?	
  
	
  Methodology	
  
•  Based	
  upon	
  crawler	
  behaviour	
  
•  Based	
  on	
  insLtuLons	
  needs	
  and	
  policy	
  
•  Can	
  be	
  manual	
  (visual)	
  or	
  “automated”	
  
•  Can	
  be	
  made	
  at	
  pre	
  or	
  post	
  crawl	
  Lme	
  
Tools	
  
•  Open	
  source	
  tools	
  such	
  as	
  plugins	
  ,	
  proxies,	
  etc.	
  
•  Internally	
  developed	
  tools	
  (fetchers,	
  automate	
  check,	
  etc.)	
  
•  Bug	
  trackers	
  to	
  record	
  informaLon	
  and	
  communicate	
  with	
  
partner	
  insLtuLons	
  
 QA	
  Methodology	
  and	
  tools?	
  
	
  
SCApe:	
  Scalable	
  PreservaLon	
  Environments	
  
•  Automate	
  visual	
  QA	
  to	
  detect	
  rendering	
  issues:	
  
•  Improve	
  archives	
  quality	
  and	
  cut	
  QA	
  costs	
  
•  Feed	
  “preservaLon	
  watch	
  and	
  planning”	
  tools	
  
•  First	
  test	
  made	
  on	
  over	
  400	
  pairs	
  of	
  urls	
  
•  Inhouse	
  “ExecuLon	
  plaJorm”	
  under	
  deployment	
  
•  Results	
  and	
  processes	
  to	
  be	
  disseminated	
  to	
  IIPC	
  
members	
  for	
  feedback	
  !	
  
	
  
Technical	
  challenges	
  
Capture	
  
•  Dynamically	
  generated	
  content,	
  deep	
  web,	
  etc.	
  	
  
•  Non	
  HTTP	
  protocoles	
  (e.g.:	
  RTMP)	
  
•  Social	
  media	
  plaJorms,	
  ...	
  
Access	
  	
  
•  Replicate	
  live	
  funcLonaliLes	
  and	
  look	
  &	
  feel	
  
•  Provide	
  access	
  to	
  very	
  large	
  files	
  	
  
	
  
➥ Fast	
  evolving	
  technologies	
  
➥ Ephemeral	
  content	
  
➥ MulLplicaLon	
  of	
  producLon	
  means:	
  	
  
➥ Increase	
  of	
  user	
  generated	
  content	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Technical	
  SoluLons	
  	
  
•  ExecuLon	
  based	
  
crawling	
  (vs	
  parsing)	
  
•  API	
  crawling	
  	
  
•  ApplicaLon	
  aware	
  
crawling	
  
•  Bespoke	
  fetchers	
  
➥  OrchestraLon	
  of	
  tools	
  
	
  
ARCOMEM content acquisition
Technical	
  SoluLons	
  	
  
Access	
  tool:	
  
•  Player	
  replacement:	
  reproduce	
  players	
  
funcLonaliLes	
  	
  
•  Adapt	
  access	
  soluLon	
  to	
  type	
  of	
  content/plaJorms	
  
(generic	
  soluLons)	
  
Storage	
  infrastructure	
  /	
  format:	
  
•  Enable	
  access	
  to	
  large	
  files	
  
•  Fast	
  access	
  to	
  large	
  amount	
  of	
  content	
  to	
  facilitate	
  
search	
  &	
  retrieval	
  
Use	
  cases	
  
•  Social	
  media	
  capture	
  and	
  access:	
  
•  You	
  Tube	
  	
  
•  Twiaer	
  
•  Flickr,	
  etc.	
  
•  Web	
  Archiving	
  related	
  services:	
  	
  
•  RedirecLon	
  service	
  
•  Memento	
  
•  Legal	
  issues	
  with	
  captured	
  content	
  	
  
•  Full	
  text	
  search	
  	
  
•  etc.	
  
	
  
	
  

More Related Content

What's hot

Статический анализ кода в контексте SSDL
Статический анализ кода в контексте SSDLСтатический анализ кода в контексте SSDL
Статический анализ кода в контексте SSDLPositive Hack Days
 
Power of linked list
Power of linked listPower of linked list
Power of linked listPeter Hlavaty
 
Exploitation and State Machines
Exploitation and State MachinesExploitation and State Machines
Exploitation and State MachinesMichael Scovetta
 
We shall play a game....
We shall play a game....We shall play a game....
We shall play a game....Sadia Textile
 
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemScalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemTamas K Lengyel
 
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
50 Shades of Fuzzing by Peter Hlavaty & Marco GrassiShakacon
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
 
Cloud forensics putting the bits back together
Cloud forensics putting the bits back togetherCloud forensics putting the bits back together
Cloud forensics putting the bits back togetherShakacon
 
AV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkAV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkVeilFramework
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty
 
Testing CAN network with help of CANToolz
Testing CAN network with help of CANToolzTesting CAN network with help of CANToolz
Testing CAN network with help of CANToolzAlexey Sintsov
 
Pitfalls and limits of dynamic malware analysis
Pitfalls and limits of dynamic malware analysisPitfalls and limits of dynamic malware analysis
Pitfalls and limits of dynamic malware analysisTamas K Lengyel
 
Guardians of your CODE
Guardians of your CODEGuardians of your CODE
Guardians of your CODEPeter Hlavaty
 
A Battle Against the Industry - Beating Antivirus for Meterpreter and More
A Battle Against the Industry - Beating Antivirus for Meterpreter and MoreA Battle Against the Industry - Beating Antivirus for Meterpreter and More
A Battle Against the Industry - Beating Antivirus for Meterpreter and MoreCTruncer
 
How to Root 10 Million Phones with One Exploit
How to Root 10 Million Phones with One ExploitHow to Root 10 Million Phones with One Exploit
How to Root 10 Million Phones with One ExploitJiahong Fang
 
Bypassing patchguard on Windows 8.1 and Windows 10
Bypassing patchguard on Windows 8.1 and Windows 10Bypassing patchguard on Windows 8.1 and Windows 10
Bypassing patchguard on Windows 8.1 and Windows 10Honorary_BoT
 
BH Arsenal '14 TurboTalk: The Veil-framework
BH Arsenal '14 TurboTalk: The Veil-frameworkBH Arsenal '14 TurboTalk: The Veil-framework
BH Arsenal '14 TurboTalk: The Veil-frameworkVeilFramework
 
Digging for Android Kernel Bugs
Digging for Android Kernel BugsDigging for Android Kernel Bugs
Digging for Android Kernel BugsJiahong Fang
 

What's hot (20)

Статический анализ кода в контексте SSDL
Статический анализ кода в контексте SSDLСтатический анализ кода в контексте SSDL
Статический анализ кода в контексте SSDL
 
Power of linked list
Power of linked listPower of linked list
Power of linked list
 
Exploitation and State Machines
Exploitation and State MachinesExploitation and State Machines
Exploitation and State Machines
 
We shall play a game....
We shall play a game....We shall play a game....
We shall play a game....
 
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemScalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
 
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
50 Shades of Fuzzing by Peter Hlavaty & Marco Grassi
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
Racing with Droids
Racing with DroidsRacing with Droids
Racing with Droids
 
Cloud forensics putting the bits back together
Cloud forensics putting the bits back togetherCloud forensics putting the bits back together
Cloud forensics putting the bits back together
 
AV Evasion with the Veil Framework
AV Evasion with the Veil FrameworkAV Evasion with the Veil Framework
AV Evasion with the Veil Framework
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could Expect
 
The Veil-Framework
The Veil-FrameworkThe Veil-Framework
The Veil-Framework
 
Testing CAN network with help of CANToolz
Testing CAN network with help of CANToolzTesting CAN network with help of CANToolz
Testing CAN network with help of CANToolz
 
Pitfalls and limits of dynamic malware analysis
Pitfalls and limits of dynamic malware analysisPitfalls and limits of dynamic malware analysis
Pitfalls and limits of dynamic malware analysis
 
Guardians of your CODE
Guardians of your CODEGuardians of your CODE
Guardians of your CODE
 
A Battle Against the Industry - Beating Antivirus for Meterpreter and More
A Battle Against the Industry - Beating Antivirus for Meterpreter and MoreA Battle Against the Industry - Beating Antivirus for Meterpreter and More
A Battle Against the Industry - Beating Antivirus for Meterpreter and More
 
How to Root 10 Million Phones with One Exploit
How to Root 10 Million Phones with One ExploitHow to Root 10 Million Phones with One Exploit
How to Root 10 Million Phones with One Exploit
 
Bypassing patchguard on Windows 8.1 and Windows 10
Bypassing patchguard on Windows 8.1 and Windows 10Bypassing patchguard on Windows 8.1 and Windows 10
Bypassing patchguard on Windows 8.1 and Windows 10
 
BH Arsenal '14 TurboTalk: The Veil-framework
BH Arsenal '14 TurboTalk: The Veil-frameworkBH Arsenal '14 TurboTalk: The Veil-framework
BH Arsenal '14 TurboTalk: The Veil-framework
 
Digging for Android Kernel Bugs
Digging for Android Kernel BugsDigging for Android Kernel Bugs
Digging for Android Kernel Bugs
 

Similar to W/ARC file records

Webvideo, FFmpeg und Drupal
Webvideo, FFmpeg und DrupalWebvideo, FFmpeg und Drupal
Webvideo, FFmpeg und DrupalWalter Ebert
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionChamp Yen
 
[1C2]webrtc 개발, 현재와 미래
[1C2]webrtc 개발, 현재와 미래[1C2]webrtc 개발, 현재와 미래
[1C2]webrtc 개발, 현재와 미래NAVER D2
 
HTTP and 5G (fixed1)
HTTP and 5G (fixed1)HTTP and 5G (fixed1)
HTTP and 5G (fixed1)dynamis
 
Inside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, MetroInside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, MetroDaphne Chong
 
KohaCon11: Integrating Koha with RFID system
KohaCon11: Integrating Koha with RFID systemKohaCon11: Integrating Koha with RFID system
KohaCon11: Integrating Koha with RFID systemDobrica Pavlinušić
 
06 - ELF format, knowing your friend
06 - ELF format, knowing your friend06 - ELF format, knowing your friend
06 - ELF format, knowing your friendAlexandre Moneger
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesBobby Curtis
 
Avtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - FargoAvtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - FargoAvtex
 
Web rtc 핵심 기술에 대한 이해
Web rtc 핵심 기술에 대한 이해Web rtc 핵심 기술에 대한 이해
Web rtc 핵심 기술에 대한 이해Dahyun Kim
 
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationWebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationAmir Zmora
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerJomaSoft
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewLinaro
 
HTTP/2 What's inside and Why
HTTP/2 What's inside and WhyHTTP/2 What's inside and Why
HTTP/2 What's inside and WhyAdrian Cole
 
WebRTC standards update - November 2014
WebRTC standards update - November 2014WebRTC standards update - November 2014
WebRTC standards update - November 2014Victor Pascual Ávila
 
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our ServersHTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our ServersJean-Frederic Clere
 
Fundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisFundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisDmitry Vostokov
 

Similar to W/ARC file records (20)

Webvideo, FFmpeg und Drupal
Webvideo, FFmpeg und DrupalWebvideo, FFmpeg und Drupal
Webvideo, FFmpeg und Drupal
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
gofortution
gofortutiongofortution
gofortution
 
Quic illustrated
Quic illustratedQuic illustrated
Quic illustrated
 
[1C2]webrtc 개발, 현재와 미래
[1C2]webrtc 개발, 현재와 미래[1C2]webrtc 개발, 현재와 미래
[1C2]webrtc 개발, 현재와 미래
 
HTTP and 5G (fixed1)
HTTP and 5G (fixed1)HTTP and 5G (fixed1)
HTTP and 5G (fixed1)
 
Inside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, MetroInside the ABC's new Media Transcoding system, Metro
Inside the ABC's new Media Transcoding system, Metro
 
KohaCon11: Integrating Koha with RFID system
KohaCon11: Integrating Koha with RFID systemKohaCon11: Integrating Koha with RFID system
KohaCon11: Integrating Koha with RFID system
 
06 - ELF format, knowing your friend
06 - ELF format, knowing your friend06 - ELF format, knowing your friend
06 - ELF format, knowing your friend
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail Files
 
Avtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - FargoAvtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - Fargo
 
Web rtc 핵심 기술에 대한 이해
Web rtc 핵심 기술에 대한 이해Web rtc 핵심 기술에 대한 이해
Web rtc 핵심 기술에 대한 이해
 
Restfs internals
Restfs internalsRestfs internals
Restfs internals
 
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationWebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
 
Experiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 ServerExperiences with Oracle SPARC S7-2 Server
Experiences with Oracle SPARC S7-2 Server
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting Review
 
HTTP/2 What's inside and Why
HTTP/2 What's inside and WhyHTTP/2 What's inside and Why
HTTP/2 What's inside and Why
 
WebRTC standards update - November 2014
WebRTC standards update - November 2014WebRTC standards update - November 2014
WebRTC standards update - November 2014
 
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our ServersHTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
 
Fundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump AnalysisFundamentals of Complete Crash and Hang Memory Dump Analysis
Fundamentals of Complete Crash and Hang Memory Dump Analysis
 

More from Biblioteca Nacional de España

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoBiblioteca Nacional de España
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...Biblioteca Nacional de España
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Biblioteca Nacional de España
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Biblioteca Nacional de España
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoBiblioteca Nacional de España
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Biblioteca Nacional de España
 

More from Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

W/ARC file records

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. W/ARC file W/ARC record Header Block Ex: HTTP response, jpeg file… Ex: record ID, capture date, record type,…
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. WARC/1.0 WARC-Type: warcinfo WARC-Date: 2006-09-19T17:20:14Z WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39> Content-Type: application/warc-fields Content-Length: 381 software: Heritrix 1.12.0 http://crawler.archive.org hostname: crawling017.archive.org ip: 207.241.227.234 isPartOf: testcrawl-20050708 description: testcrawl with WARC output operator: IA_Admin http-header-user-agent: Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org) format: WARC file version 0.17 conformsTo: http://www.archive.org/documents/WarcFileFormat-0.17.html
  • 13. WARC/1.0 WARC-Type: request WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z Content-Length: 236 WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594> Content-Type: application/http;msgtype=request WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> GET /images/logoc.jpg HTTP/1.0 User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0) From: stack@example.org Connection: close Referer: http://www.archive.org/ Host: www.archive.org Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824
  • 14. WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 WARC-IP-Address: 207.241.233.58 WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0> Content-Type: application/http;msgtype=response WARC-Identified-Payload-Type: image/jpeg Content-Length: 1902 HTTP/1.1 200 OK Date: Tue, 19 Sep 2006 17:18:40 GMT Server: Apache/2.0.54 (Ubuntu) Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT ETag: "3e45-67e-2ed02ec0" Accept-Ranges: bytes Content-Length: 1662 Connection: close Content-Type: image/jpeg [image/jpeg binary data here]
  • 15. WARC/1.0 WARC-Type: resource WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> Content-Type: image/jpeg WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 Content-Length: 1662 [image/jpeg binary data here]
  • 16. WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e- 57494593b943> WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> Content-Type: application/warc-fields WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 Content-Length: 59 via: http://www.archive.org/ hopsFromSeed: E fetchTimeMs: 565
  • 17. WARC/1.0 WARC-Type: revisit WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2007-03-06T00:43:35Z WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb> WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0> Content-Type: message/http Content-Length: 226 HTTP/1.x 304 Not Modified Date: Tue, 06 Mar 2007 00:43:35 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 Connection: Keep-Alive Keep-Alive: timeout=15, max=100 Etag: "3e45-67e-2ed02ec0"
  • 18. WARC/1.0 WARC-Type: conversion WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2016-09-19T19:00:40Z WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e- 57494593dddd> WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK Content-Type: image/neoimg Content-Length: 934 [image/neoimg binary data here]
  • 19. WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2 WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 WARC-IP-Address: 207.241.233.58 WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0> WARC-Segment-Number: 1 Content-Type: application/http;msgtype=response Content-Length: 1600 HTTP/1.1 200 OK Date: Tue, 19 Sep 2006 17:18:40 GMT Server: Apache/2.0.54 (Ubuntu) Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT ETag: "3e45-67e-2ed02ec0" Accept-Ranges: bytes Content-Length: 1662 Connection: close Content-Type: image/jpeg [first 1360 bytes of image/jpeg binary data here]
  • 20. WARC/1.0 WARC-Type: continuation WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7 WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef> WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a- aa4c6ec90bb0> WARC-Segment-Number: 2 WARC-Segment-Total-Length: 1902 WARC-Identified-Payload-Type: image/jpeg Content-Length: 302
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Digitization as a mean for preservation and dissemination
  • 26. Now Color (24 bits) – 400dpi – TIFF uncompressed 1 page ~ 80Mb More than x500 !!! Then Black & white – 300dpi – TIFF G4 1 page ~ 200Kb
  • 27.
  • 28. SPAR - Infrastructure SPAR - Realization Ingest SPAR Storage Abstraction Service (SAS) Administration Data management Storage Access Preservation planning Productionapplications Disseminationapplications Preservation digitization … wayback WEB Archiving …. …. … Audiovisual
  • 29.
  • 31. P r e - I n g e s t P r e - I n g e s t P r e - I n g e s t Storage abstraction service Ingest Storage Preservation planning Administration Data management Accès SIP DIPmets rdf rdf Infrastructure Preservation digitization Web archives And so on
  • 32.
  • 33.
  • 34. 2005 2006 2007 2008 2009 2010 2011 20122004 Operations may 2010 2013
  • 35.
  • 36. Backup storage Backup Servers Backup site Backup secondary storage Primary storage Secondary storage Lookup storageServers Main site Backup Lookup storage Online storage
  • 37. Oracle StorageTek SL8500 • up to 64 tape drives • up to 8500 tapes • up to 8 hand pickers • up to 32 linked libraries Primary storage 2 libraries 16 PB maximum Backup storage 2 libraries 16 PB maximum
  • 38.
  • 39. Capacity 1.5 TB Transfer rate 140 MB/s Primary storage LTO5 Backup storage T10000B Capacity 1 TB Transfer rate 120 MB/s (previously: 9840C – 40GB) (previously: T10000A – 500GB)
  • 40.
  • 41.
  • 42. P r e - I n g e s t Storage abstraction service Ingest Stockage Preservation Administration Data management Access SIP AIP DIPmets rdf rdf AIP Which formats are allowed? How copies are needed, in what kind of media ? What is the maximum size of a package ? Do we need to log each access?
  • 44.
  • 46.
  • 47. For this purpose, PDF/X chosen as a good compromise between truth to the original, wide usage and standardization
  • 48.
  • 49. Mets.xml: manifest T000001.tiff: sample format.xml: machine readable description format.txt: human description
  • 51.
  • 53. Ingest request reception Manifest Validation Package search within SPAR SIP characteristics audit SIP files audit and characterization ARK identifier generation SET processing Ingest completion SIP reception Audit ACT_01 ACT_02 ACT_03 ACT_04 ACT_05 ACT_06 ACT_07 ACT_08 ACT_09
  • 54.
  • 55.
  • 56. Structural metadata: METS Descriptive and source metadata: qualified Dublin Core Provenance metadata: PREMIS Technical metadata: depends on the data-objects
  • 57.
  • 58. 58 1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now 70 Tb 0.5 Tb 45 Tb 22 Tb operator robot + 150 Tb
  • 60.
  • 66.
  • 67. 1996-2005 2002 & 2004 2004-2008 2006-2010 70 Tb 0.5 Tb 45 Tb 22 Tb unknown 2010-now +Alexa bot 67 150 Tb
  • 68. ARC data ARC metadata HTML HTML HTML HTML harvested files ARC data ARC data ARC data ARC data + harvest 1 harvest 2 + harvest 3 + … … This is a collection containing French election websites Here are the files we harvested They are included in web archives specific files This was done with these tools
  • 69. A three-layered model in SPAR Harvest Definition (curator collection) Harvest Instance (“technical” harvest = job) ARC file (data or metadata)
  • 70. filedesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77 1 0 InternetArchive URL IP-address Archive-date Content-type Archive-length metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVers ion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814 text/xml 366 <?xml version="1.0" encoding="UTF-8"?> <harvestInfo> <version>0.2</version> <jobId>32</jobId> <priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum> <origHarvestDefinitionID>1</origHarvestDefinitionID> <maxBytesPerDomain>-1</maxBytesPerDomain> <maxObjectsPerDomain>1000</maxObjectsPerDomain> <orderXMLName>default</orderXMLName> </harvestInfo> metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1. 14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml 44775 <?xml version="1.0" encoding="UTF-8"?> <crawl-order xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="heritrix_settings.xsd"> …
  • 71. 1996-2005 2002 & 2004 2004-2008 2006-2010 70 Tb 0.5 Tb 45 Tb 22 Tb unknown 2010-now +Alexa bot 71 150 Tb
  • 72. Two layers: - Collection - ARC files 1996-2005 2010-now Three layers: - Harvest Definition - Harvest instance - ARC files Two layers: - Collection - ARC files
  • 73. 2006-2010 2010-now Four layers: - Collection - Harvest division - Harvest instance - ARC files Three layers: - Harvest Definition - Harvest instance - ARC files
  • 74.
  • 75. 03/07/1882 28/02/1883 01/03/1883 set group object file 02/07/1882 Year 1883 Le Matin Year 1882 01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882 Le Matin 01/07/1882
  • 76. set group 03/07/1882 28/02/1883 01/03/188302/07/1882 Le Matin 01/07/1882 AIPAIP AIPAIP set Contains nothing but metadata Curator information, allows to group AIPs sharing the same intellectual content AIPAIP Must contain files to be preserved Each AIP is an autonomous unit AIPAIP AIPAIP AIPAIP
  • 77. <mets> <dmdSec> Intellectual metadata <amdSec> Administrative metadata <fileSec> List of the files <structMap> Structure of the package <sourceMD> Metadata about the source used to produce this content <techMD> Technical metadata <digiprovMD> Provenance metadata
  • 78.
  • 79.
  • 80.
  • 81. harvestInstance has harvest instance is documented in Outcome extensions persons: admins software organizations Harvest event
  • 82. ARC data ARC metadata HTML HTML HTML HTML ARC data ARC data ARC data ARC data + harvest 1 harvest 2 + harvest 3 + … … This is a collection containing French election websites HTML HTML HTML HTML …
  • 83. ARC data ARC data ARC metadataARC data ARC data ARC data ARC data … … This is a collection containing French election websites AIPAIP AIPAIP AIPAIP AIPAIP AIPAIP AIPAIP AIPAIPset ARC data ARC data … AIPAIP AIPAIP ARC data AIPAIP AIPAIPAIPAIP groups
  • 84.
  • 88. <mets> <dmdSec> Intellectual metadata <amdSec> Administrative metadata <fileSec> List of the files <structMap> Structure of the package <sourceMD> Metadata about the source used to produce this content <techMD> Technical metadata <digiprovMD> Provenance metadata
  • 91.
  • 92. 92
  • 93.
  • 94. Web archiving at the British Library Helen Hockx-Yu Head of Web Archiving
  • 95. Overview > Part 1: Background, history and organisation > Part 2: Web Archiving Tools (including demos) > Part 3: Access > Part 4: Non-print Legal Deposit and future strategy 29th November 2012 Session 7 -Web archiving at the British Library 2
  • 96. BL Structure > BL Board and Executive Team > e-Strategy and Information Systems (eIS) > IT-based products and services > Finance and Corporate Services (F&CS) > Money > Human Resources > People > Operations & Services (O&S) > Front line services > Scholarship and Collections (S&C) > Content (Arts and humanities, Social Sciences, Science, Technology & Medicine) > Strategic Marketing and Communications (SMC) > Brand and reputation 29th November 2012 Session 7 -Web archiving at the British Library 3
  • 97. Web archiving timeline 29th November 2012 Session 7 -Web archiving at the British Library 4
  • 98. Current web archiving strategy > Selective archiving of websites that > reflect the diversity of lives, interests and activities throughout the UK > contain research value or are of research interest > feature political, cultural, social and economic events of national interest > demonstrate innovative use of the web4 areas > Also prioritise websites at risk and web-only content > Permission based > Permission to archive, to provide online access and to preserve. Also ask or 3rd rights clearance > 30% success rate, 5% explicit refusal (mostly due to 3rd party rights) > Online access through UK Web Archive > Expect to crawl at domain level (from April 2013) for Non- print Legal Deposit 29th November 2012 Session 7 -Web archiving at the British Library 5
  • 99. The current Web Archiving team 29th November 2012 Session 7 -Web archiving at the British Library 6 Skills Profile > IT > Collection management, digital curation > Management > Communications > Web Archiving
  • 100. (Internal Collaboration) > The Web Archiving Team is involved in the end to end process but work with other departments / teams in the library 29th November 2012 Session 7 -Web archiving at the British Library 7 Department /Team Activity / Support S&C > Subject specialist group > Curator’s Choice project Selection, curation eIS Network, hardware and IT support O&S Resource Discovery & Research Corporate level resource discovery http://explore.bl.uk/ CA&D Digital Processing Cataloguing (special collection level) SMC Publicity, press release, events The Legal Deposit Programme Domain crawl capability / process and policy
  • 101. Curator’s Choice > Pilot project with a small group of dedicated curators / subject specialists > Special Collections of curator’s choice. Curators take responsibility for owning, maintaining and growing the collections over time > Evolving Role of Libraries in the UK > Political Action and Communication > Slavery and Abolition in the Caribbean > UK relations with the Low Countries > 19th Century English Literature > Oral History in the UK > Film in the UK > Energy 29th November 2012 Session 7 -Web archiving at the British Library 8
  • 102. Web Archiving Advisory Group > Provide advice and support to the Web Archiving Team > Act as a ‘critical friend’ to assist in the development of policy and practice. > Specific advice and support on: > Purpose, vision and benefits. > Strategic direction and planning. > Synergy with internal teams and collaboration with external stakeholders/partners. > Policy changes and risk management 29th November 2012 Session 7 -Web archiving at the British Library 9
  • 103. (External) Collaboration > UK Web Archiving Consortium (2004-2007): centralised infrastructure and development, distributed collections > UK Web Archive partners, National Archives, Legal Deposit Libraries (LDLs) > External Collaborators > Welcome Library > Live Art Development Agency > The Cambridge Innovation Network > The Women’s Library > Institute of Historical esearch, University of London > Individual researchers, specialists > General public – ca. 20 nominations / week > National organisations: DPC, JISC > International: IIPC 29th November 2012 Session 7 -Web archiving at the British Library 10
  • 104. JISC UK Web Domain Dataset (1996-2010) > Collaboration with JISC and the Internet Archive > UK Web Domain Dataset (1996-2010) – UK websites extracted from the Internet Archive's collection and supported by funding from the JISC > 35TB research dataset > No local access to individual websites but access to secondary dataset allowed > BL has developed visualisations of the dataset > JISC funded 2 further projects using this dataset > Analytical Access to the Domain Dark Archive > Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research 29th November 2012 Session 7 -Web archiving at the British Library 11
  • 105. Web Archiving Tools > Support key processes: selection, harvesting, storage, access, preservation > Mostly open source tools, some developed in-house > New tools / changes to current tools expected when business processes change due to non-print Legal Deposit 29th November 2012 Session 7 -Web archiving at the British Library 12
  • 106. Selection Tools > Selection: decide what websites to archive and to include as part of a web archive collection > Selection and Permission Tool: https://wct.bl.uk/selection/ > Submit selection – real time checking of duplicates, fetching meta tags from live sites > Collect metadata > Add contact details > Suggest crawl frequency > Permissions management – send emails, direct users to online licence form, store the completed forms, pass details to WCT (create authorisation record and a pending target) > Reports > Twittervane 29th November 2012 Session 7 -Web archiving at the British Library 13
  • 107. Harvesting Tools > Harvesting: automated downloading of selected websites using crawler software; quality assurance regarded as an element > The Web Curator Tool (WCT): https://wct.bl.uk/wct/ > Job scheduling > Metadata > Access control > Harvesting (uses Heritirx) > QA 29th November 2012 Session 7 -Web archiving at the British Library 14
  • 108. Quality Assurance > Placing more emphasis on intellectual content than appearance or behaviour of a website > Use four aspects to define quality: > Completeness of capture: whether the intended content has been captured as part of the harvest. > Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool. > Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively. > Appearance: look and feel of a website. > Rely on visual comparison, previous harvests & crawl logs > Recent development of QA module to allow bulk operation, reduce # of clicks and make QA recommendations 29th November 2012 Session 7 -Web archiving at the British Library 15
  • 109. Supporting Long-term Preservation > Storing data in WARCs and metadata in METS > Migrate all legacy data into WARCs > WCT output WARC files > Submission Information Package (SIP) profiles for selective and domain crawls > Storing descriptive metadata (eg permission information) & technical metadata (eg crawl log, crawl configurations, virus scan events) > Ingest archived websites in the Digital Library System (DLS) > Command line tool generates SIPs > Providing access from the DLS (in future) 29th November 2012 Session 7 -Web archiving at the British Library 16
  • 110. Demo (45 minutes) > Selection and Permission Tool (https://wct.bl.uk/selection/) > Web Curator Tool (https://wct.bl.uk/wct/) 29th November 2012 Session 7 -Web archiving at the British Library 17
  • 111. Access > Currently 3 ways to access the web archive > Online through the UK Web Archive > Catalogue records (of special collections) > Keywords search through primo (corporate resource discovery system) > Conduct researcher survey to understand requirements > Analytical access 29th November 2012 Session 7 -Web archiving at the British Library 18
  • 112. Catalogue Records 29th November 2012 Session 7 -Web archiving at the British Library 19
  • 113. Keyword search through Primo 29th November 2012 Session 7 -Web archiving at the British Library 20
  • 114. UK Web Archive 29th November 2012 Session 7 -Web archiving at the British Library 21 > Websites archived by BL and partners since 2004 (65% by BL) > 122,99 websites, 50,866 instances, 13.6TBWARCs > Over 100,000 unique visits since 1st April 2012 > Key websites include videos > Full-text, N-gram, title and URL search > Browse by subject / special collection, visual browsing http://www.webarchive.org.uk
  • 115. Analytical Access > Shift of focus from the level of single webpages or websites to the entire web archive collection. > Use web archives as datasets > Support survey, annotation, contextualisation and visualisation > Allows discovery of patterns, trends and relationships in inter-linked web pages > Extracting value from the “haystacks” > Helps addresses a number of challenging issues > Scalability > Accessibility of individual websites > Components missed by crawlers 29th November 2012 Session 7 -Web archiving at the British Library 22
  • 116. Visualising the UK Web > http://www.webarchive.org.uk/ukwa/visualisation > N-gram search > Links analysis > Format Analysis > Geo-index > http://www.webarchive.org.uk/bluebox/ > uses the Memento aggregate TimeGate hosted by lanl.gov > “resource not in archive” – who else has it? > Open data > Dataset and APIs for general use > Enable broader community to re-use, explore and visualise content of web archive 29th November 2012 Session 7 -Web archiving at the British Library 23
  • 117. Web Archiving Infrastructure 29th November 2012 Session 7 -Web archiving at the British Library 24
  • 118. Non-print Legal Deposit: Time of change > Expected to be in place in April 2013 > Access restricted to premises of Legal Deposit Libraries > Library-wide Legal Deposit Programme to develop capability and end-to-end process > Web Archiving Team acts as “technical supplier” for a number of projects > Still need to work out how current (permission-based) selective archiving relates to domain crawl under Legal Deposit > Will we request permissions for online access? > Will we stop crawling some of the sites we are crawling now and include them in the annual / bi-annual broad domain crawl? > Who does what? 29th November 2012 Session 7 -Web archiving at the British Library 25
  • 119. 29th November 2012 Session 7 -Web archiving at the British Library 26 Web Archiving Strategy 26 Domain Crawl Event S p e c i a l c o l l e c t i o n S p e c i a l c o l l e c t i o n Domain harvesting: • Broad sweep of .uk domain • Once or twice a year Events & key sites: • Events of national interest • Sites need to be captured frequently Special Collection: • Focused, thematic collections • Support priority subjects Key sitesEvent S p e c i a l c o l l e c t i o n S p e c i a l c o l l e c t i o n
  • 120. Web  Archiving  Workshop   Leïla  Medjkoune,  Internet  Memory   IIPC  workshop,  BNF,  Paris,  November  2012  
  • 121. Internet  Memory   Internet  Memory  Founda/on  (European  Archive)   •  Established  in  2004  in  Amsterdam  and  then  Paris   •  Mission:  Preserve  Web  content  by  building  a  shared  WA  plaJorm   •  Ac/ons:  DisseminaLon,  R&D  and  partnerships  with  research  groups  and   cultural  insLtuLons   •  Open  Access  Collec/ons:  UK  NaLonal  Archives  &  Parliament,  PRONI,  CERN and  The  NaLonal  Library  of  Ireland   Internet  Memory  Research   •  Spin-­‐off  of  IM  established  in  June  2011  in  Paris   •  Missions:  Operate  large  scale  or  selecLve  crawls  &  develop  new   technologies  (crawl,  access,  processing  and  extracLon)    
  • 122. Internet  Memory   Infrastructure     Green  datacenters     Repository  and  data  access  for  large-­‐scale  data   management:   •  HDFS  (Hadoop  File  System):  Distributed,  fault-­‐tolerant   file  system   •  Hbase.  A  distributed  key-­‐value  index   •  Convenient  model  for  temporal  archives   •  MapReduce:  A  distributed  execuLon  framework   •  Reliable  mechanism  to  run  an  analysis  job  on   very  large  datasets    
  • 123. Internet  Memory   Focused  crawling:   •  Automated  crawls     •  Quality  focused  crawls  :   –  Video  capture,  Twiaer  crawls   –  ExecuLon  tools  to  overcome  crawling  issues  on  specific  content   Large  scale  crawling   •  Inhouse  developped  distributed  sobware     •  Scalable  crawler  (10-­‐50  Bn  pages)   •  Also  designed  for  focused  crawl  and  complex  scoping    
  • 124. Research  projects  and  focus   Web  Archiving  and  Preserva/on   ✓  Living  Web  Archives  (2007-­‐2010)   ✓  Archives  to  Community  MEMories:   (2010-­‐2013)   ✓  SCAlable  PreservaLon  Environment   (2010-­‐2013)   Webscale  data  Archiving  and   Extrac/on   ✓  Living  Knowledge  (2009-­‐2012)   ✓  Longitudinal  AnalyLcs  of  Web   Archive  data  (2010-­‐2013)   ✓  TrendMiner  (2011-­‐2014)   ✓  DOPA  (2012-­‐2014)   ✓  AnnoMarket  (2012-­‐2014)  
  • 125. Web  Archiving  project  ?   OrganisaLonal  challenges:   •  SelecLon/QA:  Librarian  /  Archivist,  Quality  assurance  team,   Project  manager   •  Content  capture/services  development:  Engineers,   developers,  technicians   •  Infrastructure  deployment  and  maintenance:  Engineers,   System  administrators   ➥ Web  Archiving  projects  require  strong  competences  and   experienced  human  resources  combined  with  a  scalable   infrastructure  
  • 126. IM  Shared  plaJorm   Since  its  creaLon  in  2004,  the  Internet  Memory   FoundaLon  works  in  close  collaboraLon  with  partners   insLtuLons  and  research  groups  through  European   projects:   •  To  develop  methods  and  tools  improving  web   archiving  quality   •  To  grow  its  experLse  and  technological  taskforce  
  • 127. Archivethe.Net  (1)     •  To  mutualize  knowledge  and  skills  between   insLtuLons   •  To  share  internal  developments  with  partners   insLtuLons   •  To  cut  services  and  R&D  costs  
  • 128. Archivethe.Net  (2)   •  Archivethe.net is a shared web archiving platform associated to a service. •  The platform is combining new technology and user needs to ensure a good service quality in terms of reliability and efficiency •  For whom ? our current partners, our new partners and … for ourselves
  • 129. Benefits  ?   •  Integrated  web  archiving  process  :  from  selecLon   to  access   •  Ongoing  technological  developments  through   specific  or  common  R&D  projects   •  Dedicated  and  highly  skilled  team  to  follow   partners’  projects   •  Dedicated  infrastructure  
  • 130. How  does  it  work?  (1)   •  ATN  is  designed  as  a  Saas   (Sobware  as  a  service)     •  The  plaJorm  offers  a  friendly  user   interface  to  record  partners  web   archiving  orders   •  A  pipeline  organizes  and  manages   the  producLon   •  A  QA  team  ensures  the  quality  of   the  archive  to  meet  partners   requirements    
  • 131. How  does  it  work?  (2)         Demo  
  • 132. ARCOMEM  Archivist  tool  ?   Set  and  follow  web  archive   campaigns   •  V1:  A  crawler  cockpit  and  a  search     and  retrieval  applicaLon   Intelligent  content  acquisiLon:   •  Seeds  URLs   •  Keywords   •  Social  web  sites  APIs     •  Social  Media  Categories  (SMC)    
  • 133. SARA   Search  and  retrieval  interface:   •  Advance  search   funcLonaliLes   •  Filtering  via  faceLng   •  SorLng  by  content  type,   Social  media  plaJorm,  text/ image  contextual   informaLon  (event,   enLty,...),  etc.      
  • 134. Crawler  Cockpit  Interface       •  Create/select  a  campaign   •  Describe  campaign  (Ltle,   descripLon,  comments,  etc.)   •  Define  scope:  select  criteria  such   as  language,  keyword,  url,   organisaLon,  etc.   •  Select  social  media  categories  and   APIs  to  explore   •  Set  precedence  rules  for  some   content  type  or  source  (images,   videos,  tweets,  news,  etc.)  
  • 135. Crawler  cockpit  interface         Demo    
  • 136. ARCOMEM  Archivist  Tool  V2   •  Refinement  mode  :  Refine  crawl   parameters  to  improve  crawls   •  Improve  access  applicaLon  (SARA)  :   Preview  funcLon  so  that  the  users  can   review  the  results  of  the  campaign  set  up  
  • 137. QA  for  Web  Archives?   IM  QA  is  based  on:   •  Tools  internally  developed   •  Tools  developed  in  the  context  of  European  projects     •   Automated  processes   •   Knowledge  and  skills  of  our  crawl  engineer  and  QA   teams    
  • 138.  QA  Methodology  and  tools?    Methodology   •  Based  upon  crawler  behaviour   •  Based  on  insLtuLons  needs  and  policy   •  Can  be  manual  (visual)  or  “automated”   •  Can  be  made  at  pre  or  post  crawl  Lme   Tools   •  Open  source  tools  such  as  plugins  ,  proxies,  etc.   •  Internally  developed  tools  (fetchers,  automate  check,  etc.)   •  Bug  trackers  to  record  informaLon  and  communicate  with   partner  insLtuLons  
  • 139.  QA  Methodology  and  tools?     SCApe:  Scalable  PreservaLon  Environments   •  Automate  visual  QA  to  detect  rendering  issues:   •  Improve  archives  quality  and  cut  QA  costs   •  Feed  “preservaLon  watch  and  planning”  tools   •  First  test  made  on  over  400  pairs  of  urls   •  Inhouse  “ExecuLon  plaJorm”  under  deployment   •  Results  and  processes  to  be  disseminated  to  IIPC   members  for  feedback  !    
  • 140. Technical  challenges   Capture   •  Dynamically  generated  content,  deep  web,  etc.     •  Non  HTTP  protocoles  (e.g.:  RTMP)   •  Social  media  plaJorms,  ...   Access     •  Replicate  live  funcLonaliLes  and  look  &  feel   •  Provide  access  to  very  large  files       ➥ Fast  evolving  technologies   ➥ Ephemeral  content   ➥ MulLplicaLon  of  producLon  means:     ➥ Increase  of  user  generated  content                                    
  • 141. Technical  SoluLons     •  ExecuLon  based   crawling  (vs  parsing)   •  API  crawling     •  ApplicaLon  aware   crawling   •  Bespoke  fetchers   ➥  OrchestraLon  of  tools     ARCOMEM content acquisition
  • 142. Technical  SoluLons     Access  tool:   •  Player  replacement:  reproduce  players   funcLonaliLes     •  Adapt  access  soluLon  to  type  of  content/plaJorms   (generic  soluLons)   Storage  infrastructure  /  format:   •  Enable  access  to  large  files   •  Fast  access  to  large  amount  of  content  to  facilitate   search  &  retrieval  
  • 143. Use  cases   •  Social  media  capture  and  access:   •  You  Tube     •  Twiaer   •  Flickr,  etc.   •  Web  Archiving  related  services:     •  RedirecLon  service   •  Memento   •  Legal  issues  with  captured  content     •  Full  text  search     •  etc.