Slides anu talkwebarchivingaug2012

Internet Content as
Research Data
Australian National University
August 2012, Canberra
Monica Omodei

Research Examples
•  Social networking •  Political Science
•  Lexicography •  Media Studies
•  Linguistics •  Contemporary history
•  Network Science

Data-driven science is migrating from the
natural sciences to humanities and social
science

Talk
Structure

•  Exis0ng
web
archives

•  Web
archive
use
cases

•  Bringing
archives
together

•  Crea0ng
your
own
archive

•  It’s
ge>ng
harder
–
challenges

•  Web
data
mining
&
analysis

Exis0ng
web
archives

•  Internet
Archive

•  Common
Crawl

•  Pandora
Archive

•  Internet
Memory
Founda0on
Archive

•  Other
na0onal
archives

•  Research,
University
Library
archives

Common
Collec0on
Strategies

•  Crawl
Scope
&
Focus

1)  Thema0c/Topical
(elec0ons,
events,
global
warming…)

2)  Resource-‐specific
(video,
pdf,
etc.)

3)  Broad
survey
(domain
wide
for
.com/.net/.org/.edu/.gov)

4)  Exhaus0ve
(end
of
life, closure crawls, natl domains)

5)  Frequency-‐Based

•  Key
Inputs:
nomina0ons
from
subject
maêr
experts,

prior
crawl
data,
registry
data,
trusted
directories,

wikipedia,
twiêr

Internet Archive’s Web Archive

Positives
–  Very broad – 175+ billion web instances
–  Historic – started 1996
–  Publicly accessible
–  Time-based URL search
–  API access
–  Not constrained by legislation – covered by
fair use and fast take-down response

Internet
Archive’s
Web
Archive

Negatives
–  Because of size can’t search by keyword
–  Because of size crawling is fully automated –
ergo QA is not possible

Common
Crawl

•  Non-‐proﬁt
founda0on
building
an
open
crawl

of
the
web
to
seed
research
and
innova0on

•  Currently
5
billion
pages

•  Stored
on
Amazon’s
S3

•  Accessible
via
MapReduce
processing
in

Amazon’s
EC2
compute
cloud

•  Wholesale
extrac0on,
transforma0on,
and

analysis
of
web
data
cheap
and
easy

Common
Crawl

Nega0ves

•  Not
designed
for
human
browsing
but
for

machine
access

•  Objec0ve
is
to
support
large-‐scale
analysis
and

text
mining/indexing
–
not
long-‐term

preserva0on

•  Some
costs
are
involved
for
direct
extrac0on

of
data
from
S3
storage
using
Requester-‐Pays

API

Pandora
Archive

•  Posi0ves

–  Quality
checked

–  Targeted
Australian
content
with
selec0on
policy

–  Historical
–
started
1996

–  Bibliocentric
approach
–web
sites/publica0ons

selected
for
archiving
are
catalogued
(see
Trove)

–  Keyword
search

–  Publicly
accessible

–  You
can
nominate
Australian
web
sites
for

inclusion
-‐
pandora.nla.gov.au/
registra0on_form.html

Pandora
Archive

•  Nega0ves

–  labour
intensive
thus
quite
small

–  signiﬁcant
content
missed
because
permission
to

copy
refused

•  Situa0on
will
improve
markedly
if
Legal

Deposit
provisions
extended
to
digital

publica0ons

•  Broader
coverage
will
be
achieved
when

infrastructure
is
upgraded
hence
reducing

labour
costs
for
checking/ﬁxing
crawls

Pandora
Archive
Stats

•  Size
–
6.32
TB

•  Number
of
Files

>
140
million

•  Number
of
‘0tles’
>
30.5K

•  Number
of
0tle
instances
>
73.5K

Which archived sites are popular ?

•  Measure: ﬁltered, aggregated web access
log data which counts access to titles "
•  Examined top 30 archived titles (# of
accesses) for each year 2009 to 2012"
•  Selected some to examine and speculate
as to why they might be popular"
•  Selected those with consistently high
ranking, and ones that were very variable
between years

Reasons for popularity of archived
version

•  Were once popular and are now
decommissioned, particularly if domain
name continues to exist and redirects to
the archive"
•  May not be that popular as live sites but
their live site links prominently to Pandora
as an archive for their content"
•  Popular referencing sources cite the
archive as well as the live site (if it still
exists)

Improving visibility and usage of
Pandora archive

•  Articles about interesting content on the
Australia Web Archives blog –http://
blogs.nla.gov.au/australias-web-archives/"
•  More effort to identify archived sites that are
no longer ʻliveʼ"
•  Market automatic redirect services to web
site owners/managers"
•  Allow Google to index archive content for
ʻnon-liveʼ sites (problematic)"
•  Install Twittervane - draws
site
nomina0ons

for
archiving
based
on
trending
Twi^er
topics.

"

.au
Domain
Annual
Snapshots

•  Annual
crawls
since
2005
commissioned
from

Internet
Archive

•  Includes
sites
on
servers
located
in
Australia

as
well
as
.au
domain

•  Robots.txt
respected
except
for
inline
images

and
stylesheets

•  No
public
access
–
researcher
access
protocols

are
being
developed

•  Full
text
search
–
suited
to
searching
archives

•  Separate
.gov
crawl
publicly
accessible
soon

Australian
web
domain
crawls

Year
2005
2006
2007
2008
2009
2011

Files
185
596
516
1
billion
765
660

million
million
million
million
million

Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549

crawled

Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71

Internet
Memory
Founda0on

•  Number
of
European
partners

•  LiWA
–
Living
Web
Archives:
next
genera0on

Web
archiving
methods
and
tools

•  LAWA
–
Longitudinal
Analy0cs
of
Web
Archive

Data:
experimental
testbed
for
large-‐scale

data
analy0cs

•  ARCOMEM
(Collect-‐All
ARchives
to

COmmunity
MEMories)
leveraging
social

media
for
Intelligent
Preserva0on

•  SCAPE
–
Scalable
Preserva0on
Environments

Other
Na0onal
Archives

•  List
of
Interna0onal
Internet
Preserva0on

Consor0um
member
archives
–

netpreserve.org/about/archiveList.php

•  Some
are
whole
domain
archives,
some

are

selec0ve
archives,
many
are
both

•  Some
have
public
access,
others
you
will
need

to
nego0ate
access
for
research

•  Most
archives
have
been
collected
using
the

heritrix
open-‐source
crawler
and
thus
use
the

standard
format
(warc
ISO
format)

Research
Archives

•  California
Digital
Library

•  Harvard
University
Libraries

•  Columbia

University
Libraries

•  University
of
North
Texas

….
and
many
more

•  WebCITE
-‐
webcita0on.org
(cita0on
service

archive)

Example:
Columbia
University

•  Member
of
the
IIPC

•  They
use
the
ArchiveIt
service

•  A
Research
library
that
sees
web
archiving
as

fundamental
to
their
collec0ng

•  They
complement
and
coordinate
with
other
web

archives

•  Their
collec0ng
focus
is
thema0c
–
eg
human
rights,

historic
preserva0on,
NY
religious
ins0tu0ons

•  They
also
archive
web
content
as
part
of
personal

and
organisa0onal
archives
(c.f.
manuscripts
coll)

•  Archive
their
own
web
site
regularly

Bringing
Archives
Together

•  Common
standards
and
APIs

•  Memento
project
–
adding
0me
to
the
web

–  Aggregates
CDX
ﬁles
(URL
index)
from
mul0ple

archives

–  Has
a
Firefox
plug-‐in
which
allows
0me-‐based

browsing

–  Ini0a0ve
of
Los
Alamos
Laboratories

–  See
h^p://www.mementoweb.org/demo/

Common
Use
Cases
for
a
web

archive

•  Content
discovery

•  Nostalgia
queries

•  Web
site
restora0on
and
ﬁle
recovery

•  Domain
name
valua0on

•  Fall-‐back
for
link-‐rot

•  Prior
art
analysis
and
patent/copyright
infringement

research

•  Legal
cases

•  Topic
analysis,
web
trends
analysis,
popularity

analysis,
network
analysis,
linguis0c
analysis

Create
your
own
Archive

•  Use
a
subscrip0on
service

•  Build
your
own
web
archiving
infrastructure

with
open
source
sonware
(
ie
Heritrix
and

Wayback)

•  Use
web
cita0on
services
that
create
archive

copies
as
you
bookmark
pages

Subscrip0on
Services

•  archive-‐it.org
(service
operated
by
non-‐proﬁt

Internet
Archive
since
2006)

•  archivethe.net
(service
operated
by
non-‐proﬁt

Internet
Memory
Founda0on)

•  California
Digital
Library
Web
Archiving

Service
-‐
cdlib.org/services/uc3/was.html

•  OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm

Install
web
archiving
system
locally

•  Easy-‐to-‐deploy
web
archiving
toolkit
not
yet

available

•  Ins0tu0onal
web
archiving
infrastructure
is

feasible
and
has
been
established
at
a
number

of
universi0es
for
use
by
researchers
–
needs

IT
systems
engineers
to
set
up
though

•  Archives
can
be
deposited
with
the
NLA
for

long-‐term
preserva0on

Personal
Web
Archiving

•  WARCreate
–
recently
released
free
tool
which

creates
wayback-‐consumable
warc
ﬁles
from
any

web
page

•  Google
Chrome
extension

•  Enables
preserva0on
by
users
from
their
desktop

•  Can
target
content
unreachable
by
crawlers

•  Brings
WARC
to
personal
digital
archiving

•  What
you
do
with
the
WARC
ﬁles
is
up
to
you

•  Install
suite
provided
to
set
up
local
Wayback

instance
and
Memento
0megate

Current
challenges

•  Database-‐driven
features
and
func0ons

•  Complex
and
varying
URI
formats
and
non-‐
standard
link
implementa0ons
eg
Twi^er

•  Dynamically
generated
ever-‐changing
URIs

–  For
serving
the
same
resources

•  Rich
Media
–
eg
streamed
media
with
custom

apps
and
ant-‐collec0on
measures

•  Scripted
incremental
display
and
page-‐loading

…
more…

•  Scripted
HTML
forms

•  Mul0-‐sourced
embedded
material

•  Dynamic
authen0ca0on
e.g.
captchas,
cross-‐
site
authen0ca0on,
user-‐sensi0ve
embeds

•  Alternate
display
based
on
browser
or
device,

or
other
parameter

•  Site
architecture
designed
to
inhibit
crawling

and
indexing
–
but
if
poorly
done
even
‘polite’

harvesters
like
Heritrix
may
crash
their
server

..
but
wait,
there’s
more
…

•  Server-‐side
scripts
and
remote
procedure
calls

–
the
full
variety
of
paths
through
a
site
are

now
onen
hidden
in
remote/opaque
server-‐
side
code
–
not
a
new
problem
but
now

effects
80+%
of
online
resources

•  HTML
5
web
sockets
–
effec0vely
codifies

incremental
updates
without
page
reloads

•  Mobile
publishing

Transac0onal
Web
Archiving

•  Useful
for
ins0tu0onal
archiving

–  Best
for
record-‐keeping
purposes
-‐
when

challenged
in
court
about
content
on
web
site

–  Can
be
used
to
ensure
URL
persistence
eg
when

site
has
a
make-‐over
–
can
intercept
404s

–  No
‘gaps’
c.f.
crawl
approach
–
every
change
in

accessed
content
is
archived

–  However
requires
code
snippet
to
be
installed
on

web
server

–  Open
source
sonware
being
developed
by
Los

Alamos
Labs

Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis

Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…

Platform & Toolkit: Overview

•  Software

–  Apache Hadoop

–  Apache Pig

•  Data/File format

–  WARC

–  CDX

–  WAT (new!)

Apache Hadoop

•  HDFS

–  Distributed storage

–  Durable, default 3x replication

–  Scalable: Yahoo! 60+PB HDFS

•  MapReduce

–  Distributed computation

–  You write Java functions

–  Hadoop distributes work across cluster

–  Tolerates failures

File formats and data: CDX
•  Index used to browse WARC-based archive

•  Space-delimited text ﬁle

•  Only essential the essential metadata needed
by Wayback

–  URL

–  Content Digest

–  Capture Timestamp

–  Content-Type

–  HTTP response code

–  etc.

File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹

•  Not preservation format

•  Data exchange and analysis

•  Less than full WARC, more than CDX

•  Essential metadata for many types of analysis

•  Avoids barriers to data exchange: copyright,
privacy

•  Work-in-progress: we want your feedback

File formats and data: WAT
•  WAT is WARC ☺

–  WAT records are WARC
metadata records

File formats & data:

–  WARC-Refers-To header •  CDX: 53 MB

identiﬁes original WARC
record

•  WAT: 443 MB

•  WAT payload is JSON

•  WARC: 8,651 MB

–  Compact

–  Hierarchical

–  Supported by every
programming environ

Some
References

•  h^p://en.wikipedia.org/wiki/Web_archiving

•  h^p://netpreserve.org/about/archiveList.php

•  Web
Archives:
The
Future(s)
-‐

h^p://www.netpreserve.org/publica0ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf

•  h^p://matkelly.com/warcreate/

•  Common
Crawl:
h^p://commoncrawl.org/
data/accessing-‐the-‐data/

Contacts

•  Webarchive
@
nla.gov.au

•  Secretariat
@
internetmemory.org

•  Queries
about
the
internet
archive
web
archive

h^p://iawebarchiving.wordpress.com/

•  Queries
about
Archive-‐It
service

h^p://www.archive-‐it.org/contact-‐us

momodei
@
nla.gov.au
(un0l
31
Aug
2012
)

or

monica.omodei
@
gmail.com

Slides anu talkwebarchivingaug2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Slides anu talkwebarchivingaug2012

Similar to Slides anu talkwebarchivingaug2012 (20)

More from Roxanne Missingham

More from Roxanne Missingham (20)

Recently uploaded

Recently uploaded (20)

Slides anu talkwebarchivingaug2012