What Are The Drone Anti-jamming Systems Technology?
Slides anu talkwebarchivingaug2012
1. Internet Content as
Research Data
Australian National University
August 2012, Canberra
Monica Omodei
2. Research Examples
• Social networking • Political Science
• Lexicography • Media Studies
• Linguistics • Contemporary history
• Network Science
Data-driven science is migrating from the
natural sciences to humanities and social
science
3. Talk
Structure
• Exis0ng
web
archives
• Web
archive
use
cases
• Bringing
archives
together
• Crea0ng
your
own
archive
• It’s
ge>ng
harder
–
challenges
• Web
data
mining
&
analysis
4. Exis0ng
web
archives
• Internet
Archive
• Common
Crawl
• Pandora
Archive
• Internet
Memory
Founda0on
Archive
• Other
na0onal
archives
• Research,
University
Library
archives
6. Internet Archive’s Web Archive
Positives
– Very broad – 175+ billion web instances
– Historic – started 1996
– Publicly accessible
– Time-based URL search
– API access
– Not constrained by legislation – covered by
fair use and fast take-down response
7. Internet
Archive’s
Web
Archive
Negatives
– Because of size can’t search by keyword
– Because of size crawling is fully automated –
ergo QA is not possible
8.
9.
10.
11. Common
Crawl
• Non-‐profit
founda0on
building
an
open
crawl
of
the
web
to
seed
research
and
innova0on
• Currently
5
billion
pages
• Stored
on
Amazon’s
S3
• Accessible
via
MapReduce
processing
in
Amazon’s
EC2
compute
cloud
• Wholesale
extrac0on,
transforma0on,
and
analysis
of
web
data
cheap
and
easy
12. Common
Crawl
Nega0ves
• Not
designed
for
human
browsing
but
for
machine
access
• Objec0ve
is
to
support
large-‐scale
analysis
and
text
mining/indexing
–
not
long-‐term
preserva0on
• Some
costs
are
involved
for
direct
extrac0on
of
data
from
S3
storage
using
Requester-‐Pays
API
13. Pandora
Archive
• Posi0ves
– Quality
checked
– Targeted
Australian
content
with
selec0on
policy
– Historical
–
started
1996
– Bibliocentric
approach
–web
sites/publica0ons
selected
for
archiving
are
catalogued
(see
Trove)
– Keyword
search
– Publicly
accessible
– You
can
nominate
Australian
web
sites
for
inclusion
-‐
pandora.nla.gov.au/
registra0on_form.html
14.
15. Pandora
Archive
• Nega0ves
– labour
intensive
thus
quite
small
– significant
content
missed
because
permission
to
copy
refused
• Situa0on
will
improve
markedly
if
Legal
Deposit
provisions
extended
to
digital
publica0ons
• Broader
coverage
will
be
achieved
when
infrastructure
is
upgraded
hence
reducing
labour
costs
for
checking/fixing
crawls
16. Pandora
Archive
Stats
• Size
–
6.32
TB
• Number
of
Files
>
140
million
• Number
of
‘0tles’
>
30.5K
• Number
of
0tle
instances
>
73.5K
17.
18.
19.
20.
21. Which archived sites are popular ?
• Measure: filtered, aggregated web access
log data which counts access to titles "
• Examined top 30 archived titles (# of
accesses) for each year 2009 to 2012"
• Selected some to examine and speculate
as to why they might be popular"
• Selected those with consistently high
ranking, and ones that were very variable
between years
22. Reasons for popularity of archived
version
• Were once popular and are now
decommissioned, particularly if domain
name continues to exist and redirects to
the archive"
• May not be that popular as live sites but
their live site links prominently to Pandora
as an archive for their content"
• Popular referencing sources cite the
archive as well as the live site (if it still
exists)
23.
24.
25.
26. Improving visibility and usage of
Pandora archive
• Articles about interesting content on the
Australia Web Archives blog –http://
blogs.nla.gov.au/australias-web-archives/"
• More effort to identify archived sites that are
no longer ʻliveʼ"
• Market automatic redirect services to web
site owners/managers"
• Allow Google to index archive content for
ʻnon-liveʼ sites (problematic)"
• Install Twittervane - draws
site
nomina0ons
for
archiving
based
on
trending
Twi^er
topics.
"
27. .au
Domain
Annual
Snapshots
• Annual
crawls
since
2005
commissioned
from
Internet
Archive
• Includes
sites
on
servers
located
in
Australia
as
well
as
.au
domain
• Robots.txt
respected
except
for
inline
images
and
stylesheets
• No
public
access
–
researcher
access
protocols
are
being
developed
• Full
text
search
–
suited
to
searching
archives
• Separate
.gov
crawl
publicly
accessible
soon
28. Australian
web
domain
crawls
Year
2005
2006
2007
2008
2009
2011
Files
185
596
516
1
billion
765
660
million
million
million
million
million
Hosts
811,523
1,046,038
1,247,614
3,038,658
1,074,645
1,346,549
crawled
Size
(TBs)
6.69
19.04
18.47
34.55
24.29
30.71
29. Internet
Memory
Founda0on
• Number
of
European
partners
• LiWA
–
Living
Web
Archives:
next
genera0on
Web
archiving
methods
and
tools
• LAWA
–
Longitudinal
Analy0cs
of
Web
Archive
Data:
experimental
testbed
for
large-‐scale
data
analy0cs
• ARCOMEM
(Collect-‐All
ARchives
to
COmmunity
MEMories)
leveraging
social
media
for
Intelligent
Preserva0on
• SCAPE
–
Scalable
Preserva0on
Environments
30.
31. Other
Na0onal
Archives
• List
of
Interna0onal
Internet
Preserva0on
Consor0um
member
archives
–
netpreserve.org/about/archiveList.php
• Some
are
whole
domain
archives,
some
are
selec0ve
archives,
many
are
both
• Some
have
public
access,
others
you
will
need
to
nego0ate
access
for
research
• Most
archives
have
been
collected
using
the
heritrix
open-‐source
crawler
and
thus
use
the
standard
format
(warc
ISO
format)
32. Research
Archives
• California
Digital
Library
• Harvard
University
Libraries
• Columbia
University
Libraries
• University
of
North
Texas
….
and
many
more
• WebCITE
-‐
webcita0on.org
(cita0on
service
archive)
33. Example:
Columbia
University
• Member
of
the
IIPC
• They
use
the
ArchiveIt
service
• A
Research
library
that
sees
web
archiving
as
fundamental
to
their
collec0ng
• They
complement
and
coordinate
with
other
web
archives
• Their
collec0ng
focus
is
thema0c
–
eg
human
rights,
historic
preserva0on,
NY
religious
ins0tu0ons
• They
also
archive
web
content
as
part
of
personal
and
organisa0onal
archives
(c.f.
manuscripts
coll)
• Archive
their
own
web
site
regularly
34.
35. Bringing
Archives
Together
• Common
standards
and
APIs
• Memento
project
–
adding
0me
to
the
web
– Aggregates
CDX
files
(URL
index)
from
mul0ple
archives
– Has
a
Firefox
plug-‐in
which
allows
0me-‐based
browsing
– Ini0a0ve
of
Los
Alamos
Laboratories
– See
h^p://www.mementoweb.org/demo/
36.
37. Common
Use
Cases
for
a
web
archive
• Content
discovery
• Nostalgia
queries
• Web
site
restora0on
and
file
recovery
• Domain
name
valua0on
• Fall-‐back
for
link-‐rot
• Prior
art
analysis
and
patent/copyright
infringement
research
• Legal
cases
• Topic
analysis,
web
trends
analysis,
popularity
analysis,
network
analysis,
linguis0c
analysis
38. Create
your
own
Archive
• Use
a
subscrip0on
service
• Build
your
own
web
archiving
infrastructure
with
open
source
sonware
(
ie
Heritrix
and
Wayback)
• Use
web
cita0on
services
that
create
archive
copies
as
you
bookmark
pages
39. Subscrip0on
Services
• archive-‐it.org
(service
operated
by
non-‐profit
Internet
Archive
since
2006)
• archivethe.net
(service
operated
by
non-‐profit
Internet
Memory
Founda0on)
• California
Digital
Library
Web
Archiving
Service
-‐
cdlib.org/services/uc3/was.html
• OCLC
Harvester
Service
-‐
oclc.org/
webharvester/overview/default.htm
40.
41. Install
web
archiving
system
locally
• Easy-‐to-‐deploy
web
archiving
toolkit
not
yet
available
• Ins0tu0onal
web
archiving
infrastructure
is
feasible
and
has
been
established
at
a
number
of
universi0es
for
use
by
researchers
–
needs
IT
systems
engineers
to
set
up
though
• Archives
can
be
deposited
with
the
NLA
for
long-‐term
preserva0on
42. Personal
Web
Archiving
• WARCreate
–
recently
released
free
tool
which
creates
wayback-‐consumable
warc
files
from
any
web
page
• Google
Chrome
extension
• Enables
preserva0on
by
users
from
their
desktop
• Can
target
content
unreachable
by
crawlers
• Brings
WARC
to
personal
digital
archiving
• What
you
do
with
the
WARC
files
is
up
to
you
• Install
suite
provided
to
set
up
local
Wayback
instance
and
Memento
0megate
43. Current
challenges
• Database-‐driven
features
and
func0ons
• Complex
and
varying
URI
formats
and
non-‐
standard
link
implementa0ons
eg
Twi^er
• Dynamically
generated
ever-‐changing
URIs
– For
serving
the
same
resources
• Rich
Media
–
eg
streamed
media
with
custom
apps
and
ant-‐collec0on
measures
• Scripted
incremental
display
and
page-‐loading
44. …
more…
• Scripted
HTML
forms
• Mul0-‐sourced
embedded
material
• Dynamic
authen0ca0on
e.g.
captchas,
cross-‐
site
authen0ca0on,
user-‐sensi0ve
embeds
• Alternate
display
based
on
browser
or
device,
or
other
parameter
• Site
architecture
designed
to
inhibit
crawling
and
indexing
–
but
if
poorly
done
even
‘polite’
harvesters
like
Heritrix
may
crash
their
server
45. ..
but
wait,
there’s
more
…
• Server-‐side
scripts
and
remote
procedure
calls
–
the
full
variety
of
paths
through
a
site
are
now
onen
hidden
in
remote/opaque
server-‐
side
code
–
not
a
new
problem
but
now
effects
80+%
of
online
resources
• HTML
5
web
sockets
–
effec0vely
codifies
incremental
updates
without
page
reloads
• Mobile
publishing
46. Transac0onal
Web
Archiving
• Useful
for
ins0tu0onal
archiving
– Best
for
record-‐keeping
purposes
-‐
when
challenged
in
court
about
content
on
web
site
– Can
be
used
to
ensure
URL
persistence
eg
when
site
has
a
make-‐over
–
can
intercept
404s
– No
‘gaps’
c.f.
crawl
approach
–
every
change
in
accessed
content
is
archived
– However
requires
code
snippet
to
be
installed
on
web
server
– Open
source
sonware
being
developed
by
Los
Alamos
Labs
47. Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
Data Analysis
Need fast iteration to understand the right
questions to ask
More minds able to contribute = more value
(perceived and real) placed on the importance
of the data
Increased demand for/value of the data = more
funding to support it
Need to surface the Information amongst all
that data…
51. File formats and data: CDX
• Index used to browse WARC-based archive
• Space-delimited text file
• Only essential the essential metadata needed
by Wayback
– URL
– Content Digest
– Capture Timestamp
– Content-Type
– HTTP response code
– etc.
52. File formats and data: WAT
• Yet Another Metadata Format! ☺ ☹
• Not preservation format
• Data exchange and analysis
• Less than full WARC, more than CDX
• Essential metadata for many types of analysis
• Avoids barriers to data exchange: copyright,
privacy
• Work-in-progress: we want your feedback
53. File formats and data: WAT
• WAT is WARC ☺
– WAT records are WARC
metadata records
File formats & data:
– WARC-Refers-To header • CDX: 53 MB
identifies original WARC
record
• WAT: 443 MB
• WAT payload is JSON
• WARC: 8,651 MB
– Compact
– Hierarchical
– Supported by every
programming environ
54. Some
References
• h^p://en.wikipedia.org/wiki/Web_archiving
• h^p://netpreserve.org/about/archiveList.php
• Web
Archives:
The
Future(s)
-‐
h^p://www.netpreserve.org/publica0ons/
2011_06_IIPC_WebArchives-‐TheFutures.pdf
• h^p://matkelly.com/warcreate/
• Common
Crawl:
h^p://commoncrawl.org/
data/accessing-‐the-‐data/
55. Contacts
• Webarchive
@
nla.gov.au
• Secretariat
@
internetmemory.org
• Queries
about
the
internet
archive
web
archive
h^p://iawebarchiving.wordpress.com/
• Queries
about
Archive-‐It
service
h^p://www.archive-‐it.org/contact-‐us
momodei
@
nla.gov.au
(un0l
31
Aug
2012
)
or
monica.omodei
@
gmail.com