SlideShare a Scribd company logo
1 of 55
Download to read offline
Internet Content as
   Research Data
Australian National University
  August 2012, Canberra
       Monica Omodei
Research Examples
•    Social networking   •  Political Science
•    Lexicography        •  Media Studies
•    Linguistics         •  Contemporary history
•    Network Science


Data-driven science is migrating from the
natural sciences to humanities and social
science
Talk	
  Structure	
  
•      Exis0ng	
  web	
  archives	
  
•      Web	
  archive	
  use	
  cases	
  
•      Bringing	
  archives	
  together	
  
•      Crea0ng	
  your	
  own	
  archive	
  
•      It’s	
  ge>ng	
  harder	
  –	
  challenges	
  
•      Web	
  data	
  mining	
  &	
  analysis	
  	
  
	
  
	
  
Exis0ng	
  web	
  archives	
  	
  
•    Internet	
  Archive	
  
•    Common	
  Crawl	
  	
  
•    Pandora	
  Archive	
  
•    Internet	
  Memory	
  Founda0on	
  Archive	
  
•    Other	
  na0onal	
  archives	
  
•    Research,	
  University	
  Library	
  archives	
  	
  
Common	
  Collec0on	
  Strategies	
  
•  Crawl	
  Scope	
  &	
  Focus	
  
    1)       Thema0c/Topical	
  (elec0ons,	
  events,	
  global	
  warming…)	
  
    2)       Resource-­‐specific	
  (video,	
  pdf,	
  etc.)	
  
    3)       Broad	
  survey	
  (domain	
  wide	
  for	
  .com/.net/.org/.edu/.gov)	
  
    4)       Exhaus0ve	
  (end	
  of	
  life, closure crawls, natl domains)	
  
    5)       Frequency-­‐Based	
  
    	
  
•  Key	
  Inputs:	
  nomina0ons	
  from	
  subject	
  ma^er	
  experts,	
  
   prior	
  crawl	
  data,	
  registry	
  data,	
  trusted	
  directories,	
  
   wikipedia,	
  twi^er	
  
Internet Archive’s Web Archive

Positives
  –  Very broad – 175+ billion web instances
  –  Historic – started 1996
  –  Publicly accessible
  –  Time-based URL search
  –  API access
  –  Not constrained by legislation – covered by
     fair use and fast take-down response
Internet	
  Archive’s	
  Web	
  Archive	
  
Negatives
       –  Because of size can’t search by keyword
       –  Because of size crawling is fully automated –
          ergo QA is not possible
	
  
Common	
  Crawl	
  
•  Non-­‐profit	
  founda0on	
  building	
  an	
  open	
  crawl	
  
   of	
  the	
  web	
  to	
  seed	
  research	
  and	
  innova0on	
  
•  Currently	
  5	
  billion	
  pages	
  
•  Stored	
  on	
  Amazon’s	
  S3	
  	
  
•  Accessible	
  via	
  MapReduce	
  processing	
  in	
  
   Amazon’s	
  EC2	
  compute	
  cloud	
  
•  Wholesale	
  extrac0on,	
  transforma0on,	
  and	
  
   analysis	
  of	
  web	
  data	
  cheap	
  and	
  easy	
  
Common	
  Crawl	
  
Nega0ves	
  
•  Not	
  designed	
  for	
  human	
  browsing	
  but	
  for	
  
   machine	
  access	
  
•  Objec0ve	
  is	
  to	
  support	
  large-­‐scale	
  analysis	
  and	
  
   text	
  mining/indexing	
  –	
  not	
  long-­‐term	
  
   preserva0on	
  
•  Some	
  costs	
  are	
  involved	
  for	
  direct	
  extrac0on	
  
   of	
  data	
  from	
  S3	
  storage	
  using	
  Requester-­‐Pays	
  
   API	
  	
  
Pandora	
  Archive	
  
•  Posi0ves	
  
   –  Quality	
  checked	
  
   –  Targeted	
  Australian	
  content	
  with	
  selec0on	
  policy	
  
   –  Historical	
  –	
  started	
  1996	
  
   –  Bibliocentric	
  approach	
  –web	
  sites/publica0ons	
  
      selected	
  for	
  archiving	
  are	
  catalogued	
  (see	
  Trove)	
  
   –  Keyword	
  search	
  
   –  Publicly	
  accessible	
  
   –  You	
  can	
  nominate	
  Australian	
  web	
  sites	
  for	
  
      inclusion	
  -­‐	
  pandora.nla.gov.au/
      registra0on_form.html	
  
Pandora	
  Archive	
  
•  Nega0ves	
  
   –  labour	
  intensive	
  thus	
  quite	
  small	
  
   –  significant	
  content	
  missed	
  because	
  permission	
  to	
  
      copy	
  refused	
  
•  Situa0on	
  will	
  improve	
  markedly	
  if	
  Legal	
  
   Deposit	
  provisions	
  extended	
  to	
  digital	
  
   publica0ons	
  
•  Broader	
  coverage	
  will	
  be	
  achieved	
  when	
  
   infrastructure	
  is	
  upgraded	
  hence	
  reducing	
  
   labour	
  costs	
  for	
  checking/fixing	
  crawls	
  
Pandora	
  Archive	
  Stats	
  
•    Size	
  –	
  6.32	
  TB	
  
•    Number	
  of	
  Files	
  	
  >	
  140	
  million	
  
•    Number	
  of	
  ‘0tles’	
  >	
  30.5K	
  
•    Number	
  of	
  0tle	
  instances	
  >	
  73.5K	
  
Which archived sites are popular ?	
  
   •  Measure: filtered, aggregated web access
      log data which counts access to titles "
   •  Examined top 30 archived titles (# of
      accesses) for each year 2009 to 2012"
   •  Selected some to examine and speculate
      as to why they might be popular"
   •  Selected those with consistently high
      ranking, and ones that were very variable
      between years	
  
Reasons for popularity of archived
             version	
  
•  Were once popular and are now
   decommissioned, particularly if domain
   name continues to exist and redirects to
   the archive"
•  May not be that popular as live sites but
   their live site links prominently to Pandora
   as an archive for their content"
•  Popular referencing sources cite the
   archive as well as the live site (if it still
   exists)	
  
Improving visibility and usage of
       Pandora archive	
  
•  Articles about interesting content on the
   Australia Web Archives blog –http://
   blogs.nla.gov.au/australias-web-archives/"
•  More effort to identify archived sites that are
   no longer ʻliveʼ"
•  Market automatic redirect services to web
   site owners/managers"
•  Allow Google to index archive content for
   ʻnon-liveʼ sites (problematic)"
•  Install Twittervane - draws	
  site	
  nomina0ons	
  
   for	
  archiving	
  based	
  on	
  trending	
  Twi^er	
  topics.	
  	
  	
  "
.au	
  Domain	
  Annual	
  Snapshots	
  
•  Annual	
  crawls	
  since	
  2005	
  commissioned	
  from	
  
   Internet	
  Archive	
  
•  Includes	
  sites	
  on	
  servers	
  located	
  in	
  Australia	
  
   as	
  well	
  as	
  .au	
  domain	
  
•  Robots.txt	
  respected	
  except	
  for	
  inline	
  images	
  
   and	
  stylesheets	
  
•  No	
  public	
  access	
  –	
  researcher	
  access	
  protocols	
  
   are	
  being	
  developed	
  
•  Full	
  text	
  search	
  –	
  suited	
  to	
  searching	
  archives	
  
•  Separate	
  .gov	
  crawl	
  publicly	
  accessible	
  soon	
  
Australian	
  web	
  domain	
  crawls	
  

Year	
              2005	
        2006	
        2007	
        2008	
             2009	
        2011	
  
Files	
             185	
         596	
         516	
         1	
  billion	
     765	
         660	
  
                    million	
     million	
     million	
                        million	
     million	
  
Hosts	
             811,523	
     1,046,038	
   1,247,614	
   3,038,658	
   1,074,645	
   1,346,549	
  
crawled	
  
Size	
  (TBs)	
     6.69	
        19.04	
       18.47	
       34.55	
            24.29	
       30.71	
  
Internet	
  Memory	
  Founda0on	
  
•  Number	
  of	
  European	
  partners	
  	
  
•  LiWA	
  –	
  Living	
  Web	
  Archives:	
  next	
  genera0on	
  
   Web	
  archiving	
  methods	
  and	
  tools	
  	
  
•  LAWA	
  –	
  Longitudinal	
  Analy0cs	
  of	
  Web	
  Archive	
  
   Data:	
  experimental	
  testbed	
  for	
  large-­‐scale	
  
   data	
  analy0cs	
  
•  ARCOMEM	
  (Collect-­‐All	
  ARchives	
  to	
  
   COmmunity	
  MEMories)	
  leveraging	
  social	
  
   media	
  for	
  Intelligent	
  Preserva0on	
  	
  
•  SCAPE	
  –	
  Scalable	
  Preserva0on	
  Environments	
  
Other	
  Na0onal	
  Archives	
  
•  List	
  of	
  Interna0onal	
  Internet	
  Preserva0on	
  
   Consor0um	
  member	
  archives	
  –	
  
   netpreserve.org/about/archiveList.php	
  
•  Some	
  are	
  whole	
  domain	
  archives,	
  some	
  	
  are	
  
   selec0ve	
  archives,	
  many	
  are	
  both	
  
•  Some	
  have	
  public	
  access,	
  others	
  you	
  will	
  need	
  
   to	
  nego0ate	
  access	
  for	
  research	
  
•  Most	
  archives	
  have	
  been	
  collected	
  using	
  the	
  
   heritrix	
  open-­‐source	
  crawler	
  and	
  thus	
  use	
  the	
  
   standard	
  format	
  (warc	
  ISO	
  format)	
  
Research	
  Archives	
  
•  California	
  Digital	
  Library	
  
•  Harvard	
  University	
  Libraries	
  
•  Columbia	
  	
  University	
  Libraries	
  
•  University	
  of	
  North	
  Texas	
  
….	
  and	
  many	
  more	
  
	
  
•  WebCITE	
  -­‐	
  webcita0on.org	
  (cita0on	
  service	
  
     archive)	
  
Example:	
  Columbia	
  University	
  
•  Member	
  of	
  the	
  IIPC	
  
•  They	
  use	
  the	
  ArchiveIt	
  service	
  
•  A	
  Research	
  library	
  that	
  sees	
  web	
  archiving	
  as	
  
   fundamental	
  to	
  their	
  collec0ng	
  	
  
•  They	
  complement	
  and	
  coordinate	
  with	
  other	
  web	
  
   archives	
  
•  Their	
  collec0ng	
  focus	
  is	
  thema0c	
  –	
  eg	
  human	
  rights,	
  
   historic	
  preserva0on,	
  NY	
  religious	
  ins0tu0ons	
  
•  They	
  also	
  archive	
  web	
  content	
  as	
  part	
  of	
  personal	
  
   and	
  organisa0onal	
  archives	
  (c.f.	
  manuscripts	
  coll)	
  
•  Archive	
  their	
  own	
  web	
  site	
  regularly	
  
Bringing	
  Archives	
  Together	
  
•  Common	
  standards	
  and	
  APIs	
  
•  Memento	
  project	
  –	
  adding	
  0me	
  to	
  the	
  web	
  
       –  Aggregates	
  CDX	
  files	
  (URL	
  index)	
  from	
  mul0ple	
  
          archives	
  
       –  Has	
  a	
  Firefox	
  plug-­‐in	
  which	
  allows	
  0me-­‐based	
  
          browsing	
  
       –  Ini0a0ve	
  of	
  Los	
  Alamos	
  Laboratories	
  
       –  See	
  h^p://www.mementoweb.org/demo/	
  
	
  
Common	
  Use	
  Cases	
  for	
  a	
  web	
  
                 archive	
  
•  Content	
  discovery	
  
•  Nostalgia	
  queries	
  
•  Web	
  site	
  restora0on	
  and	
  file	
  recovery	
  
•  Domain	
  name	
  valua0on	
  
•  Fall-­‐back	
  for	
  link-­‐rot	
  
•  Prior	
  art	
  analysis	
  and	
  patent/copyright	
  infringement	
  
   research	
  
•  Legal	
  cases	
  
•  Topic	
  analysis,	
  web	
  trends	
  analysis,	
  popularity	
  
   analysis,	
  network	
  analysis,	
  linguis0c	
  analysis	
  
Create	
  your	
  own	
  Archive	
  
•  Use	
  a	
  subscrip0on	
  service	
  
•  Build	
  your	
  own	
  web	
  archiving	
  infrastructure	
  
   with	
  open	
  source	
  sonware	
  (	
  ie	
  Heritrix	
  and	
  
   Wayback)	
  
•  Use	
  web	
  cita0on	
  services	
  that	
  create	
  archive	
  
   copies	
  as	
  you	
  bookmark	
  pages	
  
Subscrip0on	
  Services	
  
•  archive-­‐it.org	
  (service	
  operated	
  by	
  non-­‐profit	
  
   Internet	
  Archive	
  since	
  2006)	
  
•  archivethe.net	
  (service	
  operated	
  by	
  non-­‐profit	
  	
  
   Internet	
  Memory	
  Founda0on)	
  
•  California	
  Digital	
  Library	
  Web	
  Archiving	
  
   Service	
  -­‐	
  cdlib.org/services/uc3/was.html	
  
•  OCLC	
  Harvester	
  Service	
  -­‐	
  oclc.org/
   webharvester/overview/default.htm	
  
Install	
  web	
  archiving	
  system	
  locally	
  
•  Easy-­‐to-­‐deploy	
  web	
  archiving	
  toolkit	
  not	
  yet	
  
   available	
  	
  
•  Ins0tu0onal	
  web	
  archiving	
  infrastructure	
  is	
  
   feasible	
  and	
  has	
  been	
  established	
  at	
  a	
  number	
  
   of	
  universi0es	
  for	
  use	
  by	
  researchers	
  –	
  needs	
  
   IT	
  systems	
  engineers	
  to	
  set	
  up	
  though	
  
•  Archives	
  can	
  be	
  deposited	
  with	
  the	
  NLA	
  for	
  
   long-­‐term	
  preserva0on	
  
Personal	
  Web	
  Archiving	
  
•  WARCreate	
  –	
  recently	
  released	
  free	
  tool	
  which	
  
   creates	
  wayback-­‐consumable	
  warc	
  files	
  from	
  any	
  
   web	
  page	
  
•  Google	
  Chrome	
  extension	
  
•  Enables	
  preserva0on	
  by	
  users	
  from	
  their	
  desktop	
  
•  Can	
  target	
  content	
  unreachable	
  by	
  crawlers	
  
•  Brings	
  WARC	
  to	
  personal	
  digital	
  archiving	
  
•  What	
  you	
  do	
  with	
  the	
  WARC	
  files	
  is	
  up	
  to	
  you	
  
•  Install	
  suite	
  provided	
  to	
  set	
  up	
  local	
  Wayback	
  
   instance	
  and	
  Memento	
  0megate	
  
Current	
  challenges	
  
•  Database-­‐driven	
  features	
  and	
  func0ons	
  
•  Complex	
  and	
  varying	
  URI	
  formats	
  and	
  non-­‐
   standard	
  link	
  implementa0ons	
  eg	
  Twi^er	
  
•  Dynamically	
  generated	
  ever-­‐changing	
  URIs	
  
   –  For	
  serving	
  the	
  same	
  resources	
  
•  Rich	
  Media	
  –	
  eg	
  streamed	
  media	
  with	
  custom	
  
   apps	
  and	
  ant-­‐collec0on	
  measures	
  
•  Scripted	
  incremental	
  display	
  and	
  page-­‐loading	
  
…	
  more…	
  
•  Scripted	
  HTML	
  forms	
  
•  Mul0-­‐sourced	
  embedded	
  material	
  
•  Dynamic	
  authen0ca0on	
  e.g.	
  captchas,	
  cross-­‐
   site	
  authen0ca0on,	
  user-­‐sensi0ve	
  embeds	
  
•  Alternate	
  display	
  based	
  on	
  browser	
  or	
  device,	
  
   or	
  other	
  parameter	
  
•  Site	
  architecture	
  designed	
  to	
  inhibit	
  crawling	
  
   and	
  indexing	
  –	
  but	
  if	
  poorly	
  done	
  even	
  ‘polite’	
  
   harvesters	
  like	
  Heritrix	
  may	
  crash	
  their	
  server	
  
..	
  but	
  wait,	
  there’s	
  more	
  …	
  
•  Server-­‐side	
  scripts	
  and	
  remote	
  procedure	
  calls	
  
   –	
  the	
  full	
  variety	
  of	
  paths	
  through	
  a	
  site	
  are	
  
   now	
  onen	
  hidden	
  in	
  remote/opaque	
  server-­‐
   side	
  code	
  –	
  not	
  a	
  new	
  problem	
  but	
  now	
  
   effects	
  80+%	
  of	
  online	
  resources	
  
•  HTML	
  5	
  web	
  sockets	
  –	
  effec0vely	
  codifies	
  
   incremental	
  updates	
  without	
  page	
  reloads	
  
•  Mobile	
  publishing	
  
Transac0onal	
  Web	
  Archiving	
  
•  Useful	
  for	
  ins0tu0onal	
  archiving	
  	
  
    –  Best	
  for	
  record-­‐keeping	
  purposes	
  -­‐	
  when	
  
       challenged	
  in	
  court	
  about	
  content	
  on	
  web	
  site	
  
    –  Can	
  be	
  used	
  to	
  ensure	
  URL	
  persistence	
  eg	
  when	
  
       site	
  has	
  a	
  make-­‐over	
  –	
  can	
  intercept	
  404s	
  	
  	
  
    –  No	
  ‘gaps’	
  c.f.	
  crawl	
  approach	
  –	
  every	
  change	
  in	
  
       accessed	
  content	
  is	
  archived	
  
    –  However	
  requires	
  code	
  snippet	
  to	
  be	
  installed	
  on	
  
       web	
  server	
  
    –  Open	
  source	
  sonware	
  being	
  developed	
  by	
  Los	
  
       Alamos	
  Labs	
  
Web Data Mining & Analysis –
What is it? Why Do It?
Innovation is increasingly driven from Large scale
  Data Analysis

  Need fast iteration to understand the right
  questions to ask
  More minds able to contribute = more value
  (perceived and real) placed on the importance
  of the data
  Increased demand for/value of the data = more
  funding to support it
  Need to surface the Information amongst all
  that data…
Platform & Toolkit: Overview

•  Software	

   –  Apache Hadoop	

   –  Apache Pig	

•  Data/File format	

   –  WARC	

   –  CDX	

   –  WAT (new!)
Apache Hadoop

•  HDFS	

   –  Distributed storage	

   –  Durable, default 3x replication	

   –  Scalable: Yahoo! 60+PB HDFS	

•  MapReduce	

   –  Distributed computation	

   –  You write Java functions	

   –  Hadoop distributes work across cluster	

   –  Tolerates failures
File formats and data: WARC
File formats and data: CDX
•  Index used to browse WARC-based archive	

•  Space-delimited text file	

•  Only essential the essential metadata needed
   by Wayback	

  –  URL	

  –  Content Digest	

  –  Capture Timestamp	

  –  Content-Type	

  –  HTTP response code	

  –  etc.
File formats and data: WAT

•  Yet Another Metadata Format! ☺ ☹	

•  Not preservation format	

•  Data exchange and analysis	

•  Less than full WARC, more than CDX	

•  Essential metadata for many types of analysis	

•  Avoids barriers to data exchange: copyright,
   privacy	

•  Work-in-progress: we want your feedback
File formats and data: WAT
•  WAT is WARC ☺	

  –  WAT records are WARC
     metadata records	

       File formats & data:	

  –  WARC-Refers-To header     •  CDX: 53 MB	

     identifies original WARC
     record	

                 •  WAT: 443 MB	

•  WAT payload is JSON	

      •  WARC: 8,651 MB	

  –  Compact	

  –  Hierarchical	

  –  Supported by every
     programming environ
Some	
  References	
  
•  h^p://en.wikipedia.org/wiki/Web_archiving	
  
•  h^p://netpreserve.org/about/archiveList.php	
  
•  Web	
  Archives:	
  The	
  Future(s)	
  -­‐	
  
   h^p://www.netpreserve.org/publica0ons/
   2011_06_IIPC_WebArchives-­‐TheFutures.pdf	
  
•  h^p://matkelly.com/warcreate/	
  
•  Common	
  Crawl:	
  h^p://commoncrawl.org/
   data/accessing-­‐the-­‐data/	
  
Contacts	
  
•  Webarchive	
  @	
  nla.gov.au	
  
•  Secretariat	
  @	
  internetmemory.org	
  
•  Queries	
  about	
  the	
  internet	
  archive	
  web	
  archive	
  
   h^p://iawebarchiving.wordpress.com/	
  
•  Queries	
  about	
  Archive-­‐It	
  service	
  
   h^p://www.archive-­‐it.org/contact-­‐us	
  

momodei	
  @	
  nla.gov.au	
  (un0l	
  31	
  Aug	
  2012	
  )	
  
or	
  
monica.omodei	
  @	
  gmail.com	
  
	
  

More Related Content

What's hot

Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library servicesNikesh Narayanan
 
Deep Dive Into KBART
Deep Dive Into KBARTDeep Dive Into KBART
Deep Dive Into KBARTNASIG
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Samuel W. Shogren, MPA., LEAD assoc.
 
WorldCat Presentation
WorldCat PresentationWorldCat Presentation
WorldCat PresentationVal MacMillan
 
Applying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual PreservationApplying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual PreservationJon W. Dunn
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucsEdmund Chamberlain
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019Helge Holzmann
 
E book acquisition discovery-delivery-support
E book acquisition discovery-delivery-supportE book acquisition discovery-delivery-support
E book acquisition discovery-delivery-supportJeff Siemon
 
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...locloud
 
Charper.lawdi.20120601
Charper.lawdi.20120601Charper.lawdi.20120601
Charper.lawdi.20120601charper
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Future Perfect 2012
 
Gil interconnected libraries cooperative cataloging (3)
Gil interconnected libraries cooperative cataloging (3)Gil interconnected libraries cooperative cataloging (3)
Gil interconnected libraries cooperative cataloging (3)Debra Skinner
 
Criteria for a trusted institutional repository
Criteria for a trusted institutional repositoryCriteria for a trusted institutional repository
Criteria for a trusted institutional repositoryIna Smith
 
NISO Standards update: KBart and Demand Driven Acquisitions Best Practices
NISO Standards update: KBart and Demand Driven Acquisitions Best PracticesNISO Standards update: KBart and Demand Driven Acquisitions Best Practices
NISO Standards update: KBart and Demand Driven Acquisitions Best PracticesJason Price, PhD
 
Text mining in CORE (OR2012)
Text mining in CORE (OR2012)Text mining in CORE (OR2012)
Text mining in CORE (OR2012)petrknoth
 

What's hot (20)

Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library services
 
Institutional Repositories and Open Access Movement
Institutional Repositories and Open Access MovementInstitutional Repositories and Open Access Movement
Institutional Repositories and Open Access Movement
 
The WSTIERIA Project – A Web of Services
The  WSTIERIA Project – A Web of ServicesThe  WSTIERIA Project – A Web of Services
The WSTIERIA Project – A Web of Services
 
Deep Dive Into KBART
Deep Dive Into KBARTDeep Dive Into KBART
Deep Dive Into KBART
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
Community Collaboration in the Creation of Digital Collections - 2015 OR Heri...
 
WorldCat Presentation
WorldCat PresentationWorldCat Presentation
WorldCat Presentation
 
Digital Library Conferences
Digital Library ConferencesDigital Library Conferences
Digital Library Conferences
 
Applying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual PreservationApplying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual Preservation
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucs
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
 
E book acquisition discovery-delivery-support
E book acquisition discovery-delivery-supportE book acquisition discovery-delivery-support
E book acquisition discovery-delivery-support
 
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
LoCloud Vocabulary Services: Thesaurus management introduction, Walter Koch a...
 
Charper.lawdi.20120601
Charper.lawdi.20120601Charper.lawdi.20120601
Charper.lawdi.20120601
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
Gil interconnected libraries cooperative cataloging (3)
Gil interconnected libraries cooperative cataloging (3)Gil interconnected libraries cooperative cataloging (3)
Gil interconnected libraries cooperative cataloging (3)
 
Criteria for a trusted institutional repository
Criteria for a trusted institutional repositoryCriteria for a trusted institutional repository
Criteria for a trusted institutional repository
 
NISO Standards update: KBart and Demand Driven Acquisitions Best Practices
NISO Standards update: KBart and Demand Driven Acquisitions Best PracticesNISO Standards update: KBart and Demand Driven Acquisitions Best Practices
NISO Standards update: KBart and Demand Driven Acquisitions Best Practices
 
Text mining in CORE (OR2012)
Text mining in CORE (OR2012)Text mining in CORE (OR2012)
Text mining in CORE (OR2012)
 

Viewers also liked

The Law of Averages, Chapter 1
The Law of Averages, Chapter 1The Law of Averages, Chapter 1
The Law of Averages, Chapter 1Nerissaemerald
 
Lymphatic And Immune System
Lymphatic And Immune SystemLymphatic And Immune System
Lymphatic And Immune Systemguest866fdd0d
 
Susan Lannon Samples
Susan Lannon SamplesSusan Lannon Samples
Susan Lannon SamplesSuzinLannon
 
Jane Massey Portfolio
Jane Massey  PortfolioJane Massey  Portfolio
Jane Massey PortfolioJane Massey
 
Pakistan fights back
Pakistan fights backPakistan fights back
Pakistan fights backAndeel Ali
 
Чего хотят люди // What people wants
Чего хотят люди // What people wantsЧего хотят люди // What people wants
Чего хотят люди // What people wantsSegrey Nikishov - @n_grey
 
Mayas
MayasMayas
Mayasnone
 
Gum = Sticky Substance
Gum = Sticky SubstanceGum = Sticky Substance
Gum = Sticky Substanceguest033f1106
 
Основные направления деятельности «Конгресс-коллегии в 2014 году
Основные направления деятельности «Конгресс-коллегии в 2014 годуОсновные направления деятельности «Конгресс-коллегии в 2014 году
Основные направления деятельности «Конгресс-коллегии в 2014 годуFert
 
Getting stuff made
Getting stuff madeGetting stuff made
Getting stuff madeElaine Chen
 
Creating and configuring vnc sessions
Creating and configuring vnc sessionsCreating and configuring vnc sessions
Creating and configuring vnc sessionsRavi Kumar Lanke
 
Picking colors for your presentations
Picking colors for your presentationsPicking colors for your presentations
Picking colors for your presentationsRuben Rathnasingham
 

Viewers also liked (20)

The Law of Averages, Chapter 1
The Law of Averages, Chapter 1The Law of Averages, Chapter 1
The Law of Averages, Chapter 1
 
Lymphatic And Immune System
Lymphatic And Immune SystemLymphatic And Immune System
Lymphatic And Immune System
 
breeam basics
breeam basicsbreeam basics
breeam basics
 
1.1a rasional pengembangan k2013
1.1a    rasional pengembangan k20131.1a    rasional pengembangan k2013
1.1a rasional pengembangan k2013
 
Susan Lannon Samples
Susan Lannon SamplesSusan Lannon Samples
Susan Lannon Samples
 
Jane Massey Portfolio
Jane Massey  PortfolioJane Massey  Portfolio
Jane Massey Portfolio
 
MakerFaire Shenzhen 2014 presentation "How to make educational by technology ...
MakerFaire Shenzhen 2014 presentation "How to make educational by technology ...MakerFaire Shenzhen 2014 presentation "How to make educational by technology ...
MakerFaire Shenzhen 2014 presentation "How to make educational by technology ...
 
Pakistan fights back
Pakistan fights backPakistan fights back
Pakistan fights back
 
Schoon Licht geeft 4 keer winst
Schoon Licht geeft 4 keer winstSchoon Licht geeft 4 keer winst
Schoon Licht geeft 4 keer winst
 
Чего хотят люди // What people wants
Чего хотят люди // What people wantsЧего хотят люди // What people wants
Чего хотят люди // What people wants
 
Verduurzaam Uw Onderhoudsplan Agentschap Nl
Verduurzaam Uw Onderhoudsplan   Agentschap NlVerduurzaam Uw Onderhoudsplan   Agentschap Nl
Verduurzaam Uw Onderhoudsplan Agentschap Nl
 
Mayas
MayasMayas
Mayas
 
Spaun
SpaunSpaun
Spaun
 
12 agatha smmf
12 agatha smmf12 agatha smmf
12 agatha smmf
 
Gum = Sticky Substance
Gum = Sticky SubstanceGum = Sticky Substance
Gum = Sticky Substance
 
Основные направления деятельности «Конгресс-коллегии в 2014 году
Основные направления деятельности «Конгресс-коллегии в 2014 годуОсновные направления деятельности «Конгресс-коллегии в 2014 году
Основные направления деятельности «Конгресс-коллегии в 2014 году
 
Getting stuff made
Getting stuff madeGetting stuff made
Getting stuff made
 
Creating and configuring vnc sessions
Creating and configuring vnc sessionsCreating and configuring vnc sessions
Creating and configuring vnc sessions
 
Decentrale opwek in het Energieakkoord voor Duurzame Groei
Decentrale opwek in het Energieakkoord voor Duurzame GroeiDecentrale opwek in het Energieakkoord voor Duurzame Groei
Decentrale opwek in het Energieakkoord voor Duurzame Groei
 
Picking colors for your presentations
Picking colors for your presentationsPicking colors for your presentations
Picking colors for your presentations
 

Similar to Slides anu talkwebarchivingaug2012

Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and PotentialDaniel Gomes
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices Richard Wallis
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
MetadataTheory: Introduction to Repositories (8th of 10)
MetadataTheory: Introduction to Repositories (8th of 10)MetadataTheory: Introduction to Repositories (8th of 10)
MetadataTheory: Introduction to Repositories (8th of 10)Nikos Palavitsinis, PhD
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applicationsBurhan Ahmed
 
High and Lows of Library Linked Data
High and Lows of Library Linked DataHigh and Lows of Library Linked Data
High and Lows of Library Linked DataAdrian Stevenson
 
Digital Repositories: Essential Information for Academic Librarians
Digital Repositories: Essential Information for Academic LibrariansDigital Repositories: Essential Information for Academic Librarians
Digital Repositories: Essential Information for Academic LibrariansJeffrey Beall
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...datascienceiqss
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaJisc RDM
 
Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019Anna Perricci
 
Lisa Rogers
Lisa RogersLisa Rogers
Lisa RogersJisc
 
Emerging Trends in Librarianship (2008)
Emerging Trends in Librarianship (2008)Emerging Trends in Librarianship (2008)
Emerging Trends in Librarianship (2008)H Anil Kumar
 
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation SlidesDuraSpace
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 

Similar to Slides anu talkwebarchivingaug2012 (20)

Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Aglin
AglinAglin
Aglin
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
MetadataTheory: Introduction to Repositories (8th of 10)
MetadataTheory: Introduction to Repositories (8th of 10)MetadataTheory: Introduction to Repositories (8th of 10)
MetadataTheory: Introduction to Repositories (8th of 10)
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applications
 
High and Lows of Library Linked Data
High and Lows of Library Linked DataHigh and Lows of Library Linked Data
High and Lows of Library Linked Data
 
Digital Repositories: Essential Information for Academic Librarians
Digital Repositories: Essential Information for Academic LibrariansDigital Repositories: Essential Information for Academic Librarians
Digital Repositories: Essential Information for Academic Librarians
 
Internet and Its Applications
Internet and Its ApplicationsInternet and Its Applications
Internet and Its Applications
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via Archivematica
 
Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019
 
Lisa Rogers
Lisa RogersLisa Rogers
Lisa Rogers
 
Emerging Trends in Librarianship (2008)
Emerging Trends in Librarianship (2008)Emerging Trends in Librarianship (2008)
Emerging Trends in Librarianship (2008)
 
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 

More from Roxanne Missingham

Elephants and copyright – considerations for a different future
Elephants and copyright – considerations for a different futureElephants and copyright – considerations for a different future
Elephants and copyright – considerations for a different futureRoxanne Missingham
 
Cinderella comes to the digital humanities ball
Cinderella comes to the digital humanities ballCinderella comes to the digital humanities ball
Cinderella comes to the digital humanities ballRoxanne Missingham
 
Etextbooks presentation to OUP Australia workshop
Etextbooks presentation to OUP Australia workshopEtextbooks presentation to OUP Australia workshop
Etextbooks presentation to OUP Australia workshopRoxanne Missingham
 
Managing key relationships: the Library and the academic world
Managing key relationships: the Library and the academic worldManaging key relationships: the Library and the academic world
Managing key relationships: the Library and the academic worldRoxanne Missingham
 
Collections and budgets: libraries and publishers and collaboration
Collections and budgets: libraries and publishers and collaborationCollections and budgets: libraries and publishers and collaboration
Collections and budgets: libraries and publishers and collaborationRoxanne Missingham
 
Presentation to CAUL Research repositories Community event 2015
Presentation to CAUL Research repositories Community event 2015Presentation to CAUL Research repositories Community event 2015
Presentation to CAUL Research repositories Community event 2015Roxanne Missingham
 
Come to the library to learn how not to smile at a crocodile
Come to the library to learn how not to smile at a crocodileCome to the library to learn how not to smile at a crocodile
Come to the library to learn how not to smile at a crocodileRoxanne Missingham
 
National Scholarly Communications Forum 2015 monographs
National Scholarly Communications Forum 2015 monographsNational Scholarly Communications Forum 2015 monographs
National Scholarly Communications Forum 2015 monographsRoxanne Missingham
 
Publishing models for open access monographs
Publishing models for open access monographsPublishing models for open access monographs
Publishing models for open access monographsRoxanne Missingham
 
Survival: hard decisions in hard times
Survival: hard decisions in hard timesSurvival: hard decisions in hard times
Survival: hard decisions in hard timesRoxanne Missingham
 
ANU Library: responding to new needs
ANU Library: responding to new needsANU Library: responding to new needs
ANU Library: responding to new needsRoxanne Missingham
 
RAILS Paper: Understanding information needs to support Australia’s policy of...
RAILS Paper: Understanding information needs to support Australia’s policy of...RAILS Paper: Understanding information needs to support Australia’s policy of...
RAILS Paper: Understanding information needs to support Australia’s policy of...Roxanne Missingham
 
MOOCS: it’s time to shake, rattle and roll
MOOCS: it’s time to shake, rattle and rollMOOCS: it’s time to shake, rattle and roll
MOOCS: it’s time to shake, rattle and rollRoxanne Missingham
 
Higher education and copyright
Higher education and copyrightHigher education and copyright
Higher education and copyrightRoxanne Missingham
 
Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association Roxanne Missingham
 
Collaborative resource discovery: researchers needs for navigation in a sea o...
Collaborative resource discovery: researchers needs for navigation in a sea o...Collaborative resource discovery: researchers needs for navigation in a sea o...
Collaborative resource discovery: researchers needs for navigation in a sea o...Roxanne Missingham
 

More from Roxanne Missingham (20)

Predatory publishing 2019
Predatory publishing 2019Predatory publishing 2019
Predatory publishing 2019
 
Elephants and copyright – considerations for a different future
Elephants and copyright – considerations for a different futureElephants and copyright – considerations for a different future
Elephants and copyright – considerations for a different future
 
Cinderella comes to the digital humanities ball
Cinderella comes to the digital humanities ballCinderella comes to the digital humanities ball
Cinderella comes to the digital humanities ball
 
Etextbooks ecu
Etextbooks ecuEtextbooks ecu
Etextbooks ecu
 
Etextbooks presentation to OUP Australia workshop
Etextbooks presentation to OUP Australia workshopEtextbooks presentation to OUP Australia workshop
Etextbooks presentation to OUP Australia workshop
 
Managing key relationships: the Library and the academic world
Managing key relationships: the Library and the academic worldManaging key relationships: the Library and the academic world
Managing key relationships: the Library and the academic world
 
Collections and budgets: libraries and publishers and collaboration
Collections and budgets: libraries and publishers and collaborationCollections and budgets: libraries and publishers and collaboration
Collections and budgets: libraries and publishers and collaboration
 
Presentation to CAUL Research repositories Community event 2015
Presentation to CAUL Research repositories Community event 2015Presentation to CAUL Research repositories Community event 2015
Presentation to CAUL Research repositories Community event 2015
 
Come to the library to learn how not to smile at a crocodile
Come to the library to learn how not to smile at a crocodileCome to the library to learn how not to smile at a crocodile
Come to the library to learn how not to smile at a crocodile
 
National Scholarly Communications Forum 2015 monographs
National Scholarly Communications Forum 2015 monographsNational Scholarly Communications Forum 2015 monographs
National Scholarly Communications Forum 2015 monographs
 
Predatory publishing
Predatory publishingPredatory publishing
Predatory publishing
 
Publishing models for open access monographs
Publishing models for open access monographsPublishing models for open access monographs
Publishing models for open access monographs
 
Survival: hard decisions in hard times
Survival: hard decisions in hard timesSurvival: hard decisions in hard times
Survival: hard decisions in hard times
 
ANU Library: responding to new needs
ANU Library: responding to new needsANU Library: responding to new needs
ANU Library: responding to new needs
 
RAILS Paper: Understanding information needs to support Australia’s policy of...
RAILS Paper: Understanding information needs to support Australia’s policy of...RAILS Paper: Understanding information needs to support Australia’s policy of...
RAILS Paper: Understanding information needs to support Australia’s policy of...
 
MOOCS: it’s time to shake, rattle and roll
MOOCS: it’s time to shake, rattle and rollMOOCS: it’s time to shake, rattle and roll
MOOCS: it’s time to shake, rattle and roll
 
Higher education and copyright
Higher education and copyrightHigher education and copyright
Higher education and copyright
 
Alianlits missingham
Alianlits missinghamAlianlits missingham
Alianlits missingham
 
Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association Presentation to Northern Sydney District Teacher Librarian Association
Presentation to Northern Sydney District Teacher Librarian Association
 
Collaborative resource discovery: researchers needs for navigation in a sea o...
Collaborative resource discovery: researchers needs for navigation in a sea o...Collaborative resource discovery: researchers needs for navigation in a sea o...
Collaborative resource discovery: researchers needs for navigation in a sea o...
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Slides anu talkwebarchivingaug2012

  • 1. Internet Content as Research Data Australian National University August 2012, Canberra Monica Omodei
  • 2. Research Examples •  Social networking •  Political Science •  Lexicography •  Media Studies •  Linguistics •  Contemporary history •  Network Science Data-driven science is migrating from the natural sciences to humanities and social science
  • 3. Talk  Structure   •  Exis0ng  web  archives   •  Web  archive  use  cases   •  Bringing  archives  together   •  Crea0ng  your  own  archive   •  It’s  ge>ng  harder  –  challenges   •  Web  data  mining  &  analysis        
  • 4. Exis0ng  web  archives     •  Internet  Archive   •  Common  Crawl     •  Pandora  Archive   •  Internet  Memory  Founda0on  Archive   •  Other  na0onal  archives   •  Research,  University  Library  archives    
  • 5. Common  Collec0on  Strategies   •  Crawl  Scope  &  Focus   1)  Thema0c/Topical  (elec0ons,  events,  global  warming…)   2)  Resource-­‐specific  (video,  pdf,  etc.)   3)  Broad  survey  (domain  wide  for  .com/.net/.org/.edu/.gov)   4)  Exhaus0ve  (end  of  life, closure crawls, natl domains)   5)  Frequency-­‐Based     •  Key  Inputs:  nomina0ons  from  subject  ma^er  experts,   prior  crawl  data,  registry  data,  trusted  directories,   wikipedia,  twi^er  
  • 6. Internet Archive’s Web Archive Positives –  Very broad – 175+ billion web instances –  Historic – started 1996 –  Publicly accessible –  Time-based URL search –  API access –  Not constrained by legislation – covered by fair use and fast take-down response
  • 7. Internet  Archive’s  Web  Archive   Negatives –  Because of size can’t search by keyword –  Because of size crawling is fully automated – ergo QA is not possible  
  • 8.
  • 9.
  • 10.
  • 11. Common  Crawl   •  Non-­‐profit  founda0on  building  an  open  crawl   of  the  web  to  seed  research  and  innova0on   •  Currently  5  billion  pages   •  Stored  on  Amazon’s  S3     •  Accessible  via  MapReduce  processing  in   Amazon’s  EC2  compute  cloud   •  Wholesale  extrac0on,  transforma0on,  and   analysis  of  web  data  cheap  and  easy  
  • 12. Common  Crawl   Nega0ves   •  Not  designed  for  human  browsing  but  for   machine  access   •  Objec0ve  is  to  support  large-­‐scale  analysis  and   text  mining/indexing  –  not  long-­‐term   preserva0on   •  Some  costs  are  involved  for  direct  extrac0on   of  data  from  S3  storage  using  Requester-­‐Pays   API    
  • 13. Pandora  Archive   •  Posi0ves   –  Quality  checked   –  Targeted  Australian  content  with  selec0on  policy   –  Historical  –  started  1996   –  Bibliocentric  approach  –web  sites/publica0ons   selected  for  archiving  are  catalogued  (see  Trove)   –  Keyword  search   –  Publicly  accessible   –  You  can  nominate  Australian  web  sites  for   inclusion  -­‐  pandora.nla.gov.au/ registra0on_form.html  
  • 14.
  • 15. Pandora  Archive   •  Nega0ves   –  labour  intensive  thus  quite  small   –  significant  content  missed  because  permission  to   copy  refused   •  Situa0on  will  improve  markedly  if  Legal   Deposit  provisions  extended  to  digital   publica0ons   •  Broader  coverage  will  be  achieved  when   infrastructure  is  upgraded  hence  reducing   labour  costs  for  checking/fixing  crawls  
  • 16. Pandora  Archive  Stats   •  Size  –  6.32  TB   •  Number  of  Files    >  140  million   •  Number  of  ‘0tles’  >  30.5K   •  Number  of  0tle  instances  >  73.5K  
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Which archived sites are popular ?   •  Measure: filtered, aggregated web access log data which counts access to titles " •  Examined top 30 archived titles (# of accesses) for each year 2009 to 2012" •  Selected some to examine and speculate as to why they might be popular" •  Selected those with consistently high ranking, and ones that were very variable between years  
  • 22. Reasons for popularity of archived version   •  Were once popular and are now decommissioned, particularly if domain name continues to exist and redirects to the archive" •  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content" •  Popular referencing sources cite the archive as well as the live site (if it still exists)  
  • 23.
  • 24.
  • 25.
  • 26. Improving visibility and usage of Pandora archive   •  Articles about interesting content on the Australia Web Archives blog –http:// blogs.nla.gov.au/australias-web-archives/" •  More effort to identify archived sites that are no longer ʻliveʼ" •  Market automatic redirect services to web site owners/managers" •  Allow Google to index archive content for ʻnon-liveʼ sites (problematic)" •  Install Twittervane - draws  site  nomina0ons   for  archiving  based  on  trending  Twi^er  topics.      "
  • 27. .au  Domain  Annual  Snapshots   •  Annual  crawls  since  2005  commissioned  from   Internet  Archive   •  Includes  sites  on  servers  located  in  Australia   as  well  as  .au  domain   •  Robots.txt  respected  except  for  inline  images   and  stylesheets   •  No  public  access  –  researcher  access  protocols   are  being  developed   •  Full  text  search  –  suited  to  searching  archives   •  Separate  .gov  crawl  publicly  accessible  soon  
  • 28. Australian  web  domain  crawls   Year   2005   2006   2007   2008   2009   2011   Files   185   596   516   1  billion   765   660   million   million   million   million   million   Hosts   811,523   1,046,038   1,247,614   3,038,658   1,074,645   1,346,549   crawled   Size  (TBs)   6.69   19.04   18.47   34.55   24.29   30.71  
  • 29. Internet  Memory  Founda0on   •  Number  of  European  partners     •  LiWA  –  Living  Web  Archives:  next  genera0on   Web  archiving  methods  and  tools     •  LAWA  –  Longitudinal  Analy0cs  of  Web  Archive   Data:  experimental  testbed  for  large-­‐scale   data  analy0cs   •  ARCOMEM  (Collect-­‐All  ARchives  to   COmmunity  MEMories)  leveraging  social   media  for  Intelligent  Preserva0on     •  SCAPE  –  Scalable  Preserva0on  Environments  
  • 30.
  • 31. Other  Na0onal  Archives   •  List  of  Interna0onal  Internet  Preserva0on   Consor0um  member  archives  –   netpreserve.org/about/archiveList.php   •  Some  are  whole  domain  archives,  some    are   selec0ve  archives,  many  are  both   •  Some  have  public  access,  others  you  will  need   to  nego0ate  access  for  research   •  Most  archives  have  been  collected  using  the   heritrix  open-­‐source  crawler  and  thus  use  the   standard  format  (warc  ISO  format)  
  • 32. Research  Archives   •  California  Digital  Library   •  Harvard  University  Libraries   •  Columbia    University  Libraries   •  University  of  North  Texas   ….  and  many  more     •  WebCITE  -­‐  webcita0on.org  (cita0on  service   archive)  
  • 33. Example:  Columbia  University   •  Member  of  the  IIPC   •  They  use  the  ArchiveIt  service   •  A  Research  library  that  sees  web  archiving  as   fundamental  to  their  collec0ng     •  They  complement  and  coordinate  with  other  web   archives   •  Their  collec0ng  focus  is  thema0c  –  eg  human  rights,   historic  preserva0on,  NY  religious  ins0tu0ons   •  They  also  archive  web  content  as  part  of  personal   and  organisa0onal  archives  (c.f.  manuscripts  coll)   •  Archive  their  own  web  site  regularly  
  • 34.
  • 35. Bringing  Archives  Together   •  Common  standards  and  APIs   •  Memento  project  –  adding  0me  to  the  web   –  Aggregates  CDX  files  (URL  index)  from  mul0ple   archives   –  Has  a  Firefox  plug-­‐in  which  allows  0me-­‐based   browsing   –  Ini0a0ve  of  Los  Alamos  Laboratories   –  See  h^p://www.mementoweb.org/demo/    
  • 36.
  • 37. Common  Use  Cases  for  a  web   archive   •  Content  discovery   •  Nostalgia  queries   •  Web  site  restora0on  and  file  recovery   •  Domain  name  valua0on   •  Fall-­‐back  for  link-­‐rot   •  Prior  art  analysis  and  patent/copyright  infringement   research   •  Legal  cases   •  Topic  analysis,  web  trends  analysis,  popularity   analysis,  network  analysis,  linguis0c  analysis  
  • 38. Create  your  own  Archive   •  Use  a  subscrip0on  service   •  Build  your  own  web  archiving  infrastructure   with  open  source  sonware  (  ie  Heritrix  and   Wayback)   •  Use  web  cita0on  services  that  create  archive   copies  as  you  bookmark  pages  
  • 39. Subscrip0on  Services   •  archive-­‐it.org  (service  operated  by  non-­‐profit   Internet  Archive  since  2006)   •  archivethe.net  (service  operated  by  non-­‐profit     Internet  Memory  Founda0on)   •  California  Digital  Library  Web  Archiving   Service  -­‐  cdlib.org/services/uc3/was.html   •  OCLC  Harvester  Service  -­‐  oclc.org/ webharvester/overview/default.htm  
  • 40.
  • 41. Install  web  archiving  system  locally   •  Easy-­‐to-­‐deploy  web  archiving  toolkit  not  yet   available     •  Ins0tu0onal  web  archiving  infrastructure  is   feasible  and  has  been  established  at  a  number   of  universi0es  for  use  by  researchers  –  needs   IT  systems  engineers  to  set  up  though   •  Archives  can  be  deposited  with  the  NLA  for   long-­‐term  preserva0on  
  • 42. Personal  Web  Archiving   •  WARCreate  –  recently  released  free  tool  which   creates  wayback-­‐consumable  warc  files  from  any   web  page   •  Google  Chrome  extension   •  Enables  preserva0on  by  users  from  their  desktop   •  Can  target  content  unreachable  by  crawlers   •  Brings  WARC  to  personal  digital  archiving   •  What  you  do  with  the  WARC  files  is  up  to  you   •  Install  suite  provided  to  set  up  local  Wayback   instance  and  Memento  0megate  
  • 43. Current  challenges   •  Database-­‐driven  features  and  func0ons   •  Complex  and  varying  URI  formats  and  non-­‐ standard  link  implementa0ons  eg  Twi^er   •  Dynamically  generated  ever-­‐changing  URIs   –  For  serving  the  same  resources   •  Rich  Media  –  eg  streamed  media  with  custom   apps  and  ant-­‐collec0on  measures   •  Scripted  incremental  display  and  page-­‐loading  
  • 44. …  more…   •  Scripted  HTML  forms   •  Mul0-­‐sourced  embedded  material   •  Dynamic  authen0ca0on  e.g.  captchas,  cross-­‐ site  authen0ca0on,  user-­‐sensi0ve  embeds   •  Alternate  display  based  on  browser  or  device,   or  other  parameter   •  Site  architecture  designed  to  inhibit  crawling   and  indexing  –  but  if  poorly  done  even  ‘polite’   harvesters  like  Heritrix  may  crash  their  server  
  • 45. ..  but  wait,  there’s  more  …   •  Server-­‐side  scripts  and  remote  procedure  calls   –  the  full  variety  of  paths  through  a  site  are   now  onen  hidden  in  remote/opaque  server-­‐ side  code  –  not  a  new  problem  but  now   effects  80+%  of  online  resources   •  HTML  5  web  sockets  –  effec0vely  codifies   incremental  updates  without  page  reloads   •  Mobile  publishing  
  • 46. Transac0onal  Web  Archiving   •  Useful  for  ins0tu0onal  archiving     –  Best  for  record-­‐keeping  purposes  -­‐  when   challenged  in  court  about  content  on  web  site   –  Can  be  used  to  ensure  URL  persistence  eg  when   site  has  a  make-­‐over  –  can  intercept  404s       –  No  ‘gaps’  c.f.  crawl  approach  –  every  change  in   accessed  content  is  archived   –  However  requires  code  snippet  to  be  installed  on   web  server   –  Open  source  sonware  being  developed  by  Los   Alamos  Labs  
  • 47. Web Data Mining & Analysis – What is it? Why Do It? Innovation is increasingly driven from Large scale Data Analysis Need fast iteration to understand the right questions to ask More minds able to contribute = more value (perceived and real) placed on the importance of the data Increased demand for/value of the data = more funding to support it Need to surface the Information amongst all that data…
  • 48. Platform & Toolkit: Overview •  Software –  Apache Hadoop –  Apache Pig •  Data/File format –  WARC –  CDX –  WAT (new!)
  • 49. Apache Hadoop •  HDFS –  Distributed storage –  Durable, default 3x replication –  Scalable: Yahoo! 60+PB HDFS •  MapReduce –  Distributed computation –  You write Java functions –  Hadoop distributes work across cluster –  Tolerates failures
  • 50. File formats and data: WARC
  • 51. File formats and data: CDX •  Index used to browse WARC-based archive •  Space-delimited text file •  Only essential the essential metadata needed by Wayback –  URL –  Content Digest –  Capture Timestamp –  Content-Type –  HTTP response code –  etc.
  • 52. File formats and data: WAT •  Yet Another Metadata Format! ☺ ☹ •  Not preservation format •  Data exchange and analysis •  Less than full WARC, more than CDX •  Essential metadata for many types of analysis •  Avoids barriers to data exchange: copyright, privacy •  Work-in-progress: we want your feedback
  • 53. File formats and data: WAT •  WAT is WARC ☺ –  WAT records are WARC metadata records File formats & data: –  WARC-Refers-To header •  CDX: 53 MB identifies original WARC record •  WAT: 443 MB •  WAT payload is JSON •  WARC: 8,651 MB –  Compact –  Hierarchical –  Supported by every programming environ
  • 54. Some  References   •  h^p://en.wikipedia.org/wiki/Web_archiving   •  h^p://netpreserve.org/about/archiveList.php   •  Web  Archives:  The  Future(s)  -­‐   h^p://www.netpreserve.org/publica0ons/ 2011_06_IIPC_WebArchives-­‐TheFutures.pdf   •  h^p://matkelly.com/warcreate/   •  Common  Crawl:  h^p://commoncrawl.org/ data/accessing-­‐the-­‐data/  
  • 55. Contacts   •  Webarchive  @  nla.gov.au   •  Secretariat  @  internetmemory.org   •  Queries  about  the  internet  archive  web  archive   h^p://iawebarchiving.wordpress.com/   •  Queries  about  Archive-­‐It  service   h^p://www.archive-­‐it.org/contact-­‐us   momodei  @  nla.gov.au  (un0l  31  Aug  2012  )   or   monica.omodei  @  gmail.com