SlideShare a Scribd company logo
1 of 16
Web Archives are Big Data
• Processing requires computing clusters
• i.e., Hadoop, YARN, Spark, …
• Web archive data is heterogeneous, may include text, video, images, …
• Requires cleaning, filtering, selection, extraction and finally, processing
1
Source: Yahoo!
• Distributed computing
• MapReduce or variants
• Transform, aggregate, write
• Homogeneous data
• Well-defined format, clean
• Easy to split and handle
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Web Archives as Scholarly Source
• Web archive data is heterogeneous, may include text, video, images, …
• Requires cleaning, filtering, selection, extraction and finally, processing
• Data is temporal, time differences within a website, among websites
• A resource may have been crawled and be included multiple times
• With no or only slight changes, layout changes, content changes, …
• Users are often from various disciplines, not only computer scientists
• Social scientists, political scientists, historians, …
• Want to express their needs, define sub-sets, create research corpora
• Require a format they know, which is readable and reusable
• Technical persons can support, but need the tools
2
23/05/2016 Helge Holzmann (holzmann@L3S.de)
ArchiveSpark:
Efficient Web Archive
Access, Extraction and Derivation
Helge Holzmann, Vinay Goel, Avishek Anand
To appear at JCDL 2016 (please cite)
https://github.com/helgeho/ArchiveSpark
23/05/2016 Helge Holzmann (holzmann@L3S.de) 3
Overview
• Goal: Easy research corpus building from Web archives
• Framework based on Apache Spark [1]
• Supports efficient filter / selection operation
• Supports derivations by applying third-party libraries
• Output in a pretty and machine readable format
• Identified six objectives based on practical requirements
• Benchmarked against pure Spark and HBase
• Both using WarcBase helpers [2]
4
[1] http://spark.apache.org
[2] https://github.com/lintool/warcbase
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Example Use Case
• Political scientist wants to analyze sentiments and reactions on the
Web from a previous election cycle.
• Five decisions to be taken to narrow down the dataset:
1. Filter temporally to focus on the election period
2. Select text documents by filtering on MIME type
3. Only keep online captures with HTTP status code 200
4. Look for political signal terms in the content to get rid of unrelated pages
5. Choose a captured version, for instance the latest of each page
• Finally, extract relevant terms / snippets to analyze sentiments
• Document lineage, e.g., the title might have more value than the body text
5
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Objectives
1. A simple and expressive interface
2. Compliance to and reuse of standard formats
3. An efficient selection and filtering process
4. An easily extensible architecture
5. Lineage support to comprehend and reconstruct derivations
6. Output in a standard, readable and reusable format
6
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O1: Simple and expressive interface
• Based on Spark
• Expressing instructions in Scala
• Simple accessors provided
• Easy extraction / enrichment mechanisms
7
val rdd = ArchiveSpark.hdfs(cdxPath, warcPath)
val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val entities = onlineHtml.enrich(Entities)
entities.saveAsJson("entities.gz")
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O2: Standard formats
• (W)ARC and CDX widely used among the big Web archives
• (Web) ARChive files (ISO 28500:2009)
• Crawl index
• No additional pre-processing / indexing required
8
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O3: Efficient selection and filtering
• Initial filtering purely based on meta data (CDX)
• without touching the archive
• Seamless integration of contents / extractions / enrichments
9
val Title = HtmlText.of(Html.first("title"))
val filtered = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val corpus = filtered.enrich(Title)
.filter(r => r.value(Title).get.contains("science"))
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O4: Easily extensible architecture
• Enrich functions allow integrating any Java/Scala code/library
• with a unified interface
• Base classes provided
• Implementations need to define the following only:
1. Default dependency enrich function (e.g., StringContent)
2. Function body (Java/Scala code / library calls, e.g., Named Entity Extractor)
3. Result fields (e.g., persons, organizations, locations)
10
rdd.mapEnrich(StringContent, "length") {str => str.length}
Dependency Result field Body
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O5: Lineage documentation
• Nested JSON objects represent enrichments / derivations
11
Record (meta data)
Payload
String
Length
23/05/2016 Helge Holzmann (holzmann@L3S.de)
O6: Readable / reusable output
• Filtered, cleaned, enriched JSON vs. raw (W)ARC
• widely used, structured, pretty printable, reusable
12
vs.
23/05/2016 Helge Holzmann (holzmann@L3S.de)
The ArchiveSpark Approach
• Filter as much as possible on meta data before touching the archive
• Enrich instead of mapping / transforming data
13
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Benchmarks
• Three scenarios, from basic to more sophisticated:
a) Select one particular URL
b) Select all pages (MIME type text/html) under a specific domain
c) Select the latest successful capture (HTTP status 200) in a specific month
• Benchmarks do not include derivations
• Those are applied on top of all three methods and involve third-party libraries
14
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Conclusion
• Expressive and efficient corpus creation from Web archives
• Easily extensible
• Please provide enrich functions of your own tools / third-party libraries
• Open source
• Fork us on GitHub: https://github.com/helgeho/ArchiveSpark
• Star, contribute, fix, spread, get involved!
• Learn more at our hands-on session, today at 3:30 pm.
15
23/05/2016 Helge Holzmann (holzmann@L3S.de)
Thank you!
23/05/2016 Helge Holzmann (holzmann@L3S.de)
16
https://github.com/helgeho/ArchiveSpark

More Related Content

What's hot

Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data ApplicationsEUCLID project
 
How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?Dirk Lewandowski
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelRobert Sanderson
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapAAT Taiwan
 
File organization
File organizationFile organization
File organizationGokul017
 
File Organization
File OrganizationFile Organization
File OrganizationManyi Man
 
Fundamental File Processing Operations
Fundamental File Processing OperationsFundamental File Processing Operations
Fundamental File Processing OperationsRico
 

What's hot (13)

Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
Thompson 6-jun15-final
Thompson 6-jun15-finalThompson 6-jun15-final
Thompson 6-jun15-final
 
How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?How can library materials be ranked in the OPAC?
How can library materials be ranked in the OPAC?
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
File organisation
File organisationFile organisation
File organisation
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation Model
 
File organization
File organizationFile organization
File organization
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldap
 
File organization
File organizationFile organization
File organization
 
File Organization
File OrganizationFile Organization
File Organization
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Fundamental File Processing Operations
Fundamental File Processing OperationsFundamental File Processing Operations
Fundamental File Processing Operations
 

Similar to ArchiveSpark Introduction @ WebSci' 2016 Hackathon

Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkHelge Holzmann
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesHelge Holzmann
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesHelge Holzmann
 
ontology.ppt
ontology.pptontology.ppt
ontology.pptPrerak10
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019Helge Holzmann
 
Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?John F. Holliday
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPathStructured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPathJosh Anderson
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersUniversity of Bologna
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsMoutasm Tamimi
 
From Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource ManagerFrom Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource ManagerRindra Ramli
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM4Science
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and SharingC. Tobin Magle
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365DocFluix, LLC
 

Similar to ArchiveSpark Introduction @ WebSci' 2016 Hackathon (20)

Medical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSparkMedical Heritage Library (MHL) on ArchiveSpark
Medical Heritage Library (MHL) on ArchiveSpark
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Web Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web ArchivesWeb Data Engineering - A Technical Perspective on Web Archives
Web Data Engineering - A Technical Perspective on Web Archives
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019ArchiveSpark at CEDWARC workshop 2019
ArchiveSpark at CEDWARC workshop 2019
 
Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?Who says you can't do records management in SharePoint?
Who says you can't do records management in SharePoint?
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPathStructured Strategy: How to Supercharge Your Content Analysis with XML and XPath
Structured Strategy: How to Supercharge Your Content Analysis with XML and XPath
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
From Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource ManagerFrom Millennium ERMS to Proquest 360 Resource Manager
From Millennium ERMS to Proquest 360 Resource Manager
 
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365IA& Taxonomy Planning for SharePoint Online & Office 365
IA& Taxonomy Planning for SharePoint Online & Office 365
 

Recently uploaded

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 

Recently uploaded (20)

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 

ArchiveSpark Introduction @ WebSci' 2016 Hackathon

  • 1. Web Archives are Big Data • Processing requires computing clusters • i.e., Hadoop, YARN, Spark, … • Web archive data is heterogeneous, may include text, video, images, … • Requires cleaning, filtering, selection, extraction and finally, processing 1 Source: Yahoo! • Distributed computing • MapReduce or variants • Transform, aggregate, write • Homogeneous data • Well-defined format, clean • Easy to split and handle 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 2. Web Archives as Scholarly Source • Web archive data is heterogeneous, may include text, video, images, … • Requires cleaning, filtering, selection, extraction and finally, processing • Data is temporal, time differences within a website, among websites • A resource may have been crawled and be included multiple times • With no or only slight changes, layout changes, content changes, … • Users are often from various disciplines, not only computer scientists • Social scientists, political scientists, historians, … • Want to express their needs, define sub-sets, create research corpora • Require a format they know, which is readable and reusable • Technical persons can support, but need the tools 2 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 3. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation Helge Holzmann, Vinay Goel, Avishek Anand To appear at JCDL 2016 (please cite) https://github.com/helgeho/ArchiveSpark 23/05/2016 Helge Holzmann (holzmann@L3S.de) 3
  • 4. Overview • Goal: Easy research corpus building from Web archives • Framework based on Apache Spark [1] • Supports efficient filter / selection operation • Supports derivations by applying third-party libraries • Output in a pretty and machine readable format • Identified six objectives based on practical requirements • Benchmarked against pure Spark and HBase • Both using WarcBase helpers [2] 4 [1] http://spark.apache.org [2] https://github.com/lintool/warcbase 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 5. Example Use Case • Political scientist wants to analyze sentiments and reactions on the Web from a previous election cycle. • Five decisions to be taken to narrow down the dataset: 1. Filter temporally to focus on the election period 2. Select text documents by filtering on MIME type 3. Only keep online captures with HTTP status code 200 4. Look for political signal terms in the content to get rid of unrelated pages 5. Choose a captured version, for instance the latest of each page • Finally, extract relevant terms / snippets to analyze sentiments • Document lineage, e.g., the title might have more value than the body text 5 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 6. Objectives 1. A simple and expressive interface 2. Compliance to and reuse of standard formats 3. An efficient selection and filtering process 4. An easily extensible architecture 5. Lineage support to comprehend and reconstruct derivations 6. Output in a standard, readable and reusable format 6 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 7. O1: Simple and expressive interface • Based on Spark • Expressing instructions in Scala • Simple accessors provided • Easy extraction / enrichment mechanisms 7 val rdd = ArchiveSpark.hdfs(cdxPath, warcPath) val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html") val entities = onlineHtml.enrich(Entities) entities.saveAsJson("entities.gz") 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 8. O2: Standard formats • (W)ARC and CDX widely used among the big Web archives • (Web) ARChive files (ISO 28500:2009) • Crawl index • No additional pre-processing / indexing required 8 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 9. O3: Efficient selection and filtering • Initial filtering purely based on meta data (CDX) • without touching the archive • Seamless integration of contents / extractions / enrichments 9 val Title = HtmlText.of(Html.first("title")) val filtered = rdd.filter(r => r.status == 200 && r.mime == "text/html") val corpus = filtered.enrich(Title) .filter(r => r.value(Title).get.contains("science")) 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 10. O4: Easily extensible architecture • Enrich functions allow integrating any Java/Scala code/library • with a unified interface • Base classes provided • Implementations need to define the following only: 1. Default dependency enrich function (e.g., StringContent) 2. Function body (Java/Scala code / library calls, e.g., Named Entity Extractor) 3. Result fields (e.g., persons, organizations, locations) 10 rdd.mapEnrich(StringContent, "length") {str => str.length} Dependency Result field Body 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 11. O5: Lineage documentation • Nested JSON objects represent enrichments / derivations 11 Record (meta data) Payload String Length 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 12. O6: Readable / reusable output • Filtered, cleaned, enriched JSON vs. raw (W)ARC • widely used, structured, pretty printable, reusable 12 vs. 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 13. The ArchiveSpark Approach • Filter as much as possible on meta data before touching the archive • Enrich instead of mapping / transforming data 13 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 14. Benchmarks • Three scenarios, from basic to more sophisticated: a) Select one particular URL b) Select all pages (MIME type text/html) under a specific domain c) Select the latest successful capture (HTTP status 200) in a specific month • Benchmarks do not include derivations • Those are applied on top of all three methods and involve third-party libraries 14 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 15. Conclusion • Expressive and efficient corpus creation from Web archives • Easily extensible • Please provide enrich functions of your own tools / third-party libraries • Open source • Fork us on GitHub: https://github.com/helgeho/ArchiveSpark • Star, contribute, fix, spread, get involved! • Learn more at our hands-on session, today at 3:30 pm. 15 23/05/2016 Helge Holzmann (holzmann@L3S.de)
  • 16. Thank you! 23/05/2016 Helge Holzmann (holzmann@L3S.de) 16 https://github.com/helgeho/ArchiveSpark