Archiving the Internet Before it All Rots Away

•

1 like•76 views

Could you imagine an internet where all links stopped working after 4 years? All the old blogs from the 90’s… gone, all your hot takes on Twitter… gone, all the news and reporting… gone. Some of that decay is good, no one wants the entire internet to be preserved for eternity, but most of that decay leads to great content disappearing forever, and future generations being deprived of access to the most important medium for knowledge in the last half century. If no one worked on preserving that information, the human race would be facing a loss of knowledge many times greater than the burning of the Library of Alexandria. Luckily organizations like Archive.org and the Internet Preservation Consortium work tirelessly every day to save what they can. But archiving doesn’t have to be exclusive to big organizations, we can all play a part by archiving the stuff that matters to us locally. Learn about the internet archiving community, the tools of the trade, and how to save content you care about in this talk! Slides, talk recording, and more available here: https://github.com/pirate/internet-archiving-talk

Technology

Archiving the Internet
Before it All Rots Away
Nick Sweeting (@theSquashSH)
PyCon Colombia 2020
Monadical Inc.

Nick Swee)ng

twi-er.com/theSquashSH github.com/pirate
Co-Founder @ Monadical.com

We're a so)ware development consultancy!
- Python & JS full-stack work
- Lots of open-source + interes)ng projects
- Fully remote & ﬂexible hours
Disclaimer: I am not a digital preserva)on professional,
I don't work for any archiving organiza)ons. I just think it's neat.

Well... where I grew up...
• my connection was 1.5mbps on a good day, with ~200ms ping

• the internet was heavily censored (often unpredictably)

• Google something sketchy and your house goes oﬄine for 30min
Having reliable access to content

is a privilege that we often take for granted.

I just got tired of links I care about 404ing over time

So I built a tool to archive my Pocket stream automatically

Then Equifax happened...

https://docs.sweeting.me/s/equifax-security-incident

Only the 2nd mention of wget in NYTimes history.

https://docs.sweeting.me/s/equifax-security-incident

The power of wget is amazing!

DEMO TIME

But it's not perfect...
• How do we handle scripts like Javascript?

• How do we handle REST API requests?

• How do we handle dynamic content like video & games?

• Single Page Apps are the worst, plz stop building them

• How do we make sure it works long-term?

Enter... headless-browser-based archiving
• History of headless browsers

• Now we can run JS

• webrecorder.io vs heretrix vs archivebox

• Single Page Apps, games, dynamic content now works

• How do we make sure it works long-term?

Distributed archiving
• Central authority vs individual archivers?

• Distribute the archiving process or distribute the ﬁlesystem?

• Storage concerns, many hard drives are greater than a few

• Spreading content out makes it more resilient

• Can it replace the normal internet?

• Is it something to strive for?

Eﬀective archiving
• What happens when the whole internet becomes immutable?

• Is there such a thing as "too much data"?

• Should we be curating what we archive?

• Should people have a right to be forgotten?

• How will it change the culture of the internet?

• How will it impact the long-term durability of the internet?

How can you get started with archiving today?
• Run ArchiveBox or WebRecorder.io

• Help host / mirror content thats often targeted

• https://github.com/pirate/wikipedia-mirror

• BitTorrent / I2P

• DAT / IPFS

- Join the ArchiveTeam

- Donate

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community

Archive Wikipedia, TED, project Gutenberg, etc.
https://github.com/pirate/wikipedia-mirror

https://kiwix.org

https://other-wiki.zervice.io

Theres a whole community of professionals and
volunteers who do this.
https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community

Thank You!
Q&A
Monadical.com is hiring remote full-stack devs!
Slides:
github.com/pirate/internet-archiving-talk

Video:
youtube.com/watch?v=7eoz_EU6-wQ

Recently uploaded

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Real Time Object Detection Using Open CVKhem

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

MINDCTI Revenue Release Quarter One 2024MIND CTI

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Artificial Intelligence: Facts and MythsJoaquim Jorge

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

A Year of the Servo Reboot: Where Are We Now?Igalia

Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

GenAI Risks & Security Meetup 01052024.pdflior mazor

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Boost Fertility New Invention Ups Success Rates.pdf

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

A Domino Admins Adventures (Engage 2024)

Real Time Object Detection Using Open CV

HTML Injection Attacks: Impact and Mitigation Strategies

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

MINDCTI Revenue Release Quarter One 2024

Scaling API-first – The story of a global engineering organization

Artificial Intelligence: Facts and Myths

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

A Year of the Servo Reboot: Where Are We Now?

Top 10 Most Downloaded Games on Play Store in 2024

presentation ICT roal in 21st century education

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

GenAI Risks & Security Meetup 01052024.pdf

Boost PC performance: How more available memory can improve productivity

Data Cloud, More than a CDP by Matt Robison

Featured

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Archiving the Internet Before it All Rots Away

1. Archiving the Internet Before it All Rots Away Nick Sweeting (@theSquashSH) PyCon Colombia 2020 Monadical Inc.

2. Nick Swee)ng twi-er.com/theSquashSH github.com/pirate Co-Founder @ Monadical.com We're a so)ware development consultancy! - Python & JS full-stack work - Lots of open-source + interes)ng projects - Fully remote & ﬂexible hours Disclaimer: I am not a digital preserva)on professional, I don't work for any archiving organiza)ons. I just think it's neat.

3. Why am I into internet archiving?

4. Well... where I grew up... • my connection was 1.5mbps on a good day, with ~200ms ping • the internet was heavily censored (often unpredictably) • Google something sketchy and your house goes oﬄine for 30min Having reliable access to content is a privilege that we often take for granted.

5. I just got tired of links I care about 404ing over time So I built a tool to archive my Pocket stream automatically

6. Then Equifax happened... https://docs.sweeting.me/s/equifax-security-incident

7. Then Equifax happened...

8. Then Equifax happened...

9. Only the 2nd mention of wget in NYTimes history. https://docs.sweeting.me/s/equifax-security-incident

10. The power of wget is amazing! DEMO TIME

11. But it's not perfect... • How do we handle scripts like Javascript? • How do we handle REST API requests? • How do we handle dynamic content like video & games? • Single Page Apps are the worst, plz stop building them • How do we make sure it works long-term?

12. Enter... headless-browser-based archiving • History of headless browsers • Now we can run JS • webrecorder.io vs heretrix vs archivebox • Single Page Apps, games, dynamic content now works • How do we make sure it works long-term?

13. Distributed archiving • Central authority vs individual archivers? • Distribute the archiving process or distribute the ﬁlesystem? • Storage concerns, many hard drives are greater than a few • Spreading content out makes it more resilient • Can it replace the normal internet? • Is it something to strive for?

14. Eﬀective archiving • What happens when the whole internet becomes immutable? • Is there such a thing as "too much data"? • Should we be curating what we archive? • Should people have a right to be forgotten? • How will it change the culture of the internet? • How will it impact the long-term durability of the internet?

15. How can you get started with archiving today? • Run ArchiveBox or WebRecorder.io • Help host / mirror content thats often targeted • https://github.com/pirate/wikipedia-mirror • BitTorrent / I2P • DAT / IPFS - Join the ArchiveTeam - Donate https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community

16. Archive Wikipedia, TED, project Gutenberg, etc. https://github.com/pirate/wikipedia-mirror https://kiwix.org https://other-wiki.zervice.io

17. Theres a whole community of professionals and volunteers who do this. https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community

18. Thank You! Q&A Monadical.com is hiring remote full-stack devs! Slides: github.com/pirate/internet-archiving-talk Video: youtube.com/watch?v=7eoz_EU6-wQ

Archiving the Internet Before it All Rots Away

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Archiving the Internet Before it All Rots Away