Working with Wayback Machine Google Sheets Service and Internet Archive

Under the main SUCHO spreadsheet, you’ll see an Internet Archive (IA) tab. That tab lists as the sites that need to be backuped to IA. Often times you will also see the same site listed under the Broswertrix tab, and that’s not an error - we are trying to have backups of our backups. You can volunteer for archiving a website to IA by putting your name in Claimed By and changing the status to In Progress.

There are instructions in the sheet for how to proceed, but generally you should:

  1. Copy the link into the wayback machine URL to see if it has been archived recently and check if there are many broken links.
    • If the site looks pretty much functioning as you click around and you don’t see any 400 or 300 errors, then you can mark it as done and start on the next one.
  2. If the site is either not listed or the last snapshots were older than a few months, you should go ahead and start copying links into a new google spreadsheet. You can name the spreadsheet anything you like, and all you need is one column where you paste each of the URLs. These should generally be top level URLs (so ones you see in your browser bar when you move around the site).
  3. Also be sure to include URLs for files that are downloadable, like PDFs, images, etc.
    • You do not need to download files. Previously we had been suggesting that you do download these, and then upload them manually, but that is no longer our suggestion unless the file is not supported by the Internet archive. If you’re unsure feel free to ask questions in either the #waybackmachine or #internetarchive Slack channels.
  4. Once you’re done you can submit the file to be processed by the Internet Archive (instructions below).

If you are seeing a lot of similar patterned links and there are more links than you can easily capture in an hour or so, then you might consider flagging the site as “Needs Scraping” and then we can try to scrape the links programmatically. But this is for mostly exceptional cases, so do your best and also know that you can first submit a subset of links to the wayback machine and then go back to add additional links as you find them.

Internet Archive Google Sheets Service

The Internet Archive has a Google Sheets service where you can submit a Google Sheet with the first column full of URLs (just the URLs), and it’ll work through archiving them all.

Some of the other individual task tutorials for SUCHO involve building spreadsheets to submit to the Internet Archive in this way.

Updates to Google Sheet after Submission

After you’ve submitted the Google Sheet to the Internet Archive, the sheet will automatically be updated after every 80 rows that are processed. The Internet Archive Google Sheets Service adds the following columns: