SUCHO workflow

How we’ve organized things (last updated March 15th).

Anyone can send us links, with or without volunteering.

Joining the team

Want to get more hands-on? Great!

Things to do

Low tech helping

We have several teams working on projects that mostly need a web browser and enthusiasm (you don’t even need to read Cyrillic for most of these).

High tech helping

Can you read Ukrainian?

In addition to the Link Collection and Metadata volunteer groups above, there are a couple of teams where reading Ukrainian is a vital prerequisite:

How it all fits together

The diagram below this explanation draws out how the pieces of this project work together.

Anyone can submit a link to SUCHO via our web form (1a). If you already have a large list of links to submit, reach out to info@sucho.org and we can accept a spreadsheet (1b).

Links go into the SUCHO working spreadsheet (2), and from there are sent to both the Browsertrix workflow (3a) and the Internet Archive workflow (3b).

The Internet Archive workflow (3b) involves making sure that the sites – including sub- and sub-sub-pages – are well captured by the Wayback Machine. Sometimes we need to submit new URLs to the Wayback Machine to capture the whole site as part of this process.

The Browsertrix workflow (3a) involves volunteers running Browsertrix (automated Webrecorder software) to generate a web archive file (WACZ) that is stored on the SUCHO server hosted by Amazon Web Services. WACZ files that appear to be created successfully go to quality control (4a). If there are problems with the WACZ file, it goes back to the Browsertrix workflow (3a), otherwise it’s considered done.

Sometimes the Browsertrix workflow isn’t (fully) successful, because there are some materials that aren’t captured well by the automated crawler. Pages that require user interaction (e.g. virtual tours, complex Javascript) go to a Manual Webrecorder workflow (4b) for completion. Pages that don’t have an easy set of navigable links (e.g. DSpace sites, library catalogs) go to a custom Scraping workflow (4c).

The custom scraping workflow (4c) varies depending on the site; it may produce a set of links that go to the Wayback Machine (3b), or it may produce a set of files (images, PDFs, etc.) that get uploaded to the SUCHO collection at the Internet Archive (5). Volunteers then enhance the metadata for uploaded files (6).

The SUCHO workflow as of March 15