Mirroring the SUCHO web archives
The SUCHO Web Archives are publicly accessible as a dataset on the AWS OpenData Program. You can help us secure their safety by mirroring the S3 bucket. We recommend that you reserve 100TB of disk space for the mirror.
Step 1: Check the Open Data registry
Check the SUCHO entry of the AWS Open Data registry for the description of the dataset and the S3 bucket name, which is
Step 2: Download and install the AWS CLI
aws CLI according to the install instructions. Alternatively you can also use rclone or s3cmd.
Step 3: Sync the SUCHO bucket
Since the SUCHO Open Data bucket is publicly accessible, you do not need access credentials to access it. Use the CLI flag
--no-sign-request to access the bucket fully anonymously. The AWS region is
Please use the sync command to mirror the bucket. Using the sync command will ensure that possible deletions, renames or other file operations made in the Open Data bucket will be mirrored to your destination.
Assuming your destination directory is a folder called
/sucho_mirror/ the AWS sync command would look like this:
aws s3 sync --no-sign-request s3://sucho-opendata/ /sucho_mirror/
To optimise transfer speeds, you can tweak the settings
max_concurrent_requests. The average file size of the bucket is ca. 750MB, so you could set the chunk size to 512MB without problems.