刘凡 9ff4d1d109 add S3,archive,truncate | 2 years ago | |
---|---|---|
.. | ||
README.md | 2 years ago | |
s3bundler.dockerfile | 2 years ago | |
s3bundler.py | 2 years ago |
S3Bundler takes a large number of objects and puts them into a small number of larger archives. It downloads all of the objects listed in a manifest, writes them to a tar archive, and uploads it to S3 with an index that includes metadata, tags, and an md5 checksum.
It can be run with a single manifest URI, or it can get them from SQS.
usage: s3bundler.py [-h] [-q QUEUE] [-m s3://bucket/key] [-b BUCKET]
[-p PREFIX] [-f [F [F ...]]] [-s BYTES] [-c] [-v] [-d]
Bundle S3 objects from an inventory into an archive
optional arguments:
-h, --help show this help message and exit
-q QUEUE, --queue QUEUE
SQS S3Bundler manifest queue.
-m s3://bucket/key, --manifest s3://bucket/key
Manifest produced by s3grouper
-b BUCKET, --bucket BUCKET
S3 bucket to write archives to
-p PREFIX, --prefix PREFIX
Target S3 prefix
-f [F [F ...]], --fieldnames [F [F ...]]
Field names in order used by s3grouper
-s BYTES, --maxsize BYTES
Objects greater than maxsize will be copied directly
to the destination bucket. Metadata will be stored
alongside them. Checksums will not be calculated.
Default: 2GB
-c, --compress Compress archives with gzip
-v, --verbose Enable verbose messages
-d, --debug Enable debug messages
In order to reduce throttling, S3Bundler writes objects using a hashed hex prefix. You may want to open a support case to request that S3 prepare your archive bucket for more PUT requests per second. See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Archives can optionally be compressed to save space on S3 in addition to reducing the number of objects.
When dealing with a ton of small objects, S3 API costs can get pretty high. It is not possible to batch GET requests, so the best we can hope for is one call per object. If an object is under 8 MB, it is retrieved with a single request unless there are tags(one call gets all an object's tags). Otherwise, it is a multipart download.
If an object was uploaded without multipart, then the ETag header is the MD5 checksum of the object. S3bundler only calculates the checksum if the object was originally a multipart upload.
When run using SQS, it expects to be managed as an ECS Service. After trying to get 100 messages from the queue, it quits. ECS will automatically restart it. If there are any leaks, this should prevent them from causing problems.
If an object can't be downloaded after boto's internal retries, then the key is written to a DLQ index along with the reason for later investigation. The DLQ index is uploaded next to the original manifest when the archive and index are uploaded.
Other errors cause s3bundler to quit. SQS will make the message available a few more times until it finally sends it to an SQS DLQ for later investigation.
When run is ECS, logs are collected using Cloudwatch Logs which can filter and create metrics.
Normally, only job start, finish, and error summaries are logged to save on log ingestion costs.
Filters can look for the strings "ERROR", "begin processing", and "successfully processed".