# S3Bundler
S3Bundler takes a large number of objects and puts them into a small number of larger archives. It downloads all of the objects listed in a manifest, writes them to a tar archive, and uploads it to S3 with an index that includes metadata, tags, and an md5 checksum.

It can be run with a single manifest URI, or it can get them from SQS.

## Usage

```
usage: s3bundler.py [-h] [-q QUEUE] [-m s3://bucket/key] [-b BUCKET]
                    [-p PREFIX] [-f [F [F ...]]] [-s BYTES] [-c] [-v] [-d]

Bundle S3 objects from an inventory into an archive

optional arguments:
  -h, --help            show this help message and exit
  -q QUEUE, --queue QUEUE
                        SQS S3Bundler manifest queue.
  -m s3://bucket/key, --manifest s3://bucket/key
                        Manifest produced by s3grouper
  -b BUCKET, --bucket BUCKET
                        S3 bucket to write archives to
  -p PREFIX, --prefix PREFIX
                        Target S3 prefix
  -f [F [F ...]], --fieldnames [F [F ...]]
                        Field names in order used by s3grouper
  -s BYTES, --maxsize BYTES
                        Objects greater than maxsize will be copied directly
                        to the destination bucket. Metadata will be stored
                        alongside them. Checksums will not be calculated.
                        Default: 2GB
  -c, --compress        Compress archives with gzip
  -v, --verbose         Enable verbose messages
  -d, --debug           Enable debug messages
```

## Optimizations
### S3 Bucket Partitioning
In order to reduce throttling, S3Bundler writes objects using a hashed hex prefix. You may want to open a support case to request that S3 prepare your archive bucket for more PUT requests per second. See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

### Compression
Archives can optionally be compressed to save space on S3 in addition to reducing the number of objects.

### S3 API calls
When dealing with a ton of small objects, S3 API costs can get pretty high. It is not possible to batch GET requests, so the best we can hope for is one call per object. If an object is under 8 MB, it is retrieved with a single request unless there are tags(one call gets all an object's tags). Otherwise, it is a multipart download.

### MD5 Checksums
If an object was uploaded without multipart, then the ETag header is the MD5 checksum of the object. S3bundler only calculates the checksum if the object was originally a multipart upload.

### ECS Service
When run using SQS, it expects to be managed as an ECS Service. After trying to get 100 messages from the queue, it quits. ECS will automatically restart it. If there are any leaks, this should prevent them from causing problems.

## Handling Errors
### Object Errors
If an object can't be downloaded after boto's internal retries, then the key is written to a DLQ index along with the reason for later investigation. The DLQ index is uploaded next to the original manifest when the archive and index are uploaded.

### Other Errors
Other errors cause s3bundler to quit. SQS will make the message available a few more times until it finally sends it to an SQS DLQ for later investigation.

### Logging
When run is ECS, logs are collected using Cloudwatch Logs which can filter and create metrics.

Normally, only job start, finish, and error summaries are logged to save on log ingestion costs.

Filters can look for the strings "ERROR", "begin processing", and "successfully processed".