刘凡 9ff4d1d109 add S3,archive,truncate		%!s(int64=2) %!d(string=hai) anos
..
.dockerignore	9ff4d1d109 add S3,archive,truncate	%!s(int64=2) %!d(string=hai) anos
Dockerfile	9ff4d1d109 add S3,archive,truncate	%!s(int64=2) %!d(string=hai) anos
README.md	9ff4d1d109 add S3,archive,truncate	%!s(int64=2) %!d(string=hai) anos
gh2s3.py	9ff4d1d109 add S3,archive,truncate	%!s(int64=2) %!d(string=hai) anos
requirements.txt	9ff4d1d109 add S3,archive,truncate	%!s(int64=2) %!d(string=hai) anos

GH2S3

It downloads GitHub Archive 2016 data and uploads it to an Amazon S3 bucket.

It's preferred to run it inside an Amazon EC2 instance, for better bandwidth and latency.

Run locally

With Python 3 and pip:

pip install -r requirements.txt

You need to setup your AWS credentials, the same way it's done with AWS CLI.

Then run:

export S3_BUCKET=YOUR_BUCKET
./gh2s3.py

With Docker

docker build -t gh2s3 .

docker run \
    --rm \
    -e "AWS_ACCESS_KEY_ID=YOUR_ID" \
    -e "AWS_SECRET_ACCESS_KEY=YOUR_KEY" \
    -e "S3_BUCKET=YOUR_BUCKET" \
    gh2s3

README.md

GH2S3

Run locally

With Docker