刘凡 d0fdd49032 add S3 | 2 yıl önce | |
---|---|---|
.. | ||
cdk-serverless | 2 yıl önce | |
img | 2 yıl önce | |
old-cdk-serverless-0.96 | 2 yıl önce | |
README-English.md | 2 yıl önce | |
README.md | 2 yıl önce |
Cluster & Serverless Version 0.98
Amazon EC2 Autoscaling Group Cluster and Serverless AWS Lambda can be deployed together, or seperated used in different senario
Serverless Version Architecture
Amazon S3 New Create Object trigger transmission:
Jobsender scan Amazon S3 to send jobs:
CloudWatch Event cron trigger Jobsender Lambda every hour
After one AWS Lambda runtime timeout of 15 minutes, SQS message InvisibleTime timeout also. The message recover to Visible, it trigger another Lambda runtime to fetch part number list from dest. Amazon S3 and continue subsequent parts upload. Here is snapshot of logs of Lambda:
Please install AWS CDK and follow instruction to deploy
https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html
Before deploy AWS CDK, please config AWS System Manager Parameter Store manually with this credential:
Name: s3_migration_credentials
Type: SecureString
Tier: Standard
KMS key source:My current account/alias/aws/ssm or others you can have
This s3_migration_credentials is for accessing the account which is not the same as AWS Lambda running account. Config example:
{
"aws_access_key_id": "your_aws_access_key_id",
"aws_secret_access_key": "your_aws_secret_access_key",
"region": "cn-northwest-1"
}
Edit app.py in CDK project with your Source and Destination S3 buckets information as example as below:
[{
"src_bucket": "your_global_bucket_1",
"src_prefix": "your_prefix",
"des_bucket": "your_china_bucket_1",
"des_prefix": "prefix_1"
}, {
"src_bucket": "your_global_bucket_2",
"src_prefix": "your_prefix",
"des_bucket": "your_china_bucket_2",
"des_prefix": "prefix_2"
}]
These information will be deployed to System Manager Parameter Store as s3_migration_bucket_para
Change your notification email in app.py
alarm_email = "alarm_your_email@email.com"
./cdk-serverless This CDK is written in PYTHON, and will deploy the following resources:
Ignore List: Jobsender support ignore some objects while comparing bucket. Edit s3_migration_ignore_list.txt in CDK, add the file bucket/key as one file one line. Or use wildcard "*" or "?". E.g.
your_src_bucket/your_exact_key.mp4
your_src_bucket/your_exact_key.*
your_src_bucket/your_*
*.jpg
*/readme.md
CDK will deploy this to SSM Parameter Store, you can also change this parameter on s3_migrate_ignore_list after deployed.
Amazon SQS Access Policy to allow Amazon S3 bucket to send message as below, please change the account and bucket in this json:
{
"Version": "2012-10-17",
"Id": "arn:aws:sqs:us-east-1:your_account:s3_migrate_file_list/SQSDefaultPolicy",
"Statement": [
{
"Sid": "Sidx",
"Effect": "Allow",
"Principal": {
"Service": "s3.amazonaws.com"
},
"Action": "SQS:SendMessage",
"Resource": "arn:aws:sqs:us-east-1:your_account:s3_migrate_file_list",
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:s3:::source_bucket"
}
}
}
]
}
Create Lambda log group with 3 Log filter, to match below Pattern:
Namespace: s3_migrate
Filter name: Downloading-bytes
Pattern: [info, date, sn, p="--->Downloading", bytes, key]
Value: $bytes
default value: 0
Filter name: Uploading-bytes
Pattern: [info, date, sn, p="--->Uploading", bytes, key]
Value: $bytes
default value: 0
Filter name: Complete-bytes
Pattern: [info, date, sn, p="--->Complete", bytes, key]
Value: $bytes
default value: 0
So Lambda traffic bytes/min will be emit to CloudWatch Metric for monitoring. Config monitor Statistic: Sum with 1 min.
If not enable S3 Versioning, when transfering a large file, if the file is replaced with a new file but same name at this time, the copy file on destination will be half old half new, integrity is broken. What we used to do is not to change the source file when transfering, or only allow adding new file.
But there is some case, you have to replace source file, then you can enable S3 versioning feature. So can avoid file integrity broken. This project has supported S3 Versioning, you need to config:
JobsenderCompareVersionId(True/False):
Default: False
True means: When Jobsender comparing S3 buckets, get versionId from source buckets. The destination S3 bucket versionId are saved in DynamoDB when start transfer every object. Jobsender will get destination versionId list from DynamoDB for comparing. The SQS Job message will contain versionId.
UpdateVersionId(True/False):
Default: False
True means: Before Work start transfer a object, it update the versionId this the Job message. This feature is for the case that Jobsender has not sent versionId, but you still need to enable versionId for integrity protection.
GetObjectWithVersionId(True/False):
Default: False
True means: When Worker get object from source bucket, get with specified versionId. If set the switch to False, it will get the lastest version of the object.
For Cluster version, the above parameters are in s3_migration_config.ini
For Serverless version, the above parameters are in Lambda Environment Variable of Jobsender and Worker.
UpdateVersionId = False
GetObjectWithVersionId = True
In this S3 trigger SQS case, when S3 Versioning enable, the SQS message contains versionId. Worker just follow this versionId to get source S3 bucket. If broken while transfering and recovering, it still get object with the same versionId.
If you don't have the GetObjectVersion permission, you need to set these parameters to False, otherwise get object will fail because of AccessDenied.
* Jobsender compare S3 and send SQS Jobs. Here are two purposes:
1. Integrity: For fully ensure even file is replaced while transfering, the file is still a complete file, not half new half old.
JobsenderCompareVersionId = False
UpdateVersionId = True
GetObjectWithVersionId = True
For better performance, not to compare versionId in Jobsender, so send SQS message with versionId 'null'. And before Worker get object with versionId, it get the real versionId from source S3 and save to DynamoDB. When this object transfer complete, it will check the versionId matching again.
2. Integrity and consistency: For compare source/destination S3 buckets existing files size and also compare version, and fully ensure even file is replaced while transfering, the file is still a complete file, not half new half old.
JobsenderCompareVersionId = True
UpdateVersionId = False
GetObjectWithVersionId = True
```
Jobsender compare versionId in source bucket and in DynamoDB. Send SQS message with real versionId. Worker will follow this versioinId to get object.
Performance: Get the source versionId list will spend much more time than just List objects. If you have lots of objects (such as more than 10K), and you are using Lambda as Jobsender, it might exceed the 15 minutes limit.
Permission: Not every bucket permits to GetObjectVersion. Some bucket might open read for GetObject, but not GetObjectVersion for you, e.g. some Open Data Set, even enabled Versioning.
Required memory = concurrency * ChunckSize. For the file < 50GB, ChunckSize is 5MB. For the file >50GB, ChunkSize will be auto change to Size/10000
So for example is file average Size is 500GB , ChunckSize will be auto change to 50MB, and the default concurrency setting 5 File x 10 Concurrency/File = 50. So required 50 x 50MB = 2.5GB memory to run in EC2 or Lambda.
If you need to increase the concurrency, can change it in config, but remember provision related memory for EC2 or Lambda.
Don't change the chunksize while start data copying.
To delete the resources only need one command: cdk destroy
But you need to delete these resources manually: DynamoDB, CloudWatch Log Group, new created S3 bucket
This library is licensed under the MIT-0 License. See the LICENSE file.