刘凡 d0fdd49032 add S3 | 2 tahun lalu | |
---|---|---|
.. | ||
cdk-cluster | 2 tahun lalu | |
img | 2 tahun lalu | |
old-cdk-cluster-0.96 | 2 tahun lalu | |
README-English.md | 2 tahun lalu | |
README.md | 2 tahun lalu |
PROJECT LONGBOW - Amazon EC2 Autoscaling Cluster. Support mass scale of large objects trasmission between AWS Global Amazon S3 and China Amazon S3
Cluster & Serverless Version 0.98
Architecture:
Amazon S3 New Create Object trigger transmission:
Jobsender scan Amazon S3 to send jobs:
If need to cron run Jobsender, you can logon to the EC2 to setup (crontab -e ) to run s3_migration_cluster_jobsender.py
Preparation(Optional): Jobsender fetch source and destination bucket list and compare delta to a Job List. And it can be run regularly.
Jobsender sends Job messages to Amazon SQS, each job message on Amazon SQS is one file object on Amazon S3. And record job status on Amazon DynamoDB for statistic.
Amazon S3 Object created can also trigger Amazon SQS directly.
Amazon EC2 or/and AWS Lambda get Job message from SQS. Each EC2 instance can get multiple Job concurrently.
Amazon EC2 or/and AWS Lambda create multiple threads for each job, each thread transmit different part of the file object and upload to destination S3.
Job Sender get List from multiple Amazon S3 bucket and compare to destination
New object created on Amazon S3 and trigger Amazon SQS message
Bach jobs completed and start Job Sender to List compare buckets again.
Cron event to trigger Job Sender to List bucket, compare and send new jobs.
For large scale of GBytes level of files, single node c5.large ( Setting 5 Files X 30 Threads in this test) can reach 800Mbps between Global and China Amazon S3. If they are most MBytes level files, you can setup more concurrency for one Amazon EC2 node.
As above graph, Autoscaling Group adding more EC2, and more traffic handling adding up in line rate, 9 EC2 up to 900+MB/s (7.2Gbps).
Transfer 1.2TBytes (916 Files), takes only one hour, from us-east-1 to cn-northwest-1. After jobs all finished, it auto shutdown EC2s and keep only one left.
Test files size distribution as below:
Before deploy AWS CDK, please config AWS System Manager Parameter Store manually with this credential:
Name: s3_migration_credentials
Type: SecureString
Tier: Standard
KMS key source:My current account/alias/aws/ssm or others you can have
This s3_migration_credentials is for accessing the account which is not the same as EC2 running account. Config example:
{
"aws_access_key_id": "your_aws_access_key_id",
"aws_secret_access_key": "your_aws_secret_access_key",
"region": "cn-northwest-1"
}
Edit app.py in CDK project with your Source and Destination S3 buckets information as example as below:
[{
"src_bucket": "your_global_bucket_1",
"src_prefix": "your_prefix",
"des_bucket": "your_china_bucket_1",
"des_prefix": "prefix_1"
}, {
"src_bucket": "your_global_bucket_2",
"src_prefix": "your_prefix",
"des_bucket": "your_china_bucket_2",
"des_prefix": "prefix_2"
}]
These information will be deployed to System Manager Parameter Store as s3_migration_bucket_para
Change your notification email in cdk_ec2_stack.py
If needed, can change the default config file: ./cdk-cluster/code/s3_migration_cluster_config.ini (e.g. JobType or Threads), CDK will deploy code and config to a new create s3 bucket: s3-migration-cluster-resourc-deploybucket. You can also change the config after deploy, upload to the same s3 bucket. New created EC2 will download the new config via Userdata.
AWS CDK will deploy below all resources except as manually config credentials in step 1.
Amazon VPC ( with 2AZ, 2 Public subnets ) and S3 Endpoint,
Amazon SQS Queue: s3_migrate_file_list
Amazon SQS Queue DLQ: s3_migrate_file_list-DLQ,
Amazon DynamoDB Table: s3_migrate_file_list,
Amazon EC2 JobSender Autoscaling Group with only 1 instance: t3.micro,
Amazon EC2 Workers Autoscaling Group capacity 1-10, instance type c5.large ( Can change in cdk_ec2_stack.py )
Amazon EC2 Autoscaling Policy
AWS SSM Parameter Store: s3_migration_bucket_para as bucket information for Jobsender to scan
Amazon EC2 all needed IAM Role
Amazon CloudWatch Dashboard
Amazon CloudWatch Alarm to email notification
Amazon S3 Bucket - s3-migration-cluster-resource-newbucket - New created object in this bucket will trigger SQS and start transmission.
Amazon S3 Bucket - s3-migration-cluster-resourc-deploybucket - code deploy bucket: For CDK to upload the application source code. All EC2 start will first download these codes and config. If you want to change the config.ini, you can change it in this bucket.
Amazon EC2 User Data will install CloudWatch Logs Agent to collect EC2 initial logs and s3_migrate application logs
Amazon EC2 User Data enable TCP BBR
Amazon EC2 User data auto clone application code and default config.ini from github and run it. Recommand to put code and modified config on your own S3 bucket for user data to pull or make your own AMI to start Amazon EC2.
If needed, you can change s3_migration_config.ini on EC2 worker, as below description:
* JobType = PUT or GET (default: PUT)
Determine Worker take its only IAM Role as accessing Source or Destination S3. PUT means Worker is not in the same account as the destination S3. And GET means Worker is not in the same account as the source S3.
* Des_bucket_default/Des_prefix_default
This is only for Amazon S3 trigger SQS senario, which has no Jobsender to give the destination S3 Buckeet/Prefix information.
This is useless for the senario that Jobsender scan buckets and send jobs to SQS.
* table_queue_name
The SQS queue and DynamoDB Table name. Needs to be the same as config in CloudFormation/CDK template table_queue_name
* ssm_parameter_bucket
The name on SSM ParameterStore to save buckets information. Needed to be the same as CloudFormation/CDK ssm_parameter_bucket
* ssm_parameter_credentials
The name on SSM ParameterStore to save access key of the other account
Needed to be the same as CloudFormation/CDK ssm_parameter_credentials
* StorageClass = STANDARD|REDUCED_REDUNDANCY|STANDARD_IA|ONEZONE_IA|INTELLIGENT_TIERING|GLACIER|DEEP_ARCHIVE
Select target Amazon S3 Storage Class, default STANDARD
* ResumableThreshold (default 5MB)
Unit MBytes, the file size equal or less than this value will not go through multipart upload procedue, no resumable upload to save performance.
* MaxRetry (default 20)
MaxRetry for single API call on application level
* MaxThread (default 20)
Max thread concurrency for single file
* MaxParallelFile (default 5)
Max file transmission for a single noede
* JobTimeout (default 3600)
Timeout for single file job, Unit Seconds
* LoggingLevel = WARNING | INFO | DEBUG (default: INFO)
* For Deubg: fVerifyMD5Twice, ChunkSize, CleanUnfinishedUpload, LocalProfileMode
* Hidden para: max_pool_connections=200 in s3_migration_lib.py
Jobsender application starts and checks Amazon SQS queue is empty or not. If empty, it gets the bucket info from Parameter Store s3_migration_bucket_para, compare and send jobs to Amazon SQS queue.
Notice: If you manually change the timeout, be notice to consider: SQS, EC2 JobTimeout and Lambda. By default, CDK deploy SQS message invisible time and EC2 JobTimeout as one hour.
CDK has deployed CloudWatch Dashbard:
Jobsender support ignore some objects while comparing bucket. Edit s3_migration_ignore_list.txt, add the file bucket/key as one file one line. Or use wildcard "*" or "?". E.g.
your_src_bucket/your_exact_key.mp4
your_src_bucket/your_exact_key.*
your_src_bucket/your_*
*.jpg
*/readme.md
CDK deployed and enabled TCP BBR on EC2: Congestion-Based Congestion Control, which improves performance.
Amazon Linux AMI 2017.09.1 Kernel 4.9.51 or later version supported TCP Bottleneck Bandwidth and RTT (BBR) .
Currenty default config setting is PUT mode. If you need to change to GET mode:
cdk_ec2_stack.py, change src_bucket to des_bucket and allow read&write exist S3:
# Allow EC2 access exist s3 bucket for GET mode: read and write access the destination buckets
# bucket_name = ''
# for b in bucket_para:
# if bucket_name != b['des_bucket']:
# bucket_name = b['des_bucket']
# s3exist_bucket = s3.Bucket.from_bucket_name(self,
# bucket_name,
# bucket_name=bucket_name)
# s3exist_bucket.grant_read_write(jobsender)
# s3exist_bucket.grant_read_write(worker_asg)
Don'f forget to change s3_migration_cluster_config.ini to JobType = GET
If not enable S3 Versioning, when transfering a large file, if the file is replaced with a new file but same name at this time, the copy file on destination will be half old half new, integrity is broken. What we used to do is not to change the source file when transfering, or only allow adding new file.
But there is some case, you have to replace source file, then you can enable S3 versioning feature. So can avoid file integrity broken. This project has supported S3 Versioning, you need to config:
Config below application parameters:
JobsenderCompareVersionId(True/False):
Default: False
True means: When Jobsender comparing S3 buckets, get versionId from source buckets. The destination S3 bucket versionId are saved in DynamoDB when start transfer every object. Jobsender will get destination versionId list from DynamoDB for comparing. The SQS Job message will contain versionId.
UpdateVersionId(True/False):
Default: False
True means: Before Work start transfer a object, it update the versionId this the Job message. This feature is for the case that Jobsender has not sent versionId, but you still need to enable versionId for integrity protection.
GetObjectWithVersionId(True/False):
Default: False
True means: When Worker get object from source bucket, get with specified versionId. If set the switch to False, it will get the lastest version of the object.
For Cluster version, the above parameters are in s3_migration_config.ini
For Serverless version, the above parameters are in head lines in lambda_function_jobsender and lambda_function_worker python files.
S3 new object trigger SQS Jobs:
UpdateVersionId = False
GetObjectWithVersionId = True
In this S3 trigger SQS case, when S3 Versioning enable, the SQS message contains versionId. Worker just follow this versionId to get source S3 bucket. If broken while transfering and recovering, it still get object with the same versionId.
If you don't have the GetObjectVersion permission, you need to set these parameters to False, otherwise get object will fail because of AccessDenied.
Jobsender compare S3 and send SQS Jobs. Here are two purposes:
Integrity: For fully ensure even file is replaced while transfering, the file is still a complete file, not half new half old.
JobsenderCompareVersionId = False
UpdateVersionId = True
GetObjectWithVersionId = True
For better performance, not to compare versionId in Jobsender, so send SQS message with versionId 'null'. And before Worker get object with versionId, it get the real versionId from source S3 and save to DynamoDB. When this object transfer complete, it will check the versionId matching again.
Integrity and consistency: For compare source/destination S3 buckets existing files size and also compare version, and fully ensure even file is replaced while transfering, the file is still a complete file, not half new half old.
JobsenderCompareVersionId = True
UpdateVersionId = False
GetObjectWithVersionId = True
Jobsender compare versionId in source bucket and in DynamoDB. Send SQS message with real versionId. Worker will follow this versioinId to get object.
For case that no need to compare version, and the file will not be replaced while transfering, or you don't have GetObjectVersion permission source bucket. Just set the three parameters to False. (Default)
Performance: Get the source versionId list will spend much more time than just List objects. If you have lots of objects (such as more than 10K), and you are using Lambda as Jobsender, it might exceed the 15 minutes limit.
Permission: Not every bucket permits to GetObjectVersion. Some bucket might open read for GetObject, but not GetObjectVersion for you, e.g. some Open Data Set, even enabled Versioning.
Required memory = concurrency * ChunckSize. For the file < 50GB, ChunckSize is 5MB. For the file >50GB, ChunkSize will be auto change to Size/10000
So for example is file average Size is 500GB , ChunckSize will be auto change to 50MB, and the default concurrency setting 5 File x 10 Concurrency/File = 50. So required 50 x 50MB = 2.5GB memory to run in EC2 or Lambda.
If you need to increase the concurrency, can change it in config, but remember provision related memory for EC2 or Lambda.
Don't change the chunksize after start data copying.
To delete the resources only need one command: cdk destroy
But you need to delete these resources manually: DynamoDB, CloudWatch Log Group, new created S3 bucket
This library is licensed under the MIT-0 License. See the LICENSE file.