This section describes how to install, configure and use the Amazon S3 Find and Forget solution.
The Fargate tasks used by this solution to perform deletions must be able to access the following AWS services, either via an Internet Gateway or via VPC Endpoints:
By default the CloudFormation template will create a new VPC that has been purpose-built for the solution. The VPC includes VPC endpoints for the aforementioned services, and does not provision internet connectivity.
You can use the provided VPC to operate the solution with no further customisations. However, if you have more complex requirements it is recommended to use an existing VPC as described in the following section.
Amazon S3 Find and Forget can also be used in an existing VPC. You may want to do this if you have requirements that aren't met by using the VPC provided with the solution.
To use an existing VPC, set the DeployVpc
parameter to false
when launching
the solution CloudFormation stack. You must also specify the subnet and security
groups that the Fargate tasks will use by setting the VpcSubnets
and
VpcSecurityGroups
parameters respectively.
The subnets and security groups that you specify must allow the tasks to connect to the aforementioned AWS services. Forget solution, you must ensure that when deploying the solution you select subnets and security groups which permit access to the aforementioned services and you set DeployVpc to false.
You can obtain your subnet and security group IDs from the AWS Console or by using the AWS CLI. If using the AWS CLI, you can use the following command to get a list of VPCs:
aws ec2 describe-vpcs \
--query 'Vpcs[*].{ID:VpcId,Name:Tags[?Key==`Name`].Value | [0], IsDefault: IsDefault}'
Once you have found the VPC you wish to use, to get a list of subnets and security groups in that VPC:
export VPC_ID=<chosen-vpc-id>
aws ec2 describe-subnets \
--filter Name=vpc-id,Values="$VPC_ID" \
--query 'Subnets[*].{ID:SubnetId,Name:Tags[?Key==`Name`].Value | [0],AZ:AvailabilityZone}'
aws ec2 describe-security-groups \
--filter Name=vpc-id,Values="$VPC_ID" \
--query 'SecurityGroups[*].{ID:GroupId,Name:GroupName}'
The Fargate tasks used by this solution to perform deletions require a specific IAM role to exist in each account that owns a bucket that you will use with the solution. The role must have the exact name S3F2DataAccessRole (no path). A CloudFormation template is available as part of this solution which can be deployed separately to the main stack in each account. A way to deploy this role to many accounts, for example across your organization, is to use AWS CloudFormation StackSets.
To deploy this template manually, use the IAM Role Template "Deploy to AWS button" in Deploying the Solution then follow steps 5-9. The Outputs tab will contain the Role ARN which you will need when adding data mappers.
You will need to grant this role read and write access to your data. We recommend you do this using a bucket policy. For more information, see Granting Access to Data.
The solution is deployed as an AWS CloudFormation template and should take about 20 to 40 minutes to deploy.
Your access to the AWS account must have IAM permissions to launch AWS CloudFormation templates that create IAM roles and to create the solution resources.
Note You are responsible for the cost of the AWS services used while running this solution. For full details, see the pricing pages for each AWS service you will be using in this sample. Prices are subject to change.
Region | Launch Template | Template Link | Launch IAM Role Template | IAM Role Template Link |
---|---|---|---|---|
US East (N. Virginia) (us-east-1) | Launch | Link | Launch | Link |
US East (Ohio) (us-east-2) | Launch | Link | Launch | Link |
US West (Oregon) (us-west-2) | Launch | Link | Launch | Link |
Asia Pacific (Sydney) (ap-southeast-2) | Launch | Link | Launch | Link |
Asia Pacific (Tokyo) (ap-northeast-1) | Launch | Link | Launch | Link |
EU (Ireland) (eu-west-1) | Launch | Link | Launch | Link |
EU (London) (eu-west-2) | Launch | Link | Launch | Link |
EU (Frankfurt) (eu-central-1) | Launch | Link | Launch | Link |
EU (Stockholm) (eu-north-1) | Launch | Link | Launch | Link |
On the "Specify stack details" screen you should provide values for the following parameters of the CloudFormation stack:
The following parameters are optional and allow further customisation of the solution if required:
When completed, click Next
On the review screen, you must check the boxes for:
These are required to allow CloudFormation to create a Role to allow access to resources needed by the stack and name the resources in a dynamic way.
The solution provides a web user interface and a REST API to allow you to integrate it in your own applications. If you have chosen not to deploy the Web UI you will need to use the API to interface with the solution.
To add more users to the application:
To use the API directly, you will need to authenticate requests using the Cognito User Pool or IAM. The method for authenticating differs depending on which authentication option was chosen:
After resetting the password via the UI, you can make authenticated requests using the AWS CLI:
aws cognito-idp admin-initiate-auth \
--user-pool-id $COGNITO_USER_POOL_ID \
--client-id $COGNITO_USER_POOL_CLIENT_ID \
--auth-flow ADMIN_NO_SRP_AUTH \
--auth-parameters '{"USERNAME":"$USER_EMAIL_ADDRESS","PASSWORD":"$USER_PASSWORD"}'
IdToken
generated by the previous command to make an authenticated
request to the API. For instance, the following command will show the matches
in the deletion queue: curl $API_URL/v1/queue -H "Authorization: Bearer $ID_TOKEN"
For more information, consult the Cognito REST API integration guide.
IAM authentication for API requests uses the Signature Version 4 signing process. Add the resulting signature to the Authorization header when making requests to the API.
Use the Sigv4 process linked above to generate the Authorization header value and then call the API as normal:
curl $API_URL/v1/queue -H "Authorization: $Sigv4Auth"
IAM authentication can be used anywhere you have AWS credentials with the correct permissions, this could be an IAM User or an assumed IAM Role.
Please refer to the documentation
here
to understand how to define the IAM policy to match your requirements. The ARN
for the api can be found in the value of the ApiArn
CloudFormation Stack
Output.
Applications deployed using AWS CloudFormation in the same AWS account and region can integrate with Find and Forget by using CloudFormation output values. You can use the solution stack as a nested stack to use its outputs (such as the API URL) as inputs for another application.
Some outputs are also available as exports. You can import these values to use in your own CloudFormation stacks that you deploy following the Find and Forget stack.
Note for using exports: After another stack imports an output value, you can't delete the stack that is exporting the output value or modify the exported output value. All of the imports must be removed before you can delete the exporting stack or modify the output value.
Consult the exporting stack output values guide to review the differences between importing exported values and using nested stacks.
After Deploying the Solution, your first step should be to configure one or more data mappers which will connect your data to the solution. Identify the S3 Bucket containing the data you wish to connect to the solution and ensure you have defined a table in your data catalog and that all existing and future partitions (as they are created) are known to the Data Catalog. Currently AWS Glue is the only supported data catalog provider. For more information on defining your data in the Glue Data Catalog, see Defining Glue Tables. You must define your Table in the Glue Data Catalog in the same region and account as the S3 Find and Forget solution.
For data lakes registered with AWS Lake Formation, you must grant additional permissions in Lake Formation before you can use them with the solution. If you are not using Lake Formation, proceed directly to the Data Mapper creation section.
To grant these permissions in Lake Formation:
Describe
permission for all Glue Databases that you will want to use with
the solution; then grant the Describe
and Select
permissions to the role
for all Glue Tables that you will want to use with the solution. These
permissions are necessary to create data mappers in the web interface.Describe
and Select
permissions for all Glue Tables that you will
want to use with the solution. These permissions allow the solution to access
Table metadata when creating a Data Mapper.Describe
and Select
permissions to
both principals for all of the tables that you will want to use with the
solution. These permissions allow the solution to plan and execute Athena
queries during the Find Phase.WHERE
additional clause for each combination of partition values).
If you have a lot of small partitions, it may be more efficient to choose
none or a subset of partition keys from the list in order to increase speed
of execution. If instead you have very big partitions, it may be more
efficient to choose all the partition keys in order to reduce probability of
failure caused by query timeout. We recommend the average query size not to
exceed the hundreds of GBs and not to take more than 5 minutes.As an example, let's consider 10 years of daily data with partition keys of
year
,month
andday
with total size of10TB
. By declaring PartitionKeys=[]
(none) a single query of10TB
would run during the Find phase, and that may be too much to complete within the 30m limit of Athena execution time. On the other hand, using all the combinations of the partition keys we would have approximately3652
queries, each being probably very small, and given the default Athena concurrency limit of20
, it may take very long to execute all of them. The best in this scenario is possibly the['year','month']
combination, which would result in120
queries.
Note that the solution will not delete old versions for these objects. This can cause data to be retained longer than intended. Make sure there is some mechanism to handle old versions. One option would be to configure S3 lifecycle policies on non-current versions.
You can also create Data Mappers directly via the API. For more information, see the API Documentation.
After configuring a data mapper you must ensure that the S3 Find and Forget solution has the required level of access to the S3 location the data mapper refers to. The recommended way to achieve this is through the use of S3 Bucket Policies.
Note: AWS IAM uses an eventual consistency model and therefore any change you make to IAM, Bucket or KMS Key policies may take time to become visible. Ensure you have allowed time for permissions changes to propagate to all endpoints before starting a job. If your job fails with a status of FIND_FAILED and the
QueryFailed
events indicate S3 permissions issues, you may need to wait for the permissions changes to propagate.
To update the S3 bucket policy to grant read access to the IAM role used by Amazon Athena, and write access to the Data Access IAM role used by AWS Fargate, follow these steps:
Where the data you are connecting to the solution is encrypted with an Customer Managed CMK rather than an AWS Managed CMK, you must also grant the Athena and Data Access IAM roles access to use the key so that the data can be decrypted when reading, re-encrypted when writing.
Once you have updated the bucket policy as described in Updating the Bucket Policy, choose the KMS Access tab from the Generate Access Policies modal window and follow the instructions to update the key policy with the provided statements. The statements provided are for use when using the policy view in the AWS console or making updates to the key policy via the CLI, CloudFormation or the API. If you wish, to use the default view in th AWS console, add the Principals in the provided statements as key users. For more information, see How to Change a Key Policy.
Once your Data Mappers are configured, you can begin adding "Matches" to the Deletion Queue.
Matches can be Simple or Composite.
To add a simple match:
To add a composite match you need to have at least one data mapper with more than one column identifier. Then:
You can also add matches to the Deletion Queue directly via the API. For more information, see the API Documentation.
When the next deletion job runs, the solution will scan the configured columns of your data for any occurrences of the Matches present in the queue at the time the job starts and remove any items where one of the Matches is present.
If across all your data mappers you can find all items related to a single logical entity using the same value, you only need to add one Match value to the deletion queue to delete that logical entity from all data mappers.
If the value used to identify a single logical entity is not consistent across your data mappers, you should add an item to the deletion queue for each distinct value which identifies the logical entity, selecting the specific data mapper(s) to which that value is relevant.
If you make a mistake when adding a Match to the deletion queue, you can remove that match from the queue as long as there is no job running. Once a job has started no items can be removed from the deletion queue until the running job has completed. You may continue to add matches to the queue whilst a job is running, but only matches which were present when the job started will be processed by that job. Once a job completes, only the matches that job has processed will be removed from the queue.
In order to facilitate different teams using a single deployment within an organisation, the same match can be added to the deletion queue more than once. When the job executes, it will merge the lists of data mappers for duplicates in the queue.
Once you have configured your data mappers and added one or more items to the deletion queue, you can stat a job.
Once a job has started, you can leave the page and return to view its progress at point by choosing the job ID from the Deletion Jobs list. The job details page will automatically refresh and to display the current status and statistics for the job. For more information on the possible statuses and their meaning, see Deletion Job Statuses.
You can also start jobs and check their status using the API. For more information, see the API Documentation.
Job events are continuously emitted whilst a job is running. These events are used to update the status and statistics for the job. You can view all the emitted events for a job in the Job Events table. Whilst a job is running, the Load More button will continue to be displayed even if no new events have been received. Once a job has finished, the Load More button will disappear once you have loaded all the emitted events. For more information on the events which can be emitted during a job, see Deletion Job Event Types
To optimise costs, it is best practice when using the solution to start jobs on a regular schedule, rather than every time a single item is added to the Deletion Queue. This is because the marginal cost of the Find phase when deleting an additional item from the queue is far less that re-executing the Find phase (where the data mappers searched are the same). Similarly, the marginal cost of removing an additional match from an object is negligible when there is already at least 1 match present in the object contents.
Important
Ensure no external processes perform write/delete actions against exist objects whilst a job is running. For more information, consult the Limits guide
The list of possible job statuses is as follows:
QUEUED
: The job has been accepted but has yet to start. Jobs are started
asynchronously by a Lambda invoked by the DynamoDB event
stream for the Jobs table.RUNNING
: The job is still in progress.FORGET_COMPLETED_CLEANUP_IN_PROGRESS
: The job is still in progress.COMPLETED
: The job finished successfully.COMPLETED_CLEANUP_FAILED
: The job finished successfully however the deletion
queue items could not be removed. You should manually remove these or leave
them to be removed on the next jobFORGET_PARTIALLY_FAILED
: The job finished but it was unable to successfully
process one or more objects. The Deletion DLQ for messages will contain a
message per object that could not be updated.FIND_FAILED
: The job failed during the Find phase as there was an issue
querying one or more data mappers.FORGET_FAILED
: The job failed during the Forget phase as there was an issue
running the Fargate tasks.FAILED
: An unknown error occurred during the Find and Forget workflow, for
example, the Step Functions execution timed out or the execution was manually
cancelled.For more information on how to resolve statuses indicative of errors, consult the Troubleshooting guide.
The list of events is as follows:
JobStarted
: Emitted when the deletion job state machine first starts. Causes
the status of the job to transition from QUEUED
to RUNNING
FindPhaseStarted
: Emitted when the deletion job has purged any messages from
the query and object queues and is ready to be searching for data.FindPhaseEnded
: Emitted when all queries have executed and written their
results to the objects queue.FindPhaseFailed
: Emitted when one or more queries fail. Causes the status to
transition to FIND_FAILED
.ForgetPhaseStarted
: Emitted when the Find phase has completed successfully
and the Forget phase is starting.ForgetPhaseEnded
: Emitted when the Forget phase has completed. If the Forget
phase completes with no errors, this event causes the status to transition to
FORGET_COMPLETED_CLEANUP_IN_PROGRESS
. If the Forget phase completes but
there was an error updating one or more objects, this causes the status to
transition to FORGET_PARTIALLY_FAILED
.ForgetPhaseFailed
: Emitted when there was an issue running the Fargate
tasks. Causes the status to transition to FORGET_FAILED
.CleanupSucceeded
: The final event emitted when a job has executed
successfully and the Deletion Queue has been cleaned up. Causes the status to
transition to COMPLETED
.CleanupFailed
: The final event emitted when the job executed
successfully but there was an error removing the processed matches from the
Deletion Queue. Causes the status to transition to COMPLETED_CLEANUP_FAILED
.CleanupSkipped
: Emitted when the job is finalising and the job status is one
of FIND_FAILED
, FORGET_FAILED
or FAILED
.QuerySucceeded
: Emitted whenever a single query executes successfully.QueryFailed
: Emitted whenever a single query fails.ObjectUpdated
: Emitted whenever an updated object is written to S3 and any
associated deletions are complete.ObjectUpdateFailed
: Emitted whenever an object cannot be updated, an object
version integrity conflict is detected or an associated deletion fails.ObjectRollbackFailed
: Emitted whenever a rollback (triggered by a detected
version integrity conflict) fails.Exception
: Emitted whenever a generic error occurs during the job execution.
Causes the status to transition to FAILED
.There are several parameters to set when Deploying the Solution which affect the behaviour of the solution in terms of data retention and performance:
AthenaConcurrencyLimit
: Increasing the number of concurrent queries that
should be executed will decrease the total time spent performing the Find
phase. You should not increase this value beyond your account Service Quota
for concurrent DML queries, and should ensure that the value set takes into
account any other Athena DML queries that may be executing whilst a job is
running.DeletionTasksMaxNumber
: Increasing the number of concurrent tasks that
should consume messages from the object queue will decrease the total time
spent performing the Forget phase.QueryExecutionWaitSeconds
: Decreasing this value will decrease the length of
time between each check to see whether a query has completed. You should aim
to set this to the "ceiling function" of your average query time. For example,
if you average query takes 3.2 seconds, set this to 4.QueryQueueWaitSeconds
: Decreasing this value will decrease the length of
time between each check to see whether additional queries can be scheduled
during the Find phase. If your jobs fail due to exceeding the Step Functions
execution history quota, you may have set this value to low and should
increase it to allow more queries to be scheduled after each check.ForgetQueueWaitSeconds
: Decreasing this value will decrease the length of
time between each check to see whether the Fargate object queue is empty. If
your jobs fail due to exceeding the Step Functions execution history quota,
you may have set this value to low.JobDetailsRetentionDays
: Changing this value will change how long records
job details and events are retained for. Set this to 0 to retain them
indefinitely.The values for these parameters are stored in an SSM Parameter Store String
Parameter named /s3f2/S3F2-Configuration
as a JSON object. The recommended
approach for updating these values is to perform a
Stack Update
and change the relevant parameters for the stack.
It is possible to update the SSM Parameter directly however this is not a recommended approach. You should not alter the structure or data types of the configuration JSON object.
Once updated, the configuration will affect any future job executions. In progress and previous executions will not be affected. The current configuration values are displayed when confirming that you wish to start a job.
You can only update the vCPUs/memory allocated to Fargate tasks by performing a stack update. For more information, see Updating the Solution.
To benefit from the latest features and improvements, you should update the solution deployed to your account when a new version is published. To find out what the latest version is and what has changed since your currently deployed version, check the Changelog.
How you update the solution depends on the difference between versions. If the new version is a minor upgrade (for instance, from version 3.45 to 3.67) you should deploy using a CloudFormation Stack Update. If the new version is a major upgrade (for instance, from 2.34 to 3.0) you should perform a manual rolling deployment.
Major version releases are made in exceptional circumstances and may contain changes that prohibit backward compatibility. Minor versions releases are backward-compatible.
You can find the version of the currently deployed solution by retrieving the
SolutionVersion
output for the solution stack. The solution version is also
shown on the Dashboard of the Web UI.
After reviewing the Changelog, obtain the Template Link
url of the latest
version from "Deploying the Solution" (it will be
similar to
https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml
).
If you wish to deploy a specific version rather than the latest version, replace
latest
from the url with the chosen version, for instance
https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/v0.2/template.yaml
.
To deploy via AWS Console:
On the Review stack screen, you must check the boxes for:
These are required to allow CloudFormation to create a Role to allow access to resources needed by the stack and name the resources in a dynamic way.
To deploy via the AWS CLI consult the documentation.
The process for a manual rolling deployment is as follows:
The steps for performing this process are:
To delete a stack via AWS Console:
To delete a stack via the AWS CLI consult the documentation.