USER_GUIDE.md 56 KB

User Guide

This section describes how to install, configure and use the Amazon S3 Find and Forget solution.

Index

Pre-requisites

Configuring a VPC for the Solution

The Fargate tasks used by this solution to perform deletions must be able to access the following AWS services, either via an Internet Gateway or via VPC Endpoints:

  • Amazon S3 (gateway endpoint com.amazonaws.region.s3)
  • Amazon DynamoDB (gateway endpoint com.amazonaws.region.dynamodb)
  • Amazon CloudWatch Monitoring (interface endpoint com.amazonaws.region.monitoring) and Logs (interface endpoint com.amazonaws.region.logs)
  • AWS ECR API (interface endpoint com.amazonaws.region.ecr.api) and Docker (interface endpoint com.amazonaws.region.ecr.dkr)
  • Amazon SQS (interface endpoint com.amazonaws.region.sqs)
  • AWS STS (interface endpoint com.amazonaws.region.sts)
  • AWS KMS (interface endpoint com.amazonaws.region.kms) - required only if S3 Objects are encrypted using AWS KMS client-side encryption

Creating a New VPC

By default the CloudFormation template will create a new VPC that has been purpose-built for the solution. The VPC includes VPC endpoints for the aforementioned services, and does not provision internet connectivity.

You can use the provided VPC to operate the solution with no further customisations. However, if you have more complex requirements it is recommended to use an existing VPC as described in the following section.

Using an Existing VPC

Amazon S3 Find and Forget can also be used in an existing VPC. You may want to do this if you have requirements that aren't met by using the VPC provided with the solution.

To use an existing VPC, set the DeployVpc parameter to false when launching the solution CloudFormation stack. You must also specify the subnet and security groups that the Fargate tasks will use by setting the VpcSubnets and VpcSecurityGroups parameters respectively.

The subnets and security groups that you specify must allow the tasks to connect to the aforementioned AWS services. Forget solution, you must ensure that when deploying the solution you select subnets and security groups which permit access to the aforementioned services and you set DeployVpc to false.

You can obtain your subnet and security group IDs from the AWS Console or by using the AWS CLI. If using the AWS CLI, you can use the following command to get a list of VPCs:

aws ec2 describe-vpcs \
  --query 'Vpcs[*].{ID:VpcId,Name:Tags[?Key==`Name`].Value | [0], IsDefault: IsDefault}'

Once you have found the VPC you wish to use, to get a list of subnets and security groups in that VPC:

export VPC_ID=<chosen-vpc-id>
aws ec2 describe-subnets \
  --filter Name=vpc-id,Values="$VPC_ID" \
  --query 'Subnets[*].{ID:SubnetId,Name:Tags[?Key==`Name`].Value | [0],AZ:AvailabilityZone}'
aws ec2 describe-security-groups \
  --filter Name=vpc-id,Values="$VPC_ID" \
  --query 'SecurityGroups[*].{ID:GroupId,Name:GroupName}'

Provisioning Data Access IAM Roles

The Fargate tasks used by this solution to perform deletions require a specific IAM role to exist in each account that owns a bucket that you will use with the solution. The role must have the exact name S3F2DataAccessRole (no path). A CloudFormation template is available as part of this solution which can be deployed separately to the main stack in each account. A way to deploy this role to many accounts, for example across your organization, is to use AWS CloudFormation StackSets.

To deploy this template manually, use the IAM Role Template "Deploy to AWS button" in Deploying the Solution then follow steps 5-9. The Outputs tab will contain the Role ARN which you will need when adding data mappers.

You will need to grant this role read and write access to your data. We recommend you do this using a bucket policy. For more information, see Granting Access to Data.

Deploying the Solution

The solution is deployed as an AWS CloudFormation template and should take about 20 to 40 minutes to deploy.

Your access to the AWS account must have IAM permissions to launch AWS CloudFormation templates that create IAM roles and to create the solution resources.

Note You are responsible for the cost of the AWS services used while running this solution. For full details, see the pricing pages for each AWS service you will be using in this sample. Prices are subject to change.

  1. Deploy the latest CloudFormation template using the AWS Console by choosing the "Launch Template" button below for your preferred AWS region. If you wish to deploy using the AWS CLI instead, you can refer to the "Template Link" to download the template files.
Region Launch Template Template Link Launch IAM Role Template IAM Role Template Link
US East (N. Virginia) (us-east-1) Launch Link Launch Link
US East (Ohio) (us-east-2) Launch Link Launch Link
US West (Oregon) (us-west-2) Launch Link Launch Link
Asia Pacific (Sydney) (ap-southeast-2) Launch Link Launch Link
Asia Pacific (Tokyo) (ap-northeast-1) Launch Link Launch Link
EU (Ireland) (eu-west-1) Launch Link Launch Link
EU (London) (eu-west-2) Launch Link Launch Link
EU (Frankfurt) (eu-central-1) Launch Link Launch Link
EU (Stockholm) (eu-north-1) Launch Link Launch Link
  1. If prompted, login using your AWS account credentials.
  2. You should see a screen titled "Create Stack" at the "Specify template" step. The fields specifying the CloudFormation template are pre-populated. Choose the Next button at the bottom of the page.
  3. On the "Specify stack details" screen you should provide values for the following parameters of the CloudFormation stack:

    • Stack Name: (Default: S3F2) This is the name that is used to refer to this stack in CloudFormation once deployed.
    • AdminEmail: The email address you wish to setup as the initial user of this Amazon S3 Find and Forget deployment.
    • DeployWebUI: (Default: true) Whether to deploy the Web UI as part of the solution. If set to true, the AuthMethod parameter must be set to Cognito. If set to false, interaction with the solution is performed via the API Gateway only.
    • AuthMethod: (Default: Cognito) The authentication method to be used for the solution. Must be set to Cognito if DeployWebUI is true.

The following parameters are optional and allow further customisation of the solution if required:

  • DeployVpc: (Default: true) Whether to deploy the solution provided VPC. If you wish to use your own VPC, set this value to false. The solution provided VPC uses VPC Endpoints to access the required services which will incur additional costs. For more details, see the VPC Endpoint Pricing page.
  • VpcSecurityGroups: (Default: "") List of security group IDs to apply to Fargate deletion tasks. For more information on how to obtain these IDs, see Configuring a VPC for the Solution. If DeployVpc is true, this parameter is ignored.
  • VpcSubnets: (Default: "") List of subnets to run Fargate deletion tasks in. For more information on how to obtain these IDs, see Configuring a VPC for the Solution. If DeployVpc is true, this parameter is ignored.
  • FlowLogsGroup: (Default: "") If using the solution provided VPC, defines the CloudWatch Log group which should be used for flow logs. If not set, flow logs will not be enabled. If DeployVpc is false, this parameter is ignored. Enabling flow logs will incur additional costs. See the CloudWatch Logs Pricing page for the associated costs.
  • FlowLogsRoleArn: (Default: "") If using the solution provided VPC, defines which IAM Role should be used to send flow logs to CloudWatch. If not set, flow logs will not be enabled. If DeployVpc is false, this parameter is ignored.
  • CreateCloudFrontDistribution: (Default: true) Creates a CloudFront distribution for accessing the web interface of the solution.
  • AccessControlAllowOriginOverride: (Default: false) Allows overriding the origin from which the API can be called. If 'false' is provided, the API will only accept requests from the Web UI origin.
  • AthenaConcurrencyLimit: (Default: 20) The number of concurrent Athena queries the solution will run when scanning your data lake.
  • AthenaQueryMaxRetries: (Default: 2) Max number of retries to each Athena query after a failure
  • DeletionTasksMaxNumber: (Default: 3) Max number of concurrent Fargate tasks to run when performing deletions.
  • DeletionTaskCPU: (Default: 4096) Fargate task CPU limit. For more info see Fargate Configuration
  • DeletionTaskMemory: (Default: 30720) Fargate task memory limit. For more info see Fargate Configuration
  • QueryExecutionWaitSeconds: (Default: 3) How long to wait when checking if an Athena Query has completed.
  • QueryQueueWaitSeconds: (Default: 3) How long to wait when checking if there the current number of executing queries is less than the specified concurrency limit.
  • ForgetQueueWaitSeconds: (Default: 30) How long to wait when checking if the Forget phase is complete
  • AccessLogsBucket: (Default: "") The name of the bucket to use for storing the Web UI access logs. Leave blank to disable UI access logging. Ensure the provided bucket has the appropriate permissions configured. For more information see CloudFront Access Logging Permissions if CreateCloudFrontDistribution is set to true, or [S3 Access Logging Permissions] if not.
  • CognitoAdvancedSecurity: (Default: "OFF") The setting to use for Cognito advanced security. Allowed values for this parameter are: OFF, AUDIT and ENFORCED. For more information on this parameter, see [Cognito Advanced Security]
  • EnableAPIAccessLogging: (Default: false) Whether to enable access logging via CloudWatch Logs for API Gateway. Enabling this feature will incur additional costs.
  • EnableContainerInsights: (Default: false) Whether to enable CloudWatch Container Insights.
  • JobDetailsRetentionDays: (Default: 0) How long job records should remain in the Job table and how long job manifests should remain in the S3 manifests bucket. Use 0 to retain data indefinitely. Note: if the retention setting is changed it will only apply to new deletion jobs in DynamoDB, existing deletion jobs will retain the TTL at the time they were ran; but the policy will apply immediately to new and existing job manifests in S3.
  • EnableDynamoDBBackups: (Default: false) Whether to enable [DynamoDB Point-in-Time Recovery] for the DynamoDB tables. Enabling this feature will incur additional costs. See the DynamoDB Pricing page for the associated costs.
  • RetainDynamoDBTables: (Default: true) Whether to retain the DynamoDB tables upon Stack Update and Stack Deletion.
  • AthenaWorkGroup: (Default: primary) The Athena work group that should be used for when the solution runs Athena queries.
  • PreBuiltArtefactsBucketOverride: (Default: false) Overrides the default Bucket containing Front-end and Back-end pre-built artefacts. Use this if you are using a customised version of these artefacts.
  • ResourcePrefix: (Default: S3F2) Resource prefix to apply to resource names when creating statically named resources.
  • KMSKeyArns (Default: "") Comma-delimited list of KMS Key Arns used for Client-side Encryption. Leave empty if data is not client-side encrypted with KMS.

When completed, click Next

  1. Configure stack options if desired, then click Next.
  2. On the review screen, you must check the boxes for:

    • "I acknowledge that AWS CloudFormation might create IAM resources"
    • "I acknowledge that AWS CloudFormation might create IAM resources with custom names"
    • "_I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTOEXPAND"

These are required to allow CloudFormation to create a Role to allow access to resources needed by the stack and name the resources in a dynamic way.

  1. Choose Create Stack
  2. Wait for the CloudFormation stack to launch. Completion is indicated when the "Stack status" is "_CREATECOMPLETE".
    • You can monitor the stack creation progress in the "Events" tab.
  3. Note the WebUIUrl displayed in the Outputs tab for the stack. This is used to access the application.

Accessing the application

The solution provides a web user interface and a REST API to allow you to integrate it in your own applications. If you have chosen not to deploy the Web UI you will need to use the API to interface with the solution.

Logging in for the first time (only relevant if the Web UI is deployed)

  1. Note the WebUIUrl displayed in the Outputs tab for the stack. This is used to access the application.
  2. When accessing the web user interface for the first time, you will be prompted to insert a username and a password. In the username field, enter the admin e-mail specified during stack creation. In the password field, enter the temporary password sent by the system to the admin e-mail. Then select "Sign In".
  3. Next, you will need to reset the password. Enter a new password and then select "Submit".
  4. Now you should be able to access all the functionalities.

Managing users (only relevant if Cognito is chosen for authentication)

To add more users to the application:

  1. Access the Cognito Console and choose "Manage User Pools".
  2. Select the solution's User Pool (its name is displayed as CognitoUserPoolName in the Outputs tab for the CloudFormation stack).
  3. Select "Users and Groups" from the menu on the right.
  4. Use this page to create or manage users. For more information, consult the Managing Users in User Pools Guide.

Making authenticated API requests

To use the API directly, you will need to authenticate requests using the Cognito User Pool or IAM. The method for authenticating differs depending on which authentication option was chosen:

Cognito

After resetting the password via the UI, you can make authenticated requests using the AWS CLI:

  1. Note the CognitoUserPoolId, CognitoUserPoolClientId and ApiUrl parameters displayed in the Outputs tab for the stack.
  2. Take note of the Cognito user email and password.
  3. Generate a token by running this command with the values you noted in the previous steps:
   aws cognito-idp admin-initiate-auth \
     --user-pool-id $COGNITO_USER_POOL_ID \
     --client-id $COGNITO_USER_POOL_CLIENT_ID \
     --auth-flow ADMIN_NO_SRP_AUTH \
     --auth-parameters '{"USERNAME":"$USER_EMAIL_ADDRESS","PASSWORD":"$USER_PASSWORD"}'
  1. Use the IdToken generated by the previous command to make an authenticated request to the API. For instance, the following command will show the matches in the deletion queue:
   curl $API_URL/v1/queue -H "Authorization: Bearer $ID_TOKEN"

For more information, consult the Cognito REST API integration guide.

IAM

IAM authentication for API requests uses the Signature Version 4 signing process. Add the resulting signature to the Authorization header when making requests to the API.

Use the Sigv4 process linked above to generate the Authorization header value and then call the API as normal:

curl $API_URL/v1/queue -H "Authorization: $Sigv4Auth"

IAM authentication can be used anywhere you have AWS credentials with the correct permissions, this could be an IAM User or an assumed IAM Role.

Please refer to the documentation here to understand how to define the IAM policy to match your requirements. The ARN for the api can be found in the value of the ApiArn CloudFormation Stack Output.

Integrating the solution with other applications using CloudFormation stack outputs

Applications deployed using AWS CloudFormation in the same AWS account and region can integrate with Find and Forget by using CloudFormation output values. You can use the solution stack as a nested stack to use its outputs (such as the API URL) as inputs for another application.

Some outputs are also available as exports. You can import these values to use in your own CloudFormation stacks that you deploy following the Find and Forget stack.

Note for using exports: After another stack imports an output value, you can't delete the stack that is exporting the output value or modify the exported output value. All of the imports must be removed before you can delete the exporting stack or modify the output value.

Consult the exporting stack output values guide to review the differences between importing exported values and using nested stacks.

Configuring Data Mappers

After Deploying the Solution, your first step should be to configure one or more data mappers which will connect your data to the solution. Identify the S3 Bucket containing the data you wish to connect to the solution and ensure you have defined a table in your data catalog and that all existing and future partitions (as they are created) are known to the Data Catalog. Currently AWS Glue is the only supported data catalog provider. For more information on defining your data in the Glue Data Catalog, see Defining Glue Tables. You must define your Table in the Glue Data Catalog in the same region and account as the S3 Find and Forget solution.

AWS Lake Formation Configuration

For data lakes registered with AWS Lake Formation, you must grant additional permissions in Lake Formation before you can use them with the solution. If you are not using Lake Formation, proceed directly to the Data Mapper creation section.

To grant these permissions in Lake Formation:

  1. Using the WebUIRole output from the solution CloudFormation stack as the IAM principal, use the Lake Formation Data Permissions Console to grant the Describe permission for all Glue Databases that you will want to use with the solution; then grant the Describe and Select permissions to the role for all Glue Tables that you will want to use with the solution. These permissions are necessary to create data mappers in the web interface.
  2. Using the PutDataMapperRole output from the solution CloudFormation stack as the IAM principal, use the Lake Formation Data Permissions Console to grant Describe and Select permissions for all Glue Tables that you will want to use with the solution. These permissions allow the solution to access Table metadata when creating a Data Mapper.
  3. Using the AthenaExecutionRole and GenerateQueriesRole outputs from the solution CloudFormation stack as IAM principals, use the Lake Formation Data Permissions Console to grant the Describe and Select permissions to both principals for all of the tables that you will want to use with the solution. These permissions allow the solution to plan and execute Athena queries during the Find Phase.

Data Mapper Creation

  1. Access the application UI via the WebUIUrl displayed in the Outputs tab for the stack.
  2. Choose Data Mappers from the menu then choose Create Data Mapper
  3. On the Create Data Mapper page input a Name to uniquely identify this Data Mapper.
  4. Select a Query Executor Type then choose the Database and Table in your data catalog which describes the target data in S3. A list of columns will be displayed for the chosen Table.
  5. From the Partition Keys list, select the partition key(s) that you want the solution to use when generating the queries. If you select none, only one query will be performed for the data mapper. If you select any or all, you'll have a greater number of smaller queries (the same query will be repeated with a WHERE additional clause for each combination of partition values). If you have a lot of small partitions, it may be more efficient to choose none or a subset of partition keys from the list in order to increase speed of execution. If instead you have very big partitions, it may be more efficient to choose all the partition keys in order to reduce probability of failure caused by query timeout. We recommend the average query size not to exceed the hundreds of GBs and not to take more than 5 minutes.

As an example, let's consider 10 years of daily data with partition keys of year, month and day with total size of 10TB. By declaring PartitionKeys=[] (none) a single query of 10TB would run during the Find phase, and that may be too much to complete within the 30m limit of Athena execution time. On the other hand, using all the combinations of the partition keys we would have approximately 3652 queries, each being probably very small, and given the default Athena concurrency limit of 20, it may take very long to execute all of them. The best in this scenario is possibly the ['year','month'] combination, which would result in 120 queries.

  1. From the columns list, choose the column(s) the solution should use to to find items in the data which should be deleted. For example, if your table has three columns named customer_id, description and created_at and you want to search for items using the customer_id, you should choose only the customer_id column from this list.
  2. Enter the ARN of the role for Fargate to assume when modifying objects in S3 buckets. This role should already exist if you have followed the Provisioning Data Access IAM Roles steps.
  3. If you do not want the solution to delete all older versions except the latest created object version, deselect Delete previous object versions after update. By default the solution will delete all previous of versions after creating a new version.
  4. If you want the solution to ignore Object Not Found exceptions, select Ignore object not found exceptions during deletion. By default deletion jobs will fail if any objects that are found by the Find phase don't exist in the Delete phase. This setting can be useful if you have some other system deleting objects from the bucket, for example S3 lifecycle policies.

Note that the solution will not delete old versions for these objects. This can cause data to be retained longer than intended. Make sure there is some mechanism to handle old versions. One option would be to configure S3 lifecycle policies on non-current versions.

  1. Choose Create Data Mapper.
  2. A message is displayed advising you to update the S3 Bucket Policy for the S3 Bucket referenced by the newly created data mapper. See Granting Access to Data for more information on how to do this. Choose Return to Data Mappers.

You can also create Data Mappers directly via the API. For more information, see the API Documentation.

Granting Access to Data

After configuring a data mapper you must ensure that the S3 Find and Forget solution has the required level of access to the S3 location the data mapper refers to. The recommended way to achieve this is through the use of S3 Bucket Policies.

Note: AWS IAM uses an eventual consistency model and therefore any change you make to IAM, Bucket or KMS Key policies may take time to become visible. Ensure you have allowed time for permissions changes to propagate to all endpoints before starting a job. If your job fails with a status of FIND_FAILED and the QueryFailed events indicate S3 permissions issues, you may need to wait for the permissions changes to propagate.

Updating your Bucket Policy

To update the S3 bucket policy to grant read access to the IAM role used by Amazon Athena, and write access to the Data Access IAM role used by AWS Fargate, follow these steps:

  1. Access the application UI via the WebUIUrl displayed in the Outputs tab for the stack.
  2. Choose Data Mappers from the menu then choose the radio button for the relevant data mapper from the Data Mappers list.
  3. Choose Generate Access Policies and follow the instructions on the Bucket Access tab to update the bucket policy. If you already have a bucket policy in place, add the statements shown to your existing bucket policy rather than replacing it completely. If your data is encrypted with an Customer Managed CMK rather than an AWS Managed CMK, see Data Encrypted with Customer Managed CMK to grant the solution access to the Customer Managed CMK. For more information on using Server-Side Encryption (SSE) with S3, see Using SSE with CMKs.

Data Encrypted with a Customer Managed CMK

Where the data you are connecting to the solution is encrypted with an Customer Managed CMK rather than an AWS Managed CMK, you must also grant the Athena and Data Access IAM roles access to use the key so that the data can be decrypted when reading, re-encrypted when writing.

Once you have updated the bucket policy as described in Updating the Bucket Policy, choose the KMS Access tab from the Generate Access Policies modal window and follow the instructions to update the key policy with the provided statements. The statements provided are for use when using the policy view in the AWS console or making updates to the key policy via the CLI, CloudFormation or the API. If you wish, to use the default view in th AWS console, add the Principals in the provided statements as key users. For more information, see How to Change a Key Policy.

Adding to the Deletion Queue

Once your Data Mappers are configured, you can begin adding "Matches" to the Deletion Queue.

  1. Access the application UI via the WebUIUrl displayed in the Outputs tab for the stack.
  2. Choose Deletion Queue from the menu then choose Add Match to the Deletion Queue.

Matches can be Simple or Composite.

  • A Simple match is a value to be matched against any column identifier of one or more data mappers. For instance a value 12345 to be matched against the _customerid column of DataMapperA or the _adminid of DataMapperB.
  • A Composite match consists on one or more values to be matched against specific column identifiers of a multi-column based data mapper. For instance a tuple John and Doe to be matched against the _firstname and _lastname columns of DataMapperC

To add a simple match:

  1. Choose Simple as Match Type
  2. Input a Match, which is the value to search for in your data mappers. If you wish to search for the match from all data mappers choose All Data Mappers, otherwise choose Select your Data Mappers then select the relevant data mappers from the list.
  3. Choose Add Item to the Deletion Queue and confirm you can see the match in the Deletion Queue.

To add a composite match you need to have at least one data mapper with more than one column identifier. Then:

  1. Choose Composite as Match Type
  2. Select the Data Mapper from the List
  3. Select all the columns (at least one) that you want to map to a match and then provide a value for each of them. Empty is a valid value.
  4. Choose Add Item to the Deletion Queue and confirm you can see the match in the Deletion Queue.

You can also add matches to the Deletion Queue directly via the API. For more information, see the API Documentation.

When the next deletion job runs, the solution will scan the configured columns of your data for any occurrences of the Matches present in the queue at the time the job starts and remove any items where one of the Matches is present.

If across all your data mappers you can find all items related to a single logical entity using the same value, you only need to add one Match value to the deletion queue to delete that logical entity from all data mappers.

If the value used to identify a single logical entity is not consistent across your data mappers, you should add an item to the deletion queue for each distinct value which identifies the logical entity, selecting the specific data mapper(s) to which that value is relevant.

If you make a mistake when adding a Match to the deletion queue, you can remove that match from the queue as long as there is no job running. Once a job has started no items can be removed from the deletion queue until the running job has completed. You may continue to add matches to the queue whilst a job is running, but only matches which were present when the job started will be processed by that job. Once a job completes, only the matches that job has processed will be removed from the queue.

In order to facilitate different teams using a single deployment within an organisation, the same match can be added to the deletion queue more than once. When the job executes, it will merge the lists of data mappers for duplicates in the queue.

Running a Deletion Job

Once you have configured your data mappers and added one or more items to the deletion queue, you can stat a job.

  1. Access the application UI via the WebUIUrl displayed in the Outputs tab for the stack.
  2. Choose Deletion Jobs from the menu and ensure there are no jobs currently running. Choose Start a Deletion Job and review the settings displayed on the screen. For more information on how to edit these settings, see Adjusting Configuration.
  3. If you are happy with the current solution configuration choose Start a Deletion Job. The job details page should be displayed.

Once a job has started, you can leave the page and return to view its progress at point by choosing the job ID from the Deletion Jobs list. The job details page will automatically refresh and to display the current status and statistics for the job. For more information on the possible statuses and their meaning, see Deletion Job Statuses.

You can also start jobs and check their status using the API. For more information, see the API Documentation.

Job events are continuously emitted whilst a job is running. These events are used to update the status and statistics for the job. You can view all the emitted events for a job in the Job Events table. Whilst a job is running, the Load More button will continue to be displayed even if no new events have been received. Once a job has finished, the Load More button will disappear once you have loaded all the emitted events. For more information on the events which can be emitted during a job, see Deletion Job Event Types

To optimise costs, it is best practice when using the solution to start jobs on a regular schedule, rather than every time a single item is added to the Deletion Queue. This is because the marginal cost of the Find phase when deleting an additional item from the queue is far less that re-executing the Find phase (where the data mappers searched are the same). Similarly, the marginal cost of removing an additional match from an object is negligible when there is already at least 1 match present in the object contents.

Important

Ensure no external processes perform write/delete actions against exist objects whilst a job is running. For more information, consult the Limits guide

Deletion Job Statuses

The list of possible job statuses is as follows:

  • QUEUED: The job has been accepted but has yet to start. Jobs are started asynchronously by a Lambda invoked by the DynamoDB event stream for the Jobs table.
  • RUNNING: The job is still in progress.
  • FORGET_COMPLETED_CLEANUP_IN_PROGRESS: The job is still in progress.
  • COMPLETED: The job finished successfully.
  • COMPLETED_CLEANUP_FAILED: The job finished successfully however the deletion queue items could not be removed. You should manually remove these or leave them to be removed on the next job
  • FORGET_PARTIALLY_FAILED: The job finished but it was unable to successfully process one or more objects. The Deletion DLQ for messages will contain a message per object that could not be updated.
  • FIND_FAILED: The job failed during the Find phase as there was an issue querying one or more data mappers.
  • FORGET_FAILED: The job failed during the Forget phase as there was an issue running the Fargate tasks.
  • FAILED: An unknown error occurred during the Find and Forget workflow, for example, the Step Functions execution timed out or the execution was manually cancelled.

For more information on how to resolve statuses indicative of errors, consult the Troubleshooting guide.

Deletion Job Event Types

The list of events is as follows:

  • JobStarted: Emitted when the deletion job state machine first starts. Causes the status of the job to transition from QUEUED to RUNNING
  • FindPhaseStarted: Emitted when the deletion job has purged any messages from the query and object queues and is ready to be searching for data.
  • FindPhaseEnded: Emitted when all queries have executed and written their results to the objects queue.
  • FindPhaseFailed: Emitted when one or more queries fail. Causes the status to transition to FIND_FAILED.
  • ForgetPhaseStarted: Emitted when the Find phase has completed successfully and the Forget phase is starting.
  • ForgetPhaseEnded: Emitted when the Forget phase has completed. If the Forget phase completes with no errors, this event causes the status to transition to FORGET_COMPLETED_CLEANUP_IN_PROGRESS. If the Forget phase completes but there was an error updating one or more objects, this causes the status to transition to FORGET_PARTIALLY_FAILED.
  • ForgetPhaseFailed: Emitted when there was an issue running the Fargate tasks. Causes the status to transition to FORGET_FAILED.
  • CleanupSucceeded: The final event emitted when a job has executed successfully and the Deletion Queue has been cleaned up. Causes the status to transition to COMPLETED.
  • CleanupFailed: The final event emitted when the job executed successfully but there was an error removing the processed matches from the Deletion Queue. Causes the status to transition to COMPLETED_CLEANUP_FAILED.
  • CleanupSkipped: Emitted when the job is finalising and the job status is one of FIND_FAILED, FORGET_FAILED or FAILED.
  • QuerySucceeded: Emitted whenever a single query executes successfully.
  • QueryFailed: Emitted whenever a single query fails.
  • ObjectUpdated: Emitted whenever an updated object is written to S3 and any associated deletions are complete.
  • ObjectUpdateFailed: Emitted whenever an object cannot be updated, an object version integrity conflict is detected or an associated deletion fails.
  • ObjectRollbackFailed: Emitted whenever a rollback (triggered by a detected version integrity conflict) fails.
  • Exception: Emitted whenever a generic error occurs during the job execution. Causes the status to transition to FAILED.

Adjusting Configuration

There are several parameters to set when Deploying the Solution which affect the behaviour of the solution in terms of data retention and performance:

  • AthenaConcurrencyLimit: Increasing the number of concurrent queries that should be executed will decrease the total time spent performing the Find phase. You should not increase this value beyond your account Service Quota for concurrent DML queries, and should ensure that the value set takes into account any other Athena DML queries that may be executing whilst a job is running.
  • DeletionTasksMaxNumber: Increasing the number of concurrent tasks that should consume messages from the object queue will decrease the total time spent performing the Forget phase.
  • QueryExecutionWaitSeconds: Decreasing this value will decrease the length of time between each check to see whether a query has completed. You should aim to set this to the "ceiling function" of your average query time. For example, if you average query takes 3.2 seconds, set this to 4.
  • QueryQueueWaitSeconds: Decreasing this value will decrease the length of time between each check to see whether additional queries can be scheduled during the Find phase. If your jobs fail due to exceeding the Step Functions execution history quota, you may have set this value to low and should increase it to allow more queries to be scheduled after each check.
  • ForgetQueueWaitSeconds: Decreasing this value will decrease the length of time between each check to see whether the Fargate object queue is empty. If your jobs fail due to exceeding the Step Functions execution history quota, you may have set this value to low.
  • JobDetailsRetentionDays: Changing this value will change how long records job details and events are retained for. Set this to 0 to retain them indefinitely.

The values for these parameters are stored in an SSM Parameter Store String Parameter named /s3f2/S3F2-Configuration as a JSON object. The recommended approach for updating these values is to perform a Stack Update and change the relevant parameters for the stack.

It is possible to update the SSM Parameter directly however this is not a recommended approach. You should not alter the structure or data types of the configuration JSON object.

Once updated, the configuration will affect any future job executions. In progress and previous executions will not be affected. The current configuration values are displayed when confirming that you wish to start a job.

You can only update the vCPUs/memory allocated to Fargate tasks by performing a stack update. For more information, see Updating the Solution.

Updating the Solution

To benefit from the latest features and improvements, you should update the solution deployed to your account when a new version is published. To find out what the latest version is and what has changed since your currently deployed version, check the Changelog.

How you update the solution depends on the difference between versions. If the new version is a minor upgrade (for instance, from version 3.45 to 3.67) you should deploy using a CloudFormation Stack Update. If the new version is a major upgrade (for instance, from 2.34 to 3.0) you should perform a manual rolling deployment.

Major version releases are made in exceptional circumstances and may contain changes that prohibit backward compatibility. Minor versions releases are backward-compatible.

Identify current solution version

You can find the version of the currently deployed solution by retrieving the SolutionVersion output for the solution stack. The solution version is also shown on the Dashboard of the Web UI.

Identify the Stack URL to deploy

After reviewing the Changelog, obtain the Template Link url of the latest version from "Deploying the Solution" (it will be similar to https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml). If you wish to deploy a specific version rather than the latest version, replace latest from the url with the chosen version, for instance https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/v0.2/template.yaml.

Minor Upgrades: Perform CloudFormation Stack Update

To deploy via AWS Console:

  1. Open the CloudFormation Console Page and choose the Solution by selecting to the stack's radio button, then choose "Update"
  2. Choose "Replace current template" and then input the template URL for the version you wish to deploy in the "Amazon S3 URL" textbox, then choose "Next"
  3. On the Stack Details screen, review the Parameters and then choose "Next"
  4. On the Configure stack options screen, choose "Next"
  5. On the Review stack screen, you must check the boxes for:

    • "I acknowledge that AWS CloudFormation might create IAM resources"
    • "I acknowledge that AWS CloudFormation might create IAM resources with custom names"
    • "_I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTOEXPAND"

These are required to allow CloudFormation to create a Role to allow access to resources needed by the stack and name the resources in a dynamic way.

  1. Choose "Update stack" to start the stack update.
  2. Wait for the CloudFormation stack to finish updating. Completion is indicated when the "Stack status" is "_UPDATECOMPLETE".

To deploy via the AWS CLI consult the documentation.

Major Upgrades: Manual Rolling Deployment

The process for a manual rolling deployment is as follows:

  1. Create a new stack from scratch
  2. Export the data from the old stack to the new stack
  3. Migrate consumers to new API and Web UI URLs
  4. Delete the old stack.

The steps for performing this process are:

  1. Deploy a new instance of the Solution by following the instructions contained in the "Deploying the Solution" section. Make sure you use unique values for Stack Name and ResourcePrefix parameter which differ from existing stack.
  2. Migrate Data from DynamoDB to ensure the new stack contains the necessary configuration related to Data Mappers and settings. When both stacks are deployed in the same account and region, the simplest way to migrate is via On-Demand Backup and Restore. If the stacks are deployed in different regions or accounts, you can use AWS Data Pipeline.
  3. Ensure that all the bucket policies for the Data Mappers are in place for the new stack. See the "Granting Access to Data" section for steps to do this.
  4. Review the Changelog for changes that may affect how you use the new deployment. This may require you to make changes to any software you have that interacts with the solution's API.
  5. Once all the consumers are migrated to the new stack (API and Web UI), delete the old stack.

Deleting the Solution

To delete a stack via AWS Console:

  1. Open the CloudFormation Console Page and choose the solution stack, then choose "Delete"
  2. Once the confirmation modal appears, choose "Delete stack".
  3. Wait for the CloudFormation stack to finish updating. Completion is indicated when the "Stack status" is "_DELETECOMPLETE".

To delete a stack via the AWS CLI consult the documentation.