This section outlines steps to assist you with resolving issues deploying, configuring and using the Amazon S3 Find and Forget solution.
If you're unable to resolve an issue using this information you can report the issue on GitHub.
If the Find phase does not identify the expected objects for the matches in the deletion queue, verify the following:
If a job remains in a QUEUED or RUNNING status for much longer than expected, there may be an issue relating to:
If the state machine is still executing but in a non-recoverable state, you can
stop the state machine execution manually which will trigger an Exception job
event — the job will enter a FAILED
status.
If this doesn't resolve the issue or the execution isn't running, you can manually update the job status to FAILED or remove the job and any associated events from the Jobs table*.
* WARNING: You should manually intervene only when there as been a fatal error from which the system cannot recover.
A COMPLETED_CLEANUP_FAILED
status indicates that the job has completed, but an
error occurred when removing the processed matches from the deletion queue.
Some possible causes for this are:
You can find more details of the cause by checking the job event history for a CleanupFailed event, then viewing the event data.
As the processed matches will still be on the queue, you can choose to either:
A FAILED
status indicates that the job has terminated due to a generic
exception.
Some possible causes for this are:
To find information on what caused the failure, check the deletion job log for an Exception event and inspect that event's event data.
Errors relating to Step Functions such as timeouts or exceeding the permitted execution history length, may be resolvable by increasing the waiter configuration as described in Performance Configuration.
A FIND_FAILED
status indicates that the job has terminated because one or more
data mapper queries failed to execute.
If you are using Athena and Glue as data mappers, you should first verify the following:
If you made any changes whilst verifying the prior points, you should attempt to run a new deletion job.
To find further details of the cause of the failure you should inspect the deletion job log and inspect the event data for any QueryFailed events.
Athena queries may fail if the length of a query sent to Athena exceed the Athena query string length limit (see Athena Service Quotas). If queries are failing for this reason, you will need to reduce the number of matches queued when running a deletion job.
To troubleshoot Athena queries further, find the QueryId
from the event data
and match this to the query in the Athena Query History. You can use the
Athena Troubleshooting guide for Athena troubleshooting steps.
A FORGET_FAILED
status indicates that the job has terminated because a fatal
error occurred during the forget phase of the job. S3 objects may have been
modified.
Check the job log for a ForgetPhaseFailed event. Examining the event data for this event will provide you with more information about the underlying cause of the failure.
A FORGET_PARTIALLY_FAILED
status indicates that the job has completed, but
that the forget phase was unable to process one or more objects.
Each object that was not correctly processed will result in a message sent to
the object dead letter queue ("DLQ"; see DLQUrl
in the CloudFormation stack
outputs) and an ObjectUpdateFailed event in the job event history containing
error information. Check the content of any ObjectUpdateFailed events to
ascertain the root cause of an issue.
Verify the following:
To reprocess the objects, run a new deletion job.