刘凡 há 2 anos atrás
pai
commit
d0fdd49032
100 ficheiros alterados com 8740 adições e 0 exclusões
  1. BIN
      S3/NewFind/amazon-s3-find-and-forget-master.zip
  2. 26 0
      S3/NewFind/amazon-s3-find-and-forget-master/.dependabot/config.yml
  3. 7 0
      S3/NewFind/amazon-s3-find-and-forget-master/.dockerignore
  4. 14 0
      S3/NewFind/amazon-s3-find-and-forget-master/.github/PULL_REQUEST_TEMPLATE.md
  5. 63 0
      S3/NewFind/amazon-s3-find-and-forget-master/.github/workflows/publish.yaml
  6. 51 0
      S3/NewFind/amazon-s3-find-and-forget-master/.github/workflows/release.yaml
  7. 61 0
      S3/NewFind/amazon-s3-find-and-forget-master/.github/workflows/unit-tests.yaml
  8. 161 0
      S3/NewFind/amazon-s3-find-and-forget-master/.gitignore
  9. 41 0
      S3/NewFind/amazon-s3-find-and-forget-master/.pre-commit-config.yaml
  10. 364 0
      S3/NewFind/amazon-s3-find-and-forget-master/.pylintrc
  11. 423 0
      S3/NewFind/amazon-s3-find-and-forget-master/CHANGELOG.md
  12. 7 0
      S3/NewFind/amazon-s3-find-and-forget-master/CODE_OF_CONDUCT.md
  13. 118 0
      S3/NewFind/amazon-s3-find-and-forget-master/CONTRIBUTING.md
  14. 175 0
      S3/NewFind/amazon-s3-find-and-forget-master/LICENSE
  15. 211 0
      S3/NewFind/amazon-s3-find-and-forget-master/Makefile
  16. 1 0
      S3/NewFind/amazon-s3-find-and-forget-master/NOTICE
  17. 124 0
      S3/NewFind/amazon-s3-find-and-forget-master/README.md
  18. 39 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/Dockerfile
  19. 0 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/__init__.py
  20. 138 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/cse.py
  21. 90 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/events.py
  22. 81 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/json_handler.py
  23. 315 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/main.py
  24. 170 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/parquet_handler.py
  25. 10 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/requirements.in
  26. 55 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/requirements.txt
  27. 365 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/s3.py
  28. 30 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/utils.py
  29. 3 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/aws_sdk/requirements.in
  30. 30 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/aws_sdk/requirements.txt
  31. 267 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/boto_utils/python/boto_utils.py
  32. 1 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/cr_helper/requirements.in
  33. 8 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/cr_helper/requirements.txt
  34. 1 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/decorators/requirements.in
  35. 17 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/decorators/requirements.txt
  36. 44 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/cleanup_bucket.py
  37. 37 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/cleanup_repository.py
  38. 41 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/copy_build_artefact.py
  39. 45 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/get_vpce_subnets.py
  40. 31 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/redeploy_apigw.py
  41. 29 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/rerun_pipeline.py
  42. 47 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/wait_container_build.py
  43. 180 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/handlers.py
  44. 21 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/create_data_mapper.json
  45. 21 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/delete_data_mapper.json
  46. 21 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/get_data_mapper.json
  47. 22 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/list_data_mappers.json
  48. 0 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/__init__.py
  49. 215 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/handlers.py
  50. 21 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/schemas/get_job.json
  51. 47 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/schemas/list_job_events.json
  52. 24 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/schemas/list_jobs.json
  53. 121 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/stats_updater.py
  54. 146 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/status_updater.py
  55. 159 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/stream_processor.py
  56. 180 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/queue/handlers.py
  57. 22 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/queue/schemas/list_queue_items.json
  58. 20 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/settings/handlers.py
  59. 27 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/check_query_status.py
  60. 24 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/check_queue_size.py
  61. 31 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/check_task_count.py
  62. 19 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/delete_message.py
  63. 16 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/emit_event.py
  64. 158 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/execute_query.py
  65. 499 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/generate_queries.py
  66. 20 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/orchestrate_ecs_service_scaling.py
  67. 14 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/purge_queue.py
  68. 22 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/scan_table.py
  69. 62 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/submit_query_results.py
  70. 94 0
      S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/work_query_queue.py
  71. 5 0
      S3/NewFind/amazon-s3-find-and-forget-master/cfn-publish.config
  72. 4 0
      S3/NewFind/amazon-s3-find-and-forget-master/ci/cfn_nag_blacklist.yaml
  73. 40 0
      S3/NewFind/amazon-s3-find-and-forget-master/docker_run_with_creds.sh
  74. 217 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/ARCHITECTURE.md
  75. 351 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/COST_OVERVIEW.md
  76. 119 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/LIMITS.md
  77. 140 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/LOCAL_DEVELOPMENT.md
  78. 77 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/MONITORING.md
  79. 74 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/PRODUCTION_READINESS_GUIDELINES.md
  80. 47 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/SECURITY.md
  81. 178 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/TROUBLESHOOTING.md
  82. 49 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/UPGRADE_GUIDE.md
  83. 942 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/USER_GUIDE.md
  84. 23 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/.openapi-generator-ignore
  85. 110 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/DataMapperApi.md
  86. 131 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/DeletionQueueApi.md
  87. 87 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/JobApi.md
  88. 30 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/SettingsApi.md
  89. 11 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/CreateDeletionQueueItem.md
  90. 16 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DataMapper.md
  91. 12 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DataMapperQueryExecutorParameters.md
  92. 10 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DeletionQueue.md
  93. 13 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DeletionQueueItem.md
  94. 9 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/Error.md
  95. 32 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/Job.md
  96. 16 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/JobEvent.md
  97. 22 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/JobSummary.md
  98. 9 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/ListOfCreateDeletionQueueItems.md
  99. 10 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/ListOfDataMappers.md
  100. 9 0
      S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/ListOfDeletionQueueItem.md

BIN
S3/NewFind/amazon-s3-find-and-forget-master.zip


+ 26 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.dependabot/config.yml

@@ -0,0 +1,26 @@
+version: 1
+update_configs:
+  - package_manager: "javascript"
+    directory: "/frontend"
+    update_schedule: "monthly"
+    default_labels:
+      - "frontend"
+      - "dependencies"
+  - package_manager: "javascript"
+    directory: "/"
+    update_schedule: "monthly"
+    default_labels:
+      - "ci"
+      - "dependencies"
+  - package_manager: "python"
+    directory: "/"
+    update_schedule: "monthly"
+    default_labels:
+      - "backend"
+      - "dependencies"
+  - package_manager: "docker"
+    directory: "/backend/ecs_tasks/delete_files"
+    update_schedule: "monthly"
+    default_labels:
+      - "backend"
+      - "dependencies"

+ 7 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.dockerignore

@@ -0,0 +1,7 @@
+venv/
+.idea/
+.vscode/
+tests/
+docs/
+backend/lambda_layers/
+!backend/lambda_layers/boto_utils/

+ 14 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.github/PULL_REQUEST_TEMPLATE.md

@@ -0,0 +1,14 @@
+*Issue #, if available:*
+
+*Description of changes:*
+
+*PR Checklist:*
+
+- [ ] Changelog updated
+- [ ] Unit tests (and integration tests if applicable) provided
+- [ ] All tests pass
+- [ ] Pre-commit checks pass
+- [ ] Debugging code removed
+- [ ] If releasing a new version, have you bumped the version in the main CFN template?
+
+By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+ 63 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.github/workflows/publish.yaml

@@ -0,0 +1,63 @@
+---
+
+name: Publish Version
+on:
+  release:
+    types: [created, edited]
+jobs:
+  publish:
+    name: Publish Version
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Fetch Tags
+        run: git fetch --depth=1 origin +refs/tags/*:refs/tags/* || true
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}
+          aws-region: ${{ secrets.REGION }}
+      - name: Set version
+        id: version
+        run: echo "VERSION=${GITHUB_REF/refs\/tags\//}" >> $GITHUB_ENV
+      # Cache
+      - uses: actions/cache@v1
+        with:
+          path: ~/.npm
+          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
+          restore-keys: |
+            ${{ runner.os }}-node-
+      - uses: actions/cache@v1
+        with:
+          path: ~/.cache/pip
+          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-pip-
+      # Setup
+      - name: Install Snappy
+        run: sudo apt-get install libsnappy-dev
+      - name: Set up Python 3.9
+        uses: actions/setup-python@v1
+        with:
+          python-version: 3.9
+      - name: Set up Nodejs 16
+        uses: actions/setup-node@v1
+        with:
+          node-version: 16
+      - name: Set up ruby 2.6
+        uses: actions/setup-ruby@v1
+        with:
+          ruby-version: '2.6'
+      - name: Install virtualenv
+        run: pip install virtualenv
+      - name: Install dependencies
+        run: make setup
+      # Package and Upload Archive
+      - name: Build Release
+        run: make package
+      - name: Upload artefact
+        run: aws s3 cp packaged.zip s3://$CFN_BUCKET/amazon-s3-find-and-forget/$VERSION/amazon-s3-find-and-forget.zip
+        env:
+          CFN_BUCKET: ${{ secrets.CFN_BUCKET }}

+ 51 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.github/workflows/release.yaml

@@ -0,0 +1,51 @@
+---
+
+name: Release Version
+on:
+  push:
+    branches:
+      - master
+jobs:
+  release:
+    name: Release Version
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - run: git fetch --depth=1 origin +refs/tags/*:refs/tags/* || true
+      # Cache
+      - uses: actions/cache@v1
+        with:
+          path: ~/.cache/pip
+          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-pip-
+      # Setup
+      - name: Set up Python 3.9
+        uses: actions/setup-python@v1
+        with:
+          python-version: 3.9
+      - name: Install virtualenv
+        run: pip install virtualenv
+      - name: Install dependencies
+        run: make setup-predeploy
+      # Release if required
+      - name: Setup versions in env variables
+        id: version
+        run: |
+          function version { echo "$@" | awk -F. '{ printf("%d%03d%03d%03d\n", $1,$2,$3,$4); }'; }
+          echo "THIS_VERSION=$(make version | sed s/^v//)" >> $GITHUB_ENV
+          echo "THIS_VERSION_COMPARABLE=$(version $(make version | sed s/^v//))" >> $GITHUB_ENV
+          echo "LATEST_VERSION_COMPARABLE=$(version $(git describe --tags $(git rev-list --tags --max-count=1) | sed s/^v// 2> /dev/null || echo '0'))" >> $GITHUB_ENV
+      - name: Create Release
+        id: create_release
+        uses: actions/create-release@latest
+        if: env.THIS_VERSION_COMPARABLE > env.LATEST_VERSION_COMPARABLE
+        env:
+          GITHUB_TOKEN: ${{ secrets.RELEASE_TOKEN }}
+        with:
+          tag_name: v${{ env.THIS_VERSION }}
+          release_name: Release v${{ env.THIS_VERSION }}
+          body: |
+            See the CHANGELOG for a list of features included in this release
+          draft: false
+          prerelease: true

+ 61 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.github/workflows/unit-tests.yaml

@@ -0,0 +1,61 @@
+---
+
+name: Unit Tests
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+    types:
+      - opened
+      - edited
+      - synchronize
+jobs:
+  unit_tests:
+    name: Unit tests
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      # Cache
+      - uses: actions/cache@v1
+        with:
+          path: ~/.npm
+          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
+          restore-keys: |
+            ${{ runner.os }}-node-
+      - uses: actions/cache@v1
+        with:
+          path: ~/.cache/pip
+          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-pip-
+      # Setup
+      - name: Install snappy dep
+        run: sudo apt-get install libsnappy-dev
+      - name: Set up Python 3.9
+        uses: actions/setup-python@v1
+        with:
+          python-version: 3.9
+      - name: Set up Nodejs 16
+        uses: actions/setup-node@v1
+        with:
+          node-version: 16
+      - name: Set up ruby 2.6
+        uses: actions/setup-ruby@v1
+        with:
+          ruby-version: '2.6'
+      - name: Install virtualenv
+        run: pip install virtualenv
+      - name: Install dependencies
+        run: make setup
+      # Run Tests
+      - name: CloudFormation unit tests
+        run: make test-cfn
+      - name: Backend unit tests
+        run: make test-ci
+        env:
+          AWS_DEFAULT_REGION: eu-west-1
+      - name: Frontend unit tests
+        run: make test-frontend
+      - name: Upload unit test coverage reports to Codecov
+        uses: codecov/codecov-action@v1

+ 161 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.gitignore

@@ -0,0 +1,161 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+coverage/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# vs code
+.vscode/
+
+# misc
+.idea/
+.env
+.DS_Store
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+
+#front end
+node_modules
+.eslintcache
+/.pnp
+.pnp.js
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+
+# build
+packaged.yaml
+backend/lambda_layers/aws_sdk/python
+backend/lambda_layers/cr_helper/python
+backend/lambda_layers/decorators/python
+!backend/lambda_layers/decorators/python/decorators.py
+build.zip
+packaged.zip
+frontend/public/settings.js
+*.sentinel
+backend/ecs_tasks/*.tar
+
+# docs
+.openapi-generator/

+ 41 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.pre-commit-config.yaml

@@ -0,0 +1,41 @@
+repos:
+-   repo: local
+    hooks:
+    -   id: format-cfn
+        name: Format CloudFormation
+        entry: make format-cfn
+        language: system
+-   repo: local
+    hooks:
+    -   id: format-js
+        name: Format Javascript
+        entry: make format-js
+        language: system
+-   repo: local
+    hooks:
+    -   id: format-python
+        name: Format Python Code
+        entry: make format-python
+        language: system
+-   repo: local
+    hooks:
+    -   id: format-docs
+        name: Format Markdown docs
+        entry: make format-docs
+        language: system
+-   repo: local
+    hooks:
+    -   id: generate-api-docs
+        name: Generate API Docs
+        entry: make generate-api-docs
+        language: system
+-   repo: https://github.com/awslabs/git-secrets
+    rev: 5e28df337746db4f070c84f7069d365bfd0d72a8
+    hooks:
+    -   id: git-secrets
+-   repo: local
+    hooks:
+    -   id: cfn-lint
+        name: Lint CloudFormation templates
+        entry: make lint-cfn
+        language: system

+ 364 - 0
S3/NewFind/amazon-s3-find-and-forget-master/.pylintrc

@@ -0,0 +1,364 @@
+[MASTER]
+
+# Specify a configuration file.
+#rcfile=
+
+# Python code to execute, usually for sys.path manipulation such as
+# pygtk.require().
+#init-hook=
+
+# Add files or directories to the blacklist. They should be base names, not
+# paths.
+ignore=compat.py, __main__.py
+
+# Pickle collected data for later comparisons.
+persistent=yes
+
+# List of plugins (as comma separated values of python modules names) to load,
+# usually to register additional checkers.
+load-plugins=
+
+# Use multiple processes to speed up Pylint.
+jobs=1
+
+# Allow loading of arbitrary C extensions. Extensions are imported into the
+# active Python interpreter and may run arbitrary code.
+unsafe-load-any-extension=no
+
+# A comma-separated list of package or module names from where C extensions may
+# be loaded. Extensions are loading into the active Python interpreter and may
+# run arbitrary code
+extension-pkg-whitelist=
+
+# Allow optimization of some AST trees. This will activate a peephole AST
+# optimizer, which will apply various small optimizations. For instance, it can
+# be used to obtain the result of joining multiple strings with the addition
+# operator. Joining a lot of strings can lead to a maximum recursion error in
+# Pylint and this flag can prevent that. It has one side effect, the resulting
+# AST will be different than the one from reality.
+optimize-ast=no
+
+
+[MESSAGES CONTROL]
+
+# Only show warnings with the listed confidence levels. Leave empty to show
+# all. Valid levels: HIGH, INFERENCE, INFERENCE_FAILURE, UNDEFINED
+confidence=
+
+# Enable the message, report, category or checker with the given id(s). You can
+# either give multiple identifier separated by comma (,) or put this option
+# multiple time. See also the "--disable" option for examples.
+#enable=
+
+# Disable the message, report, category or checker with the given id(s). You
+# can either give multiple identifiers separated by comma (,) or put this
+# option multiple times (only on the command line, not in the configuration
+# file where it should appear only once).You can also use "--disable=all" to
+# disable everything first and then reenable specific checks. For example, if
+# you want to run only the similarities checker, you can use "--disable=all
+# --enable=similarities". If you want to run only the classes checker, but have
+# no Warning level messages displayed, use"--disable=all --enable=classes
+# --disable=W"
+disable=W0107,W0201,R0913,R0902,E0401,C0103,E0611,R0914,W0613,E1101
+
+
+[REPORTS]
+
+# Set the output format. Available formats are text, parseable, colorized, msvs
+# (visual studio) and html. You can also give a reporter class, eg
+# mypackage.mymodule.MyReporterClass.
+output-format=text
+
+# Put messages in a separate file for each module / package specified on the
+# command line instead of printing them on stdout. Reports (if any) will be
+# written in a file name "pylint_global.[txt|html]".
+files-output=no
+
+# Tells whether to display a full report or only the messages
+reports=no
+
+# Python expression which should return a note less than 10 (10 is the highest
+# note). You have access to the variables errors warning, statement which
+# respectively contain the number of errors / warnings messages and the total
+# number of statements analyzed. This is used by the global evaluation report
+# (RP0004).
+evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10)
+
+# Template used to display messages. This is a python new-style format string
+# used to format the message information. See doc for all details
+#msg-template=
+
+
+[BASIC]
+
+# List of builtins function names that should not be used, separated by a comma
+bad-functions=apply,reduce
+
+# Good variable names which should always be accepted, separated by a comma
+good-names=e,i,j,k,n,ex,Run,_
+
+# Bad variable names which should always be refused, separated by a comma
+bad-names=foo,bar,baz,toto,tutu,tata
+
+# Colon-delimited sets of names that determine each other's naming style when
+# the name regexes allow several styles.
+name-group=
+
+# Include a hint for the correct naming format with invalid-name
+include-naming-hint=yes
+
+# Regular expression matching correct function names
+function-rgx=[a-z_][a-z0-9_]{2,50}$
+
+# Naming hint for function names
+function-name-hint=[a-z_][a-z0-9_]{2,30}$
+
+# Regular expression matching correct variable names
+variable-rgx=[a-z_][a-z0-9_]{0,50}$
+
+# Naming hint for variable names
+variable-name-hint=[a-z_][a-z0-9_]{2,30}$
+
+# Regular expression matching correct constant names
+const-rgx=(([a-zA-Z_][a-zA-Z0-9_]*)|(__.*__))$
+
+# Naming hint for constant names
+const-name-hint=(([A-Z_][A-Z0-9_]*)|(__.*__))$
+
+# Regular expression matching correct attribute names
+attr-rgx=[a-z_][a-z0-9_]{1,50}$
+
+# Naming hint for attribute names
+attr-name-hint=[a-z_][a-z0-9_]{2,30}$
+
+# Regular expression matching correct argument names
+argument-rgx=[a-z_][a-z0-9_]{0,50}$
+
+# Naming hint for argument names
+argument-name-hint=[a-z_][a-z0-9_]{2,30}$
+
+# Regular expression matching correct class attribute names
+class-attribute-rgx=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$
+
+# Naming hint for class attribute names
+class-attribute-name-hint=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$
+
+# Regular expression matching correct inline iteration names
+inlinevar-rgx=[A-Za-z_][A-Za-z0-9_]*$
+
+# Naming hint for inline iteration names
+inlinevar-name-hint=[A-Za-z_][A-Za-z0-9_]*$
+
+# Regular expression matching correct class names
+class-rgx=[A-Z_][a-zA-Z0-9]+$
+
+# Naming hint for class names
+class-name-hint=[A-Z_][a-zA-Z0-9]+$
+
+# Regular expression matching correct module names
+module-rgx=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$
+
+# Naming hint for module names
+module-name-hint=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$
+
+# Regular expression matching correct method names
+method-rgx=[a-z_][a-z0-9_]{2,30}$
+
+# Naming hint for method names
+method-name-hint=[a-z_][a-z0-9_]{2,30}$
+
+# Regular expression which should only match function or class names that do
+# not require a docstring.
+no-docstring-rgx=.*
+
+# Minimum line length for functions/classes that require docstrings, shorter
+# ones are exempt.
+docstring-min-length=-1
+
+
+[FORMAT]
+
+# Maximum number of characters on a single line.
+max-line-length=190
+
+# Regexp for a line that is allowed to be longer than the limit.
+ignore-long-lines=^\s*(# )?<?https?://\S+>?$
+
+# Allow the body of an if to be on the same line as the test if there is no
+# else.
+single-line-if-stmt=no
+
+# List of optional constructs for which whitespace checking is disabled
+no-space-check=trailing-comma,dict-separator
+
+# Maximum number of lines in a module
+max-module-lines=1000
+
+# String used as indentation unit. This is usually " " (4 spaces) or "\t" (1
+# tab).
+indent-string='    '
+
+# Number of spaces of indent required inside a hanging or continued line.
+indent-after-paren=4
+
+# Expected format of line ending, e.g. empty (any line ending), LF or CRLF.
+expected-line-ending-format=
+
+
+[LOGGING]
+
+# Logging modules to check that the string format arguments are in logging
+# function parameter format
+logging-modules=logging
+
+
+[MISCELLANEOUS]
+
+# List of note tags to take in consideration, separated by a comma.
+notes=FIXME,XXX
+
+
+[SIMILARITIES]
+
+# Minimum lines number of a similarity.
+# Temp 500 until we merge initial_commit into shared codebase.
+min-similarity-lines=500 
+
+# Ignore comments when computing similarities.
+ignore-comments=yes
+
+# Ignore docstrings when computing similarities.
+ignore-docstrings=yes
+
+# Ignore imports when computing similarities.
+ignore-imports=yes
+
+
+[SPELLING]
+
+# Spelling dictionary name. Available dictionaries: none. To make it working
+# install python-enchant package.
+spelling-dict=
+
+# List of comma separated words that should not be checked.
+spelling-ignore-words=
+
+# A path to a file that contains private dictionary; one word per line.
+spelling-private-dict-file=
+
+# Tells whether to store unknown words to indicated private dictionary in
+# --spelling-private-dict-file option instead of raising a message.
+spelling-store-unknown-words=no
+
+
+[TYPECHECK]
+
+# Tells whether missing members accessed in mixin class should be ignored. A
+# mixin class is detected if its name ends with "mixin" (case insensitive).
+ignore-mixin-members=yes
+
+# List of module names for which member attributes should not be checked
+# (useful for modules/projects where namespaces are manipulated during runtime
+# and thus existing member attributes cannot be deduced by static analysis
+ignored-modules=six.moves,
+
+# List of classes names for which member attributes should not be checked
+# (useful for classes with attributes dynamically set).
+ignored-classes=SQLObject
+
+# List of members which are set dynamically and missed by pylint inference
+# system, and so shouldn't trigger E0201 when accessed. Python regular
+# expressions are accepted.
+generated-members=REQUEST,acl_users,aq_parent,objects,DoesNotExist,md5,sha1,sha224,sha256,sha384,sha512
+
+
+[VARIABLES]
+
+# Tells whether we should check for unused import in __init__ files.
+init-import=no
+
+# A regular expression matching the name of dummy variables (i.e. expectedly
+# not used).
+dummy-variables-rgx=_|dummy|ignore
+
+# List of additional names supposed to be defined in builtins. Remember that
+# you should avoid to define new builtins when possible.
+additional-builtins=
+
+# List of strings which can identify a callback function by name. A callback
+# name must start or end with one of those strings.
+callbacks=cb_,_cb
+
+
+[CLASSES]
+
+# List of method names used to declare (i.e. assign) instance attributes.
+defining-attr-methods=__init__,__new__,setUp
+
+# List of valid names for the first argument in a class method.
+valid-classmethod-first-arg=cls
+
+# List of valid names for the first argument in a metaclass class method.
+valid-metaclass-classmethod-first-arg=mcs
+
+# List of member names, which should be excluded from the protected access
+# warning.
+exclude-protected=_asdict,_fields,_replace,_source,_make
+
+
+[DESIGN]
+
+# Maximum number of arguments for function / method
+max-args=5
+
+# Argument names that match this expression will be ignored. Default to name
+# with leading underscore
+ignored-argument-names=_.*
+
+# Maximum number of locals for function / method body
+max-locals=15
+
+# Maximum number of return / yield for function / method body
+max-returns=6
+
+# Maximum number of branch for function / method body
+max-branches=12
+
+# Maximum number of statements in function / method body
+max-statements=35
+
+# Maximum number of parents for a class (see R0901).
+max-parents=6
+
+# Maximum number of attributes for a class (see R0902).
+max-attributes=7
+
+# Minimum number of public methods for a class (see R0903).
+min-public-methods=0
+
+# Maximum number of public methods for a class (see R0904).
+max-public-methods=20
+
+
+[IMPORTS]
+
+# Deprecated modules which should not be used, separated by a comma
+deprecated-modules=regsub,TERMIOS,Bastion,rexec,UserDict
+
+# Create a graph of every (i.e. internal and external) dependencies in the
+# given file (report RP0402 must not be disabled)
+import-graph=
+
+# Create a graph of external dependencies in the given file (report RP0402 must
+# not be disabled)
+ext-import-graph=
+
+# Create a graph of internal dependencies in the given file (report RP0402 must
+# not be disabled)
+int-import-graph=
+
+[EXCEPTIONS]
+
+# Exceptions that will emit a warning when being caught. Defaults to
+# "Exception"
+overgeneral-exceptions=Exception

+ 423 - 0
S3/NewFind/amazon-s3-find-and-forget-master/CHANGELOG.md

@@ -0,0 +1,423 @@
+# Change Log
+
+## v0.53
+
+- [#332](https://github.com/awslabs/amazon-s3-find-and-forget/pull/332):
+  - Switch to
+    [Pyarrow's S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html)
+    to read from S3 on the Forget Phase
+  - Switch to
+    [boto3's upload_fileobj](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj)
+    to write to S3 on the Forget Phase
+  - Upgrade backend dependencies
+- [#327](https://github.com/awslabs/amazon-s3-find-and-forget/pull/327): Improve
+  performance of Athena results handler
+- [#329](https://github.com/awslabs/amazon-s3-find-and-forget/pull/329): Upgrade
+  frontend dependencies
+
+## v0.52
+
+- [#318](https://github.com/awslabs/amazon-s3-find-and-forget/pull/318): Added
+  support for AWS China
+- [#324](https://github.com/awslabs/amazon-s3-find-and-forget/pull/324): Upgrade
+  frontend dependencies
+
+## v0.51
+
+- [#321](https://github.com/awslabs/amazon-s3-find-and-forget/pull/321): Upgrade
+  numpy dependency
+
+## v0.50
+
+- [#322](https://github.com/awslabs/amazon-s3-find-and-forget/pull/322):
+  Upgraded to Python 3.9
+
+## v0.49
+
+- [#314](https://github.com/awslabs/amazon-s3-find-and-forget/pull/314): Fix
+  query generation step for Composite matches consisting of a single column
+- [#320](https://github.com/awslabs/amazon-s3-find-and-forget/pull/320): Fix
+  deployment issue introduced in v0.48
+
+## v0.48
+
+- [#316](https://github.com/awslabs/amazon-s3-find-and-forget/pull/316): Upgrade
+  dependencies
+- [#313](https://github.com/awslabs/amazon-s3-find-and-forget/pull/313): Add
+  option to choose IAM for authentication (in place of Cognito)
+- [#313](https://github.com/awslabs/amazon-s3-find-and-forget/pull/313): Add
+  option to not deploy WebUI component. Cognito auth is required for WebUI
+
+## v0.47
+
+- [#310](https://github.com/awslabs/amazon-s3-find-and-forget/pull/310): Improve
+  performance of Athena query generation
+- [#308](https://github.com/awslabs/amazon-s3-find-and-forget/pull/308): Upgrade
+  frontend dependencies and use npm workspaces to link frontend sub-project
+
+## v0.46
+
+- [#306](https://github.com/awslabs/amazon-s3-find-and-forget/pull/306): Adds
+  retry behaviour for old object deletion to improve reliability against
+  transient errors from Amazon S3
+
+## v0.45
+
+- [#303](https://github.com/awslabs/amazon-s3-find-and-forget/pull/303): Improve
+  performance of Athena query generation
+- [#301](https://github.com/awslabs/amazon-s3-find-and-forget/pull/301): Include
+  table name to error when query generation fails due to an invalid column type
+- Dependency version updates for:
+  - [#302](https://github.com/awslabs/amazon-s3-find-and-forget/pull/302)
+    simple-plist
+  - [#300](https://github.com/awslabs/amazon-s3-find-and-forget/pull/300) plist
+  - [#299](https://github.com/awslabs/amazon-s3-find-and-forget/pull/299)
+    ansi-regex
+  - [#298](https://github.com/awslabs/amazon-s3-find-and-forget/pull/298)
+    minimist
+  - [#296](https://github.com/awslabs/amazon-s3-find-and-forget/pull/296)
+    node-forge
+
+## v0.44
+
+- [#293](https://github.com/awslabs/amazon-s3-find-and-forget/pull/293): Upgrade
+  dependencies
+
+## v0.43
+
+- [#289](https://github.com/awslabs/amazon-s3-find-and-forget/pull/289): Upgrade
+  frontend dependencies
+
+- [#287](https://github.com/awslabs/amazon-s3-find-and-forget/pull/287): Add
+  data mapper parameter for ignoring Object Not Found exceptions encountered
+  during deletion
+
+## v0.42
+
+- [#285](https://github.com/awslabs/amazon-s3-find-and-forget/pull/285): Fix for
+  a bug that caused a job to fail with a false positive
+  `The object s3://<REDACTED> was processed successfully but no rows required deletion`
+  when processing a job with queries running for more than 30m
+
+- [#286](https://github.com/awslabs/amazon-s3-find-and-forget/pull/286): Fix for
+  a bug that causes `AthenaQueryMaxRetries` setting to be ignored
+
+- [#286](https://github.com/awslabs/amazon-s3-find-and-forget/pull/286): Make
+  state machine more resilient to transient failures by adding retry
+
+- [#284](https://github.com/awslabs/amazon-s3-find-and-forget/pull/284): Improve
+  performance of find query for data mappers with multiple column identifiers
+
+## v0.41
+
+- [#283](https://github.com/awslabs/amazon-s3-find-and-forget/pull/283): Fix for
+  a bug that caused a job to fail with `Runtime.ExitError` when processing a
+  large queue of objects to be modified
+
+- [#281](https://github.com/awslabs/amazon-s3-find-and-forget/pull/281): Improve
+  performance of query generation step for tables with many partitions
+
+## v0.40
+
+- [#280](https://github.com/awslabs/amazon-s3-find-and-forget/pull/280): Improve
+  performance for large queues of composite matches
+
+## v0.39
+
+- [#279](https://github.com/awslabs/amazon-s3-find-and-forget/pull/279): Improve
+  performance for large queues of simple matches and logging additions
+
+## v0.38
+
+- [#278](https://github.com/awslabs/amazon-s3-find-and-forget/pull/278): Fix for
+  a bug that caused a job to fail if the processing of an object took longer
+  than the lifetime of its IAM temporary access credentials
+
+## v0.37
+
+- [#276](https://github.com/awslabs/amazon-s3-find-and-forget/pull/276): First
+  attempt for fixing a bug that causes the access token to expire and cause a
+  Job to fail if processing of an object takes more than an hour
+
+- [#275](https://github.com/awslabs/amazon-s3-find-and-forget/pull/275): Upgrade
+  JavaScript dependencies
+
+- [#274](https://github.com/awslabs/amazon-s3-find-and-forget/pull/274): Fix for
+  a bug that causes deletion to fail in parquet files when a data mapper has
+  multiple column identifiers
+
+## v0.36
+
+- [#272](https://github.com/awslabs/amazon-s3-find-and-forget/pull/272):
+  Introduce a retry mechanism when running Athena queries
+
+## v0.35
+
+- [#271](https://github.com/awslabs/amazon-s3-find-and-forget/pull/271): Support
+  for decimal type column identifiers in Parquet files
+
+## v0.34
+
+- [#270](https://github.com/awslabs/amazon-s3-find-and-forget/pull/270): Fix for
+  a bug affecting the front-end causing a 403 error when making a request to STS
+  in the Data Mappers Page
+
+## v0.33
+
+- [#266](https://github.com/awslabs/amazon-s3-find-and-forget/pull/266): Fix
+  creating data mapper bug when glue table doesn't have partition keys
+
+- [#264](https://github.com/awslabs/amazon-s3-find-and-forget/pull/264): Upgrade
+  frontend dependencies
+
+- [#263](https://github.com/awslabs/amazon-s3-find-and-forget/pull/263): Improve
+  bucket policies
+
+- [#261](https://github.com/awslabs/amazon-s3-find-and-forget/pull/261): Upgrade
+  frontend dependencies
+
+## v0.32
+
+- [#260](https://github.com/awslabs/amazon-s3-find-and-forget/pull/260): Add
+  Stockholm region
+
+## v0.31
+
+- [#245](https://github.com/awslabs/amazon-s3-find-and-forget/pull/245): CSE-KMS
+  support
+
+- [#259](https://github.com/awslabs/amazon-s3-find-and-forget/pull/259): Upgrade
+  frontend dependencies
+
+## v0.30
+
+- [#257](https://github.com/awslabs/amazon-s3-find-and-forget/pull/257):
+  Introduce data mapper setting to specify the partition keys to be used when
+  querying the data during the Find Phase
+
+## v0.29
+
+- [#256](https://github.com/awslabs/amazon-s3-find-and-forget/pull/256): Upgrade
+  backend dependencies
+
+## v0.28
+
+- [#252](https://github.com/awslabs/amazon-s3-find-and-forget/pull/252): Upgrade
+  frontend and backend dependencies
+
+## v0.27
+
+- [#248](https://github.com/awslabs/amazon-s3-find-and-forget/pull/248): Fix for
+  a bug affecting Deletion Jobs running for cross-account buckets
+- [#246](https://github.com/awslabs/amazon-s3-find-and-forget/pull/246): Upgrade
+  build dependencies
+
+## v0.26
+
+- [#244](https://github.com/awslabs/amazon-s3-find-and-forget/pull/244): Upgrade
+  frontend dependencies
+- [#243](https://github.com/awslabs/amazon-s3-find-and-forget/pull/243): Upgrade
+  frontend and build dependencies
+
+## v0.25
+
+> This version introduces breaking changes to the API and Web UI. Please consult
+> the
+> [migrating from <=v0.24 to v0.25 guide](docs/UPGRADE_GUIDE.md#migrating-from-v024-to-v025)
+
+- [#239](https://github.com/awslabs/amazon-s3-find-and-forget/pull/239): Remove
+  limit on queue size for individual jobs.
+
+## v0.24
+
+- [#240](https://github.com/awslabs/amazon-s3-find-and-forget/pull/240): Add ECR
+  API Endpoint and migrate to
+  [Fargate Platform version 1.4.0](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/platform_versions.html#platform-version-migration)
+
+## v0.23
+
+- [#238](https://github.com/awslabs/amazon-s3-find-and-forget/pull/238): Upgrade
+  frontend dependencies
+
+## v0.22
+
+- [#236](https://github.com/awslabs/amazon-s3-find-and-forget/pull/236): Export
+  API Gateway URL + Deletion Queue Table Stream ARN from main CloudFormation
+  Template
+
+## v0.21
+
+- [#232](https://github.com/awslabs/amazon-s3-find-and-forget/pull/232): Fix for
+  a bug affecting the Frontend not rendering the Data Mappers list when a Glue
+  Table associated to a Data Mapper gets deleted
+- [#233](https://github.com/awslabs/amazon-s3-find-and-forget/pull/233): Add GET
+  endpoint for specific data mapper
+- [#234](https://github.com/awslabs/amazon-s3-find-and-forget/pull/234):
+  Performance improvements for the query generation phase
+
+## v0.20
+
+- [#230](https://github.com/awslabs/amazon-s3-find-and-forget/pull/230): Upgrade
+  frontend dependencies
+- [#231](https://github.com/awslabs/amazon-s3-find-and-forget/pull/231): Upgrade
+  aws-amplify dependency
+
+## v0.19
+
+- [#226](https://github.com/awslabs/amazon-s3-find-and-forget/pull/226): Support
+  for Composite Match Ids
+- [#227](https://github.com/awslabs/amazon-s3-find-and-forget/pull/227): Upgrade
+  frontend dependencies
+
+## v0.18
+
+- [#223](https://github.com/awslabs/amazon-s3-find-and-forget/pull/223): This
+  release fixes
+  [an issue (#222)](https://github.com/awslabs/amazon-s3-find-and-forget/issues/222)
+  where new deployments of the solution could fail due to unavailability of a
+  third-party dependency. Container base images are now retrieved and bundled
+  with each release.
+
+## v0.17
+
+- [#220](https://github.com/awslabs/amazon-s3-find-and-forget/pull/220): Fix for
+  a bug affecting Parquet files with lower-cased column identifiers generating a
+  `Apache Arrow processing error: 'Field "customerid" does not exist in table schema`
+  exception during the Forget phase (for example `customerId` in parquet file
+  being mapped to lower-case `customerid` in glue table)
+
+## v0.16
+
+- [#216](https://github.com/awslabs/amazon-s3-find-and-forget/pull/216): Fix for
+  a bug affecting Parquet files with complex data types as column identifier
+  generating a
+  `Apache Arrow processing error: Mix of struct and list types not yet supported`
+  exception during the Forget phase
+- [#216](https://github.com/awslabs/amazon-s3-find-and-forget/pull/216): Fix for
+  a bug affecting workgroups other than `primary` generating a permission error
+  exception during the Find phase
+
+## v0.15
+
+- [#215](https://github.com/awslabs/amazon-s3-find-and-forget/pull/215): Support
+  for data registered with AWS Lake Formation
+
+## v0.14
+
+- [#213](https://github.com/awslabs/amazon-s3-find-and-forget/pull/213): Fix for
+  a bug causing a FIND_FAILED error related to a States.DataLimitExceed
+  exception triggered by Step Function's Athena workflow when executing the
+  SubmitQueryResults lambda
+- [#208](https://github.com/awslabs/amazon-s3-find-and-forget/pull/208): Fix bug
+  preventing PUT DataMapper to edit existing datamapper with same location, fix
+  Front-end DataMapper creation to prevent editing an existing one.
+
+## v0.13
+
+- [#207](https://github.com/awslabs/amazon-s3-find-and-forget/pull/207): Upgrade
+  frontend dependencies
+
+## v0.12
+
+- [#202](https://github.com/awslabs/amazon-s3-find-and-forget/pull/202): Fix a
+  bug that was affecting Partitions with non-string types generating a
+  `SYNTAX_ERROR: line x:y: '=' cannot be applied to integer, varchar(z)`
+  exception during the Find Phase
+- [#203](https://github.com/awslabs/amazon-s3-find-and-forget/pull/203): Upgrade
+  frontend dependencies
+- [#204](https://github.com/awslabs/amazon-s3-find-and-forget/pull/204): Improve
+  performance during Cleanup Phase
+- [#205](https://github.com/awslabs/amazon-s3-find-and-forget/pull/205): Fix a
+  UI issue affecting FireFox preventing to show the correct queue size due to a
+  missing CORS header
+
+## v0.11
+
+- [#200](https://github.com/awslabs/amazon-s3-find-and-forget/pull/200): Add API
+  Endpoint for adding deletion queue items in batch - deprecates PATCH /v1/queue
+- [#170](https://github.com/awslabs/amazon-s3-find-and-forget/pull/170): JSON
+  support
+
+## v0.10
+
+- [#193](https://github.com/awslabs/amazon-s3-find-and-forget/pull/193): Add
+  support for datasets with Pandas indexes. Pandas indexes will be preserved if
+  present.
+- [#194](https://github.com/awslabs/amazon-s3-find-and-forget/pull/194): Remove
+  debugging code from Fargate task
+- [#195](https://github.com/awslabs/amazon-s3-find-and-forget/pull/195): Fix
+  support for requester pays buckets
+- [#196](https://github.com/awslabs/amazon-s3-find-and-forget/pull/196): Upgrade
+  backend dependencies
+- [#197](https://github.com/awslabs/amazon-s3-find-and-forget/pull/197): Fix
+  duplicated query executions during Find Phase
+
+## v0.9
+
+> This version introduces breaking changes to the CloudFormation templates.
+> Please consult the
+> [migrating from <=v0.8 to v0.9 guide](docs/UPGRADE_GUIDE.md#migrating-from-v08-to-v09)
+
+- [#189](https://github.com/awslabs/amazon-s3-find-and-forget/pull/189): UI
+  Updates
+- [#191](https://github.com/awslabs/amazon-s3-find-and-forget/pull/191): Deploy
+  VPC template by default
+
+## v0.8
+
+- [#185](https://github.com/awslabs/amazon-s3-find-and-forget/pull/185): Fix
+  dead links to VPC info in docs
+- [#186](https://github.com/awslabs/amazon-s3-find-and-forget/pull/186): Fix:
+  Solves an issue where the forget phase container could crash when redacting
+  numeric Match IDs from its logs
+- [#187](https://github.com/awslabs/amazon-s3-find-and-forget/pull/187):
+  Dependency version updates for react-scripts
+
+## v0.7
+
+- [#183](https://github.com/awslabs/amazon-s3-find-and-forget/pull/183):
+  Dependency version updates for elliptic
+
+## v0.6
+
+- [#173](https://github.com/awslabs/amazon-s3-find-and-forget/pull/173): Show
+  column types and hierarchy in the front-end during Data Mapper creation
+- [#173](https://github.com/awslabs/amazon-s3-find-and-forget/pull/173): Add
+  support for char, smallint, tinyint, double, float
+- [#174](https://github.com/awslabs/amazon-s3-find-and-forget/pull/174): Add
+  support for types nested in struct
+- [#177](https://github.com/awslabs/amazon-s3-find-and-forget/pull/177):
+  Reformat of Python source code (non-functional change)
+- Dependency version updates for:
+  - [#178](https://github.com/awslabs/amazon-s3-find-and-forget/pull/178),
+    [#180](https://github.com/awslabs/amazon-s3-find-and-forget/pull/180) lodash
+  - [#179](https://github.com/awslabs/amazon-s3-find-and-forget/pull/179)
+    websocket-extensions
+
+## v0.5
+
+- [#172](https://github.com/awslabs/amazon-s3-find-and-forget/pull/172): Fix for
+  an issue where Make may not install the required Lambda layer dependencies,
+  resulting in unusable builds.
+
+## v0.4
+
+- [#171](https://github.com/awslabs/amazon-s3-find-and-forget/pull/171): Fix for
+  a bug affecting the API for 5xx responses not returning the appropriate CORS
+  headers
+
+## v0.3
+
+- [#164](https://github.com/awslabs/amazon-s3-find-and-forget/pull/164): Fix for
+  a bug affecting v0.2 deployment via CloudFormation
+
+## v0.2
+
+- [#161](https://github.com/awslabs/amazon-s3-find-and-forget/pull/161): Fix for
+  a bug affecting Parquet files with nullable values generating a
+  `Table schema does not match schema used to create file` exception during the
+  Forget phase
+
+## v0.1
+
+Initial Release

+ 7 - 0
S3/NewFind/amazon-s3-find-and-forget-master/CODE_OF_CONDUCT.md

@@ -0,0 +1,7 @@
+## Code of Conduct
+
+This project has adopted the
+[Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). For
+more information see the
+[Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
+opensource-codeofconduct@amazon.com with any additional questions or comments.

+ 118 - 0
S3/NewFind/amazon-s3-find-and-forget-master/CONTRIBUTING.md

@@ -0,0 +1,118 @@
+# Contributing Guidelines
+
+Thank you for your interest in contributing to our project. Whether it's a bug
+report, new feature, correction, or additional documentation, we greatly value
+feedback and contributions from our community.
+
+Please read through this document before submitting any issues or pull requests
+to ensure we have all the necessary information to effectively respond to your
+bug report or contribution.
+
+## Index
+
+- [Introduction](#introduction)
+  - [Reporting Bugs/Feature Requests](#reporting-bugsfeature-requests)
+  - [Contributing via Pull Requests](#contributing-via-pull-requests)
+  - [Finding contributions to work on](#finding-contributions-to-work-on)
+  - [Code of Conduct](#code-of-conduct)
+  - [Security issue notifications](#security-issue-notifications)
+  - [Licensing](#licensing)
+- [Contributing to the codebase](#contributing-to-the-codebase)
+
+## Introduction
+
+### Reporting Bugs/Feature Requests
+
+We welcome you to use the GitHub issue tracker to report bugs or suggest
+features.
+
+When filing an issue, please check
+[existing open](https://github.com/awslabs/amazon-s3-find-and-forget/issues), or
+[recently closed](https://github.com/awslabs/amazon-s3-find-and-forget/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20),
+issues to make sure somebody else hasn't already reported the issue. Please try
+to include as much information as you can. Details like these are incredibly
+useful:
+
+- A reproducible test case or series of steps
+- The version of our code being used
+- Any modifications you've made relevant to the bug
+- Anything unusual about your environment or deployment
+
+### Contributing via Pull Requests
+
+Contributions via pull requests are much appreciated. Before sending us a pull
+request, please ensure that:
+
+1. You are working against the latest source on the _master_ branch.
+2. You check existing open, and recently merged, pull requests to make sure
+   someone else hasn't addressed the problem already.
+3. You open an issue to discuss any significant work - we would hate for your
+   time to be wasted.
+
+To send us a pull request, please:
+
+1. Fork the repository.
+2. Modify the source; please focus on the specific change you are contributing.
+   If you also reformat all the code, it will be hard for us to focus on your
+   change.
+3. Ensure local tests pass.
+4. Commit to your fork using clear commit messages.
+5. Send us a pull request, answering any default questions in the pull request
+   interface.
+6. Pay attention to any automated CI failures reported in the pull request, and
+   stay involved in the conversation.
+
+GitHub provides additional document on
+[forking a repository](https://help.github.com/articles/fork-a-repo/) and
+[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
+
+### Finding contributions to work on
+
+Looking at the existing issues is a great way to find something to contribute
+on. As our projects, by default, use the default GitHub issue labels
+(enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any
+['help wanted'](https://github.com/awslabs/amazon-s3-find-and-forget/labels/help%20wanted)
+issues is a great place to start.
+
+### Code of Conduct
+
+This project has adopted the
+[Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). For
+more information see the
+[Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
+opensource-codeofconduct@amazon.com with any additional questions or comments.
+
+### Security issue notifications
+
+If you discover a potential security issue in this project we ask that you
+notify AWS/Amazon Security via our
+[vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/).
+Please do **not** create a public github issue.
+
+### Licensing
+
+See the
+[LICENSE](https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/LICENSE)
+file for our project's licensing. We will ask you to confirm the licensing of
+your contribution.
+
+We may ask you to sign a
+[Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement)
+for larger changes.
+
+## Contributing to the codebase
+
+Documentation contributions can be made by cloning the repository, making
+changes to the Markdown files and then
+[issuing a Pull Request](#contributing-via-pull-requests). Small changes can be
+made by using the Github visual editor too.
+
+For contributions to the architecture or code, please read the
+[Local development guide](docs/LOCAL_DEVELOPMENT.md) for instructions on how to
+setup a local environment and run the tests. After issuing a
+[Pull Request](#contributing-via-pull-requests) an automated test suite will run
+and be reported on the Pull Request page. Make sure all the tests pass to
+facilitate and speed up code reviews. New features should include unit tests and
+acceptance tests when appropriate.
+
+If you need guidance or help, please let us know in the relevant Github issue.

+ 175 - 0
S3/NewFind/amazon-s3-find-and-forget-master/LICENSE

@@ -0,0 +1,175 @@
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.

+ 211 - 0
S3/NewFind/amazon-s3-find-and-forget-master/Makefile

@@ -0,0 +1,211 @@
+SHELL := /bin/bash
+
+.PHONY : deploy deploy-containers pre-deploy setup test test-cov test-acceptance test-acceptance-cov test-no-state-machine test-no-state-machine-cov test-unit test-unit-cov
+
+# The name of the virtualenv directory to use
+VENV ?= venv
+
+pre-deploy:
+ifndef TEMP_BUCKET
+	$(error TEMP_BUCKET is undefined)
+endif
+ifndef ADMIN_EMAIL
+	$(error ADMIN_EMAIL is undefined)
+endif
+
+pre-run:
+ifndef ROLE_NAME
+	$(error ROLE_NAME is undefined)
+endif
+
+build-frontend:
+	npm run build --workspace frontend
+
+deploy:
+	make pre-deploy
+	make deploy-artefacts
+	make deploy-cfn
+	make setup-frontend-local-dev
+
+deploy-vpc:
+	aws cloudformation create-stack --template-body file://templates/vpc.yaml --stack-name S3F2-VPC
+
+deploy-cfn:
+	aws cloudformation package --template-file templates/template.yaml --s3-bucket $(TEMP_BUCKET) --output-template-file packaged.yaml
+	aws cloudformation deploy --template-file ./packaged.yaml --stack-name S3F2 --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
+		--parameter-overrides CreateCloudFrontDistribution=false EnableContainerInsights=true AdminEmail=$(ADMIN_EMAIL) \
+		AccessControlAllowOriginOverride=* PreBuiltArtefactsBucketOverride=$(TEMP_BUCKET) KMSKeyArns=$(KMS_KEYARNS)
+
+deploy-artefacts:
+	$(eval VERSION := $(shell $(MAKE) -s version))
+	make package-artefacts
+	aws s3 cp build.zip s3://$(TEMP_BUCKET)/amazon-s3-find-and-forget/$(VERSION)/build.zip
+
+.PHONY: format-cfn
+format-cfn:
+	$(eval VERSION := $(shell $(MAKE) -s version))
+	TEMP_FILE="$$(mktemp)" ; \
+		sed  -e '3s/.*/Description: Amazon S3 Find and Forget \(uksb-1q2j8beb0\) \(version:$(VERSION)\)/' templates/template.yaml > "$$TEMP_FILE" ; \
+		mv "$$TEMP_FILE" templates/template.yaml 
+	git add templates/template.yaml
+
+.PHONY: format-docs
+format-docs:
+	npx prettier ./*.md ./docs/*.md --write
+	git add *.md
+	git add docs/*.md
+
+.PHONY: format-js
+format-js:
+	npx prettier ./frontend/src/**/*.js --write
+	git add frontend/src/
+
+.PHONY: format-python
+format-python: | $(VENV)
+	for src in \
+		tests/ \
+		backend/ecs_tasks/ \
+		backend/lambdas/ \
+		backend/lambda_layers/boto_utils/python/boto_utils.py \
+		backend/lambda_layers/decorators/python/decorators.py \
+	; do \
+		$(VENV)/bin/black "$$src" \
+	; done
+
+generate-api-docs:
+	TEMP_FILE="$$(mktemp)" ; \
+		$(VENV)/bin/yq -y .Resources.Api.Properties.DefinitionBody ./templates/api.yaml > "$$TEMP_FILE" ; \
+		npx openapi-generator generate -i "$$TEMP_FILE" -g markdown -t ./docs/templates/ -o docs/api
+	git add docs/api
+
+.PHONY: generate-pip-requirements
+generate-pip-requirements: $(patsubst %.in,%.txt,$(shell find . -type f -name requirements.in))
+
+.PHONY: lint-cfn
+lint-cfn:
+	$(VENV)/bin/cfn-lint templates/*
+
+package:
+	make package-artefacts
+	zip -r packaged.zip \
+		backend/lambda_layers \
+		backend/lambdas \
+		build.zip \
+		cfn-publish.config \
+		templates \
+		-x '**/__pycache*' '*settings.js' @
+
+package-artefacts: backend/ecs_tasks/python_3.9-slim.tar
+	make build-frontend
+	zip -r build.zip \
+		backend/ecs_tasks/ \
+		backend/lambda_layers/boto_utils/ \
+		frontend/build \
+		-x '**/__pycache*' '*settings.js' @
+
+backend/ecs_tasks/python_3.9-slim.tar:
+	docker pull python:3.9-slim
+	docker save python:3.9-slim -o "$@"
+
+redeploy-containers:
+	$(eval ACCOUNT_ID := $(shell aws sts get-caller-identity --query Account --output text))
+	$(eval API_URL := $(shell aws cloudformation describe-stacks --stack-name S3F2 --query 'Stacks[0].Outputs[?OutputKey==`ApiUrl`].OutputValue' --output text))
+	$(eval REGION := $(shell echo $(API_URL) | cut -d'.' -f3))
+	$(eval ECR_REPOSITORY := $(shell aws cloudformation describe-stacks --stack-name S3F2 --query 'Stacks[0].Outputs[?OutputKey==`ECRRepository`].OutputValue' --output text))
+	$(eval REPOSITORY_URI := $(shell aws ecr describe-repositories --repository-names $(ECR_REPOSITORY) --query 'repositories[0].repositoryUri' --output text))
+	$(shell aws ecr get-login --no-include-email --region $(REGION))
+	docker build -t $(ECR_REPOSITORY) -f backend/ecs_tasks/delete_files/Dockerfile .
+	docker tag $(ECR_REPOSITORY):latest $(REPOSITORY_URI):latest
+	docker push $(REPOSITORY_URI):latest
+
+redeploy-frontend:
+	$(eval WEBUI_BUCKET := $(shell aws cloudformation describe-stacks --stack-name S3F2 --query 'Stacks[0].Outputs[?OutputKey==`WebUIBucket`].OutputValue' --output text))
+	make build-frontend
+	cd frontend/build && aws s3 cp --recursive . s3://$(WEBUI_BUCKET) --acl public-read --exclude *settings.js
+
+run-local-container:
+	make pre-run
+	./docker_run_with_creds.sh
+
+setup: | $(VENV) lambda-layer-deps
+	(! [[ -d .git ]] || $(VENV)/bin/pre-commit install)
+	npm i
+	gem install cfn-nag
+
+# virtualenv setup
+.PHONY: $(VENV)
+$(VENV): $(VENV)/pip-sync.sentinel
+
+$(VENV)/pip-sync.sentinel: requirements.txt | $(VENV)/bin/pip-sync
+	$(VENV)/bin/pip-sync $<
+	touch $@
+
+$(VENV)/bin/activate:
+	test -d $(VENV) || virtualenv $(VENV)
+
+$(VENV)/bin/pip-compile $(VENV)/bin/pip-sync: $(VENV)/bin/activate
+	$(VENV)/bin/pip install pip-tools
+
+# Lambda layers
+.PHONY: lambda-layer-deps
+lambda-layer-deps: \
+	backend/lambda_layers/aws_sdk/requirements-installed.sentinel \
+	backend/lambda_layers/cr_helper/requirements-installed.sentinel \
+	backend/lambda_layers/decorators/requirements-installed.sentinel \
+	;
+
+backend/lambda_layers/%/requirements-installed.sentinel: backend/lambda_layers/%/requirements.txt | $(VENV)
+	@# pip-sync only works with virtualenv, so we can't use it here.
+	$(VENV)/bin/pip install -r $< -t $(subst requirements-installed.sentinel,python,$@)
+	touch $@
+
+setup-frontend-local-dev:
+	$(eval WEBUI_URL := $(shell aws cloudformation describe-stacks --stack-name S3F2 --query 'Stacks[0].Outputs[?OutputKey==`WebUIUrl`].OutputValue' --output text))
+	$(eval WEBUI_BUCKET := $(shell aws cloudformation describe-stacks --stack-name S3F2 --query 'Stacks[0].Outputs[?OutputKey==`WebUIBucket`].OutputValue' --output text))
+	$(if $(filter none, $(WEBUI_URL)), @echo "WebUI not deployed.", aws s3 cp s3://$(WEBUI_BUCKET)/settings.js frontend/public/settings.js)
+
+setup-predeploy:
+	virtualenv venv
+	source venv/bin/activate && pip install cfn-flip==1.2.2
+
+start-frontend-local:
+	npm start --workspace frontend
+
+start-frontend-remote:
+	$(eval WEBUI_URL := $(shell aws cloudformation describe-stacks --stack-name S3F2 --query 'Stacks[0].Outputs[?OutputKey==`WebUIUrl`].OutputValue' --output text))
+	$(if $(filter none, $(WEBUI_URL)), @echo "WebUI not deployed.", open $(WEBUI_URL))
+
+test-cfn:
+	cfn_nag templates/*.yaml --blacklist-path ci/cfn_nag_blacklist.yaml
+
+test-frontend:
+	npm t --workspace frontend
+
+test-unit: | $(VENV)
+	$(VENV)/bin/pytest -m unit --log-cli-level info --cov=backend.lambdas --cov=decorators --cov=boto_utils --cov=backend.ecs_tasks --cov-report term-missing
+
+test-ci: | $(VENV)
+	$(VENV)/bin/pytest -m unit --log-cli-level info --cov=backend.lambdas --cov=decorators --cov=boto_utils --cov=backend.ecs_tasks --cov-report xml
+
+test-acceptance-cognito: | $(VENV)
+	$(VENV)/bin/pytest -m acceptance_cognito --log-cli-level info
+
+test-acceptance-iam: | $(VENV)
+	$(VENV)/bin/pytest -m acceptance_iam --log-cli-level info
+
+test-no-state-machine: | $(VENV)
+	$(VENV)/bin/pytest -m "not state_machine" --log-cli-level info  --cov=backend.lambdas --cov=boto_utils --cov=decorators --cov=backend.ecs_tasks
+
+test: | $(VENV)
+	make test-cfn
+	$(VENV)/bin/pytest --log-cli-level info --cov=backend.lambdas --cov=decorators --cov=boto_utils --cov=backend.ecs_tasks
+	make test-frontend
+
+version:
+	@echo $(shell $(VENV)/bin/cfn-flip templates/template.yaml | $(VENV)/bin/python -c 'import sys, json; print(json.load(sys.stdin)["Mappings"]["Solution"]["Constants"]["Version"])')
+
+%/requirements.txt: %/requirements.in | $(VENV)/bin/pip-compile
+	$(VENV)/bin/pip-compile -q -o $@ $<
+
+requirements.txt: requirements.in $(shell awk '/^-r / { print $$2 }' requirements.in) | $(VENV)/bin/pip-compile
+	$(VENV)/bin/pip-compile -q -o $@ $<

+ 1 - 0
S3/NewFind/amazon-s3-find-and-forget-master/NOTICE

@@ -0,0 +1 @@
+Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.

+ 124 - 0
S3/NewFind/amazon-s3-find-and-forget-master/README.md

@@ -0,0 +1,124 @@
+<h1 align="center">
+    Amazon S3 Find and Forget
+    <br>
+    <img src="https://img.shields.io/github/v/release/awslabs/amazon-s3-find-and-forget?include_prereleases"> 
+    <img src="https://github.com/awslabs/amazon-s3-find-and-forget/workflows/Unit%20Tests/badge.svg"> 
+    <img src="https://codecov.io/gh/awslabs/amazon-s3-find-and-forget/branch/master/graph/badge.svg">
+</h1>
+
+> Warning: Consult the
+> [Production Readiness guidelines](docs/PRODUCTION_READINESS_GUIDELINES.md)
+> prior to using the solution with production data
+
+Amazon S3 Find and Forget is a solution to the need to selectively erase records
+from data lakes stored on Amazon Simple Storage Service (Amazon S3). This
+solution can assist data lake operators to handle data erasure requests, for
+example, pursuant to the European General Data Protection Regulation (GDPR).
+
+The solution can be used with Parquet and JSON format data stored in Amazon S3
+buckets. Your data lake is connected to the solution via AWS Glue tables and by
+specifying which columns in the tables need to be used to identify the data to
+be erased.
+
+Once configured, you can queue record identifiers that you want the
+corresponding data erased for. You can then run a deletion job to remove the
+data corresponding to the records specified from the objects in the data lake. A
+report log is provided of all the S3 objects modified.
+
+## Installation
+
+The solution is available as an AWS CloudFormation template and should take
+about 20 to 40 minutes to deploy. See the
+[deployment guide](docs/USER_GUIDE.md#deploying-the-solution) for one-click
+deployment instructions, and the [cost overview guide](docs/COST_OVERVIEW.md) to
+learn about costs.
+
+## Usage
+
+The solution provides a web user interface, and a REST API to allow you to
+integrate it in your own applications.
+
+See the [user guide](docs/USER_GUIDE.md) to learn how to use the solution and
+the [API specification](docs/api/README.md) to integrate the solution with your
+own applications.
+
+## Architecture
+
+The goal of the solution is to provide a secure, reliable, performant and cost
+effective tool for finding and removing individual records within objects stored
+in S3 buckets. In order to achieve this goal the solution has adopted the
+following design principles:
+
+1. **Secure by design:**
+   - Every component is implemented with least privilege access
+   - Encryption is performed at all layers at rest and in transit
+   - Authentication is provided out of the box
+   - Expiration of logs is configurable
+   - Record identifiers (known as **Match IDs**) are automatically obfuscated or
+     irreversibly deleted as soon as possible when persisting state
+2. **Built to scale:** The system is designed and tested to work with
+   petabyte-scale Data Lakes containing thousands of partitions and hundreds of
+   thousands of objects
+3. **Cost optimised:**
+   - **Perform work in batches:** Since the time complexity of removing a single
+     vs multiple records in a single object is practically equal and it is
+     common for data owners to have the requirement of removing data within a
+     given _timeframe_, the solution is designed to allow the solution operator
+     to "queue" multiple matches to be removed in a single job.
+   - **Fail fast:** A deletion job takes place in two distinct phases: Find and
+     Forget. The Find phase queries the objects in your S3 data lakes to find
+     any objects which contain records where a specified column contains at
+     least one of the Match IDs in the deletion queue. If any queries fail, the
+     job will abandon as soon as possible and the Forget phase will not take
+     place. The Forget Phase takes the list of objects returned from the Find
+     phase, and deletes only the relevant rows in those objects.
+   - **Optimised for Parquet:** The split phase approach optimises scanning for
+     columnar dense formats such as Parquet. The Find phase only retrieves and
+     processes the data for relevant columns when determining which S3 objects
+     need to be processed in the Forget phase. This approach can have
+     significant cost savings when operating on large data lakes with sparse
+     matches.
+   - **Serverless:** Where possible, the solution only uses Serverless
+     components to avoid costs for idle resources. All the components for Web
+     UI, API and Deletion Jobs are Serverless (for more information consult the
+     [Cost Overview guide](docs/COST_OVERVIEW.md)).
+4. **Robust monitoring and logging:** When performing deletion jobs, information
+   is provided in real-time to provide visibility. After the job completes,
+   detailed reports are available documenting all the actions performed to
+   individual S3 Objects, and detailed error traces in case of failures to
+   facilitate troubleshooting processes and identify remediation actions. For
+   more information consult the
+   [Troubleshooting guide](docs/TROUBLESHOOTING.md).
+
+### High-level overview diagram
+
+![Architecture Diagram](docs/images/architecture.png)
+
+See the [Architecture guide](docs/ARCHITECTURE.md) to learn more about the
+architecture.
+
+## Documentation
+
+- [User Guide](docs/USER_GUIDE.md)
+- [Deployment](docs/USER_GUIDE.md#deploying-the-solution)
+- [Architecture](docs/ARCHITECTURE.md)
+- [Troubleshooting](docs/TROUBLESHOOTING.md)
+- [Monitoring the Solution](docs/MONITORING.md)
+- [Security](docs/SECURITY.md)
+- [Cost Overview](docs/COST_OVERVIEW.md)
+- [Limits](docs/LIMITS.md)
+- [API Specification](docs/api/README.md)
+- [Local Development](docs/LOCAL_DEVELOPMENT.md)
+- [Production Readiness guidelines](docs/PRODUCTION_READINESS_GUIDELINES.md)
+- [Change Log](CHANGELOG.md)
+- [Upgrade Guide](docs/UPGRADE_GUIDE.md)
+
+## Contributing
+
+Contributions are more than welcome. Please read the
+[code of conduct](CODE_OF_CONDUCT.md) and the
+[contributing guidelines](CONTRIBUTING.md).
+
+## License Summary
+
+This project is licensed under the Apache-2.0 License.

+ 39 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/Dockerfile

@@ -0,0 +1,39 @@
+ARG src_path=backend/ecs_tasks/delete_files
+ARG layers_path=backend/lambda_layers
+
+FROM python:3.9-slim as base
+
+RUN apt-get update --fix-missing
+RUN apt-get -y install g++ gcc libsnappy-dev
+
+FROM base as builder
+
+ARG src_path
+ARG layers_path
+
+RUN mkdir /install
+WORKDIR /install
+COPY $src_path/requirements.txt /requirements.txt
+
+RUN pip3 install \
+    -r /requirements.txt \
+    -t /install \
+    --compile \
+    --no-cache-dir
+
+FROM base
+
+ARG src_path
+ARG layers_path
+
+RUN groupadd -r s3f2 && useradd --no-log-init -r -m -g s3f2 s3f2
+USER s3f2
+RUN mkdir /home/s3f2/app
+RUN echo ${src_path}
+COPY --from=builder /install /home/s3f2/.local/lib/python3.9/site-packages/
+WORKDIR /home/s3f2/app
+COPY $src_path/* \
+     $layers_path/boto_utils/python/boto_utils.py \
+     /home/s3f2/app/
+
+CMD ["python3", "-u", "main.py"]

+ 0 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/__init__.py


+ 138 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/cse.py

@@ -0,0 +1,138 @@
+import base64
+import json
+import logging
+import os
+from io import BytesIO
+
+from cryptography.hazmat.primitives.ciphers import Cipher
+from cryptography.hazmat.primitives.ciphers.aead import AESGCM
+from cryptography.hazmat.primitives.ciphers.algorithms import AES
+from cryptography.hazmat.primitives.ciphers.modes import CBC
+from cryptography.hazmat.primitives.padding import PKCS7
+
+logger = logging.getLogger(__name__)
+
+AES_BLOCK_SIZE = 128
+ALG_CBC = "AES/CBC/PKCS5Padding"
+ALG_GCM = "AES/GCM/NoPadding"
+HEADER_ALG = "x-amz-cek-alg"
+HEADER_KEY = "x-amz-key-v2"
+HEADER_IV = "x-amz-iv"
+HEADER_MATDESC = "x-amz-matdesc"
+HEADER_TAG_LEN = "x-amz-tag-len"
+HEADER_UE_CLENGHT = "x-amz-unencrypted-content-length"
+HEADER_WRAP_ALG = "x-amz-wrap-alg"
+
+
+def is_kms_cse_encrypted(s3_metadata):
+    if HEADER_KEY in s3_metadata:
+        if s3_metadata.get(HEADER_WRAP_ALG, None) != "kms":
+            raise ValueError("Unsupported Encryption strategy")
+        if s3_metadata.get(HEADER_ALG, None) not in [ALG_CBC, ALG_GCM]:
+            raise ValueError("Unsupported Encryption algorithm")
+        return True
+    elif "x-amz-key" in s3_metadata:
+        raise ValueError("Unsupported Amazon S3 Encryption Client Version")
+    return False
+
+
+def get_encryption_aes_key(key, kms_client):
+    encryption_context = {"kms_cmk_id": key}
+    response = kms_client.generate_data_key(
+        KeyId=key, EncryptionContext=encryption_context, KeySpec="AES_256"
+    )
+    return (
+        response["Plaintext"],
+        encryption_context,
+        base64.b64encode(response["CiphertextBlob"]).decode(),
+    )
+
+
+def get_decryption_aes_key(key, material_description, kms_client):
+    return kms_client.decrypt(
+        CiphertextBlob=key, EncryptionContext=material_description
+    )["Plaintext"]
+
+
+def encrypt(buf, s3_metadata, kms_client):
+    """
+    Method to encrypt an S3 object with KMS based Client-side encryption (CSE).
+    The original object's metadata (previously used to decrypt the content) is
+    used to infer some parameters such as the algorithm originally used to encrypt
+    the previous version (which is left unchanged) and to store the new envelope,
+    including the initialization vector (IV).
+    """
+    logger.info("Encrypting Object with CSE-KMS")
+    content = buf.read()
+    alg = s3_metadata.get(HEADER_ALG, None)
+    matdesc = json.loads(s3_metadata[HEADER_MATDESC])
+    aes_key, matdesc_metadata, key_metadata = get_encryption_aes_key(
+        matdesc["kms_cmk_id"], kms_client
+    )
+    s3_metadata[HEADER_UE_CLENGHT] = str(len(content))
+    s3_metadata[HEADER_WRAP_ALG] = "kms"
+    s3_metadata[HEADER_KEY] = key_metadata
+    s3_metadata[HEADER_ALG] = alg
+    if alg == ALG_GCM:
+        s3_metadata[HEADER_TAG_LEN] = str(AES_BLOCK_SIZE)
+        result, iv = encrypt_gcm(aes_key, content)
+    else:
+        result, iv = encrypt_cbc(aes_key, content)
+    s3_metadata[HEADER_IV] = base64.b64encode(iv).decode()
+    return BytesIO(result), s3_metadata
+
+
+def decrypt(file_input, s3_metadata, kms_client):
+    """
+    Method to decrypt an S3 object with KMS based Client-side encryption (CSE).
+    The object's metadata is used to fetch the encryption envelope such as
+    the KMS key ID and the algorithm.
+    """
+    logger.info("Decrypting Object with CSE-KMS")
+    alg = s3_metadata.get(HEADER_ALG, None)
+    iv = base64.b64decode(s3_metadata[HEADER_IV])
+    material_description = json.loads(s3_metadata[HEADER_MATDESC])
+    key = s3_metadata[HEADER_KEY]
+    decryption_key = base64.b64decode(key)
+    aes_key = get_decryption_aes_key(decryption_key, material_description, kms_client)
+    content = file_input.read()
+    decrypted = (
+        decrypt_gcm(content, aes_key, iv)
+        if alg == ALG_GCM
+        else decrypt_cbc(content, aes_key, iv)
+    )
+    return BytesIO(decrypted)
+
+
+# AES/CBC/PKCS5Padding
+
+
+def encrypt_cbc(aes_key, content):
+    iv = os.urandom(16)
+    padder = PKCS7(AES.block_size).padder()
+    padded_result = padder.update(content) + padder.finalize()
+    aescbc = Cipher(AES(aes_key), CBC(iv)).encryptor()
+    result = aescbc.update(padded_result) + aescbc.finalize()
+    return result, iv
+
+
+def decrypt_cbc(content, aes_key, iv):
+    aescbc = Cipher(AES(aes_key), CBC(iv)).decryptor()
+    padded_result = aescbc.update(content) + aescbc.finalize()
+    unpadder = PKCS7(AES.block_size).unpadder()
+    return unpadder.update(padded_result) + unpadder.finalize()
+
+
+# AES/GCM/NoPadding
+
+
+def encrypt_gcm(aes_key, content):
+    iv = os.urandom(12)
+    aesgcm = AESGCM(aes_key)
+    result = aesgcm.encrypt(iv, content, None)
+    return result, iv
+
+
+def decrypt_gcm(content, aes_key, iv):
+    aesgcm = AESGCM(aes_key)
+    return aesgcm.decrypt(iv, content, None)

+ 90 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/events.py

@@ -0,0 +1,90 @@
+import json
+import os
+import logging
+import urllib.request
+import urllib.error
+from collections.abc import Iterable
+from functools import lru_cache
+
+from boto_utils import emit_event
+
+logger = logging.getLogger(__name__)
+
+
+def emit_deletion_event(message_body, stats):
+    job_id = message_body["JobId"]
+    event_data = {
+        "Statistics": stats,
+        "Object": message_body["Object"],
+    }
+    emit_event(job_id, "ObjectUpdated", event_data, get_emitter_id())
+
+
+def emit_skipped_event(message_body, skip_reason):
+    job_id = message_body["JobId"]
+    event_data = {
+        "Object": message_body["Object"],
+        "Reason": skip_reason,
+    }
+    emit_event(job_id, "ObjectUpdateSkipped", event_data, get_emitter_id())
+
+
+def emit_failure_event(message_body, err_message, event_name):
+    json_body = json.loads(message_body)
+    job_id = json_body.get("JobId")
+    if not job_id:
+        raise ValueError("Message missing Job ID")
+    event_data = {
+        "Error": err_message,
+        "Message": json_body,
+    }
+    emit_event(job_id, event_name, event_data, get_emitter_id())
+
+
+def sanitize_message(err_message, message_body):
+    """
+    Obtain all the known match IDs from the original message and ensure
+    they are masked in the given err message
+    """
+    try:
+        sanitised = err_message
+        if not isinstance(message_body, dict):
+            message_body = json.loads(message_body)
+        matches = []
+        cols = message_body.get("Columns", [])
+        for col in cols:
+            match_ids = col.get("MatchIds")
+            if isinstance(match_ids, Iterable):
+                matches.extend(match_ids)
+        for m in matches:
+            sanitised = sanitised.replace(str(m), "*** MATCH ID ***")
+        return sanitised
+    except (json.decoder.JSONDecodeError, ValueError):
+        return err_message
+
+
+@lru_cache()
+def get_emitter_id():
+    metadata_endpoint = os.getenv("ECS_CONTAINER_METADATA_URI")
+    if metadata_endpoint:
+        res = ""
+        try:
+            res = urllib.request.urlopen(metadata_endpoint, timeout=1).read()
+            metadata = json.loads(res)
+            return "ECSTask_{}".format(
+                metadata["Labels"]["com.amazonaws.ecs.task-arn"].rsplit("/", 1)[1]
+            )
+        except urllib.error.URLError as e:
+            logger.warning(
+                "Error when accessing the metadata service: {}".format(e.reason)
+            )
+        except (AttributeError, KeyError, IndexError) as e:
+            logger.warning(
+                "Malformed response from the metadata service: {}".format(res)
+            )
+        except Exception as e:
+            logger.warning(
+                "Error when getting emitter id from metadata service: {}".format(str(e))
+            )
+
+    return "ECSTask"

+ 81 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/json_handler.py

@@ -0,0 +1,81 @@
+from gzip import GzipFile
+from io import BytesIO
+import json
+from collections import Counter
+
+from boto_utils import json_lines_iterator
+
+from pyarrow import BufferOutputStream, CompressedOutputStream
+
+
+def initialize(input_file, out_stream, compressed):
+    if compressed:
+        bytestream = BytesIO(input_file.read())
+        input_file = GzipFile(None, "rb", fileobj=bytestream)
+    gzip_stream = CompressedOutputStream(out_stream, "gzip") if compressed else None
+    writer = gzip_stream if compressed else out_stream
+    return input_file, writer
+
+
+def find_key(key, obj):
+    """
+    Athena openx SerDe is case insensitive, and converts by default each object's key
+    to a lowercase value: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html
+
+    Here we convert the DataMapper value for the column identifier
+    (for instance, customerid) to the JSON's object key (for instance, customerId).
+    """
+    if not obj:
+        return None
+    for found_key in obj.keys():
+        if key.lower() == found_key.lower():
+            return found_key
+
+
+def get_value(key, obj):
+    """
+    Method to find a value given a nested key in an object. Example:
+    key="user.Id"
+    obj='{"user":{"id": 1234}}'
+    result=1234
+    """
+    for segment in key.split("."):
+        current_key = find_key(segment, obj)
+        if not current_key:
+            return None
+        obj = obj[current_key]
+    return obj
+
+
+def delete_matches_from_json_file(input_file, to_delete, compressed=False):
+    deleted_rows = 0
+    with BufferOutputStream() as out_stream:
+        input_file, writer = initialize(input_file, out_stream, compressed)
+        content = input_file.read().decode("utf-8")
+        total_rows = 0
+        for parsed, line in json_lines_iterator(content, include_unparsed=True):
+            total_rows += 1
+            should_delete = False
+            for column in to_delete:
+                if column["Type"] == "Simple":
+                    record = get_value(column["Column"], parsed)
+                    if record and record in column["MatchIds"]:
+                        should_delete = True
+                        break
+                else:
+                    matched = []
+                    for col in column["Columns"]:
+                        record = get_value(col, parsed)
+                        if record:
+                            matched.append(record)
+                    if matched in column["MatchIds"]:
+                        should_delete = True
+                        break
+            if should_delete:
+                deleted_rows += 1
+            else:
+                writer.write(bytes(line + "\n", "utf-8"))
+        if compressed:
+            writer.close()
+        stats = Counter({"ProcessedRows": total_rows, "DeletedRows": deleted_rows})
+        return out_stream, stats

+ 315 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/main.py

@@ -0,0 +1,315 @@
+import argparse
+import json
+import os
+import sys
+import signal
+import time
+import logging
+from multiprocessing import Pool, cpu_count
+from operator import itemgetter
+
+import boto3
+import pyarrow as pa
+from boto_utils import get_session, json_lines_iterator, parse_s3_url
+from botocore.exceptions import ClientError
+from pyarrow.lib import ArrowException
+
+from cse import decrypt, encrypt, is_kms_cse_encrypted
+from events import (
+    sanitize_message,
+    emit_failure_event,
+    emit_deletion_event,
+    emit_skipped_event,
+)
+from json_handler import delete_matches_from_json_file
+from parquet_handler import delete_matches_from_parquet_file
+from s3 import (
+    delete_old_versions,
+    DeleteOldVersionsError,
+    fetch_manifest,
+    get_object_info,
+    IntegrityCheckFailedError,
+    rollback_object_version,
+    save,
+    validate_bucket_versioning,
+    verify_object_versions_integrity,
+)
+
+FIVE_MB = 5 * 2**20
+ROLE_SESSION_NAME = "s3f2"
+
+logger = logging.getLogger(__name__)
+logger.setLevel(os.getenv("LOG_LEVEL", logging.INFO))
+formatter = logging.Formatter("[%(levelname)s] %(message)s")
+handler = logging.StreamHandler(stream=sys.stdout)
+handler.setFormatter(formatter)
+logger.addHandler(handler)
+
+
+def handle_error(
+    sqs_msg,
+    message_body,
+    err_message,
+    event_name="ObjectUpdateFailed",
+    change_msg_visibility=True,
+):
+    logger.error(sanitize_message(err_message, message_body))
+    try:
+        emit_failure_event(message_body, err_message, event_name)
+    except KeyError:
+        logger.error("Unable to emit failure event due to invalid Job ID")
+    except (json.decoder.JSONDecodeError, ValueError):
+        logger.error("Unable to emit failure event due to invalid message")
+    except ClientError as e:
+        logger.error("Unable to emit failure event: %s", str(e))
+
+    if change_msg_visibility:
+        try:
+            sqs_msg.change_visibility(VisibilityTimeout=0)
+        except (
+            sqs_msg.meta.client.exceptions.MessageNotInflight,
+            sqs_msg.meta.client.exceptions.ReceiptHandleIsInvalid,
+        ) as e:
+            logger.error("Unable to change message visibility: %s", str(e))
+
+
+def handle_skip(sqs_msg, message_body, skip_reason):
+    sqs_msg.delete()
+    logger.info(sanitize_message(skip_reason, message_body))
+    emit_skipped_event(message_body, skip_reason)
+
+
+def validate_message(message):
+    body = json.loads(message)
+    mandatory_keys = ["JobId", "Object", "Columns"]
+    for k in mandatory_keys:
+        if k not in body:
+            raise ValueError("Malformed message. Missing key: %s", k)
+
+
+def delete_matches_from_file(input_file, to_delete, file_format, compressed=False):
+    logger.info("Generating new file without matches")
+    if file_format == "json":
+        return delete_matches_from_json_file(input_file, to_delete, compressed)
+    return delete_matches_from_parquet_file(input_file, to_delete)
+
+
+def build_matches(cols, manifest_object):
+    """
+    This function takes the columns and the manifests, and returns
+    the match_ids grouped by column.
+    Input example:
+    [{"Column":"customer_id", "Type":"Simple"}]
+    Output example:
+    [{"Column":"customer_id", "Type":"Simple", "MatchIds":[123, 234]}]
+    """
+    COMPOSITE_MATCH_TOKEN = "_S3F2COMP_"
+    manifest = fetch_manifest(manifest_object)
+    matches = {}
+    for line in json_lines_iterator(manifest):
+        if not line["QueryableColumns"] in matches:
+            matches[line["QueryableColumns"]] = []
+        is_simple = len(line["Columns"]) == 1
+        match = line["MatchId"][0] if is_simple else line["MatchId"]
+        matches[line["QueryableColumns"]].append(match)
+    return list(
+        map(
+            lambda c: {
+                "MatchIds": matches[
+                    COMPOSITE_MATCH_TOKEN.join(c["Columns"])
+                    if "Columns" in c
+                    else c["Column"]
+                ],
+                **c,
+            },
+            cols,
+        )
+    )
+
+
+def execute(queue_url, message_body, receipt_handle):
+    logger.info("Message received")
+    queue = get_queue(queue_url)
+    msg = queue.Message(receipt_handle)
+    try:
+        # Parse and validate incoming message
+        validate_message(message_body)
+        body = json.loads(message_body)
+        session = get_session(body.get("RoleArn"), ROLE_SESSION_NAME)
+        ignore_not_found_exceptions = body.get("IgnoreObjectNotFoundExceptions", False)
+        client = session.client("s3")
+        kms_client = session.client("kms")
+        cols, object_path, job_id, file_format, manifest_object = itemgetter(
+            "Columns", "Object", "JobId", "Format", "Manifest"
+        )(body)
+        input_bucket, input_key = parse_s3_url(object_path)
+        validate_bucket_versioning(client, input_bucket)
+        match_ids = build_matches(cols, manifest_object)
+        s3 = pa.fs.S3FileSystem(
+            region=os.getenv("AWS_DEFAULT_REGION"),
+            session_name=ROLE_SESSION_NAME,
+            external_id=ROLE_SESSION_NAME,
+            role_arn=body.get("RoleArn"),
+            load_frequency=60 * 60,
+        )
+        # Download the object in-memory and convert to PyArrow NativeFile
+        logger.info("Downloading and opening %s object in-memory", object_path)
+        with s3.open_input_stream(
+            "{}/{}".format(input_bucket, input_key), buffer_size=FIVE_MB
+        ) as f:
+            source_version = f.metadata()["VersionId"].decode("utf-8")
+            logger.info("Using object version %s as source", source_version)
+            # Write new file in-memory
+            compressed = object_path.endswith(".gz")
+            object_info, _ = get_object_info(
+                client, input_bucket, input_key, source_version
+            )
+            metadata = object_info["Metadata"]
+            is_encrypted = is_kms_cse_encrypted(metadata)
+            input_file = decrypt(f, metadata, kms_client) if is_encrypted else f
+            out_sink, stats = delete_matches_from_file(
+                input_file, match_ids, file_format, compressed
+            )
+        if stats["DeletedRows"] == 0:
+            raise ValueError(
+                "The object {} was processed successfully but no rows required deletion".format(
+                    object_path
+                )
+            )
+        with pa.BufferReader(out_sink.getvalue()) as output_buf:
+            if is_encrypted:
+                output_buf, metadata = encrypt(output_buf, metadata, kms_client)
+            logger.info("Uploading new object version to S3")
+            new_version = save(
+                client,
+                output_buf,
+                input_bucket,
+                input_key,
+                metadata,
+                source_version,
+            )
+        logger.info("New object version: %s", new_version)
+        verify_object_versions_integrity(
+            client, input_bucket, input_key, source_version, new_version
+        )
+        if body.get("DeleteOldVersions"):
+            logger.info(
+                "Deleting object {} versions older than version {}".format(
+                    input_key, new_version
+                )
+            )
+            delete_old_versions(client, input_bucket, input_key, new_version)
+        msg.delete()
+        emit_deletion_event(body, stats)
+    except FileNotFoundError as e:
+        err_message = "Apache Arrow S3FileSystem Error: {}".format(str(e))
+        if ignore_not_found_exceptions:
+            handle_skip(msg, body, "Ignored error: {}".format(err_message))
+        else:
+            handle_error(msg, message_body, err_message)
+    except (KeyError, ArrowException) as e:
+        err_message = "Apache Arrow processing error: {}".format(str(e))
+        handle_error(msg, message_body, err_message)
+    except IOError as e:
+        err_message = "Unable to retrieve object: {}".format(str(e))
+        handle_error(msg, message_body, err_message)
+    except MemoryError as e:
+        err_message = "Insufficient memory to work on object: {}".format(str(e))
+        handle_error(msg, message_body, err_message)
+    except ClientError as e:
+        ignore_error = False
+        err_message = "ClientError: {}".format(str(e))
+        if e.operation_name == "PutObjectAcl":
+            err_message += ". Redacted object uploaded successfully but unable to restore WRITE ACL"
+        if e.operation_name == "ListObjectVersions":
+            err_message += ". Could not verify redacted object version integrity"
+        if e.operation_name == "HeadObject" and e.response["Error"]["Code"] == "404":
+            ignore_error = ignore_not_found_exceptions
+        if ignore_error:
+            skip_reason = "Ignored error: {}".format(err_message)
+            handle_skip(msg, body, skip_reason)
+        else:
+            handle_error(msg, message_body, err_message)
+    except ValueError as e:
+        err_message = "Unprocessable message: {}".format(str(e))
+        handle_error(msg, message_body, err_message)
+    except DeleteOldVersionsError as e:
+        err_message = "Unable to delete previous versions: {}".format(str(e))
+        handle_error(msg, message_body, err_message)
+    except IntegrityCheckFailedError as e:
+        err_description, client, bucket, key, version_id = e.args
+        err_message = "Object version integrity check failed: {}".format(
+            err_description
+        )
+        handle_error(msg, message_body, err_message)
+        rollback_object_version(
+            client,
+            bucket,
+            key,
+            version_id,
+            on_error=lambda err: handle_error(
+                None, "{}", err, "ObjectRollbackFailed", False
+            ),
+        )
+    except Exception as e:
+        err_message = "Unknown error during message processing: {}".format(str(e))
+        handle_error(msg, message_body, err_message)
+
+
+def kill_handler(msgs, process_pool):
+    logger.info("Received shutdown signal. Cleaning up %s messages", str(len(msgs)))
+    process_pool.terminate()
+    for msg in msgs:
+        try:
+            handle_error(msg, msg.body, "SIGINT/SIGTERM received during processing")
+        except (ClientError, ValueError) as e:
+            logger.error("Unable to gracefully cleanup message: %s", str(e))
+    sys.exit(1 if len(msgs) > 0 else 0)
+
+
+def get_queue(queue_url, **resource_kwargs):
+    if not resource_kwargs.get("endpoint_url") and os.getenv("AWS_DEFAULT_REGION"):
+        resource_kwargs["endpoint_url"] = "https://sqs.{}.{}".format(
+            os.getenv("AWS_DEFAULT_REGION"), os.getenv("AWS_URL_SUFFIX")
+        )
+    sqs = boto3.resource("sqs", **resource_kwargs)
+    return sqs.Queue(queue_url)
+
+
+def main(queue_url, max_messages, wait_time, sleep_time):
+    logger.info("CPU count for system: %s", cpu_count())
+    messages = []
+    queue = get_queue(queue_url)
+    with Pool(maxtasksperchild=1) as pool:
+        signal.signal(signal.SIGINT, lambda *_: kill_handler(messages, pool))
+        signal.signal(signal.SIGTERM, lambda *_: kill_handler(messages, pool))
+        while 1:
+            logger.info("Fetching messages...")
+            messages = queue.receive_messages(
+                WaitTimeSeconds=wait_time, MaxNumberOfMessages=max_messages
+            )
+            if len(messages) == 0:
+                logger.info("No messages. Sleeping")
+                time.sleep(sleep_time)
+            else:
+                processes = [(queue_url, m.body, m.receipt_handle) for m in messages]
+                pool.starmap(execute, processes)
+                messages = []
+
+
+def parse_args(args):
+    parser = argparse.ArgumentParser(
+        description="Read and process new deletion tasks from a deletion queue"
+    )
+    parser.add_argument("--wait_time", type=int, default=5)
+    parser.add_argument("--max_messages", type=int, default=1)
+    parser.add_argument("--sleep_time", type=int, default=30)
+    parser.add_argument(
+        "--queue_url", type=str, default=os.getenv("DELETE_OBJECTS_QUEUE")
+    )
+    return parser.parse_args(args)
+
+
+if __name__ == "__main__":
+    opts = parse_args(sys.argv[1:])
+    main(opts.queue_url, opts.max_messages, opts.wait_time, opts.sleep_time)

+ 170 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/parquet_handler.py

@@ -0,0 +1,170 @@
+from decimal import Decimal
+import logging
+import os
+import sys
+from collections import Counter
+from io import BytesIO
+
+import numpy as np
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+logger = logging.getLogger(__name__)
+logger.setLevel(os.getenv("LOG_LEVEL", logging.INFO))
+formatter = logging.Formatter("[%(levelname)s] %(message)s")
+handler = logging.StreamHandler(stream=sys.stdout)
+handler.setFormatter(formatter)
+logger.addHandler(handler)
+
+
+def load_parquet(f):
+    return pq.ParquetFile(BytesIO(f.read()), memory_map=False)
+
+
+def case_insensitive_getter(from_array, value):
+    """
+    When creating a Glue Table (either manually or via crawler) columns
+    are automatically lower cased. If the column identifier is saved in
+    the data mapper consistently to the glue table, the getter may not
+    work when accessing the key directly inside the Parquet object. To
+    prevent this to happen, we use this case insensitive getter to iterate
+    over columns.
+    """
+    return next(x for x in from_array if value.lower() == x.lower())
+
+
+def get_row_indexes_to_delete_for_composite(table, identifiers, to_delete):
+    """
+    Iterates over the values of a particular group of columns and returns a
+    numpy mask identifying the rows to delete. The column identifier is a
+    list of simple or complex identifiers, like ["user_first_name", "user.last_name"]
+    """
+    indexes = []
+    data = {}
+    to_delete_set = set(map(tuple, to_delete))
+    for identifier in identifiers:
+        column_first_level = identifier.split(".")[0].lower()
+        if not column_first_level in data:
+            column_identifier = case_insensitive_getter(
+                table.column_names, column_first_level
+            )
+            data[column_first_level] = table.column(column_identifier).to_pylist()
+    for i in range(table.num_rows):
+        values_array = []
+        for identifier in identifiers:
+            segments = identifier.split(".")
+            current = data[segments[0].lower()][i]
+            for j in range(1, len(segments)):
+                next_segment = case_insensitive_getter(
+                    list(current.keys()), segments[j]
+                )
+                current = current[next_segment]
+            values_array.append(current)
+        indexes.append(tuple(values_array) in to_delete_set)
+    return np.array(indexes)
+
+
+def get_row_indexes_to_delete(table, identifier, to_delete):
+    """
+    Iterates over the values of a particular column and returns a
+    numpy mask identifying the rows to delete. The column identifier
+    can be simple like "customer_id" or complex like "user.info.id"
+    """
+    indexes = []
+    to_delete_set = set(to_delete)
+    segments = identifier.split(".")
+    column_identifier = case_insensitive_getter(table.column_names, segments[0])
+    for obj in table.column(column_identifier).to_pylist():
+        current = obj
+        for i in range(1, len(segments)):
+            next_segment = case_insensitive_getter(list(current.keys()), segments[i])
+            current = current[next_segment]
+        indexes.append(current in to_delete_set)
+    return np.array(indexes)
+
+
+def find_column(tree, column_name):
+    """
+    Iterates over columns, including nested within structs, to find simple
+    or complex columns.
+    """
+    for node in tree:
+        if node.name.lower() == column_name.lower():
+            return node
+        flattened = node.flatten()
+        #  When the end of the tree is reached, flatten() returns an array
+        # containing a self reference: self.flatten() => [self]
+        is_tail = flattened[0].name == node.name
+        if not is_tail:
+            found = find_column(flattened, column_name)
+            if found:
+                return found
+
+
+def is_column_type_decimal(schema, column_name):
+    column = find_column(schema, column_name)
+    if not column:
+        raise ValueError("Column {} not found.".format(column_name))
+    return type(column.type) == pa.lib.Decimal128Type
+
+
+def cast_column_values(column, schema):
+    """
+    Method to cast stringified MatchIds to their actual types
+    """
+    if column["Type"] == "Simple":
+        if is_column_type_decimal(schema, column["Column"]):
+            column["MatchIds"] = [Decimal(m) for m in column["MatchIds"]]
+    else:
+        for i in range(0, len(column["Columns"])):
+            if is_column_type_decimal(schema, column["Columns"][i]):
+                for composite_match in column["MatchIds"]:
+                    composite_match[i] = Decimal(composite_match[i])
+    return column
+
+
+def delete_from_table(table, to_delete):
+    """
+    Deletes rows from a Arrow Table where any of the MatchIds is found as
+    value in any of the columns
+    """
+    initial_rows = table.num_rows
+    for column in to_delete:
+        column = cast_column_values(column, table.schema)
+        indexes = (
+            get_row_indexes_to_delete(table, column["Column"], column["MatchIds"])
+            if column["Type"] == "Simple"
+            else get_row_indexes_to_delete_for_composite(
+                table, column["Columns"], column["MatchIds"]
+            )
+        )
+        table = table.filter(~indexes)
+        if not table.num_rows:
+            break
+    deleted_rows = initial_rows - table.num_rows
+    return table, deleted_rows
+
+
+def delete_matches_from_parquet_file(input_file, to_delete):
+    """
+    Deletes matches from Parquet file where to_delete is a list of dicts where
+    each dict contains a column to search and the MatchIds to search for in
+    that particular column
+    """
+    parquet_file = load_parquet(input_file)
+    schema = parquet_file.metadata.schema.to_arrow_schema().remove_metadata()
+    total_rows = parquet_file.metadata.num_rows
+    stats = Counter({"ProcessedRows": total_rows, "DeletedRows": 0})
+    with pa.BufferOutputStream() as out_stream:
+        with pq.ParquetWriter(out_stream, schema) as writer:
+            for row_group in range(parquet_file.num_row_groups):
+                logger.info(
+                    "Row group %s/%s",
+                    str(row_group + 1),
+                    str(parquet_file.num_row_groups),
+                )
+                table = parquet_file.read_row_group(row_group)
+                table, deleted_rows = delete_from_table(table, to_delete)
+                writer.write_table(table)
+                stats.update({"DeletedRows": deleted_rows})
+        return out_stream, stats

+ 10 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/requirements.in

@@ -0,0 +1,10 @@
+pyarrow==8.0.0
+python-snappy==0.6.1
+pandas==1.4.3
+boto3==1.24.38
+s3transfer==0.6.0
+numpy==1.22.0
+cryptography==3.4.7
+urllib3>=1.26.5
+aws-assume-role-lib>=2.10.0
+tenacity==8.0.1

+ 55 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/requirements.txt

@@ -0,0 +1,55 @@
+#
+# This file is autogenerated by pip-compile with python 3.9
+# To update, run:
+#
+#    pip-compile --output-file=backend/ecs_tasks/delete_files/requirements.txt backend/ecs_tasks/delete_files/requirements.in
+#
+aws-assume-role-lib==2.10.0
+    # via -r backend/ecs_tasks/delete_files/requirements.in
+boto3==1.24.38
+    # via
+    #   -r backend/ecs_tasks/delete_files/requirements.in
+    #   aws-assume-role-lib
+botocore==1.27.38
+    # via
+    #   boto3
+    #   s3transfer
+cffi==1.15.1
+    # via cryptography
+cryptography==3.4.7
+    # via -r backend/ecs_tasks/delete_files/requirements.in
+jmespath==1.0.1
+    # via
+    #   boto3
+    #   botocore
+numpy==1.22.0
+    # via
+    #   -r backend/ecs_tasks/delete_files/requirements.in
+    #   pandas
+    #   pyarrow
+pandas==1.4.3
+    # via -r backend/ecs_tasks/delete_files/requirements.in
+pyarrow==8.0.0
+    # via -r backend/ecs_tasks/delete_files/requirements.in
+pycparser==2.21
+    # via cffi
+python-dateutil==2.8.2
+    # via
+    #   botocore
+    #   pandas
+python-snappy==0.6.1
+    # via -r backend/ecs_tasks/delete_files/requirements.in
+pytz==2022.1
+    # via pandas
+s3transfer==0.6.0
+    # via
+    #   -r backend/ecs_tasks/delete_files/requirements.in
+    #   boto3
+six==1.16.0
+    # via python-dateutil
+tenacity==8.0.1
+    # via -r backend/ecs_tasks/delete_files/requirements.in
+urllib3==1.26.11
+    # via
+    #   -r backend/ecs_tasks/delete_files/requirements.in
+    #   botocore

+ 365 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/s3.py

@@ -0,0 +1,365 @@
+import logging
+from functools import lru_cache
+from urllib.parse import urlencode, quote_plus
+from tenacity import (
+    retry,
+    retry_if_result,
+    wait_exponential,
+    stop_after_attempt,
+    after_log,
+)
+
+from boto_utils import fetch_job_manifest, paginate
+from botocore.exceptions import ClientError
+
+from utils import remove_none, retry_wrapper
+
+
+# BEGINNING OF s3transfer MONKEY PATCH
+# https://github.com/boto/s3transfer/issues/82#issuecomment-837971614
+
+import s3transfer.upload
+import s3transfer.tasks
+
+
+class PutObjectTask(s3transfer.tasks.Task):
+    # Copied from s3transfer/upload.py, changed to return the result of client.put_object.
+    def _main(self, client, fileobj, bucket, key, extra_args):
+        with fileobj as body:
+            return client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
+
+
+class CompleteMultipartUploadTask(s3transfer.tasks.Task):
+    # Copied from s3transfer/tasks.py, changed to return a result.
+    def _main(self, client, bucket, key, upload_id, parts, extra_args):
+        print(f"Multipart upload {upload_id} for {key}.")
+        return client.complete_multipart_upload(
+            Bucket=bucket,
+            Key=key,
+            UploadId=upload_id,
+            MultipartUpload={"Parts": parts},
+            **extra_args,
+        )
+
+
+s3transfer.upload.PutObjectTask = PutObjectTask
+s3transfer.upload.CompleteMultipartUploadTask = CompleteMultipartUploadTask
+
+# END OF s3transfer MONKEY PATCH
+
+
+logger = logging.getLogger(__name__)
+
+
+def save(client, buf, bucket, key, metadata, source_version=None):
+    """
+    Save a buffer to S3, preserving any existing properties on the object
+    """
+    # Get Object Settings
+    request_payer_args, _ = get_requester_payment(client, bucket)
+    object_info_args, _ = get_object_info(client, bucket, key, source_version)
+    tagging_args, _ = get_object_tags(client, bucket, key, source_version)
+    acl_args, acl_resp = get_object_acl(client, bucket, key, source_version)
+    extra_args = {
+        **request_payer_args,
+        **object_info_args,
+        **tagging_args,
+        **acl_args,
+        **{"Metadata": metadata},
+    }
+    logger.info("Object settings: %s", extra_args)
+    # Write Object Back to S3
+    logger.info("Saving updated object to s3://%s/%s", bucket, key)
+    resp = client.upload_fileobj(buf, bucket, key, ExtraArgs=extra_args)
+    new_version_id = resp["VersionId"]
+    logger.info("Object uploaded to S3")
+    # GrantWrite cannot be set whilst uploading therefore ACLs need to be restored separately
+    write_grantees = ",".join(get_grantees(acl_resp, "WRITE"))
+    if write_grantees:
+        logger.info("WRITE grant found. Restoring additional grantees for object")
+        client.put_object_acl(
+            Bucket=bucket,
+            Key=key,
+            VersionId=new_version_id,
+            **{
+                **request_payer_args,
+                **acl_args,
+                "GrantWrite": write_grantees,
+            },
+        )
+    logger.info("Processing of file s3://%s/%s complete", bucket, key)
+    return new_version_id
+
+
+@lru_cache()
+def get_requester_payment(client, bucket):
+    """
+    Generates a dict containing the request payer args supported when calling S3.
+    GetBucketRequestPayment call will be cached
+    :returns tuple containing the info formatted for ExtraArgs and the raw response
+    """
+    request_payer = client.get_bucket_request_payment(Bucket=bucket)
+    return (
+        remove_none(
+            {
+                "RequestPayer": "requester"
+                if request_payer["Payer"] == "Requester"
+                else None,
+            }
+        ),
+        request_payer,
+    )
+
+
+@lru_cache()
+def get_object_info(client, bucket, key, version_id=None):
+    """
+    Generates a dict containing the non-ACL/Tagging args supported when uploading to S3.
+    HeadObject call will be cached
+    :returns tuple containing the info formatted for ExtraArgs and the raw response
+    """
+    kwargs = {"Bucket": bucket, "Key": key, **get_requester_payment(client, bucket)[0]}
+    if version_id:
+        kwargs["VersionId"] = version_id
+    object_info = client.head_object(**kwargs)
+    return (
+        remove_none(
+            {
+                "CacheControl": object_info.get("CacheControl"),
+                "ContentDisposition": object_info.get("ContentDisposition"),
+                "ContentEncoding": object_info.get("ContentEncoding"),
+                "ContentLanguage": object_info.get("ContentLanguage"),
+                "ContentType": object_info.get("ContentType"),
+                "Expires": object_info.get("Expires"),
+                "Metadata": object_info.get("Metadata"),
+                "ServerSideEncryption": object_info.get("ServerSideEncryption"),
+                "StorageClass": object_info.get("StorageClass"),
+                "SSECustomerAlgorithm": object_info.get("SSECustomerAlgorithm"),
+                "SSEKMSKeyId": object_info.get("SSEKMSKeyId"),
+                "WebsiteRedirectLocation": object_info.get("WebsiteRedirectLocation"),
+            }
+        ),
+        object_info,
+    )
+
+
+@lru_cache()
+def get_object_tags(client, bucket, key, version_id=None):
+    """
+    Generates a dict containing the Tagging args supported when uploading to S3
+    GetObjectTagging call will be cached
+    :returns tuple containing tagging formatted for ExtraArgs and the raw response
+    """
+    kwargs = {"Bucket": bucket, "Key": key}
+    if version_id:
+        kwargs["VersionId"] = version_id
+    tagging = client.get_object_tagging(**kwargs)
+    return (
+        remove_none(
+            {
+                "Tagging": urlencode(
+                    {tag["Key"]: tag["Value"] for tag in tagging["TagSet"]},
+                    quote_via=quote_plus,
+                )
+            }
+        ),
+        tagging,
+    )
+
+
+@lru_cache()
+def get_object_acl(client, bucket, key, version_id=None):
+    """
+    Generates a dict containing the ACL args supported when uploading to S3
+    GetObjectAcl call will be cached
+    :returns tuple containing ACL formatted for ExtraArgs and the raw response
+    """
+    kwargs = {"Bucket": bucket, "Key": key, **get_requester_payment(client, bucket)[0]}
+    if version_id:
+        kwargs["VersionId"] = version_id
+    acl = client.get_object_acl(**kwargs)
+    existing_owner = {"id={}".format(acl["Owner"]["ID"])}
+    return (
+        remove_none(
+            {
+                "GrantFullControl": ",".join(
+                    existing_owner | get_grantees(acl, "FULL_CONTROL")
+                ),
+                "GrantRead": ",".join(get_grantees(acl, "READ")),
+                "GrantReadACP": ",".join(get_grantees(acl, "READ_ACP")),
+                "GrantWriteACP": ",".join(get_grantees(acl, "WRITE_ACP")),
+            }
+        ),
+        acl,
+    )
+
+
+def get_grantees(acl, grant_type):
+    prop_map = {
+        "CanonicalUser": ("ID", "id"),
+        "AmazonCustomerByEmail": ("EmailAddress", "emailAddress"),
+        "Group": ("URI", "uri"),
+    }
+    filtered = [
+        grantee["Grantee"]
+        for grantee in acl.get("Grants")
+        if grantee["Permission"] == grant_type
+    ]
+    grantees = set()
+    for grantee in filtered:
+        identifier_type = grantee["Type"]
+        identifier_prop = prop_map[identifier_type]
+        grantees.add("{}={}".format(identifier_prop[1], grantee[identifier_prop[0]]))
+
+    return grantees
+
+
+@lru_cache()
+def validate_bucket_versioning(client, bucket):
+    resp = client.get_bucket_versioning(Bucket=bucket)
+    versioning_enabled = resp.get("Status") == "Enabled"
+    mfa_delete_enabled = resp.get("MFADelete") == "Enabled"
+
+    if not versioning_enabled:
+        raise ValueError("Bucket {} does not have versioning enabled".format(bucket))
+
+    if mfa_delete_enabled:
+        raise ValueError("Bucket {} has MFA Delete enabled".format(bucket))
+
+    return True
+
+
+@lru_cache()
+def fetch_manifest(manifest_object):
+    return fetch_job_manifest(manifest_object)
+
+
+def delete_old_versions(client, input_bucket, input_key, new_version):
+    try:
+        resp = list(
+            paginate(
+                client,
+                client.list_object_versions,
+                ["Versions", "DeleteMarkers"],
+                Bucket=input_bucket,
+                Prefix=input_key,
+                VersionIdMarker=new_version,
+                KeyMarker=input_key,
+            )
+        )
+        versions = [el[0] for el in resp if el[0] is not None]
+        delete_markers = [el[1] for el in resp if el[1] is not None]
+        versions.extend(delete_markers)
+        sorted_versions = sorted(versions, key=lambda x: x["LastModified"])
+        version_ids = [v["VersionId"] for v in sorted_versions]
+        errors = []
+        max_deletions = 1000
+        for i in range(0, len(version_ids), max_deletions):
+            objects = [
+                {"Key": input_key, "VersionId": version_id}
+                for version_id in version_ids[i : i + max_deletions]
+            ]
+            resp = delete_s3_objects(client, input_bucket, objects)
+            errors.extend(resp.get("Errors", []))
+        if len(errors) > 0:
+            raise DeleteOldVersionsError(
+                errors=[
+                    "Delete object {} version {} failed: {}".format(
+                        e["Key"], e["VersionId"], e["Message"]
+                    )
+                    for e in errors
+                ]
+            )
+    except ClientError as e:
+        raise DeleteOldVersionsError(errors=[str(e)])
+
+
+@retry(
+    wait=wait_exponential(multiplier=1, min=1, max=10),
+    stop=stop_after_attempt(10),
+    retry=(retry_if_result(lambda r: len(r.get("Errors", [])) > 0)),
+    retry_error_callback=lambda r: r.outcome.result(),
+    after=after_log(logger, logging.DEBUG),
+)
+def delete_s3_objects(client, bucket, objects):
+    return client.delete_objects(
+        Bucket=bucket,
+        Delete={
+            "Objects": objects,
+            "Quiet": True,
+        },
+    )
+
+
+def verify_object_versions_integrity(
+    client, bucket, key, from_version_id, to_version_id
+):
+    def raise_exception(msg):
+        raise IntegrityCheckFailedError(msg, client, bucket, key, to_version_id)
+
+    conflict_error_template = "A {} ({}) was detected for the given object between read and write operations ({} and {})."
+    not_found_error_template = "Previous version ({}) has been deleted."
+
+    object_versions = retry_wrapper(client.list_object_versions)(
+        Bucket=bucket,
+        Prefix=key,
+        VersionIdMarker=to_version_id,
+        KeyMarker=key,
+        MaxKeys=1,
+    )
+
+    versions = object_versions.get("Versions", [])
+    delete_markers = object_versions.get("DeleteMarkers", [])
+    all_versions = versions + delete_markers
+
+    if not len(all_versions):
+        return raise_exception(not_found_error_template.format(from_version_id))
+
+    prev_version = all_versions[0]
+    prev_version_id = prev_version["VersionId"]
+
+    if prev_version_id != from_version_id:
+        conflicting_version_type = (
+            "delete marker" if "ETag" not in prev_version else "version"
+        )
+        return raise_exception(
+            conflict_error_template.format(
+                conflicting_version_type,
+                prev_version_id,
+                from_version_id,
+                to_version_id,
+            )
+        )
+
+    return True
+
+
+def rollback_object_version(client, bucket, key, version, on_error):
+    """Delete newly created object version as soon as integrity conflict is detected"""
+    try:
+        return client.delete_object(Bucket=bucket, Key=key, VersionId=version)
+    except ClientError as e:
+        err_message = "ClientError: {}. Version rollback caused by version integrity conflict failed".format(
+            str(e)
+        )
+        on_error(err_message)
+    except Exception as e:
+        err_message = "Unknown error: {}. Version rollback caused by version integrity conflict failed".format(
+            str(e)
+        )
+        on_error(err_message)
+
+
+class DeleteOldVersionsError(Exception):
+    def __init__(self, errors):
+        super().__init__("\n".join(errors))
+        self.errors = errors
+
+
+class IntegrityCheckFailedError(Exception):
+    def __init__(self, message, client, bucket, key, version_id):
+        self.message = message
+        self.client = client
+        self.bucket = bucket
+        self.key = key
+        self.version_id = version_id

+ 30 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/ecs_tasks/delete_files/utils.py

@@ -0,0 +1,30 @@
+import time
+from botocore.exceptions import ClientError
+
+
+def remove_none(d: dict):
+    return {k: v for k, v in d.items() if v is not None and v != ""}
+
+
+def retry_wrapper(fn, retry_wait_seconds=2, retry_factor=2, max_retries=5):
+    """Exponential back-off retry wrapper for ClientError exceptions"""
+
+    def wrapper(*args, **kwargs):
+        retry_current = 0
+        last_error = None
+
+        while retry_current <= max_retries:
+            try:
+                return fn(*args, **kwargs)
+            except ClientError as e:
+                nonlocal retry_wait_seconds
+                if retry_current == max_retries:
+                    break
+                last_error = e
+                retry_current += 1
+                time.sleep(retry_wait_seconds)
+                retry_wait_seconds *= retry_factor
+
+        raise last_error
+
+    return wrapper

+ 3 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/aws_sdk/requirements.in

@@ -0,0 +1,3 @@
+boto3==1.24.38
+urllib3>=1.26.5
+aws-assume-role-lib>=2.10.0

+ 30 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/aws_sdk/requirements.txt

@@ -0,0 +1,30 @@
+#
+# This file is autogenerated by pip-compile with python 3.9
+# To update, run:
+#
+#    pip-compile --output-file=backend/lambda_layers/aws_sdk/requirements.txt backend/lambda_layers/aws_sdk/requirements.in
+#
+aws-assume-role-lib==2.10.0
+    # via -r backend/lambda_layers/aws_sdk/requirements.in
+boto3==1.24.38
+    # via
+    #   -r backend/lambda_layers/aws_sdk/requirements.in
+    #   aws-assume-role-lib
+botocore==1.27.38
+    # via
+    #   boto3
+    #   s3transfer
+jmespath==1.0.1
+    # via
+    #   boto3
+    #   botocore
+python-dateutil==2.8.2
+    # via botocore
+s3transfer==0.6.0
+    # via boto3
+six==1.16.0
+    # via python-dateutil
+urllib3==1.26.11
+    # via
+    #   -r backend/lambda_layers/aws_sdk/requirements.in
+    #   botocore

+ 267 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/boto_utils/python/boto_utils.py

@@ -0,0 +1,267 @@
+from datetime import datetime, timezone, timedelta
+import decimal
+import logging
+import json
+import os
+import uuid
+from functools import lru_cache, reduce
+
+import boto3
+from boto3.dynamodb.conditions import Key
+from boto3.dynamodb.types import TypeDeserializer
+from botocore.exceptions import ClientError
+from aws_assume_role_lib import assume_role
+
+deserializer = TypeDeserializer()
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+batch_size = 10  # SQS Max Batch Size
+
+s3 = boto3.resource("s3")
+ssm = boto3.client("ssm")
+ddb = boto3.resource("dynamodb")
+table = ddb.Table(os.getenv("JobTable", "S3F2_Jobs"))
+index = os.getenv("JobTableDateGSI", "Date-GSI")
+bucket_count = int(os.getenv("GSIBucketCount", 1))
+
+
+def paginate(client, method, iter_keys, **kwargs):
+    """
+    Auto paginates Boto3 client requests
+    :param client: client to use for the request
+    :param method: method on the client to call
+    :param iter_keys: keys in the response dict to return. If a single iter_key is supplied
+    each next call to the returned iterator will return the next available value If multiple iter_keys are supplied
+    a tuple with an element per iter key
+    Use dot notation for nested keys
+    :param kwargs: kwargs to pass to method call
+    :return: generator
+    Example:
+        paginate(s3, s3.list_object_versions, ["Versions"], Bucket="...", Prefix="...")
+        paginate(s3, s3.list_object_versions, ["Versions", "DeleteMarkers"], Bucket="...", Prefix="...")
+        paginate(athena, athena.get_query_results, ["ResultSet.Rows", "ResultSet.ResultSetMetadata.ColumnInfo"],
+                 QueryExecutionId="...", MaxResults=10)
+    """
+    paginator = client.get_paginator(method.__name__)
+    page_iterator = paginator.paginate(**kwargs)
+    if isinstance(iter_keys, str):
+        iter_keys = [iter_keys]
+    for page in page_iterator:
+        # Support dot notation nested keys
+        results = [
+            reduce(lambda d, x: d.get(x, []), k.split("."), page) for k in iter_keys
+        ]
+        longest = len(max(results, key=len))
+        for i in range(0, longest):
+            # If only one iter key supplied, return the next result for that key
+            if len(iter_keys) == 1:
+                yield results[0][i]
+            # If multiple iter keys supplied, return a tuple of the next result for each iter key,
+            # defaulting an element in the tuple to None if the corresponding iter key has no more results
+            else:
+                yield tuple(
+                    [
+                        iter_page[i] if len(iter_page) > i else None
+                        for iter_page in results
+                    ]
+                )
+
+
+def read_queue(queue, number_to_read=10):
+    msgs = []
+    while len(msgs) < number_to_read:
+        received = queue.receive_messages(
+            MaxNumberOfMessages=min((number_to_read - len(msgs)), batch_size),
+            AttributeNames=["All"],
+        )
+        if len(received) == 0:
+            break  # no messages left
+        remaining = number_to_read - len(msgs)
+        i = min(
+            remaining, len(received)
+        )  # take as many as allowed from the received messages
+        msgs = msgs + received[:i]
+    return msgs
+
+
+def batch_sqs_msgs(queue, messages, **kwargs):
+    chunks = [messages[x : x + batch_size] for x in range(0, len(messages), batch_size)]
+    for chunk in chunks:
+        entries = [
+            {
+                "Id": str(uuid.uuid4()),
+                "MessageBody": json.dumps(m),
+                **(
+                    {"MessageGroupId": str(uuid.uuid4())}
+                    if queue.attributes.get("FifoQueue", False)
+                    else {}
+                ),
+                **kwargs,
+            }
+            for m in chunk
+        ]
+        queue.send_messages(Entries=entries)
+
+
+def emit_event(job_id, event_name, event_data, emitter_id=None, created_at=None):
+    if not emitter_id:
+        emitter_id = str(uuid.uuid4())
+    if not created_at:
+        created_at = datetime.now(timezone.utc).timestamp()
+    item = {
+        "Id": job_id,
+        "Sk": "{}#{}".format(round(created_at * 1000), str(uuid.uuid4())),
+        "Type": "JobEvent",
+        "EventName": event_name,
+        "EventData": normalise_dates(event_data),
+        "EmitterId": emitter_id,
+        "CreatedAt": normalise_dates(round(created_at)),
+    }
+    expiry = get_job_expiry(job_id)
+    if expiry:
+        item["Expires"] = expiry
+    table.put_item(Item=item)
+
+
+@lru_cache()
+def get_job_expiry(job_id):
+    return table.get_item(Key={"Id": job_id, "Sk": job_id})["Item"].get("Expires", None)
+
+
+def running_job_exists():
+    jobs = []
+    for gsi_bucket in range(0, bucket_count):
+        response = table.query(
+            IndexName=index,
+            KeyConditionExpression=Key("GSIBucket").eq(str(gsi_bucket)),
+            ScanIndexForward=False,
+            FilterExpression="(#s = :r) or (#s = :q) or (#s = :c)",
+            ExpressionAttributeNames={"#s": "JobStatus"},
+            ExpressionAttributeValues={
+                ":r": "RUNNING",
+                ":q": "QUEUED",
+                ":c": "FORGET_COMPLETED_CLEANUP_IN_PROGRESS",
+            },
+            Limit=1,
+        )
+        jobs += response.get("Items", [])
+
+    return len(jobs) > 0
+
+
+def get_config():
+    try:
+        param_name = os.getenv("ConfigParam", "S3F2-Configuration")
+        return json.loads(
+            ssm.get_parameter(Name=param_name, WithDecryption=True)["Parameter"][
+                "Value"
+            ]
+        )
+    except (KeyError, ValueError) as e:
+        logger.error("Invalid configuration supplied: %s", str(e))
+        raise e
+    except ClientError as e:
+        logger.error("Unable to retrieve config: %s", str(e))
+        raise e
+    except Exception as e:
+        logger.error("Unknown error retrieving config: %s", str(e))
+        raise e
+
+
+class DecimalEncoder(json.JSONEncoder):
+    def default(self, o):
+        if isinstance(o, decimal.Decimal):
+            return round(o)
+        return super(DecimalEncoder, self).default(o)
+
+
+def utc_timestamp(**delta_kwargs):
+    return round((datetime.now(timezone.utc) + timedelta(**delta_kwargs)).timestamp())
+
+
+def convert_iso8601_to_epoch(iso_time: str):
+    normalised = iso_time.strip().replace(" ", "T")
+    with_ms = "." in normalised
+    regex = "%Y-%m-%dT%H:%M:%S.%f%z" if with_ms else "%Y-%m-%dT%H:%M:%S%z"
+    parsed = datetime.strptime(normalised, regex)
+    unix_timestamp = round(parsed.timestamp())
+    return unix_timestamp
+
+
+def normalise_dates(data):
+    if isinstance(data, str):
+        try:
+            return convert_iso8601_to_epoch(data)
+        except ValueError:
+            return data
+    elif isinstance(data, list):
+        return [normalise_dates(i) for i in data]
+    elif isinstance(data, dict):
+        return {k: normalise_dates(v) for k, v in data.items()}
+    return data
+
+
+def deserialize_item(item):
+    return {k: deserializer.deserialize(v) for k, v in item.items()}
+
+
+def parse_s3_url(s3_url):
+    if not (isinstance(s3_url, str) and s3_url.startswith("s3://")):
+        raise ValueError("Invalid S3 URL")
+    return s3_url.replace("s3://", "").split("/", 1)
+
+
+def get_user_info(event):
+    req = event.get("requestContext", {})
+    # If Cognito Auth is being used
+    if "authorizer" in req:
+        auth = req.get("authorizer", {})
+        claims = auth.get("claims", {})
+        return {
+            "Username": claims.get("cognito:username", "N/A"),
+            "Sub": claims.get("sub", "N/A"),
+        }
+    # If Cognito IAM is being used
+    elif "identity" in req:
+        iden = req.get("identity", {})
+        return {
+            "Username": iden.get("userArn", "N/A"),
+            "Sub": iden.get("user", "N/A"),
+        }
+    # Default behaviour of method expected to return N/A in both fields
+    else:
+        return {
+            "Username": "N/A",
+            "Sub": "N/A",
+        }
+
+
+def get_session(assume_role_arn=None, role_session_name="s3f2"):
+    session = boto3.session.Session()
+    if assume_role_arn:
+        return assume_role(session, assume_role_arn, RoleSessionName=role_session_name)
+    return session
+
+
+def fetch_job_manifest(path):
+    bucket, obj = parse_s3_url(path)
+    return s3.Object(bucket, obj).get().get("Body").read().decode("utf-8")
+
+
+def json_lines_iterator(content, include_unparsed=False):
+    lines = content.split("\n")
+    if lines[-1] == "":
+        lines.pop()
+    for i, line in enumerate(lines):
+        try:
+            parsed = json.loads(line)
+        except (json.JSONDecodeError) as e:
+            raise ValueError(
+                "Serialization error when parsing JSON lines: {}".format(
+                    str(e).replace("line 1", "line {}".format(i + 1)),
+                )
+            )
+        if include_unparsed:
+            yield parsed, line
+        else:
+            yield parsed

+ 1 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/cr_helper/requirements.in

@@ -0,0 +1 @@
+crhelper==2.0.10

+ 8 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/cr_helper/requirements.txt

@@ -0,0 +1,8 @@
+#
+# This file is autogenerated by pip-compile with python 3.9
+# To update, run:
+#
+#    pip-compile ./backend/lambda_layers/cr_helper/requirements.in
+#
+crhelper==2.0.10
+    # via -r ./backend/lambda_layers/cr_helper/requirements.in

+ 1 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/decorators/requirements.in

@@ -0,0 +1 @@
+jsonschema==3.2.0

+ 17 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambda_layers/decorators/requirements.txt

@@ -0,0 +1,17 @@
+#
+# This file is autogenerated by pip-compile with python 3.9
+# To update, run:
+#
+#    pip-compile ./backend/lambda_layers/decorators/requirements.in
+#
+attrs==21.4.0
+    # via jsonschema
+jsonschema==3.2.0
+    # via -r ./backend/lambda_layers/decorators/requirements.in
+pyrsistent==0.18.1
+    # via jsonschema
+six==1.16.0
+    # via jsonschema
+
+# The following packages are considered to be unsafe in a requirements file:
+# setuptools

+ 44 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/cleanup_bucket.py

@@ -0,0 +1,44 @@
+import boto3
+from crhelper import CfnResource
+from decorators import with_logging
+
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+s3 = boto3.resource("s3")
+
+
+@with_logging
+def empty_bucket(bucket_name):
+    bucket = s3.Bucket(bucket_name)
+    bucket.objects.all().delete()
+    bucket.object_versions.all().delete()
+
+
+@with_logging
+@helper.create
+def create(event, context):
+    return None
+
+
+@with_logging
+@helper.update
+def update(event, context):
+    props = event["ResourceProperties"]
+    props_old = event["OldResourceProperties"]
+    web_ui_deployed = props_old.get("DeployWebUI", "true")
+    if web_ui_deployed == "true" and props["DeployWebUI"] == "false":
+        empty_bucket(props["Bucket"])
+    return None
+
+
+@with_logging
+@helper.delete
+def delete(event, context):
+    props = event["ResourceProperties"]
+    empty_bucket(props["Bucket"])
+    return None
+
+
+def handler(event, context):
+    helper(event, context)

+ 37 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/cleanup_repository.py

@@ -0,0 +1,37 @@
+import boto3
+from crhelper import CfnResource
+from boto_utils import paginate
+from decorators import with_logging
+
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+ecr_client = boto3.client("ecr")
+
+
+@with_logging
+@helper.create
+@helper.update
+def create(event, context):
+    return None
+
+
+@with_logging
+@helper.delete
+def delete(event, context):
+    props = event["ResourceProperties"]
+    repository = props["Repository"]
+    images = list(
+        paginate(
+            ecr_client, ecr_client.list_images, ["imageIds"], repositoryName=repository
+        )
+    )
+
+    if images:
+        ecr_client.batch_delete_image(imageIds=images, repositoryName=repository)
+
+    return None
+
+
+def handler(event, context):
+    helper(event, context)

+ 41 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/copy_build_artefact.py

@@ -0,0 +1,41 @@
+import boto3
+from crhelper import CfnResource
+from decorators import with_logging
+
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+s3_client = boto3.client("s3")
+
+
+@with_logging
+@helper.create
+@helper.update
+def create(event, context):
+    props = event.get("ResourceProperties", None)
+    version = props.get("Version")
+    destination_artefact = props.get("ArtefactName")
+    destination_bucket = props.get("CodeBuildArtefactBucket")
+    destination_bucket_arn = props.get(
+        "CodeBuildArtefactBucketArn", "arn:aws:s3:::{}".format(destination_bucket)
+    )
+    source_bucket = props.get("PreBuiltArtefactsBucket")
+    source_artefact = "{}/amazon-s3-find-and-forget/{}/build.zip".format(
+        source_bucket, version
+    )
+
+    s3_client.copy_object(
+        Bucket=destination_bucket, CopySource=source_artefact, Key=destination_artefact
+    )
+
+    return "{}/{}".format(destination_bucket_arn, destination_artefact)
+
+
+@with_logging
+@helper.delete
+def delete(event, context):
+    return None
+
+
+def handler(event, context):
+    helper(event, context)

+ 45 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/get_vpce_subnets.py

@@ -0,0 +1,45 @@
+#############################################################
+# This Custom Resource is required since VPC Endpoint names #
+# and subnets are not consistant in the China region        #
+#############################################################
+
+import boto3
+from crhelper import CfnResource
+from decorators import with_logging
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+ec2_client = boto3.client("ec2")
+
+
+@with_logging
+@helper.create
+@helper.update
+def create(event, context):
+    props = event.get("ResourceProperties", None)
+    service_name = props.get("ServiceName")
+    subnet_ids = props.get("SubnetIds")
+    vpc_endpoint_type = props.get("VpcEndpointType")
+    describe_subnets = ec2_client.describe_subnets(SubnetIds=subnet_ids)
+    subnet_dict = {
+        s["AvailabilityZone"]: s["SubnetId"] for s in describe_subnets["Subnets"]
+    }
+    endpoint_service = ec2_client.describe_vpc_endpoint_services(
+        Filters=[
+            {"Name": "service-name", "Values": [f"cn.{service_name}", service_name]},
+            {"Name": "service-type", "Values": [vpc_endpoint_type]},
+        ]
+    )
+    service_details = endpoint_service["ServiceDetails"][0]
+    helper.Data["ServiceName"] = service_details["ServiceName"]
+    return ",".join([subnet_dict[s] for s in service_details["AvailabilityZones"]])
+
+
+@with_logging
+@helper.delete
+def delete(event, context):
+    return None
+
+
+def handler(event, context):
+    helper(event, context)

+ 31 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/redeploy_apigw.py

@@ -0,0 +1,31 @@
+import boto3
+from crhelper import CfnResource
+from decorators import with_logging
+
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+api_client = boto3.client("apigateway")
+
+
+@with_logging
+@helper.create
+@helper.delete
+def create(event, context):
+    return None
+
+
+@with_logging
+@helper.update
+def update(event, context):
+    props = event["ResourceProperties"]
+    props_old = event["OldResourceProperties"]
+    if props_old["DeployCognito"] != props["DeployCognito"]:
+        api_client.create_deployment(
+            restApiId=props["ApiId"], stageName=props["ApiStage"]
+        )
+    return None
+
+
+def handler(event, context):
+    helper(event, context)

+ 29 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/rerun_pipeline.py

@@ -0,0 +1,29 @@
+import boto3
+from crhelper import CfnResource
+from decorators import with_logging
+
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+pipe_client = boto3.client("codepipeline")
+
+
+@with_logging
+@helper.create
+@helper.delete
+def create(event, context):
+    return None
+
+
+@with_logging
+@helper.update
+def update(event, context):
+    props = event["ResourceProperties"]
+    props_old = event["OldResourceProperties"]
+    if props_old["DeployWebUI"] == "false" and props["DeployWebUI"] == "true":
+        pipe_client.start_pipeline_execution(name=props["PipelineName"])
+    return None
+
+
+def handler(event, context):
+    helper(event, context)

+ 47 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/custom_resources/wait_container_build.py

@@ -0,0 +1,47 @@
+import boto3
+from crhelper import CfnResource
+from boto_utils import convert_iso8601_to_epoch
+from decorators import with_logging
+
+
+helper = CfnResource(json_logging=False, log_level="DEBUG", boto_level="CRITICAL")
+
+ecr_client = boto3.client("ecr")
+s3_client = boto3.resource("s3")
+
+
+@with_logging
+@helper.create
+@helper.update
+@helper.delete
+def create(event, context):
+    return None
+
+
+@with_logging
+@helper.poll_create
+@helper.poll_update
+def poll(event, context):
+    props = event.get("ResourceProperties", None)
+    bucket = props.get("CodeBuildArtefactBucket")
+    key = props.get("ArtefactName")
+    repository = props.get("ECRRepository")
+    obj = s3_client.Object(bucket, key)
+    last_modified = convert_iso8601_to_epoch(str(obj.last_modified))
+    image_pushed_at = get_latest_image_push(repository)
+    return image_pushed_at and last_modified < image_pushed_at
+
+
+def handler(event, context):
+    helper(event, context)
+
+
+def get_latest_image_push(repository):
+    try:
+        images = ecr_client.describe_images(
+            repositoryName=repository, imageIds=[{"imageTag": "latest"}]
+        )
+
+        return convert_iso8601_to_epoch(str(images["imageDetails"][0]["imagePushedAt"]))
+    except ecr_client.exceptions.ImageNotFoundException:
+        return None

+ 180 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/handlers.py

@@ -0,0 +1,180 @@
+"""
+DataMapper handlers
+"""
+import json
+import os
+
+import boto3
+
+from boto_utils import DecimalEncoder, get_user_info, running_job_exists
+from decorators import (
+    with_logging,
+    request_validator,
+    catch_errors,
+    add_cors_headers,
+    json_body_loader,
+    load_schema,
+)
+
+dynamodb_resource = boto3.resource("dynamodb")
+table = dynamodb_resource.Table(os.getenv("DataMapperTable"))
+glue_client = boto3.client("glue")
+
+PARQUET_HIVE_SERDE = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
+JSON_HIVE_SERDE = "org.apache.hive.hcatalog.data.JsonSerDe"
+JSON_OPENX_SERDE = "org.openx.data.jsonserde.JsonSerDe"
+SUPPORTED_SERDE_LIBS = [PARQUET_HIVE_SERDE, JSON_HIVE_SERDE, JSON_OPENX_SERDE]
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("get_data_mapper"))
+@catch_errors
+def get_data_mapper_handler(event, context):
+    data_mapper_id = event["pathParameters"]["data_mapper_id"]
+    item = table.get_item(Key={"DataMapperId": data_mapper_id}).get("Item")
+    if not item:
+        return {"statusCode": 404}
+
+    return {"statusCode": 200, "body": json.dumps(item, cls=DecimalEncoder)}
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("list_data_mappers"))
+@catch_errors
+def get_data_mappers_handler(event, context):
+    qs = event.get("queryStringParameters")
+    if not qs:
+        qs = {}
+    page_size = int(qs.get("page_size", 10))
+    scan_params = {"Limit": page_size}
+    start_at = qs.get("start_at")
+    if start_at:
+        scan_params["ExclusiveStartKey"] = {"DataMapperId": start_at}
+    items = table.scan(**scan_params).get("Items", [])
+    if len(items) < page_size:
+        next_start = None
+    else:
+        next_start = items[-1]["DataMapperId"]
+    return {
+        "statusCode": 200,
+        "body": json.dumps(
+            {"DataMappers": items, "NextStart": next_start}, cls=DecimalEncoder
+        ),
+    }
+
+
+@with_logging
+@add_cors_headers
+@json_body_loader
+@request_validator(load_schema("create_data_mapper"))
+@catch_errors
+def put_data_mapper_handler(event, context):
+    path_params = event["pathParameters"]
+    body = event["body"]
+    validate_mapper(body)
+    item = {
+        "DataMapperId": path_params["data_mapper_id"],
+        "Columns": body["Columns"],
+        "QueryExecutor": body["QueryExecutor"],
+        "QueryExecutorParameters": body["QueryExecutorParameters"],
+        "CreatedBy": get_user_info(event),
+        "RoleArn": body["RoleArn"],
+        "Format": body.get("Format", "parquet"),
+        "DeleteOldVersions": body.get("DeleteOldVersions", True),
+        "IgnoreObjectNotFoundExceptions": body.get(
+            "IgnoreObjectNotFoundExceptions", False
+        ),
+    }
+    table.put_item(Item=item)
+
+    return {"statusCode": 201, "body": json.dumps(item)}
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("delete_data_mapper"))
+@catch_errors
+def delete_data_mapper_handler(event, context):
+    if running_job_exists():
+        raise ValueError("Cannot delete Data Mappers whilst there is a job in progress")
+    data_mapper_id = event["pathParameters"]["data_mapper_id"]
+    table.delete_item(Key={"DataMapperId": data_mapper_id})
+
+    return {"statusCode": 204}
+
+
+def validate_mapper(mapper):
+    existing_s3_locations = get_existing_s3_locations(mapper["DataMapperId"])
+    if mapper["QueryExecutorParameters"].get("DataCatalogProvider") == "glue":
+        table_details = get_table_details_from_mapper(mapper)
+        new_location = get_glue_table_location(table_details)
+        serde_lib, serde_params = get_glue_table_format(table_details)
+        for partition in mapper["QueryExecutorParameters"].get("PartitionKeys", []):
+            if partition not in get_glue_table_partition_keys(table_details):
+                raise ValueError("Partition Key {} doesn't exist".format(partition))
+        if any([is_overlap(new_location, e) for e in existing_s3_locations]):
+            raise ValueError(
+                "A data mapper already exists which covers this S3 location"
+            )
+        if serde_lib not in SUPPORTED_SERDE_LIBS:
+            raise ValueError(
+                "The format for the specified table is not supported. The SerDe lib must be one of {}".format(
+                    ", ".join(SUPPORTED_SERDE_LIBS)
+                )
+            )
+        if serde_lib == JSON_OPENX_SERDE:
+            not_allowed_json_params = {
+                "ignore.malformed.json": "TRUE",
+                "dots.in.keys": "TRUE",
+            }
+            for param, value in not_allowed_json_params.items():
+                if param in serde_params and serde_params[param] == value:
+                    raise ValueError(
+                        "The parameter {} cannot be {} for SerDe library {}".format(
+                            param, value, JSON_OPENX_SERDE
+                        )
+                    )
+            if any([k for k, v in serde_params.items() if k.startswith("mapping.")]):
+                raise ValueError(
+                    "Column mappings are not supported for SerDe library {}".format(
+                        JSON_OPENX_SERDE
+                    )
+                )
+
+
+def get_existing_s3_locations(current_data_mapper_id):
+    items = table.scan()["Items"]
+    glue_mappers = [
+        get_table_details_from_mapper(mapper)
+        for mapper in items
+        if mapper["QueryExecutorParameters"].get("DataCatalogProvider") == "glue"
+        and mapper["DataMapperId"] != current_data_mapper_id
+    ]
+    return [get_glue_table_location(m) for m in glue_mappers]
+
+
+def get_table_details_from_mapper(mapper):
+    db = mapper["QueryExecutorParameters"]["Database"]
+    table_name = mapper["QueryExecutorParameters"]["Table"]
+    return glue_client.get_table(DatabaseName=db, Name=table_name)
+
+
+def get_glue_table_location(t):
+    return t["Table"]["StorageDescriptor"]["Location"]
+
+
+def get_glue_table_format(t):
+    return (
+        t["Table"]["StorageDescriptor"]["SerdeInfo"]["SerializationLibrary"],
+        t["Table"]["StorageDescriptor"]["SerdeInfo"]["Parameters"],
+    )
+
+
+def get_glue_table_partition_keys(t):
+    return [x["Name"] for x in t["Table"]["PartitionKeys"]]
+
+
+def is_overlap(a, b):
+    return a in b or b in a

+ 21 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/create_data_mapper.json

@@ -0,0 +1,21 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "Create Data Mapper Handler",
+    "type": "object",
+    "properties": {
+        "pathParameters": {
+            "description": "Path parameters for the request",
+            "type": "object",
+            "properties": {
+                "data_mapper_id": {
+                    "description": "ID of the data mapper",
+                    "type": "string",
+                    "pattern": "^([A-Za-z0-9])+$"
+                }
+            },
+            "required": ["data_mapper_id"]
+        }
+    },
+    "required": ["pathParameters"]
+}
+

+ 21 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/delete_data_mapper.json

@@ -0,0 +1,21 @@
+{
+    "$schema": "http://json-schema.org/draft-06/schema#",
+    "title": "Delete Data Mapper Handler",
+    "type": "object",
+    "properties": {
+        "pathParameters": {
+            "description": "Path parameters for the request",
+            "type": "object",
+            "properties": {
+                "data_mapper_id": {
+                    "description": "ID of the data mapper",
+                    "type": "string",
+                    "pattern": "^([A-Za-z0-9])+$"
+                }
+            },
+            "required": ["data_mapper_id"]
+        }
+    },
+    "required": ["pathParameters"]
+}
+

+ 21 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/get_data_mapper.json

@@ -0,0 +1,21 @@
+{
+    "$schema": "http://json-schema.org/draft-06/schema#",
+    "title": "Get Data Mapper Handler",
+    "type": "object",
+    "properties": {
+        "pathParameters": {
+            "description": "Path parameters for the request",
+            "type": "object",
+            "properties": {
+                "data_mapper_id": {
+                    "description": "ID of the Data Mapper",
+                    "type": "string",
+                    "pattern": "^([A-Za-z0-9])+$"
+                }
+            },
+            "required": ["data_mapper_id"]
+        }
+    },
+    "required": ["pathParameters"]
+}
+

+ 22 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/data_mappers/schemas/list_data_mappers.json

@@ -0,0 +1,22 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "List Data Mappers Handler",
+    "type": "object",
+    "properties": {
+        "queryStringParameters": {
+            "description": "Query string parameters for the request",
+            "type": [ "object", "null" ],
+            "properties": {
+                "start_at": {
+                    "description": "Starting watermark",
+                    "type": "string"
+                },
+                "page_size": {
+                    "description": "Maximum page size",
+                    "type": "string",
+                    "pattern": "^([1-9][0-9]{0,2}|1000)$"
+                }
+            }
+        }
+    }
+}

+ 0 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/__init__.py


+ 215 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/handlers.py

@@ -0,0 +1,215 @@
+"""
+Job handlers
+"""
+import json
+import os
+
+import boto3
+from boto3.dynamodb.conditions import Key, Attr
+
+from boto_utils import DecimalEncoder, utc_timestamp
+from decorators import (
+    with_logging,
+    request_validator,
+    catch_errors,
+    add_cors_headers,
+    load_schema,
+)
+
+ddb = boto3.resource("dynamodb")
+table = ddb.Table(os.getenv("JobTable", "S3F2_Jobs"))
+index = os.getenv("JobTableDateGSI", "Date-GSI")
+bucket_count = int(os.getenv("GSIBucketCount", 1))
+
+end_statuses = [
+    "COMPLETED_CLEANUP_FAILED",
+    "COMPLETED",
+    "FAILED",
+    "FIND_FAILED",
+    "FORGET_FAILED",
+    "FORGET_PARTIALLY_FAILED",
+]
+
+job_summary_attributes = [
+    "Id",
+    "CreatedAt",
+    "JobStatus",
+    "JobFinishTime",
+    "JobStartTime",
+    "TotalObjectRollbackFailedCount",
+    "TotalObjectUpdatedCount",
+    "TotalObjectUpdateSkippedCount",
+    "TotalObjectUpdateFailedCount",
+    "TotalQueryCount",
+    "TotalQueryFailedCount",
+    "TotalQueryScannedInBytes",
+    "TotalQuerySucceededCount",
+    "TotalQueryTimeInMillis",
+]
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("get_job"))
+@catch_errors
+def get_job_handler(event, context):
+    job_id = event["pathParameters"]["job_id"]
+    resp = table.get_item(
+        Key={
+            "Id": job_id,
+            "Sk": job_id,
+        }
+    )
+    item = resp.get("Item")
+    if not item:
+        return {"statusCode": 404}
+
+    return {"statusCode": 200, "body": json.dumps(item, cls=DecimalEncoder)}
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("list_jobs"))
+@catch_errors
+def list_jobs_handler(event, context):
+    qs = event.get("queryStringParameters")
+    if not qs:
+        qs = {}
+    page_size = int(qs.get("page_size", 10))
+    start_at = int(qs.get("start_at", utc_timestamp()))
+
+    items = []
+    for gsi_bucket in range(0, bucket_count):
+        response = table.query(
+            IndexName=index,
+            KeyConditionExpression=Key("GSIBucket").eq(str(gsi_bucket))
+            & Key("CreatedAt").lt(start_at),
+            ScanIndexForward=False,
+            Limit=page_size,
+            ProjectionExpression=", ".join(job_summary_attributes),
+        )
+        items += response.get("Items", [])
+    items = sorted(items, key=lambda i: i["CreatedAt"], reverse=True)[:page_size]
+    if len(items) < page_size:
+        next_start = None
+    else:
+        next_start = min([item["CreatedAt"] for item in items])
+
+    return {
+        "statusCode": 200,
+        "body": json.dumps(
+            {
+                "Jobs": items,
+                "NextStart": next_start,
+            },
+            cls=DecimalEncoder,
+        ),
+    }
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("list_job_events"))
+@catch_errors
+def list_job_events_handler(event, context):
+    # Input parsing
+    job_id = event["pathParameters"]["job_id"]
+    qs = event.get("queryStringParameters")
+    mvqs = event.get("multiValueQueryStringParameters")
+    if not qs:
+        qs = {}
+        mvqs = {}
+    page_size = int(qs.get("page_size", 20))
+    start_at = qs.get("start_at", "0")
+    # Check the job exists
+    job = table.get_item(
+        Key={
+            "Id": job_id,
+            "Sk": job_id,
+        }
+    ).get("Item")
+    if not job:
+        return {"statusCode": 404}
+
+    watermark_boundary_mu = (job.get("JobFinishTime", utc_timestamp()) + 1) * 1000
+
+    # Check the watermark is not "future"
+    if int(start_at.split("#")[0]) > watermark_boundary_mu:
+        raise ValueError("Watermark {} is out of bounds for this job".format(start_at))
+
+    # Apply filters
+    filter_expression = Attr("Type").eq("JobEvent")
+    user_filters = mvqs.get("filter", [])
+    for f in user_filters:
+        k, v = f.split("=")
+        filter_expression = filter_expression & Attr(k).begins_with(v)
+
+    # Because result may contain both JobEvent and Job items, we request max page_size+1 items then apply the type
+    # filter as FilterExpression. We then limit the list size to the requested page size in case the number of
+    # items after filtering is still page_size+1 i.e. the Job item wasn't on the page.
+    items = []
+    query_start_key = str(start_at)
+    last_evaluated = None
+    last_query_size = 0
+    while len(items) < page_size:
+        resp = table.query(
+            KeyConditionExpression=Key("Id").eq(job_id),
+            ScanIndexForward=True,
+            FilterExpression=filter_expression,
+            Limit=100 if len(user_filters) else page_size + 1,
+            ExclusiveStartKey={"Id": job_id, "Sk": query_start_key},
+        )
+        results = resp.get("Items", [])
+        last_query_size = len(results)
+        items.extend(results[: page_size - len(items)])
+        query_start_key = resp.get("LastEvaluatedKey", {}).get("Sk")
+        if not query_start_key:
+            break
+        last_evaluated = query_start_key
+
+    next_start = _get_watermark(
+        items, start_at, page_size, job["JobStatus"], last_evaluated, last_query_size
+    )
+
+    resp = {
+        k: v
+        for k, v in {"JobEvents": items, "NextStart": next_start}.items()
+        if v is not None
+    }
+
+    return {"statusCode": 200, "body": json.dumps(resp, cls=DecimalEncoder)}
+
+
+def _get_watermark(
+    items,
+    initial_start_key,
+    page_size,
+    job_status,
+    last_evaluated_ddb_key,
+    last_query_size,
+):
+    """
+    Work out the watermark to return to the user using the following logic:
+    1. If the job is in progress, we always return a watermark but the source of the watermark
+       is determined as follows:
+       a. We've reached the last available items in DDB but filtering has left us with less than the desired page
+       size but we have a LastEvaluatedKey that allows the client to skip the filtered items next time
+       b. There is at least 1 event and there are (or will be) more items available
+       c. There's no events after the supplied watermark so just return whatever the user sent
+    2. If the job is finished, return a watermark if the last query executed indicates there *might* be more
+       results
+    """
+    next_start = None
+    if job_status not in end_statuses:
+        # Job is in progress
+        if len(items) < page_size and last_evaluated_ddb_key:
+            next_start = last_evaluated_ddb_key
+        elif 0 < len(items) <= page_size:
+            next_start = items[len(items) - 1]["Sk"]
+        else:
+            next_start = str(initial_start_key)
+    # Job is finished but there are potentially more results
+    elif last_query_size >= page_size:
+        next_start = items[len(items) - 1]["Sk"]
+
+    return next_start

+ 21 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/schemas/get_job.json

@@ -0,0 +1,21 @@
+{
+    "$schema": "http://json-schema.org/draft-06/schema#",
+    "title": "Get Job Handler",
+    "type": "object",
+    "properties": {
+        "pathParameters": {
+            "description": "Path parameters for the request",
+            "type": "object",
+            "properties": {
+                "job_id": {
+                    "description": "ID of the Job",
+                    "type": "string",
+                    "pattern": "^([A-Za-z0-9-])+$"
+                }
+            },
+            "required": ["job_id"]
+        }
+    },
+    "required": ["pathParameters"]
+}
+

+ 47 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/schemas/list_job_events.json

@@ -0,0 +1,47 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "List Job Events Handler",
+    "type": "object",
+    "properties": {
+        "multiValueQueryStringParameters": {
+            "type": [ "object", "null" ],
+            "properties": {
+                "filter": {
+                    "oneOf": [
+                        {
+                            "type": "string",
+                            "pattern": "^(EventName)([=])([a-zA-Z0-9]+)$"
+                        },
+                        {
+                            "type": "array",
+                            "items": {
+                                "type": "string",
+                                "pattern": "^(EventName)([=])([a-zA-Z0-9]+)$"
+                            }
+                        }
+                    ]
+                }
+            }
+        },
+        "queryStringParameters": {
+            "description": "Query string parameters for the request",
+            "type": [ "object", "null" ],
+            "properties": {
+                "start_at": {
+                    "description": "Starting watermark",
+                    "type": "string",
+                    "pattern": "^(0|([0-9]+)#([a-zA-Z0-9-\\.]+))$"
+                },
+                "page_size": {
+                    "description": "Maximum page size",
+                    "type": "string",
+                    "pattern": "^([1-9][0-9]{0,2}|1000)$"
+                },
+                "filter": {
+                    "type": "string",
+                    "pattern": "^(EventName)([=])([a-zA-Z0-9]+)$"
+                }
+            }
+        }
+    }
+}

+ 24 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/schemas/list_jobs.json

@@ -0,0 +1,24 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "List Jobs Handler",
+    "type": "object",
+    "properties": {
+        "queryStringParameters": {
+            "description": "Query string parameters for the request",
+            "type": [ "object", "null" ],
+            "properties": {
+                "start_at": {
+                    "description": "Starting watermark",
+                    "type": "string",
+                    "pattern": "^[0-9]+$"
+                },
+                "page_size": {
+                    "description": "Maximum page size",
+                    "type": "string",
+                    "pattern": "^([1-9][0-9]{0,2}|1000)$"
+                }
+            }
+        }
+    }
+}
+

+ 121 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/stats_updater.py

@@ -0,0 +1,121 @@
+"""
+Job Stats Updater
+"""
+import json
+import logging
+
+import boto3
+from os import getenv
+from collections import Counter
+
+from boto_utils import DecimalEncoder
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+ddb = boto3.resource("dynamodb")
+table = ddb.Table(getenv("JobTable", "S3F2_Jobs"))
+
+
+def update_stats(job_id, events):
+    stats = _aggregate_stats(events)
+    job = _update_job(job_id, stats)
+    logger.info("Updated Stats for Job ID %s: %s", job_id, stats)
+    return job
+
+
+def _aggregate_stats(events):
+    stats = Counter({})
+
+    for event in events:
+        event_name = event["EventName"]
+        event_data = event.get("EventData", {})
+        if event_name in ["QuerySucceeded", "QueryFailed"]:
+            stats += Counter(
+                {
+                    "TotalQueryCount": 1,
+                    "TotalQuerySucceededCount": 1
+                    if event_name == "QuerySucceeded"
+                    else 0,
+                    "TotalQueryFailedCount": 1 if event_name == "QueryFailed" else 0,
+                    "TotalQueryScannedInBytes": event_data.get("Statistics", {}).get(
+                        "DataScannedInBytes", 0
+                    ),
+                    "TotalQueryTimeInMillis": event_data.get("Statistics", {}).get(
+                        "EngineExecutionTimeInMillis", 0
+                    ),
+                }
+            )
+        if event_name in [
+            "ObjectUpdated",
+            "ObjectUpdateSkipped",
+            "ObjectUpdateFailed",
+            "ObjectRollbackFailed",
+        ]:
+            stats += Counter(
+                {
+                    "TotalObjectUpdatedCount": 1
+                    if event_name == "ObjectUpdated"
+                    else 0,
+                    "TotalObjectUpdateSkippedCount": 1
+                    if event_name == "ObjectUpdateSkipped"
+                    else 0,
+                    "TotalObjectUpdateFailedCount": 1
+                    if event_name == "ObjectUpdateFailed"
+                    else 0,
+                    "TotalObjectRollbackFailedCount": 1
+                    if event_name == "ObjectRollbackFailed"
+                    else 0,
+                }
+            )
+
+    return stats
+
+
+def _update_job(job_id, stats):
+    try:
+        return table.update_item(
+            Key={
+                "Id": job_id,
+                "Sk": job_id,
+            },
+            ConditionExpression="#Id = :Id AND #Sk = :Sk",
+            UpdateExpression="set #qt = if_not_exists(#qt, :z) + :qt, "
+            "#qs = if_not_exists(#qs, :z) + :qs, "
+            "#qf = if_not_exists(#qf, :z) + :qf, "
+            "#qb = if_not_exists(#qb, :z) + :qb, "
+            "#qm = if_not_exists(#qm, :z) + :qm, "
+            "#ou = if_not_exists(#ou, :z) + :ou, "
+            "#os = if_not_exists(#os, :z) + :os, "
+            "#of = if_not_exists(#of, :z) + :of, "
+            "#or = if_not_exists(#or, :z) + :or",
+            ExpressionAttributeNames={
+                "#Id": "Id",
+                "#Sk": "Sk",
+                "#qt": "TotalQueryCount",
+                "#qs": "TotalQuerySucceededCount",
+                "#qf": "TotalQueryFailedCount",
+                "#qb": "TotalQueryScannedInBytes",
+                "#qm": "TotalQueryTimeInMillis",
+                "#ou": "TotalObjectUpdatedCount",
+                "#os": "TotalObjectUpdateSkippedCount",
+                "#of": "TotalObjectUpdateFailedCount",
+                "#or": "TotalObjectRollbackFailedCount",
+            },
+            ExpressionAttributeValues={
+                ":Id": job_id,
+                ":Sk": job_id,
+                ":qt": stats.get("TotalQueryCount", 0),
+                ":qs": stats.get("TotalQuerySucceededCount", 0),
+                ":qf": stats.get("TotalQueryFailedCount", 0),
+                ":qb": stats.get("TotalQueryScannedInBytes", 0),
+                ":qm": stats.get("TotalQueryTimeInMillis", 0),
+                ":ou": stats.get("TotalObjectUpdatedCount", 0),
+                ":os": stats.get("TotalObjectUpdateSkippedCount", 0),
+                ":of": stats.get("TotalObjectUpdateFailedCount", 0),
+                ":or": stats.get("TotalObjectRollbackFailedCount", 0),
+                ":z": 0,
+            },
+            ReturnValues="ALL_NEW",
+        )["Attributes"]
+    except ddb.meta.client.exceptions.ConditionalCheckFailedException:
+        logger.warning("Job %s does not exist", job_id)

+ 146 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/status_updater.py

@@ -0,0 +1,146 @@
+"""
+Job Status Updater
+"""
+import json
+import logging
+import os
+
+import boto3
+
+from boto_utils import DecimalEncoder
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+ddb = boto3.resource("dynamodb")
+table = ddb.Table(os.getenv("JobTable"))
+
+status_map = {
+    "FindPhaseFailed": "FIND_FAILED",
+    "ForgetPhaseFailed": "FORGET_FAILED",
+    "Exception": "FAILED",
+    "JobStarted": "RUNNING",
+    "ForgetPhaseEnded": "FORGET_COMPLETED_CLEANUP_IN_PROGRESS",
+    "CleanupFailed": "COMPLETED_CLEANUP_FAILED",
+    "CleanupSucceeded": "COMPLETED",
+}
+
+unlocked_states = ["RUNNING", "QUEUED", "FORGET_COMPLETED_CLEANUP_IN_PROGRESS"]
+skip_cleanup_states = [
+    "FIND_FAILED",
+    "FORGET_FAILED",
+    "FAILED",
+    "FORGET_PARTIALLY_FAILED",
+]
+
+time_statuses = {
+    "JobStartTime": ["RUNNING"],
+    "JobFinishTime": [
+        "COMPLETED_CLEANUP_FAILED",
+        "COMPLETED",
+        "FAILED",
+        "FIND_FAILED",
+        "FORGET_FAILED",
+        "FORGET_PARTIALLY_FAILED",
+    ],
+}
+
+append_data_only_statuses = {
+    "QueryPlanningComplete": ["GeneratedQueries", "DeletionQueueSize", "Manifests"]
+}
+
+
+def update_status(job_id, events):
+    attr_updates = {}
+    for event in events:
+        # Handle non status events
+        event_name = event["EventName"]
+        if event_name not in status_map:
+            if event_name in append_data_only_statuses:
+                event_data = event.get("EventData", {})
+                for attribute in append_data_only_statuses[event_name]:
+                    attr_updates[attribute] = event_data[attribute]
+            continue
+
+        new_status = determine_status(job_id, event_name)
+        # Only change the status if it's still in an unlocked state
+        if (
+            not attr_updates.get("JobStatus")
+            or attr_updates.get("JobStatus") in unlocked_states
+        ):
+            attr_updates["JobStatus"] = new_status
+
+        # Update any job attributes
+        for attr, statuses in time_statuses.items():
+            if new_status in statuses and not attr_updates.get(attr):
+                attr_updates[attr] = event["CreatedAt"]
+
+    if len(attr_updates) > 0:
+        job = _update_item(job_id, attr_updates)
+        logger.info("Updated Status for Job ID %s: %s", job_id, attr_updates)
+        return job
+
+
+def determine_status(job_id, event_name):
+    new_status = status_map[event_name]
+    if event_name == "ForgetPhaseEnded" and job_has_errors(job_id):
+        return "FORGET_PARTIALLY_FAILED"
+
+    return new_status
+
+
+def job_has_errors(job_id):
+    item = table.get_item(
+        Key={
+            "Id": job_id,
+            "Sk": job_id,
+        },
+        ConsistentRead=True,
+    )["Item"]
+    return (
+        item.get("TotalObjectUpdateFailedCount", 0) > 0
+        or item.get("TotalQueryFailedCount") > 0
+    )
+
+
+def _update_item(job_id, attr_updates):
+    try:
+        update_expression = "set " + ", ".join(
+            ["#{k} = :{k}".format(k=k) for k, v in attr_updates.items()]
+        )
+        attr_names = {}
+        attr_values = {}
+
+        for k, v in attr_updates.items():
+            attr_names["#{}".format(k)] = k
+            attr_values[":{}".format(k)] = v
+
+        unlocked_states_condition = " OR ".join(
+            ["#JobStatus = :{}".format(s) for s in unlocked_states]
+        )
+
+        return table.update_item(
+            Key={
+                "Id": job_id,
+                "Sk": job_id,
+            },
+            UpdateExpression=update_expression,
+            ConditionExpression="#Id = :Id AND #Sk = :Sk AND ({})".format(
+                unlocked_states_condition
+            ),
+            ExpressionAttributeNames={
+                "#Id": "Id",
+                "#Sk": "Sk",
+                "#JobStatus": "JobStatus",
+                **attr_names,
+            },
+            ExpressionAttributeValues={
+                ":Id": job_id,
+                ":Sk": job_id,
+                **{":{}".format(s): s for s in unlocked_states},
+                **attr_values,
+            },
+            ReturnValues="ALL_NEW",
+        )["Attributes"]
+    except ddb.meta.client.exceptions.ConditionalCheckFailedException:
+        logger.warning("Job %s is already in a status which cannot be updated", job_id)

+ 159 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/jobs/stream_processor.py

@@ -0,0 +1,159 @@
+import logging
+from datetime import datetime, timezone
+from os import getenv
+import json
+import boto3
+from boto3.dynamodb.types import TypeDeserializer
+from botocore.exceptions import ClientError
+from itertools import groupby
+from operator import itemgetter
+
+from stats_updater import update_stats
+from status_updater import update_status, skip_cleanup_states
+from boto_utils import (
+    DecimalEncoder,
+    deserialize_item,
+    emit_event,
+    fetch_job_manifest,
+    json_lines_iterator,
+    utc_timestamp,
+)
+from decorators import with_logging
+
+deserializer = TypeDeserializer()
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+client = boto3.client("stepfunctions")
+ddb = boto3.resource("dynamodb")
+glue = boto3.client("glue")
+
+state_machine_arn = getenv("StateMachineArn")
+q_table = ddb.Table(getenv("DeletionQueueTable"))
+glue_db = getenv("GlueDatabase", "s3f2_manifests_database")
+glue_table = getenv("JobManifestsGlueTable", "s3f2_manifests_table")
+
+
+@with_logging
+def handler(event, context):
+    records = event["Records"]
+    new_jobs = get_records(records, "Job", "INSERT")
+    deleted_jobs = get_records(records, "Job", "REMOVE", new_image=False)
+    events = get_records(records, "JobEvent", "INSERT")
+    grouped_events = groupby(sorted(events, key=itemgetter("Id")), key=itemgetter("Id"))
+    for job in new_jobs:
+        process_job(job)
+
+    for job in deleted_jobs:
+        cleanup_manifests(job)
+
+    for job_id, group in grouped_events:
+        group = [i for i in group]
+        update_stats(job_id, group)
+        updated_job = update_status(job_id, group)
+        # Perform cleanup if required
+        if (
+            updated_job
+            and updated_job.get("JobStatus") == "FORGET_COMPLETED_CLEANUP_IN_PROGRESS"
+        ):
+            try:
+                clear_deletion_queue(updated_job)
+                emit_event(
+                    job_id, "CleanupSucceeded", utc_timestamp(), "StreamProcessor"
+                )
+            except Exception as e:
+                emit_event(
+                    job_id,
+                    "CleanupFailed",
+                    {"Error": "Unable to clear deletion queue: {}".format(str(e))},
+                    "StreamProcessor",
+                )
+        elif updated_job and updated_job.get("JobStatus") in skip_cleanup_states:
+            emit_event(job_id, "CleanupSkipped", utc_timestamp(), "StreamProcessor")
+
+
+def process_job(job):
+    job_id = job["Id"]
+    state = {
+        k: job[k]
+        for k in [
+            "AthenaConcurrencyLimit",
+            "AthenaQueryMaxRetries",
+            "DeletionTasksMaxNumber",
+            "ForgetQueueWaitSeconds",
+            "Id",
+            "QueryExecutionWaitSeconds",
+            "QueryQueueWaitSeconds",
+        ]
+    }
+
+    try:
+        client.start_execution(
+            stateMachineArn=state_machine_arn,
+            name=job_id,
+            input=json.dumps(state, cls=DecimalEncoder),
+        )
+    except client.exceptions.ExecutionAlreadyExists:
+        logger.warning("Execution %s already exists", job_id)
+    except (ClientError, ValueError) as e:
+        emit_event(
+            job_id,
+            "Exception",
+            {
+                "Error": "ExecutionFailure",
+                "Cause": "Unable to start StepFunction execution: {}".format(str(e)),
+            },
+            "StreamProcessor",
+        )
+
+
+def cleanup_manifests(job):
+    logger.info("Removing job manifest partitions")
+    job_id = job["Id"]
+    partitions = []
+    for manifest in job.get("Manifests", []):
+        data_mapper_id = manifest.split("/")[5]
+        partitions.append([job_id, data_mapper_id])
+    max_deletion_batch_size = 25
+    for i in range(0, len(partitions), max_deletion_batch_size):
+        glue.batch_delete_partition(
+            DatabaseName=glue_db,
+            TableName=glue_table,
+            PartitionsToDelete=[
+                {"Values": partition_tuple}
+                for partition_tuple in partitions[i : i + max_deletion_batch_size]
+            ],
+        )
+
+
+def clear_deletion_queue(job):
+    logger.info("Clearing successfully deleted matches")
+    to_delete = set()
+    for manifest_object in job.get("Manifests", []):
+        manifest = fetch_job_manifest(manifest_object)
+        for line in json_lines_iterator(manifest):
+            to_delete.add(line["DeletionQueueItemId"])
+
+    with q_table.batch_writer() as batch:
+        for item_id in to_delete:
+            batch.delete_item(Key={"DeletionQueueItemId": item_id})
+
+
+def is_operation(record, operation):
+    return record.get("eventName") == operation
+
+
+def is_record_type(record, record_type, new_image):
+    image = record["dynamodb"].get("NewImage" if new_image else "OldImage")
+    if not image:
+        return False
+    item = deserialize_item(image)
+    return item.get("Type") and item.get("Type") == record_type
+
+
+def get_records(records, record_type, operation, new_image=True):
+    return [
+        deserialize_item(r["dynamodb"].get("NewImage" if new_image else "OldImage", {}))
+        for r in records
+        if is_record_type(r, record_type, new_image) and is_operation(r, operation)
+    ]

+ 180 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/queue/handlers.py

@@ -0,0 +1,180 @@
+"""
+Queue handlers
+"""
+import random
+import json
+import os
+import uuid
+
+import boto3
+
+from decimal import Decimal
+
+from boto_utils import (
+    DecimalEncoder,
+    get_config,
+    get_user_info,
+    paginate,
+    running_job_exists,
+    utc_timestamp,
+    deserialize_item,
+)
+from decorators import (
+    with_logging,
+    catch_errors,
+    add_cors_headers,
+    json_body_loader,
+    load_schema,
+    request_validator,
+)
+
+sfn_client = boto3.client("stepfunctions")
+ddb_client = boto3.client("dynamodb")
+ddb_resource = boto3.resource("dynamodb")
+
+deletion_queue_table_name = os.getenv("DeletionQueueTable", "S3F2_DeletionQueue")
+deletion_queue_table = ddb_resource.Table(deletion_queue_table_name)
+jobs_table = ddb_resource.Table(os.getenv("JobTable", "S3F2_Jobs"))
+bucket_count = int(os.getenv("GSIBucketCount", 1))
+max_size_bytes = 375000
+
+
+@with_logging
+@add_cors_headers
+@json_body_loader
+@catch_errors
+def enqueue_handler(event, context):
+    body = event["body"]
+    validate_queue_items([body])
+    user_info = get_user_info(event)
+    item = enqueue_items([body], user_info)[0]
+    deletion_queue_table.put_item(Item=item)
+    return {"statusCode": 201, "body": json.dumps(item, cls=DecimalEncoder)}
+
+
+@with_logging
+@add_cors_headers
+@json_body_loader
+@catch_errors
+def enqueue_batch_handler(event, context):
+    body = event["body"]
+    matches = body["Matches"]
+    validate_queue_items(matches)
+    user_info = get_user_info(event)
+    items = enqueue_items(matches, user_info)
+    return {
+        "statusCode": 201,
+        "body": json.dumps({"Matches": items}, cls=DecimalEncoder),
+    }
+
+
+@with_logging
+@add_cors_headers
+@request_validator(load_schema("list_queue_items"))
+@catch_errors
+def get_handler(event, context):
+    defaults = {"Type": "Simple"}
+    qs = event.get("queryStringParameters")
+    if not qs:
+        qs = {}
+    page_size = int(qs.get("page_size", 10))
+    scan_params = {"Limit": page_size}
+    start_at = qs.get("start_at")
+    if start_at:
+        scan_params["ExclusiveStartKey"] = {"DeletionQueueItemId": start_at}
+    items = deletion_queue_table.scan(**scan_params).get("Items", [])
+    if len(items) < page_size:
+        next_start = None
+    else:
+        next_start = items[-1]["DeletionQueueItemId"]
+    return {
+        "statusCode": 200,
+        "body": json.dumps(
+            {
+                "MatchIds": list(map(lambda item: dict(defaults, **item), items)),
+                "NextStart": next_start,
+            },
+            cls=DecimalEncoder,
+        ),
+        "headers": {"Access-Control-Expose-Headers": "content-length"},
+    }
+
+
+@with_logging
+@add_cors_headers
+@json_body_loader
+@catch_errors
+def cancel_handler(event, context):
+    if running_job_exists():
+        raise ValueError("Cannot delete matches whilst there is a job in progress")
+    body = event["body"]
+    matches = body["Matches"]
+    with deletion_queue_table.batch_writer() as batch:
+        for match in matches:
+            batch.delete_item(Key={"DeletionQueueItemId": match["DeletionQueueItemId"]})
+
+    return {"statusCode": 204}
+
+
+@with_logging
+@add_cors_headers
+@catch_errors
+def process_handler(event, context):
+    if running_job_exists():
+        raise ValueError("There is already a job in progress")
+
+    job_id = str(uuid.uuid4())
+    config = get_config()
+    item = {
+        "Id": job_id,
+        "Sk": job_id,
+        "Type": "Job",
+        "JobStatus": "QUEUED",
+        "GSIBucket": str(random.randint(0, bucket_count - 1)),
+        "CreatedAt": utc_timestamp(),
+        "CreatedBy": get_user_info(event),
+        **{k: v for k, v in config.items() if k not in ["JobDetailsRetentionDays"]},
+    }
+    if int(config.get("JobDetailsRetentionDays", 0)) > 0:
+        item["Expires"] = utc_timestamp(days=config["JobDetailsRetentionDays"])
+    jobs_table.put_item(Item=item)
+    return {"statusCode": 202, "body": json.dumps(item, cls=DecimalEncoder)}
+
+
+def validate_queue_items(items):
+    for item in items:
+        if item.get("Type", "Simple") == "Composite":
+            is_array = isinstance(item["MatchId"], list)
+            enough_columns = is_array and len(item["MatchId"]) > 0
+            just_one_mapper = len(item["DataMappers"]) == 1
+            if not is_array:
+                raise ValueError(
+                    "MatchIds of Composite type need to be specified as array"
+                )
+            if not enough_columns:
+                raise ValueError(
+                    "MatchIds of Composite type need to have a value for at least one column"
+                )
+            if not just_one_mapper:
+                raise ValueError(
+                    "MatchIds of Composite type need to be associated to exactly one Data Mapper"
+                )
+
+
+def enqueue_items(matches, user_info):
+    items = []
+    with deletion_queue_table.batch_writer() as batch:
+        for match in matches:
+            match_id = match["MatchId"]
+            data_mappers = match.get("DataMappers", [])
+            item = {
+                "DeletionQueueItemId": str(uuid.uuid4()),
+                "Type": match.get("Type", "Simple"),
+                "MatchId": match_id,
+                "CreatedAt": utc_timestamp(),
+                "DataMappers": data_mappers,
+                "CreatedBy": user_info,
+            }
+            batch.put_item(Item=item)
+            items.append(item)
+    return items

+ 22 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/queue/schemas/list_queue_items.json

@@ -0,0 +1,22 @@
+{
+    "$schema": "http://json-schema.org/draft-07/schema#",
+    "title": "List Queue Items Handler",
+    "type": "object",
+    "properties": {
+        "queryStringParameters": {
+            "description": "Query string parameters for the request",
+            "type": [ "object", "null" ],
+            "properties": {
+                "start_at": {
+                    "description": "Starting watermark",
+                    "type": "string"
+                },
+                "page_size": {
+                    "description": "Maximum page size",
+                    "type": "string",
+                    "pattern": "^([1-9][0-9]{0,2}|1000)$"
+                }
+            }
+        }
+    }
+}

+ 20 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/settings/handlers.py

@@ -0,0 +1,20 @@
+"""
+Settings handlers
+"""
+import json
+
+import boto3
+
+from boto_utils import get_config, DecimalEncoder
+from decorators import with_logging, catch_errors, add_cors_headers
+
+
+@with_logging
+@add_cors_headers
+@catch_errors
+def list_settings_handler(event, context):
+    config = get_config()
+    return {
+        "statusCode": 200,
+        "body": json.dumps({"Settings": config}, cls=DecimalEncoder),
+    }

+ 27 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/check_query_status.py

@@ -0,0 +1,27 @@
+import boto3
+
+from decorators import with_logging
+
+client = boto3.client("athena")
+
+
+@with_logging
+def handler(event, context):
+    execution_retries_left = event["ExecutionRetriesLeft"]
+    execution_details = client.get_query_execution(QueryExecutionId=event["QueryId"])[
+        "QueryExecution"
+    ]
+    state = execution_details["Status"]["State"]
+    needs_retry = state == "FAILED" or state == "CANCELLED"
+    if needs_retry:
+        execution_retries_left -= 1
+
+    result = {
+        **event,
+        "State": state,
+        "Reason": execution_details["Status"].get("StateChangeReason", "n/a"),
+        "Statistics": execution_details["Statistics"],
+        "ExecutionRetriesLeft": execution_retries_left,
+    }
+
+    return result

+ 24 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/check_queue_size.py

@@ -0,0 +1,24 @@
+"""
+Task to check the SQS Queue Size
+"""
+import boto3
+
+from decorators import with_logging
+
+sqs = boto3.resource("sqs")
+
+
+def get_attribute(q, attribute):
+    return int(q.attributes[attribute])
+
+
+@with_logging
+def handler(event, context):
+    queue = sqs.Queue(event["QueueUrl"])
+    visible = get_attribute(queue, "ApproximateNumberOfMessages")
+    not_visible = get_attribute(queue, "ApproximateNumberOfMessagesNotVisible")
+    return {
+        "Visible": visible,
+        "NotVisible": not_visible,
+        "Total": visible + not_visible,
+    }

+ 31 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/check_task_count.py

@@ -0,0 +1,31 @@
+"""
+Task to check the number of running and pending tasks
+"""
+import logging
+import boto3
+
+from decorators import with_logging
+
+logger = logging.getLogger()
+client = boto3.client("ecs")
+
+
+@with_logging
+def handler(event, context):
+    try:
+        service = client.describe_services(
+            cluster=event["Cluster"],
+            services=[
+                event["ServiceName"],
+            ],
+        )["services"][0]
+        pending = service["pendingCount"]
+        running = service["runningCount"]
+        return {"Pending": pending, "Running": running, "Total": pending + running}
+    except IndexError:
+        logger.error("Unable to find service '%s'", event["ServiceName"])
+        raise ValueError(
+            "Service {} in cluster {} not found".format(
+                event["ServiceName"], event["Cluster"]
+            )
+        )

+ 19 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/delete_message.py

@@ -0,0 +1,19 @@
+import os
+
+import logging
+import boto3
+from decorators import with_logging
+
+logger = logging.getLogger()
+sqs = boto3.resource("sqs")
+queue_url = os.getenv("QueueUrl")
+
+
+@with_logging
+def handler(event, context):
+    receipt_handle = event.get("ReceiptHandle")
+    if receipt_handle:
+        message = sqs.Message(queue_url, receipt_handle)
+        message.delete()
+    else:
+        logger.warning("No receipt handle found in event. Skipping")

+ 16 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/emit_event.py

@@ -0,0 +1,16 @@
+"""
+Task to emit events
+"""
+from uuid import uuid4
+
+from boto_utils import emit_event
+from decorators import with_logging
+
+
+@with_logging
+def handler(event, context):
+    job_id = event["JobId"]
+    event_name = event["EventName"]
+    event_data = event["EventData"]
+    emitter_id = event.get("EmitterId", str(uuid4()))
+    emit_event(job_id, event_name, event_data, emitter_id)

+ 158 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/execute_query.py

@@ -0,0 +1,158 @@
+import os
+from operator import itemgetter
+
+import boto3
+
+from decorators import with_logging
+
+client = boto3.client("athena")
+
+COMPOSITE_JOIN_TOKEN = "_S3F2COMP_"
+
+glue_db = os.getenv("GlueDatabase", "s3f2_manifests_database")
+glue_table = os.getenv("JobManifestsGlueTable", "s3f2_manifests_table")
+
+
+@with_logging
+def handler(event, context):
+    response = client.start_query_execution(
+        QueryString=make_query(event["QueryData"]),
+        ResultConfiguration={
+            "OutputLocation": "s3://{bucket}/{prefix}/".format(
+                bucket=event["Bucket"], prefix=event["Prefix"]
+            )
+        },
+        WorkGroup=os.getenv("WorkGroup", "primary"),
+    )
+    return response["QueryExecutionId"]
+
+
+def make_query(query_data):
+    """
+    Returns a query which will look like
+    SELECT DISTINCT "$path" FROM (
+        SELECT t."$path"
+        FROM "db"."table" t,
+            "manifests_db"."manifests_table" m
+        WHERE
+            m."jobid"='job1234' AND
+            m."datamapperid"='dm123' AND
+            cast(t."customer_id" as varchar)=m."queryablematchid" AND
+                m."queryablecolumns"='customer_id'
+            AND partition_key = value
+
+        UNION ALL
+
+        SELECT t."$path"
+        FROM "db"."table" t,
+            "manifests_db"."manifests_table" m
+        WHERE
+            m."jobid"='job1234' AND
+            m."datamapperid"='dm123' AND
+            cast(t."other_customer_id" as varchar)=m."queryablematchid" AND
+                m."queryablecolumns"='other_customer_id'
+            AND partition_key = value
+    )
+
+    Note: 'queryablematchid' and 'queryablecolumns' is a convenience
+    stringified value of match_id and its column when the match is simple,
+    or a stringified joint value when composite (for instance,
+    "John_S3F2COMP_Doe" and "first_name_S3F2COMP_last_name").
+    JobId and DataMapperId are both used as partitions for the manifest to
+    optimize query execution time.
+
+    :param query_data: a dict which looks like
+    {
+      "Database":"db",
+      "Table": "table",
+      "Columns": [
+        {"Column": "col", "Type": "Simple"},
+        {
+          "Columns": ["first_name", "last_name"],
+          "Type": "Composite"
+        }
+      ],
+      "PartitionKeys": [{"Key":"k", "Value":"val"}]
+    }
+    """
+    distinct_template = """SELECT DISTINCT "$path" FROM ({column_unions})"""
+    single_column_template = """
+    SELECT t."$path"
+    FROM "{db}"."{table}" t,
+        "{manifest_db}"."{manifest_table}" m
+    WHERE
+        m."jobid"='{job_id}' AND
+        m."datamapperid"='{data_mapper_id}' AND
+        {queryable_matches}=m."queryablematchid" AND m."queryablecolumns"=\'{queryable_columns}\'
+        {partition_filters}
+    """
+    indent = " " * 4
+    cast_as_str = "cast(t.{} as varchar)"
+    columns_composite_join_token = ", '{}', ".format(COMPOSITE_JOIN_TOKEN)
+
+    db, table, columns, data_mapper_id, job_id = itemgetter(
+        "Database", "Table", "Columns", "DataMapperId", "JobId"
+    )(query_data)
+
+    partitions = query_data.get("PartitionKeys", [])
+    partition_filters = ""
+    for partition in partitions:
+        partition_filters += " AND {key} = {value} ".format(
+            key=escape_column(partition["Key"]),
+            value=escape_item(partition["Value"]),
+        )
+
+    column_unions = ""
+    for i, col in enumerate(columns):
+        if i > 0:
+            column_unions += "\n" + indent + "UNION ALL\n"
+        is_simple = col["Type"] == "Simple"
+        queryable_matches = (
+            cast_as_str.format(escape_column(col["Column"]))
+            if is_simple
+            else cast_as_str.format(escape_column(col["Columns"][0]))
+            if len(col["Columns"]) == 1
+            else "concat({})".format(
+                columns_composite_join_token.join(
+                    "t.{0}".format(escape_column(c)) for c in col["Columns"]
+                )
+            )
+        )
+        queryable_columns = (
+            col["Column"] if is_simple else COMPOSITE_JOIN_TOKEN.join(col["Columns"])
+        )
+        column_unions += single_column_template.format(
+            db=db,
+            table=table,
+            manifest_db=glue_db,
+            manifest_table=glue_table,
+            job_id=job_id,
+            data_mapper_id=data_mapper_id,
+            queryable_matches=queryable_matches,
+            queryable_columns=queryable_columns,
+            partition_filters=partition_filters,
+        )
+    return distinct_template.format(column_unions=column_unions)
+
+
+def escape_column(item):
+    return '"{}"'.format(item.replace('"', '""').replace(".", '"."'))
+
+
+def escape_item(item):
+    if item is None:
+        return "NULL"
+    elif isinstance(item, (int, float)):
+        return escape_number(item)
+    elif isinstance(item, str):
+        return escape_string(item)
+    else:
+        raise ValueError("Unable to process supplied value")
+
+
+def escape_number(item):
+    return item
+
+
+def escape_string(item):
+    return "'{}'".format(item.replace("'", "''"))

+ 499 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/generate_queries.py

@@ -0,0 +1,499 @@
+"""
+Task for generating Athena queries from glue catalog aka Query Planning
+"""
+import json
+import os
+import boto3
+
+from operator import itemgetter
+from boto_utils import paginate, batch_sqs_msgs, deserialize_item, DecimalEncoder
+from decorators import with_logging
+
+ddb = boto3.resource("dynamodb")
+ddb_client = boto3.client("dynamodb")
+glue_client = boto3.client("glue")
+s3 = boto3.resource("s3")
+sqs = boto3.resource("sqs")
+
+queue = sqs.Queue(os.getenv("QueryQueue"))
+jobs_table = ddb.Table(os.getenv("JobTable", "S3F2_Jobs"))
+data_mapper_table_name = os.getenv("DataMapperTable", "S3F2_DataMappers")
+deletion_queue_table_name = os.getenv("DeletionQueueTable", "S3F2_DeletionQueue")
+manifests_bucket_name = os.getenv("ManifestsBucket", "S3F2-manifests-bucket")
+glue_db = os.getenv("GlueDatabase", "s3f2_manifests_database")
+glue_table = os.getenv("JobManifestsGlueTable", "s3f2_manifests_table")
+
+COMPOSITE_JOIN_TOKEN = "_S3F2COMP_"
+MANIFEST_KEY = "manifests/{job_id}/{data_mapper_id}/manifest.json"
+
+COMPOSITE_JOIN_TOKEN = "_S3F2COMP_"
+
+ARRAYSTRUCT = "array<struct>"
+ARRAYSTRUCT_PREFIX = "array<struct<"
+ARRAYSTRUCT_SUFFIX = ">>"
+STRUCT = "struct"
+STRUCT_PREFIX = "struct<"
+STRUCT_SUFFIX = ">"
+SCHEMA_INVALID = "Column schema is not valid"
+ALLOWED_TYPES = [
+    "bigint",
+    "char",
+    "decimal",
+    "double",
+    "float",
+    "int",
+    "smallint",
+    "string",
+    "tinyint",
+    "varchar",
+]
+
+
+@with_logging
+def handler(event, context):
+    job_id = event["ExecutionName"]
+    deletion_items = get_deletion_queue()
+    manifests_partitions = []
+    data_mappers = get_data_mappers()
+    total_queries = 0
+    for data_mapper in data_mappers:
+        query_executor = data_mapper["QueryExecutor"]
+        if query_executor == "athena":
+            queries = generate_athena_queries(data_mapper, deletion_items, job_id)
+            if len(queries) > 0:
+                manifests_partitions.append([job_id, data_mapper["DataMapperId"]])
+        else:
+            raise NotImplementedError(
+                "Unsupported data mapper query executor: '{}'".format(query_executor)
+            )
+
+        batch_sqs_msgs(queue, queries)
+        total_queries += len(queries)
+    write_partitions(manifests_partitions)
+    return {
+        "GeneratedQueries": total_queries,
+        "DeletionQueueSize": len(deletion_items),
+        "Manifests": [
+            "s3://{}/{}".format(
+                manifests_bucket_name,
+                MANIFEST_KEY.format(
+                    job_id=partition_tuple[0], data_mapper_id=partition_tuple[1]
+                ),
+            )
+            for partition_tuple in manifests_partitions
+        ],
+    }
+
+
+def build_manifest_row(columns, match_id, item_id, item_createdat, is_composite):
+    """
+    Function for building each row of the manifest that will be written to S3.
+
+    * What are 'queryablematchid' and 'queryablecolumns'?
+    A convenience stringified value of match_id and its column when the match
+    is simple, or a stringified joint value when composite (for instance,
+    "John_S3F2COMP_Doe" and "first_name_S3F2COMP_last_name"). The purpose of
+    these fields is optimise query execution by doing the SQL JOINs over strings only.
+
+    * What are MatchId and Columns?
+    Original values to be used by the ECS task instead.
+    Note that the MatchId is declared as array<string> in the Glue Table as it's
+    not possible to declare it as array of generic types and the design is for
+    using a single table schema for each match/column tuple, despite
+    the current column type.
+    This means that using the "MatchId" field in Athena will always coherce its values
+    to strings, for instance [1234] => ["1234"]. That's ok because when working with
+    the manifest, the Fargate task will read and parse the JSON directly and therefore
+    will use its original type (for instance, int over strings to do the comparison).
+    """
+
+    iterable_match = match_id if is_composite else [match_id]
+    queryable = COMPOSITE_JOIN_TOKEN.join(str(x) for x in iterable_match)
+    queryable_cols = COMPOSITE_JOIN_TOKEN.join(str(x) for x in columns)
+    return (
+        json.dumps(
+            {
+                "Columns": columns,
+                "MatchId": iterable_match,
+                "DeletionQueueItemId": item_id,
+                "CreatedAt": item_createdat,
+                "QueryableColumns": queryable_cols,
+                "QueryableMatchId": queryable,
+            },
+            cls=DecimalEncoder,
+        )
+        + "\n"
+    )
+
+
+def generate_athena_queries(data_mapper, deletion_items, job_id):
+    """
+    For each Data Mapper, it generates a list of parameters needed for each
+    query execution. The matches for the given column are saved in an external
+    S3 object (aka manifest) to allow its size to grow into the thousands without
+    incurring in DDB Document size limit, SQS message size limit, or Athena query
+    size limit. The manifest S3 Path is finally referenced as part of the SQS message.
+    """
+    manifest_key = MANIFEST_KEY.format(
+        job_id=job_id, data_mapper_id=data_mapper["DataMapperId"]
+    )
+    db = data_mapper["QueryExecutorParameters"]["Database"]
+    table_name = data_mapper["QueryExecutorParameters"]["Table"]
+    table = get_table(db, table_name)
+    columns_tree = get_columns_tree(table)
+    all_partition_keys = [p["Name"] for p in table.get("PartitionKeys", [])]
+    partition_keys = data_mapper["QueryExecutorParameters"].get(
+        "PartitionKeys", all_partition_keys
+    )
+    columns = [c for c in data_mapper["Columns"]]
+    msg = {
+        "DataMapperId": data_mapper["DataMapperId"],
+        "QueryExecutor": data_mapper["QueryExecutor"],
+        "Format": data_mapper["Format"],
+        "Database": db,
+        "Table": table_name,
+        "Columns": columns,
+        "PartitionKeys": [],
+        "DeleteOldVersions": data_mapper.get("DeleteOldVersions", True),
+        "IgnoreObjectNotFoundExceptions": data_mapper.get(
+            "IgnoreObjectNotFoundExceptions", False
+        ),
+    }
+    if data_mapper.get("RoleArn", None):
+        msg["RoleArn"] = data_mapper["RoleArn"]
+
+    # Workout which deletion items should be included in this query
+    applicable_match_ids = [
+        item
+        for item in deletion_items
+        if msg["DataMapperId"] in item.get("DataMappers", [])
+        or len(item.get("DataMappers", [])) == 0
+    ]
+    if len(applicable_match_ids) == 0:
+        return []
+
+    # Compile a list of MatchIds grouped by Column
+    columns_with_matches = {}
+    manifest = ""
+    for item in applicable_match_ids:
+        mid, item_id, item_createdat = itemgetter(
+            "MatchId", "DeletionQueueItemId", "CreatedAt"
+        )(item)
+        is_simple = not isinstance(mid, list)
+        if is_simple:
+            for column in msg["Columns"]:
+                casted = cast_to_type(mid, column, table_name, columns_tree)
+                if column not in columns_with_matches:
+                    columns_with_matches[column] = {
+                        "Column": column,
+                        "Type": "Simple",
+                    }
+                manifest += build_manifest_row(
+                    [column], casted, item_id, item_createdat, False
+                )
+        else:
+            sorted_mid = sorted(mid, key=lambda x: x["Column"])
+            query_columns = list(map(lambda x: x["Column"], sorted_mid))
+            column_key = COMPOSITE_JOIN_TOKEN.join(query_columns)
+            composite_match = list(
+                map(
+                    lambda x: cast_to_type(
+                        x["Value"], x["Column"], table_name, columns_tree
+                    ),
+                    sorted_mid,
+                )
+            )
+            if column_key not in columns_with_matches:
+                columns_with_matches[column_key] = {
+                    "Columns": query_columns,
+                    "Type": "Composite",
+                }
+            manifest += build_manifest_row(
+                query_columns, composite_match, item_id, item_createdat, True
+            )
+    s3.Bucket(manifests_bucket_name).put_object(Body=manifest, Key=manifest_key)
+    msg["Columns"] = list(columns_with_matches.values())
+    msg["Manifest"] = "s3://{}/{}".format(manifests_bucket_name, manifest_key)
+
+    if len(partition_keys) == 0:
+        return [msg]
+
+    # For every partition combo of every table, create a query
+    partitions = set()
+    for partition in get_partitions(db, table_name):
+        current = tuple(
+            (
+                all_partition_keys[i],
+                cast_to_type(v, all_partition_keys[i], table_name, columns_tree),
+            )
+            for i, v in enumerate(partition["Values"])
+            if all_partition_keys[i] in partition_keys
+        )
+        partitions.add(current)
+    ret = []
+    for current in partitions:
+        current_dict = [{"Key": k, "Value": v} for k, v in current]
+        ret.append({**msg, "PartitionKeys": current_dict})
+    return ret
+
+
+def get_deletion_queue():
+    results = paginate(
+        ddb_client, ddb_client.scan, "Items", TableName=deletion_queue_table_name
+    )
+    return [deserialize_item(result) for result in results]
+
+
+def get_data_mappers():
+    results = paginate(
+        ddb_client, ddb_client.scan, "Items", TableName=data_mapper_table_name
+    )
+    for result in results:
+        yield deserialize_item(result)
+
+
+def get_table(db, table_name):
+    return glue_client.get_table(DatabaseName=db, Name=table_name)["Table"]
+
+
+def get_columns_tree(table):
+    return list(
+        map(
+            column_mapper,
+            table["StorageDescriptor"]["Columns"] + table.get("PartitionKeys", []),
+        )
+    )
+
+
+def get_partitions(db, table_name):
+    return paginate(
+        glue_client,
+        glue_client.get_partitions,
+        ["Partitions"],
+        DatabaseName=db,
+        TableName=table_name,
+        ExcludeColumnSchema=True,
+    )
+
+
+def write_partitions(partitions):
+    """
+    In order for the manifests to be used by Athena in a JOIN, we make them
+    available as partitions with Job and DataMapperId tuple.
+    """
+    max_create_batch_size = 100
+    for i in range(0, len(partitions), max_create_batch_size):
+        glue_client.batch_create_partition(
+            DatabaseName=glue_db,
+            TableName=glue_table,
+            PartitionInputList=[
+                {
+                    "Values": partition_tuple,
+                    "StorageDescriptor": {
+                        "Columns": [
+                            {"Name": "columns", "Type": "array<string>"},
+                            {"Name": "matchid", "Type": "array<string>"},
+                            {"Name": "deletionqueueitemid", "Type": "string"},
+                            {"Name": "createdat", "Type": "int"},
+                            {"Name": "queryablecolumns", "Type": "string"},
+                            {"Name": "queryablematchid", "Type": "string"},
+                        ],
+                        "Location": "s3://{}/manifests/{}/{}/".format(
+                            manifests_bucket_name,
+                            partition_tuple[0],
+                            partition_tuple[1],
+                        ),
+                        "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
+                        "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
+                        "Compressed": False,
+                        "SerdeInfo": {
+                            "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe",
+                        },
+                        "StoredAsSubDirectories": False,
+                    },
+                }
+                for partition_tuple in partitions[i : i + max_create_batch_size]
+            ],
+        )
+
+
+def get_inner_children(str, prefix, suffix):
+    """
+    Function to get inner children from complex type string
+    "struct<name:string,age:int>" => "name:string,age:int"
+    """
+    if not str.endswith(suffix):
+        raise ValueError(SCHEMA_INVALID)
+    return str[len(prefix) : -len(suffix)]
+
+
+def get_nested_children(str, nested_type):
+    """
+    Function to get next nested child type from a children string
+    starting with a complex type such as struct or array
+    "struct<name:string,age:int,s:struct<n:int>>,b:string" =>
+    "struct<name:string,age:int,s:struct<n:int>>"
+    """
+    is_struct = nested_type == STRUCT
+    prefix = STRUCT_PREFIX if is_struct else ARRAYSTRUCT_PREFIX
+    suffix = STRUCT_SUFFIX if is_struct else ARRAYSTRUCT_SUFFIX
+    n_opened_tags = len(suffix)
+    end_index = -1
+    to_parse = str[len(prefix) :]
+    for i in range(len(to_parse)):
+        char = to_parse[i : (i + 1)]
+        if char == "<":
+            n_opened_tags += 1
+        if char == ">":
+            n_opened_tags -= 1
+        if n_opened_tags == 0:
+            end_index = i
+            break
+    if end_index < 0:
+        raise ValueError(SCHEMA_INVALID)
+    return str[0 : (end_index + len(prefix) + 1)]
+
+
+def get_nested_type(str):
+    """
+    Function to get next nested child type from a children string
+    starting with a non complex type
+    "string,a:int" => "string"
+    """
+    upper_index = str.find(",")
+    return str[0:upper_index] if upper_index >= 0 else str
+
+
+def set_no_identifier_to_node_and_its_children(node):
+    """
+    Function to set canBeIdentifier=false to item and its children
+    Example:
+    {
+        name: "arr",
+        type: "array<struct>",
+        canBeIdentifier: false,
+        children: [
+            { name: "field", type: "int", canBeIdentifier: true },
+            { name: "n", type: "string", canBeIdentifier: true }
+        ]
+    } => {
+        name: "arr",
+        type: "array<struct>",
+        canBeIdentifier: false,
+        children: [
+            { name: "field", type: "int", canBeIdentifier: false },
+            { name: "n", type: "string", canBeIdentifier: false }
+        ]
+    }
+    """
+    node["CanBeIdentifier"] = False
+    for child in node.get("Children", []):
+        set_no_identifier_to_node_and_its_children(child)
+
+
+def column_mapper(col):
+    """
+    Function to map Columns from AWS Glue schema to tree
+    Example 1:
+    { Name: "Name", Type: "int" } =>
+    { name: "Name", type: "int", canBeIdentifier: true }
+    Example 2:
+    { Name: "complex", Type: "struct<a:string,b:struct<c:int>>"} =>
+    { name: "complex", type: "struct", children: [
+        { name: "a", type: "string", canBeIdentifier: false},
+        { name: "b", type: "struct", children: [
+        { name: "c", type: "int", canBeIdentifier: false}
+        ], canBeIdentifier: false}
+    ], canBeIdentifier: false}
+    """
+    prefix = suffix = None
+    result_type = col["Type"]
+    has_children = False
+
+    if result_type.startswith(ARRAYSTRUCT_PREFIX):
+        result_type = ARRAYSTRUCT
+        prefix = ARRAYSTRUCT_PREFIX
+        suffix = ARRAYSTRUCT_SUFFIX
+        has_children = True
+    elif result_type.startswith(STRUCT_PREFIX):
+        result_type = STRUCT
+        prefix = STRUCT_PREFIX
+        suffix = STRUCT_SUFFIX
+        has_children = True
+
+    type_is_decimal_with_precision = result_type.startswith("decimal(")
+
+    result = {
+        "Name": col["Name"],
+        "Type": result_type,
+        "CanBeIdentifier": col["CanBeIdentifier"]
+        if "CanBeIdentifier" in col
+        else result_type in ALLOWED_TYPES or type_is_decimal_with_precision,
+    }
+
+    if has_children:
+        result["Children"] = []
+        children_to_parse = get_inner_children(col["Type"], prefix, suffix)
+
+        while len(children_to_parse) > 0:
+            sep = ":"
+            name = children_to_parse[0 : children_to_parse.index(sep)]
+            rest = children_to_parse[len(name) + len(sep) :]
+            nested_type = "other"
+            if rest.startswith(STRUCT_PREFIX):
+                nested_type = STRUCT
+            elif rest.startswith(ARRAYSTRUCT_PREFIX):
+                nested_type = ARRAYSTRUCT
+
+            c_type = (
+                get_nested_type(rest)
+                if nested_type == "other"
+                else get_nested_children(rest, nested_type)
+            )
+            result["Children"].append(
+                column_mapper(
+                    {
+                        "Name": name,
+                        "Type": c_type,
+                        "CanBeIdentifier": c_type in ALLOWED_TYPES,
+                    }
+                )
+            )
+            children_to_parse = children_to_parse[len(name) + len(sep) + len(c_type) :]
+            if children_to_parse.startswith(","):
+                children_to_parse = children_to_parse[1:]
+
+        if result_type != STRUCT:
+            set_no_identifier_to_node_and_its_children(result)
+
+    return result
+
+
+def get_column_info(col, columns_tree):
+    current = columns_tree
+    col_array = col.split(".")
+    found = None
+    for col_segment in col_array:
+        found = next((x for x in current if x["Name"] == col_segment), None)
+        if not found:
+            return None, False
+        current = found["Children"] if "Children" in found else []
+    return found["Type"], found["CanBeIdentifier"]
+
+
+def cast_to_type(val, col, table_name, columns_tree):
+    col_type, can_be_identifier = get_column_info(col, columns_tree)
+    if not col_type:
+        raise ValueError("Column {} not found at table {}".format(col, table_name))
+    elif not can_be_identifier:
+        raise ValueError(
+            "Column {} at table {} is not a supported column type for querying".format(
+                col, table_name
+            )
+        )
+    if col_type in ("bigint", "int", "smallint", "tinyint"):
+        return int(val)
+    if col_type in ("double", "float"):
+        return float(val)
+
+    return str(val)

+ 20 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/orchestrate_ecs_service_scaling.py

@@ -0,0 +1,20 @@
+"""
+Task to orchestrate scaling for a ECS Service
+"""
+import boto3
+
+from decorators import with_logging
+
+ecs = boto3.client("ecs")
+
+
+@with_logging
+def handler(event, context):
+    cluster = event["Cluster"]
+    max_tasks = event["DeletionTasksMaxNumber"]
+    queue_size = event["QueueSize"]
+    service = event["DeleteService"]
+    desired_count = min(queue_size, max_tasks)
+    ecs.update_service(cluster=cluster, service=service, desiredCount=desired_count)
+
+    return desired_count

+ 14 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/purge_queue.py

@@ -0,0 +1,14 @@
+"""
+Task to purge an SQS queue
+"""
+import boto3
+
+from decorators import with_logging
+
+sqs = boto3.resource("sqs")
+
+
+@with_logging
+def handler(event, context):
+    queue = sqs.Queue(event["QueueUrl"])
+    queue.purge()

+ 22 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/scan_table.py

@@ -0,0 +1,22 @@
+"""
+Task to scan a DynamoDB table
+"""
+import boto3
+from boto3.dynamodb.types import TypeDeserializer
+
+from decorators import with_logging
+from boto_utils import paginate, deserialize_item
+
+ddb_client = boto3.client("dynamodb")
+deserializer = TypeDeserializer()
+
+
+@with_logging
+def handler(event, context):
+    results = paginate(
+        ddb_client, ddb_client.scan, "Items", TableName=event["TableName"]
+    )
+
+    items = [deserialize_item(result) for result in results]
+
+    return {"Items": items, "Count": len(items)}

+ 62 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/submit_query_results.py

@@ -0,0 +1,62 @@
+"""
+Submits results from Athena queries to the Fargate deletion queue
+"""
+import os
+
+import boto3
+
+from decorators import with_logging
+from boto_utils import paginate, batch_sqs_msgs
+
+athena = boto3.client("athena")
+sqs = boto3.resource("sqs")
+queue = sqs.Queue(os.getenv("QueueUrl"))
+
+MSG_BATCH_SIZE = 500
+
+
+@with_logging
+def handler(event, context):
+    query_id = event["QueryId"]
+    results = paginate(
+        athena, athena.get_query_results, ["ResultSet.Rows"], QueryExecutionId=query_id
+    )
+    messages = []
+    msg_count = 0
+    path_field_index = None
+    for result in results:
+        is_header_row = path_field_index is None
+        if is_header_row:
+            path_field_index = next(
+                (
+                    index
+                    for (index, d) in enumerate(result["Data"])
+                    if d["VarCharValue"] == "$path"
+                ),
+                None,
+            )
+        else:
+            msg_count += 1
+            path = result["Data"][path_field_index]["VarCharValue"]
+            msg = {
+                "JobId": event["JobId"],
+                "Object": path,
+                "Columns": event["Columns"],
+                "RoleArn": event.get("RoleArn", None),
+                "DeleteOldVersions": event.get("DeleteOldVersions", True),
+                "IgnoreObjectNotFoundExceptions": event.get(
+                    "IgnoreObjectNotFoundExceptions", False
+                ),
+                "Format": event.get("Format"),
+                "Manifest": event.get("Manifest"),
+            }
+            messages.append({k: v for k, v in msg.items() if v is not None})
+
+        if len(messages) >= MSG_BATCH_SIZE:
+            batch_sqs_msgs(queue, messages)
+            messages = []
+
+    if len(messages) > 0:
+        batch_sqs_msgs(queue, messages)
+
+    return msg_count

+ 94 - 0
S3/NewFind/amazon-s3-find-and-forget-master/backend/lambdas/tasks/work_query_queue.py

@@ -0,0 +1,94 @@
+import json
+import os
+import boto3
+
+from decorators import with_logging, s3_state_store
+from boto_utils import read_queue
+
+queue_url = os.getenv("QueueUrl")
+state_machine_arn = os.getenv("StateMachineArn")
+sqs = boto3.resource("sqs")
+queue = sqs.Queue(queue_url)
+sf_client = boto3.client("stepfunctions")
+
+
+@with_logging
+@s3_state_store(offload_keys=["Data"])
+def handler(event, context):
+    concurrency_limit = int(event.get("AthenaConcurrencyLimit", 15))
+    wait_duration = int(event.get("QueryExecutionWaitSeconds", 15))
+    execution_retries_left = int(event.get("AthenaQueryMaxRetries", 2))
+    execution_id = event["ExecutionId"]
+    job_id = event["ExecutionName"]
+    previously_started = event.get("RunningExecutions", {"Data": [], "Total": 0})
+    executions = [load_execution(execution) for execution in previously_started["Data"]]
+    succeeded = [
+        execution for execution in executions if execution["status"] == "SUCCEEDED"
+    ]
+    still_running = [
+        execution for execution in executions if execution["status"] == "RUNNING"
+    ]
+    failed = [
+        execution
+        for execution in executions
+        if execution["status"] not in ["SUCCEEDED", "RUNNING"]
+    ]
+    clear_completed(succeeded)
+    is_failing = previously_started.get("IsFailing", False)
+    if len(failed) > 0:
+        is_failing = True
+    # Only abandon for failures once all running queries are done
+    if is_failing and len(still_running) == 0:
+        abandon_execution(failed)
+
+    remaining_capacity = int(concurrency_limit) - len(still_running)
+    # Only schedule new queries if there have been no errors
+    if remaining_capacity > 0 and not is_failing:
+        msgs = read_queue(queue, remaining_capacity)
+        started = []
+        for msg in msgs:
+            body = json.loads(msg.body)
+            body["AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID"] = execution_id
+            body["JobId"] = job_id
+            body["WaitDuration"] = wait_duration
+            body["ExecutionRetriesLeft"] = execution_retries_left
+            query_executor = body["QueryExecutor"]
+            if query_executor == "athena":
+                resp = sf_client.start_execution(
+                    stateMachineArn=state_machine_arn, input=json.dumps(body)
+                )
+                started.append({**resp, "ReceiptHandle": msg.receipt_handle})
+            else:
+                raise NotImplementedError(
+                    "Unsupported query executor: '{}'".format(query_executor)
+                )
+        still_running += started
+
+    return {
+        "IsFailing": is_failing,
+        "Data": [
+            {"ExecutionArn": e["executionArn"], "ReceiptHandle": e["ReceiptHandle"]}
+            for e in still_running
+        ],
+        "Total": len(still_running),
+    }
+
+
+def load_execution(execution):
+    resp = sf_client.describe_execution(executionArn=execution["ExecutionArn"])
+    resp["ReceiptHandle"] = execution["ReceiptHandle"]
+    return resp
+
+
+def clear_completed(executions):
+    for e in executions:
+        message = sqs.Message(queue.url, e["ReceiptHandle"])
+        message.delete()
+
+
+def abandon_execution(failed):
+    raise RuntimeError(
+        "Abandoning execution because one or more queries failed. {}".format(
+            ", ".join([f["executionArn"] for f in failed])
+        )
+    )

+ 5 - 0
S3/NewFind/amazon-s3-find-and-forget-master/cfn-publish.config

@@ -0,0 +1,5 @@
+bucket_name_prefix="solution-builders"
+acl="public-read"
+extra_files=build.zip
+templates="templates/template.yaml templates/role.yaml"
+regions="us-east-1 us-east-2 us-west-2 ap-northeast-1 ap-southeast-2 eu-west-1 eu-west-2 eu-central-1 eu-north-1"

+ 4 - 0
S3/NewFind/amazon-s3-find-and-forget-master/ci/cfn_nag_blacklist.yaml

@@ -0,0 +1,4 @@
+---
+RulesToSuppress:
+- id: W35
+  reason: Bucket Logging should be left to the customer to enable as it is subject to additional costs

+ 40 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docker_run_with_creds.sh

@@ -0,0 +1,40 @@
+#!/usr/bin/env bash
+
+set -e
+
+# Obtain stack and account details
+REGION=$(aws configure get region)
+JOB_TABLE=$(aws cloudformation describe-stacks \
+  --stack-name S3F2 \
+  --query 'Stacks[0].Outputs[?OutputKey==`JobTable`].OutputValue' \
+  --output text)
+QUEUE_URL=$(aws cloudformation describe-stacks \
+  --stack-name S3F2 \
+  --query 'Stacks[0].Outputs[?OutputKey==`DeletionQueueUrl`].OutputValue' \
+  --output text)
+DLQ_URL=$(aws cloudformation describe-stacks \
+  --stack-name S3F2 \
+  --query 'Stacks[0].Outputs[?OutputKey==`DLQUrl`].OutputValue' \
+  --output text)
+ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
+PARTITION=$(aws sts get-caller-identity --query Arn --output text | cut -d':' -f2)
+# Assume IAM Role to be passed to container
+SESSION_DATA=$(aws sts assume-role \
+  --role-session-name s3f2-local \
+  --role-arn arn:"${PARTITION}":iam::"${ACCOUNT_ID}":role/"${ROLE_NAME}" \
+  --query Credentials \
+  --output json)
+AWS_ACCESS_KEY_ID=$(echo "${SESSION_DATA}" | jq -r ".AccessKeyId")
+AWS_SECRET_ACCESS_KEY=$(echo "${SESSION_DATA}" | jq -r ".SecretAccessKey")
+AWS_SESSION_TOKEN=$(echo "${SESSION_DATA}" | jq -r ".SessionToken")
+# Run the container with local changes mounted
+docker run \
+	-v "$(pwd)"/backend/ecs_tasks/delete_files/:/app/:ro \
+	-e DELETE_OBJECTS_QUEUE="${QUEUE_URL}" \
+	-e DLQ="${DLQ_URL}" \
+	-e JobTable="${JOB_TABLE}" \
+	-e AWS_DEFAULT_REGION="${REGION}" \
+	-e AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}" \
+	-e AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}" \
+	-e AWS_SESSION_TOKEN="${AWS_SESSION_TOKEN}" \
+	s3f2

+ 217 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/ARCHITECTURE.md

@@ -0,0 +1,217 @@
+# Architecture
+
+## Index
+
+- [Design Principles](#design-principles)
+- [Core Components](#core-components)
+  - [Data Mappers](#data-mappers)
+  - [Deletion Queue](#deletion-queue)
+  - [Deletion Jobs](#deletion-jobs)
+- [High-level Overview](#high-level-overview)
+- [User Interface](#user-interface)
+- [Persistence Layer](#persistence-layer)
+- [Deletion Job Workflow](#deletion-job-workflow)
+  - [The Athena Find workflow](#the-athena-find-workflow)
+  - [The Forget workflow](#the-forget-workflow)
+- [See Also](#see-also)
+
+## Design Principles
+
+The goal of the solution is to provide a secure, reliable, performant and cost
+effective tool for finding and removing individual records within objects stored
+in S3 buckets. In order to achieve these goals the solution has adopted the
+following design principles:
+
+1. **Secure by design:**
+   - Every component is implemented with least privilege access
+   - Encryption is performed at all layers at rest and in transit
+   - Authentication is provided out of the box
+   - Expiration of logs is configurable
+   - Record identifiers (known as **Match IDs**) are automatically obfuscated or
+     irreversibly deleted as soon as possible when persisting state
+2. **Built to scale:** The system is designed and tested to work with
+   petabyte-scale Data Lakes containing thousands of partitions and hundreds of
+   thousands of objects
+3. **Cost optimised:**
+   - **Perform work in batches:** Since the time complexity of removing a single
+     vs multiple records in a single object is practically equal and it is
+     common for data owners to have the requirement of removing data within a
+     given _timeframe_, the solution is designed to allow the solution operator
+     to "queue" multiple matches to be removed in a single job.
+   - **Fail fast:** A deletion job takes place in two distinct phases: Find and
+     Forget. The Find phase queries the objects in your S3 data lakes to find
+     any objects which contain records where a specified column contains at
+     least one of the Match IDs in the deletion queue. If any queries fail, the
+     job will abandon as soon as possible and the Forget phase will not take
+     place. The Forget Phase takes the list of objects returned from the Find
+     phase, and deletes only the relevant rows in those objects.
+   - **Optimised for Parquet:** The split phase approach optimises scanning for
+     columnar dense formats such as Parquet. The Find phase only retrieves and
+     processes the data for relevant columns when determining which S3 objects
+     need to be processed in the Forget phase. This approach can have
+     significant cost savings when operating on large data lakes with sparse
+     matches.
+   - **Serverless:** Where possible, the solution only uses Serverless
+     components to avoid costs for idle resources. All the components for Web
+     UI, API and Deletion Jobs are Serverless (for more information consult the
+     [Cost Overview guide]).
+4. **Robust monitoring and logging:** When performing deletion jobs, information
+   is provided in real-time to provide visibility. After the job completes,
+   detailed reports are available documenting all the actions performed to
+   individual S3 Objects, and detailed error traces in case of failures to
+   facilitate troubleshooting processes and identify remediation actions. For
+   more information consult the [Troubleshooting guide].
+
+## Core components
+
+The following terms are used to identify core components within the solution.
+
+### Data Mappers
+
+Data Mappers instruct the Amazon S3 Find and Forget solution how and where to
+search for items to be deleted.
+
+To find data, a Data Mapper uses:
+
+- A table in a supported _data catalog provider_ which describes the location
+  and structure of the data you want to connect to the solution. Currently, AWS
+  Glue is the only supported data catalog provider.
+- A _query executor_ which is the service the Amazon S3 Find and Forget solution
+  will use to query the data. Currently, Amazon Athena is the only supported
+  query executor.
+
+Data Mappers can be created at any time, and removed when no deletion job is
+running.
+
+### Deletion Queue
+
+The Deletion Queue is a list of matches. A match is a value you wish to search
+for, which identifies rows in your S3 data lake to be deleted. For example, a
+match could be the ID of a specific customer.
+
+Matches can be added at any time, and can be removed only when no deletion job
+is in progress.
+
+### Deletion Jobs
+
+A Deletion Job is an activity performed by Amazon S3 Find and Forget which
+queries your data in S3 defined by the Data Mappers and deletes rows containing
+any match present in the Deletion Queue.
+
+Deletion jobs can be run anytime there is not another deletion job already
+running.
+
+## High-level Overview
+
+![Architecture](images/architecture.png)
+
+## User Interface
+
+Interaction with the system is via the Web UI or the API.
+
+To use the Web UI customers must authenticate themselves. The Web UI uses the
+same Amazon Cognito User Pool as the API. It consists of an Amazon S3 static
+site hosting a React.js web app, optionally distributed by an Amazon CloudFront
+distribution, which makes authenticated requests to the API on behalf of the
+customer. Customers can also send authenticated requests directly to the API
+Gateway ([API specification]).
+
+## Persistence Layer
+
+Data Persistence is handled differently depending on the cirumstances:
+
+- The customer performs an action that synchronously affects state such as
+  making an API call that results on a write or update of a document in
+  DynamoDB. In that case the Lambda API Handlers directly interact with the
+  Database and respond accordingly following the [API specification].
+- The customer performs an action that results in a contract for a asynchronous
+  promise to be fullfilled such as running a deletion Job. In that case, the
+  synchronous write to the database will trigger an asynchronous Lambda Job
+  Stream Processor that will perform a variety of actions depending on the
+  scenario, such as executing the Deletion Job Step Function. Asynchronous
+  actions generally handle state by writing event documents to DynamoDB that are
+  occasionally subject to further actions by the Job Stream Processor.
+
+The data is stored in DynamoDB using 3 tables:
+
+- **DataMappers**: Metadata for mapping S3 buckets to the solution.
+- **DeletionQueue**: The queue of matches to be deleted. This data is stored in
+  DynamoDB in order to provide an API that easily allows to inspect and
+  occasionally amend the data between deletion jobs.
+- **Jobs**: Data about deletion jobs, including the Job Summary (that contains
+  an up-to-date representation of specific jobs over time) and Job Events
+  (documents containing metadata about discrete events affecting a running job).
+  Job records will be retained for the duration specified in the settings after
+  which they will be removed using the [DynamoDB TTL] feature.
+
+## Deletion Job Workflow
+
+The Deletion Job workflow is implemented as an AWS Step Function.
+
+When a Deletion Job starts, the solution gathers all the configured data mappers
+then proceeds to the Find phase.
+
+For each supported query executor, the workflow generates a list of queries it
+should run based on the data mappers associated with that query executor and the
+partitions present in the data catalog tables associated with those data
+mappers. For each generated query, a message containing the required information
+required by the target query executor is added to the query queue.
+
+When all the queries have been executed, the
+[Forget Workflow](#the-forget-workflow) is executed.
+
+![Architecture](images/stepfunctions_graph_main.png)
+
+### The Athena Find Workflow
+
+The Amazon S3 Find and Forget solution currently supports one type of Find
+Workflow, operated by an AWS Step Function that leverages Amazon Athena to query
+Amazon S3.
+
+The workflow is capable of finding where specific content is located in Amazon
+S3 by using Athena's `$path` pseudo-parameter as part of each query. In this way
+the system can operate the Forget Workflow by reading/writing only relevant
+objects rather than whole buckets, optimising performance, reliability and cost.
+When each workflow completes a query, it stores the result to the Object
+Deletion SQS Queue. The speed of the Find workflow depends on the Athena
+Concurrency (subject to account limits) and wait handlers, both configurable
+when deploying the solution.
+
+![Architecture](images/stepfunctions_graph_athena.png)
+
+### The Forget Workflow
+
+The Forget workflow is operated by a Amazon Step Function that uses AWS Lambda
+and AWS Fargate for computing and Amazon DynamoDB and Amazon SQS to handle
+state.
+
+When the workflow starts, a fleet of AWS Fargate tasks is instantiated to
+consume the Object Deletion Queue and start deleting content from the objects.
+When the Queue is empty, a Lambda sets the instances back to 0 in order to
+optimise cost. The number of Fargate tasks is configurable when deploying the
+solution.
+
+Note that during the Forget phase, affected S3 objects are replaced at the time
+they are processed and are subject to the [Amazon S3 data consistency model]. We
+recommend that you avoid running a Deletion Job in parallel to a workload that
+reads from the data lake unless it has been designed to handle temporary
+inconsistencies between objects.
+
+![Architecture](images/stepfunctions_graph_deletion.png)
+
+## See Also
+
+- [API Specification]
+- [Cost Overview guide]
+- [Limits]
+- [Monitoring guide]
+
+[amazon s3 data consistency model]:
+  https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
+[api specification]: ./api/README.md
+[cost overview guide]: COST_OVERVIEW.md
+[limits]: LIMITS.md
+[monitoring guide]: MONITORING.md
+[dynamodb ttl]:
+  https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html
+[troubleshooting guide]: docs/TROUBLESHOOTING.md

+ 351 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/COST_OVERVIEW.md

@@ -0,0 +1,351 @@
+# Cost Overview
+
+Amazon S3 Find and Forget is a solution you deploy in your own AWS account using
+[AWS CloudFormation]. There is no charge for the solution: you pay only for the
+AWS services used to run the solution. This page outlines the services used by
+the solution, and examples of the charges you should expect for typical usage of
+the solution.
+
+> **Disclaimer**
+>
+> You are responsible for the cost of the AWS services used while running this
+> deployment. There is no additional cost for using the solution. For full
+> details, see the following pricing pages for each AWS service you will be
+> using. Prices are subject to change.
+
+## Index
+
+- [Overview](#overview)
+  - [AWS Fargate](#aws-fargate)
+  - [AWS Glue](#aws-glue)
+  - [AWS Lambda](#aws-lambda)
+  - [AWS Step Functions](#aws-step-functions)
+  - [Amazon API Gateway](#amazon-api-gateway)
+  - [Amazon Athena](#amazon-athena)
+  - [Amazon CloudFront](#amazon-cloudfront)
+  - [Amazon Cognito](#amazon-cognito)
+  - [Amazon DynamoDB](#amazon-dynamodb)
+  - [Amazon S3](#amazon-s3)
+  - [Amazon SQS](#amazon-sqs)
+  - [Amazon VPC](#amazon-vpc)
+  - [Other Supporting Services](#other-supporting-services)
+- [Solution Cost Estimate](#solution-cost-estimate)
+  - [Scenario 1](#scenario-1)
+  - [Scenario 2](#scenario-2)
+  - [Scenario 3](#scenario-3)
+  - [Scenario 4](#scenario-4)
+  - [Scenario 5](#scenario-5)
+
+## Overview
+
+The Amazon S3 Find and Forget solution uses a serverless computing architecture.
+This model minimises costs when you're not actively using the solution, and
+allows the solution to scale while only paying for what you use.
+
+The sample VPC provided in this solution makes use of VPC Endpoints, which have
+an hourly cost as well as data transfer cost. All the other costs depend on the
+usage of the API, and for typical usage, the greatest proportion of what you pay
+will be for use of Amazon Athena, Amazon S3 and AWS Fargate.
+
+### AWS Fargate
+
+The Forget phase of the solution uses AWS Fargate. Using Fargate, you pay for
+the duration that Fargate tasks run during the Forget phase.
+
+The AWS Fargate cost is affected by the number of Fargate tasks you choose to
+run concurrently, and their configuration (vCPU and memory). You can configure
+these parameters when deploying the Solution.
+
+[AWS Fargate Pricing]
+
+### AWS Glue
+
+AWS Glue Data Catalog is used by the solution to define data mappers. You pay a
+monthly fee based on the number of objects stored in the data catalog, and for
+requests made to the AWS Glue service when the solution runs.
+
+[AWS Glue Pricing]
+
+### AWS Lambda
+
+AWS Lambda Functions are used throughout the solution. You pay for the requests
+to, and execution time of, these functions. Functions execute when using the
+solution web interface, API, and when a deletion job runs.
+
+[AWS Lambda Pricing]
+
+### AWS Step Functions
+
+AWS Step Functions Standard Workflows are used when a deletion job runs. You pay
+for the amount of state transitions in the Step Function Workflow. The number of
+state transitions will increase with the number of data mappers, and partitions
+in those data mappers, included in a deletion job.
+
+[AWS Step Functions Pricing][deletion job workflow]
+
+### Amazon API Gateway
+
+Amazon API Gateway is used to provide the solution web interface and API. You
+pay for requests made when using the web interface or API, and any data
+transferred out.
+
+[Amazon API Gateway Pricing]
+
+### Amazon Athena
+
+Amazon Athena scans your data lake during the _Find phase_ of a deletion job.
+You pay for the Athena queries run based on the amount of data scanned.
+
+You can achieve significant cost savings and performance gains by reducing the
+quantity of data Athena needs to scan per query by using compression,
+partitioning and conversion of your data to a columnar format. See
+[Supported Data Formats](LIMITS.md#supported-data-formats) for more information
+regarding supported data and compression formats.
+
+The [Amazon Athena Pricing] page contains an overview of prices and provides a
+calculator to estimate the Athena query cost for each deletion job run based on
+the Data Lake size. See [Using Workgroups to Control Query Access and Costs] for
+more information on using workgroups to set limits on the amount of data each
+query or the entire workgroup can process, and to track costs.
+
+### Amazon CloudFront
+
+If you choose to deploy a CloudFront distribution for the solution interface,
+you will pay CloudFront charges for requests and data transferred when you
+access the web interface.
+
+[Amazon CloudFront Pricing]
+
+### Amazon Cognito
+
+Amazon Cognito provides authentication to secure access to the API using an
+administrative user created during deployment. You pay a monthly fee for active
+users in the Cognito User Pool.
+
+[Amazon Cognito Pricing]
+
+### Amazon DynamoDB
+
+Amazon DynamoDB stores internal state data for the solution. All tables created
+by the solution use the on-demand capacity mode of pricing. You pay for storage
+used by these tables, and DynamoDB capacity used when interacting with the
+solution web interface, API, or running a deletion job.
+
+- [Amazon DynamoDB Pricing]
+- [Solution Persistence Layer]
+
+### Amazon S3
+
+Four types of charges occur when working with Amazon S3: Storage, Requests and
+data retrievals, Data Transfer, and Management.
+
+Uses of Amazon S3 in the solution include:
+
+- The solution web interface is deployed to, and served, from an S3 Bucket
+- During the _Find_ phase, Amazon Athena will:
+  1. Retrieve data from Amazon S3 for the columns defined in the data mapper
+  1. Store its results in an S3 bucket
+- During the _Forget_ phase, a program run in AWS Fargate processes each object
+  identified in the Find phase will:
+  1. Retrieve the entire object and its metadata
+  1. Create a new version of the file, and PUT this object to a staging bucket
+  1. Delete the original object
+  1. Copy the updated object from the staging bucket to the data bucket, and
+     sets any metadata identified from the original object
+  1. Delete the object from the staging bucket
+- Some artefacts, and state data relating to AWS Step Functions Workflows may be
+  stored in S3
+
+[Amazon S3 Pricing]
+
+### Amazon SQS
+
+The solution uses standard and FIFO SQS queues to handle internal state during a
+deletion job. You pay for the number of requests made to SQS. The number of
+requests increases with the number of data mappers, partitions in those data
+mappers, and the number of Amazon S3 objects processed in a deletion job.
+
+[Amazon SQS Pricing]
+
+### Amazon VPC
+
+Amazon VPC provides network connectivity for AWS Fargate tasks that run during
+the _Forget_ phase.
+
+How you build the VPC will determine the prices you pay. For example, VPC
+Endpoints and NAT Gateways are two different ways to provide network access to
+the solutions' dependencies. Both ways have different hourly prices and costs
+for data transferred.
+
+The sample VPC provided in this solution makes use of VPC Endpoints, which have
+an hourly cost as well as data transfer cost. You can choose to use this sample
+VPC, however it may be more cost-efficient to use an existing suitable VPC in
+your account if you have one.
+
+- [Amazon VPC Pricing]
+- [AWS PrivateLink Pricing]
+
+### Other Supporting Services
+
+During deployment, the solution uses [AWS CodeBuild], [AWS CodePipeline] and
+[AWS Lambda] custom resources to deploy the frontend and the backend. [AWS
+Fargate] uses [Amazon Elastic Container Registry] to store container images.
+
+## Solution Cost Estimate
+
+You are responsible for the cost of the AWS services used while running this
+solution. As of the date of publication of this version of the source code, the
+estimated cost to run a job with different Data Lake configurations in the
+Europe (Ireland) region is shown in the tables below. The estimates do not
+include VPC costs.
+
+| Summary                   |                      |
+| ------------------------- | -------------------- |
+| [Scenario 1](#scenario-1) | 100GB Snappy Parquet |
+| [Scenario 2](#scenario-2) | 750GB Snappy Parquet |
+| [Scenario 3](#scenario-3) | 10TB Snappy Parquet  |
+| [Scenario 4](#scenario-4) | 50TB Snappy Parquet  |
+| [Scenario 5](#scenario-5) | 100GB Gzip JSON      |
+
+### Scenario 1
+
+This example shows how the charges would be calculated for a deletion job where:
+
+- Your dataset is 100GB of Snappy compressed Parquet objects that are
+  distributed across 2 Partitions
+- The S3 bucket containing the objects is in the same region as the S3 Find and
+  Forget Solution
+- The total size of the data held in the column queried by Athena is 6.8GB
+- The Find phase returns 15 objects which need to be modified
+- The Forget phase uses 3 Fargate tasks with 4 vCPUs and 30GB of memory each,
+  running concurrently for 60 minutes
+
+| Service        | Spending | Notes                                                       |
+| -------------- | -------- | ----------------------------------------------------------- |
+| Amazon Athena  | \$0.03   | 6.8GB of data scanned                                       |
+| AWS Fargate    | \$0.89   | 3 tasks x 4 vCPUs, 30GB memory x 1 hour                     |
+| Amazon S3      | \$0.01   | \$0.01 of requests and data retrieval. \$0 of data transfer |
+| Other services | \$0.05   | n/a                                                         |
+| Total          | \$0.98   | n/a                                                         |
+
+> Note: This estimate doesn't include the costs for Amazon VPC
+
+### Scenario 2
+
+This example shows how the charges would be calculated for a deletion job where:
+
+- Your dataset is 750GB of Snappy compressed Parquet objects that are
+  distributed across 1000 Partitions
+- The S3 bucket containing the objects is in the same region as the S3 Find and
+  Forget Solution
+- The total size of the data held in the column queried by Athena is 10GB
+- The Find phase returns 1000 objects which need to be modified
+- The Forget phase uses 50 Fargate tasks with 4 vCPUs and 30GB of memory each,
+  running concurrently for 45 minutes
+
+| Service        | Spending | Notes                                                       |
+| -------------- | -------- | ----------------------------------------------------------- |
+| Amazon Athena  | \$0.05   | 10GB of data scanned                                        |
+| AWS Fargate    | \$11.07  | 50 tasks x 4 vCPUs, 30GB memory x 0.75 hours                |
+| Amazon S3      | \$0.01   | \$0.01 of requests and data retrieval. \$0 of data transfer |
+| Other services | \$0.01   | n/a                                                         |
+| Total          | \$11.14  | n/a                                                         |
+
+> Note: This estimate doesn't include the costs for Amazon VPC
+
+### Scenario 3
+
+This example shows how the charges would be calculated for a deletion job where:
+
+- Your dataset is 10TB of Snappy compressed Parquet objects that are distributed
+  across 2000 Partitions
+- The S3 bucket containing the objects is in the same region as the S3 Find and
+  Forget Solution
+- The total size of the data held in the column queried by Athena is 156GB
+- The Find phase returns 11000 objects which need to be modified
+- The Forget phase uses 100 Fargate tasks with 4 vCPUs and 30GB of memory each,
+  running concurrently for 150 minutes
+
+| Service        | Spending | Notes                                                       |
+| -------------- | -------- | ----------------------------------------------------------- |
+| Amazon Athena  | \$0.76   | 156GB of data scanned                                       |
+| AWS Fargate    | \$73.82  | 100 tasks x 4 vCPUs, 30GB memory x 2.5 hours                |
+| Amazon S3      | \$0.11   | \$0.11 of requests and data retrieval. \$0 of data transfer |
+| Other services | \$1      | n/a                                                         |
+| Total          | \$75.69  | n/a                                                         |
+
+> Note: This estimate doesn't include the costs for Amazon VPC
+
+### Scenario 4
+
+This example shows how the charges would be calculated for a deletion job where:
+
+- Your dataset is 50TB of Snappy compressed Parquet objects that are distributed
+  across 5300 Partitions
+- The S3 bucket containing the objects is in the same region as the S3 Find and
+  Forget Solution
+- The total size of the data held in the column queried by Athena is 671GB
+- The Find phase returns 45300 objects which need to be modified
+- The Forget phase uses 100 Fargate tasks with 4 vCPUs and 30GB of memory each,
+  running concurrently for 10.5 hours
+
+| Service        | Spending | Notes                                                       |
+| -------------- | -------- | ----------------------------------------------------------- |
+| Amazon Athena  | \$3.28   | 671GB of data scanned                                       |
+| AWS Fargate    | \$310.03 | 100 tasks x 4 vCPUs, 30GB memory x 10.5 hours               |
+| Amazon S3      | \$0.49   | \$0.49 of requests and data retrieval. \$0 of data transfer |
+| Other services | \$3      | n/a                                                         |
+| Total          | \$316.80 | n/a                                                         |
+
+> Note: This estimate doesn't include the costs for Amazon VPC
+
+### Scenario 5
+
+This example shows how the charges would be calculated for a deletion job where:
+
+- Your dataset is 100GB of Gzip compressed JSON objects that are distributed
+  across 310 Partitions
+- The S3 bucket containing the objects is in the same region as the S3 Find and
+  Forget Solution
+- The Find phase returns 3500 objects which need to be modified
+- The Forget phase uses 50 Fargate tasks with 4 vCPUs and 30GB of memory each,
+  running concurrently for 22 minutes
+
+| Service        | Spending | Notes                                                       |
+| -------------- | -------- | ----------------------------------------------------------- |
+| Amazon Athena  | \$0.50   | 100GB of data scanned                                       |
+| AWS Fargate    | \$5.31   | 50 tasks x 4 vCPUs, 30GB memory x 0.36 hours                |
+| Amazon S3      | \$0.03   | \$0.03 of requests and data retrieval. \$0 of data transfer |
+| Other services | \$0.05   | n/a                                                         |
+| Total          | \$5.89   | n/a                                                         |
+
+> Note: This estimate doesn't include the costs for Amazon VPC
+
+[aws cloudformation]: https://aws.amazon.com/cloudformation/
+[aws codebuild]: https://aws.amazon.com/codebuild/pricing/
+[aws codepipeline]: https://aws.amazon.com/codepipeline/pricing/
+[aws fargate pricing]: https://aws.amazon.com/fargate/pricing/
+[aws fargate]: https://aws.amazon.com/fargate/pricing/
+[aws glue pricing]: https://aws.amazon.com/glue/pricing/
+[aws lambda pricing]: https://aws.amazon.com/lambda/pricing/
+[aws lambda]: https://aws.amazon.com/lambda/pricing/
+[aws privatelink pricing]: https://aws.amazon.com/privatelink/pricing/
+[aws step functions pricing]: https://aws.amazon.com/step-functions/pricing/
+[amazon api gateway pricing]: https://aws.amazon.com/api-gateway/pricing/
+[amazon athena pricing]: https://aws.amazon.com/athena/pricing/
+[amazon cloudfront pricing]: https://aws.amazon.com/cloudfront/pricing/
+[amazon cognito pricing]: https://aws.amazon.com/cognito/pricing/
+[amazon dynamodb pricing]: https://aws.amazon.com/dynamodb/pricing/
+[amazon elastic container registry]: https://aws.amazon.com/ecr/pricing/
+[amazon s3 pricing]: https://aws.amazon.com/s3/pricing/
+[amazon sqs pricing]: https://aws.amazon.com/sqs/pricing/
+[amazon vpc pricing]: https://aws.amazon.com/vpc/pricing/
+[deletion job workflow]: ARCHITECTURE.md#deletion-job-workflow
+[solution persistence layer]: ARCHITECTURE.md#persistence-layer
+[using workgroups to control query access and costs]:
+  https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html
+[vpc configuration]:
+  USER_GUIDE.md#pre-requisite-Configuring-a-vpc-for-the-solution
+
+[some VPC endpoints]:
+[https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/templates/vpc.yaml]

+ 119 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/LIMITS.md

@@ -0,0 +1,119 @@
+# Limits
+
+This section describes current limitations of the Amazon S3 Find and Forget
+solution. We are actively working on adding additional features and supporting
+more data formats. For feature requests, please open an issue on our [Issue
+Tracker].
+
+## Supported Data Formats
+
+The following data formats are supported:
+
+#### Apache Parquet
+
+|                                       |                                                                                                                                                                                       |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Compression on Read                   | Snappy, Brotli, Gzip, uncompressed                                                                                                                                                    |
+| Compression on Write                  | Snappy                                                                                                                                                                                |
+| Supported Types for Column Identifier | bigint, char, decimal, double, float, int, smallint, string, tinyint, varchar. Nested types (types whose parent is a struct, map, array) are only supported for **struct** type (\*). |
+| Notes                                 | (\*) When using a type nested in a struct as column identifier with Apache Parquet files, use the Athena's version 2 engine. For more information, see [Managing Workgroups]          |
+
+#### JSON
+
+|                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Compression on Read                   | Gzip, uncompressed (\*\*)                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| Compression on Write                  | Gzip, uncompressed (\*\*)                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| Supported Types for Column Identifier | number, string. Nested types (types whose parent is a object, array) are only supported for **object** type.                                                                                                                                                                                                                                                                                                                                            |
+| Notes                                 | (\*\*) The compression type is determined from the file extension. If no file extension is present the solution treats the data as uncompressed. If the data is compressed make sure the file name includes the compression extension, such as `gz`.<br><br>When using OpenX JSON SerDe, `ignore.malformed.json` cannot be `TRUE`, `dots.in.keys` cannot be `TRUE`, and column mappings are not supported. For more information, see [OpenX JSON SerDe] |
+
+## Supported Query Providers
+
+The following data catalog provider and query executor combinations are
+supported:
+
+| Catalog Provider | Query Executor |
+| ---------------- | -------------- |
+| AWS Glue         | Amazon Athena  |
+
+## Concurrency Limits
+
+| Catalog Provider        | Query Executor            |
+| ----------------------- | ------------------------- |
+| Max Concurrent Jobs     | 1                         |
+| Max Athena Concurrency  | See account service quota |
+| Max Fargate Concurrency | See account service quota |
+
+## Other Limitations
+
+- Only buckets with versioning set to **Enabled** are supported
+- Decompressed individual object size must be less than the Fargate task memory
+  limit (`DeletionTaskMemory`) specified when launching the stack
+- S3 Objects using the `GLACIER` or `DEEP_ARCHIVE` storage classes are not
+  supported and will be ignored
+- The bucket targeted by a data mapper must be in the same region as the Amazon
+  S3 Find and Forget deployment
+- Client-side encrypted S3 Objects are supported only when a symmetric customer
+  master key (CMK) is stored in AWS Key Management Service (AWS KMS) and
+  encrypted using one of the [AWS supported SDKs].
+- If the bucket targeted by a data mapper belongs to an account other than the
+  account that the Amazon S3 Find and Forget Solution is deployed in, only
+  SSE-KMS with a customer master key (CMK) may be used for encryption
+- To avoid race conditions when objects are processed by the solution,
+  manipulating existing data lake objects must not occur while a Job is running.
+  The solution will attempt to verify object integrity between read and write
+  operations and attempt to rollback any changes if an inconsistency is
+  detected. If the rollback fails, you will need to manually reconcile the
+  object versions to avoid data inconsistency or loss
+- We recommend that you avoid running a Deletion Job in parallel to a workload
+  that reads from the data lake unless it has been designed to handle temporary
+  inconsistencies between objects
+- Buckets with MFA Delete enabled are not supported
+- When the _Ignore object not found exceptions during deletion_ setting is
+  enabled, the solution will not delete old versions for ignored objects. Make
+  sure there is some mechanism for deleting these old versions to avoid
+  **retaining data longer than intended**.
+
+## Service Quotas
+
+If you wish to increase the number of concurrent queries that can be by Athena
+and therefore speed up the Find phase, you will need to request a Service Quota
+increase for Athena. For more, information consult the [Athena Service Quotas]
+page. Similarly, to increase the number of concurrent Fargate tasks and
+therefore speed up the Forget phase, consult the [Fargate Service Quotas] page.
+When configuring the solution, you should not set an `AthenaConcurrencyLimit` or
+`DeletionTasksMaxNumber` greater than the respective Service Quote for your
+account.
+
+Amazon S3 Find and Forget is also bound by any other service quotas which apply
+to the underlying AWS services that it leverages. For more information, consult
+the AWS docs for [Service Quotas] and the relevant Service Quota page for the
+service in question:
+
+- [SQS Service Quotas]
+- [Step Functions Service Quotas]
+- [DynamoDB Service Quotas]
+
+[aws supported sdks]:
+  https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryption.html
+[issue tracker]: https://github.com/awslabs/amazon-s3-find-and-forget/issues
+[service quotas]:
+  https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
+[service quotas]:
+  https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
+[athena service quotas]:
+  https://docs.aws.amazon.com/athena/latest/ug/service-limits.html
+[fargate service quotas]:
+  https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-quotas.html
+[step functions service quotas]:
+  https://docs.aws.amazon.com/step-functions/latest/dg/limits.html
+[sqs service quotas]:
+  https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-quotas.html
+[dynamodb service quotas]:
+  https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
+[deletion job]: ARCHITECTURE.md#deletion-jobs
+[deletion queue]: ARCHITECTURE.md#deletion-queue
+[managing workgroups]:
+  https://docs.aws.amazon.com/athena/latest/ug/workgroups-create-update-delete.html
+[openx json serde]:
+  https://docs.aws.amazon.com/athena/latest/ug/json-serde.html#openx-json-serde

+ 140 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/LOCAL_DEVELOPMENT.md

@@ -0,0 +1,140 @@
+# Local Development
+
+This section details how to run the solution locally and deploy your code
+changes from the command line.
+
+## Pre-Requisites
+
+The following dependencies must be installed:
+
+- AWS CLI (v1)
+- Python >=3.9 and pip
+- node.js >= v16.11 and npm >= 8
+- virtualenv
+- Ruby >= 2.6
+- libsnappy-dev/snappy-devel (debian/centos)
+- docker
+- jq
+- Java JRE
+
+Once you have installed all pre-requisites, you must run the following command
+to create a `virtualenv` and install all frontend/backend dependencies before
+commencing development.
+
+```bash
+make setup
+```
+
+This command only needs to be ran once.
+
+## Build and Deploy from Source
+
+To deploy the solution manually from source to your AWS account, run the
+following command:
+
+```bash
+make deploy \
+  REGION=<aws-region> \
+  ADMIN_EMAIL=<your-email-address> \
+  TEMP_BUCKET=<temp-bucket-name>
+```
+
+If you use KMS for client-side encryption you'll also need to pass a
+`KMS_KEYARNS` environment variable to the `make deploy` script, containing the
+comma-delimited list of KMS Key Arns used for client-side Encryption.
+
+> For information on how to obtain your subnet and security group IDs, see
+> [Configuring a VPC for the Solution](USER_GUIDE.md#configuring-a-vpc-for-the-solution).
+
+This will deploy the Amazon S3 Find and Forget solution using the AWS CLI
+profile of the current shell. By default this will be the profile `default`.
+
+The following commands are also available:
+
+- `make deploy-artefacts`: Packages and uploads the Forget task Docker image and
+  frontend React app to the solution bucket. This will trigger CodePipeline to
+  automatically deploy these artefacts
+- `make deploy-vpc`: Deploys only the VPC CloudFormation template
+- `make deploy-cfn`: Deploys only the CloudFormation template
+- `make redeploy-containers`: Manually packages and deploys the Forget task
+  Docker image to ECR via the AWS CLI rather than using CodePipeline.
+- `make redeploy-frontend`: Manually packages and deploys the frontend React app
+  to S3 via the AWS CLI rather than using CodePipeline.
+- `make start-frontend-remote`: Opens the frontend of the deployed Amazon S3
+  Find and Forget solution
+
+## Running Locally
+
+> **Important**: Running the frontend/forget task locally requires the solution
+> CloudFormation stack to be deployed. For more info, see
+> [Build and Deploy From Source](#build-and-deploy-from-source)
+
+To run the frontend locally, run the following commands:
+
+- `make setup-frontend-local-dev`: Downloads a copy of the configuration file
+  required for the frontend app to run locally
+- `make start-frontend-local`: Runs the frontend app locally on `localhost:3000`
+
+> In order to allow your locally running frontend to connect to the deployed
+> API, you will need to set the `AccessControlAllowOriginOverride` parameter
+> to \* when deploying the solution stack
+
+To run the "Forget" task locally using Docker, run the following command:
+
+```bash
+docker build -f backend/ecs_tasks/delete_files/Dockerfile -t s3f2 .
+make run-local-container ROLE_NAME=<your-sqs-access-role-name>
+```
+
+The container needs to connect to the deletion queue deployed by the solution
+and therefore AWS credentials are required in the container environment. You
+will need to setup an IAM role which has access to process messages from the
+queue and provide the role name as an input. The above command will perform STS
+Assume Role via the AWS CLI using `ROLE_NAME` as the target role in order to
+obtain temporary credentials. These temporary credentials will be injected into
+the container as environment variables.
+
+The command uses your default CLI profile to assume the role. You can override
+the profile being used as follows:
+
+```bash
+make run-local-container ROLE_NAME=<your-sqs-access-role-name> AWS_PROFILE=my-profile
+```
+
+#### Run Tests
+
+> **Important**: Running acceptance tests requires the solution CloudFormation
+> stack to be deployed. For more info, see
+> [Build and Deploy From Source](#build-and-deploy-from-source)
+
+The following commands are available for running tests:
+
+- `make test`: Run all unit and acceptance tests for the backend and frontend.
+- `make test-acceptance-cognito`: Run all backend task acceptance tests using
+  Cognito authentication
+- `make test-acceptance-iam`: Run all backend task acceptance tests using IAM
+  authentication
+- `make test-cfn`: Run CloudFormation related unit tests
+- `make test-unit`: Run all backend task unit tests
+- `make test-frontend`: Run all frontend tests
+
+> Note: some acceptance tests require a KMS Symmetric Key to be created in
+> advance and specified during the solution's deployment.
+
+#### Updating Python Library Dependencies
+
+In this project, Python library dependencies are stored in two forms:
+
+1. `requirements.in` is a hand-managed file, and may contain loose version
+   specifications
+2. `requirements.txt` is a machine-generated file, generated by the former,
+   which contains strict versions for all required dependencies
+
+When running Make, if it detects a change to the `requirements.in` file it will
+automatically regenerate `requirements.txt` for you using the latest published
+versions of libraries. You can also manually trigger this by running Make with
+the requirements.txt file as the target (for instance,
+`make ./backend/ecs_tasks/delete_files/requirements.txt`).
+
+For advanced use-cases, you can use `pip-compile` outside of Make to change
+specific library versions without regenerating the entire file.

+ 77 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/MONITORING.md

@@ -0,0 +1,77 @@
+# Monitoring
+
+### Key Metrics
+
+The following metrics are important indicators of the health of the Amazon S3
+Find and Forget Solution:
+
+- `AWS/SQS - ApproximateNumberOfMessagesVisible` for the Object Deletion Queue
+  DLQ. Any value > 0 for this metric indicates that 1 or more objects could not
+  be processed during a deletion job. The job which triggered the message(s) to
+  be put in the queue will have a status of **COMPLETED_WITH_ERRORS** and the
+  `ObjectUpdateFailed` event(s) will contain further debugging information.
+- `AWS/SQS - ApproximateNumberOfMessagesVisible` for the Events DLQ. Any value >
+  0 for this metrics indicates that 1 or more Job Events could not be processed.
+- `AWS/Athena - ProcessedBytes/TotalExecutionTime`. If the average processed
+  bytes and/or total execution time per query is rising, it may be indicative of
+  the average partition size also growing in size. This is not an issue per se,
+  however if partitions grow too large (or your dataset is unpartitioned), you
+  may eventually encounter Athena errors.
+- `AWS/States - ExecutionsFailed`. State machine executions failing indicates
+  that the Amazon S3 Find and Forget solution is misconfigured error. To resolve
+  this, find the State Machine execution which failed and investigate the cause
+  of the failure.
+- `AWS/States - ExecutionsTimedOut`. State machine timeouts indicate that Amazon
+  S3 Find and Forget is unable to complete a job before Step Functions kills the
+  execution due to it exceeding the allowed execution time limit. See
+  [Troubleshooting] for more details.
+
+If required, you can create CloudWatch Alarms for any of the aforementioned
+metrics to be notified of potential solution misconfiguration.
+
+### Service Level Monitoring
+
+All standard metrics for the services used by the Amazon S3 Find and Forget
+Solution are available. For detailed information about the metrics and logging
+for a given service, view the relevant Monitoring docs for that service. The key
+services used by the solution:
+
+- [Lambda Metrics] / [Lambda Logging]
+- [ECS Metrics] / [ECS Logging] <sup>1</sup>
+- [Athena Metrics] <sup>2</sup>
+- [Step Functions Metrics]
+- [SQS Metrics]
+- [DynamoDB Metrics]
+- [S3 Metrics]
+
+<sup>1</sup> CloudWatch Container Insights can be be enabled when deploying the
+solution by setting `EnableContainerInsights` to `true`. Using Container
+Insights will incur additional charges. It is disabled by default.
+
+<sup>2</sup> To obtain Athena metrics, you will need to enable metrics for the
+workgroup you are using to execute the queries as described [in the Athena
+docs][athena metrics]. By default the solution uses the **primary** workgroup,
+however you can change this when deploying the stack using the `AthenaWorkGroup`
+parameter
+
+[lambda metrics]:
+  https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions-metrics.html
+[lambda logging]:
+  https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions-logs.html
+[ecs metrics]:
+  https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html
+[ecs logging]:
+  https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html#viewing_awslogs
+[ecs container insights]:
+  https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-container-insights.html
+[step functions metrics]:
+  https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html#cloudwatch-step-functions-execution-metrics
+[athena metrics]:
+  https://docs.aws.amazon.com/athena/latest/ug/query-metrics-viewing.html
+[dynamodb metrics]:
+  https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/monitoring-cloudwatch.html
+[s3 metrics]:
+  https://docs.aws.amazon.com/AmazonS3/latest/dev/cloudwatch-monitoring.html
+[sqs metrics]:
+  https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-monitoring-using-cloudwatch.html
+[troubleshooting]: ./TROUBLESHOOTING.md

+ 74 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/PRODUCTION_READINESS_GUIDELINES.md

@@ -0,0 +1,74 @@
+# Production Readiness Guidelines
+
+It is important to conduct your own testing prior to using this solution with
+production data. The following guidelines provide steps you can follow to
+mitigate against unexpected behaviours such as unwanted data loss, or unexpected
+high spend that could arise by using the solution in an incompatible
+configuration.
+
+## 1. Review the Solution Limits
+
+Consult the [Limits] guide to check your datasets are in a supported format and
+your S3 bucket configuration is compatible with the solution requirements.
+
+## 2. Learn about costs
+
+Consult the [Cost Overview guide] to learn about the costs of running the
+solution, and ways to set spend limits.
+
+## 3. Deploy the solution in a test environment
+
+We recommend first evaluating the solution by deploying it in an AWS account you
+use for testing, with a sample of your dataset. After configuring the solution,
+identify a set of queries to run against your dataset before and after [running
+a Deletion Job].
+
+> **Note:** You don't need to have a full copy of each dataset, but we recommend
+> to have at least the same schema to make sure the test queries are as close to
+> production as possible.
+
+## 4. Run your test queries
+
+These are examples of test queries:
+
+- Count the total number of rows in a dataset (A)
+- Count the number of rows that need to be deleted from the same dataset (B)
+- Run a query to fetch one or more rows that won't be affected by deletion but
+  contained in an object that will be rewritten because of other rows (C)
+
+After running a deletion job:
+
+- Repeat the first 2 queries to make sure the row count is correct:
+  A<sub>1</sub>=A<sub>0</sub>-B<sub>0</sub> and B<sub>1</sub>=0
+- Repeat the third query to ensure the rows have been re-written without
+  affecting their schema (for instance, there is no unwanted type coercion
+  against `date` or `number` types): C<sub>1</sub>=C<sub>0</sub>
+
+If any error occurs or data doesn't match, review the [troubleshooting guide] or
+check for [existing issues]. If you cannot find a resolution, feel free to [open
+an issue].
+
+## 4. Identify your own extra requirements
+
+These guidelines are provided as suggested steps to identify your own acceptance
+criteria, but they are not intended to be an exhaustive list. You should
+consider testing any other factors that may apply to your workload before moving
+to production. If you have any question please [open an issue]. We appreciate
+your feedback.
+
+## 5. Deploy the solution in production
+
+For greater confidence, it could be a good idea to repeat the test queries in
+production before/after a deletion job. If you would prefer some extra safety,
+you can configure your data mappers to **not** delete the previous versions of
+objects after write, so that if anything goes wrong you can manually recover
+older versions of the objects; but remember to turn the setting back on after
+you finish testing, and in case, perform a manual deletion of the previous
+versions if so desired.
+
+[cost overview guide]: COST_OVERVIEW.md
+[existing issues]: https://github.com/awslabs/amazon-s3-find-and-forget/issues
+[limits]: LIMITS.md
+[open an issue]: https://github.com/awslabs/amazon-s3-find-and-forget/issues
+[running a deletion job]: USER_GUIDE.md#running-a-deletion-job
+[troubleshooting guide]: TROUBLESHOOTING.md

+ 47 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/SECURITY.md

@@ -0,0 +1,47 @@
+# Security
+
+When you build systems on AWS infrastructure, security responsibilities are
+shared between you and AWS. This shared model can reduce your operational burden
+as AWS operates, manages, and controls the components from the host operating
+system and virtualization layer down to the physical security of the facilities
+in which the services operate. For more information about security on AWS, visit
+the [AWS Security Center].
+
+## IAM Roles
+
+AWS Identity and Access Management (IAM) roles enable customers to assign
+granular access policies and permissions to services and users on AWS. This
+solution creates several IAM roles, including roles that grant the solution’s
+AWS Lambda functions access to the other AWS services used in this solution.
+
+## Amazon Cognito
+
+Amazon Cognito is used for managing access to the web user interface and the
+API. For more information, consult [Accessing the application].
+
+Amazon Cognito offers an option to enable Multi-Factor Authentication (MFA).
+Follow the instructions found
+[here](https://docs.aws.amazon.com/cognito/latest/developerguide/user-pool-settings-mfa.html)
+to your deployment. Do not make any changes, other than the MFA setting, to the
+User Pool as changes other than this may cause problems when upgrading or
+updating the solution in the future.
+
+## Amazon CloudFront
+
+This solution deploys a static website hosted in an Amazon S3 bucket. To help
+reduce latency and improve security, this solution includes an Amazon CloudFront
+distribution with an origin access identity, which is a special CloudFront user
+that helps restrict access to the solution’s website bucket contents. For more
+information, see [Restricting Access to Amazon S3 Content by Using an Origin
+Access Identity].
+
+If you wish to increase the security of the web user interface or the API, we
+recommend considering integrating [AWS WAF], which gives you control over how
+traffic reaches your applications by enabling you to create security rules such
+as filtering out specific traffic patterns you define.
+
+[aws security center]: https://aws.amazon.com/security
+[aws waf]: https://aws.amazon.com/waf
+[accessing the application]: USER_GUIDE.md#accessing-the-application
+[restricting access to amazon s3 content by using an origin access identity]:
+  https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-restricting-access-to-s3.html

+ 178 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/TROUBLESHOOTING.md

@@ -0,0 +1,178 @@
+# Troubleshooting
+
+This section outlines steps to assist you with resolving issues deploying,
+configuring and using the Amazon S3 Find and Forget solution.
+
+If you're unable to resolve an issue using this information you can
+[report the issue on GitHub](../CONTRIBUTING.md#reporting-bugsfeature-requests).
+
+### Expected Results Not Found
+
+If the Find phase does not identify the expected objects for the matches in the
+deletion queue, verify the following:
+
+- You have chosen the relevant data mappers for the matches in the deletion
+  queue.
+- Your data mappers are referencing the correct S3 locations.
+- Your data mappers have been configured to search the correct columns.
+- All partitions have been loaded into the Glue Data Catalog.
+
+### Job appears stuck in QUEUED/RUNNING status
+
+If a job remains in a QUEUED or RUNNING status for much longer than expected,
+there may be an issue relating to:
+
+- AWS Fargate accessing the ECR service endpoint. Enabling the required network
+  access from the subnets/security groups in which Forget Fargate tasks are
+  launched will unblock the job without requiring manual intervention. For more
+  information see [VPC Configuration] in the [User Guide].
+- Errors in job table stream processor.
+  [Check the logs](https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions-logs.html)
+  of the stream processor Lambda function for errors.
+- Unhandled state machine execution errors. If there are no errors in the job
+  event history which indicate an issue, check the state machine execution
+  history of the execution with the same name as the blocked job ID.
+- The containers have exausted memory or vCPUs capacity while processing large
+  (+GB size) files. See also
+  [Service Level Monitoring](MONITORING.md#service-level-monitoring).
+
+If the state machine is still executing but in a non-recoverable state, you can
+stop the state machine execution manually which will trigger an Exception job
+event — the job will enter a `FAILED` status.
+
+If this doesn't resolve the issue or the execution isn't running, you can
+manually update the job status to FAILED or remove the job and any associated
+events from the Jobs table<sup>\*</sup>.
+
+<sup>\*</sup> **WARNING:** You should manually intervene only when there as been
+a fatal error from which the system cannot recover.
+
+### Job status: COMPLETED_CLEANUP_FAILED
+
+A `COMPLETED_CLEANUP_FAILED` status indicates that the job has completed, but an
+error occurred when removing the processed matches from the deletion queue.
+
+Some possible causes for this are:
+
+- The stream processor Lambda function does not have permissions to manipulate
+  the DynamoDB table.
+- The item has been manually removed from the deletion queue table via a direct
+  call to the DynamoDB API.
+
+You can find more details of the cause by checking the job event history for a
+**CleanupFailed** event, then viewing the event data.
+
+As the processed matches will still be on the queue, you can choose to either:
+
+- Manually remove the processed matches via the solution web interface or APIs.
+- Take no action — the matches will remain in the queue and be re-processed
+  during the next deletion job run.
+
+### Job status: FAILED
+
+A `FAILED` status indicates that the job has terminated due to a generic
+exception.
+
+Some possible causes for this are:
+
+- One of the tasks in the main step function failed.
+- There was a permissions issue encountered in one of the solution components.
+- The state machine execution time has timed out, or has exceeded the service
+  quota for state machine execution history.
+
+To find information on what caused the failure, check the deletion job log for
+an **Exception** event and inspect that event's event data.
+
+Errors relating to Step Functions such as timeouts or exceeding the permitted
+execution history length, may be resolvable by increasing the waiter
+configuration as described in [Performance Configuration].
+
+### Job status: FIND_FAILED
+
+A `FIND_FAILED` status indicates that the job has terminated because one or more
+data mapper queries failed to execute.
+
+If you are using Athena and Glue as data mappers, you should first verify the
+following:
+
+- You have granted permissions to the Athena IAM role for access to the S3
+  buckets referenced by your data mappers **and** any AWS KMS keys used to
+  encrypt the S3 objects. For more information see [Permissions Configuration]
+  in the [User Guide].
+- The concurrency setting for the solution does not exceed the limits for
+  concurrent Athena queries for your AWS account or the Athena workgroup the
+  solution is configured to use. For more information see [Performance
+  Configuration] in the [User Guide].
+- Your data is compatible within the [solution limits].
+
+If you made any changes whilst verifying the prior points, you should attempt to
+run a new deletion job.
+
+To find further details of the cause of the failure you should inspect the
+deletion job log and inspect the event data for any **QueryFailed** events.
+
+Athena queries may fail if the length of a query sent to Athena exceed the
+Athena query string length limit (see [Athena Service Quotas]). If queries are
+failing for this reason, you will need to reduce the number of matches queued
+when running a deletion job.
+
+To troubleshoot Athena queries further, find the `QueryId` from the event data
+and match this to the query in the [Athena Query History]. You can use the
+[Athena Troubleshooting] guide for Athena troubleshooting steps.
+
+### Job status: FORGET_FAILED
+
+A `FORGET_FAILED` status indicates that the job has terminated because a fatal
+error occurred during the _forget_ phase of the job. S3 objects _may_ have been
+modified.
+
+Check the job log for a **ForgetPhaseFailed** event. Examining the event data
+for this event will provide you with more information about the underlying cause
+of the failure.
+
+### Job status: FORGET_PARTIALLY_FAILED
+
+A `FORGET_PARTIALLY_FAILED` status indicates that the job has completed, but
+that the _forget_ phase was unable to process one or more objects.
+
+Each object that was not correctly processed will result in a message sent to
+the object dead letter queue ("DLQ"; see `DLQUrl` in the CloudFormation stack
+outputs) and an **ObjectUpdateFailed** event in the job event history containing
+error information. Check the content of any **ObjectUpdateFailed** events to
+ascertain the root cause of an issue.
+
+Verify the following:
+
+- No other processes created a new version of existing objects while the job was
+  running. When the system creates a new version of a object, an integrity check
+  is performed to verify that during processing, no new versions of an object
+  were created and that a delete marker for the object was not created. If
+  either case is detected, an **ObjectUpdateFailed** event will be present in
+  the job event history and a rollback will be attempted. If the rollback will
+  fail, an **ObjectRollbackFailed** event will be present in the job event
+  history containing error information.
+- You have granted permissions to the Fargate task IAM role for access to the S3
+  buckets referenced by your data mappers **and** any AWS KMS keys used to
+  encrypt the data. For more information see [Permissions Configuration] in the
+  [User Guide].
+- You have configured the VPC used for the Fargate tasks according to the [VPC
+  Configuration] section.
+- Your data is compatible within the [solution limits].
+- Your data is not corrupted.
+
+To reprocess the objects, run a new deletion job.
+
+[user guide]: USER_GUIDE.md
+[vpc configuration]:
+  USER_GUIDE.md#pre-requisite-configuring-a-vpc-for-the-solution
+[permissions configuration]: USER_GUIDE.md#granting-access-to-data
+[performance configuration]: USER_GUIDE.md#adjusting-performance-configuration
+[athena service quotas]:
+  https://docs.aws.amazon.com/athena/latest/ug/service-limits.html
+[athena query history]:
+  https://docs.aws.amazon.com/athena/latest/ug/querying.html#queries-viewing-history
+[athena troubleshooting]:
+  https://docs.aws.amazon.com/athena/latest/ug/troubleshooting.html
+[solution limits]: LIMITS.md
+[cloudwatch container insights]:
+  https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html

+ 49 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/UPGRADE_GUIDE.md

@@ -0,0 +1,49 @@
+# Upgrade Guide
+
+## Migrating from <=v0.24 to v0.25
+
+Prior to v0.25, the Deletion Queue was synchronously processed on Job Creation
+and stored in DynamoDB. As a result, the job API provided the full queue for a
+given job in the `DeletionQueueItems` property and there was a limit of ~375KB
+on the queue size for each individual Job. If the size for a given job would
+have exceeded the allowed space, the `DeletionQueueItemsSkipped` property would
+have been set to `true` and it would have been necessary to run one or more
+deletion jobs, upon completion, to process the whole queue.
+
+Starting from v0.25, the queue is processed asynchronously after job creation
+and is stored in S3 in order to remove the queue limit. As a result:
+
+1. The `DeletionQueueItemsSkipped` and `DeletionQueueItems` fields are removed
+   from the `GET /jobs/{job_id}` and `DELETE /queue` APIs.
+2. A new Job Event is created when the Query Planning ends called
+   `QueryPlanningComplete` that contains details of the query planning phase.
+3. After Query Planning, the `QueryPlanningComplete` event payload is available
+   in the `GET /jobs/{job_id}` API for lookup of the properties:
+   - `GeneratedQueries` is the number of queries planned for execution
+   - `DeletionQueueSize` is the size of the queue for the Job
+   - `Manifests` is an array of S3 Objects containing the location for the Job
+     manifests. There is a manifest for each combination of `JobId` and
+     `DataMapperId`, and each manifest contains the full queue including the
+     MatchIds.
+4. The manifests follow the same expiration policy as the Job Details (they will
+   get automatically removed if the `JobDetailsRetentionDays` parameter is
+   configured when installing the solution).
+5. If you relied on the removed `DeletionQueueItems` parameter to inspect the
+   Job's queue, you'll need to migrate to fetching the S3 Manifests or querying
+   the AWS Glue Manifests Table.
+6. The deletion queue items are not visible in the UI anymore in the job details
+   page or in the job JSON export.
+
+## Migrating from <=v0.8 to v0.9
+
+The default behaviour of the solution has been changed in v0.9 to deploy and use
+a purpose-built VPC when creating the solution CloudFormation stack.
+
+If you have deployed the standalone VPC stack provided in previous versions, you
+should should set `DeployVpc` to **true** when upgrading to v0.9 and input the
+same values for the `FlowLogsGroup` and `FlowLogsRoleArn` parameters that were
+used when deploying the standalone VPC stack. After the deployment of v0.9 is
+complete, you should delete the old VPC stack.
+
+To continue using an existing VPC, you must set `DeployVpc` to **false** when
+upgrading to v0.9.

+ 942 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/USER_GUIDE.md

@@ -0,0 +1,942 @@
+# User Guide
+
+This section describes how to install, configure and use the Amazon S3 Find and
+Forget solution.
+
+## Index
+
+- [User Guide](#user-guide)
+  - [Index](#index)
+  - [Pre-requisites](#pre-requisites)
+    - [Configuring a VPC for the Solution](#configuring-a-vpc-for-the-solution)
+      - [Creating a New VPC](#creating-a-new-vpc)
+      - [Using an Existing VPC](#using-an-existing-vpc)
+    - [Provisioning Data Access IAM Roles](#provisioning-data-access-iam-roles)
+  - [Deploying the Solution](#deploying-the-solution)
+  - [Accessing the application](#accessing-the-application)
+    - [Logging in for the first time (only relevant if the Web UI is deployed)](#logging-in-for-the-first-time-only-relevant-if-the-web-ui-is-deployed)
+    - [Managing users (only relevant if Cognito is chosen for authentication)](#managing-users-only-relevant-if-cognito-is-chosen-for-authentication)
+    - [Making authenticated API requests](#making-authenticated-api-requests)
+      - [Cognito](#cognito)
+      - [IAM](#iam)
+    - [Integrating the solution with other applications using CloudFormation stack outputs](#integrating-the-solution-with-other-applications-using-cloudformation-stack-outputs)
+  - [Configuring Data Mappers](#configuring-data-mappers)
+    - [AWS Lake Formation Configuration](#aws-lake-formation-configuration)
+    - [Data Mapper Creation](#data-mapper-creation)
+  - [Granting Access to Data](#granting-access-to-data)
+    - [Updating your Bucket Policy](#updating-your-bucket-policy)
+    - [Data Encrypted with a Customer Managed CMK](#data-encrypted-with-a-customer-managed-cmk)
+  - [Adding to the Deletion Queue](#adding-to-the-deletion-queue)
+  - [Running a Deletion Job](#running-a-deletion-job)
+    - [Deletion Job Statuses](#deletion-job-statuses)
+    - [Deletion Job Event Types](#deletion-job-event-types)
+  - [Adjusting Configuration](#adjusting-configuration)
+  - [Updating the Solution](#updating-the-solution)
+    - [Identify current solution version](#identify-current-solution-version)
+    - [Identify the Stack URL to deploy](#identify-the-stack-url-to-deploy)
+    - [Minor Upgrades: Perform CloudFormation Stack Update](#minor-upgrades-perform-cloudformation-stack-update)
+    - [Major Upgrades: Manual Rolling Deployment](#major-upgrades-manual-rolling-deployment)
+  - [Deleting the Solution](#deleting-the-solution)
+
+## Pre-requisites
+
+### Configuring a VPC for the Solution
+
+The Fargate tasks used by this solution to perform deletions must be able to
+access the following AWS services, either via an Internet Gateway or via [VPC
+Endpoints]:
+
+- Amazon S3 (gateway endpoint _com.amazonaws.**region**.s3_)
+- Amazon DynamoDB (gateway endpoint _com.amazonaws.**region**.dynamodb_)
+- Amazon CloudWatch Monitoring (interface endpoint
+  _com.amazonaws.**region**.monitoring_) and Logs (interface endpoint
+  _com.amazonaws.**region**.logs_)
+- AWS ECR API (interface endpoint _com.amazonaws.**region**.ecr.api_) and Docker
+  (interface endpoint _com.amazonaws.**region**.ecr.dkr_)
+- Amazon SQS (interface endpoint _com.amazonaws.**region**.sqs_)
+- AWS STS (interface endpoint _com.amazonaws.**region**.sts_)
+- AWS KMS (interface endpoint _com.amazonaws.**region**.kms_) - **required only
+  if S3 Objects are encrypted using AWS KMS client-side encryption**
+
+#### Creating a New VPC
+
+By default the CloudFormation template will create a new VPC that has been
+purpose-built for the solution. The VPC includes VPC endpoints for the
+aforementioned services, and does not provision internet connectivity.
+
+You can use the provided VPC to operate the solution with no further
+customisations. However, if you have more complex requirements it is recommended
+to use an existing VPC as described in the following section.
+
+#### Using an Existing VPC
+
+Amazon S3 Find and Forget can also be used in an existing VPC. You may want to
+do this if you have requirements that aren't met by using the VPC provided with
+the solution.
+
+To use an existing VPC, set the `DeployVpc` parameter to `false` when launching
+the solution CloudFormation stack. You must also specify the subnet and security
+groups that the Fargate tasks will use by setting the `VpcSubnets` and
+`VpcSecurityGroups` parameters respectively.
+
+The subnets and security groups that you specify must allow the tasks to connect
+to the aforementioned AWS services. Forget solution, you must ensure that when
+deploying the solution you select subnets and security groups which permit
+access to the aforementioned services and you set _DeployVpc_ to false.
+
+You can obtain your subnet and security group IDs from the AWS Console or by
+using the AWS CLI. If using the AWS CLI, you can use the following command to
+get a list of VPCs:
+
+```bash
+aws ec2 describe-vpcs \
+  --query 'Vpcs[*].{ID:VpcId,Name:Tags[?Key==`Name`].Value | [0], IsDefault: IsDefault}'
+```
+
+Once you have found the VPC you wish to use, to get a list of subnets and
+security groups in that VPC:
+
+```bash
+export VPC_ID=<chosen-vpc-id>
+aws ec2 describe-subnets \
+  --filter Name=vpc-id,Values="$VPC_ID" \
+  --query 'Subnets[*].{ID:SubnetId,Name:Tags[?Key==`Name`].Value | [0],AZ:AvailabilityZone}'
+aws ec2 describe-security-groups \
+  --filter Name=vpc-id,Values="$VPC_ID" \
+  --query 'SecurityGroups[*].{ID:GroupId,Name:GroupName}'
+```
+
+### Provisioning Data Access IAM Roles
+
+The Fargate tasks used by this solution to perform deletions require a specific
+IAM role to exist in each account that owns a bucket that you will use with the
+solution. The role must have the exact name **S3F2DataAccessRole** (no path). A
+CloudFormation template is available as part of this solution which can be
+deployed separately to the main stack in each account. A way to deploy this role
+to many accounts, for example across your organization, is to use [AWS
+CloudFormation StackSets].
+
+To deploy this template manually, use the IAM Role Template "Deploy to AWS
+button" in [Deploying the Solution](#deploying-the-solution) then follow steps
+5-9. The **Outputs** tab will contain the Role ARN which you will need when
+adding data mappers.
+
+You will need to grant this role read and write access to your data. We
+recommend you do this using a bucket policy. For more information, see
+[Granting Access to Data](#granting-access-to-data).
+
+## Deploying the Solution
+
+The solution is deployed as an
+[AWS CloudFormation](https://aws.amazon.com/cloudformation) template and should
+take about 20 to 40 minutes to deploy.
+
+Your access to the AWS account must have IAM permissions to launch AWS
+CloudFormation templates that create IAM roles and to create the solution
+resources.
+
+> **Note** You are responsible for the cost of the AWS services used while
+> running this solution. For full details, see the pricing pages for each AWS
+> service you will be using in this sample. Prices are subject to change.
+
+1. Deploy the latest CloudFormation template using the AWS Console by choosing
+   the "_Launch Template_" button below for your preferred AWS region. If you
+   wish to [deploy using the AWS CLI] instead, you can refer to the "_Template
+   Link_" to download the template files.
+
+| Region                                     | Launch Template                                                                                                                                                                                                                                   | Template Link                                                                                                                   | Launch IAM Role Template                                                                                                                                                                                                                           | IAM Role Template Link                                                                                                      |
+| ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
+| **US East (N. Virginia)** (us-east-1)      | [Launch](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)                | [Link](https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)           | [Launch](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)                | [Link](https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)           |
+| **US East (Ohio)** (us-east-2)             | [Launch](https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-us-east-2.s3.us-east-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)                | [Link](https://solution-builders-us-east-2.s3.us-east-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)           | [Launch](https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-us-east-2.s3.us-east-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)                | [Link](https://solution-builders-us-east-2.s3.us-east-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)           |
+| **US West (Oregon)** (us-west-2)           | [Launch](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-us-west-2.s3.us-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)                | [Link](https://solution-builders-us-west-2.s3.us-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)           | [Launch](https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-us-west-2.s3.us-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)                | [Link](https://solution-builders-us-west-2.s3.us-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)           |
+| **Asia Pacific (Sydney)** (ap-southeast-2) | [Launch](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml) | [Link](https://solution-builders-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml) | [Launch](https://console.aws.amazon.com/cloudformation/home?region=ap-southeast-2#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml) | [Link](https://solution-builders-ap-southeast-2.s3.ap-southeast-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml) |
+| **Asia Pacific (Tokyo)** (ap-northeast-1)  | [Launch](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-ap-northeast-1.s3.ap-northeast-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml) | [Link](https://solution-builders-ap-northeast-1.s3.ap-northeast-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml) | [Launch](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-ap-northeast-1.s3.ap-northeast-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml) | [Link](https://solution-builders-ap-northeast-1.s3.ap-northeast-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml) |
+| **EU (Ireland)** (eu-west-1)               | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-eu-west-1.s3.eu-west-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)                | [Link](https://solution-builders-eu-west-1.s3.eu-west-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)           | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-eu-west-1.s3.eu-west-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)                | [Link](https://solution-builders-eu-west-1.s3.eu-west-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)           |
+| **EU (London)** (eu-west-2)                | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-west-2#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-eu-west-2.s3.eu-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)                | [Link](https://solution-builders-eu-west-2.s3.eu-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)           | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-west-2#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-eu-west-2.s3.eu-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)                | [Link](https://solution-builders-eu-west-2.s3.eu-west-2.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)           |
+| **EU (Frankfurt)** (eu-central-1)          | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-central-1#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-eu-central-1.s3.eu-central-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)       | [Link](https://solution-builders-eu-central-1.s3.eu-central-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)     | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-central-1#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-eu-central-1.s3.eu-central-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)       | [Link](https://solution-builders-eu-central-1.s3.eu-central-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)     |
+| **EU (Stockholm)** (eu-north-1)            | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-north-1#/stacks/new?stackName=S3F2&templateURL=https://solution-builders-eu-north-1.s3.eu-north-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)             | [Link](https://solution-builders-eu-north-1.s3.eu-north-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml)         | [Launch](https://console.aws.amazon.com/cloudformation/home?region=eu-north-1#/stacks/new?stackName=S3F2-Role&templateURL=https://solution-builders-eu-north-1.s3.eu-north-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)             | [Link](https://solution-builders-eu-north-1.s3.eu-north-1.amazonaws.com/amazon-s3-find-and-forget/latest/role.yaml)         |
+
+2. If prompted, login using your AWS account credentials.
+3. You should see a screen titled "_Create Stack_" at the "_Specify template_"
+   step. The fields specifying the CloudFormation template are pre-populated.
+   Choose the _Next_ button at the bottom of the page.
+4. On the "_Specify stack details_" screen you should provide values for the
+   following parameters of the CloudFormation stack:
+
+   - **Stack Name:** (Default: S3F2) This is the name that is used to refer to
+     this stack in CloudFormation once deployed.
+   - **AdminEmail:** The email address you wish to setup as the initial user of
+     this Amazon S3 Find and Forget deployment.
+   - **DeployWebUI:** (Default: true) Whether to deploy the Web UI as part of
+     the solution. If set to **true**, the AuthMethod parameter must be set to
+     **Cognito**. If set to **false**, interaction with the solution is
+     performed via the API Gateway only.
+   - **AuthMethod:** (Default: Cognito) The authentication method to be used for
+     the solution. Must be set to **Cognito** if DeployWebUI is true.
+
+   The following parameters are optional and allow further customisation of the
+   solution if required:
+
+   - **DeployVpc:** (Default: true) Whether to deploy the solution provided VPC.
+     If you wish to use your own VPC, set this value to false. The solution
+     provided VPC uses VPC Endpoints to access the required services which will
+     incur additional costs. For more details, see the [VPC Endpoint Pricing]
+     page.
+   - **VpcSecurityGroups:** (Default: "") List of security group IDs to apply to
+     Fargate deletion tasks. For more information on how to obtain these IDs,
+     see
+     [Configuring a VPC for the Solution](#configuring-a-vpc-for-the-solution).
+     If _DeployVpc_ is true, this parameter is ignored.
+   - **VpcSubnets:** (Default: "") List of subnets to run Fargate deletion tasks
+     in. For more information on how to obtain these IDs, see
+     [Configuring a VPC for the Solution](#configuring-a-vpc-for-the-solution).
+     If _DeployVpc_ is true, this parameter is ignored.
+   - **FlowLogsGroup**: (Default: "") If using the solution provided VPC,
+     defines the CloudWatch Log group which should be used for flow logs. If not
+     set, flow logs will not be enabled. If _DeployVpc_ is false, this parameter
+     is ignored. Enabling flow logs will incur additional costs. See the
+     [CloudWatch Logs Pricing] page for the associated costs.
+   - **FlowLogsRoleArn**: (Default: "") If using the solution provided VPC,
+     defines which IAM Role should be used to send flow logs to CloudWatch. If
+     not set, flow logs will not be enabled. If _DeployVpc_ is false, this
+     parameter is ignored.
+   - **CreateCloudFrontDistribution:** (Default: true) Creates a CloudFront
+     distribution for accessing the web interface of the solution.
+   - **AccessControlAllowOriginOverride:** (Default: false) Allows overriding
+     the origin from which the API can be called. If 'false' is provided, the
+     API will only accept requests from the Web UI origin.
+   - **AthenaConcurrencyLimit:** (Default: 20) The number of concurrent Athena
+     queries the solution will run when scanning your data lake.
+   - **AthenaQueryMaxRetries:** (Default: 2) Max number of retries to each
+     Athena query after a failure
+   - **DeletionTasksMaxNumber:** (Default: 3) Max number of concurrent Fargate
+     tasks to run when performing deletions.
+   - **DeletionTaskCPU:** (Default: 4096) Fargate task CPU limit. For more info
+     see [Fargate Configuration]
+   - **DeletionTaskMemory:** (Default: 30720) Fargate task memory limit. For
+     more info see [Fargate Configuration]
+   - **QueryExecutionWaitSeconds:** (Default: 3) How long to wait when checking
+     if an Athena Query has completed.
+   - **QueryQueueWaitSeconds:** (Default: 3) How long to wait when checking if
+     there the current number of executing queries is less than the specified
+     concurrency limit.
+   - **ForgetQueueWaitSeconds:** (Default: 30) How long to wait when checking if
+     the Forget phase is complete
+   - **AccessLogsBucket:** (Default: "") The name of the bucket to use for
+     storing the Web UI access logs. Leave blank to disable UI access logging.
+     Ensure the provided bucket has the appropriate permissions configured. For
+     more information see [CloudFront Access Logging Permissions] if
+     **CreateCloudFrontDistribution** is set to true, or [S3 Access Logging
+     Permissions] if not.
+   - **CognitoAdvancedSecurity:** (Default: "OFF") The setting to use for
+     Cognito advanced security. Allowed values for this parameter are: OFF,
+     AUDIT and ENFORCED. For more information on this parameter, see [Cognito
+     Advanced Security]
+   - **EnableAPIAccessLogging:** (Default: false) Whether to enable access
+     logging via CloudWatch Logs for API Gateway. Enabling this feature will
+     incur additional costs.
+   - **EnableContainerInsights:** (Default: false) Whether to enable CloudWatch
+     Container Insights.
+   - **JobDetailsRetentionDays:** (Default: 0) How long job records should
+     remain in the Job table and how long job manifests should remain in the S3
+     manifests bucket. Use 0 to retain data indefinitely. **Note**: if the
+     retention setting is changed it will only apply to new deletion jobs in
+     DynamoDB, existing deletion jobs will retain the TTL at the time they were
+     ran; but the policy will apply immediately to new and existing job
+     manifests in S3.
+   - **EnableDynamoDBBackups:** (Default: false) Whether to enable [DynamoDB
+     Point-in-Time Recovery] for the DynamoDB tables. Enabling this feature will
+     incur additional costs. See the [DynamoDB Pricing] page for the associated
+     costs.
+   - **RetainDynamoDBTables:** (Default: true) Whether to retain the DynamoDB
+     tables upon Stack Update and Stack Deletion.
+   - **AthenaWorkGroup:** (Default: primary) The Athena work group that should
+     be used for when the solution runs Athena queries.
+   - **PreBuiltArtefactsBucketOverride:** (Default: false) Overrides the default
+     Bucket containing Front-end and Back-end pre-built artefacts. Use this if
+     you are using a customised version of these artefacts.
+   - **ResourcePrefix:** (Default: S3F2) Resource prefix to apply to resource
+     names when creating statically named resources.
+   - **KMSKeyArns** (Default: "") Comma-delimited list of KMS Key Arns used for
+     Client-side Encryption. Leave empty if data is not client-side encrypted
+     with KMS.
+
+   When completed, click _Next_
+
+5. [Configure stack options](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-add-tags.html)
+   if desired, then click _Next_.
+6. On the review screen, you must check the boxes for:
+
+   - "_I acknowledge that AWS CloudFormation might create IAM resources_"
+   - "_I acknowledge that AWS CloudFormation might create IAM resources with
+     custom names_"
+   - "_I acknowledge that AWS CloudFormation might require the following
+     capability: CAPABILITY_AUTO_EXPAND_"
+
+   These are required to allow CloudFormation to create a Role to allow access
+   to resources needed by the stack and name the resources in a dynamic way.
+
+7. Choose _Create Stack_
+8. Wait for the CloudFormation stack to launch. Completion is indicated when the
+   "Stack status" is "_CREATE_COMPLETE_".
+   - You can monitor the stack creation progress in the "Events" tab.
+9. Note the _WebUIUrl_ displayed in the _Outputs_ tab for the stack. This is
+   used to access the application.
+
+## Accessing the application
+
+The solution provides a web user interface and a REST API to allow you to
+integrate it in your own applications. If you have chosen not to deploy the Web
+UI you will need to use the API to interface with the solution.
+
+### Logging in for the first time (only relevant if the Web UI is deployed)
+
+1. Note the _WebUIUrl_ displayed in the _Outputs_ tab for the stack. This is
+   used to access the application.
+2. When accessing the web user interface for the first time, you will be
+   prompted to insert a username and a password. In the username field, enter
+   the admin e-mail specified during stack creation. In the password field,
+   enter the temporary password sent by the system to the admin e-mail. Then
+   select "Sign In".
+3. Next, you will need to reset the password. Enter a new password and then
+   select "Submit".
+4. Now you should be able to access all the functionalities.
+
+### Managing users (only relevant if Cognito is chosen for authentication)
+
+To add more users to the application:
+
+1. Access the [Cognito Console] and choose "Manage User Pools".
+2. Select the solution's User Pool (its name is displayed as
+   _CognitoUserPoolName_ in the _Outputs_ tab for the CloudFormation stack).
+3. Select "Users and Groups" from the menu on the right.
+4. Use this page to create or manage users. For more information, consult the
+   [Managing Users in User Pools Guide].
+
+### Making authenticated API requests
+
+To use the API directly, you will need to authenticate requests using the
+Cognito User Pool or IAM. The method for authenticating differs depending on
+which authentication option was chosen:
+
+#### Cognito
+
+After resetting the password via the UI, you can make authenticated requests
+using the AWS CLI:
+
+1. Note the _CognitoUserPoolId_, _CognitoUserPoolClientId_ and _ApiUrl_
+   parameters displayed in the _Outputs_ tab for the stack.
+2. Take note of the Cognito user email and password.
+3. Generate a token by running this command with the values you noted in the
+   previous steps:
+
+   ```sh
+   aws cognito-idp admin-initiate-auth \
+     --user-pool-id $COGNITO_USER_POOL_ID \
+     --client-id $COGNITO_USER_POOL_CLIENT_ID \
+     --auth-flow ADMIN_NO_SRP_AUTH \
+     --auth-parameters '{"USERNAME":"$USER_EMAIL_ADDRESS","PASSWORD":"$USER_PASSWORD"}'
+   ```
+
+4. Use the `IdToken` generated by the previous command to make an authenticated
+   request to the API. For instance, the following command will show the matches
+   in the deletion queue:
+
+   ```sh
+   curl $API_URL/v1/queue -H "Authorization: Bearer $ID_TOKEN"
+   ```
+
+For more information, consult the [Cognito REST API integration guide].
+
+#### IAM
+
+IAM authentication for API requests uses the
+[Signature Version 4 signing process](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html).
+Add the resulting signature to the **Authorization** header when making requests
+to the API.
+
+Use the Sigv4 process linked above to generate the Authorization header value
+and then call the API as normal:
+
+```sh
+curl $API_URL/v1/queue -H "Authorization: $Sigv4Auth"
+```
+
+IAM authentication can be used anywhere you have AWS credentials with the
+correct permissions, this could be an IAM User or an assumed IAM Role.
+
+Please refer to the documentation
+[here](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-control-access-using-iam-policies-to-invoke-api.html)
+to understand how to define the IAM policy to match your requirements. The ARN
+for the api can be found in the value of the `ApiArn` CloudFormation Stack
+Output.
+
+### Integrating the solution with other applications using CloudFormation stack outputs
+
+Applications deployed using AWS CloudFormation in the same AWS account and
+region can integrate with Find and Forget by using CloudFormation output values.
+You can use the solution stack as a nested stack to use its outputs (such as the
+API URL) as inputs for another application.
+
+Some outputs are also available as exports. You can import these values to use
+in your own CloudFormation stacks that you deploy following the Find and Forget
+stack.
+
+**Note for using exports:** After another stack imports an output value, you
+can't delete the stack that is exporting the output value or modify the exported
+output value. All of the imports must be removed before you can delete the
+exporting stack or modify the output value.
+
+Consult the [exporting stack output values] guide to review the differences
+between importing exported values and using nested stacks.
+
+## Configuring Data Mappers
+
+After [Deploying the Solution](#deploying-the-solution), your first step should
+be to configure one or more [data mappers](ARCHITECTURE.md#data-mappers) which
+will connect your data to the solution. Identify the S3 Bucket containing the
+data you wish to connect to the solution and ensure you have defined a table in
+your data catalog and that all existing and future partitions (as they are
+created) are known to the Data Catalog. Currently AWS Glue is the only supported
+data catalog provider. For more information on defining your data in the Glue
+Data Catalog, see [Defining Glue Tables]. You must define your Table in the Glue
+Data Catalog in the same region and account as the S3 Find and Forget solution.
+
+### AWS Lake Formation Configuration
+
+For data lakes registered with AWS Lake Formation, you must grant additional
+permissions in Lake Formation before you can use them with the solution. If you
+are not using Lake Formation, proceed directly to the
+[Data Mapper creation](#data-mapper-creation) section.
+
+To grant these permissions in Lake Formation:
+
+1. Using the **WebUIRole** output from the solution CloudFormation stack as the
+   IAM principal, use the [Lake Formation Data Permissions Console] to grant the
+   `Describe` permission for all Glue Databases that you will want to use with
+   the solution; then grant the `Describe` and `Select` permissions to the role
+   for all Glue Tables that you will want to use with the solution. These
+   permissions are necessary to create data mappers in the web interface.
+2. Using the **PutDataMapperRole** output from the solution CloudFormation stack
+   as the IAM principal, use the [Lake Formation Data Permissions Console] to
+   grant `Describe` and `Select` permissions for all Glue Tables that you will
+   want to use with the solution. These permissions allow the solution to access
+   Table metadata when creating a Data Mapper.
+3. Using the **AthenaExecutionRole** and **GenerateQueriesRole** outputs from
+   the solution CloudFormation stack as IAM principals, use the [Lake Formation
+   Data Permissions Console] to grant the `Describe` and `Select` permissions to
+   both principals for all of the tables that you will want to use with the
+   solution. These permissions allow the solution to plan and execute Athena
+   queries during the Find Phase.
+
+### Data Mapper Creation
+
+1. Access the application UI via the **WebUIUrl** displayed in the _Outputs_ tab
+   for the stack.
+2. Choose **Data Mappers** from the menu then choose **Create Data Mapper**
+3. On the Create Data Mapper page input a **Name** to uniquely identify this
+   Data Mapper.
+4. Select a **Query Executor Type** then choose the **Database** and **Table**
+   in your data catalog which describes the target data in S3. A list of columns
+   will be displayed for the chosen Table.
+5. From the Partition Keys list, select the partition key(s) that you want the
+   solution to use when generating the queries. If you select none, only one
+   query will be performed for the data mapper. If you select any or all, you'll
+   have a greater number of smaller queries (the same query will be repeated
+   with a `WHERE` additional clause for each combination of partition values).
+   If you have a lot of small partitions, it may be more efficient to choose
+   none or a subset of partition keys from the list in order to increase speed
+   of execution. If instead you have very big partitions, it may be more
+   efficient to choose all the partition keys in order to reduce probability of
+   failure caused by query timeout. We recommend the average query size not to
+   exceed the hundreds of GBs and not to take more than 5 minutes.
+
+   > As an example, let's consider 10 years of daily data with partition keys of
+   > `year`, `month` and `day` with total size of `10TB`. By declaring
+   > PartitionKeys=`[]` (none) a single query of `10TB` would run during the
+   > Find phase, and that may be too much to complete within the 30m limit of
+   > Athena execution time. On the other hand, using all the combinations of the
+   > partition keys we would have approximately `3652` queries, each being
+   > probably very small, and given the default Athena concurrency limit of
+   > `20`, it may take very long to execute all of them. The best in this
+   > scenario is possibly the `['year','month']` combination, which would result
+   > in `120` queries.
+
+6. From the columns list, choose the column(s) the solution should use to to
+   find items in the data which should be deleted. For example, if your table
+   has three columns named **customer_id**, **description** and **created_at**
+   and you want to search for items using the **customer_id**, you should choose
+   only the **customer_id** column from this list.
+7. Enter the ARN of the role for Fargate to assume when modifying objects in S3
+   buckets. This role should already exist if you have followed the
+   [Provisioning Data Access IAM Roles](#provisioning-data-access-iam-roles)
+   steps.
+8. If you do not want the solution to delete all older versions except the
+   latest created object version, deselect _Delete previous object versions
+   after update_. By default the solution will delete all previous of versions
+   after creating a new version.
+9. If you want the solution to ignore Object Not Found exceptions, select
+   _Ignore object not found exceptions during deletion_. By default deletion
+   jobs will fail if any objects that are found by the Find phase don't exist in
+   the Delete phase. This setting can be useful if you have some other system
+   deleting objects from the bucket, for example S3 lifecycle policies.
+
+   Note that the solution **will not** delete old versions for these objects.
+   This can cause data to be **retained longer than intended**. Make sure there
+   is some mechanism to handle old versions. One option would be to configure
+   [S3 lifecycle policies] on non-current versions.
+
+10. Choose **Create Data Mapper**.
+11. A message is displayed advising you to update the S3 Bucket Policy for the
+    S3 Bucket referenced by the newly created data mapper. See
+    [Granting Access to Data](#granting-access-to-data) for more information on
+    how to do this. Choose **Return to Data Mappers**.
+
+You can also create Data Mappers directly via the API. For more information, see
+the [API Documentation].
+
+## Granting Access to Data
+
+After configuring a data mapper you must ensure that the S3 Find and Forget
+solution has the required level of access to the S3 location the data mapper
+refers to. The recommended way to achieve this is through the use of [S3 Bucket
+Policies].
+
+> **Note:** AWS IAM uses an
+> [eventual consistency model](https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_general.html#troubleshoot_general_eventual-consistency)
+> and therefore any change you make to IAM, Bucket or KMS Key policies may take
+> time to become visible. Ensure you have allowed time for permissions changes
+> to propagate to all endpoints before starting a job. If your job fails with a
+> status of FIND_FAILED and the `QueryFailed` events indicate S3 permissions
+> issues, you may need to wait for the permissions changes to propagate.
+
+### Updating your Bucket Policy
+
+To update the S3 bucket policy to grant **read** access to the IAM role used by
+Amazon Athena, and **write** access to the Data Access IAM role used by AWS
+Fargate, follow these steps:
+
+1. Access the application UI via the **WebUIUrl** displayed in the _Outputs_ tab
+   for the stack.
+2. Choose **Data Mappers** from the menu then choose the radio button for the
+   relevant data mapper from the **Data Mappers** list.
+3. Choose **Generate Access Policies** and follow the instructions on the
+   **Bucket Access** tab to update the bucket policy. If you already have a
+   bucket policy in place, add the statements shown to your existing bucket
+   policy rather than replacing it completely. If your data is encrypted with an
+   **Customer Managed CMK** rather than an **AWS Managed CMK**, see
+   [Data Encrypted with Customer Managed CMK](#data-encrypted-with-a-customer-managed-cmk)
+   to grant the solution access to the Customer Managed CMK. For more
+   information on using Server-Side Encryption (SSE) with S3, see [Using SSE
+   with CMKs].
+
+### Data Encrypted with a Customer Managed CMK
+
+Where the data you are connecting to the solution is encrypted with an Customer
+Managed CMK rather than an AWS Managed CMK, you must also grant the Athena and
+Data Access IAM roles access to use the key so that the data can be decrypted
+when reading, re-encrypted when writing.
+
+Once you have updated the bucket policy as described in
+[Updating the Bucket Policy](#updating-the-bucket-policy), choose the **KMS
+Access** tab from the **Generate Access Policies** modal window and follow the
+instructions to update the key policy with the provided statements. The
+statements provided are for use when using the **policy view** in the AWS
+console or making updates to the key policy via the CLI, CloudFormation or the
+API. If you wish, to use the **default view** in th AWS console, add the
+**Principals** in the provided statements as **key users**. For more
+information, see [How to Change a Key Policy].
+
+## Adding to the Deletion Queue
+
+Once your Data Mappers are configured, you can begin adding "Matches" to the
+[Deletion Queue](ARCHITECTURE.md#deletion-queue).
+
+1. Access the application UI via the **WebUIUrl** displayed in the _Outputs_ tab
+   for the stack.
+2. Choose **Deletion Queue** from the menu then choose **Add Match to the
+   Deletion Queue**.
+
+Matches can be **Simple** or **Composite**.
+
+- A **Simple** match is a value to be matched against any column identifier of
+  one or more data mappers. For instance a value _12345_ to be matched against
+  the _customer_id_ column of _DataMapperA_ or the _admin_id_ of _DataMapperB_.
+- A **Composite** match consists on one or more values to be matched against
+  specific column identifiers of a multi-column based data mapper. For instance
+  a tuple _John_ and _Doe_ to be matched against the _first_name_ and
+  _last_name_ columns of _DataMapperC_
+
+To add a simple match:
+
+1. Choose _Simple_ as **Match Type**
+2. Input a **Match**, which is the value to search for in your data mappers. If
+   you wish to search for the match from all data mappers choose **All Data
+   Mappers**, otherwise choose **Select your Data Mappers** then select the
+   relevant data mappers from the list.
+3. Choose **Add Item to the Deletion Queue** and confirm you can see the match
+   in the Deletion Queue.
+
+To add a composite match you need to have at least one data mapper with more
+than one column identifier. Then:
+
+1. Choose _Composite_ as **Match Type**
+2. Select the Data Mapper from the List
+3. Select all the columns (at least one) that you want to map to a match and
+   then provide a value for each of them. Empty is a valid value.
+4. Choose **Add Item to the Deletion Queue** and confirm you can see the match
+   in the Deletion Queue.
+
+You can also add matches to the Deletion Queue directly via the API. For more
+information, see the [API Documentation].
+
+When the next deletion job runs, the solution will scan the configured columns
+of your data for any occurrences of the Matches present in the queue at the time
+the job starts and remove any items where one of the Matches is present.
+
+If across all your data mappers you can find all items related to a single
+logical entity using the same value, you only need to add one Match value to the
+deletion queue to delete that logical entity from all data mappers.
+
+If the value used to identify a single logical entity is not consistent across
+your data mappers, you should add an item to the deletion queue **for each
+distinct value** which identifies the logical entity, selecting the specific
+data mapper(s) to which that value is relevant.
+
+If you make a mistake when adding a Match to the deletion queue, you can remove
+that match from the queue as long as there is no job running. Once a job has
+started no items can be removed from the deletion queue until the running job
+has completed. You may continue to add matches to the queue whilst a job is
+running, but only matches which were present when the job started will be
+processed by that job. Once a job completes, only the matches that job has
+processed will be removed from the queue.
+
+In order to facilitate different teams using a single deployment within an
+organisation, the same match can be added to the deletion queue more than once.
+When the job executes, it will merge the lists of data mappers for duplicates in
+the queue.
+
+## Running a Deletion Job
+
+Once you have configured your data mappers and added one or more items to the
+deletion queue, you can stat a job.
+
+1. Access the application UI via the **WebUIUrl** displayed in the _Outputs_ tab
+   for the stack.
+2. Choose **Deletion Jobs** from the menu and ensure there are no jobs currently
+   running. Choose **Start a Deletion Job** and review the settings displayed on
+   the screen. For more information on how to edit these settings, see
+   [Adjusting Configuration](#adjusting-configuration).
+3. If you are happy with the current solution configuration choose **Start a
+   Deletion Job**. The job details page should be displayed.
+
+Once a job has started, you can leave the page and return to view its progress
+at point by choosing the job ID from the Deletion Jobs list. The job details
+page will automatically refresh and to display the current status and statistics
+for the job. For more information on the possible statuses and their meaning,
+see [Deletion Job Statuses](#deletion-job-statuses).
+
+You can also start jobs and check their status using the API. For more
+information, see the [API Documentation].
+
+Job events are continuously emitted whilst a job is running. These events are
+used to update the status and statistics for the job. You can view all the
+emitted events for a job in the **Job Events** table. Whilst a job is running,
+the **Load More** button will continue to be displayed even if no new events
+have been received. Once a job has finished, the **Load More** button will
+disappear once you have loaded all the emitted events. For more information on
+the events which can be emitted during a job, see
+[Deletion Job Event Types](#deletion-job-event-types)
+
+To optimise costs, it is best practice when using the solution to start jobs on
+a regular schedule, rather than every time a single item is added to the
+Deletion Queue. This is because the marginal cost of the Find phase when
+deleting an additional item from the queue is far less that re-executing the
+Find phase (where the data mappers searched are the same). Similarly, the
+marginal cost of removing an additional match from an object is negligible when
+there is already at least 1 match present in the object contents.
+
+> **Important**
+>
+> Ensure no external processes perform write/delete actions against exist
+> objects whilst a job is running. For more information, consult the [Limits]
+> guide
+
+### Deletion Job Statuses
+
+The list of possible job statuses is as follows:
+
+- `QUEUED`: The job has been accepted but has yet to start. Jobs are started
+  asynchronously by a Lambda invoked by the [DynamoDB event
+  stream][dynamodb streams] for the Jobs table.
+- `RUNNING`: The job is still in progress.
+- `FORGET_COMPLETED_CLEANUP_IN_PROGRESS`: The job is still in progress.
+- `COMPLETED`: The job finished successfully.
+- `COMPLETED_CLEANUP_FAILED`: The job finished successfully however the deletion
+  queue items could not be removed. You should manually remove these or leave
+  them to be removed on the next job
+- `FORGET_PARTIALLY_FAILED`: The job finished but it was unable to successfully
+  process one or more objects. The Deletion DLQ for messages will contain a
+  message per object that could not be updated.
+- `FIND_FAILED`: The job failed during the Find phase as there was an issue
+  querying one or more data mappers.
+- `FORGET_FAILED`: The job failed during the Forget phase as there was an issue
+  running the Fargate tasks.
+- `FAILED`: An unknown error occurred during the Find and Forget workflow, for
+  example, the Step Functions execution timed out or the execution was manually
+  cancelled.
+
+For more information on how to resolve statuses indicative of errors, consult
+the [Troubleshooting] guide.
+
+### Deletion Job Event Types
+
+The list of events is as follows:
+
+- `JobStarted`: Emitted when the deletion job state machine first starts. Causes
+  the status of the job to transition from `QUEUED` to `RUNNING`
+- `FindPhaseStarted`: Emitted when the deletion job has purged any messages from
+  the query and object queues and is ready to be searching for data.
+- `FindPhaseEnded`: Emitted when all queries have executed and written their
+  results to the objects queue.
+- `FindPhaseFailed`: Emitted when one or more queries fail. Causes the status to
+  transition to `FIND_FAILED`.
+- `ForgetPhaseStarted`: Emitted when the Find phase has completed successfully
+  and the Forget phase is starting.
+- `ForgetPhaseEnded`: Emitted when the Forget phase has completed. If the Forget
+  phase completes with no errors, this event causes the status to transition to
+  `FORGET_COMPLETED_CLEANUP_IN_PROGRESS`. If the Forget phase completes but
+  there was an error updating one or more objects, this causes the status to
+  transition to `FORGET_PARTIALLY_FAILED`.
+- `ForgetPhaseFailed`: Emitted when there was an issue running the Fargate
+  tasks. Causes the status to transition to `FORGET_FAILED`.
+- `CleanupSucceeded`: The **final** event emitted when a job has executed
+  successfully and the Deletion Queue has been cleaned up. Causes the status to
+  transition to `COMPLETED`.
+- `CleanupFailed`: The **final** event emitted when the job executed
+  successfully but there was an error removing the processed matches from the
+  Deletion Queue. Causes the status to transition to `COMPLETED_CLEANUP_FAILED`.
+- `CleanupSkipped`: Emitted when the job is finalising and the job status is one
+  of `FIND_FAILED`, `FORGET_FAILED` or `FAILED`.
+- `QuerySucceeded`: Emitted whenever a single query executes successfully.
+- `QueryFailed`: Emitted whenever a single query fails.
+- `ObjectUpdated`: Emitted whenever an updated object is written to S3 and any
+  associated deletions are complete.
+- `ObjectUpdateFailed`: Emitted whenever an object cannot be updated, an object
+  version integrity conflict is detected or an associated deletion fails.
+- `ObjectRollbackFailed`: Emitted whenever a rollback (triggered by a detected
+  version integrity conflict) fails.
+- `Exception`: Emitted whenever a generic error occurs during the job execution.
+  Causes the status to transition to `FAILED`.
+
+## Adjusting Configuration
+
+There are several parameters to set when
+[Deploying the Solution](#deploying-the-solution) which affect the behaviour of
+the solution in terms of data retention and performance:
+
+- `AthenaConcurrencyLimit`: Increasing the number of concurrent queries that
+  should be executed will decrease the total time spent performing the Find
+  phase. You should not increase this value beyond your account Service Quota
+  for concurrent DML queries, and should ensure that the value set takes into
+  account any other Athena DML queries that may be executing whilst a job is
+  running.
+- `DeletionTasksMaxNumber`: Increasing the number of concurrent tasks that
+  should consume messages from the object queue will decrease the total time
+  spent performing the Forget phase.
+- `QueryExecutionWaitSeconds`: Decreasing this value will decrease the length of
+  time between each check to see whether a query has completed. You should aim
+  to set this to the "ceiling function" of your average query time. For example,
+  if you average query takes 3.2 seconds, set this to 4.
+- `QueryQueueWaitSeconds`: Decreasing this value will decrease the length of
+  time between each check to see whether additional queries can be scheduled
+  during the Find phase. If your jobs fail due to exceeding the Step Functions
+  execution history quota, you may have set this value to low and should
+  increase it to allow more queries to be scheduled after each check.
+- `ForgetQueueWaitSeconds`: Decreasing this value will decrease the length of
+  time between each check to see whether the Fargate object queue is empty. If
+  your jobs fail due to exceeding the Step Functions execution history quota,
+  you may have set this value to low.
+- `JobDetailsRetentionDays`: Changing this value will change how long records
+  job details and events are retained for. Set this to 0 to retain them
+  indefinitely.
+
+The values for these parameters are stored in an SSM Parameter Store String
+Parameter named `/s3f2/S3F2-Configuration` as a JSON object. The recommended
+approach for updating these values is to perform a
+[Stack Update](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-direct.html)
+and change the relevant parameters for the stack.
+
+It is possible to [update the SSM Parameter][updating an ssm parameter] directly
+however this is not a recommended approach. **You should not alter the structure
+or data types of the configuration JSON object.**
+
+Once updated, the configuration will affect any **future** job executions. In
+progress and previous executions will **not** be affected. The current
+configuration values are displayed when confirming that you wish to start a job.
+
+You can only update the vCPUs/memory allocated to Fargate tasks by performing a
+stack update. For more information, see
+[Updating the Solution](#updating-the-solution).
+
+## Updating the Solution
+
+To benefit from the latest features and improvements, you should update the
+solution deployed to your account when a new version is published. To find out
+what the latest version is and what has changed since your currently deployed
+version, check the [Changelog].
+
+How you update the solution depends on the difference between versions. If the
+new version is a _minor_ upgrade (for instance, from version 3.45 to 3.67) you
+should deploy using a CloudFormation Stack Update. If the new version is a
+_major_ upgrade (for instance, from 2.34 to 3.0) you should perform a manual
+rolling deployment.
+
+Major version releases are made in exceptional circumstances and may contain
+changes that prohibit backward compatibility. Minor versions releases are
+backward-compatible.
+
+### Identify current solution version
+
+You can find the version of the currently deployed solution by retrieving the
+`SolutionVersion` output for the solution stack. The solution version is also
+shown on the Dashboard of the Web UI.
+
+### Identify the Stack URL to deploy
+
+After reviewing the [Changelog], obtain the `Template Link` url of the latest
+version from ["Deploying the Solution"](#deploying-the-solution) (it will be
+similar to
+`https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/latest/template.yaml`).
+If you wish to deploy a specific version rather than the latest version, replace
+`latest` from the url with the chosen version, for instance
+`https://solution-builders-us-east-1.s3.us-east-1.amazonaws.com/amazon-s3-find-and-forget/v0.2/template.yaml`.
+
+### Minor Upgrades: Perform CloudFormation Stack Update
+
+To deploy via AWS Console:
+
+1. Open the [CloudFormation Console Page] and choose the Solution by selecting
+   to the stack's radio button, then choose "Update"
+2. Choose "Replace current template" and then input the template URL for the
+   version you wish to deploy in the "Amazon S3 URL" textbox, then choose "Next"
+3. On the _Stack Details_ screen, review the Parameters and then choose "Next"
+4. On the _Configure stack options_ screen, choose "Next"
+5. On the _Review stack_ screen, you must check the boxes for:
+
+   - "_I acknowledge that AWS CloudFormation might create IAM resources_"
+   - "_I acknowledge that AWS CloudFormation might create IAM resources with
+     custom names_"
+   - "_I acknowledge that AWS CloudFormation might require the following
+     capability: CAPABILITY_AUTO_EXPAND_"
+
+   These are required to allow CloudFormation to create a Role to allow access
+   to resources needed by the stack and name the resources in a dynamic way.
+
+6. Choose "Update stack" to start the stack update.
+7. Wait for the CloudFormation stack to finish updating. Completion is indicated
+   when the "Stack status" is "_UPDATE_COMPLETE_".
+
+To deploy via the AWS CLI
+[consult the documentation](https://docs.aws.amazon.com/cli/latest/reference/cloudformation/update-stack.html).
+
+### Major Upgrades: Manual Rolling Deployment
+
+The process for a manual rolling deployment is as follows:
+
+1. Create a new stack from scratch
+2. Export the data from the old stack to the new stack
+3. Migrate consumers to new API and Web UI URLs
+4. Delete the old stack.
+
+The steps for performing this process are:
+
+1. Deploy a new instance of the Solution by following the instructions contained
+   in the ["Deploying the Solution" section](#deploying-the-solution). Make sure
+   you use unique values for Stack Name and ResourcePrefix parameter which
+   differ from existing stack.
+2. Migrate Data from DynamoDB to ensure the new stack contains the necessary
+   configuration related to Data Mappers and settings. When both stacks are
+   deployed in the same account and region, the simplest way to migrate is via
+   [On-Demand Backup and Restore]. If the stacks are deployed in different
+   regions or accounts, you can use [AWS Data Pipeline].
+3. Ensure that all the bucket policies for the Data Mappers are in place for the
+   new stack. See the
+   ["Granting Access to Data" section](#granting-access-to-data) for steps to do
+   this.
+4. Review the [Changelog] for changes that may affect how you use the new
+   deployment. This may require you to make changes to any software you have
+   that interacts with the solution's API.
+5. Once all the consumers are migrated to the new stack (API and Web UI), delete
+   the old stack.
+
+## Deleting the Solution
+
+To delete a stack via AWS Console:
+
+1. Open the [CloudFormation Console Page] and choose the solution stack, then
+   choose "Delete"
+2. Once the confirmation modal appears, choose "Delete stack".
+3. Wait for the CloudFormation stack to finish updating. Completion is indicated
+   when the "Stack status" is "_DELETE_COMPLETE_".
+
+To delete a stack via the AWS CLI
+[consult the documentation](https://docs.aws.amazon.com/cli/latest/reference/cloudformation/delete-stack.html).
+
+[api documentation]: api/README.md
+[troubleshooting]: TROUBLESHOOTING.md
+[fargate configuration]:
+  https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html#fargate-tasks-size
+[vpc endpoints]:
+  https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html
+[vpc endpoint pricing]: https://aws.amazon.com/privatelink/pricing/
+[cloudwatch logs pricing]: https://aws.amazon.com/cloudwatch/pricing/
+[dynamodb streams]:
+  https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
+[dynamodb point-in-time recovery]:
+  https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery.html
+[dynamodb pricing]: https://aws.amazon.com/dynamodb/pricing/on-demand/
+[defining glue tables]:
+  https://docs.aws.amazon.com/glue/latest/dg/tables-described.html
+[s3 bucket policies]:
+  https://docs.aws.amazon.com/AmazonS3/latest/dev/using-iam-policies.html
+[using sse with cmks]:
+  https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html
+[customer master keys]:
+  https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#master_keys
+[how to change a key policy]:
+  https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html#key-policy-modifying-how-to
+[cross account s3 access]:
+  https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
+[cross account kms access]:
+  https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying-external-accounts.html
+[updating an ssm parameter]:
+  https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-paramstore-cli.html
+[deploy using the aws cli]:
+  https://docs.aws.amazon.com/cli/latest/reference/cloudformation/deploy/index.html
+[cloudformation console page]:
+  https://console.aws.amazon.com/cloudformation/home
+[changelog]: ../CHANGELOG.md
+[on-demand backup and restore]:
+  https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html
+[aws data pipeline]: https://aws.amazon.com/datapipeline
+[cognito advanced security]:
+  https://docs.aws.amazon.com/cognito/latest/developerguide/cognito-user-pool-settings-advanced-security.html
+[cloudfront access logging permissions]:
+  https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#AccessLogsBucketAndFileOwnership
+[s3 access logging permissions]:
+  https://docs.aws.amazon.com/AmazonS3/latest/dev/enable-logging-programming.html#grant-log-delivery-permissions-general
+[limits]: LIMITS.md
+[aws cloudformation stacksets]:
+  https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/what-is-cfnstacksets.html
+[cognito console]: https://console.aws.amazon.com/cognito
+[managing users in user pools guide]:
+  https://docs.aws.amazon.com/cognito/latest/developerguide/managing-users.html
+[cognito rest api integration guide]:
+  https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-invoke-api-integrated-with-cognito-user-pool.html
+[lake formation data permissions console]:
+  https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html
+[exporting stack output values]:
+  https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-exports.html
+[s3 lifecycle policies]:
+  https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html

+ 23 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/.openapi-generator-ignore

@@ -0,0 +1,23 @@
+# OpenAPI Generator Ignore
+# Generated by openapi-generator https://github.com/openapitools/openapi-generator
+
+# Use this file to prevent files from being overwritten by the generator.
+# The patterns follow closely to .gitignore or .dockerignore.
+
+# As an example, the C# client generator defines ApiClient.cs.
+# You can make changes and tell OpenAPI Generator to ignore just this file by uncommenting the following line:
+#ApiClient.cs
+
+# You can match any string of characters against a directory, file or extension with a single asterisk (*):
+#foo/*/qux
+# The above matches foo/bar/qux and foo/baz/qux, but not foo/bar/baz/qux
+
+# You can recursively match patterns against a directory, file or extension with a double asterisk (**):
+#foo/**/qux
+# This matches foo/bar/qux, foo/baz/qux, and foo/bar/baz/qux
+
+# You can also negate patterns with an exclamation (!).
+# For example, you can ignore all files in a docs folder with the file extension .md:
+#docs/*.md
+# Then explicitly reverse the ignore rule for a single file:
+#!docs/README.md

+ 110 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/DataMapperApi.md

@@ -0,0 +1,110 @@
+# DataMapperApi
+
+All URIs are relative to *https://your-apigw-id.execute-api.region.amazonaws.com/Prod*
+
+Method | HTTP request | Description
+------------- | ------------- | -------------
+[**DeleteDataMapper**](DataMapperApi.md#deletedatamapper) | **DELETE** /v1/data_mappers/{data_mapper_id} | Removes a data mapper
+[**GetDataMapper**](DataMapperApi.md#getdatamapper) | **GET** /v1/data_mappers/{data_mapper_id} | Returns the details of a data mapper
+[**ListDataMappers**](DataMapperApi.md#listdatamappers) | **GET** /v1/data_mappers | Lists data mappers
+[**PutDataMapper**](DataMapperApi.md#putdatamapper) | **PUT** /v1/data_mappers/{data_mapper_id} | Creates or modifies a data mapper
+
+
+<a name="deletedatamapper"></a>
+## **DeleteDataMapper**
+
+Removes a data mapper
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **DataMapperId** | **String**| Data Mapper ID path parameter | [default to null]
+
+### Return type
+
+null (empty response body)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: Not defined
+
+<a name="getdatamapper"></a>
+## **GetDataMapper**
+
+Returns the details of a data mapper
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **DataMapperId** | **String**| Data Mapper ID path parameter | [default to null]
+
+### Return type
+
+[**DataMapper**](../Models/DataMapper.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+
+<a name="listdatamappers"></a>
+## **ListDataMappers**
+
+Lists data mappers
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **StartAt** | **String**| Start at watermark query string parameter | [optional] [default to 0]
+ **PageSize** | **Integer**| Page size query string parameter. Min: 1. Max: 1000 | [optional] [default to null]
+
+### Return type
+
+[**ListOfDataMappers**](../Models/ListOfDataMappers.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+
+<a name="putdatamapper"></a>
+## **PutDataMapper**
+
+Creates or modifies a data mapper
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **DataMapperId** | **String**| Data Mapper ID path parameter | [default to null]
+ **DataMapper** | [**DataMapper**](../Models/DataMapper.md)| Request body containing details of the Data Mapper to create or modify |
+
+### Return type
+
+[**DataMapper**](../Models/DataMapper.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: application/json
+- **Accept**: application/json
+

+ 131 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/DeletionQueueApi.md

@@ -0,0 +1,131 @@
+# DeletionQueueApi
+
+All URIs are relative to *https://your-apigw-id.execute-api.region.amazonaws.com/Prod*
+
+Method | HTTP request | Description
+------------- | ------------- | -------------
+[**AddItemToDeletionQueue**](DeletionQueueApi.md#additemtodeletionqueue) | **PATCH** /v1/queue | Adds an item to the deletion queue (Deprecated: use PATCH /v1/queue/matches)
+[**AddItemsToDeletionQueue**](DeletionQueueApi.md#additemstodeletionqueue) | **PATCH** /v1/queue/matches | Adds one or more items to the deletion queue
+[**DeleteMatches**](DeletionQueueApi.md#deletematches) | **DELETE** /v1/queue/matches | Removes one or more items from the deletion queue
+[**ListDeletionQueueMatches**](DeletionQueueApi.md#listdeletionqueuematches) | **GET** /v1/queue | Lists deletion queue items
+[**StartDeletionJob**](DeletionQueueApi.md#startdeletionjob) | **DELETE** /v1/queue | Starts a job for the items in the deletion queue
+
+
+<a name="additemtodeletionqueue"></a>
+## **AddItemToDeletionQueue**
+
+Adds an item to the deletion queue (Deprecated: use PATCH /v1/queue/matches)
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **CreateDeletionQueueItem** | [**CreateDeletionQueueItem**](../Models/CreateDeletionQueueItem.md)| Request body containing details of the Match to add to the Deletion Queue |
+
+### Return type
+
+[**DeletionQueueItem**](../Models/DeletionQueueItem.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: application/json
+- **Accept**: application/json
+
+<a name="additemstodeletionqueue"></a>
+## **AddItemsToDeletionQueue**
+
+Adds one or more items to the deletion queue
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **ListOfCreateDeletionQueueItems** | [**ListOfCreateDeletionQueueItems**](../Models/ListOfCreateDeletionQueueItems.md)|  |
+
+### Return type
+
+[**ListOfDeletionQueueItem**](../Models/ListOfDeletionQueueItem.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: application/json
+- **Accept**: application/json
+
+<a name="deletematches"></a>
+## **DeleteMatches**
+
+Removes one or more items from the deletion queue
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **ListOfMatchDeletions** | [**ListOfMatchDeletions**](../Models/ListOfMatchDeletions.md)|  |
+
+### Return type
+
+null (empty response body)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: application/json
+- **Accept**: Not defined
+
+<a name="listdeletionqueuematches"></a>
+## **ListDeletionQueueMatches**
+
+Lists deletion queue items
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **StartAt** | **String**| Start at watermark query string parameter | [optional] [default to 0]
+ **PageSize** | **Integer**| Page size query string parameter. Min: 1. Max: 1000 | [optional] [default to null]
+
+### Return type
+
+[**DeletionQueue**](../Models/DeletionQueue.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+
+<a name="startdeletionjob"></a>
+## **StartDeletionJob**
+
+Starts a job for the items in the deletion queue
+
+### Parameters
+This endpoint does not need any parameters.
+
+### Return type
+
+[**Job**](../Models/Job.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+

+ 87 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/JobApi.md

@@ -0,0 +1,87 @@
+# JobApi
+
+All URIs are relative to *https://your-apigw-id.execute-api.region.amazonaws.com/Prod*
+
+Method | HTTP request | Description
+------------- | ------------- | -------------
+[**GetJob**](JobApi.md#getjob) | **GET** /v1/jobs/{job_id} | Returns the details of a job
+[**GetJobEvents**](JobApi.md#getjobevents) | **GET** /v1/jobs/{job_id}/events | Lists all events for a job
+[**ListJobs**](JobApi.md#listjobs) | **GET** /v1/jobs | Lists all jobs
+
+
+<a name="getjob"></a>
+## **GetJob**
+
+Returns the details of a job
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **JobId** | **String**| Job ID path parameter | [default to null]
+
+### Return type
+
+[**Job**](../Models/Job.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+
+<a name="getjobevents"></a>
+## **GetJobEvents**
+
+Lists all events for a job
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **JobId** | **String**| Job ID path parameter | [default to null]
+ **StartAt** | **String**| Start at watermark query string parameter | [optional] [default to 0]
+ **PageSize** | **Integer**| Page size query string parameter. Min: 1. Max: 1000 | [optional] [default to null]
+ **Filter** | [**oneOf&lt;string,array&gt;**](../Models/.md)| Filters to apply in the format [key][operator][value]. If multiple filters are supplied, they will applied on an **AND** basis. Supported keys: EventName. Supported Operators: &#x3D;  | [optional] [default to null]
+
+### Return type
+
+[**ListOfJobEvents**](../Models/ListOfJobEvents.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+
+<a name="listjobs"></a>
+## **ListJobs**
+
+Lists all jobs
+
+### Parameters
+
+Name | Type | Description  | Notes
+------------- | ------------- | ------------- | -------------
+ **StartAt** | **String**| Start at watermark query string parameter | [optional] [default to 0]
+ **PageSize** | **Integer**| Page size query string parameter. Min: 1. Max: 1000 | [optional] [default to null]
+
+### Return type
+
+[**ListOfJobs**](../Models/ListOfJobs.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+

+ 30 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Apis/SettingsApi.md

@@ -0,0 +1,30 @@
+# SettingsApi
+
+All URIs are relative to *https://your-apigw-id.execute-api.region.amazonaws.com/Prod*
+
+Method | HTTP request | Description
+------------- | ------------- | -------------
+[**GetSettings**](SettingsApi.md#getsettings) | **GET** /v1/settings | Gets the solution settings
+
+
+<a name="getsettings"></a>
+## **GetSettings**
+
+Gets the solution settings
+
+### Parameters
+This endpoint does not need any parameters.
+
+### Return type
+
+[**Settings**](../Models/Settings.md)
+
+### Authorization
+
+[Authorizer](../README.md#Authorizer)
+
+### HTTP request headers
+
+- **Content-Type**: Not defined
+- **Accept**: application/json
+

+ 11 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/CreateDeletionQueueItem.md

@@ -0,0 +1,11 @@
+# CreateDeletionQueueItem
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Type** | [**String**](string.md) | MatchId Type | [optional] [default to Simple] [enum: Simple, Composite]
+**MatchId** | [**oneOf&lt;string,array&gt;**](oneOf&lt;string,array&gt;.md) | The Match ID to remove from the deletion queue | [default to null]
+**DataMappers** | [**List**](string.md) | The list of data mappers to apply to this Match ID | [optional] [default to ["*"]]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 16 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DataMapper.md

@@ -0,0 +1,16 @@
+# DataMapper
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**DataMapperId** | [**String**](string.md) | The ID of the data mapper | [optional] [default to null]
+**Format** | [**String**](string.md) | The format of the dataset | [optional] [default to parquet] [enum: json, parquet]
+**QueryExecutor** | [**String**](string.md) | The query executor used to query your dataset | [default to null] [enum: athena]
+**Columns** | [**List**](string.md) | Columns to query for MatchIds the dataset | [default to null]
+**QueryExecutorParameters** | [**DataMapper_QueryExecutorParameters**](DataMapper_QueryExecutorParameters.md) |  | [default to null]
+**RoleArn** | [**String**](string.md) | Role ARN to assume when performing operations in S3 for this data mapper. The role must have the exact name &#39;S3F2DataAccessRole&#39;. | [default to null]
+**DeleteOldVersions** | [**Boolean**](boolean.md) | Toggles deleting all non-latest versions of an object after a new redacted version is created | [optional] [default to true]
+**IgnoreObjectNotFoundExceptions** | [**Boolean**](boolean.md) | Toggles ignoring Object Not Found errors during deletion | [optional] [default to false]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 12 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DataMapperQueryExecutorParameters.md

@@ -0,0 +1,12 @@
+# DataMapperQueryExecutorParameters
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**DataCatalogProvider** | [**String**](string.md) | The data catalog provider which contains the database table with metadata about your S3 data lake | [optional] [default to null] [enum: glue]
+**Database** | [**String**](string.md) | The database in the data catalog which contains the metatadata table | [default to null]
+**Table** | [**String**](string.md) | The table in the data catalog database containing the metatadata for your data lake | [default to null]
+**PartitionKeys** | [**List**](string.md) | The partition keys to use on each query. This allows to control the number and the size of the queries. When omitted, all the table partitions are used. | [optional] [default to null]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 10 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DeletionQueue.md

@@ -0,0 +1,10 @@
+# DeletionQueue
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**MatchIds** | [**List**](DeletionQueueItem.md) | The list of Match IDs currently in the queue | [optional] [default to null]
+**NextStart** | [**String**](string.md) | The watermark to use when requesting the next page of results | [optional] [default to ]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 13 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/DeletionQueueItem.md

@@ -0,0 +1,13 @@
+# DeletionQueueItem
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**DeletionQueueItemId** | [**String**](string.md) | The Deletion Queue Item unique identifier | [default to null]
+**Type** | [**String**](string.md) | MatchId Type | [default to Simple] [enum: Simple, Composite]
+**MatchId** | [**oneOf&lt;string,array&gt;**](oneOf&lt;string,array&gt;.md) | The Match ID to remove from the deletion queue | [default to null]
+**CreatedAt** | [**Integer**](integer.md) | Deletion queue item creation date as Epoch timestamp | [default to null]
+**DataMappers** | [**List**](string.md) | The list of data mappers to apply to this Match ID | [default to null]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 9 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/Error.md

@@ -0,0 +1,9 @@
+# Error
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Message** | [**String**](string.md) | Error message | [default to null]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 32 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/Job.md

@@ -0,0 +1,32 @@
+# Job
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Id** | [**String**](string.md) | The Job ID | [default to null]
+**JobStatus** | [**String**](string.md) | The Job status. When a job is first created, it will remain in queued till the workflow starts | [default to QUEUED] [enum: QUEUED, RUNNING, FORGET_COMPLETED_CLEANUP_IN_PROGRESS, COMPLETED, COMPLETED_CLEANUP_FAILED, FAILED, FIND_FAILED, FORGET_FAILED, FORGET_PARTIALLY_FAILED]
+**CreatedAt** | [**Integer**](integer.md) | Job creation date as Epoch timestamp | [default to null]
+**JobStartTime** | [**Integer**](integer.md) | Job start date as Epoch timestamp | [optional] [default to null]
+**JobFinishTime** | [**Integer**](integer.md) | Job finish date as Epoch timestamp | [optional] [default to null]
+**AthenaConcurrencyLimit** | [**Integer**](integer.md) | Athena concurrency setting for this job | [default to null]
+**AthenaQueryMaxRetries** | [**Integer**](integer.md) | Max number of retries to each Athena query after a failure | [default to null]
+**DeletionTasksMaxNumber** | [**Integer**](integer.md) | Max Fargate tasks setting for this job | [default to null]
+**ForgetQueueWaitSeconds** | [**Integer**](integer.md) | Forget queue wait setting for this job | [default to null]
+**QueryExecutionWaitSeconds** | [**Integer**](integer.md) | Query execution wait setting for this job | [default to null]
+**QueryQueueWaitSeconds** | [**Integer**](integer.md) | Query queue worker wait setting for this job | [default to null]
+**TotalObjectUpdatedCount** | [**Integer**](integer.md) | Total number of successfully updated objects | [optional] [default to 0]
+**TotalObjectUpdateSkippedCount** | [**Integer**](integer.md) | Total number of skipped objects | [optional] [default to 0]
+**TotalObjectUpdateFailedCount** | [**Integer**](integer.md) | Total number of objects which could not be successfully updated | [optional] [default to 0]
+**TotalObjectRollbackFailedCount** | [**Integer**](integer.md) | Total number of objects which could not be successfully rolled back after detecting an integrity conflict | [optional] [default to 0]
+**TotalQueryCount** | [**Integer**](integer.md) | Total number of queries executed during the find phase | [optional] [default to 0]
+**TotalQueryFailedCount** | [**Integer**](integer.md) | Total number of unsuccessfully executed queries during the find phase | [optional] [default to 0]
+**TotalQueryScannedInBytes** | [**Integer**](integer.md) | Total amount of data scanned during the find phase | [optional] [default to 0]
+**TotalQuerySucceededCount** | [**Integer**](integer.md) | Total number of successfully executed queries during the find phase | [optional] [default to 0]
+**TotalQueryTimeInMillis** | [**Integer**](integer.md) | Total time spent by the query executor for this job | [optional] [default to 0]
+**Expires** | [**Integer**](integer.md) | Expiry date when the item will be deleted as Epoch time | [optional] [default to null]
+**Sk** | [**String**](string.md) | Internal field used as part of DynamoDB single table design | [default to null]
+**Type** | [**String**](string.md) | Internal field used as part of DynamoDB single table design | [default to null] [enum: Job]
+**GSIBucket** | [**String**](string.md) | Internal field used as part of DynamoDB single table design | [default to null]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 16 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/JobEvent.md

@@ -0,0 +1,16 @@
+# JobEvent
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Id** | [**String**](string.md) | The Job ID | [optional] [default to null]
+**CreatedAt** | [**Integer**](integer.md) | Job creation date as Epoch timestamp | [optional] [default to null]
+**EventName** | [**String**](string.md) | The Job Event name | [optional] [default to null]
+**EventData** | [**Object**](.md) | Free form field containing data about the event. Structure varies based on the event | [optional] [default to null]
+**EmitterId** | [**String**](string.md) | The identifier for the service or service instance which emitted the event | [optional] [default to null]
+**Expires** | [**Integer**](integer.md) | Expiry date when the item will be deleted as Epoch time | [optional] [default to null]
+**Sk** | [**String**](string.md) | Internal field used as part of DynamoDB single table design | [optional] [default to null]
+**Type** | [**String**](string.md) | Internal field used as part of DynamoDB single table design | [optional] [default to null] [enum: Job]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 22 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/JobSummary.md

@@ -0,0 +1,22 @@
+# JobSummary
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Id** | [**String**](string.md) | The Job ID | [default to null]
+**JobStatus** | [**String**](string.md) | The Job status. When a job is first created, it will remain in queued till the workflow starts | [default to QUEUED] [enum: QUEUED, RUNNING, FORGET_COMPLETED_CLEANUP_IN_PROGRESS, COMPLETED, COMPLETED_CLEANUP_FAILED, FAILED, FIND_FAILED, FORGET_FAILED, FORGET_PARTIALLY_FAILED]
+**CreatedAt** | [**Integer**](integer.md) | Job creation date as Epoch timestamp | [default to null]
+**JobStartTime** | [**Integer**](integer.md) | Job start date as Epoch timestamp | [optional] [default to null]
+**JobFinishTime** | [**Integer**](integer.md) | Job finish date as Epoch timestamp | [optional] [default to null]
+**TotalObjectUpdatedCount** | [**Integer**](integer.md) | Total number of successfully updated objects | [optional] [default to 0]
+**TotalObjectUpdateSkippedCount** | [**Integer**](integer.md) | Total number of skipped objects | [optional] [default to 0]
+**TotalObjectUpdateFailedCount** | [**Integer**](integer.md) | Total number of objects which could not be successfully updated | [optional] [default to 0]
+**TotalObjectRollbackFailedCount** | [**Integer**](integer.md) | Total number of objects which could not be successfully rolled back after detecting an integrity conflict | [optional] [default to 0]
+**TotalQueryCount** | [**Integer**](integer.md) | Total number of queries executed during the find phase | [optional] [default to 0]
+**TotalQueryFailedCount** | [**Integer**](integer.md) | Total number of unsuccessfully executed queries during the find phase | [optional] [default to 0]
+**TotalQueryScannedInBytes** | [**Integer**](integer.md) | Total amount of data scanned during the find phase | [optional] [default to 0]
+**TotalQuerySucceededCount** | [**Integer**](integer.md) | Total number of successfully executed queries during the find phase | [optional] [default to 0]
+**TotalQueryTimeInMillis** | [**Integer**](integer.md) | Total time spent by the query executor for this job | [optional] [default to 0]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 9 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/ListOfCreateDeletionQueueItems.md

@@ -0,0 +1,9 @@
+# ListOfCreateDeletionQueueItems
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Matches** | [**List**](CreateDeletionQueueItem.md) | List of Deletion Queue Items | [default to null]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 10 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/ListOfDataMappers.md

@@ -0,0 +1,10 @@
+# ListOfDataMappers
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**DataMappers** | [**List**](DataMapper.md) | The list of data mappers | [optional] [default to null]
+**NextStart** | [**String**](string.md) | The watermark to use when requesting the next page of results | [optional] [default to ]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

+ 9 - 0
S3/NewFind/amazon-s3-find-and-forget-master/docs/api/Models/ListOfDeletionQueueItem.md

@@ -0,0 +1,9 @@
+# ListOfDeletionQueueItem
+## Properties
+
+Name | Type | Description | Notes
+------------ | ------------- | ------------- | -------------
+**Matches** | [**List**](DeletionQueueItem.md) | List of Deletion Queue Item objects | [default to null]
+
+[[Back to Model list]](../README.md#documentation-for-models) [[Back to API list]](../README.md#documentation-for-api-endpoints) [[Back to README]](../README.md)
+

Alguns ficheiros não foram mostrados porque muitos ficheiros mudaram neste diff