README.rst 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381
  1. .. image:: https://github.com/ckan/ckanext-archiver/actions/workflows/test.yml/badge.svg
  2. :target: https://github.com/ckan/ckanext-archiver/actions/workflows/test.yml
  3. =============
  4. ckanext-archiver
  5. =============
  6. Overview
  7. --------
  8. The CKAN Archiver Extension will download all of a CKAN's resources, for three purposes:
  9. 1. offer the user it as a 'cached' copy, in case the link becomes broken
  10. 2. tell the user (and publishers) if the link is broken, on both the dataset/resource and in a 'Broken Links' report
  11. 3. the downloaded file can be analysed by other extensions, such as ckanext-qa or ckanext-pacakgezip.
  12. Demo:
  13. .. image:: archiver_resource.png
  14. :alt: Broken link check info and a cached copy offered on resource
  15. .. image:: archiver_report.png
  16. :alt: Broken link report
  17. Compatibility: Requires CKAN version 2.1 or later
  18. TODO:
  19. * Show brokenness on the package page (not just the resources)
  20. * Prettify the html bits
  21. * Add brokenness to search facets using IFacet
  22. Operation
  23. ---------
  24. When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table. (In ckanext-archiver v0.1 it was stored in TaskStatus and on the Resource itself.) This is added to dataset during the package_show call (using a schema key), so the information is also available over the API.
  25. Other extensions can subscribe to the archiver's ``IPipe`` interface to hear about datasets being archived. e.g. ckanext-qa will detect its file type and give it an openness score, or ckanext-packagezip will create a zip of the files in a dataset.
  26. Archiver works on Celery queues, so when Archiver is notified of a dataset/resource being created or updated, it puts an 'update request' on a queue. Celery calls the Archiver 'update task' to do each archival. You can start Celery with multiple processes, to archive in parallel.
  27. You can also trigger an archival using paster on the command-line.
  28. By default, two queues are used:
  29. 1. 'bulk' for a regular archival of all the resources
  30. 2. 'priority' for when a user edits one-off resource
  31. This means that the 'bulk' queue can happily run slowly, archiving large quantities slowly, such as re-archiving every single resource once a week. And meanwhile, if a new resource is put into CKAN then it can be downloaded straight away via the 'priority' queue.
  32. Installation
  33. ------------
  34. To install ckanext-archiver:
  35. 1. Activate your CKAN virtual environment, for example::
  36. . /usr/lib/ckan/default/bin/activate
  37. 2. Install the ckanext-archiver and ckanext-report Python packages into your virtual environment::
  38. pip install -e git+http://github.com/datagovuk/ckanext-report.git#egg=ckanext-report
  39. pip install -e git+http://github.com/ckan/ckanext-archiver.git#egg=ckanext-archiver
  40. 3. Install the archiver dependencies::
  41. pip install -r ckanext-archiver/requirements.txt
  42. 4. Now create the database tables::
  43. paster --plugin=ckanext-archiver archiver init --config=production.ini
  44. paster --plugin=ckanext-report report initdb --config=production.ini
  45. 4. Add ``archiver report`` to the ``ckan.plugins`` setting in your CKAN
  46. config file (by default the config file is located at
  47. ``/etc/ckan/default/production.ini``).
  48. 5. Install a Celery queue backend - see later section.
  49. 6. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu::
  50. sudo service apache2 reload
  51. Upgrade from version 0.1 to 2.x
  52. -------------------------------
  53. NB If upgrading ckanext-archiver and use ckanext-qa too, then you will need to upgrade ckanext-qa to version 2.x at the same time.
  54. NB Previously you needed both ckanext-archiver and ckanext-qa to see the broken link report. This functionality has now moved to ckanext-archiver. So now you only need ckanext-qa if you want the 5 stars of openness functionality.
  55. 1. Activate your CKAN virtual environment, for example::
  56. . /usr/lib/ckan/default/bin/activate
  57. 2. Install ckanext-report (if not already installed)
  58. pip install -e git+http://github.com/datagovuk/ckanext-report.git#egg=ckanext-report
  59. 3. Add ``report`` to the ``ckan.plugins`` setting in your CKAN config file (it
  60. should already have ``archiver``) (by default the config file is located at
  61. ``/etc/ckan/default/production.ini``).
  62. 4. Also in your CKAN config file, rename old config option keys if you have them:
  63. * ``ckan.cache_url_root`` to ``ckanext-archiver.cache_url_root``
  64. * ``ckanext.archiver.user_agent_string`` to ``ckanext-archiver.user_agent_string``
  65. 5. Upgrade the ckanext-archiver Python package::
  66. cd ckanext-archiver
  67. git pull
  68. python setup.py develop
  69. 6. Create the new database tables::
  70. paster --plugin=ckanext-archiver archiver init --config=production.ini
  71. 7. Ensure the archiver dependencies are installed::
  72. pip install -r requirements.txt
  73. 8. Install the developer dependencies, needed for the migration::
  74. pip install -r dev-requirements.txt
  75. 9. Migrate your database to the new Archiver tables::
  76. python ckanext/archiver/bin/migrate_task_status.py --write production.ini
  77. Migrations post 2.0
  78. -------------------
  79. Over time it is possible that the database structure will change. In these cases you can use the migrate command to update the database schema.
  80. ::
  81. paster --plugin=ckanext-archiver archiver migrate -c <path to CKAN ini file>
  82. This is only necessary if you update ckanext-archiver and already have the database tables in place.
  83. Installing a Celery queue backend
  84. ---------------------------------
  85. Archiver uses Celery to manage its 'queues'. You need to install a queue back-end, such as Redis or RabbitMQ.
  86. Redis backend
  87. -------------
  88. Redis can be installed like this::
  89. sudo apt-get install redis-server
  90. Install the python library into your python environment::
  91. /usr/lib/ckan/default/bin/activate/pip install redis==2.10.1
  92. It must then be configured in your CKAN config (e.g. production.ini) by inserting a new section, e.g. before `[app:main]`::
  93. [app:celery]
  94. BROKER_BACKEND = redis
  95. BROKER_HOST = redis://localhost/1
  96. CELERY_RESULT_BACKEND = redis
  97. REDIS_HOST = 127.0.0.1
  98. REDIS_PORT = 6379
  99. REDIS_DB = 0
  100. REDIS_CONNECT_RETRY = True
  101. Number of items in the queue 'bulk'::
  102. redis-cli -n 1 LLEN bulk
  103. See item 0 in the queue (which is the last to go on the queue & last to be processed)::
  104. redis-cli -n 1 LINDEX bulk 0
  105. To delete all the items on the queue::
  106. redis-cli -n 1 DEL bulk
  107. Installing SNI support
  108. ----------------------
  109. When archiving resources on servers which use HTTPS, you might encounter this error::
  110. requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
  111. Whilst this could possibly be a problem with the server, it is most likely due to you needing to install SNI support on the machine that ckanext-archiver runs. Server Name Indication (SNI) is for when a server has multiple SSL certificates, which is a relatively new feature in HTTPS. This requires installing a recent version of OpenSSL plus the python libraries to make use of this feature.
  112. If you have SNI support installed then this should command run without the above error::
  113. python -c 'import requests; requests.get("http://files.datapress.com")'
  114. On Ubuntu 12.04 you can install SNI support by doing this::
  115. sudo apt-get install libffi-dev
  116. . /usr/lib/ckan/default/bin/activate
  117. pip install 'cryptography==0.9.3' pyOpenSSL ndg-httpsclient pyasn1
  118. You should also check your OpenSSL version is greater than 1.0.0::
  119. python -c "import ssl; print ssl.OPENSSL_VERSION"
  120. Apparently SNI was added into OpenSSL version 0.9.8j but apparently there are reported problems with 0.9.8y, 0.9.8zc & 0.9.8zg so 1.0.0+ is recommended.
  121. For more about enabling SNI in python requests see:
  122. * https://stackoverflow.com/questions/18578439/using-requests-with-tls-doesnt-give-sni-support/18579484#18579484
  123. * https://github.com/kennethreitz/requests/issues/2022
  124. Config settings
  125. ---------------
  126. 1. Enabling Archiver to listen to resource changes
  127. If you want the archiver to run automatically when a new CKAN resource is added, or the url of a resource is changed,
  128. then edit your CKAN config file (eg: development.ini) to enable the extension:
  129. ::
  130. ckan.plugins = archiver
  131. If there are other plugins activated, add this to the list (each plugin should be separated with a space).
  132. **Note:** You can still run the archiver manually (from the command line) on specific resources or on all resources
  133. in a CKAN instance without enabling the plugin. See section 'Using Archiver' for details.
  134. 2. Other CKAN config options
  135. The following config variable should also be set in your CKAN config:
  136. * ``ckan.site_url`` = URL to your CKAN instance
  137. This is the URL that the archive process (in Celery) will use to access the CKAN API to update it about the cached URLs. If your internal network names your CKAN server differently, then specify this internal name in config option: ``ckan.site_url_internally``
  138. 3. Additional Archiver settings
  139. Add the settings to the CKAN config file:
  140. * ``ckanext-archiver.archive_dir`` = path to the directory that archived files will be saved to (e.g. ``/www/resource_cache``)
  141. * ``ckanext-archiver.cache_url_root`` = URL where you will be publicly serving the cached files stored locally at ckanext-archiver.archive_dir.
  142. * ``ckanext-archiver.max_content_length`` = the maximum size (in bytes) of files to archive (default ``50000000`` =50MB)
  143. * ``ckanext-archiver.user_agent_string`` = identifies the archiver to servers it archives from
  144. * ``ckanext-archiver.verify_https`` = true/false whether you want to verify https connections and therefore fail if it is specified in the URL but does not verify.
  145. 4. Nightly report generation
  146. Configure the reports to be generated each night using cron. e.g.::
  147. 0 6 * * * www-data /usr/lib/ckan/default/bin/paster --plugin=ckanext-report report generate --config=/etc/ckan/default/production.ini
  148. 5. Your web server should serve the files from the archive_dir.
  149. With nginx you insert a new ``location`` after the ckan one. e.g. here we have configured ``ckanext-archiver.archive_dir`` to ``/www/resource_cache`` and serve these files at location ``/resource_cache`` (i.e. ``http://mysite.com/resource_cache`` )::
  150. server {
  151. # ckan
  152. location / {
  153. proxy_pass http://127.0.0.1:8080/;
  154. ...
  155. }
  156. # archived files
  157. location /resource_cache {
  158. root /www/resource_cache;
  159. }
  160. Legacy settings
  161. ~~~~~~~~~~~~~~~
  162. Older versions of ckanext-archiver put these settings in
  163. ckanext/archiver/settings.py as variables ARCHIVE_DIR and MAX_CONTENT_LENGTH
  164. but this is no longer available.
  165. There used to be an option DATA_FORMATS for filtering the resources
  166. archived, but that has now been removed in ckanext-archiver v2.0, since it
  167. is now not only caching files, but is seen as a broken link checker, which
  168. applies whatever the format.
  169. Using Archiver
  170. --------------
  171. First, make sure that Celery is running for each queue. For test/local use, you can run::
  172. paster --plugin=ckanext-archiver celeryd2 run all -c development.ini
  173. However in production you'd run the priority and bulk queues separately, or else the priority queue will not have any priority over the bulk queue. This can be done by running these two commands in separate terminals::
  174. paster --plugin=ckanext-archiver celeryd2 run priority -c production.ini
  175. paster --plugin=ckanext-archiver celeryd2 run bulk -c production.ini
  176. For production use, we recommend setting up Celery to run with supervisord. `apt-get install supervisor` and use `bin/celery-supervisor.conf` as a configuration template.
  177. If you are running CKAN 2.7 or higher, configure job workers instead http://docs.ckan.org/en/2.8/maintaining/background-tasks.html#using-supervisor
  178. An archival can be triggered by adding a dataset with a resource or updating a resource URL. Alternatively you can run::
  179. paster --plugin=ckanext-archiver archiver update [dataset] --queue=priority -c <path to CKAN config>
  180. Here ``dataset`` is a CKAN dataset name or ID, or you can omit it to archive all datasets.
  181. For a full list of manual commands run::
  182. paster --plugin=ckanext-archiver archiver --help
  183. Once you've done some archiving you can generate a Broken Links report::
  184. paster --plugin=ckanext-report report generate broken-links --config=production.ini
  185. And view it on your CKAN site at ``/report/broken-links``.
  186. Testing
  187. -------
  188. To run the tests:
  189. 1. Activate your CKAN virtual environment, for example::
  190. . /usr/lib/ckan/default/bin/activate
  191. 2. If not done already, install the dev requirements::
  192. (pyenv)~/pyenv/src/ckan$ pip install ../ckanext-archiver/dev-requirements.txt
  193. 3. From the CKAN root directory (not the extension root) do::
  194. (pyenv)~/pyenv/src/ckan$ nosetests --ckan ../ckanext-archiver/tests/ --with-pylons=../ckanext-archiver/test-core.ini
  195. Translations
  196. ------
  197. To translate plugin to a new language (ie. "pl") run `python setup.py init_catalog -l pl`.
  198. To update template file with new translation added in the code or templates
  199. run `python setup.py extract_messages` in the root plugin directory. Then run
  200. `./ckanext/archiver/i18n/unique_pot.sh -v` to strip other plugin's translations.
  201. To update translation files for locale "pl" with new template run `python setup.py update_catalog -l pl`.
  202. Building Debian package
  203. -----------------------
  204. NB this attempt at creating a Debian package is experimental. Important package dependencies have yet to specified. The outstanding issue is that some dependencies do not exist as debian packages (eg: messytables).
  205. To build the debian package::
  206. cd ckanext-archiver; dpkg-buildpackage -us -uc -i -I -rfakeroot
  207. To list the package contents::
  208. dpkg --contents ../python-ckanext-archiver_0.1-1_all.deb
  209. Questions
  210. ---------
  211. The archiver information is not appearing on the resource page
  212. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  213. Check that it is appearing in the API for the dataset - see question below.
  214. The archiver information is not appearing in the API (package_show)
  215. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  216. i.e. if you browse this path on your website: `/api/action/package_show?id=<package_name>` then you don't see the `archiver` key at the dataset level or resource level.
  217. Check the `paster archiver update` command completed ok. Check that the `paster celeryd2 run` has done the archiving ok. Check the dataset has at least one resource. Check that you have ``archiver`` in your ckan.plugins and have restarted CKAN.
  218. 'SSL handshake' error
  219. ~~~~~~~~~~~~~~~~~~~~~
  220. When archiving resources on servers which use HTTPS, you might encounter this error::
  221. requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
  222. This is probably because you don't have SNI support and requires installing OpenSSL - see section "Installing SNI support".