usage.rst 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
  1. Usage
  2. -----
  3. Quick start using pywb_, expects Google Chrome to be installed already:
  4. .. code:: bash
  5. pip install crocoite pywb
  6. crocoite http://example.com/ example.com.warc.gz
  7. wb-manager init test && wb-manager add test example.com.warc.gz
  8. wayback &
  9. $BROWSER http://localhost:8080
  10. .. _pywb: https://github.com/ikreymer/pywb
  11. It is recommended to install at least Micrsoft’s Corefonts_ as well as DejaVu_,
  12. Liberation_ or a similar font family covering a wide range of character sets.
  13. Otherwise page screenshots may be unusable due to missing glyphs.
  14. .. _Corefonts: http://corefonts.sourceforge.net/
  15. .. _DejaVu: https://dejavu-fonts.github.io/
  16. .. _Liberation: https://pagure.io/liberation-fonts
  17. Recursion
  18. ^^^^^^^^^
  19. .. program:: crocoite
  20. By default crocoite will only retrieve the URL specified on the command line.
  21. However it can follow links as well. There’s currently two recursion strategies
  22. available, depth- and prefix-based.
  23. .. code:: bash
  24. crocoite -r 1 https://example.com/ example.com.warc.gz
  25. will retrieve ``example.com`` and all pages directly refered to by it.
  26. Increasing the number increases the depth, so a value of :samp:`2` would first grab
  27. ``example.com``, queue all pages linked there as well as every reference on
  28. each of those pages.
  29. On the other hand
  30. .. code:: bash
  31. crocoite -r prefix https://example.com/dir/ example.com.warc.gz
  32. will retrieve the URL specified and all pages referenced which have the same
  33. URL prefix. There trailing slash is significant. Without it crocoite would also
  34. grab ``/dir-something`` or ``/dir.html`` for example.
  35. If an output file template is used each page is written to an individual file. For example
  36. .. code:: bash
  37. crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz'
  38. will write one file page page to files like
  39. :file:`example.com-2019-09-09T15:15:15+02:00-1.warc.gz`. ``seqnum`` is unique to
  40. each page of a single job and should always be used.
  41. When running a recursive job, increasing the concurrency (i.e. how many pages
  42. are fetched at the same time) can speed up the process. For example you can
  43. pass :option:`-j` :samp:`4` to retrieve four pages at the same time. Keep in mind
  44. that each process starts a full browser that requires a lot of resources (one
  45. to two GB of RAM and one or two CPU cores).
  46. Customizing
  47. ^^^^^^^^^^^
  48. .. program:: crocoite-single
  49. Under the hood :program:`crocoite` starts one instance of
  50. :program:`crocoite-single` to fetch each page. You can customize its options by
  51. appending a command template like this:
  52. .. code:: bash
  53. crocoite -r prefix https://example.com example.com.warc.gz -- \
  54. crocoite-single --timeout 5 -k '{url}' '{dest}'
  55. This reduces the global timeout to 5 seconds and ignores TLS errors. If an
  56. option is prefixed with an exclamation mark (``!``) it will not be expanded.
  57. This is useful for passing :option:`--warcinfo`, which expects JSON-encoded data.
  58. Command line options
  59. ^^^^^^^^^^^^^^^^^^^^
  60. Below is a list of all command line arguments available:
  61. .. program:: crocoite
  62. crocoite
  63. ++++++++
  64. Front-end with recursion support and simple job management.
  65. .. option:: -j N, --concurrency N
  66. Maximum number of concurrent fetch jobs.
  67. .. option:: -r POLICY, --recursion POLICY
  68. Enables recursion based on POLICY, which can be a positive integer
  69. (recursion depth) or the string :kbd:`prefix`.
  70. .. option:: --tempdir DIR
  71. Directory for temporary WARC files.
  72. .. program:: crocoite-single
  73. crocoite-single
  74. +++++++++++++++
  75. Back-end to fetch a single page.
  76. .. option:: -b SET-COOKIE, --cookie SET-COOKIE
  77. Add cookie to browser’s cookie jar. This option always *appends* cookies,
  78. replacing those provided by :option:`-c`.
  79. .. versionadded:: 1.1
  80. .. option:: -c FILE, --cookie-jar FILE
  81. Load cookies from FILE. :program:`crocoite` provides a default cookie file,
  82. which contains cookies to, for example, circumvent age restrictions. This
  83. option *replaces* that default file.
  84. .. versionadded:: 1.1
  85. .. option:: --idle-timeout SEC
  86. Time after which a page is considered “idle”.
  87. .. option:: -k, --insecure
  88. Allow insecure connections, i.e. self-signed ore expired HTTPS certificates.
  89. .. option:: --timeout SEC
  90. Global archiving timeout.
  91. .. option:: --warcinfo JSON
  92. Inject additional JSON-encoded information into the resulting WARC.
  93. IRC bot
  94. ^^^^^^^
  95. A simple IRC bot (“chromebot”) is provided with the command :program:`crocoite-irc`.
  96. It reads its configuration from a config file like the example provided in
  97. :file:`contrib/chromebot.json` and supports the following commands:
  98. a <url> -j <concurrency> -r <policy> -k -b <set-cookie>
  99. Archive <url> with <concurrency> processes according to recursion <policy>
  100. s <uuid>
  101. Get job status for <uuid>
  102. r <uuid>
  103. Revoke or abort running job with <uuid>