LiuFan
/
PrivacyScanData


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
							Usage
-----

Quick start using pywb_, expects Google Chrome to be installed already:

.. code:: bash

    pip install crocoite pywb
    crocoite http://example.com/ example.com.warc.gz
    wb-manager init test && wb-manager add test example.com.warc.gz
    wayback &
    $BROWSER http://localhost:8080

.. _pywb: https://github.com/ikreymer/pywb

It is recommended to install at least Micrsoft’s Corefonts_ as well as DejaVu_,
Liberation_ or a similar font family covering a wide range of character sets.
Otherwise page screenshots may be unusable due to missing glyphs.

.. _Corefonts: http://corefonts.sourceforge.net/
.. _DejaVu: https://dejavu-fonts.github.io/
.. _Liberation: https://pagure.io/liberation-fonts

Recursion
^^^^^^^^^

.. program:: crocoite

By default crocoite will only retrieve the URL specified on the command line.
However it can follow links as well. There’s currently two recursion strategies
available, depth- and prefix-based.

.. code:: bash

   crocoite -r 1 https://example.com/ example.com.warc.gz

will retrieve ``example.com`` and all pages directly refered to by it.
Increasing the number increases the depth, so a value of :samp:`2` would first grab
``example.com``, queue all pages linked there as well as every reference on
each of those pages.

On the other hand

.. code:: bash

   crocoite -r prefix https://example.com/dir/ example.com.warc.gz

will retrieve the URL specified and all pages referenced which have the same
URL prefix. There trailing slash is significant. Without it crocoite would also
grab ``/dir-something`` or ``/dir.html`` for example.

If an output file template is used each page is written to an individual file. For example

.. code:: bash

   crocoite -r prefix https://example.com/ '{host}-{date}-{seqnum}.warc.gz'

will write one file page page to files like
:file:`example.com-2019-09-09T15:15:15+02:00-1.warc.gz`. ``seqnum`` is unique to
each page of a single job and should always be used.

When running a recursive job, increasing the concurrency (i.e. how many pages
are fetched at the same time) can speed up the process. For example you can
pass :option:`-j` :samp:`4` to retrieve four pages at the same time. Keep in mind
that each process starts a full browser that requires a lot of resources (one
to two GB of RAM and one or two CPU cores).

Customizing
^^^^^^^^^^^

.. program:: crocoite-single

Under the hood :program:`crocoite` starts one instance of
:program:`crocoite-single` to fetch each page. You can customize its options by
appending a command template like this:

.. code:: bash

   crocoite -r prefix https://example.com example.com.warc.gz -- \
        crocoite-single --timeout 5 -k '{url}' '{dest}'

This reduces the global timeout to 5 seconds and ignores TLS errors. If an
option is prefixed with an exclamation mark (``!``) it will not be expanded.
This is useful for passing :option:`--warcinfo`, which expects JSON-encoded data.

Command line options
^^^^^^^^^^^^^^^^^^^^

Below is a list of all command line arguments available:

.. program:: crocoite

crocoite
++++++++

Front-end with recursion support and simple job management.

.. option:: -j N, --concurrency N

   Maximum number of concurrent fetch jobs.

.. option:: -r POLICY, --recursion POLICY

   Enables recursion based on POLICY, which can be a positive integer
   (recursion depth) or the string :kbd:`prefix`.

.. option:: --tempdir DIR

   Directory for temporary WARC files.

.. program:: crocoite-single

crocoite-single
+++++++++++++++

Back-end to fetch a single page.

.. option:: -b SET-COOKIE, --cookie SET-COOKIE

   Add cookie to browser’s cookie jar. This option always *appends* cookies,
   replacing those provided by :option:`-c`.

   .. versionadded:: 1.1

.. option:: -c FILE, --cookie-jar FILE

   Load cookies from FILE. :program:`crocoite` provides a default cookie file,
   which contains cookies to, for example, circumvent age restrictions. This
   option *replaces* that default file.

   .. versionadded:: 1.1

.. option:: --idle-timeout SEC

   Time after which a page is considered “idle”.

.. option:: -k, --insecure

   Allow insecure connections, i.e. self-signed ore expired HTTPS certificates.

.. option:: --timeout SEC

   Global archiving timeout.


.. option:: --warcinfo JSON

   Inject additional JSON-encoded information into the resulting WARC.

IRC bot
^^^^^^^

A simple IRC bot (“chromebot”) is provided with the command :program:`crocoite-irc`.
It reads its configuration from a config file like the example provided in
:file:`contrib/chromebot.json` and supports the following commands:

a <url> -j <concurrency> -r <policy> -k -b <set-cookie>
    Archive <url> with <concurrency> processes according to recursion <policy>
s <uuid>
    Get job status for <uuid>
r <uuid>
    Revoke or abort running job with <uuid>