12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576 |
- Rationale
- ---------
- Most modern websites depend heavily on executing code, usually JavaScript, on
- the user’s machine. They also make use of new and emerging Web technologies
- like HTML5, WebSockets, service workers and more. Even worse from the
- preservation point of view, they also require some form of user interaction to
- dynamically load more content (infinite scrolling, dynamic comment loading,
- etc).
- The naive approach of fetching a HTML page, parsing it and extracting
- links to referenced resources therefore is not sufficient to create a faithful
- snapshot of these web applications. A full browser, capable of running scripts and
- providing modern Web API’s is absolutely required for this task. Thankfully
- Google Chrome runs without a display (headless mode) and can be controlled by
- external programs, allowing them to navigate and extract or inject data.
- This section describes the solutions crocoite offers and explains design
- decisions taken.
- crocoite captures resources by listening to Chrome’s `network events`_ and
- requesting the response body using `Network.getResponseBody`_. This approach
- has caveats: The original HTTP requests and responses, as sent over the wire,
- are not available. They are reconstructed from parsed data. The character
- encoding for text documents is changed to UTF-8. And the content body of HTTP
- redirects cannot be retrieved due to a race condition.
- .. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
- .. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
- But at the same time it allows crocoite to rely on Chrome’s well-tested network
- stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
- transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
- need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
- traffic and present a fake certificate to the browser in order to store the
- transmitted content.
- .. _warcprox: https://github.com/internetarchive/warcprox
- WARC records generated by crocoite therefore are an abstract view on the
- resource they represent and not necessarily the data sent over the wire. A URL
- fetched with HTTP/2 for example will still result in a HTTP/1.1
- request/response pair in the WARC file. This may be undesireable from
- an archivist’s point of view (“save the data exactly like we received it”). But
- this level of abstraction is inevitable when dealing with more than one
- protocol.
- crocoite also interacts with and therefore alters the grabbed websites. It does
- so by injecting `behavior scripts`_ into the site. Typically these are written
- in JavaScript, because interacting with a page is easier this way. These
- scripts then perform different tasks: Extracting targets from visible
- hyperlinks, clicking buttons or scrolling the website to to load more content,
- as well as taking a static screenshot of ``<canvas>`` elements for the DOM
- snapshot (see below).
- .. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
- Replaying archived WARC’s can be quite challenging and might not be possible
- with current technology (or even at all):
- - Some sites request assets based on screen resolution, pixel ratio and
- supported image formats (webp). Replaying those with different parameters
- won’t work, since assets for those are missing. Example: missguided.com.
- - Some fetch different scripts based on user agent. Example: youtube.com.
- - Requests containing randomly generated JavaScript callback function names
- won’t work. Example: weather.com.
- - Range requests (Range: bytes=1-100) are captured as-is, making playback
- difficult
- crocoite offers two methods to work around these issues. Firstly it can save a
- DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
- ``<script>`` tags after the site has been fully loaded and thus can be
- displayed without executing scripts. Obviously JavaScript-based navigation
- does not work any more. Secondly it also saves a screenshot of the full page,
- so even if future browsers cannot render and display the stored HTML a fully
- rendered version of the website can be replayed instead.
|