rationale.rst 4.1 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
  1. Rationale
  2. ---------
  3. Most modern websites depend heavily on executing code, usually JavaScript, on
  4. the user’s machine. They also make use of new and emerging Web technologies
  5. like HTML5, WebSockets, service workers and more. Even worse from the
  6. preservation point of view, they also require some form of user interaction to
  7. dynamically load more content (infinite scrolling, dynamic comment loading,
  8. etc).
  9. The naive approach of fetching a HTML page, parsing it and extracting
  10. links to referenced resources therefore is not sufficient to create a faithful
  11. snapshot of these web applications. A full browser, capable of running scripts and
  12. providing modern Web API’s is absolutely required for this task. Thankfully
  13. Google Chrome runs without a display (headless mode) and can be controlled by
  14. external programs, allowing them to navigate and extract or inject data.
  15. This section describes the solutions crocoite offers and explains design
  16. decisions taken.
  17. crocoite captures resources by listening to Chrome’s `network events`_ and
  18. requesting the response body using `Network.getResponseBody`_. This approach
  19. has caveats: The original HTTP requests and responses, as sent over the wire,
  20. are not available. They are reconstructed from parsed data. The character
  21. encoding for text documents is changed to UTF-8. And the content body of HTTP
  22. redirects cannot be retrieved due to a race condition.
  23. .. _network events: https://chromedevtools.github.io/devtools-protocol/1-3/Network
  24. .. _Network.getResponseBody: https://chromedevtools.github.io/devtools-protocol/1-3/Network#method-getResponseBody
  25. But at the same time it allows crocoite to rely on Chrome’s well-tested network
  26. stack and HTTP parser. Thus it supports HTTP version 1 and 2 as well as
  27. transport protocols like SSL and QUIC. Depending on Chrome also eliminates the
  28. need for a man-in-the-middle proxy, like warcprox_, which has to decrypt SSL
  29. traffic and present a fake certificate to the browser in order to store the
  30. transmitted content.
  31. .. _warcprox: https://github.com/internetarchive/warcprox
  32. WARC records generated by crocoite therefore are an abstract view on the
  33. resource they represent and not necessarily the data sent over the wire. A URL
  34. fetched with HTTP/2 for example will still result in a HTTP/1.1
  35. request/response pair in the WARC file. This may be undesireable from
  36. an archivist’s point of view (“save the data exactly like we received it”). But
  37. this level of abstraction is inevitable when dealing with more than one
  38. protocol.
  39. crocoite also interacts with and therefore alters the grabbed websites. It does
  40. so by injecting `behavior scripts`_ into the site. Typically these are written
  41. in JavaScript, because interacting with a page is easier this way. These
  42. scripts then perform different tasks: Extracting targets from visible
  43. hyperlinks, clicking buttons or scrolling the website to to load more content,
  44. as well as taking a static screenshot of ``<canvas>`` elements for the DOM
  45. snapshot (see below).
  46. .. _behavior scripts: https://github.com/PromyLOPh/crocoite/tree/master/crocoite/data
  47. Replaying archived WARC’s can be quite challenging and might not be possible
  48. with current technology (or even at all):
  49. - Some sites request assets based on screen resolution, pixel ratio and
  50. supported image formats (webp). Replaying those with different parameters
  51. won’t work, since assets for those are missing. Example: missguided.com.
  52. - Some fetch different scripts based on user agent. Example: youtube.com.
  53. - Requests containing randomly generated JavaScript callback function names
  54. won’t work. Example: weather.com.
  55. - Range requests (Range: bytes=1-100) are captured as-is, making playback
  56. difficult
  57. crocoite offers two methods to work around these issues. Firstly it can save a
  58. DOM snapshot to the WARC file. It contains the entire DOM in HTML format minus
  59. ``<script>`` tags after the site has been fully loaded and thus can be
  60. displayed without executing scripts. Obviously JavaScript-based navigation
  61. does not work any more. Secondly it also saves a screenshot of the full page,
  62. so even if future browsers cannot render and display the stored HTML a fully
  63. rendered version of the website can be replayed instead.