README.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412
  1. WARCIO: WARC (and ARC) Streaming Library
  2. ========================================
  3. .. image:: https://travis-ci.org/webrecorder/warcio.svg?branch=master
  4. :target: https://travis-ci.org/webrecorder/warcio
  5. .. image:: https://codecov.io/gh/webrecorder/warcio/branch/master/graph/badge.svg
  6. :target: https://codecov.io/gh/webrecorder/warcio
  7. Background
  8. ----------
  9. This library provides a fast, standalone way to read and write `WARC
  10. Format <https://en.wikipedia.org/wiki/Web_ARChive>`__ commonly used in
  11. web archives. Supports Python 2.7+ and Python 3.4+ (using
  12. `six <https://pythonhosted.org/six/>`__, the only external dependency)
  13. warcio supports reading and writing of WARC files compliant with both the `WARC 1.0 <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf>`__
  14. and `WARC 1.1 <http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf>`__ ISO standards.
  15. Install with: ``pip install warcio``
  16. This library is a spin-off of the WARC reading and writing component of
  17. the `pywb <https://github.com/webrecorder/pywb>`__ high-fidelity replay
  18. library, a key component of
  19. `Webrecorder <https://github.com/webrecorder/webrecorder>`__
  20. The library is designed for fast, low-level access to web archival
  21. content, oriented around a stream of WARC records rather than files.
  22. Reading WARC Records
  23. --------------------
  24. A key feature of the library is to be able to iterate over a stream of
  25. WARC records using the ``ArchiveIterator``.
  26. It includes the following features:
  27. - Reading a WARC 1.0, WARC 1.1 or ARC stream
  28. - On the fly ARC to WARC record conversion
  29. - Decompressing and de-chunking HTTP payload content stored in WARC/ARC files.
  30. For example, the following prints the the url for each WARC ``response``
  31. record:
  32. .. code:: python
  33. from warcio.archiveiterator import ArchiveIterator
  34. with open('path/to/file', 'rb') as stream:
  35. for record in ArchiveIterator(stream):
  36. if record.rec_type == 'response':
  37. print(record.rec_headers.get_header('WARC-Target-URI'))
  38. The stream object could be a file on disk or a remote network stream.
  39. The ``ArchiveIterator`` reads the WARC content in a single pass. The
  40. ``record`` is represented by an ``ArcWarcRecord`` object which contains
  41. the format (ARC or WARC), record type, the record headers, http headers
  42. (if any), and raw stream for reading the payload.
  43. .. code:: python
  44. class ArcWarcRecord(object):
  45. def __init__(self, *args):
  46. (self.format, self.rec_type, self.rec_headers, self.raw_stream,
  47. self.http_headers, self.content_type, self.length) = args
  48. Reading WARC Content
  49. ~~~~~~~~~~~~~~~~~~~~
  50. The ``raw_stream`` can be used to read the rest of the payload directly.
  51. A special ``ArcWarcRecord.content_stream()`` function provides a stream that
  52. automatically decompresses and de-chunks the HTTP payload, if it is
  53. compressed and/or transfer-encoding chunked.
  54. ARC Files
  55. ~~~~~~~~~
  56. The library provides support for reading (but not writing ARC) files.
  57. The ARC format is legacy but is important to support in a consistent
  58. matter. The ``ArchiveIterator`` can equally iterate over ARC and WARC
  59. files to emit ``ArcWarcRecord`` objects. The special ``arc2warc`` option
  60. converts ARC records to WARCs on the fly, allowing for them to be
  61. accessed using the same API.
  62. (Special ``WARCIterator`` and ``ARCIterator`` subclasses of ``ArchiveIterator``
  63. are also available to read only WARC or only ARC files).
  64. WARC and ARC Streaming
  65. ~~~~~~~~~~~~~~~~~~~~~~
  66. For example, here is a snippet for reading an ARC and a WARC using the
  67. same API.
  68. The example streams a WARC and ARC file over HTTP using
  69. `requests <http://docs.python-requests.org/en/master/>`__, printing the
  70. ``warcinfo`` record (or ARC header) and any response records (or all ARC
  71. records) that contain HTML:
  72. .. code:: python
  73. import requests
  74. from warcio.archiveiterator import ArchiveIterator
  75. def print_records(url):
  76. resp = requests.get(url, stream=True)
  77. for record in ArchiveIterator(resp.raw, arc2warc=True):
  78. if record.rec_type == 'warcinfo':
  79. print(record.raw_stream.read())
  80. elif record.rec_type == 'response':
  81. if record.http_headers.get_header('Content-Type') == 'text/html':
  82. print(record.rec_headers.get_header('WARC-Target-URI'))
  83. print(record.content_stream().read())
  84. print('')
  85. # WARC
  86. print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz')
  87. # ARC with arc2warc
  88. print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz')
  89. Writing WARC Records
  90. --------------------
  91. Starting with 1.6, warcio introduces a way to capture HTTP/S traffic directly
  92. to a WARC file, by monkey-patching Python's ``http.client`` library.
  93. This approach works well with the popular ``requests`` library often used to fetch
  94. HTTP/S content. Note that ``requests`` must be imported after the ``capture_http`` module.
  95. Quick Start to Writing a WARC
  96. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  97. Fetching the url ``https://example.com/`` while capturing the response and request
  98. into a gzip compressed WARC file named ``example.warc.gz`` can be done with the following four lines:
  99. .. code:: python
  100. from warcio.capture_http import capture_http
  101. import requests # requests must be imported after capture_http
  102. with capture_http('example.warc.gz'):
  103. requests.get('https://example.com/')
  104. The WARC ``example.warc.gz`` will contain two records (the response is written first, then the request).
  105. To write to a default in-memory buffer (``BufferWARCWriter``), don't specify a filename, using ``with capture_http() as writer:``.
  106. Additional requests in the ``capture_http`` context and will be appended to the WARC as expected.
  107. The ``WARC-IP-Address`` header will also be added for each record if the IP address is available.
  108. The following example (similar to a `unit test from the test suite <test/test_capture_http.py>`__) demonstrates the resulting records created with ``capture_http``:
  109. .. code:: python
  110. with capture_http() as writer:
  111. requests.get('http://example.com/')
  112. requests.get('https://google.com/')
  113. expected = [('http://example.com/', 'response', True),
  114. ('http://example.com/', 'request', True),
  115. ('https://google.com/', 'response', True),
  116. ('https://google.com/', 'request', True),
  117. ('https://www.google.com/', 'response', True),
  118. ('https://www.google.com/', 'request', True)
  119. ]
  120. actual = [
  121. (record.rec_headers['WARC-Target-URI'],
  122. record.rec_type,
  123. 'WARC-IP-Address' in record.rec_headers)
  124. for record in ArchiveIterator(writer.get_stream())
  125. ]
  126. assert actual == expected
  127. Customizing WARC Writing
  128. ~~~~~~~~~~~~~~~~~~~~~~~~
  129. The library provides a simple and extensible interface for writing
  130. standards-compliant WARC files.
  131. The library comes with a basic ``WARCWriter`` class for writing to a
  132. single WARC file and ``BufferWARCWriter`` for writing to an in-memory
  133. buffer. The ``BaseWARCWriter`` can be extended to support more complex
  134. operations.
  135. (There is no support for writing legacy ARC files)
  136. For more flexibility, such as to use a custom ``WARCWriter`` class,
  137. the above example can be written as:
  138. .. code:: python
  139. from warcio.capture_http import capture_http
  140. from warcio import WARCWriter
  141. import requests # requests *must* be imported after capture_http
  142. with open('example.warc.gz', 'wb') as fh:
  143. warc_writer = WARCWriter(fh)
  144. with capture_http(warc_writer):
  145. requests.get('https://example.com/')
  146. WARC/1.1 Support
  147. ~~~~~~~~~~~~~~~~
  148. By default, warcio creates WARC 1.0 records for maximum compatibility with existing tools.
  149. To create WARC/1.1 records, simply specify the warc version as follows:
  150. .. code:: python
  151. with capture_http('example.warc.gz', warc_version='1.1'):
  152. ...
  153. .. code:: python
  154. WARCWriter(fh, warc_version='1.1)
  155. ...
  156. When using WARC 1.1, the main difference is that the ``WARC-Date`` timestamp header
  157. will be written with microsecond precision, while WARC 1.0 only supports second precision.
  158. WARC 1.0:
  159. .. code::
  160. WARC/1.0
  161. ...
  162. WARC-Date: 2018-12-26T10:11:12Z
  163. WARC 1.1:
  164. .. code::
  165. WARC/1.1
  166. ...
  167. WARC-Date: 2018-12-26T10:11:12.456789Z
  168. Filtering HTTP Capture
  169. ~~~~~~~~~~~~~~~~~~~~~~
  170. When capturing via HTTP, it is possible to provide a custom filter function,
  171. which can be used to determine if a particular request and response records
  172. should be written to the WARC file or skipped.
  173. The filter function is called with the request and response record
  174. before they are written, and can be used to substitute a different record (for example, a revisit
  175. instead of a response), or to skip writing altogether by returning nothing, as shown below:
  176. .. code:: python
  177. def filter_records(request, response, request_recorder):
  178. # return None, None to indicate records should be skipped
  179. if response.http_headers.get_statuscode() != '200':
  180. return None, None
  181. # the response record can be replaced with a revisit record
  182. elif check_for_dedup():
  183. response = create_revisit_record(...)
  184. return request, response
  185. with capture_http('example.warc.gz', filter_records):
  186. requests.get('https://example.com/')
  187. Please refer to
  188. `test/test\_capture_http.py <test/test_capture_http.py>`__ for additional examples
  189. of capturing ``requests`` traffic to WARC.
  190. Manual/Advanced WARC Writing
  191. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  192. Before 1.6, this was the primary method for fetching a url and then
  193. writing to a WARC. This process is a bit more verbose,
  194. but provides for full control of WARC creation and avoid monkey-patching.
  195. The following example loads ``http://example.com/``, creates a WARC
  196. response record, and writes it, gzip compressed, to ``example.warc.gz``
  197. The block and payload digests are computed automatically.
  198. .. code:: python
  199. from warcio.warcwriter import WARCWriter
  200. from warcio.statusandheaders import StatusAndHeaders
  201. import requests
  202. with open('example.warc.gz', 'wb') as output:
  203. writer = WARCWriter(output, gzip=True)
  204. resp = requests.get('http://example.com/',
  205. headers={'Accept-Encoding': 'identity'},
  206. stream=True)
  207. # get raw headers from urllib3
  208. headers_list = resp.raw.headers.items()
  209. http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')
  210. record = writer.create_warc_record('http://example.com/', 'response',
  211. payload=resp.raw,
  212. http_headers=http_headers)
  213. writer.write_record(record)
  214. The library also includes additional semantics for:
  215. - Creating ``warcinfo`` and ``revisit`` records
  216. - Writing ``response`` and ``request`` records together
  217. - Writing custom WARC records
  218. - Reading a full WARC record from a stream
  219. Please refer to `warcwriter.py <warcio/warcwriter.py>`__ and
  220. `test/test\_writer.py <test/test_writer.py>`__ for additional examples.
  221. WARCIO CLI: Indexing and Recompression
  222. --------------------------------------
  223. The library currently ships with a few simple command line tools.
  224. Index
  225. ~~~~~
  226. The ``warcio index`` cmd will print a simple index of the records in the
  227. warc file as newline delimited JSON lines (NDJSON).
  228. WARC header fields to include in the index can be specified via the
  229. ``-f`` flag, and are included in the JSON block (in order, for
  230. convenience).
  231. ::
  232. warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length
  233. {"warc-type": "warcinfo", "content-length": "137"}
  234. {"warc-type": "response", "warc-target-uri": "http://www.iana.org/", "content-length": "7566"}
  235. {"warc-type": "request", "warc-target-uri": "http://www.iana.org/", "content-length": "76"}
  236. HTTP header fields can be included by prefixing them with the prefix
  237. ``http:``. The special field ``offset`` refers to the record offset within
  238. the warc file.
  239. ::
  240. warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri
  241. {"offset": "0", "content-type": "application/warc-fields"}
  242. {"offset": "405", "content-type": "application/http;msgtype=response", "http:content-type": "text/html; charset=UTF-8", "warc-target-uri": "http://www.iana.org/"}
  243. {"offset": "8379", "content-type": "application/http;msgtype=request", "warc-target-uri": "http://www.iana.org/"}
  244. (Note: this library does not produce CDX or CDXJ format indexes often
  245. associated with web archives. To create these indexes, please see the
  246. `cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`__ tool which extends warcio indexing to provide this functionality)
  247. Check
  248. ~~~~~
  249. The ``warcio check`` command will check the payload and block digests
  250. of WARC records, if possible. An exit value of 1 indicates a failure.
  251. ``warcio check -v`` will print verbose output for each record in the
  252. WARC file.
  253. Recompress
  254. ~~~~~~~~~~
  255. The ``recompress`` command allows for re-compressing or normalizing WARC
  256. (or ARC) files to a record-compressed, gzipped WARC file.
  257. Each WARC record is compressed individually and concatenated. This is
  258. the 'canonical' WARC storage format used by
  259. `Webrecorder <https://github.com/webrecorder/webrecorder>`__ and other
  260. web archiving institutions, and usually stored with a ``.warc.gz``
  261. extension.
  262. It can be used to: - Compress an uncompressed WARC - Convert any ARC
  263. file to a compressed WARC - Fix an improperly compressed WARC file (eg.
  264. a WARC compressed entirely instead of by record)
  265. ::
  266. warcio recompress ./input.arc.gz ./output.warc.gz
  267. Extract
  268. ~~~~~~~
  269. The ``extract`` command provides a way to extract either the WARC and HTTP headers and/or payload of a WARC record
  270. to stdout. Given a WARC filename and an offset, ``extract`` will print the (decompressed) record at that offset
  271. in the file to stdout
  272. Specifying --payload or --headers will output only the payload or only the WARC + HTTP headers (if any), respectively.
  273. ::
  274. warcio extract [--payload | --headers] filename offset
  275. License
  276. ~~~~~~~
  277. ``warcio`` is licensed under the Apache 2.0 License and is part of the
  278. Webrecorder project.
  279. See `NOTICE <NOTICE>`__ and `LICENSE <LICENSE>`__ for details.