CHANGELIST.rst 9.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
  1. 1.7.4
  2. ~~~~~
  3. - ``capture_http`` support for chunk-encoded requests `#116 <https://github.com/webrecorder/warcio/pull/116>`_
  4. - indexer: option to enable ``verify_http`` `#116 <https://github.com/webrecorder/warcio/pull/116>`_
  5. - Enable writing block digests for warcinfo records `#115 <https://github.com/webrecorder/warcio/pull/115>`_
  6. 1.7.3
  7. ~~~~~
  8. - Fix documentation for capture_http filter_records `#110 <https://github.com/webrecorder/warcio/pull/110>`_
  9. - Fix capture_http with http and https proxies `#113 <https://github.com/webrecorder/warcio/pull/113>`_
  10. 1.7.2
  11. ~~~~~
  12. - Ensure 1.1 revisit profile used with WARC/1.1 revists `#96 <https://github.com/webrecorder/warcio/pull/96>`_
  13. - Include record offsets in ``warcio check`` output `#98 <https://github.com/webrecorder/warcio/pull/98>`_
  14. - CI fix for python 2.7, use jinja<3.0.0 (`#105 <https://github.com/webrecorder/warcio/pull/105>`_)
  15. - Fix in ``StatusAndHeaders`` when writing, then reading record `#106 <https://github.com/webrecorder/warcio/pull/106>`_
  16. - Fix issues related to http header re-encoding, ensure correct content-length and %-encoding `#106 <https://github.com/webrecorder/warcio/pull/106>`_, `#107 <https://github.com/webrecorder/warcio/pull/107>`_
  17. 1.7.1
  18. ~~~~~
  19. - Windows fixes: Fix reading from stdin, ensure all WARCs/ARCs are treated as binary `#86 <https://github.com/webrecorder/warcio/pull/86>`_
  20. - Fix ``ensure_digest(block=True)`` breaking on an existing record, RecordBuilder supports ``header_filter`` `#85 <https://github.com/webrecorder/warcio/pull/85>`_
  21. 1.7.0
  22. ~~~~~
  23. - Docs and Misc Cleanup: add docs for ``extract`` tool, correct doc for ``get_statuscode()``, move all CLI tools to separate modules for better reusability.
  24. - Support indexing a WARC read from stdin `#79 <https://github.com/webrecorder/warcio/pull/79>`_
  25. - Automatically %-encode urls that have a space in ``WARC-Target-URI`` `#80 <https://github.com/webrecorder/warcio/pull/80>`_
  26. - Separate record creation into ``RecordBuilder`` class to allow building WARC records without a ``WARCWriter``, which now derives from ``RecordBuilder`` `#63 <https://github.com/webrecorder/warcio/pull/63>`_
  27. - Support the ability to optionally check ARC/WARC record's block and payload digests `#54 <https://github.com/webrecorder/warcio/pull/54>`_, `#58 <https://github.com/webrecorder/warcio/pull/58>`_, `#68 <https://github.com/webrecorder/warcio/pull/68>`_, `#77 <https://github.com/webrecorder/warcio/pull/77>`_
  28. - Creation of ``ArchiveIterator`` and ``ArcWarcRecordLoader`` now accept an ``check_digests`` boolean keyword argument indicating if each records digest should be checked, defaults to ``False``
  29. - Core digest checking functionality is provided by ``DigestChecker`` and ``DigestVerifyingReader`` importable from `warcio.digestverifyingreader <digestverifyingreader.py>`_
  30. - New block and payload digest checking utility class, ``Checker``, has been added and is importable from `warcio.checker <checker.py>`_
  31. - The CLI has been updated to provide ``warcio check``, a command for performing block and payload digest checking
  32. - Ensured that ARCHeadersParser's splitting on spaces does not split any spaces in uri's `#62 <https://github.com/webrecorder/warcio/pull/62>`_
  33. - Move the ``compute_headers_buffer`` method and ``headers_buff`` property to the StatusAndHeaders and fix incorrect digests in some test WARCs `#67 <https://github.com/webrecorder/warcio/pull/67>`_
  34. - Ensured that the ``BaseWARCWriter`` does not use a mutable default value for the ``warc_header_dict`` keyword argument `#70 <https://github.com/webrecorder/warcio/pull/70>`_
  35. 1.6.3
  36. ~~~~~
  37. - Make ``warcio recompress`` more robust in fixing improperly compressed WARCs, --verbose mode for printing results `#52 <https://github.com/webrecorder/warcio/issues/52>`_
  38. - BufferedReader supports streaming all members of multi-member gzip file with ``read_all_members=True`` option.
  39. 1.6.2
  40. ~~~~~
  41. - Ensure any non-ascii data in http headers is %-encoded, even if non-conformant to RFC 8187 `#51 <https://github.com/webrecorder/warcio/issues/51>`_
  42. 1.6.1
  43. ~~~~~
  44. - Fixes for ``warcio.utils.open()`` not opening files in binary mode in Python 2.7 on Windows `#49 <https://github.com/webrecorder/warcio/issues/49>`_
  45. - ``capture_http()`` various fixes and improvements, default writer, ``WARC-IP-Address`` header support `#50 <https://github.com/webrecorder/warcio/issues/50>`_
  46. 1.6.0
  47. ~~~~~
  48. - Support WARC/1.1 standard WARC records, reading `#39 <https://github.com/webrecorder/warcio/issues/39>`_ and writing `#46 <https://github.com/webrecorder/warcio/issues/46>`_ with microsecond precision ``WARC-Date``
  49. - Support simplified semantics for capturing http traffic to a WARC `#43 <https://github.com/webrecorder/warcio/issues/43>`_
  50. - Support parsing incorrect wget 1.19 WARCs with angle brackets, eg: ``WARC-Target-URI: <uri>`` `#42 <https://github.com/webrecorder/warcio/issues/42>`_
  51. - Correct encoding of non-ascii HTTP headers per RFC 8187 `#45 <https://github.com/webrecorder/warcio/issues/45>`_
  52. - New Util Added: ``warcio.utils.open`` provides exclusive creation mode ``open(..., 'x')`` for Python 2.7
  53. 1.5.3
  54. ~~~~~
  55. - ArchiveIterator calls new ``close_decompressor()`` function in BufferedReader instead of close() to only close decompressor, not underlying stream. `#35 <https://github.com/webrecorder/warcio/issues/35>`_
  56. 1.5.2
  57. ~~~~~
  58. - Write any errors during decompression to stderr `#31 <https://github.com/webrecorder/warcio/issues/31>`_
  59. - ``to_native_str()`` returns original value unchanged if not a string/bytes type
  60. - ``WarcWriter.create_visit_record()`` accepts additional WARC headers dictionary
  61. - ``ArchiveIterator.close()`` added which calls ``decompressor.flush()`` to address possible issues in `#34 <https://github.com/webrecorder/warcio/issues/34>`_
  62. - Switch ``Warc-Record-ID`` uuid creation to ``uuid4()`` from ``uuid1()``
  63. 1.5.1
  64. ~~~~~
  65. - remove ``test/data`` from wheel build, as it breaks latest setuptools wheel installation
  66. - add ``Content-Length`` when adding ``Content-Range`` via ``StatusAndHeaders.add_range`` `#29 <https://github.com/webrecorder/warcio/issues/29>`_
  67. 1.5.0
  68. ~~~~~
  69. - new extract cli command `#26 <https://github.com/webrecorder/warcio/issues/26>`_ (by @nlevitt)
  70. - fix for writing WARC record with no content-type `#27 <https://github.com/webrecorder/warcio/issues/27>`_ (by @thomaspreece)
  71. - better verification of chunk header before attempting to de-chunk with ChunkedDataReader
  72. - MANIFEST.in added (by @pmlandwehr)
  73. 1.4.0
  74. ~~~~~
  75. - Indexing API improvements:
  76. - Indexer class moved to ``indexer.py`` and all aspects of indexing process can be extended.
  77. - Support for accessing http headers with ``http:``-prefixed fields `#22 <https://github.com/webrecorder/warcio/issues/22>`_
  78. - Special fields: ``filename`` field and ``http:status``
  79. - JSON ``offset`` and ``length`` fields returned as strings for consistency.
  80. - ``ArchiveIterator`` API: add ``get_record_offset()`` and ``get_record_length()`` to return current offset/length, iterator now tracks current record
  81. - ``StatusAndHeaders`` accepts headers in more flexible formats (mapping, byte or string) and normalizes to string tuples `#19 <https://github.com/webrecorder/warcio/issues/19>`_
  82. 1.3.4
  83. ~~~~~
  84. - Continuous read for more data to decompress (introduced in 1.3.2 for brotli decomp) should only happen if no unused data remaining. Otherwise, likely at gzip member end.
  85. 1.3.3
  86. ~~~~~
  87. - Set default read ``block_size`` to 16384, ensure ``block_size`` is never None (caused an issue in py2.7)
  88. 1.3.2
  89. ~~~~~
  90. - Fixes issues with BufferedReader returning empty response due to brotli decompressor requiring additional data, for more details see: `#21 <https://github.com/webrecorder/warcio/issues/21>`_
  91. 1.3.1
  92. ~~~~~
  93. - Fixes `#15 <https://github.com/webrecorder/warcio/issues/15>`_, including:
  94. - ``WARCWriter.create_warc_record()`` works correctly when specifying a payload with no length param.
  95. - Writing DNS records now works (tests included).
  96. - HTTP headers only expected for writing ``request``, ``response`` records if the URI has a ``http:`` or ``https:`` scheme (consistent with reading).
  97. 1.3
  98. ~~~
  99. - Support for reading "streaming" WARC records, with no ``Content-Length`` set. ``Content-Length`` and digests computed as expected when the record is written.
  100. - Additional tests for streaming WARC records, loading HTTP headers+payload from buffer, POST request record, arc2warc conversion.
  101. - ``recompress`` command now parses records fully and generates correct block and payload digests.
  102. - ``WARCWriter.writer.create_record_from_stream()`` removed, redundant with ``ArcWarcRecordLoader()``
  103. 1.2
  104. ~~~
  105. - Support for special field ``offset`` to include WARC record offset when indexing (by @nlevitt, `#4 <https://github.com/webrecorder/warcio/issues/4>`_)
  106. - ``ArchiveIterator`` supports full iterator semantics
  107. - WARC headers encoded/decoded as UTF-8, with fallback to ISO-8859-1 (see `#6 <https://github.com/webrecorder/warcio/issues/6>`_, `#7 <https://github.com/webrecorder/warcio/issues/7>`_)
  108. - ``ArchiveIterator``, ``StatusAndHeaders`` and ``WARCWriter`` now available from package root (by @nlevitt, `#10 <https://github.com/webrecorder/warcio/issues/10>`_)
  109. - ``StatusAndHeaders`` supports dict-like API (by @nlevitt, `#11 <https://github.com/webrecorder/warcio/issues/11>`_)
  110. - When reading, http headers never added by default, unless ``ensure_http_headers=True`` is set (see `#12 <https://github.com/webrecorder/warcio/issues/12>`_, `#13 <https://github.com/webrecorder/warcio/issues/13>`_)
  111. - All tests run on Windows, CI using Appveyor
  112. - Additional tests for writing/reading resource, metadata records
  113. - ``warcio -V`` now outputs current version.
  114. 1.1
  115. ~~~
  116. - Header filtering: support filtering via custom header function, instead of an exclusion list
  117. - Add tests for invalid data passed to ``recompress``, remove unused code
  118. 1.0
  119. ~~~
  120. Initial Release!