刘凡 9ff4d1d109 add S3,archive,truncate 2 år sedan
..
.gitignore 9ff4d1d109 add S3,archive,truncate 2 år sedan
LICENSE 9ff4d1d109 add S3,archive,truncate 2 år sedan
README.md 9ff4d1d109 add S3,archive,truncate 2 år sedan
test_truncate.py 9ff4d1d109 add S3,archive,truncate 2 år sedan
timeit_truncate.py 9ff4d1d109 add S3,archive,truncate 2 år sedan
truncate.py 9ff4d1d109 add S3,archive,truncate 2 år sedan

README.md

unicut - Truncate Unicode strings to Byte length

This project provides and analyses Python functions to truncate a Unicode string in a way that its FUT-8 byte representation does not exceed a given number of bytes.

Introduction

For the uninitiated a brief example. Consider the string

Happy Days!😀

This string has 12 characters, but its UTF-8 representation has 15 bytes: 11 for the ASCII characters Happy Days! and 4 bytes \xf0\x9f\x98\x80) for the smiley 😀.

If we want to cut the string to a length of, say, 14 bytes (or more presicsly if we want to truncate the UTF-8 representation of the string to contain no more than 14 bytes), than we have to cut before the last character:

Happy Days!

But we cannot cut the byte representation just at 14 bytes, because then we have three dangling bytes from the smiley left in the byte stream which makes it non-valid for UTF-8. Therfore we have to cut at 11 bytes!

So cutting uncode strings to (UTF-8) byte lengths is not trivial.

Different implementations

We have two different truncation implementations here. The 1st implementation is my first naive attempt via concatenation; the second implementation is in its core by StackOverfow user [zvone]('truncate_by_backing_up_bytes'). He kindly wrote it up in answer to my question about this topic. I have amended the version here to be save against some edge cases which I found via property-based testing with Hypothesis.

Performance

Furthermore I wrote a few performance tests with the timeit module from the Standard Library. Here the output:

--- Timeings WITHOUT cutting the strings ---
Time 'truncate_by_concating' with SHORT_UNICODE string UNCUT
100000 loops, best of 5: 0.794 usec per loop
Time 'truncate_by_backing_up_bytes' with SHORT_UNICODE string UNCUT
100000 loops, best of 5: 3.57 usec per loop
Time 'truncate_by_concating' with LONG_UNICODE string UNCUT
100000 loops, best of 5: 0.888 usec per loop
Time 'truncate_by_backing_up_bytes' with LONG_UNICODE string UNCUT
100000 loops, best of 5: 1.62 usec per loop

--- Timeings WITH cutting the strings (at len-2) ---
Time 'truncate_by_concating' with SHORT_UNICODE string CUT at len-2
100000 loops, best of 5: 1.48 usec per loop
Time 'truncate_by_backing_up_bytes' with SHORT_UNICODE string CUT at len-2
100000 loops, best of 5: 5.18 usec per loop
Time 'truncate_by_concating' with LONG_UNICODE string CUT at len-2
100000 loops, best of 5: 22.1 usec per loop
Time 'truncate_by_backing_up_bytes' with LONG_UNICODE string CUT at len-2
100000 loops, best of 5: 4.08 usec per loop

Very surprisingly truncate_by_concating is faster than truncate_by_backing_up_bytes!!! This "shouldn't" be the case - particuklaly for larger strings.

TODO

  • Find out why truncate_by_backing_up_bytes is slower than truncate_by_concating.