刘凡 9ff4d1d109 add S3,archive,truncate | 2 år sedan | |
---|---|---|
.. | ||
.gitignore | 2 år sedan | |
LICENSE | 2 år sedan | |
README.md | 2 år sedan | |
test_truncate.py | 2 år sedan | |
timeit_truncate.py | 2 år sedan | |
truncate.py | 2 år sedan |
This project provides and analyses Python functions to truncate a Unicode string in a way that its FUT-8 byte representation does not exceed a given number of bytes.
For the uninitiated a brief example. Consider the string
Happy Days!😀
This string has 12 characters, but its UTF-8 representation has 15 bytes: 11 for the ASCII characters Happy Days!
and 4 bytes \xf0\x9f\x98\x80
) for the smiley 😀.
If we want to cut the string to a length of, say, 14 bytes (or more presicsly if we want to truncate the UTF-8 representation of the string to contain no more than 14 bytes), than we have to cut before the last character:
Happy Days!
But we cannot cut the byte representation just at 14 bytes, because then we have three dangling bytes from the smiley left in the byte stream which makes it non-valid for UTF-8. Therfore we have to cut at 11 bytes!
So cutting uncode strings to (UTF-8) byte lengths is not trivial.
We have two different truncation implementations here. The 1st implementation is my first naive attempt via concatenation; the second implementation is in its core by StackOverfow user [zvone]('truncate_by_backing_up_bytes'). He kindly wrote it up in answer to my question about this topic. I have amended the version here to be save against some edge cases which I found via property-based testing with Hypothesis.
Furthermore I wrote a few performance tests with the timeit
module from the Standard Library. Here the output:
--- Timeings WITHOUT cutting the strings ---
Time 'truncate_by_concating' with SHORT_UNICODE string UNCUT
100000 loops, best of 5: 0.794 usec per loop
Time 'truncate_by_backing_up_bytes' with SHORT_UNICODE string UNCUT
100000 loops, best of 5: 3.57 usec per loop
Time 'truncate_by_concating' with LONG_UNICODE string UNCUT
100000 loops, best of 5: 0.888 usec per loop
Time 'truncate_by_backing_up_bytes' with LONG_UNICODE string UNCUT
100000 loops, best of 5: 1.62 usec per loop
--- Timeings WITH cutting the strings (at len-2) ---
Time 'truncate_by_concating' with SHORT_UNICODE string CUT at len-2
100000 loops, best of 5: 1.48 usec per loop
Time 'truncate_by_backing_up_bytes' with SHORT_UNICODE string CUT at len-2
100000 loops, best of 5: 5.18 usec per loop
Time 'truncate_by_concating' with LONG_UNICODE string CUT at len-2
100000 loops, best of 5: 22.1 usec per loop
Time 'truncate_by_backing_up_bytes' with LONG_UNICODE string CUT at len-2
100000 loops, best of 5: 4.08 usec per loop
Very surprisingly truncate_by_concating
is faster than truncate_by_backing_up_bytes
!!! This "shouldn't" be the case - particuklaly for larger strings.
truncate_by_backing_up_bytes
is slower than truncate_by_concating
.