Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

utdemir · 2021-04-26T06:38:44Z

Depending on the way the input is chunked, decodeUtf8With returns different results for the "same" bytestring. See:

> import qualified Data.Text.Lazy.Encoding as T
> import qualified Data.Text.Encoding.Error as T
> import qualified Data.ByteString as BS
> import qualified Data.ByteString.Lazy as BL
> let bs1 = BL.fromChunks [BS.pack [194], BS.pack [97, 98, 99]]
> let bs2 = BL.fromChunks [BS.pack [194, 97, 98, 99]]
> bs1
"\194abc"
> bs2
"\194abc"
> bs1 == bs2
True
> T.decodeUtf8With T.lenientDecode bs1
"\65533bc"
> T.decodeUtf8With T.lenientDecode bs2
"\65533abc"

Another example:

> let Some y l r = streamDecodeUtf8With lenientDecode "\194"
> (y, l)
("","\194")
> let Some y' l' r' = r "abcde"
> (y', l')
("\65533bcde","")

but:

> let Some y l r = streamDecodeUtf8With lenientDecode "\194abcde"
> (y, l)
("\65533abcde","")

I noticed this property testing a function using streamDecodeUtf8With against a simpler one using decodeUtf8With on strict Text; but the same issue appeared on other places too.

To me, this sounds like a bug; since it breaks equational reasoning; but if it is the expected behaviour for some reason it should be documented prominently around decodeUtf8 functions.

This issue looks vaguely relevant: #60

The text was updated successfully, but these errors were encountered:

Bodigrim · 2021-04-30T21:11:01Z

Looks bad :(

Lysxia · 2021-05-01T18:35:51Z

I think I found the cause of the bug, though I'm not sure how to resolve it. In the above example, which of the two results is the preferable one? The bs1 behavior seems simplest to implement.

The problem is in cbits/cbits.c, in the function _hs_text_decode_utf8_int: after an error, the last pointer points to the end of the part that was successfully decoded (i.e., a prefix which is a valid UTF8 string), and we recover by skipping to the next byte. That is, unless a chunk ends in the middle of an UTF8 codepoint which ends up invalid at the next chunk. Then last will point to the beginning of the chunk, which is technically "in the middle" of an undecoded sequence, causing us to skip more than one byte.

Lysxia added the bug label May 1, 2021

Lysxia mentioned this issue May 2, 2021

Fix UTF-8 decoding of lazy bytestrings #333

Merged

Bodigrim closed this as completed in #333 May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

utdemir commented Apr 26, 2021 •

edited

Loading

Bodigrim commented Apr 30, 2021

Lysxia commented May 1, 2021

Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

Comments

utdemir commented Apr 26, 2021 • edited Loading

Bodigrim commented Apr 30, 2021

Lysxia commented May 1, 2021

utdemir commented Apr 26, 2021 •

edited

Loading