Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

Closed
utdemir opened this issue Apr 26, 2021 · 2 comments · Fixed by #333
Closed

Inconsistent UTF8 decoding behaviour based on the underlying chunking #330

utdemir opened this issue Apr 26, 2021 · 2 comments · Fixed by #333
Labels

Comments

@utdemir
Copy link

utdemir commented Apr 26, 2021

Depending on the way the input is chunked, decodeUtf8With returns different results for the "same" bytestring. See:

> import qualified Data.Text.Lazy.Encoding as T
> import qualified Data.Text.Encoding.Error as T
> import qualified Data.ByteString as BS
> import qualified Data.ByteString.Lazy as BL
> let bs1 = BL.fromChunks [BS.pack [194], BS.pack [97, 98, 99]]
> let bs2 = BL.fromChunks [BS.pack [194, 97, 98, 99]]
> bs1
"\194abc"
> bs2
"\194abc"
> bs1 == bs2
True
> T.decodeUtf8With T.lenientDecode bs1
"\65533bc"
> T.decodeUtf8With T.lenientDecode bs2
"\65533abc"

Another example:

> let Some y l r = streamDecodeUtf8With lenientDecode "\194"
> (y, l)
("","\194")
> let Some y' l' r' = r "abcde"
> (y', l')
("\65533bcde","")

but:

> let Some y l r = streamDecodeUtf8With lenientDecode "\194abcde"
> (y, l)
("\65533abcde","")

I noticed this property testing a function using streamDecodeUtf8With against a simpler one using decodeUtf8With on strict Text; but the same issue appeared on other places too.

To me, this sounds like a bug; since it breaks equational reasoning; but if it is the expected behaviour for some reason it should be documented prominently around decodeUtf8 functions.

This issue looks vaguely relevant: #60

@Bodigrim
Copy link
Contributor

Looks bad :(

@Lysxia Lysxia added the bug label May 1, 2021
@Lysxia
Copy link
Contributor

Lysxia commented May 1, 2021

I think I found the cause of the bug, though I'm not sure how to resolve it. In the above example, which of the two results is the preferable one? The bs1 behavior seems simplest to implement.

The problem is in cbits/cbits.c, in the function _hs_text_decode_utf8_int: after an error, the last pointer points to the end of the part that was successfully decoded (i.e., a prefix which is a valid UTF8 string), and we recover by skipping to the next byte. That is, unless a chunk ends in the middle of an UTF8 codepoint which ends up invalid at the next chunk. Then last will point to the beginning of the chunk, which is technically "in the middle" of an undecoded sequence, causing us to skip more than one byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants