You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
>letSome y l r = streamDecodeUtf8With lenientDecode "\194"> (y, l)
("","\194")
>letSome y' l' r' = r "abcde"> (y', l')
("\65533bcde","")
but:
>letSome y l r = streamDecodeUtf8With lenientDecode "\194abcde"> (y, l)
("\65533abcde","")
I noticed this property testing a function using streamDecodeUtf8With against a simpler one using decodeUtf8With on strict Text; but the same issue appeared on other places too.
To me, this sounds like a bug; since it breaks equational reasoning; but if it is the expected behaviour for some reason it should be documented prominently around decodeUtf8 functions.
I think I found the cause of the bug, though I'm not sure how to resolve it. In the above example, which of the two results is the preferable one? The bs1 behavior seems simplest to implement.
The problem is in cbits/cbits.c, in the function _hs_text_decode_utf8_int: after an error, the last pointer points to the end of the part that was successfully decoded (i.e., a prefix which is a valid UTF8 string), and we recover by skipping to the next byte. That is, unless a chunk ends in the middle of an UTF8 codepoint which ends up invalid at the next chunk. Then last will point to the beginning of the chunk, which is technically "in the middle" of an undecoded sequence, causing us to skip more than one byte.
Depending on the way the input is chunked,
decodeUtf8With
returns different results for the "same" bytestring. See:Another example:
but:
I noticed this property testing a function using
streamDecodeUtf8With
against a simpler one usingdecodeUtf8With
on strictText
; but the same issue appeared on other places too.To me, this sounds like a bug; since it breaks equational reasoning; but if it is the expected behaviour for some reason it should be documented prominently around
decodeUtf8
functions.This issue looks vaguely relevant: #60
The text was updated successfully, but these errors were encountered: