-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bytes!() should encode to ASCII instead of UTF-8 #13955
Comments
We could disallow all non-ascii codepoints and people who want out-of-range bytes can write them explicitly: bytes!("foo", 0xff, "bar") (This works now.) |
Yes, that would be better still. Ideally, unescaped non-ASCII code points should be disallowed, and ASCII bytes could be represented with |
(Conversely, I'd be really surprised if bytes!("ä") etc encoded as latin1 and not utf-8.) |
@ben0x539, I agree. I changed the title to say ASCII rather than Latin-1, though ideally I’d still like escapes like |
Responding to #18702 (comment) and #18702 (comment): @mahkoh, as the person who filed it, I can assure you this bug not about the \xHH syntax and is not fixed by #18504. As the title says, it’s about arbitrarily encoding to UTF-8 for a feature that is used in context where text is not necessarily in UTF-8. I argue that what you call the expected behavior is wrong, for reasons explained above. That C is doing it wrong is not a reason for us to do the same. |
You explained nothing like this above. |
|
That's a claim in need of proof, not an explanation. I use lots of byte literals in places where everything is expected to be UTF8. |
I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized. This issue has been moved to the RFCs repo: rust-lang/rfcs#683 |
…=Veykril Parse more exclusive range patterns and inline const patterns Closes rust-lang#13955 This PR - implements exclusive range pattern without start bound (tracking issue: rust-lang#37854) - additionally moves rest pattern handling into the same place since they only differ in whether another pattern follows; this actually solves some FIXMEs - updates `PATTERN_FIRST` token set to include `const` token so we can parse inline const pattern in nested patterns
changelog: none This PR fixes explaining the difference in usage between early and late lint passes in the book.
Update: The original title was bytes!() should encode to "Latin-1" instead of UTF-8, but Latin-1 turned out to be not such a good idea. See discussion below.
Currently, the
bytes!()
macro encodes character and string arguments as UTF-8. This gives surprising behavior such asbytes!("\xFF") == [0xC3, 0xBF]
instead of[0xFF]
.This macro should not assume UTF-8, since its typical use case is working with bytes in an encoding that is ASCII-compatible but is not necessarily UTF-8.
Instead, it should map code points in the U+0000 .. U+00FF range to bytes with the same numerical value, and trigger a compile-time error for other code points. (This encoding sometimes known as "Latin-1", although the official definition of ISO/IEC 8859-1:1998 leaves some bytes unmapped.)
The text was updated successfully, but these errors were encountered: