Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes!() should encode to ASCII instead of UTF-8 #13955

Closed
SimonSapin opened this issue May 5, 2014 · 9 comments
Closed

bytes!() should encode to ASCII instead of UTF-8 #13955

SimonSapin opened this issue May 5, 2014 · 9 comments

Comments

@SimonSapin
Copy link
Contributor

Update: The original title was bytes!() should encode to "Latin-1" instead of UTF-8, but Latin-1 turned out to be not such a good idea. See discussion below.


Currently, the bytes!() macro encodes character and string arguments as UTF-8. This gives surprising behavior such as bytes!("\xFF") == [0xC3, 0xBF] instead of [0xFF].

This macro should not assume UTF-8, since its typical use case is working with bytes in an encoding that is ASCII-compatible but is not necessarily UTF-8.

Instead, it should map code points in the U+0000 .. U+00FF range to bytes with the same numerical value, and trigger a compile-time error for other code points. (This encoding sometimes known as "Latin-1", although the official definition of ISO/IEC 8859-1:1998 leaves some bytes unmapped.)

@huonw
Copy link
Member

huonw commented May 5, 2014

We could disallow all non-ascii codepoints and people who want out-of-range bytes can write them explicitly:

bytes!("foo", 0xff, "bar")

(This works now.)

@SimonSapin
Copy link
Contributor Author

Yes, that would be better still.

Ideally, unescaped non-ASCII code points should be disallowed, and ASCII bytes could be represented with \xFF escapes (this matches Python 3 byte string literals). But this may require having the tokenizer record more data than is otherwise required. At this point, the effort might be better spent adding full byte literals: rust-lang/rfcs#69

@ben0x539
Copy link
Contributor

ben0x539 commented May 6, 2014

(Conversely, I'd be really surprised if bytes!("ä") etc encoded as latin1 and not utf-8.)

@SimonSapin SimonSapin changed the title bytes!() should encode to "Latin-1" instead of UTF-8 bytes!() should encode to ASCII instead of UTF-8 May 7, 2014
@SimonSapin
Copy link
Contributor Author

@ben0x539, I agree. I changed the title to say ASCII rather than Latin-1, though ideally I’d still like escapes like "\xFF" to be supported and pass through to a single byte.

@SimonSapin
Copy link
Contributor Author

Responding to #18702 (comment) and #18702 (comment):

@mahkoh, as the person who filed it, I can assure you this bug not about the \xHH syntax and is not fixed by #18504. As the title says, it’s about arbitrarily encoding to UTF-8 for a feature that is used in context where text is not necessarily in UTF-8. I argue that what you call the expected behavior is wrong, for reasons explained above. That C is doing it wrong is not a reason for us to do the same.

@mahkoh
Copy link
Contributor

mahkoh commented Nov 6, 2014

I argue that what you call the expected behavior is wrong, for reasons explained above.

You explained nothing like this above.

@SimonSapin
Copy link
Contributor Author

This macro should not assume UTF-8, since its typical use case is working with bytes in an encoding that is ASCII-compatible but is not necessarily UTF-8.

@mahkoh
Copy link
Contributor

mahkoh commented Nov 6, 2014

That's a claim in need of proof, not an explanation. I use lots of byte literals in places where everything is expected to be UTF8.

@steveklabnik
Copy link
Member

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#683

bors added a commit to rust-lang-ci/rust that referenced this issue Jun 5, 2023
…=Veykril

Parse more exclusive range patterns and inline const patterns

Closes rust-lang#13955

This PR
- implements exclusive range pattern without start bound (tracking issue: rust-lang#37854)
  - additionally moves rest pattern handling into the same place since they only differ in whether another pattern follows; this actually solves some FIXMEs
- updates `PATTERN_FIRST` token set to include `const` token so we can parse inline const pattern in nested patterns
bors pushed a commit to rust-lang-ci/rust that referenced this issue Jan 28, 2025
changelog: none

This PR fixes explaining the difference in usage between early and late
lint passes in the book.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants