bytes!() should encode to ASCII instead of UTF-8 #13955

SimonSapin · 2014-05-05T23:26:02Z

Update: The original title was bytes!() should encode to "Latin-1" instead of UTF-8, but Latin-1 turned out to be not such a good idea. See discussion below.

Currently, the bytes!() macro encodes character and string arguments as UTF-8. This gives surprising behavior such as bytes!("\xFF") == [0xC3, 0xBF] instead of [0xFF].

This macro should not assume UTF-8, since its typical use case is working with bytes in an encoding that is ASCII-compatible but is not necessarily UTF-8.

Instead, it should map code points in the U+0000 .. U+00FF range to bytes with the same numerical value, and trigger a compile-time error for other code points. (This encoding sometimes known as "Latin-1", although the official definition of ISO/IEC 8859-1:1998 leaves some bytes unmapped.)

The text was updated successfully, but these errors were encountered:

huonw · 2014-05-05T23:45:49Z

We could disallow all non-ascii codepoints and people who want out-of-range bytes can write them explicitly:

bytes!("foo", 0xff, "bar")

(This works now.)

SimonSapin · 2014-05-05T23:55:37Z

Yes, that would be better still.

Ideally, unescaped non-ASCII code points should be disallowed, and ASCII bytes could be represented with \xFF escapes (this matches Python 3 byte string literals). But this may require having the tokenizer record more data than is otherwise required. At this point, the effort might be better spent adding full byte literals: rust-lang/rfcs#69

ben0x539 · 2014-05-06T23:22:20Z

(Conversely, I'd be really surprised if bytes!("ä") etc encoded as latin1 and not utf-8.)

SimonSapin · 2014-05-07T08:39:56Z

@ben0x539, I agree. I changed the title to say ASCII rather than Latin-1, though ideally I’d still like escapes like "\xFF" to be supported and pass through to a single byte.

SimonSapin · 2014-11-06T18:52:19Z

Responding to #18702 (comment) and #18702 (comment):

@mahkoh, as the person who filed it, I can assure you this bug not about the \xHH syntax and is not fixed by #18504. As the title says, it’s about arbitrarily encoding to UTF-8 for a feature that is used in context where text is not necessarily in UTF-8. I argue that what you call the expected behavior is wrong, for reasons explained above. That C is doing it wrong is not a reason for us to do the same.

mahkoh · 2014-11-06T18:57:30Z

I argue that what you call the expected behavior is wrong, for reasons explained above.

You explained nothing like this above.

SimonSapin · 2014-11-06T19:04:02Z

This macro should not assume UTF-8, since its typical use case is working with bytes in an encoding that is ASCII-compatible but is not necessarily UTF-8.

mahkoh · 2014-11-06T19:05:44Z

That's a claim in need of proof, not an explanation. I use lots of byte literals in places where everything is expected to be UTF8.

steveklabnik · 2015-01-21T20:07:00Z

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#683

…=Veykril Parse more exclusive range patterns and inline const patterns Closes rust-lang#13955 This PR - implements exclusive range pattern without start bound (tracking issue: rust-lang#37854) - additionally moves rest pattern handling into the same place since they only differ in whether another pattern follows; this actually solves some FIXMEs - updates `PATTERN_FIRST` token set to include `const` token so we can parse inline const pattern in nested patterns

changelog: none This PR fixes explaining the difference in usage between early and late lint passes in the book.

SimonSapin mentioned this issue May 6, 2014

RFC: Add byte and byte string literals rust-lang/rfcs#69

Merged

SimonSapin changed the title ~~bytes!() should encode to "Latin-1" instead of UTF-8~~ bytes!() should encode to ASCII instead of UTF-8 May 7, 2014

thestinger added the B-RFC label Sep 26, 2014

steveklabnik mentioned this issue Jan 21, 2015

bytes!() should encode to ASCII instead of UTF-8 rust-lang/rfcs#683

Closed

steveklabnik closed this as completed Jan 21, 2015

bors pushed a commit to rust-lang-ci/rust that referenced this issue Jan 28, 2025

fix(adding_lints): usage of early vs late lint pass (rust-lang#13955)

0db6411

changelog: none This PR fixes explaining the difference in usage between early and late lint passes in the book.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bytes!() should encode to ASCII instead of UTF-8 #13955

bytes!() should encode to ASCII instead of UTF-8 #13955

SimonSapin commented May 5, 2014

huonw commented May 5, 2014

SimonSapin commented May 5, 2014

ben0x539 commented May 6, 2014

SimonSapin commented May 7, 2014

SimonSapin commented Nov 6, 2014

mahkoh commented Nov 6, 2014

SimonSapin commented Nov 6, 2014

mahkoh commented Nov 6, 2014

steveklabnik commented Jan 21, 2015

bytes!() should encode to ASCII instead of UTF-8 #13955

bytes!() should encode to ASCII instead of UTF-8 #13955

Comments

SimonSapin commented May 5, 2014

huonw commented May 5, 2014

SimonSapin commented May 5, 2014

ben0x539 commented May 6, 2014

SimonSapin commented May 7, 2014

SimonSapin commented Nov 6, 2014

mahkoh commented Nov 6, 2014

SimonSapin commented Nov 6, 2014

mahkoh commented Nov 6, 2014

steveklabnik commented Jan 21, 2015