From 9fdd8f17e0851b7cf9629f5578a93b5a77dfd938 Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 14:03:56 +0100 Subject: [PATCH 01/10] Add c_str_literal rfc. --- text/3348-c-str-literal.md | 106 +++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 text/3348-c-str-literal.md diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md new file mode 100644 index 00000000000..faac2f5fa14 --- /dev/null +++ b/text/3348-c-str-literal.md @@ -0,0 +1,106 @@ +- Feature Name: `c_str_literal` +- Start Date: 2022-11-15 +- RFC PR: [rust-lang/rfcs#3348](https://github.com/rust-lang/rfcs/pull/3348) +- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) + +# Summary +[summary]: #summary + +`c"…"` string literals. + +# Motivation +[motivation]: #motivation + +Looking at the [amount of `cstr!()` invocations just on GitHub](https://cs.github.com/?scopeName=All+repos&scope=&q=cstr%21+lang%3Arust) it seems like C string literals +are a widely used feature. Implementing `cstr!()` as a `macro_rules` or `proc_macro` requires non-trivial code to get it completely right (e.g. refusing embedded nul bytes), +and is still less flexible than it should be (e.g. in terms of accepted escape codes). + +In Rust 2021, we reserved prefixes for (string) literals, so let's make use of that. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +`c"abc"` is a [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). A nul byte (`b'\0'`) is appended to it in memory and the result is a `&CStr`. + +All escape codes and characters accepted by `""` and `b""` literals are accepted, except the nul byte (`\0`). +So, both UTF-8 and non-UTF-8 data can co-exist in a C string. E.g. `c"hello\x80我叫\u{1F980}"`. + +The raw string literal variant is prefixed with `cr`. For example, `cr"\"` and `r##"Hello "world"!"##`. (Just like `r""` and `br""`.) + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +Two new [string literal types](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings): `c"…"` and `cr#"…"#`. + +Accepted escape codes: [Quote](https://doc.rust-lang.org/reference/tokens.html#quote-escapes) & [Unicode](https://doc.rust-lang.org/reference/tokens.html#unicode-escapes) & [Byte](https://doc.rust-lang.org/reference/tokens.html#byte-escapes). + +Unicode characters are accepted and encoded as UTF-8. That is, `c"🦀"`, `c"\u{1F980}"` and `c"\xf0\x9f\xa6\x80"` are all accepted and equivalent. + +The type of the expression is [`&core::ffi::CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). So, the `CStr` type will have to become a lang item. + +Interactions with string related macros: + +- The [`concat` macro](https://doc.rust-lang.org/stable/std/macro.concat.html) will _not_ accept these literals, just like it doesn't accept byte string literals. +- The [`format_args` macro](https://doc.rust-lang.org/stable/std/macro.format_args.html) will _not_ accept such a literal as the format string, just like it doesn't accept a byte string literal. + +(This might change in the future. E.g. `format_args!(c"…")` would be cool, but that would require generalizing the macro and `fmt::Arguments` to work for other kinds of strings. (Ideally also for `b"…"`.)) + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +* No `c""` literal, but just a `cstr!()` macro. (Possibly as part of the standard library.) + + This requires [complicated machinery](https://github.com/rust-lang/rust/pull/101607/files) to implement correctly. + + The trivial implementation of using `concat!($s, "\0")` is problematic for several reasons, including non-string input and embedded nul bytes. + (The unstable `concat_bytes!()` solves some of the problems.) + + The popular [`cstr` crate](https://crates.io/crates/cstr) is a proc macro to work around the limiations of a `macro_rules` implementation, but that also has many downsides. + + Even if we had the right language features for a trivial correct implementation, there are many code bases where C strings are the primary form of string, + making `cstr!("..")` syntax quite cumbersome. + +* Allowing only valid UTF-8 and unicode-oriented escape codes (like in `"…"`, e.g. `螃蟹` or `\u{1F980}` but not `\xff`). + + For regular string literals, we have this restriction because `&str` is required to be valid UTF-8. + However, C literals (and objects of our `&CStr` type) aren't necessarily valid UTF-8. + +* Allowing only ASCII characters rand byte-oriented escape codes (like in `b"…"`, e.g. `\xff` but not `螃蟹` or `\u{1F980}`). + + While C literals (and `&CStr`) aren't necessarily valid UTF-8, they often do contain UTF-8 data. + Refusing to put UTF-8 in it would make the feature less useful and would unnecessarily make it harder to use unicode in programs that mainly use C strings. + +* Having separate `c"…"` and `bc"…"` string literal prefixes for UTF-8 and non-UTF8. + + Both of those would be the same type (`&CStr`). Unless we add a special "always valid UTF-8 C string" type, there's not much use in separating them. + +* Use `z` instead of `c` (`z"…"`), for "zero terminated" instead of "C string". + + We already have a type called `CStr` for this, so `c` seems consistent. + +# Drawbacks +[drawbacks]: #drawbacks + +- The `CStr` type needs some work. `&CStr` is currently a wide pointer, but it's supposed to be a thin pointer. See https://doc.rust-lang.org/1.65.0/src/core/ffi/c_str.rs.html#87 + + It's not a blocker, but we might want to try to fix that before stabilizing `c"…"`. + +# Prior art +[prior-art]: #prior-art + +- NIM has `cstring"…"`. +- COBOL has `Z"…"`. +- Probably a lot more languages, but it's hard to search for. :) + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- Should we make `&CStr` a thin pointer before stabilizing this? (If so, how?) +- Should the (unstable) [`concat_bytes` macro](https://github.com/rust-lang/rust/issues/87555) accept C string literals? (If so, should it evaluate to a C string or byte string?) + +# Future possibilities +[future-possibilities]: #future-possibilities + +- Make `concat!()` or `concat_bytes!()` work with `c"…"`. +- Make `format_args!(c"…")` (and `format_args!(b"…")`) work. +- Improve the `&CStr` type, and make it FFI safe. From df9bd285827d77563d5bb734cc6333db34c2a606 Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 14:37:33 +0100 Subject: [PATCH 02/10] =?UTF-8?q?Add=20c'=E2=80=A6'=20idea.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- text/3348-c-str-literal.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index faac2f5fa14..dcde5564c29 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -78,6 +78,15 @@ Interactions with string related macros: We already have a type called `CStr` for this, so `c` seems consistent. +- Also add `c'…'` as [`c_char`](https://doc.rust-lang.org/stable/core/ffi/type.c_char.html) literal. + + It'd be identical to `b'…'`, except it'd be a `c_char` instead of `u8`. + + This would easily lead to unportable code, since `c_char` is `i8` or `u8` depending on the platform. (Not a wrapper type, but a direct type alias.) + E.g. `fn f(_: i8) {} f(c'a');` would compile only on some platforms. + + An alternative is to allow `c'…'` to implicitly be either a `u8` or `i8`. (Just like integer literals can implicitly become one of many types.) + # Drawbacks [drawbacks]: #drawbacks @@ -95,7 +104,10 @@ Interactions with string related macros: # Unresolved questions [unresolved-questions]: #unresolved-questions +- Also add `c'…'` C character literals? (`u8`, `i8`, `c_char`, or something more flexible?) + - Should we make `&CStr` a thin pointer before stabilizing this? (If so, how?) + - Should the (unstable) [`concat_bytes` macro](https://github.com/rust-lang/rust/issues/87555) accept C string literals? (If so, should it evaluate to a C string or byte string?) # Future possibilities From a1306b66d653fdd2ea6cecaf3282c26b381a8504 Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 14:51:01 +0100 Subject: [PATCH 03/10] Update. --- text/3348-c-str-literal.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index dcde5564c29..ce6614dccea 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -11,7 +11,7 @@ # Motivation [motivation]: #motivation -Looking at the [amount of `cstr!()` invocations just on GitHub](https://cs.github.com/?scopeName=All+repos&scope=&q=cstr%21+lang%3Arust) it seems like C string literals +Looking at the [amount of `cstr!()` invocations just on GitHub](https://cs.github.com/?scopeName=All+repos&scope=&q=cstr%21+lang%3Arust) (about 3.2k files with matches) it seems like C string literals are a widely used feature. Implementing `cstr!()` as a `macro_rules` or `proc_macro` requires non-trivial code to get it completely right (e.g. refusing embedded nul bytes), and is still less flexible than it should be (e.g. in terms of accepted escape codes). @@ -25,7 +25,7 @@ In Rust 2021, we reserved prefixes for (string) literals, so let's make use of t All escape codes and characters accepted by `""` and `b""` literals are accepted, except the nul byte (`\0`). So, both UTF-8 and non-UTF-8 data can co-exist in a C string. E.g. `c"hello\x80我叫\u{1F980}"`. -The raw string literal variant is prefixed with `cr`. For example, `cr"\"` and `r##"Hello "world"!"##`. (Just like `r""` and `br""`.) +The raw string literal variant is prefixed with `cr`. For example, `cr"\"` and `cr##"Hello "world"!"##`. (Just like `r""` and `br""`.) # Reference-level explanation [reference-level-explanation]: #reference-level-explanation @@ -97,6 +97,7 @@ Interactions with string related macros: # Prior art [prior-art]: #prior-art +- C as C string literals (`"…"`). :) - NIM has `cstring"…"`. - COBOL has `Z"…"`. - Probably a lot more languages, but it's hard to search for. :) From 76518703c76225506f5f5c12d4fc9d7eccc8bb08 Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 15:07:32 +0100 Subject: [PATCH 04/10] Add implicit "" idea. --- text/3348-c-str-literal.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index ce6614dccea..7fc21a51078 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -60,6 +60,14 @@ Interactions with string related macros: Even if we had the right language features for a trivial correct implementation, there are many code bases where C strings are the primary form of string, making `cstr!("..")` syntax quite cumbersome. +- No `c""` literal, but make it possible for `""` to implicitly become a `&CStr` through magic. + + We already allow integer literals (e.g. `123`) to become one of many types, so perhaps we could do the same to string literals. + + (It could be a built-in fixed set of types (e.g. just `str`, `[u8]`, and `CStr`), + or it could be something extensible through something like a `const trait FromStringLiteral`. + Not sure how that would exactly work, but it sounds cool.) + * Allowing only valid UTF-8 and unicode-oriented escape codes (like in `"…"`, e.g. `螃蟹` or `\u{1F980}` but not `\xff`). For regular string literals, we have this restriction because `&str` is required to be valid UTF-8. From 5fa805631e7c25890ee09c62789a112e40a0bc7b Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 15:57:36 +0100 Subject: [PATCH 05/10] Update Co-authored-by: konsumlamm <44230978+konsumlamm@users.noreply.github.com> --- text/3348-c-str-literal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index 7fc21a51078..5120cf02c17 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -106,7 +106,7 @@ Interactions with string related macros: [prior-art]: #prior-art - C as C string literals (`"…"`). :) -- NIM has `cstring"…"`. +- Nim has `cstring"…"`. - COBOL has `Z"…"`. - Probably a lot more languages, but it's hard to search for. :) From 534349c6889d34d09ecdc4b4fd9d1569dd5bc7fa Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 20:58:43 +0100 Subject: [PATCH 06/10] Fix typo. --- text/3348-c-str-literal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index 5120cf02c17..09ec347ebed 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -105,7 +105,7 @@ Interactions with string related macros: # Prior art [prior-art]: #prior-art -- C as C string literals (`"…"`). :) +- C has C string literals (`"…"`). :) - Nim has `cstring"…"`. - COBOL has `Z"…"`. - Probably a lot more languages, but it's hard to search for. :) From b4ccc53a7a9aff92035b6e951738fa7f9b7e318c Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Tue, 15 Nov 2022 21:01:25 +0100 Subject: [PATCH 07/10] Clarify all nuls are disallowed. --- text/3348-c-str-literal.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index 09ec347ebed..407d9502564 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -22,7 +22,7 @@ In Rust 2021, we reserved prefixes for (string) literals, so let's make use of t `c"abc"` is a [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). A nul byte (`b'\0'`) is appended to it in memory and the result is a `&CStr`. -All escape codes and characters accepted by `""` and `b""` literals are accepted, except the nul byte (`\0`). +All escape codes and characters accepted by `""` and `b""` literals are accepted, except nul bytes. So, both UTF-8 and non-UTF-8 data can co-exist in a C string. E.g. `c"hello\x80我叫\u{1F980}"`. The raw string literal variant is prefixed with `cr`. For example, `cr"\"` and `cr##"Hello "world"!"##`. (Just like `r""` and `br""`.) @@ -34,6 +34,8 @@ Two new [string literal types](https://doc.rust-lang.org/reference/tokens.html#c Accepted escape codes: [Quote](https://doc.rust-lang.org/reference/tokens.html#quote-escapes) & [Unicode](https://doc.rust-lang.org/reference/tokens.html#unicode-escapes) & [Byte](https://doc.rust-lang.org/reference/tokens.html#byte-escapes). +Nul bytes are disallowed, whether as escape code or source character (e.g. `"\0"`, `"\x00"`, `"\u{0}"` or `"␀"`). + Unicode characters are accepted and encoded as UTF-8. That is, `c"🦀"`, `c"\u{1F980}"` and `c"\xf0\x9f\xa6\x80"` are all accepted and equivalent. The type of the expression is [`&core::ffi::CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). So, the `CStr` type will have to become a lang item. From f30e5babe2452320b88bd88988226e078f509fac Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Thu, 17 Nov 2022 17:48:30 +0100 Subject: [PATCH 08/10] Update! --- text/3348-c-str-literal.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index 407d9502564..b22bede4ac9 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -39,6 +39,7 @@ Nul bytes are disallowed, whether as escape code or source character (e.g. `"\0" Unicode characters are accepted and encoded as UTF-8. That is, `c"🦀"`, `c"\u{1F980}"` and `c"\xf0\x9f\xa6\x80"` are all accepted and equivalent. The type of the expression is [`&core::ffi::CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). So, the `CStr` type will have to become a lang item. +(`no_core` programs that don't use `c""` string literals won't need to define this lang item.) Interactions with string related macros: @@ -124,6 +125,12 @@ Interactions with string related macros: # Future possibilities [future-possibilities]: #future-possibilities +(These aren't necessarily all good ideas.) + - Make `concat!()` or `concat_bytes!()` work with `c"…"`. - Make `format_args!(c"…")` (and `format_args!(b"…")`) work. - Improve the `&CStr` type, and make it FFI safe. +- Accept unicode characters and escape codes in `b""` literals too: [RFC 3349](https://github.com/rust-lang/rfcs/pull/3349). +- More prefixes! `w""`, `os""`, `path""`, `utf16""`, `brokenutf16""`, `utf32""`, `wtf8""`, `ebcdic""`, … +- No more prefixes! Have `let a: &CStr = "…";` work through magic, removing the need for prefixes. + (That won't happen any time soon probably, so that shouldn't block `c"…"` now.) From 0056759434a8deb0cb392b95c83f6cc0645b0008 Mon Sep 17 00:00:00 2001 From: Mara Bos Date: Fri, 18 Nov 2022 11:33:05 +0100 Subject: [PATCH 09/10] Typo. Co-authored-by: teor --- text/3348-c-str-literal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index b22bede4ac9..faf3786359a 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -76,7 +76,7 @@ Interactions with string related macros: For regular string literals, we have this restriction because `&str` is required to be valid UTF-8. However, C literals (and objects of our `&CStr` type) aren't necessarily valid UTF-8. -* Allowing only ASCII characters rand byte-oriented escape codes (like in `b"…"`, e.g. `\xff` but not `螃蟹` or `\u{1F980}`). +* Allowing only ASCII characters and byte-oriented escape codes (like in `b"…"`, e.g. `\xff` but not `螃蟹` or `\u{1F980}`). While C literals (and `&CStr`) aren't necessarily valid UTF-8, they often do contain UTF-8 data. Refusing to put UTF-8 in it would make the feature less useful and would unnecessarily make it harder to use unicode in programs that mainly use C strings. From 2196c96216b0b2d535eb1486d3239ad924b1899b Mon Sep 17 00:00:00 2001 From: Tyler Mandry Date: Wed, 14 Dec 2022 18:22:00 -0500 Subject: [PATCH 10/10] Add tracking issue link --- text/3348-c-str-literal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/3348-c-str-literal.md b/text/3348-c-str-literal.md index faf3786359a..3e1278ad761 100644 --- a/text/3348-c-str-literal.md +++ b/text/3348-c-str-literal.md @@ -1,7 +1,7 @@ - Feature Name: `c_str_literal` - Start Date: 2022-11-15 - RFC PR: [rust-lang/rfcs#3348](https://github.com/rust-lang/rfcs/pull/3348) -- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000) +- Rust Issue: [rust-lang/rust#105723](https://github.com/rust-lang/rust/issues/105723) # Summary [summary]: #summary