Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode XID related functionalities to proc_macro crate #540

Open
crlf0710 opened this issue Feb 15, 2025 · 6 comments
Open

Add Unicode XID related functionalities to proc_macro crate #540

crlf0710 opened this issue Feb 15, 2025 · 6 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@crlf0710
Copy link
Member

Proposal

Problem statement

When generating code in proc macros, sometimes one need to generate identifiers from user-provided strings. It's necessary to validate the validity of such strings to judge whether they can actually be used as identifiers.

One can use community-provided crates to do such, but they can easily go out of sync with later rustc updates, and needs to bump to new releases for updating such dependencies.

Motivating examples or use cases

Solution sketch

I'd propose adding new trait and implement on char within proc_macro crate.

Alternatives

  • Add such methods on char as inherent methods. (But might not be worthwhile, because normal rust programs rarely needs to use xid-related checks.

  • Leave things as is

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@crlf0710 crlf0710 added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Feb 15, 2025
@hanna-kruppe
Copy link

If the intent is to match what Rust considers a valid identifier, the XID start/continue functions may be a bit too low-level. They’re not quite enough on their own to ensure Ident::new{,_raw} won’t panic, due to special cases like bare _ and a few keywords that can’t be used as raw identifiers. Individual proc macro crates (or shared helper libraries) can hard-code the current exceptions but that can also become out of date. Conversely, it’s conceivable (though unlikely IMO) that the identifier lexme gets more liberal than the current “(XID_Start | _) XID_Continue* minus a few exceptions”.

Can you give some examples of how, exactly, proc macros would use this new API?

@cuviper
Copy link
Member

cuviper commented Feb 15, 2025

Maybe something like Ident::try_new (and _raw) would be better?

@hanna-kruppe
Copy link

hanna-kruppe commented Feb 15, 2025

On the other hand, if we forget about proc macros for a while, I wouldn’t mind the standard library providing the character classes. The most downloaded reverse dependencies of unicode-ident are mostly proc macro related but there’s a also a whole lot of crates implementing a non-Rust language that want the XID character classes for their respective identifier grammar.

These crates don’t need to match the unicode version used by rustc, but if they use the standard library for other unicode character classes then they already need to recompile with a new Rust release to fully pick up new Unicode versions. It doesn’t need to be in core but if the sysroot contains a copy of the data anyway (and not just for target with host tools, once rust-lang/rust#130856 is implemented) then it might as well live in the place where it can be reused the most.

@crlf0710
Copy link
Member Author

Ident::try_new is of course useful, but as these functions are inherently parsing-related i still think there're cases that providing the low level functions alongside might serve more flexibility, .

I did suggest adding these functions to proc_macro crate upon the expectation that rust-lang/rust#130856 will get implemented though.

Can you give some examples of how, exactly, proc macros would use this new API?

I won't propose anything here, instead i'll just show a few examples i collected from crates.io.

Case 1: rocket-codegen

The code here https://github.com/rwf2/Rocket/blob/28891e8072136f4641a33fb8c3f2aafce9d88d5b/core/codegen/src/attribute/param/parse.rs#L242C4-L248
validates whether a string can be used as an Rust identifier, it doesn't build the Ident immediately, instead it collects the string into its internal datastructure for later use.

Case 2: derive-more

The code here
https://github.com/JelteF/derive_more/blob/a78d8ee41dc4c5fd73572b65dee4cc1369a4c6f1/impl/src/fmt/parsing.rs#L516-L527
parses identifier syntax, with a hand-rolled nom-like parser combinator

Case 3: bunt-macros

The code here
https://github.com/LukasKalbertodt/bunt/blob/master/macros/src/parse/fmt.rs#L351-L352
parses the std::fmt-style format string, which might include identifiers referring to arguments


My original motivation was just exempting crates like this each year to bumping xid crates dependencies on new Unicode releases, following rust language toolchain's update. Instead they could just use standard libraries provided data. Maybe the motivation wasn't strong enough to put these into core, but i think at least it won't do harm to include those in proc_macro crate.


For implementing other-languages, those language itself often have its own pace of following the Unicode releases, actually the community-crate based approach might serve them better. For example, unicode-normalization has more than two dozen releases over a few years, if a language uses, say, Unicode 8.0.0, it will use one specific version of crate unicode-normalization for their specific usage.

@hanna-kruppe
Copy link

hanna-kruppe commented Feb 16, 2025

Thank you for the examples. I've looked a little bit into each and how they diverge from what Ident accepts:

  • In Rocket, macros like #[get("hello/<name>")] primarily want to match the identifier written in the parameter list of the function being annotated. There are other decorations around the identifier but these are removed before even looking at the alleged identifier, so it probably could use Ident::try_new. The actual identifier parsing is oversimplified, e.g., it does not accept raw identifiers. Keywords are accepted by the macro's own parsing but since they must also occur in the function (which is passed through to rustc) this is probably harmless.
  • In bunt, the format string parsing is subtly incorrect. In some places identifiers are not validated at all, although named arguments in format strings are subtly different from Idents (format!("{crate}", crate = 1) and format!("{self}") are valid). In other places the checks are too strict, e.g., bunt::println!("{x:_width$}", x = 1, _width = 10) is rejected because the line you linked forgot that _ is a valid identifier start but not in XID_Start.
  • Like bunt, the identifier parsing in derive-more seems to be focused on format strings. I haven't poked around its behavior as much as with bunt. It seems to be more consistent with what it considers a valid identifier.

The most surprising thing I've learned now is that fmt string parsing can't quite use Ident even if Ident gained a method suitable for incremental parsing, because format!() is more liberal about its argument names.

@hanna-kruppe
Copy link

hanna-kruppe commented Feb 16, 2025

For implementing other-languages, those language itself often have its own pace of following the Unicode releases, actually the community-crate based approach might serve them better. For example, unicode-normalization has more than two dozen releases over a few years, if a language uses, say, Unicode 8.0.0, it will use one specific version of crate unicode-normalization for their specific usage.

I don't think most dependents of unicode-ident / unicode-xid or similar crates are deliberately pinning a specific Unicode version, or even paying attention whether the various Unicode-related things they depend on (including std) agree on which Unicode version they use. Even a very mature, well-specified language like Java nowadays just says

The Java SE platform tracks the Unicode Standard as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character.

Especially for the purpose of defining "what's an identifier" making it precise is not very important because you don't get users filing issues like "I wanted to use U+10940 SIDETIC LETTER N01 in my variable names, when will you update to Unicode 17?" — nor will anyone complain when new code points are eventually accepted as part of identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

3 participants