Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: number suffixes as annotations #510

Open
zkat opened this issue Feb 24, 2025 · 45 comments
Open

Idea: number suffixes as annotations #510

zkat opened this issue Feb 24, 2025 · 45 comments
Labels
enhancement New feature or request

Comments

@zkat
Copy link
Member

zkat commented Feb 24, 2025

There was a discussion over at YaLTeR/niri#1142 (comment) when trying to come up with enhancements to Niri's DSL regarding numerical stuff. The tl;dr is that there's interest in being able to type things like 100px and 50%, and not having to quote it.

I thought about this a bit, and I would like to propose the following change to the number grammar to facilitate this. This change would be forward compatible: existing v2 parsers would error (instead of returning incorrect data).

The change is as follows:

  1. any valid number may be suffixed, with no spaces, by a string (of any kind, but most likely an unquoted string). If this is done, the suffix will be used as the type annotation for that number value. That is, 100px would be equivalent to (px)100.
  2. If you have BOTH a parenthetical annotation, AND a suffix, it will be treated as a syntax error. You simply cannot do (px)100%

One catch: a very narrow range of invalid v2 documents will now become valid (since foo 100px will no longer be an error)

Thoughts?

@zkat zkat added the enhancement New feature or request label Feb 24, 2025
@tabatkins
Copy link
Contributor

I think it's an incredible idea, but as a CSS author, I'm a little biased. ^_^

Seriously, tho, I think it really is a very good idea. It's often the case that you don't need to annotate your numbers, but there are plenty of cases where numbers are heavily annotated, and this would be a huge boon to those. I think the wide-spread basic familiarity with CSS makes this a pretty natural thing to read and understand (and useful for CSS interop), and while you have to make the leap to "the unit is the tag", that's also really the only logical interpretation for it.

One catch: a very narrow range of invalid v2 documents will now become valid (since foo 100px will no longer be an error)

This is true of literally any change that's not strictly tightening the parsing; it's perfectly fine unless we have reason to believe that's a common error. KDL's lack of error recovery makes this more or less a non-issue, imo.

@bgotink
Copy link
Member

bgotink commented Feb 24, 2025

I assume this wouldn't apply to hex numbers? I don't think 0xffpx would be a great idea.

@tabatkins
Copy link
Contributor

I assume so, yeah. Just decimals.

@zkat
Copy link
Member Author

zkat commented Feb 24, 2025

@tabatkins i was hoping you would chime in! I had a feeling it would be either "perfect, no notes" or " beware! Run! Here be dragons!"

I'm still curious whether you think there's any gotchas about this based on your experience, but it doesn't sound like it.

And yeah: originally I was gonna say these should parse as strings, but having them be numbers-with-special-type-syntax just clicked the moment I thought of it.

I also thought of whether we might want this kind of thing for any other types and I think the only other types that can even parse this way would be quoted/raw strings. I am less interested in this, but enabling this for them would enable e.g. C++ style string suffix syntax, like "foo"sv for string_view. Potentially useful for some classes of DSL. We could just toss this in for free.

@zkat
Copy link
Member Author

zkat commented Feb 24, 2025

@bgotink look good point. I hadn't thought that far

@JohnDowson
Copy link

JohnDowson commented Feb 24, 2025

@bgotink

I assume this wouldn't apply to hex numbers? I don't think 0xffpx would be a great idea.

Apparently that's a perfectly cromulent way to spell numbers according to Rust+Clippy:
Image

@sudoBash418
Copy link

I assume this wouldn't apply to hex numbers? I don't think 0xffpx would be a great idea.

Counterpoint: having 0x0123u16 work would be neat.

@ikeycode
Copy link

For disks-rs provisioning having storage units (gb, gib) and constraints (50%) would actually be very nice, rather than the currently floated proposal internal of (GB)10

@nleanba
Copy link

nleanba commented Feb 24, 2025

I think this would be very useful. I would suggest allowing a _ between number and suffix, to allow for better readability if desired. (e.g. 0xdeadbeefusize vs 0xdeadbeef_usize)

One small problem I see is ambiguity with hex digits a-f and string suffix. It might be easiest to disallow an unquoted suffix containing a-f for hex numbers (Im assuming suffixes containing 0-9 need to be quoted too [edit: although allowing 123u16 would also be good I think]

(123dx is clear, but 0x123dx is not, also 0x might be problematic to allow)

@tabatkins
Copy link
Contributor

Apparently that's a perfectly cromulent way to spell numbers according to Rust+Clippy:

Hm, I suppose so. It's just such a footgun, especially if your hex number happens to only contain digits.

I would suggest allowing a _ between number and suffix, to allow for better readability if desired. (e.g. 0xdeadbeefusize vs 0xdeadbeef_usize)

That would already be allowed implicitly; _ is allowed in the digits of all numeric syntaxes, and nothing prevents it from being the final character either.

Maybe, in fact, this is required for the non-decimal numeric syntaxes - you must have a _ separating the numeric part from the unit part. Decimals can flow straight into the unit without one.

also 0x might be problematic to allow

Yeah, we'd disallow 0b, 0o, and 0x, definitely, same as we disallow true as an ident string, to avoid confusion.

@nleanba
Copy link

nleanba commented Feb 24, 2025

That would already be allowed implicitly; _ is allowed in the digits of all numeric syntaxes, and nothing prevents it from being the final character either.

The specification is not entirely clear on this, it only ever uses the phrasing "which may be separated by _", which should maybe be made more explicit to allow or disallow them.

Edit: the formal grammar does allow a trailing, but not an initial _, which is sensible. To potentially remove footguns, I think 0x_a, 0x_ and such should also be disallowed (they are currently invalid numbers, and parsing it as (x_a)0/(x_)0 is probably not sensible)


I think requiring a _ between number and suffix for non-decimals might be the best option. It also prevents silent errors incase of typos (0xdesdbeef = (sdbeef)0xde is probably not what was intended, and would be an error if _ is required)

@tabatkins
Copy link
Contributor

The grammar, at least, is clear:

decimal := sign? integer ('.' integer)? exponent?
exponent := ('e' | 'E') sign? integer
integer := digit (digit | '_')*
digit := [0-9]
sign := '+' | '-'

hex := sign? '0x' hex-digit (hex-digit | '_')*
octal := sign? '0o' [0-7] [0-7_]*
binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')*

This allows 1________2__ as a valid way to spell 12. In the absence of any specific wording in the Number section of the spec, I'm assuming that "may be separated by _" is just a bit of descriptive text and does not imply any further restrictions. (And my impl also makes that assumption and doesn't fail any tests...)

Also _ cannot be allowed as the initial character

Already taken care of the by the grammar, you have to start with an actual digit.

It also prevents silent errors incase of typos (0xdesdbeef = (sdbeef)0xde is probably not what was intended, and would be an error if _ is required)

Yes, that is one of the footgun fears I have about allowing it unreservedly. It is better to have the kdl document fail to parse than unexpectedly misparse, when possible. So yeah, I'm leaning towards the _ separator.

And since we already have to have an (implicit) restriction that the unit can't start with a _ (because 123_ is already a valid value, so 123_px should parse as (px)123_), this works nicely overall.

I suppose in theory we only need it for hex numbers, while binary/octal could just take suffixes directly, but I think the advantages of drawing a consistent boundary around "the prefixed numeric syntaxes need a _ before a suffix" is worth it.

@tabatkins
Copy link
Contributor

I also thought of whether we might want this kind of thing for any other types and I think the only other types that can even parse this way would be quoted/raw strings. I am less interested in this, but enabling this for them would enable e.g. C++ style string suffix syntax, like "foo"sv for string_view. Potentially useful for some classes of DSL. We could just toss this in for free.

I'm less sanguine about this, especially with the general string syntax for the suffix. It means "foo""bar" is valid and equivalent to (bar)"foo", and that feels much more likely to be intended as two strings that just accidentally lost their separating space. (You could even write a raw multiline string as the suffix...)

In fact, I'm a bit loathe to allow 123"foo" for the same reason. I think if you're using a "unit" that can't be written as an ident string, you're already hanging onto enough extra syntax that wrapping it in a () as well is fine, and clearer overall. I'd be happier to restrict it to just ident strings as suffixes.

@zkat
Copy link
Member Author

zkat commented Feb 25, 2025

Ok so to summarize:

  1. Decimals need no prefix
  2. Prefixed number types require a _
  3. Only unquoted syntax is allowed
  4. Unsure about strings, BUT if we do strings, it MUST be unquoted-suffix-only

Did I miss anything?

p.s. the only reason Rust allows these suffixes for binders is because the number of symbols is limited and well defined. There is no way for any of the number suffixes in Rust to conflict (for example, Rust does not have literal hex float syntax, hence the f32/64 suffixes are not a problem)

@JohnDowson
Copy link

JohnDowson commented Feb 25, 2025

2. Prefixed number types require a _

I think it'd be in the spirit of allowing - in identifiers to also allow 0xdeadbeef-px as an alternative to 0xdeadbeef_px. Hopefully, you aren't planning on adding math expressions any time soon :)

@zkat
Copy link
Member Author

zkat commented Feb 25, 2025

The point of the _ prefix specifically is that it is already parsed as valid number syntax. Furthermore, that's the precedent already set by existing languages that allow these kinds of suffixes (123_usize is legal. 123-usize is not.)

I believe the value of allowing - as a separator should be significant, and I'm not sure it meets that bar for me right now. Happy to hear from others, though.

@jaxter184
Copy link

jaxter184 commented Feb 25, 2025

Overall, I think these changes would be really great! But a few things that come to mind:

  • 10,000 is interpreted as (,000)10 since ,000 is a valid bare string identifier
    • Probably unlikely that anyone will do this, even by accident. I don't know any programming or configuration language syntax that allows commas in numbers
  • Same with any float that accidentally uses a , instead of a ., which might be a reasonable typo for folks who are from places where that's the norm (Italy? France?).
  • Implementations mistakenly "simplifying" (.2)1 into 1.2
    • imo anyone who writes (.2)1 deserves whatever's coming to them
  • Bare semver strings like 1.2.3-tag are now valid KDL (equivalent to (.3-tag)1.2)

I don't think any of these are showstoppers, but maybe someone's got a clever solution that resolves them?

Also, what's the use case for allowing this syntax on strings? I understand the number case because units are really common, but is something like "2024-02-24"date really any better than (date)"2024-02-24"?

@mwh
Copy link

mwh commented Feb 25, 2025

0x1_face or 0x1_f64 are currently valid numbers and I assume have to remain that way, but 0x1_faces or 0x1_ft would mean (faces)0x1 and (ft)0x1 as I understand it under this model. It's a bit uncomfortable how the presence of a non-[a-f] character somewhere changes the semantics so dramatically. Alternatively, tags that happen to start with hex digits can't be written that way, which is an uncomfortable inconsistency, but probably better. This is a weak but insufficient reason in favour of - also.

@tjol
Copy link
Contributor

tjol commented Feb 25, 2025

It's probably best to ban suffixes from starting with . and , because of the edge-cases @jaxter184 mentioned.

What about e? I suppose 1e2 = 100.0 ; 1m2 = (m2)1 ; 1em = (em)1 and 1e2m = (m)100.0, but that's a bit messy

@tabatkins
Copy link
Contributor

Ooof, yeah, all those cases (from @jaxter184, @mwh, @tjol) concern me. We could add more restrictions to the syntax of valid suffixes, but that's making the feature more complex than I think is warranted.

I think, as @mwh says, this is a reasonable argument towards using a different separator, with - being a real possibility. Playing with it a bit, I think it would look okay even on decimal, which solves all the issues.

  • 10,000 remains invalid (and any decimal number accidentally using , rather than .)
  • (.2)1 simplifies to 1-.2, which is no longer confusable.
  • 1.2.3-tag remains invalid. (A simple 1.2-tag becomes valid and equivalent to (tag)1.2, but that's not a normal semver)
  • 0x1_face remains a valid hex number, 0x1_faces remains an invalid hex number. You'd have to write the latter as 0x1-faces or 0x1_face-s, depending on what you actually meant.
  • 1e2 remains a valid decimal numbers, 1em and 1e2m remain invalid - they'd have to be written as 1-em and either 1-e2m or 1e2-m, depending on what you actually meant.
  • 0x remains invalid, as do 0x_ and 0x_a.

I think, in general, this looks okay. It's an easy character to type (no Shift required), already indicates a minor separatation semantically, and is low-impact visually but still distinct. Playing with some of the motivating examples from the OP's links:

preset-column-widths {
    - 33-%
    - 50-%
    - 66-%
    - 1000-px
}
main-size 20-px
main-size tab=10-%

That looks okay? Not quite as good as examples without the -, but okay.


Just messing around a bit more, we already have an existing generic metacharacter in #. This is visually more impactful and requires a Shift, but let's just check out how it looks:

preset-column-widths {
    - 33#%
    - 50#%
    - 66#%
    - 1000#px
}
main-size 20#px
main-size tab=10#%
mixed #"string"# #true 5#px

I'm a little less happy with that, but it's not terrible, and it's visually consistent with several other things in the language. I don't think this sort of consistency is necessary, but it is nice. Hm.

@JohnDowson
Copy link

preset-column-widths {
    - 33#%
    - 50#%
    - 66#%
    - 1000#px
}
main-size 20#px
main-size tab=10#%
mixed #"string"# #true 5#px

The combination on a # with non-alnum symbols is hard to read. Also, # might be way harder to reach on non-english keyboard layouts. For example, the standard Russian layout doesn't have a # at all.

@mwh
Copy link

mwh commented Feb 25, 2025

It's unfortunate that 1e2 is a valid number, because it rules out the simple version being used even for decimal. When the separator is only for "unusual" number formats it's ok, but the motivating examples of 50% and 100px are barely worthwhile as 50-% over (%)50.

Would it be reasonable to set out:

  1. Numeric literals use maximal munch, including any valid exponential suffix, and then the remainder of the token after that makes a postfix type annotation, which may have a separator prefix.
  2. Numeric literals that already contain letters cannot use unseparated postfix types. This includes both exponentials and all alternative bases.
  3. Types that are initially shaped like an exponential suffix, i.e. ^[eE][-+]?[0-9].*, will require a separator.
  4. Types starting with ., -, or , also always require a separator.
  5. Otherwise, no separator is needed for postfix tags on decimal literals.

"Require a separator" could also be "can only be written as prefix (type)value".

This is still a slightly uncomfortable special case, but much less liable to slip through than hex digits are, and the distinguishing element is always immediately at the beginning of the type tag, analogous to the existing prohibition on .1 as an identifier. Realistically, I don't expect types that look like any of the prohibited cases to arise naturally. I think (2) is not strictly necessary other than for hex, but may be a good idea anyway. It means no 1e2px, -2.5e-1em, 0x32%.

1em would be fine as there is no valid exponential in it, and it is unambiguously (em)1. 50% is (%)50. 1e2m is not permitted. 10,000 is invalid. (e2)1 cannot be rewritten to 1e2. Other bases would still use the separator all the time, or at least hex would. In practice, I think this is fine, but there are pathological cases that can be constructed.


Using - as a separator allows cases like -1e-1---1 for (--1)-0.1. A symbol that is not an identifier character, like #, may be better. It is possible that the separator could be available only to the non-decimal bases, which eliminates some of the worst cases.

@tabatkins
Copy link
Contributor

tabatkins commented Feb 25, 2025

Also, # might be way harder to reach on non-english keyboard layouts. For example, the standard Russian layout doesn't have a # at all.

That's already an issue for anyone using KDL, as without easy access to # you can't write KDL's keywords or raw strings. Similarly, # is heavily used in CSS (and now JS, with private fields), so anyone doing web programming is, presumably, already doing something to fix that for themselves..

"Require a separator" could also be "can only be written as prefix (type)value".

Yeah, if the separator isn't required more or less at all times, I'd prefer to not have it at all and just require a normal tag.

Would it be reasonable to set out: [rules]

Assuming we don't use a separator, this boils down to:

  1. Only decimal numbers without exponential syntax can use suffixes.
  2. Suffixes can't start with . or ,.
  3. Suffixes can't be confusable with exponential syntax (can't start with e\d or e-\d).

If we then allow a separator, ideally # as it's already our metacharacter, then any number syntax can have a suffix and no rules are needed for the suffix's contents beyond "be an ident string".

For the vast majority of cases you'll encounter, this means you can just drop a suffix in and it'll be fine. 100%, 1.2em, etc. All perfectly unambiguous. All the troublesome parses from before are still disallowed. And it still allows 0x1234_face#u16, 1e2#em, 1#.2, etc.

I think this is a pretty good set of rules.

@JohnDowson
Copy link

JohnDowson commented Feb 25, 2025

  1. Numeric literals use maximal munch, including any valid exponential suffix, and then the remainder of the token after that makes a postfix type annotation, which may have a separator prefix.
    1e2m is not permitted.

I don't really see how that follows? I'd imagine that 1e2m would unambiguously parse as (m)1e2 under maximum munch?

@tabatkins
Copy link
Contributor

Their rule 2 prevents it from being valid, however. (I rephrased it a bit to make that more explicit in my version.)

@zkat
Copy link
Member Author

zkat commented Feb 26, 2025

I appreciate all the noodling y'all are doing around getting this to make sense. I appreciate that this is valuable that it might be worth a "bit" of complexity.

I will point out that even if we have complexity around certain very unlikely corner cases, as long as we find good ways to make sure those corner cases can become errors when you "hold it wrong", we can get away with a bit of complexity around the rules.

As long as we retain a boundary of "strongly avoid accidental bad data", and that "intuitive" cases just work, I think a few caveats are ok.

Another point to make: this is, as far as I'm concerned, a strongly DSL-oriented feature. Most "just data" KDL documents are unlikely to use this feature. And DSLs that decide to use it will most likely have a very limited set of them, which they will have further opportunity to verify works well with the intended data those suffixes will be attached to. Not just that, but since their suffixes are likely to be limited, "nonsensical" type annotations derived from them will have a further opportunity to be caught as errors (even if we DID allow 1e2m -> (m)100.0, the application will also go like "what the hell is m?)

@tabatkins
Copy link
Contributor

tabatkins commented Feb 26, 2025

So to be more precise, my proposal for suffixes is now:

  1. Suffixes are an alternative way to spell a tag on numbers. They're written as either 123ident or 123#ident, where ident is any ident string, and are equivalent to (ident)123.
  2. If written without the # separator (a "bare" suffix), there are some additional restrictions:
    • Bare suffixes can only be used on decimal-syntax numbers not using the exponential syntax. (So no binary/octal/hex, or decimal with exponential.)
    • Bare suffixes can't start with ., ,, _, or with e followed by a digit, or with e followed with + or - followed by a digit. (So no masquerading as parts of decimal syntax.)
  3. If written with the # separator, any ident string is allowed, on any numeric syntax.

Valid examples:

  • 10% ((%)10)
  • 2.5em ((em)2.5)
  • 1_234_em ((em)1234)
  • 1e2#px ((px)1e2)
  • 0x1a2b#u16 ((u16)0x1a2b)

Invalid examples:

  • 10.1.2
  • 10,000
  • 1e2e3
  • 0x1a2bu16

@zkat
Copy link
Member Author

zkat commented Feb 26, 2025

This seems good.

2b raises some questions for me about whether there's other "high conflict likelihood" syntax other than , that we've overlooked.

@tabatkins
Copy link
Contributor

Ah, now that you mention that, obviously I need to restrict an initial _ too. ^_^

@zkat
Copy link
Member Author

zkat commented Feb 27, 2025

So it sounds like we've walked through the main concerns. Would anyone like to PR this?

I might shop this around a bit in the meantime and see if anyone else spots a fatal flaw.

I'm a bit sad that the feature is a little wrinkly: I do like the elegance of things fitting in really really nicely and cleanly, but also the sheer value of having the "happy path" for this feature makes the slight messiness of the rules around it feel ok? Idk what do y'all think?

@bgotink
Copy link
Member

bgotink commented Feb 27, 2025

@tabatkins wouldn't 10.1#.2 be invalid because .2 is not a valid identifier?

Earlier you also mentioned that 0b, 0o and 0x should be forbidden. Does that still apply?

I think it's worth noting that KDL forbids keywords without # as identifiers, so this syntax automatically forbids 100#nan for example, making sure that missing a space between two values is still guaranteed to be an error even with this change.

@zkat I think this feature makes a lot of sense. 100ms is a lot more intuitive to write than (ms)100 and supporting this gives KDL file formats the power to provide a clean answer to the age-old question "wait, what unit of time is this timeout/delay/duration configured in?".
The limitation for hex numbers is unfortunate but easily solved with #, the limitations around the allowed idents are narrow enough that I don't think many people will encounter them.

@tabatkins
Copy link
Contributor

tabatkins commented Feb 27, 2025

wouldn't 10.1#.2 be invalid because .2 is not a valid identifier?

lol, indeed, that was just bad on my part. I'll fix.

Earlier you also mentioned that 0b, 0o and 0x should be forbidden. Does that still apply?

I feel like these are least troublesome cases; they'd still be allowed under my rules and I think that's okay. They're clearly invalid if thinking of binary/octal/hex numbers, and if you're not thinking of those syntaxes, it's clearly a 0 with a suffix of b. So that's clear.

But, hm, a unit like b1 is more troublesome, since it'll be misinterpreted on parse if used on a 0. But only when used on a 0; all other values are unambiguous. Similarly, a value like 0xface would misparse. (If we're applying greedy parsing, that luckily prevents 0x1234abcdg from reparsing into a 0 with a suffix of x1234abcdg; instead it'll parse as a hex value with a bare g suffix, which is invalid.)

If we wanted to prevent this misparse, I think the rule would have to be:

  • If the unit starts with b, o, or x, and also contains a valid binary/octal/hex digit (respectively), it must additionally contain a character that's not a valid digit or a _.

So this would declare 5b0 to be invalid, and 5xface, etc, becuase 0b0/0xface/etc would misparse. But it would allow 5b0x, because 0b0x is a parse error (assuming we're careful to phrase the parsing as greedy) rather than a misparse. (So we'd probably put a note in the spec warning against creating b/o/x units whose second character is a valid digit, but not disallow it.) 5b_ is valid, but 5b_0 is invalid.

That keeps the happy path as wide as possible.

Thoughts on if this is worthwhile or not?

so this syntax automatically forbids 100#nan for example, making sure that missing a space between two values is still guaranteed to be an error even with this change.

Yup, a somewhat accidental but nice coincidence.

@tabatkins
Copy link
Contributor

tabatkins commented Feb 27, 2025

We could also slightly widen the restriction, to go along with the e restriction - we disallow suffixes that start with an e followed by a digit, even if they contain other non-digit values, because even if you have those non-numeric characters it'll be a guaranteed misparse when you actually use it. That is, 5e2x will parse as 5e2 with a suffix of x and be a parse error anyway, so might as well rule it out entirely.

The 0x/etc case only causes parse errors on 0, but we could still harmonize the restrictions a bit so that we just disallow all suffixes that start with b followed by a binary digit, or x followed by a hex digit, etc. Rules out slightly more bare suffixes, but makes the rules slightly easier to remember and makes the parse error trigger more predictably.

Edit: yeah, i'm liking this slightly wider restriction better, I don't like technically-allowed values that cause parse errors in common cases. So:

  1. Suffixes are an alternative way to spell a tag on numbers. They're written as either 123ident or 123#ident, where ident is any ident string, and are equivalent to (ident)123.
  2. If written without the # separator (a "bare" suffix), there are some additional restrictions to avoid the number misparsing as another numeric syntax:
    • Bare suffixes can only be used on decimal-syntax numbers not using the exponential syntax. (So no binary/octal/hex, or decimal with exponential.)
    • Bare suffixes can't start with ., ,, or _,
    • If the bare suffix starts with e, it can't be followed by a digit, or followed with + or - followed by a digit.
    • If the bare suffix starts with b, it can't be followed by a valid binary digit.
    • If the bare suffix starts with o, it can't be followed by a valid octal digit.
    • If the bare suffix starts with x, it can't be followed by a valid hexadecimal digit.
  3. If written with the # separator, any ident string is allowed, on any numeric syntax.

@jaxter184
Copy link

Could bare (123abc) suffixes be implemented first to test the waters, and then separated suffixes (123#abc) be added later? As far as I can tell, they're backwards-compatible, and it might be informative to implement bare suffixes first since all of the use cases I've seen so far fit with that (12px 12em 12% 12f32), and then see all the other ways people use it and where the friction points are before making any decisions on how separated suffixes should be implemented.

The main drawback being that type tags on binary/octal/hex values would be limited to the existing parenthetical prefixes (until separated suffixes are added). Also, most of the planning/restrictions/complications pertain to bare suffixes, and discussing separated suffixes has been helpful in figuring out how the bare suffixes should work, so maybe it's just easier for everyone involved (implementers, users, etc) to reduce the number of KDL versions floating around and bundle the changes together.

If the bare suffix starts with x, it can't be followed by a valid hexadecimal digit.

Just to clarify, does the term "hexadecimal digit" also include _?

wouldn't 10.1#.2 be invalid because .2 is not a valid identifier?

lol, indeed, that was just bad on my part. I'll fix.

Isn't .2 a valid bare identifier, and therefore a valid separated suffix?

@YaLTeR
Copy link

YaLTeR commented Feb 28, 2025

I feel like these are least troublesome cases; they'd still be allowed under my rules and I think that's okay. They're clearly invalid if thinking of binary/octal/hex numbers, and if you're not thinking of those syntaxes, it's clearly a 0 with a suffix of b. So that's clear.

Something that comes to mind in favor of allowing these identifiers:

  • 0b makes sense in the context of "bytes" (0b, 0kb, 0mb)
  • 0x makes sense in the context of multipliers (0x, 2x, 4x)

@bgotink
Copy link
Member

bgotink commented Feb 28, 2025

Just to clarify, does the term "hexadecimal digit" also include _?

It should, to make sure 0b_0 remains invalid. Otherwise it would become too easy imo to mistakenly put an underscore at the start of a number. That's something parsers can now provide a helpful error message for, if we don't include _ it would suddenly become valid as (b_0)0.

wouldn't 10.1#.2 be invalid because .2 is not a valid identifier?

lol, indeed, that was just bad on my part. I'll fix.

Isn't .2 a valid bare identifier, and therefore a valid separated suffix?

No, a dot at the start of an identifier cannot be followed by a number.

kdl/draft-marchan-kdl2.md

Lines 966 to 967 in a88c450

dotted-ident :=
sign? '.' ((identifier-char - digit) identifier-char*)?

@zkat
Copy link
Member Author

zkat commented Feb 28, 2025

I'm just gonna toss this one over the fence:

We could require that this feature prefix decimals such #

That is: #10px

And we make it so the numbers are ALWAYS non-exponential decimals.

It's a special case, but it needs fewer protections because we know we're switching modes. Like, I don't think we need to protect against things like , because we're no longer worried about typos in "regular" numbers. This would also be 1000% forward compatible because it would be completely unambiguously invalid v2

That is, while this might be a little bit surprising at first, we could TOTALLY allow this: #0b -> (b)0.

I know some folks have worried before about overuse of # but I just find it to be a very nice "mode switching" character, and the syntactic cost of it ends up being relatively low, though definitely noticeable:


preset-column-widths {
    - #33%
    - #50%
    - #66%
    - #1000px
}
main-size #20px
main-size tab=#10%

@tabatkins
Copy link
Contributor

I think it reads better than 33#%, yeah. But it doesn't cleanly extend to ever having suffixes on non-decimal numbers, which is a little sad. But the fact that we could drop all restrictions on the suffix is indeed nice.

Is there any possibility of using this for non-decimal numbers? I imagine 0b#... might work? It would still require some restrictions on hex, tho. I feel like if we did the non-decimals we'd still want to put the # before the suffix, like 0x1234#u16.

I'd really like to try and navigate the 33% happy path if possible, tho. :(

@zkat
Copy link
Member Author

zkat commented Feb 28, 2025

I really want a happy path, too, but I don't want suffixes to feel like this:

Image

and this thread definitely makes me feel like we're heading in that direction.

@tabatkins
Copy link
Contributor

tabatkins commented Mar 3, 2025

Okay, here's my final attempt at charting a happy path that's not too overcomplicated. If this still isn't good enough, then I agree, going with a # prefix is the way to do it.

  1. Bare suffixes can only be used on decimal-syntax numbers without exponential syntax.
  2. Bare suffixes can't make the number look like other syntaxes:
    • Bare suffixes can't start with ., ,, or _.
    • Bare suffixes can't start with [a-zA-Z][0-9_], [eE][+-][0-9_], or [xX][a-fA-F].
  3. Explicit suffixes (num#ident) can be used on any numeric syntax with any ident.

This just widens the restrictions a bit to let us collapse the special cases as much as possible - all ascii alphas are restricted, and they all restrict all ascii digits. Unfortunately I still need to call out the e and x one specially, but I don't see a way to avoid that without allowing suffixes that would produce misparses with some numbers. I think writing the restrictions in this more compact form makes them look less scary, too. ^_^

This still allows all single-character suffixes (so 2x is still valid), and I think the restriction on second characters won't actually hit reasonable suffixes. x is the only one I'm annoyed about. This does, however, allow all existing CSS units (and all that I can imagine existing, really). As long as we never try to add a base 36 numeric syntax (really, anything base 11 or higher), we're fairly future-proof too.

@zkat
Copy link
Member Author

zkat commented Mar 4, 2025

@tabatkins i think I'm willing to live with this tbh. I think we can shop it around for a bit and see what the community thinks. Could put it being a feature flag in kdl-rs in the meantime.

@zkat
Copy link
Member Author

zkat commented Mar 4, 2025

Tragic: can't say stuff like 0xp 😔 dreams of using this for RPGs shattered

Edit: oh wait no, you can.

Ohohohoho

@bgotink
Copy link
Member

bgotink commented Mar 6, 2025

Bare suffixes can't start with [a-zA-Z][0-9_], e[+-][0-9], or x[a-fA-F].

That should be [eE][+-][0-9] as KDL allows 1E3 as 1000.

Could we also make it [xX][a-fA-F] to prevent confusion with 0XDEAFBEEF vs 0xDEADBEEF?

EDIT: and I think we should make it [eE][+-][0-9_] so 1e-_3 remains invalid.

@tabatkins
Copy link
Contributor

ah yeah, sorry, my brain was still sitting in case-insensitive mode even tho I addressed casing in the character classes.

@bgotink
Copy link
Member

bgotink commented Mar 6, 2025

I've published a pre-release of @bgotink/kdl to npm which supports number suffixes. You can play with it in the browser console over at https://github.bram.dev/kdl/.
For example:

kdl.parse('node 10xp', {as: 'node'}).entries[0].value
Screenshot of the result of the previous code block, showing an object with value 10 and a tag 'xp', the tag is noted as suffix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests