Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify intended semantics of isize, usize, and regex types #505

Open
remexre opened this issue Feb 18, 2025 · 3 comments
Open

Clarify intended semantics of isize, usize, and regex types #505

remexre opened this issue Feb 18, 2025 · 3 comments

Comments

@remexre
Copy link

remexre commented Feb 18, 2025

Other than the isize, usize, and regex types, all of the reserved types in the spec have a simple consumer-independent procedure for checking whether a node is of that type: checking if it is in a certain fixed range for integer types or checking if it meets some regex for string types.

Because of this, it seems very sensible for a parser to reject a node like (u8)300 or (ipv4)"example.com", and for editor tooling to warn humans writing KDL documents when they are writing them that the nodes are malformed.

These three reserved types lack that property:

  • The range for isize and usize depends on whether the producer or consumer's platform was intended, and the details of that platform.
  • Whether a string is a valid regex or not depends on the actual regex syntax being used.

It feels like there should be multiple regex types for the different regex standards; e.g. ere-regex for POSIX extended regular expressions or pcre-regex for the PCRE library's flavor.

Unfortunately, there are so many flavors that most probably shouldn't get reserved names, but giving the ones that do get reserved names makes it more clear what the intended semantics are, and makes it less likely that a parser consuming a KDL document accidentally parses a regex as being of the wrong flavor.

(If I had to pick and choose, I'd reserve/standardize ecma-regex and ere-regex today and leave the others, but this is a weakly held opinion.)


For isize and usize, I think the question comes down to the semantics of type annotations in general; I don't think I understand how they are supposed to interact with the intended data model.

In particular, are the following pairs of values treated as equivalent in the intended data model (and which if any of them should-in-a-SHOULD-sense be errors):

  • (non-reserved-type)#true and #true
  • (u8)300 and 300
  • (u8)300 and (u8)44
  • (usize)100 and (u32)100
  • (usize)100 and (u64)100
  • (usize)1000000000000 and (usize)3567587328
  • (usize)1000000000000 and (u32)1000000000000

Personally, I think standardizing the isize and usize types would be a mistake; it doesn't seem like a good choice to depend on the specifics of either the producer's or consumer's platform, and on many platforms (for example Common Lisp, JavaScript, OCaml, or Python, all for different reasons) the nearest "natural" semantics don't necessarily correspond to something one would expect from the Rust isize and usize types, or from the C size_t.

@zkat
Copy link
Member

zkat commented Feb 18, 2025

I need to find where I put this, but I wrote about regex being specifically ECMAScript Regex, but I guess it didn't make it into the spec.

isize and usize I'm more comfortable with: They're kinda YOLO types as-it-is, and it's also worth noting that type annotations in KDL are also themselves very YOLO, and the spec says they MAY pay attention to them, but if they do, then they SHOULD use certain semantics.

My personal expectation, and what I think should be the expectation of most users and implementors, is for usize to correspond to Rust's usize and C's size_t, and isize to Rust's isize. With all their potential quirks.

As with those types, if you're using these types, you should probably take appropriate caution.

We could decide not to reserve them on the basis of "KDL should be predictable and cross-platform), but that would not prevent people from using them. This way, we at least set some ground rules about what they're intended to be instead of ignoring them so folks can shove anything they want in there without anyone potentially giving them a fair warning about it.

@zkat
Copy link
Member

zkat commented Feb 18, 2025

Also I like the idea of expanding regex reservations a bit to be:

  • regex & ecma-regex - ECMAScript-compatible regular expression
  • pcre-regex - PCRE-compatible regular expression
  • ere-regex - POSIX-compatible regular expression

At the same time: I think a more prudent approach is to say regex is the only reserved, and it SHOULD be ECMAScript, but implementations MAY provide an option to override this behavior to the user, or document the alternative default, with the knowledge that non-ECMA regular expressions may not be fully portable between KDL consumers. I kinda like this better. There's a LOT of regex engines out there, as you say, and I think picking an option that's "most practical default + ability to change as needed" seems prudent here. wdyt?

@remexre
Copy link
Author

remexre commented Feb 18, 2025

I think that if it's documented that regex SHOULD be ECMAScript that's good; I might strengthen the documentation requirement to a SHOULD but that's just quibbling at this point. :)

Maybe if it's too lengthy to add multiple regex types, there could be a general suggestion of a naming scheme to follow, just to encourage people to name their new type awk-regex and not regex-awk, and to treat awk-regex the same as regex when they do document different default for regex.

I personally don't think those three are too lengthy, but that's a weakly held opinion too.


Perhaps this belongs in a comment on #486 instead, but if isize and usize are "YOLO types," should they be removed from the list of reserved formats there? Schema validation, at least, really seems like something that needs to be predictable and cross-platform.

I think I'd also prefer that there at least be some sort of warning in the spec that usize and isize might have a non-power-of-two number of bits, might be unbounded, etc. depending on the platform -- I think that someone who's skimming a document would recognize that the type could be 32-bit or 64-bit, with typical two's complement behavior for isize, but I'm not sure an unprepared reader would immediately realize the diversity of what the "default integer type" on various platforms is.

It might deserve a "Security Considerations" subsection, though I suppose similar overflow-related issues apply to many JSON implementations and the JSON RFC doesn't mention it in its Security Considerations section. ¯\_(ツ)_/¯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants