-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word Boundaries in Opening Node Tags #39
Comments
Yes, because it's invalid syntax.
That's the correct way to write it.
A space must be inserted. The (not yet documented) rule is: A name must be ended by
The same rule is applied in the new parser.
That would be convenient in some cases. But more error-prone for humans. Consider this code:
It reads like a node with name The 'whitespace after name' rule avoids such traps. |
That's going to complicate the details of many editor syntaxes: whereas the intuitive solution is a general RegEx pattern
the problematic aspect here is the definition of whitespace character Hopefully the E.g. in the Oniguruma engine, whitespace is defined as:
and Unicode space as:
etc. The point is that an editor syntax should mark as The correct alternative RegEx seems to be the If we were dealing with ordinary code, this wouldn't have been a problem, but since PML sources are text documents you can't really predict what characters to expect (anything Unicode might be there). All I'm saying here is that considering a tag as ending on a boundary character (i.e. any char with is not a valid tag ID char) offers a stricter definition, whereas relying on "whitespace" seems more of a generically-defined behaviour. Both approaches have pros and cons, but its worth considering and exploring the cons too, just to be aware of potential issues. |
Yes, unfortunately there is no universal standard definition for 'whitespace'. To keep it simple, a whitespace character after a tag name in PML is defined as a "space, tab, or new line", as stated in my previous comment (i.e. regex
If we used Currently a non-empty PML tag is defined by the regex A downside of this approach is that if a tag name is followed by a character that is considered to be whitespace in other definitions (e.g. a form feed) then the PML parser will generate an error (illegal character in tag name). However this kind of illegal code will probably be very rare in practice.
That would be invalid PML code, if it appears at the end of a document. But in the new parser it can be valid if it's at the end of an inserted file (because in very rare cases the tag could continue in the parent file). |
I don't think it would be so rare in real editing practice. If PML is to support non Western languages it needs to consider what whitespace characters are used in those locales. Even within the European languages, French has special rules regarding the use of thin-spaces when enclosing text in double quotes or Guillemets:
The French language is known to pose many challenges to lightweight markup syntaxes when it comes to correctly spacing punctuation marks (e.g. when to insert thin-spaces, non-breaking spaces, etc.) which often requires special extensions to handle them. What happens if an opening tag is followed by a control character to change text direction? (e.g. a western text that need to insert some Arabic or Hebrew words or sentences). A It might be tempting to say "just insert a space, PMLC will consume it"; except that this might not work in many cases, e.g. when you want/need to apply a node to part of a word, without splitting it semantically — (e.g. Arabic has a cursive-like alphabet, where letters are joined with each other by a base line, so you can't insert a space (any kind of space) without breaking the word (or words compound). I've encountered this whitespace problem so many times in real document editing work. I can't vouch for all the existing languages (spoken or dead) covered by Unicode, by knowing Arabic and Hebrew I can assure you that the I am skeptical of undefined behaviours — and trusting the "whitespace" definition as a reference is basically opening the doors to undefined behaviour. Look at how much damage undefined behaviours as done with C and C++, being the root cause of most bugs and vulnerabilities.
I meant that an editor syntax (or an LSP Lang Server) has to account also for the possibility that the text being edited might lead to an opening node being at the end of the document. |
Anyhow, I'll update Sublime PML to implement the suggested pattern, and also add some edge cases tests in the PML Playground to see how it plays out in real case usages. I'm assuming that the first whitespace character following the tag is consumed by the parser, i.e. it's considered a mere separator which won't generate the equivalent whitespace char in the output (that would alleviate potential problems). |
Yes, sure. The parser consumes the mandatory whitespace character after the tag's name, so that it does not appear in the text. After this whitespace character, any Unicode character can be inserted as part of the text, including all Unicode whitespace characters. For example, suppose we want to write the word 'analyzing' in italics, with the letter 'z' in bold. This is the PML code:
And this is the HTML code generated by the PML converter (class attributes removed):
Can currently be done like this (for HTML output):
HTML generated:
Fully agree. Undefined behaviour can create real damage of all kinds. Tree structure validation must be added in a future PML version. |
That's great. But I still want to test it with some real case examples, e.g. with RTL languages, to ensure that editors don't insert some extra unwanted characters when typing tags like this — e.g. some language or direction control characters due to the switch from a RTL main text to Western alphabet when inserting a node in the middle of a word; like your bold I'll have to carry out these tests in another editor though, because Sublime Text supports RTL languages very poorly. Notepad++ does a better job, and I haven't tried VSCode. Ideally, these tests should be done in an editor designed with RTL support in mind. A preliminary basic test (using Notepad++) seems to work OK (1st line the plain word; 2nd line the middle consonant is in bold): برمجة
بر[b م]جة (in GitHub preview the second line is broken, but in the resulting HTML after pmlc conversion it looks OK) where the word with the bold consonant results in the following HTML: <p class="pml-paragraph">بر<b class="pml-bold">م</b>جة</p> but of course the editing experience is a bit messy when the cursor encounters a RTL language change, and insertions break the sentence. My fear is that proper RTL editors might be inserting direction control character in the background here — after all, without them it would be very hard to work with the text. I'm not sure of this, it's a long time I haven't edited similar texts; but I remember that in XP Arabic Edition there were similar problems when editing HTML documents in various apps, depending on the degree of support for RTL (or how Win Arabic intervened therein). Note on SGML EntitiesAs a side note, handling similar cases (i.e. an Arabic word in the middle of an English sentence) would be much easier by econding the Arabic text to HTML entities:
This would also lift most of the problems associated with limited RTL support in editors. Conversion from Arabic to HTML entities can be easily done with an HTML entity encoder/decoder: https://mothereff.in/html-entities But in PML this would require wrapping the encoded text in a Not allowed:
[verbatim بر[b م]جة] So the above example can't be achieved this way. |
BiDi Sample DocI've added some preliminary samples of RTL usage in PML: https://github.com/tajmone/pml-playground/blob/main/pml-samples/bidi-text.pml So far, so good (but without a dedicated editor it's a real pain to edit source files with BiDi texts). |
The new parser will also support the For example this:
... is parsed as:
Great. It's nice to see Arabic and Hebrew text used in PML. |
That's a useful addition indeed! Thin-Space Node?I would suggest also adding a thin-space node, because I believe French uses them quite a lot to separate enclosing double quotes and chevrons from the quoted word(s). Maybe something like «[thsp]Guillemets[thsp]» Although in Italian we don't have explicit rules for this, when using Guillemets good printers follow the French tradition and do add thin spaces. Word-Joiners around Thin-Spaces?The problem though is that these should also be non-breaking spaces, i.e. you don't want the text to wrap after an opening double chevron or before a closing one. So maybe the I do realize that the Unicode escapes could be used instead, but since it's a basic punctuation in French it might deserve a native node — just like there's |
Inconsistent Parser BehaviourI've started implementing in Sublime PML the tag boundary pattern as you suggested ( L1[nl[- comment -]]L2
L1[b[nl[- comment -] bold]]L2
L1[b[- comment -] bold ]L2 The above nodes Now, I'm quite unsure what I should do. Should I keep a more relaxed syntax that correctly highlights node which might cause error at build time, or should I instead enforce a stricter approach which doesn't highlight nodes which are valid PML? IMO, it's better to keep the syntax as it was, an revert to using just a A syntax should be a good approximation of how the converter sees a document, but the focus is on the author who's editing the document, not on mimicking the parser 100%. It doesn't need to cover all edge cases, especially if this introduces unnecessary complexity. Bear in mind that before implementing any syntax element in Sub.PML I always do extensive local tests, trying to figure out each tag's behaviour in various context (line breaks, tabs, etc.). So far tests have been a better guidance than documentation, because many aspects are not documented and not all tags behave the same. The point here is that we're looking at the PML syntax from different angles, due to different needs — you're interested in proper parsing, whereas I'm more interest in a realistic highlighting of the syntax which need to be real-time per formant and is easy to maintain. So I should better highlight correctly The above considerations hold true also for a PML Lang Serv, were accuracy needs to be sacrificed for performance. Unlike PMLC, which acts on a source file which is immutable, syntax highlighter have to deal with authors editing the code all the time, each keystroke forcing an highlighter update. |
I'm very sorry, Tristano, because I wasn't precise enough in my previous comment. The space (i.e. Before continuing the discussion I would like to ask you: Couldn't you simply ignore the character after the name, and just apply the following regex for a valid name (as specified in the pXML BNF)?
That's also the rule applied in the new parser. Applying this regex works correctly for your initial example: This rule is less likely to change in a future version, compared to the rule of what can follow a name. And it would probably be easier to change in the future (for example if later we really want to stick to the more complex rule for XML names). If you use a |
That's why the
These attribute lists will be enclosed within round brackets?
The problem is that without the
If I've understood the above RegEx correctly, it introduces two new chars to ID names: The problem with the new pattern is that the introduction of the chars Are the
Indeed, conformance to XML naming convention for identifiers is desirable, especially when converting from HTML/XML-based formats to PML, so IDs don't undergo lossy renaming. I guess eventually I'll have to adapt all the node tags RegEx to the new scheme. |
That's the pXML syntax. It is required in strict pXML, as explained here Because PML version 2.0 will be based on pXML, attributes must be enclosed in parentheses if the PML code is written in strict pXML. However, the PML parser also applies lenient parsing, which allows (among other simplifications) the parentheses to be omitted. I am currently not sure yet if the new parser's lenient mode will work exactly as in the current parser. I will try to keep it compatible (because it makes the PML syntax more succinct), but it's one of the more challenging features to implement in PML 2.0. Lenient parsing will also need to be well documented, because it's important for editor plugin developers. Here is an example of how lenient parsing will probably work, and make PML more succinct:
Yes, because pXML names must be compatible with XML names.
Yes, that 's why I suggested to just use the regex for names.
No and no. Hence using If you need to detect the end of a name, maybe you can still use |
That would be a nice addition for some people. However, I am a bit hesitant to add this as a standard node, for the following reasons:
Note: To keep discussions focused and limited in size, I suggest that in the future we create new discussions/isuues for subjects that deserve a new, separated entry. |
Unicode escape sequences are now available in version 2.0.0.
Info: I'm currently working on that feature. |
The line
An [c\[admon] block
raises a conversion error:Forcing to change the line to
An [c \[admon] block
.I would expect the
pmlc
parser to consider any invalid token ID character (i.e.[^a-z_]
) as a word boundary, making it unnecessary to insert a space when a tag is followed by an escape, the[
of a nested node, etc.At least, this is how I've implemented most tags in Sublime PML, where nodes are usually captured via a RegEx like
(?<!\\)\[c\b
. This seems the natural way to handle opening node tags, for it will cover all sort of contexts — e.g. the tag being followed by spaces, tabs, or even a new line:Some [c inline code ]
which is perfectly valid PML code.
In any case, these details are important to know to correctly implement editor syntaxes that correctly mimic
pmlc
's behaviour — as the saying goes, "The devil is in the details".The text was updated successfully, but these errors were encountered: