bug: Open left angle bracket followed immediately by an alpha character causes next tag to be sanitized #733

fayyazul-centaurlabs · 2024-06-14T15:45:33Z

Describe the bug

The tag immediately following an unclosed left angle bracket (used in a "less than context") causes the next tag to be sanitized

Python Version: 3.12.2
Bleach Version: 6.1.0

To Reproduce

Steps to reproduce the behavior:

>>> bleach.clean('<t <a></a> <a></a>')
'&lt;t &lt;a&gt; <a></a>'

Expected behavior

>>> bleach.clean('<t <a></a> <a></a>')
'&lt;t <a></a> <a></a>'

Additional context

The above error does not occur for non-alpha characters:

>>> bleach.clean('<5 <a></a> <a></a>')
'&lt;5 <a></a> <a></a>'

willkg · 2024-06-17T11:33:55Z

The test cases are helpful. Does this issue come up often? If so, what does the corpus look like?

fayyazul-centaurlabs · 2024-09-25T13:44:09Z

Sorry for the late reply, I missed your comment somehow. But yes, this issue did come up often. It was a collection of science articles which use less than with some letter as a variable quite often. I ultimately worked around it by regex replacing angle brackets for html tags, escaping any remaining angle brackets, then undoing the replacement. Not ideal (and probably would have errors in general), but ultimately it worked for my specific situation.

willkg · 2024-10-25T19:21:55Z

This is really complicated case. I don't know if we're ever going to get this right.

<t <a></a> <a></a>
   ^

I think the parser moves forward until it hits an invalid-character-in-attribute-name at ^. Then it switches from parsing a tag to parsing character data and slurps up <t <a> as character data:

{'type': Characters, 'data': '<t <a>'}

Bleach will then escape that. Then it slurps up the </a> and I think it drops it. Maybe because it's a closing tag with no matching opening tag. Then it moves on as you would expect.

I'll have to think about how Bleach can figure out that <t followed by an open tag should be treated as Characters and then a StartTag such that the internal state is correct.

I'll keep this open in case someone else wants to tinker with it and/or I find some more free time, but I don't think I'm going to get further with it today.

fayyazul-centaurlabs added the untriaged Bug reports that haven't been triaged label Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Open left angle bracket followed immediately by an alpha character causes next tag to be sanitized #733

bug: Open left angle bracket followed immediately by an alpha character causes next tag to be sanitized #733

fayyazul-centaurlabs commented Jun 14, 2024

willkg commented Jun 17, 2024

fayyazul-centaurlabs commented Sep 25, 2024

willkg commented Oct 25, 2024 •

edited

Loading

bug: Open left angle bracket followed immediately by an alpha character causes next tag to be sanitized #733

bug: Open left angle bracket followed immediately by an alpha character causes next tag to be sanitized #733

Comments

fayyazul-centaurlabs commented Jun 14, 2024

willkg commented Jun 17, 2024

fayyazul-centaurlabs commented Sep 25, 2024

willkg commented Oct 25, 2024 • edited Loading

willkg commented Oct 25, 2024 •

edited

Loading