Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Open left angle bracket followed immediately by an alpha character causes next tag to be sanitized #733

Open
fayyazul-centaurlabs opened this issue Jun 14, 2024 · 3 comments
Labels
untriaged Bug reports that haven't been triaged

Comments

@fayyazul-centaurlabs
Copy link

Describe the bug

The tag immediately following an unclosed left angle bracket (used in a "less than context") causes the next tag to be sanitized

  • Python Version: 3.12.2
  • Bleach Version: 6.1.0

To Reproduce

Steps to reproduce the behavior:

>>> bleach.clean('<t <a></a> <a></a>')
'&lt;t &lt;a&gt; <a></a>'

Expected behavior

>>> bleach.clean('<t <a></a> <a></a>')
'&lt;t <a></a> <a></a>'

Additional context

The above error does not occur for non-alpha characters:

>>> bleach.clean('<5 <a></a> <a></a>')
'&lt;5 <a></a> <a></a>'
@fayyazul-centaurlabs fayyazul-centaurlabs added the untriaged Bug reports that haven't been triaged label Jun 14, 2024
@willkg
Copy link
Member

willkg commented Jun 17, 2024

The test cases are helpful. Does this issue come up often? If so, what does the corpus look like?

@fayyazul-centaurlabs
Copy link
Author

Sorry for the late reply, I missed your comment somehow. But yes, this issue did come up often. It was a collection of science articles which use less than with some letter as a variable quite often. I ultimately worked around it by regex replacing angle brackets for html tags, escaping any remaining angle brackets, then undoing the replacement. Not ideal (and probably would have errors in general), but ultimately it worked for my specific situation.

@willkg
Copy link
Member

willkg commented Oct 25, 2024

This is really complicated case. I don't know if we're ever going to get this right.

<t <a></a> <a></a>
   ^

I think the parser moves forward until it hits an invalid-character-in-attribute-name at ^. Then it switches from parsing a tag to parsing character data and slurps up <t <a> as character data:

{'type': Characters, 'data': '<t <a>'}

Bleach will then escape that. Then it slurps up the </a> and I think it drops it. Maybe because it's a closing tag with no matching opening tag. Then it moves on as you would expect.

I'll have to think about how Bleach can figure out that <t followed by an open tag should be treated as Characters and then a StartTag such that the internal state is correct.

I'll keep this open in case someone else wants to tinker with it and/or I find some more free time, but I don't think I'm going to get further with it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged Bug reports that haven't been triaged
Projects
None yet
Development

No branches or pull requests

2 participants