Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Auto-linking cannot deal with some cases of line-breaks #19463

Open
Snuffleupagus opened this issue Feb 10, 2025 · 3 comments
Open

[Bug]: Auto-linking cannot deal with some cases of line-breaks #19463

Snuffleupagus opened this issue Feb 10, 2025 · 3 comments
Assignees
Labels

Comments

@Snuffleupagus
Copy link
Collaborator

Attach (recommended) or Link to PDF file

https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#page=2

Web browser and its version

N/A

Operating system and its version

N/A

PDF.js version

master, or any version after PR #19110

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

  1. Load https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#page=2
  2. Note that an inferred link contains: http://www.adobe.com/devnet/pdf/

What is the expected behavior?

That the "full" link, i.e. http://www.adobe.com/devnet/pdf/pdf_reference.html, should be found.

Note: Acrobat Reader is able to detect the full URL.

What went wrong?

The line-break is causing the URL to be truncated.

Link to a viewer

No response

Additional context

No response

@ryzokuken
Copy link
Collaborator

This is because of the structure of that PDF document where the two spans that contain these two parts of the link are separated in the DOM. It could be made to work by a change in the logic where we flatten the content of a page but that would directly conflict with the test for https://github.com/mozilla/pdf.js/blob/master/test/pdfs/bug1019475_2.pdf where we test that the broken link isn't detected.

@nicolo-ribaudo
Copy link
Contributor

We need to be careful about this, since there are many valid cases of URLs ending at the end of a line that might look like it's a multiline URL:

Mozilla: www.mozilla.org/en-US/
Google: www.google.com

Here should not detect www.mozilla.org/en-US/Google as a URL

@Snuffleupagus
Copy link
Collaborator Author

Given the risk of false positives mentioned above, is this still something that we want to try and fix or should we wait and see how common this problem is first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants