Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

Open
mikethea1 opened this issue Mar 5, 2025 · 4 comments

Comments

@mikethea1
Copy link

Attach (recommended) or Link to PDF file

Corrupted PDF.pdf

Note: I saw that #11207 was closed due to lack of a publicly available test case. Hopefully this file can help!

Web browser and its version

nodejs 20.18.1

Operating system and its version

Windows 11

PDF.js version

pdfjs-dist 4.10.38

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

const pdfDocument = await pdfjslib.getDocument(...)
const page = await pdfDocument.getPage(16); // or other pages (see "what went wrong")
const textContent = await page.getTextContent({ includeMarkedContent: false }); // throws here

What is the expected behavior?

While this PDF opens in Chrome, not all pages render. It is definitely corrupted. That said, there seemed to be some interest in handling at least the flate stream error more gracefully so I figured it was worth filing.

For my use-case, I'd love if pdfjs would not choke in these cases and instead would yield a page with whatever detail about the page was avialable (e.g. falling back to blank), ideally with a flag on the page object letting me know whether errors occurred.

I understand that this might not be the goal of the library (at least not for all of these issues).

What went wrong?

Processing this file in PDFJS I see a number of errors:

  • getTextContent() on page 16, 87, 101 fails with UnknownErrorException: Bad encoding in flate stream
  • getTextContent() on page 32 fails with UnknownErrorException: Bad (uncompressed) XRef entry: 101R
  • getPage() on pages 33-40, 91-100 fails with UnknownErrorException: Illegal character: 41

Link to a viewer

No response

Additional context

No response

@Snuffleupagus
Copy link
Collaborator

This is a really corrupt PDF document, and note that even Adobe Reader (i.e. the PDF reference implementation) cannot open and render all pages correctly.
Hence it's not clear, at least to me, that it's entirely meaningful to try and "improve" things here since it'll never be perfect given that the document itself is broken.

@mikethea1
Copy link
Author

@Snuffleupagus I hear you, but I wonder if it could at least help resolve #11207 since from the conversation there it seemed that there was appetite to resolve the flate stream issue given an available PDF to work from.

@Snuffleupagus
Copy link
Collaborator

but I wonder if it could at least help resolve #11207 since from the conversation there it seemed that there was appetite to resolve the flate stream issue given an available PDF to work from.

Sorry, but I really don't understand how that old issue is relevant to the current discussion. Please note that there's not going to be just a single way, but rather any number of ways, in which a /FlateDecode stream could be corrupted to make it unreadable.

@mikethea1
Copy link
Author

Sorry, but I really don't understand how that old issue is relevant to the current discussion

When I hit the issue and googled the error, the old issue came up, and what I saw was lots of back and forth about the need for a publicly available test case, I figured that there might be interest in investigating this file as a test case. I understand that this file has multiple issues and might be considered "too damaged" to be worthy of investigation.

Please note that there's not going to be just a single way, but rather any number of ways, in which a /FlateDecode stream could be corrupted to make it unreadable.

Fair. The only thing I can offer is that the first two pages with this error DO render successfully in Chrome, so that suggests that at least one other system has worked around the particular error this file has on those pages. So likely the manifestation here is not entirely unique to this document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants