[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

mikethea1 · 2025-03-05T19:10:27Z

Attach (recommended) or Link to PDF file

Corrupted PDF.pdf

Note: I saw that #11207 was closed due to lack of a publicly available test case. Hopefully this file can help!

Web browser and its version

nodejs 20.18.1

Operating system and its version

Windows 11

PDF.js version

pdfjs-dist 4.10.38

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

const pdfDocument = await pdfjslib.getDocument(...)
const page = await pdfDocument.getPage(16); // or other pages (see "what went wrong")
const textContent = await page.getTextContent({ includeMarkedContent: false }); // throws here

What is the expected behavior?

While this PDF opens in Chrome, not all pages render. It is definitely corrupted. That said, there seemed to be some interest in handling at least the flate stream error more gracefully so I figured it was worth filing.

For my use-case, I'd love if pdfjs would not choke in these cases and instead would yield a page with whatever detail about the page was avialable (e.g. falling back to blank), ideally with a flag on the page object letting me know whether errors occurred.

I understand that this might not be the goal of the library (at least not for all of these issues).

What went wrong?

Processing this file in PDFJS I see a number of errors:

getTextContent() on page 16, 87, 101 fails with UnknownErrorException: Bad encoding in flate stream
getTextContent() on page 32 fails with UnknownErrorException: Bad (uncompressed) XRef entry: 101R
getPage() on pages 33-40, 91-100 fails with UnknownErrorException: Illegal character: 41

Link to a viewer

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Snuffleupagus · 2025-03-05T19:21:18Z

This is a really corrupt PDF document, and note that even Adobe Reader (i.e. the PDF reference implementation) cannot open and render all pages correctly.
Hence it's not clear, at least to me, that it's entirely meaningful to try and "improve" things here since it'll never be perfect given that the document itself is broken.

mikethea1 · 2025-03-05T21:05:33Z

@Snuffleupagus I hear you, but I wonder if it could at least help resolve #11207 since from the conversation there it seemed that there was appetite to resolve the flate stream issue given an available PDF to work from.

Snuffleupagus · 2025-03-05T21:20:41Z

but I wonder if it could at least help resolve #11207 since from the conversation there it seemed that there was appetite to resolve the flate stream issue given an available PDF to work from.

Sorry, but I really don't understand how that old issue is relevant to the current discussion. Please note that there's not going to be just a single way, but rather any number of ways, in which a /FlateDecode stream could be corrupted to make it unreadable.

mikethea1 · 2025-03-05T21:37:33Z

Sorry, but I really don't understand how that old issue is relevant to the current discussion

When I hit the issue and googled the error, the old issue came up, and what I saw was lots of back and forth about the need for a publicly available test case, I figured that there might be interest in investigating this file as a test case. I understand that this file has multiple issues and might be considered "too damaged" to be worthy of investigation.

Please note that there's not going to be just a single way, but rather any number of ways, in which a /FlateDecode stream could be corrupted to make it unreadable.

Fair. The only thing I can offer is that the first two pages with this error DO render successfully in Chrome, so that suggests that at least one other system has worked around the particular error this file has on those pages. So likely the manifestation here is not entirely unique to this document.

Snuffleupagus added the corrupted-pdf label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

mikethea1 commented Mar 5, 2025

Snuffleupagus commented Mar 5, 2025

mikethea1 commented Mar 5, 2025

Snuffleupagus commented Mar 5, 2025

mikethea1 commented Mar 5, 2025

[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

Comments

mikethea1 commented Mar 5, 2025

Attach (recommended) or Link to PDF file

Web browser and its version

Operating system and its version

PDF.js version

Is the bug present in the latest PDF.js version?

Is a browser extension

Steps to reproduce the problem

What is the expected behavior?

What went wrong?

Link to a viewer

Additional context

Snuffleupagus commented Mar 5, 2025

mikethea1 commented Mar 5, 2025

Snuffleupagus commented Mar 5, 2025

mikethea1 commented Mar 5, 2025