-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609
Comments
This is a really corrupt PDF document, and note that even Adobe Reader (i.e. the PDF reference implementation) cannot open and render all pages correctly. |
@Snuffleupagus I hear you, but I wonder if it could at least help resolve #11207 since from the conversation there it seemed that there was appetite to resolve the flate stream issue given an available PDF to work from. |
Sorry, but I really don't understand how that old issue is relevant to the current discussion. Please note that there's not going to be just a single way, but rather any number of ways, in which a /FlateDecode stream could be corrupted to make it unreadable. |
When I hit the issue and googled the error, the old issue came up, and what I saw was lots of back and forth about the need for a publicly available test case, I figured that there might be interest in investigating this file as a test case. I understand that this file has multiple issues and might be considered "too damaged" to be worthy of investigation.
Fair. The only thing I can offer is that the first two pages with this error DO render successfully in Chrome, so that suggests that at least one other system has worked around the particular error this file has on those pages. So likely the manifestation here is not entirely unique to this document. |
Attach (recommended) or Link to PDF file
Corrupted PDF.pdf
Note: I saw that #11207 was closed due to lack of a publicly available test case. Hopefully this file can help!
Web browser and its version
nodejs 20.18.1
Operating system and its version
Windows 11
PDF.js version
pdfjs-dist 4.10.38
Is the bug present in the latest PDF.js version?
Yes
Is a browser extension
No
Steps to reproduce the problem
What is the expected behavior?
While this PDF opens in Chrome, not all pages render. It is definitely corrupted. That said, there seemed to be some interest in handling at least the flate stream error more gracefully so I figured it was worth filing.
For my use-case, I'd love if pdfjs would not choke in these cases and instead would yield a page with whatever detail about the page was avialable (e.g. falling back to blank), ideally with a flag on the page object letting me know whether errors occurred.
I understand that this might not be the goal of the library (at least not for all of these issues).
What went wrong?
Processing this file in PDFJS I see a number of errors:
UnknownErrorException: Bad encoding in flate stream
UnknownErrorException: Bad (uncompressed) XRef entry: 101R
UnknownErrorException: Illegal character: 41
Link to a viewer
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: