Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PaddleOCR and Refactor Text Extraction #745

Merged
merged 62 commits into from
Oct 2, 2024
Merged

Conversation

karanataryn
Copy link
Contributor

@karanataryn karanataryn commented Aug 30, 2024

  • Refactors the Text Extraction interface to contain both PDFMiner and OCR, allowing them to act as part of the pipeline per page or supplement text to the whole document.
  • Keeps the old OCR implementation under the per_element_ocr flag.
  • Adds PaddleOCR, Tesseract, EasyOCR, and Legacy (Easy + Tesseract) as OCR options.
  • Refactors code to make it cleaner and changes necessary documentation.

@karanataryn karanataryn changed the title Refactor and Add Options for OCR Add PaddleOCR and Refactor Text Extraction Sep 9, 2024
@karanataryn karanataryn requested a review from bsowell September 20, 2024 22:39
Copy link
Contributor

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally OK to me. I'd try to avoid calling PdfMiner an OCR model in the code, just to avoid confusion, but I get that it serves a similar purpose.

@@ -60,9 +60,6 @@ def _can_retry(e: BaseException) -> bool:
return False


pdf_miner_cache = DiskCache(str(Path.home() / ".sycamore/PDFMinerCache"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting dilemma here. Changing to page-at-a-time makes it harder to cache PdfMiner output. Yet, PdfMiner still represents a significant chunk of work that would benefit from caching. Is there a way to stream PdfMiner output into the cache and stream it out? The only approach that comes to mind is disk spooling, which I expect to be faster than PdfMiner itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that there isn't a clear solution here. I think we should wait on this until after this PR, and decide on an approach for caching for both OCR and PDFMiner.

@karanataryn
Copy link
Contributor Author

Looks generally OK to me. I'd try to avoid calling PdfMiner an OCR model in the code, just to avoid confusion, but I get that is serves a similar purpose.

It's a Text Extractor not an OCR Model in the code. Only the OCR Models are called as the latter.

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm happy with this. Just want to understand what we are setting the defaults to.

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, but otherwise I think this looks okay.

@karanataryn karanataryn enabled auto-merge (squash) October 2, 2024 18:51
@karanataryn karanataryn merged commit e6493d8 into main Oct 2, 2024
10 of 11 checks passed
@karanataryn karanataryn deleted the ksampath/change-ocr branch October 2, 2024 18:59
Copy link
Contributor

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of comments. I read it in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants