-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PaddleOCR and Refactor Text Extraction #745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally OK to me. I'd try to avoid calling PdfMiner an OCR model in the code, just to avoid confusion, but I get that it serves a similar purpose.
lib/sycamore/sycamore/tests/integration/transforms/test_partition.py
Outdated
Show resolved
Hide resolved
@@ -60,9 +60,6 @@ def _can_retry(e: BaseException) -> bool: | |||
return False | |||
|
|||
|
|||
pdf_miner_cache = DiskCache(str(Path.home() / ".sycamore/PDFMinerCache")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting dilemma here. Changing to page-at-a-time makes it harder to cache PdfMiner output. Yet, PdfMiner still represents a significant chunk of work that would benefit from caching. Is there a way to stream PdfMiner output into the cache and stream it out? The only approach that comes to mind is disk spooling, which I expect to be faster than PdfMiner itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that there isn't a clear solution here. I think we should wait on this until after this PR, and decide on an approach for caching for both OCR and PDFMiner.
It's a Text Extractor not an OCR Model in the code. Only the OCR Models are called as the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm happy with this. Just want to understand what we are setting the defaults to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, but otherwise I think this looks okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of comments. I read it in detail.
per_element_ocr
flag.