-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new docs for Aryn Partitioning Service. Added a gentle introduction to APS docs and rearranged some of the existing APS docs. #660
Conversation
Added a gentle introduction to APS docs and rearranged some of the existing APS docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple tweaks but lg
* ``use_ocr``: It defaults to ``false``, where the partitioner attempts to directly extract the text from the underlying PDF using PDFMiner. If ``true``, the partitioner detects and extracts text using Tesseract, an open source OCR library. | ||
* ``extract_table_structure``: If ``true``, the partitioner runs a table extraction model separate from the segmentation model in order to extract cells from regions of the document identified as tables. | ||
* ``extract_images``: If ``true``, the partitioner crops each region identified as an image and attaches it to the associated ``ImageElement``. This can later be fed into the ``SummarizeImages`` transform when used within Sycamore. | ||
* ``selected_pages``: You can specify a page (like ``[11]`` ), a page range (like ``[[25,30]]`` ), or a combination of both (like ``[[11, [25,30]]`` ) of your PDF to process. The first page of the PDF is ``1``, not ``0``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"like [[11, [25,30]]
" has an unclosed bracket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol nice catch. Henry the python interpreter.
@@ -0,0 +1,220 @@ | |||
## A Gentle Introduction to the Aryn Partitioning Service | |||
You can use the Aryn Partitioning Service to easily chunk and extract data from complex PDFs. The Partitioning Service can extract paragraphs, tables and images and returns detailed information about the components it has just identified in a JSON object. The following two sections will walk you through two examples where we segment PDF documents and extract a table and an image from those documents using the python aryn-sdk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double space at 'tables and images' I think.
Might rephrase that sentence to end with a gerundive/dependent clause rather than an independent clause, i.e. 'and returns' -> ', returning'. Just to avoid and _ and.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
|
||
Let’s focus on the following code that makes a call to the Aryn Partitioning Service: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do ```python then md will highlight it all pretty-like
## param extract_table_structure (boolean): extract tables and their structural content. default: False | ||
## param use_ocr (boolean): extract text using an OCR model instead of extracting embedded text in PDF. default: False | ||
## returns: JSON object with elements representing information inside the PDF | ||
partitioned_file = partition_file(curr_file, aryn_api_key, extract_table_structure=True, use_ocr=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's curr_file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm for some reason i thought "file" was a keyword in python (colab kept highlighting it as such). I'll change it to "file"
Also I didn't want to comment every single parameter either because I felt like it distracted from the actual line of code where we call "partition_file"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's defined in the notebook? I'm just seeing a random param called curr_file and not knowing where it's defined or what kind of thing it is. Might be worth adding a
curr_file = open('my-document.pdf', 'rb')
in the block
No description provided.