Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new docs for Aryn Partitioning Service. Added a gentle introduction to APS docs and rearranged some of the existing APS docs. #660

Merged
merged 6 commits into from
Aug 8, 2024

Conversation

AbhijitP-009
Copy link
Contributor

No description provided.

Added a gentle introduction to APS docs and rearranged some of the existing APS docs.
Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple tweaks but lg

* ``use_ocr``: It defaults to ``false``, where the partitioner attempts to directly extract the text from the underlying PDF using PDFMiner. If ``true``, the partitioner detects and extracts text using Tesseract, an open source OCR library.
* ``extract_table_structure``: If ``true``, the partitioner runs a table extraction model separate from the segmentation model in order to extract cells from regions of the document identified as tables.
* ``extract_images``: If ``true``, the partitioner crops each region identified as an image and attaches it to the associated ``ImageElement``. This can later be fed into the ``SummarizeImages`` transform when used within Sycamore.
* ``selected_pages``: You can specify a page (like ``[11]`` ), a page range (like ``[[25,30]]`` ), or a combination of both (like ``[[11, [25,30]]`` ) of your PDF to process. The first page of the PDF is ``1``, not ``0``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"like [[11, [25,30]]" has an unclosed bracket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol nice catch. Henry the python interpreter.

@@ -0,0 +1,220 @@
## A Gentle Introduction to the Aryn Partitioning Service
You can use the Aryn Partitioning Service to easily chunk and extract data from complex PDFs. The Partitioning Service can extract paragraphs, tables and images and returns detailed information about the components it has just identified in a JSON object. The following two sections will walk you through two examples where we segment PDF documents and extract a table and an image from those documents using the python aryn-sdk.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double space at 'tables and images' I think.
Might rephrase that sentence to end with a gerundive/dependent clause rather than an independent clause, i.e. 'and returns' -> ', returning'. Just to avoid and _ and.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


Let’s focus on the following code that makes a call to the Aryn Partitioning Service:

```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do ```python then md will highlight it all pretty-like

## param extract_table_structure (boolean): extract tables and their structural content. default: False
## param use_ocr (boolean): extract text using an OCR model instead of extracting embedded text in PDF. default: False
## returns: JSON object with elements representing information inside the PDF
partitioned_file = partition_file(curr_file, aryn_api_key, extract_table_structure=True, use_ocr=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's curr_file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm for some reason i thought "file" was a keyword in python (colab kept highlighting it as such). I'll change it to "file"

Also I didn't want to comment every single parameter either because I felt like it distracted from the actual line of code where we call "partition_file"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's defined in the notebook? I'm just seeing a random param called curr_file and not knowing where it's defined or what kind of thing it is. Might be worth adding a

curr_file = open('my-document.pdf', 'rb')

in the block

@AbhijitP-009 AbhijitP-009 merged commit da5b49c into main Aug 8, 2024
9 of 10 checks passed
@AbhijitP-009 AbhijitP-009 deleted the abhijit-docs branch August 8, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants