New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Adding new docs for Aryn Partitioning Service. Added a gentle introduction to APS docs and rearranged some of the existing APS docs. #660

Merged

AbhijitP-009 merged 6 commits into main from abhijit-docs

Aug 8, 2024

Contributor

AbhijitP-009 commented Aug 8, 2024

No description provided.

AbhijitP-009 added 3 commits

August 8, 2024 09:59


          Adding new docs for Aryn Partitioning Service.

5c61977

Added a gentle introduction to APS docs and rearranged some of the existing APS docs.


          Adding more docs

84896f8


          Fixing the docs to get rid of some comments

40ec3b7

HenryL27 approved these changes

View reviewed changes

Collaborator

HenryL27 left a comment

A couple tweaks but lg

docs/source/aryn_cloud/accessing_the_partitioning_service.rst Outdated

+              * ``use_ocr``: It defaults to ``false``, where the partitioner attempts to directly extract the text from the underlying PDF using PDFMiner.  If ``true``, the partitioner detects and extracts text using Tesseract, an open source OCR library.
+              * ``extract_table_structure``: If ``true``, the partitioner runs a table extraction model separate from the segmentation model in order to extract cells from regions of the document identified as tables.
+              * ``extract_images``: If ``true``, the partitioner crops each region identified as an image and attaches it to the associated ``ImageElement``. This can later be fed into the ``SummarizeImages`` transform when used within Sycamore.
+              * ``selected_pages``: You can specify a page (like ``[11]`` ), a page range (like ``[[25,30]]`` ), or a combination of both (like ``[[11, [25,30]]`` ) of your PDF to process. The first page of the PDF is ``1``, not ``0``.

Collaborator

HenryL27 Aug 8, 2024

"like [[11, [25,30]]" has an unclosed bracket

Contributor Author

AbhijitP-009 Aug 8, 2024

lol nice catch. Henry the python interpreter.

docs/source/aryn_cloud/gentle_introduction.md Outdated

		@@ -0,0 +1,220 @@
		## A Gentle Introduction to the Aryn Partitioning Service
		You can use the Aryn Partitioning Service to easily chunk and extract data from complex PDFs. The Partitioning Service can extract paragraphs, tables and images and returns detailed information about the components it has just identified in a JSON object. The following two sections will walk you through two examples where we segment PDF documents and extract a table and an image from those documents using the python aryn-sdk.

Collaborator

HenryL27 Aug 8, 2024

double space at 'tables and images' I think.
Might rephrase that sentence to end with a gerundive/dependent clause rather than an independent clause, i.e. 'and returns' -> ', returning'. Just to avoid and _ and.

Contributor Author

AbhijitP-009 Aug 8, 2024

ack

docs/source/aryn_cloud/gentle_introduction.md Show resolved Hide resolved

docs/source/aryn_cloud/gentle_introduction.md Outdated


		Let’s focus on the following code that makes a call to the Aryn Partitioning Service:

		```

Collaborator

HenryL27 Aug 8, 2024

if you do ```python then md will highlight it all pretty-like

docs/source/aryn_cloud/gentle_introduction.md Outdated

+              ## param extract_table_structure (boolean): extract tables and their structural content. default: False
+              ## param use_ocr (boolean): extract text using an OCR model instead of extracting embedded text in PDF. default: False
+              ## returns: JSON object with elements representing information inside the PDF
+              partitioned_file = partition_file(curr_file, aryn_api_key, extract_table_structure=True, use_ocr=True)

Collaborator

HenryL27 Aug 8, 2024

what's curr_file?

Contributor Author

AbhijitP-009 Aug 8, 2024

hmm for some reason i thought "file" was a keyword in python (colab kept highlighting it as such). I'll change it to "file"

Also I didn't want to comment every single parameter either because I felt like it distracted from the actual line of code where we call "partition_file"

Collaborator

HenryL27 Aug 8, 2024

Oh it's defined in the notebook? I'm just seeing a random param called curr_file and not knowing where it's defined or what kind of thing it is. Might be worth adding a

curr_file = open('my-document.pdf', 'rb')

in the block

docs/source/aryn_cloud/gentle_introduction.md Show resolved Hide resolved

AbhijitP-009 added 3 commits

August 8, 2024 12:13


          Addressing comments

98ef3b0


          Fixing bracket

b8a886e


          more fixes

1efc16b

AbhijitP-009 merged commit da5b49c into main

9 of 10 checks passed

AbhijitP-009 deleted the abhijit-docs branch

August 8, 2024 20:22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet