DOC: Minor improvements (#2542)

j-t-1 · stefan6419846 · web-flow · commit ed1d43823cb1 · 2024-03-26T13:38:07.000+01:00
Co-authored-by: Stefan &lt;96178532+stefan6419846@users.noreply.github.com&gt;
diff --git a/docs/dev/pdf-format.md b/docs/dev/pdf-format.md
@@ -1,6 +1,6 @@
 # The PDF Format
 
-It's recommended to look in the PDF specification for details and clarifications.
+It is recommended to look in the PDF specification for details and clarifications.
 This is only intended to give a very rough overview of the format.
 
 ## Overall Structure
@@ -32,7 +32,7 @@ Let's go through it step-by-step:
 
 * `xref` is just a keyword that specifies the start of the xref table.
 * `42` is the numerical ID of the first object in this xref section; `5` is the number of entries in the xref table.
-* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset,
+* Now every object has 3 entries `nnnnnnnnnn ggggg n`: a 10-digit byte offset,
   a 5-digit generation number, and a literal keyword which is either `n` or `f`.
     * `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
       the object is in the file.
@@ -49,10 +49,10 @@ Let's go through it step-by-step:
 
 The body is a sequence of indirect objects:
 
-`counter generationnumber << the_object >> endobj`
+`counter generation_number << the_object >> endobj`
 
 * `counter` (integer) is a unique identifier for the object.
-* `generationnumber` (integer) is the generation number of the object.
+* `generation_number` (integer) is the generation number of the object.
 * `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
   specify which kind of object it is.
 * `endobj` marks the end of the object.
@@ -91,11 +91,11 @@ Let's go through it:
 * `%%EOF` is the end-of-file marker.
 
 The trailer dictionary is a key-value list. The keys are specified in
-Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
+Table 15 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
 
 * `/Root` (dictionary) contains the document catalog.
-    * The `5` is the object number of the catalog dictionary
-    * `0` is the generation number of the catalog dictionary
+    * The `5` is the object number of the catalog dictionary.
+    * `0` is the generation number of the catalog dictionary.
     * `R` is the keyword that indicates that the object is a reference to the
       catalog dictionary.
 * `/Size` (integer) contains the total number of entries in the files xref table.
@@ -110,4 +110,4 @@ pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
 ```
 
 Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in
-our favorite IDE / text editor.
+your favorite IDE / text editor.
diff --git a/docs/dev/pypdf-parsing.md b/docs/dev/pypdf-parsing.md
@@ -13,14 +13,14 @@ structure of parsing:
    proceeds to parse the objects in the PDF. Objects in a PDF can be of various
    types such as dictionaries, arrays, streams, and simple data types (e.g.,
    integers, strings). pypdf parses these objects and stores them in
-   {py:meth}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`
-   via {py:meth}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
+   {py:meth}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`,
+   populated by {py:meth}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
 3. **Decoding content streams**: The content of a PDF is typically stored in
    content streams, which are sequences of PDF operators and operands. pypdf
    decodes these content streams by applying filters (e.g., `FlateDecode`,
    `LZWDecode`) specified in the stream's dictionary. This is only done when the
-   object is requested via {py:meth}`PdfReader.get_object
-   <pypdf.PdfReader.get_object>` in the `PdfReader._get_object_from_stream` method.
+   object is requested by {py:meth}`PdfReader.get_object
+   <pypdf.PdfReader.get_object>` which uses the `PdfReader._get_object_from_stream` method.
 
 ## References
 
diff --git a/docs/user/extract-text.md b/docs/user/extract-text.md
@@ -1,6 +1,6 @@
 # Extract Text from a PDF
 
-You can extract text from a PDF like this:
+You can extract text from a PDF:
 
 ```python
 from pypdf import PdfReader
@@ -10,7 +10,7 @@ page = reader.pages[0]
 print(page.extract_text())
 ```
 
-You can also choose to limit the text orientation you want to extract, e.g:
+You can also choose to limit the text orientation you want to extract:
 
 ```python
 # extract only text oriented up
@@ -42,7 +42,7 @@ Refer to [extract\_text](../modules/PageObject.html#pypdf._page.PageObject.extra
 
 ## Using a visitor
 
-You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.
+You can use visitor functions to control which part of a page you want to process and extract. The visitor functions you provide will get called for each operator or for each text fragment.
 
 The function provided in argument visitor_text of function extract_text has five arguments:
 * text: the current text (as long as possible, can be up to a full line)
@@ -51,19 +51,19 @@ The function provided in argument visitor_text of function extract_text has five
 * font-dictionary: full font dictionary
 * font-size: the size (in text coordinate space)
 
-The matrix stores 6 parameters. The first 4 provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical)
+The matrix stores six parameters. The first four provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical).
 It is recommended to use the user_matrix as it takes into all transformations.
 
 Notes :
 
- - as indicated in the PDF 1.7 reference, page 204 the user matrix applies to text space/image space/form space/pattern space.
- - if you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows:
-`txt2user = mult(tm, cm))`
-The font-size is the raw text size, that is affected by the `user_matrix`
+ - As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space.
+ - If you want to get the full transformation from text to user space, you can use the `mult` function (available in global import) as follows:
+`txt2user = mult(tm, cm))`.
+The font size is the raw text size and affected by the `user_matrix`.
 
 
 The font-dictionary may be None in case of unknown fonts.
-If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold".
+If not None it could contain something like key "/BaseFont" with value "/Arial,Bold".
 
 **Caveat**: In complicated documents the calculated positions may be difficult to (if you move from multiple forms to page user space for example).
 
@@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.
 
 ### Example 1: Ignore header and footer
 
-The following example reads the text of page 4 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores header (y < 720) and footer (y > 50).
+The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
 
 ```python
 from pypdf import PdfReader
@@ -97,10 +97,10 @@ print(text_body)
 
 ### Example 2: Extract rectangles and texts into a SVG-file
 
-The following example converts page 3 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
+The following example converts page three of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
 [SVG file](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics).
 
-Such a SVG export may help to understand whats going on in a page.
+Such a SVG export may help to understand what is going on in a page.
 
 ```python
 from pypdf import PdfReader
@@ -131,13 +131,13 @@ dwg.save()
 
 The SVG generated here is bottom-up because the coordinate systems of PDF and SVG differ.
 
-Unfortunately in complicated PDF documents the coordinates given to the visitor-functions may be wrong.
+Unfortunately in complicated PDF documents the coordinates given to the visitor functions may be wrong.
 
 ## Why Text Extraction is hard
 
 ### Unclear Objective
 
-Extracting text from a PDF can be pretty tricky. In several cases there is no
+Extracting text from a PDF can be tricky. In several cases there is no
 clear answer what the expected result should look like:
 
 1. **Paragraphs**: Should the text of a paragraph have line breaks at the same places
@@ -191,7 +191,7 @@ printing. It was not created for parsing the content. PDF files don't contain a
 semantic layer.
 
 Specifically, there is no information what the header, footer, page numbers,
-tables, and paragraphs are. The visual appearence is there and people might
+tables, and paragraphs are. The visual appearance is there and people might
 find heuristics to make educated guesses, but there is no way of being certain.
 
 This is a shortcoming of the PDF file format, not of pypdf.
diff --git a/docs/user/post-processing-in-text-extraction.md b/docs/user/post-processing-in-text-extraction.md
@@ -1,15 +1,13 @@
-# Post-Processing in Text Extraction
+# Post-Processing of Text Extraction
 
-Post-processing can recognizably improve the results of text extraction.
-It is, however, outside of the scope of pypdf itself. Hence the library will
-not give any direct support for it. It is a natural language processing (NLP)
-task.
+Post-processing can recognizably improve the results of text extraction. It is,
+however, outside of the scope of pypdf itself. Hence the library will not give
+any direct support for it. It is a natural language processing (NLP) task.
 
-This page lists a few examples what can be done as well as a community
-recipie that can be used as a best-practice general purpose post processing
-step. If you know more about the specific domain of your documents, e.g. the
-language, it is likely that you can find custom solutions that work better in
-your context
+This page lists a few examples what can be done as well as a community recipe
+that can be used as a general purpose post-processing step. If you know more
+about the specific domain of your documents, e.g. the language, it is likely
+that you can find custom solutions that work better in your context.
 
 ## Ligature Replacement
 
@@ -32,7 +30,7 @@ def replace_ligatures(text: str) -> str:
     return text
 ```
 
-## De-Hyphenation
+## Dehyphenation
 
 Hyphens are used to break words up so that the appearance of the page is nicer.
 
@@ -77,11 +75,11 @@ def dehyphenate(lines: List[str], line_no: int) -> List[str]:
 
 The following header/footer removal has several drawbacks:
 
-* False-positives, e.g. for the first page when there is a date like 2021.
+* False-positives, e.g. for the first page when there is a date like 2024.
 * False-negatives in many cases:
-    * Dynamic part, e.g. page label is in the header
-    * Even/odd pages have different headers
-    * Some pages, e.g. the first one or chapter pages, don't have a header
+    * Dynamic part, e.g. page label is in the header.
+    * Even/odd pages have different headers.
+    * Some pages, e.g. the first one or chapter pages, do not have a header.
 
 ```python
 def remove_footer(extracted_texts: list[str], page_labels: list[str]):
@@ -105,9 +103,9 @@ def remove_footer(extracted_texts: list[str], page_labels: list[str]):
 
 ## Other ideas
 
-* Whitespaces between Units: Between a number and it's unit should be a space
+* Whitespaces in units: Between a number and its unit should be a space.
   ([source](https://tex.stackexchange.com/questions/20962/should-i-put-a-space-between-a-number-and-its-unit)).
   That means: 42 ms, 42 GHz, 42 GB.
 * Percent: English style guides prescribe writing the percent sign following the number without any space between (e.g. 50%).
-* Whitespaces before dots: Should typically be removed
-* Whitespaces after dots: Should typically be added
+* Whitespaces before dots: Should typically be removed.
+* Whitespaces after dots: Should typically be added.
diff --git a/docs/user/streaming-data.md b/docs/user/streaming-data.md
@@ -73,4 +73,4 @@ obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket="my-bucket", Key="my/doc.
 reader = PdfReader(BytesIO(obj["Body"].read()))
 ```
 
-It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769))
+It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769)).