Skip to content

Commit ed1d438

Browse files
j-t-1stefan6419846
andauthored
DOC: Minor improvements (#2542)
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
1 parent 24709a3 commit ed1d438

5 files changed

+44
-46
lines changed

docs/dev/pdf-format.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# The PDF Format
22

3-
It's recommended to look in the PDF specification for details and clarifications.
3+
It is recommended to look in the PDF specification for details and clarifications.
44
This is only intended to give a very rough overview of the format.
55

66
## Overall Structure
@@ -32,7 +32,7 @@ Let's go through it step-by-step:
3232

3333
* `xref` is just a keyword that specifies the start of the xref table.
3434
* `42` is the numerical ID of the first object in this xref section; `5` is the number of entries in the xref table.
35-
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset,
35+
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: a 10-digit byte offset,
3636
a 5-digit generation number, and a literal keyword which is either `n` or `f`.
3737
* `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
3838
the object is in the file.
@@ -49,10 +49,10 @@ Let's go through it step-by-step:
4949

5050
The body is a sequence of indirect objects:
5151

52-
`counter generationnumber << the_object >> endobj`
52+
`counter generation_number << the_object >> endobj`
5353

5454
* `counter` (integer) is a unique identifier for the object.
55-
* `generationnumber` (integer) is the generation number of the object.
55+
* `generation_number` (integer) is the generation number of the object.
5656
* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
5757
specify which kind of object it is.
5858
* `endobj` marks the end of the object.
@@ -91,11 +91,11 @@ Let's go through it:
9191
* `%%EOF` is the end-of-file marker.
9292

9393
The trailer dictionary is a key-value list. The keys are specified in
94-
Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
94+
Table 15 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
9595

9696
* `/Root` (dictionary) contains the document catalog.
97-
* The `5` is the object number of the catalog dictionary
98-
* `0` is the generation number of the catalog dictionary
97+
* The `5` is the object number of the catalog dictionary.
98+
* `0` is the generation number of the catalog dictionary.
9999
* `R` is the keyword that indicates that the object is a reference to the
100100
catalog dictionary.
101101
* `/Size` (integer) contains the total number of entries in the files xref table.
@@ -110,4 +110,4 @@ pdftk crazyones.pdf output crazyones-uncomp.pdf uncompress
110110
```
111111

112112
Then rename `crazyones-uncomp.pdf` to `crazyones-uncomp.txt` and open it in
113-
our favorite IDE / text editor.
113+
your favorite IDE / text editor.

docs/dev/pypdf-parsing.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ structure of parsing:
1313
proceeds to parse the objects in the PDF. Objects in a PDF can be of various
1414
types such as dictionaries, arrays, streams, and simple data types (e.g.,
1515
integers, strings). pypdf parses these objects and stores them in
16-
{py:meth}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`
17-
via {py:meth}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
16+
{py:meth}`PdfReader.resolved_objects <pypdf.PdfReader.resolved_objects>`,
17+
populated by {py:meth}`cache_indirect_object <pypdf.PdfReader.cache_indirect_object>`.
1818
3. **Decoding content streams**: The content of a PDF is typically stored in
1919
content streams, which are sequences of PDF operators and operands. pypdf
2020
decodes these content streams by applying filters (e.g., `FlateDecode`,
2121
`LZWDecode`) specified in the stream's dictionary. This is only done when the
22-
object is requested via {py:meth}`PdfReader.get_object
23-
<pypdf.PdfReader.get_object>` in the `PdfReader._get_object_from_stream` method.
22+
object is requested by {py:meth}`PdfReader.get_object
23+
<pypdf.PdfReader.get_object>` which uses the `PdfReader._get_object_from_stream` method.
2424

2525
## References
2626

docs/user/extract-text.md

+15-15
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Extract Text from a PDF
22

3-
You can extract text from a PDF like this:
3+
You can extract text from a PDF:
44

55
```python
66
from pypdf import PdfReader
@@ -10,7 +10,7 @@ page = reader.pages[0]
1010
print(page.extract_text())
1111
```
1212

13-
You can also choose to limit the text orientation you want to extract, e.g:
13+
You can also choose to limit the text orientation you want to extract:
1414

1515
```python
1616
# extract only text oriented up
@@ -42,7 +42,7 @@ Refer to [extract\_text](../modules/PageObject.html#pypdf._page.PageObject.extra
4242

4343
## Using a visitor
4444

45-
You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.
45+
You can use visitor functions to control which part of a page you want to process and extract. The visitor functions you provide will get called for each operator or for each text fragment.
4646

4747
The function provided in argument visitor_text of function extract_text has five arguments:
4848
* text: the current text (as long as possible, can be up to a full line)
@@ -51,19 +51,19 @@ The function provided in argument visitor_text of function extract_text has five
5151
* font-dictionary: full font dictionary
5252
* font-size: the size (in text coordinate space)
5353

54-
The matrix stores 6 parameters. The first 4 provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical)
54+
The matrix stores six parameters. The first four provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical).
5555
It is recommended to use the user_matrix as it takes into all transformations.
5656

5757
Notes :
5858

59-
- as indicated in the PDF 1.7 reference, page 204 the user matrix applies to text space/image space/form space/pattern space.
60-
- if you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows:
61-
`txt2user = mult(tm, cm))`
62-
The font-size is the raw text size, that is affected by the `user_matrix`
59+
- As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space.
60+
- If you want to get the full transformation from text to user space, you can use the `mult` function (available in global import) as follows:
61+
`txt2user = mult(tm, cm))`.
62+
The font size is the raw text size and affected by the `user_matrix`.
6363

6464

6565
The font-dictionary may be None in case of unknown fonts.
66-
If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold".
66+
If not None it could contain something like key "/BaseFont" with value "/Arial,Bold".
6767

6868
**Caveat**: In complicated documents the calculated positions may be difficult to (if you move from multiple forms to page user space for example).
6969

@@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.
7272

7373
### Example 1: Ignore header and footer
7474

75-
The following example reads the text of page 4 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores header (y < 720) and footer (y > 50).
75+
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
7676

7777
```python
7878
from pypdf import PdfReader
@@ -97,10 +97,10 @@ print(text_body)
9797

9898
### Example 2: Extract rectangles and texts into a SVG-file
9999

100-
The following example converts page 3 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
100+
The following example converts page three of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
101101
[SVG file](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics).
102102

103-
Such a SVG export may help to understand whats going on in a page.
103+
Such a SVG export may help to understand what is going on in a page.
104104

105105
```python
106106
from pypdf import PdfReader
@@ -131,13 +131,13 @@ dwg.save()
131131

132132
The SVG generated here is bottom-up because the coordinate systems of PDF and SVG differ.
133133

134-
Unfortunately in complicated PDF documents the coordinates given to the visitor-functions may be wrong.
134+
Unfortunately in complicated PDF documents the coordinates given to the visitor functions may be wrong.
135135

136136
## Why Text Extraction is hard
137137

138138
### Unclear Objective
139139

140-
Extracting text from a PDF can be pretty tricky. In several cases there is no
140+
Extracting text from a PDF can be tricky. In several cases there is no
141141
clear answer what the expected result should look like:
142142

143143
1. **Paragraphs**: Should the text of a paragraph have line breaks at the same places
@@ -191,7 +191,7 @@ printing. It was not created for parsing the content. PDF files don't contain a
191191
semantic layer.
192192

193193
Specifically, there is no information what the header, footer, page numbers,
194-
tables, and paragraphs are. The visual appearence is there and people might
194+
tables, and paragraphs are. The visual appearance is there and people might
195195
find heuristics to make educated guesses, but there is no way of being certain.
196196

197197
This is a shortcoming of the PDF file format, not of pypdf.

docs/user/post-processing-in-text-extraction.md

+16-18
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
1-
# Post-Processing in Text Extraction
1+
# Post-Processing of Text Extraction
22

3-
Post-processing can recognizably improve the results of text extraction.
4-
It is, however, outside of the scope of pypdf itself. Hence the library will
5-
not give any direct support for it. It is a natural language processing (NLP)
6-
task.
3+
Post-processing can recognizably improve the results of text extraction. It is,
4+
however, outside of the scope of pypdf itself. Hence the library will not give
5+
any direct support for it. It is a natural language processing (NLP) task.
76

8-
This page lists a few examples what can be done as well as a community
9-
recipie that can be used as a best-practice general purpose post processing
10-
step. If you know more about the specific domain of your documents, e.g. the
11-
language, it is likely that you can find custom solutions that work better in
12-
your context
7+
This page lists a few examples what can be done as well as a community recipe
8+
that can be used as a general purpose post-processing step. If you know more
9+
about the specific domain of your documents, e.g. the language, it is likely
10+
that you can find custom solutions that work better in your context.
1311

1412
## Ligature Replacement
1513

@@ -32,7 +30,7 @@ def replace_ligatures(text: str) -> str:
3230
return text
3331
```
3432

35-
## De-Hyphenation
33+
## Dehyphenation
3634

3735
Hyphens are used to break words up so that the appearance of the page is nicer.
3836

@@ -77,11 +75,11 @@ def dehyphenate(lines: List[str], line_no: int) -> List[str]:
7775

7876
The following header/footer removal has several drawbacks:
7977

80-
* False-positives, e.g. for the first page when there is a date like 2021.
78+
* False-positives, e.g. for the first page when there is a date like 2024.
8179
* False-negatives in many cases:
82-
* Dynamic part, e.g. page label is in the header
83-
* Even/odd pages have different headers
84-
* Some pages, e.g. the first one or chapter pages, don't have a header
80+
* Dynamic part, e.g. page label is in the header.
81+
* Even/odd pages have different headers.
82+
* Some pages, e.g. the first one or chapter pages, do not have a header.
8583

8684
```python
8785
def remove_footer(extracted_texts: list[str], page_labels: list[str]):
@@ -105,9 +103,9 @@ def remove_footer(extracted_texts: list[str], page_labels: list[str]):
105103

106104
## Other ideas
107105

108-
* Whitespaces between Units: Between a number and it's unit should be a space
106+
* Whitespaces in units: Between a number and its unit should be a space.
109107
([source](https://tex.stackexchange.com/questions/20962/should-i-put-a-space-between-a-number-and-its-unit)).
110108
That means: 42 ms, 42 GHz, 42 GB.
111109
* Percent: English style guides prescribe writing the percent sign following the number without any space between (e.g. 50%).
112-
* Whitespaces before dots: Should typically be removed
113-
* Whitespaces after dots: Should typically be added
110+
* Whitespaces before dots: Should typically be removed.
111+
* Whitespaces after dots: Should typically be added.

docs/user/streaming-data.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,4 @@ obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket="my-bucket", Key="my/doc.
7373
reader = PdfReader(BytesIO(obj["Body"].read()))
7474
```
7575

76-
It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769))
76+
It works similarly for Google Cloud Storage ([example](https://stackoverflow.com/a/68403628/562769)).

0 commit comments

Comments
 (0)