You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/user/extract-text.md
+15-15
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Extract Text from a PDF
2
2
3
-
You can extract text from a PDF like this:
3
+
You can extract text from a PDF:
4
4
5
5
```python
6
6
from pypdf import PdfReader
@@ -10,7 +10,7 @@ page = reader.pages[0]
10
10
print(page.extract_text())
11
11
```
12
12
13
-
You can also choose to limit the text orientation you want to extract, e.g:
13
+
You can also choose to limit the text orientation you want to extract:
14
14
15
15
```python
16
16
# extract only text oriented up
@@ -42,7 +42,7 @@ Refer to [extract\_text](../modules/PageObject.html#pypdf._page.PageObject.extra
42
42
43
43
## Using a visitor
44
44
45
-
You can use visitor-functions to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.
45
+
You can use visitorfunctions to control which part of a page you want to process and extract. The visitorfunctions you provide will get called for each operator or for each text fragment.
46
46
47
47
The function provided in argument visitor_text of function extract_text has five arguments:
48
48
* text: the current text (as long as possible, can be up to a full line)
@@ -51,19 +51,19 @@ The function provided in argument visitor_text of function extract_text has five
51
51
* font-dictionary: full font dictionary
52
52
* font-size: the size (in text coordinate space)
53
53
54
-
The matrix stores 6 parameters. The first 4 provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical)
54
+
The matrix stores six parameters. The first four provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical).
55
55
It is recommended to use the user_matrix as it takes into all transformations.
56
56
57
57
Notes :
58
58
59
-
-as indicated in the PDF 1.7 reference, page 204 the user matrix applies to text space/image space/form space/pattern space.
60
-
-if you want to get the full transformation from text to user space, you can use the `mult` function (availalbe in global import) as follows:
61
-
`txt2user = mult(tm, cm))`
62
-
The font-size is the raw text size, that is affected by the `user_matrix`
59
+
-As indicated in §8.3.3 of the PDF 1.7 or PDF 2.0 specification, the user matrix applies to text space/image space/form space/pattern space.
60
+
-If you want to get the full transformation from text to user space, you can use the `mult` function (available in global import) as follows:
61
+
`txt2user = mult(tm, cm))`.
62
+
The fontsize is the raw text size and affected by the `user_matrix`.
63
63
64
64
65
65
The font-dictionary may be None in case of unknown fonts.
66
-
If not None it may e.g. contain key "/BaseFont" with value "/Arial,Bold".
66
+
If not None it could contain something like key "/BaseFont" with value "/Arial,Bold".
67
67
68
68
**Caveat**: In complicated documents the calculated positions may be difficult to (if you move from multiple forms to page user space for example).
69
69
@@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.
72
72
73
73
### Example 1: Ignore header and footer
74
74
75
-
The following example reads the text of page 4 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores header (y < 720) and footer (y > 50).
75
+
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
76
76
77
77
```python
78
78
from pypdf import PdfReader
@@ -97,10 +97,10 @@ print(text_body)
97
97
98
98
### Example 2: Extract rectangles and texts into a SVG-file
99
99
100
-
The following example converts page 3 of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
100
+
The following example converts page three of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf) into a
0 commit comments