Skip to content

Commit f45979b

Browse files
DOC: Mention memory consumption of text extraction (#3168)
Closes #3167.
1 parent 3d8941d commit f45979b

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

docs/user/extract-text.md

+12-2
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,19 @@ print(page.extract_text(extraction_mode="layout", layout_mode_strip_rotated=Fals
3232

3333
Refer to {func}`~pypdf._page.PageObject.extract_text` for more details.
3434

35+
```{note}
36+
Extracting the text of a page requires parsing its whole content stream. This can require quite a lot of memory -
37+
we have seen 10 GB RAM being required for an uncompressed content stream of about 300 MB (which should not occur
38+
very often).
39+
40+
To limit the size of the content streams to process (and avoid OOM errors in your application), consider
41+
checking `len(page.get_contents().get_data())` beforehand.
42+
```
43+
3544
## Using a visitor
3645

37-
You can use visitor functions to control which part of a page you want to process and extract. The visitor functions you provide will get called for each operator or for each text fragment.
46+
You can use visitor functions to control which part of a page you want to process and extract. The visitor functions
47+
you provide will get called for each operator or for each text fragment.
3848

3949
The function provided in argument visitor_text of function extract_text has five arguments:
4050
* text: the current text (as long as possible, can be up to a full line)
@@ -110,7 +120,7 @@ def visitor_svg_rect(op, args, cm, tm):
110120
dwg.add(dwg.rect((x, y), (w, h), stroke="red", fill_opacity=0.05))
111121

112122

113-
def visitor_svg_text(text, cm, tm, fontDict, fontSize):
123+
def visitor_svg_text(text, cm, tm, font_dict, font_size):
114124
(x, y) = (cm[4], cm[5])
115125
dwg.add(dwg.text(text, insert=(x, y), fill="blue"))
116126

0 commit comments

Comments
 (0)