-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take a long time after choose an unicode font #907
Comments
Hi @diego-insaurralde and welcome! 😊 I'd be happy to try to help you, If you don't know how to craft some minimal code reproducing your issue, |
Hello @Lucas-C , I appreciate you to give me attention. Sorry for let you waiting, but let me explain now. I rewrote all the code just to show you, and I will use a example txt where it only contains letter "Ó". IIn PdfGeneratorClass line 24, I add an unicode font, I chose DejaVu, and I put the all fonts in the fonts folder , in the same directory (just to test). To te second test, I set an unicode font, DejaVu, and line 66, I change to encode using UTF-8. How Can I handle it. First it's my class PDF Generator, that use FPDF: from fpdf import FPDF
from typing import List
class PdfGenerator(FPDF):
def __init__(
self,
header_btb: List[str],
registers: List[str],
totals: dict,
initial_page: int,
initial_seq_number: int,
):
self.header_btb = header_btb
self.registers = registers
self.totals = totals
self.initial_page = initial_page
self.initial_seq_number = initial_seq_number
self.line_separator = 1
self.line_height = 3
FPDF.__init__(self)
self.add_font('DejaVuSans', '', 'fonts/DejaVuSans.ttf')
def header(self) -> None:
self.set_font("Times", "", 5)
self.set_text_color(19, 16, 196)
self.ln(2)
self.header_btb[2] = (
self.header_btb[2][:163] + f"{self.initial_page + self.page_no() - 1:>7d}"
)
for line in self.header_btb:
if "------" in line:
break
self.cell(0, self.line_height, line, 0, self.line_separator)
self.set_text_color(128, 128, 128)
self.cell(
0,
self.line_height,
f"{' ' * 5}{'_' * 141}{' ' * 5}",
0,
self.line_separator,
)
self.ln(1)
self.set_text_color(19, 16, 196)
def generate_report(self):
sep_total = 128 * " " + "-" * 42
self.set_margins(left=0.5, right=0.5, top=0.3)
self.alias_nb_pages()
self.add_page()
self.set_font("Times", "", 5)
self.set_text_color(19, 16, 196)
for index, header in enumerate(self.registers):
for item in header["register_contents"]:
self.cell(0, self.line_height, item, 0, self.line_separator)
self.set_text_color(128, 128, 128)
self.cell(0, self.line_height, sep_total, 0, self.line_separator)
# self.line(142, self.get_y(), 170, self.get_y())
self.set_text_color(19, 16, 196)
self.cell(0, self.line_height, header["total_line"], 0, self.line_separator)
self.set_text_color(19, 16, 196)
self.cell(0, self.line_height, self.totals["total"], 0, self.line_separator) Second File to handle txt file and generate pdf.: from pdf_gen import PdfGenerator
from werkzeug.datastructures import FileStorage
import io
def separate_header_content(file: io.TextIOWrapper):
j = 0
header = []
contents = []
isHeaderBuilding = False
isNewContent = True
numericNew = 0
registers = []
for line in file:
if "Cabe" in line:
j = 1
if not header:
isHeaderBuilding = True
header.append(line)
else:
isHeaderBuilding = False
elif j < 8:
if isHeaderBuilding:
header.append(line)
j+=1
elif "---" in line:
continue
elif "Total" in line:
total_line = line
else:
numeric = line[1:8].strip()
if numeric.isnumeric():
if isNewContent:
isNewContent = False
numericNew = numeric
contents.append(line)
else:
if numeric == numericNew:
contents.append(line)
else:
old_contents = contents.copy()
registers.append({
"register_contents": old_contents,
"total_line": total_line
})
contents.clear()
isNewContent = True
return header,registers
def test_pdf(file_utf8, file_latin1=""):
input("PRESS ANY BUTTON TO START")
with open(file_utf8, "rb") as f:
f1 = FileStorage(f)
f1_buff = io.TextIOWrapper(f1, encoding="ISO-8859-1") #encoding="utf-8-sig")
header, registers = separate_header_content(f1_buff)
for line in header:
print(line)
if file_latin1:
f2 = open(file_latin1, "rb")
f2 = FileStorage(f2)
f2_buff = io.TextIOWrapper(f2, encoding="ISO-8859-1")
header2, registers2 = separate_header_content(f2_buff)
header += header2
registers += registers2
pdf = PdfGenerator(header, registers,{"total": "0,0", "accumulated": "0,0"} , 1, 1)
pdf.generate_report()
pdf.output(name="teste.pdf")
if file_latin1:
f2.close()
if __name__ == "__main__":
file_utf8 = "utf8.txt"
file_latin1 = ""
test_pdf(file_utf8) |
@diego-insaurralde, have you tried profiling your code? import cProfile
cProfile.run('test_pdf(file_utf8)') This would give you (and us) a first idea of where your program is spending all that time. |
@diego-insaurralde could you please execute the command |
I did it now, and I get this result. The first one I organize the data in pandas and sort by cumtime and get head 50.
|
Sure! But the main libraries is FLASK and FPDF2.
|
Hi @diego-insaurralde Thank you for taking the time to provide some code reproducing your problem. First, I want to mention that the script you provided is not MINIMAL: Second, when strictly following your instructions, I got an
But when I executed those commands in my Linux shell: echo 'Ó' > utf8.txt
time python issue_907.py It seems that your Nevertheless I tested your script by feeding it itself ( There is the single-file Python script I used, based on the code snippets you shared: |
Hello Lucas, thanks for your answer. from fpdf import FPDF
from typing import List
from werkzeug.datastructures import FileStorage
import io
class PdfGenerator(FPDF):
def __init__(self, registers: List[str]):
self.registers = registers
FPDF.__init__(self)
self.add_font('DejaVuSans', '', 'fonts/DejaVuSans.ttf')
def generate_report(self):
self.set_margins(left = 0.5, right=0.5, top = 0.3)
self.alias_nb_pages()
self.add_page()
self.set_font('DejaVuSans', '', 5)
self.set_text_color(19,16,196)
for register in self.registers:
self.cell(0, 3, register, 0, 1)
def test_pdf(file_utf8):
with open(file_utf8, "rb") as f:
f1 = FileStorage(f)
f1_buff = io.TextIOWrapper(f1, encoding="utf-8")
lines = f1_buff.readlines()
pdf = PdfGenerator(lines )
pdf.generate_report()
pdf.output(name="teste.pdf")
if __name__ == "__main__":
file_utf8 = "utf8.txt"
test_pdf(file_utf8) |
A patch by @andersonhc has just been merged which should reduce the time required for this specific example by roughly 30%. When using the built-in fonts, then the text pretty much gets written directly to the PDF file, with hardly any processing overhead. Since we're talking about an 11+ MB file with ~100k lines of text and ~7 million characters here, any pure Python solution will eventually reach its limits. Maybe you can get it to work somehow, but it may also be that you'll have to find a package written in a compiled language that does things a lot faster. As to your original description of the production job running "forever", have you tried to instrument your code so it will give you some progress feedback? If you eg. write some console output for each new page, then this would give you a better idea about how close you are to a working solution, or how far away from it. |
I understand it. Is it not possible to add a Unicode font as a native option, similar to fonts in the ISO-8859-1 encoding? From my perspective, as someone external to the situation, it seems unusual that this library doesn't handle UTF-8 encodings very well, once this encoding is widely used. |
Short answer: No.
That statement makes no sense. Unicode file encodings like UTF-8 are completely irrelevant in this context and fpdf2 never even sees them in normal operation. At the same time, no other Python based PDF library can handle Unicode text with TTF fonts as completely and correctly as fpdf2, and none of them is significantly faster. Maybe you should try to get at least some basic understanding of the relevant topics (Unicode, TTF fonts, and the PDF specification) before criticizing the work of other people. In summary, I repeat: it is likely impossible for any Python library to deliver the performance you wish for. |
I agree overall with @gmischler answer. Thank you also for providing a 2nd, shorter code snippet. I took the time to investigate a little to try to figure out where is the bottlneck.
The last line in the table above is a specific code line that is evaluated almost 8 million times: With an average of 0.01ms per execution, this code is already very fast. And finally, I think it may be interesting to translate this usage scenario into a non-regression performance test, |
Oh, I missed that the repeated calls to Anyway I opened #911 and I'd be happy to have your reviews on it, |
I also compared the perfomances of the script provided by @diego-insaurralde (this one: issue_907.py) between
So I think we got an actual speed regression in I'm sorry @gmischler, but this shows that your previous statements were maybe too categorical:
Given that, in this specific usage scenario, |
First of all, I would like to apologize, I didn't mean to criticize anything, but just to provoke you, to get a better undestand. After your explanation @gmischler , now I'm aware how the solution can be delicated, I have a trouble and I looked for a solution, That's why, I'm here. @Lucas-C Thank you so much for your clarifications, Now I can hope a solution xD. |
I profiled
There is the result as a SVG flame graph (produced using flameprof): It makes very obvious the places were optimizations would be worthwile 🙂 I also opened #913 to try to make |
Comparing the two graphs, a few things jump out at me: I suspect though, that really significant gains can better be reached on an architectural level than by individual local optimizations. |
I totally agree, but right now I do not see clearly what architectural changes we should make to improve things... There is the performance flame graph for Not much has changed in the shape of the stacks... |
FYY I added a section in our docs on how to investigate performance issues: |
I've just been looking through #913 again and it seems that the good idea of caching glyph IDs could be expanded on further. At the moment, every time a specific glyph needs to be rendered, a new @andersonhc, am I missing anything obvious here? Is there any situation where a |
Sounds great!! 👍 |
Given that PR #913 has been merged, I'm going to close this. Please open a new issue on this subject if need be! |
Hi Everyone,
First of all, I'm sorry if someone asked help in the same subject and for my terrible english xD.
I have to convert two txt files to one pdf. Each one has a different codification, one ISO-8859-1, other UTF-8.
When I use only one file with ISO-8859-1, I generated a pdf fastly and ok. When I try from two, I have a trouble, because I need to use Unicode font, so, I'm using DejaVu font, and now, when I try to generate a new pdf, they take a long time, and not finish.
I try to use font from especific directory, no one errror is displayed, i need to force pause, because it never ends.
In the begin, I was using FPDF version 1, and a warning is displayed: "UserWarning: cmap value too big/small", and it take a long time too. I checked this version is deprecated, so now i'm here with version 2, with the same problem, (but the userwarning is not displayed)
So that's it, I'm waiting for an answer.
Thanks a lot for all contributtor
The text was updated successfully, but these errors were encountered: