-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db_dump: Add support for dumping straight to gzip. #2767
Conversation
I thought you were redoing earlier changes but no, it was actually |
sched/db_dump.cpp
Outdated
} | ||
|
||
void write(const char* fmt, va_list args) { | ||
gzvprintf(gz, fmt, args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gzvprintf()
is relatively new addition. RHEL 6 users are not going to like you using it and I don't think RHEL 7 users either. Another problem is that it uses 8 kB buffer and at least team descriptions can be larger than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I considered the small buffer but discarded it as an issue as I only saw small writes. I'll have another look at it to see what the alternatives are. Compatibility might indeed be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using vasprintf
and then dumping the entire buffer to libz? It's going to be a bit slower but you are not bound to a buffer size.
Out of curiosity I double checked printf and its min buffer size is apparently 4095 character.
gzvprintf is both limited by version and to its buffer size.
The caller, ZFILE, is responsible for ensuring that the output stream it uses is in the correct state. This makes the stream implementations smaller.
sched/db_dump.cpp
Outdated
// | ||
// File streams | ||
// | ||
class OutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the class names be uppercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coding standards are here: https://boinc.berkeley.edu/trac/wiki/CodingStyle which I think stats that class names should be upper case. @davidpanderson can confirm if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheAspens Thank you! I'll check it out and tweak the classes.
I started thinking. This code with classes and all is good. But... Just before GDPR came into effect I downloaded host stats from all the projects BOINCstats knows of. Every single project had gzipped files. `db_dump´ could just as well always unconditionally output gzipped files and nobody would notice. So... I wonder if those other output options should be removed to clean up the code a bit and hard code |
@JuhaSointusalo I had the same thought when I did this. I assumed the flexibility was needed but I have yet to see an export which was not gzip besides the |
For WCG we have always used the gzip option - so we wouldn't object to that change either. |
@denravonska - if you decide you are going to remove those options (that are apparently unused) then make add WIP to the pull request. I will try to review this either over the weekend or monday - but I'll leave it alone if i see the WIP. |
But one thought is - since this change is complete (except for the review of coding standards) maybe leave this pull request as is and make the additional changes to remove the unused options in another pull request. |
(sorry for multiple messages) - either way - if you could just post in this thread when you feel you are ready for review, that would let anyone who can do the review know for sure. Thanks! |
Changed the class names to conform to the standard. Any help testing would be appreciated as I'm still trying to find the password to the database in the virtual machine :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncompressed and gzipped works fine. Regular zip doesn't but that didn't work before these changes either. zip
wants both archive name and file list on the command line. Since clearly nobody is using zipped files and if the option is going away I don't see the need to fix it.
I'll leave it for @TheAspens to test this on a real project database. |
I tried this three times against a database with a million host rows and a bunch of user rows as well and it ran in roughly the same time. However, since the files were small enough that they would be cached in memory it may not have caught the issue Rom outlined in the associated issue. It produced identical files as the version in master. Sorry for the delay in getting this tested. However, it looks good and I'm going to merge it. Thanks for the contribution! |
Great! Out of interest, what was the rough size of the gzip? |
The host records I created was 105M compressed and 732M uncompressed. |
That is a pretty large data set and I would expect some speedup. Unless the host is very beefy. |
A refactor of db_dump which splits the compression logic out of ZFILE and into separate output stream classes. This opens up for on-the-fly gzip compression for #694.
I have tested this in the VM image but it could be good if someone with a larger database tries an export to see if there are any concrete time gains.