Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db_dump: Add support for dumping straight to gzip. #2767

Merged
merged 7 commits into from
Oct 31, 2018

Conversation

denravonska
Copy link
Contributor

A refactor of db_dump which splits the compression logic out of ZFILE and into separate output stream classes. This opens up for on-the-fly gzip compression for #694.

I have tested this in the VM image but it could be good if someone with a larger database tries an export to see if there are any concrete time gains.

@denravonska denravonska changed the title Add support for dumping straight to gzip. db_dump: Add support for dumping straight to gzip. Oct 17, 2018
@JuhaSointusalo
Copy link
Contributor

I thought you were redoing earlier changes but no, it was actually db_purge the last time. Both deal with compressed files so there could some code sharing opportunities for some future enhancement. Your approach with classes look nicer than db_purges void pointers.

}

void write(const char* fmt, va_list args) {
gzvprintf(gz, fmt, args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gzvprintf() is relatively new addition. RHEL 6 users are not going to like you using it and I don't think RHEL 7 users either. Another problem is that it uses 8 kB buffer and at least team descriptions can be larger than that.

Copy link
Contributor Author

@denravonska denravonska Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I considered the small buffer but discarded it as an issue as I only saw small writes. I'll have another look at it to see what the alternatives are. Compatibility might indeed be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using vasprintf and then dumping the entire buffer to libz? It's going to be a bit slower but you are not bound to a buffer size.

Out of curiosity I double checked printf and its min buffer size is apparently 4095 character.

gzvprintf is both limited by version and to its buffer size.
The caller, ZFILE, is responsible for ensuring that the output stream it uses
is in the correct state. This makes the stream implementations smaller.
//
// File streams
//
class OutputStream {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the class names be uppercase?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coding standards are here: https://boinc.berkeley.edu/trac/wiki/CodingStyle which I think stats that class names should be upper case. @davidpanderson can confirm if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheAspens Thank you! I'll check it out and tweak the classes.

@JuhaSointusalo
Copy link
Contributor

I started thinking. This code with classes and all is good. But...

Just before GDPR came into effect I downloaded host stats from all the projects BOINCstats knows of. Every single project had gzipped files. `db_dump´ could just as well always unconditionally output gzipped files and nobody would notice.

So... I wonder if those other output options should be removed to clean up the code a bit and hard code db_dump to always gzip files? The note at the top of db_dump.cpp kind of hints at that direction.

@denravonska
Copy link
Contributor Author

@JuhaSointusalo I had the same thought when I did this. I assumed the flexibility was needed but I have yet to see an export which was not gzip besides the tables.xml in the VirtualBox sample.

@TheAspens
Copy link
Member

For WCG we have always used the gzip option - so we wouldn't object to that change either.

@TheAspens
Copy link
Member

@denravonska - if you decide you are going to remove those options (that are apparently unused) then make add WIP to the pull request. I will try to review this either over the weekend or monday - but I'll leave it alone if i see the WIP.

@TheAspens
Copy link
Member

But one thought is - since this change is complete (except for the review of coding standards) maybe leave this pull request as is and make the additional changes to remove the unused options in another pull request.

@TheAspens
Copy link
Member

TheAspens commented Oct 19, 2018

(sorry for multiple messages) - either way - if you could just post in this thread when you feel you are ready for review, that would let anyone who can do the review know for sure. Thanks!

@denravonska
Copy link
Contributor Author

Changed the class names to conform to the standard. Any help testing would be appreciated as I'm still trying to find the password to the database in the virtual machine :)

Copy link
Contributor

@JuhaSointusalo JuhaSointusalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncompressed and gzipped works fine. Regular zip doesn't but that didn't work before these changes either. zip wants both archive name and file list on the command line. Since clearly nobody is using zipped files and if the option is going away I don't see the need to fix it.

@JuhaSointusalo
Copy link
Contributor

I'll leave it for @TheAspens to test this on a real project database.

@TheAspens
Copy link
Member

I tried this three times against a database with a million host rows and a bunch of user rows as well and it ran in roughly the same time. However, since the files were small enough that they would be cached in memory it may not have caught the issue Rom outlined in the associated issue.

It produced identical files as the version in master.

Sorry for the delay in getting this tested. However, it looks good and I'm going to merge it. Thanks for the contribution!

@TheAspens TheAspens merged commit bfe1fdd into BOINC:master Oct 31, 2018
@denravonska
Copy link
Contributor Author

Great! Out of interest, what was the rough size of the gzip?

@denravonska denravonska deleted the refactor-stat-export branch November 1, 2018 07:40
@TheAspens
Copy link
Member

The host records I created was 105M compressed and 732M uncompressed.

@denravonska
Copy link
Contributor Author

That is a pretty large data set and I would expect some speedup. Unless the host is very beefy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants