db_dump: Add support for dumping straight to gzip. #2767

denravonska · 2018-10-17T10:16:55Z

A refactor of db_dump which splits the compression logic out of ZFILE and into separate output stream classes. This opens up for on-the-fly gzip compression for #694.

I have tested this in the VM image but it could be good if someone with a larger database tries an export to see if there are any concrete time gains.

JuhaSointusalo · 2018-10-17T18:35:38Z

I thought you were redoing earlier changes but no, it was actually db_purge the last time. Both deal with compressed files so there could some code sharing opportunities for some future enhancement. Your approach with classes look nicer than db_purges void pointers.

JuhaSointusalo · 2018-10-17T18:43:57Z

sched/db_dump.cpp

+    }
+
+    void write(const char* fmt, va_list args) {
+        gzvprintf(gz, fmt, args);


gzvprintf() is relatively new addition. RHEL 6 users are not going to like you using it and I don't think RHEL 7 users either. Another problem is that it uses 8 kB buffer and at least team descriptions can be larger than that.

Good point. I considered the small buffer but discarded it as an issue as I only saw small writes. I'll have another look at it to see what the alternatives are. Compatibility might indeed be an issue.

How about using vasprintf and then dumping the entire buffer to libz? It's going to be a bit slower but you are not bound to a buffer size.

Out of curiosity I double checked printf and its min buffer size is apparently 4095 character.

gzvprintf is both limited by version and to its buffer size.

The caller, ZFILE, is responsible for ensuring that the output stream it uses is in the correct state. This makes the stream implementations smaller.

denravonska · 2018-10-19T06:07:19Z

sched/db_dump.cpp

+//
+// File streams
+//
+class OutputStream {


Should the class names be uppercase?

The coding standards are here: https://boinc.berkeley.edu/trac/wiki/CodingStyle which I think stats that class names should be upper case. @davidpanderson can confirm if needed.

@TheAspens Thank you! I'll check it out and tweak the classes.

JuhaSointusalo · 2018-10-19T20:33:24Z

I started thinking. This code with classes and all is good. But...

Just before GDPR came into effect I downloaded host stats from all the projects BOINCstats knows of. Every single project had gzipped files. `db_dump´ could just as well always unconditionally output gzipped files and nobody would notice.

So... I wonder if those other output options should be removed to clean up the code a bit and hard code db_dump to always gzip files? The note at the top of db_dump.cpp kind of hints at that direction.

denravonska · 2018-10-19T20:49:55Z

@JuhaSointusalo I had the same thought when I did this. I assumed the flexibility was needed but I have yet to see an export which was not gzip besides the tables.xml in the VirtualBox sample.

TheAspens · 2018-10-19T21:42:33Z

For WCG we have always used the gzip option - so we wouldn't object to that change either.

TheAspens · 2018-10-19T22:08:21Z

@denravonska - if you decide you are going to remove those options (that are apparently unused) then make add WIP to the pull request. I will try to review this either over the weekend or monday - but I'll leave it alone if i see the WIP.

TheAspens · 2018-10-19T22:09:04Z

But one thought is - since this change is complete (except for the review of coding standards) maybe leave this pull request as is and make the additional changes to remove the unused options in another pull request.

TheAspens · 2018-10-19T22:09:56Z

(sorry for multiple messages) - either way - if you could just post in this thread when you feel you are ready for review, that would let anyone who can do the review know for sure. Thanks!

denravonska · 2018-10-20T05:39:43Z

Changed the class names to conform to the standard. Any help testing would be appreciated as I'm still trying to find the password to the database in the virtual machine :)

JuhaSointusalo

Uncompressed and gzipped works fine. Regular zip doesn't but that didn't work before these changes either. zip wants both archive name and file list on the command line. Since clearly nobody is using zipped files and if the option is going away I don't see the need to fix it.

JuhaSointusalo · 2018-10-23T19:18:17Z

I'll leave it for @TheAspens to test this on a real project database.

TheAspens · 2018-10-31T21:37:19Z

I tried this three times against a database with a million host rows and a bunch of user rows as well and it ran in roughly the same time. However, since the files were small enough that they would be cached in memory it may not have caught the issue Rom outlined in the associated issue.

It produced identical files as the version in master.

Sorry for the delay in getting this tested. However, it looks good and I'm going to merge it. Thanks for the contribution!

denravonska · 2018-11-01T07:39:35Z

Great! Out of interest, what was the rough size of the gzip?

TheAspens · 2018-11-01T15:32:06Z

The host records I created was 105M compressed and 732M uncompressed.

denravonska · 2018-11-01T17:47:18Z

That is a pretty large data set and I would expect some speedup. Unless the host is very beefy.

Add support for dumping straight to gzip.

fc0fa8e

denravonska changed the title ~~Add support for dumping straight to gzip.~~ db_dump: Add support for dumping straight to gzip. Oct 17, 2018

denravonska mentioned this pull request Oct 17, 2018

db_dump is inefficient #694

Closed

JuhaSointusalo reviewed Oct 17, 2018

View reviewed changes

denravonska added 4 commits October 18, 2018 10:11

Use MAXPATHLEN for gzip filenames.

58f4bbc

Replace gzvprintf with vasprintf.

fb0ac2e

gzvprintf is both limited by version and to its buffer size.

Use vasprintf for all file operations.

075250f

close() no longer returns bool.

eadefce

The caller, ZFILE, is responsible for ensuring that the output stream it uses is in the correct state. This makes the stream implementations smaller.

denravonska commented Oct 19, 2018

View reviewed changes

denravonska added 2 commits October 20, 2018 07:37

Change class names to follow the coding standard.

0d8961d

Default to uncompressed files.

c431d29

JuhaSointusalo approved these changes Oct 23, 2018

View reviewed changes

TheAspens merged commit bfe1fdd into BOINC:master Oct 31, 2018

denravonska deleted the refactor-stat-export branch November 1, 2018 07:40

bema-aei mentioned this pull request Nov 15, 2018

scheduler: linking db_dump requires -lz #2827

Merged

AenBleidd added this to the Server Release 1.0 milestone Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db_dump: Add support for dumping straight to gzip. #2767

db_dump: Add support for dumping straight to gzip. #2767

denravonska commented Oct 17, 2018

JuhaSointusalo commented Oct 17, 2018

JuhaSointusalo Oct 17, 2018

denravonska Oct 17, 2018 •

edited

Loading

denravonska Oct 18, 2018

denravonska Oct 19, 2018

TheAspens Oct 19, 2018

denravonska Oct 19, 2018

JuhaSointusalo commented Oct 19, 2018

denravonska commented Oct 19, 2018

TheAspens commented Oct 19, 2018

TheAspens commented Oct 19, 2018

TheAspens commented Oct 19, 2018

TheAspens commented Oct 19, 2018 •

edited

Loading

denravonska commented Oct 20, 2018

JuhaSointusalo left a comment

JuhaSointusalo commented Oct 23, 2018

TheAspens commented Oct 31, 2018

denravonska commented Nov 1, 2018

TheAspens commented Nov 1, 2018

denravonska commented Nov 1, 2018

db_dump: Add support for dumping straight to gzip. #2767

db_dump: Add support for dumping straight to gzip. #2767

Conversation

denravonska commented Oct 17, 2018

JuhaSointusalo commented Oct 17, 2018

JuhaSointusalo Oct 17, 2018

Choose a reason for hiding this comment

denravonska Oct 17, 2018 • edited Loading

Choose a reason for hiding this comment

denravonska Oct 18, 2018

Choose a reason for hiding this comment

denravonska Oct 19, 2018

Choose a reason for hiding this comment

TheAspens Oct 19, 2018

Choose a reason for hiding this comment

denravonska Oct 19, 2018

Choose a reason for hiding this comment

JuhaSointusalo commented Oct 19, 2018

denravonska commented Oct 19, 2018

TheAspens commented Oct 19, 2018

TheAspens commented Oct 19, 2018

TheAspens commented Oct 19, 2018

TheAspens commented Oct 19, 2018 • edited Loading

denravonska commented Oct 20, 2018

JuhaSointusalo left a comment

Choose a reason for hiding this comment

JuhaSointusalo commented Oct 23, 2018

TheAspens commented Oct 31, 2018

denravonska commented Nov 1, 2018

TheAspens commented Nov 1, 2018

denravonska commented Nov 1, 2018

denravonska Oct 17, 2018 •

edited

Loading

TheAspens commented Oct 19, 2018 •

edited

Loading