Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic needed for 'list' and 'prune' operations on a remote repository increases linearly with time #167

Open
Ernest0x opened this issue Dec 29, 2014 · 4 comments

Comments

@Ernest0x
Copy link
Contributor

I have a pretty frequent backup task which is scheduled to run every 5 minutes and uses a remote attic repository. The task basically includes an attic list operation in order to get the time of the last backup, a create operation to actually take a new backup and a prune operation to delete older archives.

The problem is that the traffic that this task is generating is increasing linearly with time. I am not saying linearly with the number of archives, because the prune command has already started to prune older backups, so that at each execution one new archive is created and one old archive is deleted. What surprises me is that the traffic that is generated by the create command is neglectable in comparison to the traffic generated by the list and prune commands. The create command takes only 2-3 seconds, while both list and prune commands take several seconds (~ 2-3 minutes combined) and generate a lot of traffic.

The direction of that traffic is like this: remote repo -> target host
It looks like attic list and prune commands need to fetch a lot of data from the remote repository in order to do what they do, which does not make sense to me. The text for the listing of all archives (measured by piping the output of attic list through wc -c) is only ~125KB.

Here is some stats from a 'create' command:

 Archive name: frequent-backup_2014-12-29-16:01
 Archive fingerprint: f6b0794b37aa66dfe0f11abbf178057c5ed198afc5f1aabf117ec58a8f4f45d5
 Start time: Mon Dec 29 16:01:33 2014
 End time: Mon Dec 29 16:01:35 2014
 Duration: 2.01 seconds
 Number of files: 2862

                        Original size      Compressed size    Deduplicated size
 This archive:                2.98 GB            520.24 MB            316.50 kB
 All archives:                5.57 TB            968.40 GB              2.76 GB

Any thoughts?

If there is any extra information that would be helpful, let me know.

@Ernest0x
Copy link
Contributor Author

In addition, upgrading attic from 0.13 to 0.14 to both target and remote repo machines does not make any difference.

@Ernest0x
Copy link
Contributor Author

Well, I figured out that all the slowness and high traffic is coming from calling Archive.list_archives(), a method called both in do_prune() and do_list(), that is creating an Archive object for each archive in the repository. This costs very much in time and traffic, because the remote repository is accessed for each archive.

As a workaround, I changed my backup task and I am now caching the last backup time locally, to avoid the list operation, which helped to reduce the traffic in half. However, I am afraid I cannot do anything with the prune operation, besides not pruning so often.

@jborg, if I am not missing some important information here, I think that list and prune operations could/should be mostly done at the side of remote repository and only report back to the 'client' side, so as to avoid all that traffic and round trips. Do you agree?

@ThomasWaldmann
Copy link
Contributor

@Ernest0x some thoughts:

You say it surprises you that backup is so fast. Well, if you do it every 5 mins, you haven't much that has changed since last backup. Attic has a local cache with file infos from last backup, so it can skip quickly over unchanged files and will only examine changed ones more deeply. Even for the changed ones, it will only transfer chunks to the backup repo that are new and not already stored there (it has a local chunk id cache to quickly decide that).

About list and prune: I think the root cause for the high traffic you observed is that the "items" list of an archive is contained in its main metadata dictionary - and that needs to be loaded for infos like name and timestamp. But unlike name and timestamp, the items list can get rather big. The "items" list is a list of chunk ids for all the chunks that store the items' metadata.

So "list" only needs a little info from there, but it must load the complete datastructure.
"prune" at first also only needs little info, mostly the timestamp and name. If it decides to delete an archive, it of course also needs the item list for THAT archive.

@Ernest0x
Copy link
Contributor Author

@ThomasWaldmann I may have misstated that, but it does not surprise me how fast the backup is, but how slow the list/prune commands are while the backup is that fast. As for your suggestion, it's not clear to me what exactly you propose as a change. It sounds like it would require changes in the repository format, right? Maybe a diagram showing a 'prune' operation and how it accesses the remote repository's (proposed) data structures would be more helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants