Traffic needed for 'list' and 'prune' operations on a remote repository increases linearly with time #167

Ernest0x · 2014-12-29T15:03:21Z

I have a pretty frequent backup task which is scheduled to run every 5 minutes and uses a remote attic repository. The task basically includes an attic list operation in order to get the time of the last backup, a create operation to actually take a new backup and a prune operation to delete older archives.

The problem is that the traffic that this task is generating is increasing linearly with time. I am not saying linearly with the number of archives, because the prune command has already started to prune older backups, so that at each execution one new archive is created and one old archive is deleted. What surprises me is that the traffic that is generated by the create command is neglectable in comparison to the traffic generated by the list and prune commands. The create command takes only 2-3 seconds, while both list and prune commands take several seconds (~ 2-3 minutes combined) and generate a lot of traffic.

The direction of that traffic is like this: remote repo -> target host
It looks like attic list and prune commands need to fetch a lot of data from the remote repository in order to do what they do, which does not make sense to me. The text for the listing of all archives (measured by piping the output of attic list through wc -c) is only ~125KB.

Here is some stats from a 'create' command:

 Archive name: frequent-backup_2014-12-29-16:01
 Archive fingerprint: f6b0794b37aa66dfe0f11abbf178057c5ed198afc5f1aabf117ec58a8f4f45d5
 Start time: Mon Dec 29 16:01:33 2014
 End time: Mon Dec 29 16:01:35 2014
 Duration: 2.01 seconds
 Number of files: 2862

                        Original size      Compressed size    Deduplicated size
 This archive:                2.98 GB            520.24 MB            316.50 kB
 All archives:                5.57 TB            968.40 GB              2.76 GB

Any thoughts?

If there is any extra information that would be helpful, let me know.

The text was updated successfully, but these errors were encountered:

Ernest0x · 2014-12-29T15:13:45Z

In addition, upgrading attic from 0.13 to 0.14 to both target and remote repo machines does not make any difference.

Ernest0x · 2014-12-30T15:04:08Z

Well, I figured out that all the slowness and high traffic is coming from calling Archive.list_archives(), a method called both in do_prune() and do_list(), that is creating an Archive object for each archive in the repository. This costs very much in time and traffic, because the remote repository is accessed for each archive.

As a workaround, I changed my backup task and I am now caching the last backup time locally, to avoid the list operation, which helped to reduce the traffic in half. However, I am afraid I cannot do anything with the prune operation, besides not pruning so often.

@jborg, if I am not missing some important information here, I think that list and prune operations could/should be mostly done at the side of remote repository and only report back to the 'client' side, so as to avoid all that traffic and round trips. Do you agree?

ThomasWaldmann · 2015-05-17T23:03:42Z

@Ernest0x some thoughts:

You say it surprises you that backup is so fast. Well, if you do it every 5 mins, you haven't much that has changed since last backup. Attic has a local cache with file infos from last backup, so it can skip quickly over unchanged files and will only examine changed ones more deeply. Even for the changed ones, it will only transfer chunks to the backup repo that are new and not already stored there (it has a local chunk id cache to quickly decide that).

About list and prune: I think the root cause for the high traffic you observed is that the "items" list of an archive is contained in its main metadata dictionary - and that needs to be loaded for infos like name and timestamp. But unlike name and timestamp, the items list can get rather big. The "items" list is a list of chunk ids for all the chunks that store the items' metadata.

So "list" only needs a little info from there, but it must load the complete datastructure.
"prune" at first also only needs little info, mostly the timestamp and name. If it decides to delete an archive, it of course also needs the item list for THAT archive.

Ernest0x · 2015-05-18T07:07:02Z

@ThomasWaldmann I may have misstated that, but it does not surprise me how fast the backup is, but how slow the list/prune commands are while the backup is that fast. As for your suggestion, it's not clear to me what exactly you propose as a change. It sounds like it would require changes in the repository format, right? Maybe a diagram showing a 'prune' operation and how it accesses the remote repository's (proposed) data structures would be more helpful.

Ernest0x mentioned this issue Jan 4, 2015

Traceback on a frequent backup #161

Open

maltefiala mentioned this issue May 14, 2015

Dealing with attic issues borgbackup/borg#5

Closed

This was referenced May 17, 2015

items list inside main archive metadata dict borgbackup/borg#20

Closed

Refactor 'list' and 'prune' operations on remote repositories to run on the server side #242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traffic needed for 'list' and 'prune' operations on a remote repository increases linearly with time #167

Traffic needed for 'list' and 'prune' operations on a remote repository increases linearly with time #167

Ernest0x commented Dec 29, 2014

Ernest0x commented Dec 29, 2014

Ernest0x commented Dec 30, 2014

ThomasWaldmann commented May 17, 2015

Ernest0x commented May 18, 2015

Traffic needed for 'list' and 'prune' operations on a remote repository increases linearly with time #167

Traffic needed for 'list' and 'prune' operations on a remote repository increases linearly with time #167

Comments

Ernest0x commented Dec 29, 2014

Ernest0x commented Dec 29, 2014

Ernest0x commented Dec 30, 2014

ThomasWaldmann commented May 17, 2015

Ernest0x commented May 18, 2015