-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault with borg 1.1.16 on Alpine Linux #5899
Comments
There is a lot of C code below that call, can you run with gdb, so we get more information about the precise bug location. |
Tried to do this today, but unfortunately I could not trigger the segfault when calling borgbackup interactively. I'll instrument our normal backup runs to write a core dump on crash and do a post-mortem analysis based on the core file, but this can take a couple of days because we only do one backup run per day and not every run triggers a segfault. |
Hmm, if it is not reproducible, it could also be a hw issue, e.g. faulty RAM - maybe run memtest86+ for at least one pass? |
I don't think it is a hardware issue (especially faulty RAM) because in that case, I should have noticed errors in other applications than borgbackup, and the errors in borgbackup should not consistently occur at the same place in the software. Anyways, by instrumenting our normal backup runs I managed to get a core dump I can analyze with gdb. Here's the stack trace for the crash:
Stack frames 0 through 8 is the system handling the segfault. The interesting part where the segfault actually occurred should be frame 9, somewhere in Unfortunately, Alpine Linux seems to not ship debug symbols for borgbackup (neither in the default |
Here's the same stacktrace with debug symbols for python3 (package
|
Thanks for the effort, but guess i really need the information about the code line where it crashes. Considering that other people didn't report that crash, it might be something specific to your environment or to your borg build. |
Understandable. I'll try and build debug symbols for borgbackup (and contribute it back to Alpine Linux, since other packages do have supplementary debug symbol packages available), but I don't know how soon I can get around to that.
It probably is, but at the moment I'm not sure if it is a generic borgbackup on Alpine Linux problem (Alpine uses musl as libc, could be caused by that) or specific to my setup. Feel free to close the issue, I'll open a new issue if/when I have more information. |
You should definitely do a memory test. Borg is very good at finding memory issues, since it has error detection mechanisms. Many issues posted here, that only involve borg crashing, turn out to be faulty hardware. If you can't run memtest86+, at least run something like memtester that doesn't require downtime. Also, check SMART for disk errors. |
Guess it might be useful to keep this open for a while, just in case other alpine linux users experience the same issue and/or can add more / more specific information. If I get at least borg version, module name AND code line number that is causing the segfault, I can have a look at the code and check if there is a bug or compatibility problem in borg. Without that, I can't do anything about it. |
I'm having pretty much this exact same issue; I'm using the Borgmatic docker image (b3vis/borgmatic) on Debian, and it only happens every so often (like once every week or two, with backups running hourly). The Docker image is based on Alpine though, so the fact that I run Debian on my host probably doesn't matter. I don't think it's faulty hardware because there's no weirdness with other programs. This server is a Hetzner VPS if it matters. |
I'm having the same problem. Environment:
|
I see the same traceback on several devices, although I am unable to trigger it reliably. I'm running a similar borg setup with the same repository on other raspian/debian & ubuntu devices and can't remember a segfault. It might be some weird interaction with libmusl (alpine) instead of libc (debian).
Running on 32 bit and 64 bit arm (Raspberry Pi),
I did get a different Traceback back in September before an update, but unfortunately I didn't record the old version information.
|
Let's look at
The crash really seems to be in that That C function is in To point to the real problem, we need the precise line number in the code - it can be found by running the code within gdb and letting it crash. |
The |
Also seing this:
with How can I attach gdb to borg so it gives you what is needed to understand the fault, @ThomasWaldmann ? (A commandline to use when running borg or ...?) |
You can find examples of gdb usage in this issue tracker or via google, I use it rarely. |
I'm still seeing this, and this is what i'm getting from running it in gdb;
The I'm not sure what I'm looking at or for, so any hints and help is very much appreciated :) |
I guess this is a problem:
The cython object is compiled without debug symbols. I'll try and figure out how to change that. |
Without further infos we can't do anything for this. |
My apologies for the long silence. I'm still affected by this, seeing it repeatedly for 60 nightly runs in a row. For me, it's a journey of figuring out how to build gdb for arm+alpine, how to use cygdb and (presumably, when I get to it) how to build borg with debug symbols and reproduce the crash with the debugger. I only get a few hours for this every other week, so it'll be a while, but I still hope to get there and contribute with the info when I get there. Thanks! |
Same here, but the other way around: Since somewhen in fall last year (October I think), the issue stopped happening on my installation. No real idea why, I changed nothing on the setup besides pulling in the regular updates from alpine.org I think it is still worth investigating, but as @kidmose said this is quite a bit of effort on alpine because no ready-made debug information exists for the libraries in question. |
@tomaszduda23 guess no need to check on which part of the condition. both these 2 macros access the memory at idx, so if idx is beyond the valid range, that is of course a problem. |
fix see PR #6816. |
Is there a tool like valgrind that could be used in a test suite to check for errors like this? |
Good question, I am not familiar with such C tools. |
I find that hashindex_compact code a bit confusing, but at least i could not find an error besides the stuff in PR #6816. If somebody has time, please review. |
@enkore please check ^^^ |
…ackup#5899 also: fix "off by one" comment
|
But can it be used with cython modules? And is there a way to trigger this bug so it could be added as a test? OTOH, maybe a very short independent executable can be built using _hashindex.c that manually adds a few entries and calls hasindex_compact? |
It should be possible to use it. It is just matter of adding c compilation flags and loading libs. https://tobywf.com/2021/02/python-ext-asan/ I have not tested it though. |
@enkore Can you explain what hashindex_compact is supposed to do? Is it moving all non-empty and non-deleted (tombstone) entries to an initial segment of the hash table? If so, doesn't this destroy efficient lookup? The only place I can see it called is from cache.write_archive_index, so maybe it is intended to "ruin" the hash table but make it take up less space on disk? |
It is so we can save |
At load time, if |
If I'm right, I wonder if it would be cleaner to get rid of hashindex_compact, and instead have hashindex_write iterate over the hash table and only write out the non-empty/non-deleted entries (or do this when passed a flag). The logic of that would be clear, it would probably be faster, and it would be non-destructive... But obviously there's a lot going on here that I'm not aware of, so take this with a grain of salt. |
But does that hash table get used after being written to disk? If so, I think keys will be in random positions, so look ups will be slow. In any case, it seems strange to me to reorganize a hash table in memory, with lots of copying, just so that the non-empty buckets can get written to a file, since that could be done directly. |
Yes, it gets used whenever it needs to rebuild the main chunks index from these per-archive chunks indexes. I agree that just writing the occupied buckets to disk would also work and might be easier than what we do now, needs checking in the code how much effort that actually is and whether it is worth the change (the current code just writes one big block to disk). A similar consideration has to be done at load time: instead of load compact + resize, we could also load each single bucket and directly build the hashtable of the desired size. |
…ster hashindex_compact: fix eval order (check idx before use), fixes #5899
hashindex_compact: fix eval order (check idx before use), fixes #5899
hashindex_compact: fix eval order (check idx before use), fixes #5899
@jdchristensen When thinking now about writing out the used buckets individually without compacting the hash table first, I had the feeling it might be much slower as this would generate very many very small writes (like e.g. 40 bytes per bucket / per write). Likely some buffering by the OS will make it a bit less bad though, but likely not as efficient as compacting + one big write. And our hashtable files can be rather large... |
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes
Is this a BUG / ISSUE report or a QUESTION?
BUG
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
borg 1.1.16
Operating system (distribution) and version.
Alpine Linux "edge" (rolling release)
Hardware / network configuration, and filesystems used.
How much data is handled by borg?
Original data: ~ 500GB (filesystems and metadata of ~20 containers)
After deduplication: ~10GB per backup run in 20 archives (1 archive per container)
Total (all archives): ~3TB
Describe the problem you're observing.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
I use borgbackup to take daily backups of LXC/LXD containers on a dedicated container host. Each backup run calls
in the toplevel directory of a snapshot of the container file system (snapshots are created outside of borgbackup using Ceph tools).
$ARCHIVE_URL
is the SSH URL of an archive on a remote borgbackup server, and$CONTAINER
is the container ID.The first call to
borg create
in this backup run fails with a segfault on about every second run (see below for the backtrace).The subsequent calls to back up different containers in the same run have had no errors so far.
The segfault is not specific to one container: If I change the order of the containers for backup I get the same error for a different container - but always for the first
borg create
call of a backup run.Include any warning/errors/backtraces from the system logs
The crash occurs always in the same line in python code:
Line 740 in /usr/lib/python3.9/site-packages/borg/cache.py is simply
where
chunk_idx
is an instance ofChunkIndex
, which is implemeted in C as far as I can see.So I suspect a bug in
chunk_idx.compact()
The text was updated successfully, but these errors were encountered: