Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with borg 1.1.16 on Alpine Linux #5899

Closed
rkunze opened this issue Jul 5, 2021 · 59 comments
Closed

Segfault with borg 1.1.16 on Alpine Linux #5899

rkunze opened this issue Jul 5, 2021 · 59 comments
Labels
Milestone

Comments

@rkunze
Copy link

rkunze commented Jul 5, 2021

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

BUG

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.1.16

Operating system (distribution) and version.

Alpine Linux "edge" (rolling release)

Hardware / network configuration, and filesystems used.

  • Borgbackup metadata (cache and configuration) is in a btrfs partition on the host machine
  • Data to back up (LXC/LXD container metadata and filesystems) is ext4 on Ceph RBD block devices

How much data is handled by borg?

Original data: ~ 500GB (filesystems and metadata of ~20 containers)
After deduplication: ~10GB per backup run in 20 archives (1 archive per container)
Total (all archives): ~3TB

Describe the problem you're observing.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

I use borgbackup to take daily backups of LXC/LXD containers on a dedicated container host. Each backup run calls

borg create "$ARCHIVE_URL::$CONTAINER-{now:%Y-%m-%dT%H:%M:%S}" . --stats

in the toplevel directory of a snapshot of the container file system (snapshots are created outside of borgbackup using Ceph tools). $ARCHIVE_URL is the SSH URL of an archive on a remote borgbackup server, and $CONTAINER is the container ID.

The first call to borg create in this backup run fails with a segfault on about every second run (see below for the backtrace).

The subsequent calls to back up different containers in the same run have had no errors so far.

The segfault is not specific to one container: If I change the order of the containers for backup I get the same error for a different container - but always for the first borg create call of a backup run.

Include any warning/errors/backtraces from the system logs

The crash occurs always in the same line in python code:

Fatal Python error: Segmentation fault

Current thread 0x00007fe4c6126b48 (most recent call first):
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 740 in write_archive_index
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 736 in fetch_and_build_idx
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 824 in create_master_idx
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 870 in sync
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 474 in __init__
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 379 in local
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 387 in __new__
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 586 in do_create
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 177 in wrapper
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4622 in run
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4690 in main
  File "/usr/bin/borg", line 33 in <module>
Segmentation fault

Line 740 in /usr/lib/python3.9/site-packages/borg/cache.py is simply

compact_chunks_archive_saved_space += chunk_idx.compact()

where chunk_idx is an instance of ChunkIndex, which is implemeted in C as far as I can see.
So I suspect a bug in chunk_idx.compact()

@ThomasWaldmann
Copy link
Member

There is a lot of C code below that call, can you run with gdb, so we get more information about the precise bug location.
See other (potentially closed) issues with segfaults in borg.

@ThomasWaldmann ThomasWaldmann added this to the 1.1.17 milestone Jul 5, 2021
@rkunze
Copy link
Author

rkunze commented Jul 6, 2021

Tried to do this today, but unfortunately I could not trigger the segfault when calling borgbackup interactively.

I'll instrument our normal backup runs to write a core dump on crash and do a post-mortem analysis based on the core file, but this can take a couple of days because we only do one backup run per day and not every run triggers a segfault.

@ThomasWaldmann
Copy link
Member

Hmm, if it is not reproducible, it could also be a hw issue, e.g. faulty RAM - maybe run memtest86+ for at least one pass?

@rkunze
Copy link
Author

rkunze commented Jul 8, 2021

I don't think it is a hardware issue (especially faulty RAM) because in that case, I should have noticed errors in other applications than borgbackup, and the errors in borgbackup should not consistently occur at the same place in the software.

Anyways, by instrumenting our normal backup runs I managed to get a core dump I can analyze with gdb. Here's the stack trace for the crash:

# gdb /usr/bin/python3 /var/tmp/borg.110780.core 
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python3...
(No debugging symbols found in /usr/bin/python3)

warning: core file may not match specified executable file.
[New LWP 110780]
Core was generated by `/usr/bin/python3 /usr/bin/borg create ssh://backup@localhost:5618/media/Tivano-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f2eca7833f2 in setjmp () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007f2eca7833f2 in setjmp () from /lib/ld-musl-x86_64.so.1
#1  0x00007f2eca78354d in raise () from /lib/ld-musl-x86_64.so.1
#2  0x000000000000000b in ?? ()
#3  0x00007f2eca7d3b7c in ?? () from /lib/ld-musl-x86_64.so.1
#4  0x0000000000000002 in ?? ()
#5  0x00007f2eca623924 in ?? () from /usr/lib/libpython3.9.so.1.0
#6  0x00007f2eca4e1bdb in ?? () from /usr/lib/libpython3.9.so.1.0
#7  0x00007f2eca4e1cd5 in ?? () from /usr/lib/libpython3.9.so.1.0
#8  <signal handler called>
#9  0x00007f2ec9a778b3 in ?? () from /usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-x86_64-linux-musl.so
#10 0x00007f2eca51b69f in ?? () from /usr/lib/libpython3.9.so.1.0
#11 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#12 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#13 0x00007f2eca512b29 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#14 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#15 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#16 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#17 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#18 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#19 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#20 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#21 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#22 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#23 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#24 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#25 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#26 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#27 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#28 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#29 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#30 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#31 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#32 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#33 0x00007f2eca512b29 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#34 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#35 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#36 0x00007f2eca5142be in _PyObject_FastCallDictTstate () from /usr/lib/libpython3.9.so.1.0
#37 0x00007f2eca51d25d in _PyObject_Call_Prepend () from /usr/lib/libpython3.9.so.1.0
#38 0x00007f2eca51d08d in ?? () from /usr/lib/libpython3.9.so.1.0
#39 0x00007f2eca514746 in ?? () from /usr/lib/libpython3.9.so.1.0
#40 0x00007f2eca514613 in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0
#41 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#42 0x00007f2eca512c47 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#43 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#44 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#45 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#46 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#47 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#48 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#49 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#50 0x00007f2eca5142be in _PyObject_FastCallDictTstate () from /usr/lib/libpython3.9.so.1.0
#51 0x00007f2eca51d25d in _PyObject_Call_Prepend () from /usr/lib/libpython3.9.so.1.0
#52 0x00007f2eca53e518 in ?? () from /usr/lib/libpython3.9.so.1.0
#53 0x00007f2eca514704 in ?? () from /usr/lib/libpython3.9.so.1.0
#54 0x00007f2eca514613 in _PyObject_MakeTpCall () from /usr/lib/libpython3.9.so.1.0
#55 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#56 0x00007f2eca512c47 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#57 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#58 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#59 0x00007f2eca51e3fa in PyVectorcall_Call () from /usr/lib/libpython3.9.so.1.0
#60 0x00007f2eca513011 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#61 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#62 0x00007f2eca517e65 in _PyFunction_Vectorcall () from /usr/lib/libpython3.9.so.1.0
#63 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#64 0x00007f2eca51e0d9 in ?? () from /usr/lib/libpython3.9.so.1.0
#65 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#66 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#67 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#68 0x00007f2eca517efd in ?? () from /usr/lib/libpython3.9.so.1.0
#69 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#70 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#71 0x00007f2eca512b29 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#72 0x00007f2eca517efd in ?? () from /usr/lib/libpython3.9.so.1.0
#73 0x00007f2eca513cd7 in ?? () from /usr/lib/libpython3.9.so.1.0
#74 0x00007f2eca513c08 in ?? () from /usr/lib/libpython3.9.so.1.0
#75 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.9.so.1.0
#76 0x00007f2eca50d131 in ?? () from /usr/lib/libpython3.9.so.1.0
#77 0x00007f2eca56b230 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.9.so.1.0
#78 0x00007f2eca56b1b0 in PyEval_EvalCodeEx () from /usr/lib/libpython3.9.so.1.0
#79 0x00007f2eca56b172 in PyEval_EvalCode () from /usr/lib/libpython3.9.so.1.0
#80 0x00007f2eca57de3f in ?? () from /usr/lib/libpython3.9.so.1.0
#81 0x00007f2eca578328 in ?? () from /usr/lib/libpython3.9.so.1.0
#82 0x00007f2eca4ca6a4 in ?? () from /usr/lib/libpython3.9.so.1.0
#83 0x00007f2eca4ca432 in PyRun_SimpleFileExFlags () from /usr/lib/libpython3.9.so.1.0
#84 0x00007f2eca593065 in Py_RunMain () from /usr/lib/libpython3.9.so.1.0
#85 0x00007f2eca55a477 in Py_BytesMain () from /usr/lib/libpython3.9.so.1.0
#86 0x00007f2eca759a03 in ?? () from /lib/ld-musl-x86_64.so.1
#87 0x00007f2eca7599dc in ?? () from /lib/ld-musl-x86_64.so.1
#88 0x00007ffc177eb9e0 in ?? ()
#89 0x0000000000000000 in ?? ()
(gdb) 

Stack frames 0 through 8 is the system handling the segfault. The interesting part where the segfault actually occurred should be frame 9, somewhere in /usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-x86_64-linux-musl.so.

Unfortunately, Alpine Linux seems to not ship debug symbols for borgbackup (neither in the default borgbackup package nor in any extra debugging package as far as I can see), so I can't pinpoint it further.

@rkunze
Copy link
Author

rkunze commented Jul 8, 2021

Here's the same stacktrace with debug symbols for python3 (package python3-dbg in alpine Linux) installed. Unfortunately, still not debug symbols for /usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-x86_64-linux-musl.so:

# gdb /usr/bin/python3 /var/tmp/borg.110780.core 
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python3...
Reading symbols from /usr/lib/debug//usr/bin/python3.9.debug...

warning: core file may not match specified executable file.
[New LWP 110780]
Core was generated by `/usr/bin/python3 /usr/bin/borg create ssh://backup@localhost:5618/media/Tivano-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f2eca7833f2 in setjmp () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007f2eca7833f2 in setjmp () from /lib/ld-musl-x86_64.so.1
#1  0x00007f2eca78354d in raise () from /lib/ld-musl-x86_64.so.1
#2  0x000000000000000b in ?? ()
#3  0x00007f2eca7d3b7c in ?? () from /lib/ld-musl-x86_64.so.1
#4  0x0000000000000002 in ?? ()
#5  0x00007f2eca623924 in ?? () from /usr/lib/libpython3.9.so.1.0
#6  0x00007f2eca4e1bdb in faulthandler_dump_traceback (fd=2, fd@entry=37175, all_threads=0, interp=<optimized out>) at ./Modules/faulthandler.c:243
#7  0x00007f2eca4e1cd5 in faulthandler_fatal_error (signum=0) at ./Modules/faulthandler.c:349
#8  <signal handler called>
#9  0x00007f2ec9a778b3 in ?? () from /usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-x86_64-linux-musl.so
#10 0x00007f2eca51b69f in method_vectorcall_NOARGS (func=0x7f2ec96d1f40, args=0x7f2e9d892bf0, nargsf=<optimized out>, kwnames=0x0) at Objects/descrobject.c:434
#11 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec96d1f40, args=0x7f2e9d892bf0, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#12 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=0x7ffc177e90f0, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#13 0x00007f2eca512b29 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2e9d892a40, throwflag=<optimized out>) at Python/ceval.c:3504
#14 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec92b47c0, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec955ecb8, argcount=<optimized out>, kwnames=0x0, 
    kwargs=0x7f2ec955ecc8, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f2ec925e1c0, name=0x7f2ec946ad00, qualname=0x7f2ec9151d50) at Python/ceval.c:4327
#15 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec955ecb8, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:396
#16 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f025e0, args=0x7f2ec955ecb8, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:118
#17 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177e9418, oparg=oparg@entry=2, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#18 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec955eae0, throwflag=<optimized out>) at Python/ceval.c:3518
#19 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec92b4710, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec94d0ef0, argcount=<optimized out>, kwnames=0x0, 
    kwargs=0x7f2ec94d0f08, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f2ec8dbc1d0, name=0x7f2ec946ac60, qualname=0x7f2ec9151cf0) at Python/ceval.c:4327
#20 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec94d0ef0, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:396
#21 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f02550, args=0x7f2ec94d0ef0, nargsf=9223372036854775811, kwnames=0x0) at ./Include/cpython/abstract.h:118
#22 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177e9738, oparg=oparg@entry=3, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#23 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec94d0cc0, throwflag=<optimized out>) at Python/ceval.c:3518
#24 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec92b49d0, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2e9d4a6868, argcount=<optimized out>, kwnames=0x0, 
    kwargs=0x7f2e9d4a6870, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f2ec8e0d5e0, name=0x7f2ec946af80, qualname=0x7f2ec92b6090) at Python/ceval.c:4327
#25 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2e9d4a6868, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:396
#26 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f02430, args=0x7f2e9d4a6868, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#27 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177e9a58, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#28 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2e9d4a6640, throwflag=<optimized out>) at Python/ceval.c:3518
#29 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec92b4c90, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec8efa600, argcount=<optimized out>, kwnames=0x0, 
    kwargs=0x7f2ec8efa608, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7f2eca39c8f0, qualname=0x7f2ec915ae70) at Python/ceval.c:4327
#30 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec8efa600, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:396
#31 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec9105dc0, args=0x7f2ec8efa600, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#32 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=0x7ffc177e9d70, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#33 0x00007f2eca512b29 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec8efa440, throwflag=<optimized out>) at Python/ceval.c:3504
#34 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec912c870, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec8fc5d58, argcount=<optimized out>, kwnames=0x7f2ec8e04fa8, 
    kwargs=0x7f2ec8fc5d60, kwcount=9, kwstep=1, defs=0x7f2ec92d0a18, defcount=6, kwdefs=0x0, closure=0x0, name=0x7f2eca42b530, qualname=0x7f2ec92b1120) at Python/ceval.c:4327
#35 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec8fc5d58, nargsf=<optimized out>, kwnames=0x7f2ec8e04f90) at Objects/call.c:396
#36 0x00007f2eca5142be in _PyObject_FastCallDictTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec9105790, args=0x7ffc177e9fd0, nargsf=<optimized out>, kwargs=0x7f2ec9256140) at Objects/call.c:129
#37 0x00007f2eca51d25d in _PyObject_Call_Prepend (tstate=0x7f2eca46ec30, callable=0x7f2ec9105790, obj=<optimized out>, args=0x7f2eca42f040, kwargs=<optimized out>) at Objects/call.c:489
#38 0x00007f2eca51d08d in slot_tp_init (self=self@entry=0x7f2ec8f2b9d0, args=args@entry=0x7f2eca42f040, kwds=kwds@entry=0x7f2ec9256140) at Objects/typeobject.c:6943
#39 0x00007f2eca514746 in type_call (type=<optimized out>, type@entry=0x7f2ec95aeba0, args=args@entry=0x7f2eca42f040, kwds=kwds@entry=0x7f2ec9256140) at Objects/typeobject.c:1026
#40 0x00007f2eca514613 in _PyObject_MakeTpCall (tstate=0x7f2eca46ec30, callable=0x7f2ec95aeba0, args=<optimized out>, nargs=0, keywords=0x7f2ec9161350) at Objects/call.c:191
#41 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177ea210, oparg=oparg@entry=9, kwnames=kwnames@entry=0x7f2ec9161350) at Python/ceval.c:5072
#42 0x00007f2eca512c47 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec95733f0, throwflag=<optimized out>) at Python/ceval.c:3535
#43 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec912c240, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec9577c20, argcount=<optimized out>, kwnames=0x0, 
    kwargs=0x7f2ec9577c20, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f2ec9bb97b0, name=0x7f2eca058a30, qualname=0x7f2ec915bda0) at Python/ceval.c:4327
#44 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec9577c20, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:396
#45 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f21700, args=0x7f2ec9577c20, nargsf=9223372036854775808, kwnames=0x0) at ./Include/cpython/abstract.h:118
#46 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177ea528, oparg=oparg@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#47 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec95779e0, throwflag=<optimized out>) at Python/ceval.c:3518
#48 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec912c3a0, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec8fc5db8, argcount=<optimized out>, kwnames=0x7f2ec94c0e38, 
    kwargs=0x7f2ec8fc5dd8, kwcount=6, kwstep=1, defs=0x7f2ec90e8b48, defcount=9, kwdefs=0x0, closure=0x0, name=0x7f2eca42b570, qualname=0x7f2ec915a8f0) at Python/ceval.c:4327
#49 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec8fc5db8, nargsf=<optimized out>, kwnames=0x7f2ec94c0e20) at Objects/call.c:396
#50 0x00007f2eca5142be in _PyObject_FastCallDictTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec91050d0, args=0x7ffc177ea780, nargsf=<optimized out>, kwargs=0x7f2ec9253d80) at Objects/call.c:129
#51 0x00007f2eca51d25d in _PyObject_Call_Prepend (tstate=0x7f2eca46ec30, callable=0x7f2ec91050d0, obj=<optimized out>, args=0x7f2ec8e06dc0, kwargs=<optimized out>) at Objects/call.c:489
--Type <RET> for more, q to quit, c to continue without paging--c
#52 0x00007f2eca53e518 in slot_tp_new (type=0x7f2ec95adbe0, args=0x7f2ec8e06dc0, kwds=0x7f2ec9253d80) at Objects/typeobject.c:6973
#53 0x00007f2eca514704 in type_call (type=type@entry=0x7f2ec95adbe0, args=args@entry=0x7f2ec8e06dc0, kwds=kwds@entry=0x7f2ec9253d80) at Objects/typeobject.c:1014
#54 0x00007f2eca514613 in _PyObject_MakeTpCall (tstate=0x7f2eca46ec30, callable=0x7f2ec95adbe0, args=<optimized out>, nargs=3, keywords=0x7f2ec9e7ffa0) at Objects/call.c:191
#55 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177ea9b0, oparg=oparg@entry=9, kwnames=kwnames@entry=0x7f2ec9e7ffa0) at Python/ceval.c:5072
#56 0x00007f2eca512c47 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec9581640, throwflag=<optimized out>) at Python/ceval.c:3535
#57 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec9e753a0, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec9142de8, argcount=<optimized out>, kwnames=0x7f2ec8e01898, kwargs=0x7f2ec9142df8, kwcount=3, kwstep=1, defs=0x7f2ec9d063d8, defcount=2, kwdefs=0x0, closure=0x0, name=0x7f2ec9cd5030, qualname=0x7f2ec9e77d50) at Python/ceval.c:4327
#58 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec9142de8, nargsf=<optimized out>, kwnames=0x7f2ec8e01880) at Objects/call.c:396
#59 0x00007f2eca51e3fa in PyVectorcall_Call (callable=0x7f2ec8f26820, tuple=<optimized out>, kwargs=0x7f2ec92560c0) at Objects/call.c:243
#60 0x00007f2eca513011 in do_call_core (kwdict=0x7f2ec92560c0, callargs=0x7f2ec9b3dbc0, func=0x7f2ec8f26820, tstate=<optimized out>) at Python/ceval.c:5120
#61 _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec95757d0, throwflag=<optimized out>) at Python/ceval.c:3580
#62 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=0x7f2ec9e7b0e0, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f2ec9580888, argcount=<optimized out>, kwnames=0x0, kwargs=0x7f2ec9580898, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f2ec8e10540, name=0x7f2ec9cd5030, qualname=0x7f2ec9e77d50) at Python/ceval.c:4327
#63 0x00007f2eca517e65 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f2ec9580888, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:396
#64 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f268b0, args=0x7f2ec9580888, nargsf=2, kwnames=0x0) at ./Include/cpython/abstract.h:118
#65 0x00007f2eca51e0d9 in method_vectorcall (method=0x7f2ec8f04a80, args=0x7f2ec9580890, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:53
#66 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f04a80, args=0x7f2ec9580890, nargsf=9223372036854775809, kwnames=0x0) at ./Include/cpython/abstract.h:118
#67 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177eb078, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#68 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec95806d0, throwflag=<optimized out>) at Python/ceval.c:3518
#69 0x00007f2eca517efd in function_code_fastcall (tstate=0x7f2eca46ec30, co=<optimized out>, args=<optimized out>, nargs=2, globals=<optimized out>) at Objects/call.c:330
#70 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f214c0, args=0x7f2ec958e438, nargsf=9223372036854775810, kwnames=0x0) at ./Include/cpython/abstract.h:118
#71 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=0x7ffc177eb290, oparg=oparg@entry=2, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#72 0x00007f2eca512b29 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2ec958e240, throwflag=<optimized out>) at Python/ceval.c:3504
#73 0x00007f2eca517efd in function_code_fastcall (tstate=0x7f2eca46ec30, co=<optimized out>, args=<optimized out>, nargs=0, globals=<optimized out>) at Objects/call.c:330
#74 0x00007f2eca513cd7 in _PyObject_VectorcallTstate (tstate=0x7f2eca46ec30, callable=0x7f2ec8f215e0, args=0x7f2eca3093f0, nargsf=9223372036854775808, kwnames=0x0) at ./Include/cpython/abstract.h:118
#75 0x00007f2eca513c08 in call_function (tstate=tstate@entry=0x7f2eca46ec30, pp_stack=pp_stack@entry=0x7ffc177eb4b8, oparg=oparg@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5072
#76 0x00007f2eca512bb4 in _PyEval_EvalFrameDefault (tstate=0x7f2eca46ec30, f=0x7f2eca309270, throwflag=<optimized out>) at Python/ceval.c:3518
#77 0x00007f2eca50d131 in _PyEval_EvalCode (tstate=0x7f2eca46ec30, _co=_co@entry=0x7f2eca268710, globals=globals@entry=0x7f2eca346600, locals=locals@entry=0x7f2eca346600, args=args@entry=0x0, argcount=argcount@entry=0, kwnames=0x0, kwargs=0x0, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4327
#78 0x00007f2eca56b230 in _PyEval_EvalCodeWithName (_co=0x7f2eca268710, globals=0x7f2eca346600, locals=0x7f2eca346600, args=0x0, argcount=0, kwnames=0x0, kwargs=0x0, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4359
#79 0x00007f2eca56b1b0 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4375
#80 0x00007f2eca56b172 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:826
#81 0x00007f2eca57de3f in run_eval_code_obj (tstate=tstate@entry=0x7f2eca46ec30, co=co@entry=0x7f2eca268710, globals=globals@entry=0x7f2eca346600, locals=locals@entry=0x7f2eca346600) at Python/pythonrun.c:1219
#82 0x00007f2eca578328 in run_mod (mod=mod@entry=0x7f2eca0d33e8, filename=filename@entry=0x7f2eca354db0, globals=globals@entry=0x7f2eca346600, locals=locals@entry=0x7f2eca346600, flags=flags@entry=0x7ffc177eb890, arena=arena@entry=0x7f2eca3d6910) at Python/pythonrun.c:1240
#83 0x00007f2eca4ca6a4 in pyrun_file (fp=fp@entry=0x7f2eca22faa0, filename=filename@entry=0x7f2eca354db0, start=start@entry=257, globals=globals@entry=0x7f2eca346600, locals=locals@entry=0x7f2eca346600, closeit=closeit@entry=1, flags=0x7ffc177eb890) at Python/pythonrun.c:1138
#84 0x00007f2eca4ca432 in pyrun_simple_file (flags=0x7ffc177eb890, closeit=1, filename=0x7f2eca354db0, fp=0x7f2eca22faa0) at Python/pythonrun.c:449
#85 PyRun_SimpleFileExFlags (fp=0x7f2eca22faa0, filename=<optimized out>, closeit=1, flags=0x7ffc177eb890) at Python/pythonrun.c:482
#86 0x00007f2eca593065 in pymain_run_file (cf=0x7ffc177eb890, config=0x7f2eca46c400) at Modules/main.c:373
#87 pymain_run_python (exitcode=0x7ffc177eb888) at Modules/main.c:598
#88 Py_RunMain () at Modules/main.c:677
#89 0x00007f2eca55a477 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:731
#90 0x00007f2eca759a03 in ?? () from /lib/ld-musl-x86_64.so.1
#91 0x00007f2eca7599dc in ?? () from /lib/ld-musl-x86_64.so.1
#92 0x00007ffc177eb9e0 in ?? ()
#93 0x0000000000000000 in ?? ()
(gdb) 

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jul 8, 2021

Thanks for the effort, but guess i really need the information about the code line where it crashes.

Considering that other people didn't report that crash, it might be something specific to your environment or to your borg build.

@rkunze
Copy link
Author

rkunze commented Jul 8, 2021

Thanks for the effort, but guess i really need the information about the code line where it crashes.

Understandable. I'll try and build debug symbols for borgbackup (and contribute it back to Alpine Linux, since other packages do have supplementary debug symbol packages available), but I don't know how soon I can get around to that.

Considering that other people didn't report that crash, it might be something specific to your environment or to your borg build.

It probably is, but at the moment I'm not sure if it is a generic borgbackup on Alpine Linux problem (Alpine uses musl as libc, could be caused by that) or specific to my setup.

Feel free to close the issue, I'll open a new issue if/when I have more information.

@jdchristensen
Copy link
Contributor

You should definitely do a memory test. Borg is very good at finding memory issues, since it has error detection mechanisms. Many issues posted here, that only involve borg crashing, turn out to be faulty hardware. If you can't run memtest86+, at least run something like memtester that doesn't require downtime. Also, check SMART for disk errors.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jul 9, 2021

Guess it might be useful to keep this open for a while, just in case other alpine linux users experience the same issue and/or can add more / more specific information.

If I get at least borg version, module name AND code line number that is causing the segfault, I can have a look at the code and check if there is a bug or compatibility problem in borg. Without that, I can't do anything about it.

@BBaoVanC
Copy link

BBaoVanC commented Sep 1, 2021

I'm having pretty much this exact same issue; I'm using the Borgmatic docker image (b3vis/borgmatic) on Debian, and it only happens every so often (like once every week or two, with backups running hourly). The Docker image is based on Alpine though, so the fact that I run Debian on my host probably doesn't matter.

I don't think it's faulty hardware because there's no weirdness with other programs. This server is a Hetzner VPS if it matters.

@sinbino
Copy link

sinbino commented Sep 8, 2021

I'm having the same problem.
I use borg in an alpine container under alpine host.
Sometimes I get a segfault.
It always causes a segfault in write_archive_index.

Environment:

  • Alpine Linux 3.14.1
  • Borg 1.1.16
Current thread 0x00007f4eafe6fb48 (most recent call first):
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 740 in write_archive_index
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 736 in fetch_and_build_idx
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 824 in create_master_idx
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 870 in sync
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 474 in __init__
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 379 in local
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 387 in __new__
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 586 in do_create
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 177 in wrapper
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4622 in run
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4690 in main
  File "/usr/bin/borg", line 33 in <module>
Segmentation fault (core dumped)

@eljoest
Copy link

eljoest commented Nov 7, 2021

I see the same traceback on several devices, although I am unable to trigger it reliably.

I'm running a similar borg setup with the same repository on other raspian/debian & ubuntu devices and can't remember a segfault. It might be some weird interaction with libmusl (alpine) instead of libc (debian).

borg create --stats --compression=zlib,2 --list --filter=AME <backupname> <path or paths> <several --excludes>

Fatal Python error: Segmentation fault

Current thread 0x0000007fa1383c48 (most recent call first):
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 740 in write_archive_index
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 736 in fetch_and_build_idx
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 824 in create_master_idx
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 870 in sync
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 474 in __init__
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 379 in local
  File "/usr/lib/python3.9/site-packages/borg/cache.py", line 387 in __new__
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 586 in do_create
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 177 in wrapper
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4622 in run
  File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4690 in main
  File "/usr/bin/borg", line 33 in <module>

Running on 32 bit and 64 bit arm (Raspberry Pi),

  • borg 1.1.16

  • Python 3.9.5

  • alpine 3.14.0/3.14.2

  • inside docker

  • Backup cache is on a btrfs partition

  • Repository is on a btrfs disk of a different host, connecting through ssh

  • All archives:

    • Original size 8.40 TB,
    • Compressed size 7.87 TB,
    • Deduplicated size 276.04 GB
    • Unique chunks 952397
    • Total chunks 20011066

I did get a different Traceback back in September before an update, but unfortunately I didn't record the old version information.

Fatal Python error: Segmentation fault                                                                                                                                                                              Current thread 0xb6f2c558 (most recent call first):
  File "/usr/lib/python3.8/site-packages/borg/archive.py", line 186 in backup_io_iter
  File "/usr/lib/python3.8/site-packages/borg/archive.py", line 998 in chunk_file
  File "/usr/lib/python3.8/site-packages/borg/archive.py", line 1071 in process_file                        File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 625 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 649 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 649 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 649 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 649 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 649 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 649 in _process
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 535 in create_inner
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 574 in do_create
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 166 in wrapper
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 4461 in run
  File "/usr/lib/python3.8/site-packages/borg/archiver.py", line 4529 in main
  File "/usr/bin/borg", line 11 in <module>

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Nov 7, 2021

Let's look at cache.py:740 first:

            compact_chunks_archive_saved_space += chunk_idx.compact()

The crash really seems to be in that hashindex_compact function, but as this is C code, it does not show details about it in the python traceback.

That C function is in _hashindex.c:626+ and it just does some hashtable checks and memmove and memcpy.

To point to the real problem, we need the precise line number in the code - it can be found by running the code within gdb and letting it crash.

@ThomasWaldmann
Copy link
Member

The archive.py:186 issue seems to point to some problem in chunkify maybe, but without any version information and without gdb pointing to the specific line of C code that causes the segfault, we can not find that problem.

@kidmose
Copy link

kidmose commented Dec 9, 2021

Also seing this:

backup_1   | 2021-12-09T03:04:32.106682299Z Fatal Python error: Segmentation fault
backup_1   | 2021-12-09T03:04:32.106792091Z
backup_1   | 2021-12-09T03:04:32.106827091Z Current thread 0xb6f6d390 (most recent call first):
backup_1   | 2021-12-09T03:04:32.107060549Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 740 in write_archive_index
backup_1   | 2021-12-09T03:04:32.107227341Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 736 in fetch_and_build_idx
backup_1   | 2021-12-09T03:04:32.107519216Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 824 in create_master_idx
backup_1   | 2021-12-09T03:04:32.107715924Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 870 in sync
backup_1   | 2021-12-09T03:04:32.107941132Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 474 in __init__
backup_1   | 2021-12-09T03:04:32.108296340Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 379 in local
backup_1   | 2021-12-09T03:04:32.108483965Z   File "/usr/lib/python3.9/site-packages/borg/cache.py", line 387 in __new__
backup_1   | 2021-12-09T03:04:32.109525215Z   File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 589 in do_create
backup_1   | 2021-12-09T03:04:32.109564965Z   File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 177 in wrapper
backup_1   | 2021-12-09T03:04:32.109601257Z   File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4635 in run
backup_1   | 2021-12-09T03:04:32.109634423Z   File "/usr/lib/python3.9/site-packages/borg/archiver.py", line 4703 in main
backup_1   | 2021-12-09T03:04:32.109667923Z   File "/usr/bin/borg", line 33 in <module>
backup_1   | 2021-12-09T03:04:32.440414819Z Segmentation fault (core dumped)

with borg 1.1.17 in a Alpine Linux 3.15 docker container, running on a odroid HC1 (armv7l) with armbian (Kernel: 5.4.160).

How can I attach gdb to borg so it gives you what is needed to understand the fault, @ThomasWaldmann ? (A commandline to use when running borg or ...?)

@ThomasWaldmann
Copy link
Member

You can find examples of gdb usage in this issue tracker or via google, I use it rarely.

@kidmose
Copy link

kidmose commented Jan 4, 2022

I'm still seeing this, and this is what i'm getting from running it in gdb;

+ cat /etc/issue
Welcome to Alpine Linux 3.15
Kernel \r on an \m (\l)

+ uname -a
Linux eba1fa84bf8c 5.4.160-odroidxu4 #21.08.6 SMP PREEMPT Mon Nov 22 12:18:25 UTC 2021 armv7l Linux
+ borg --version
borg 1.1.17
++ which borg
+ gdb --args /usr/bin/python3 /usr/bin/borg create --stats '::nextcloud-{hostname}-{now}' /var/www/html /db_backup
GNU gdb (GDB) 11.1
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv7-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/python3...
Reading symbols from /usr/lib/debug//usr/bin/python3.9.debug...
(gdb) r
Starting program: /usr/bin/python3 /usr/bin/borg create --stats ::nextcloud-\{hostname\}-\{now\} /var/www/html /db_backup

Program received signal SIGILL, Illegal instruction.
0xb673999e in ?? () from /lib/libcrypto.so.1.1
(gdb) c
Continuing.

Program received signal SIGILL, Illegal instruction.
0xb673998c in ?? () from /lib/libcrypto.so.1.1
(gdb) c
Continuing.

Program received signal SIGILL, Illegal instruction.
0xb6739992 in ?? () from /lib/libcrypto.so.1.1
(gdb) c
Continuing.

Program received signal SIGILL, Illegal instruction.
0xb6739998 in ?? () from /lib/libcrypto.so.1.1
(gdb) c
Continuing.
[Detaching after fork from child process 82]
[Detaching after fork from child process 83]

Program received signal SIGSEGV, Segmentation fault.
0xb630de76 in ?? () from /usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-arm-linux-musleabihf.so
(gdb) 

The SIGILLs from /lib/libcrypto.so.1.1 seems to be expexted (https://stackoverflow.com/a/45653148)

I'm not sure what I'm looking at or for, so any hints and help is very much appreciated :)

@kidmose
Copy link

kidmose commented Jan 4, 2022

I guess this is a problem:

objdump --syms /usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-arm-linux-musleabihf.so

/usr/lib/python3.9/site-packages/borg/hashindex.cpython-39-arm-linux-musleabihf.so:     file format elf32-littlearm

SYMBOL TABLE:
no symbols

The cython object is compiled without debug symbols. I'll try and figure out how to change that.

@ThomasWaldmann
Copy link
Member

Without further infos we can't do anything for this.

@kidmose
Copy link

kidmose commented Feb 21, 2022

My apologies for the long silence. I'm still affected by this, seeing it repeatedly for 60 nightly runs in a row. For me, it's a journey of figuring out how to build gdb for arm+alpine, how to use cygdb and (presumably, when I get to it) how to build borg with debug symbols and reproduce the crash with the debugger. I only get a few hours for this every other week, so it'll be a while, but I still hope to get there and contribute with the info when I get there. Thanks!

@rkunze
Copy link
Author

rkunze commented Feb 21, 2022

Same here, but the other way around: Since somewhen in fall last year (October I think), the issue stopped happening on my installation. No real idea why, I changed nothing on the setup besides pulling in the regular updates from alpine.org

I think it is still worth investigating, but as @kidmose said this is quite a bit of effort on alpine because no ready-made debug information exists for the libraries in question.

@ThomasWaldmann
Copy link
Member

@tomaszduda23 guess no need to check on which part of the condition. both these 2 macros access the memory at idx, so if idx is beyond the valid range, that is of course a problem.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 29, 2022

fix see PR #6816.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.x, 1.2.2 Jun 29, 2022
@jdchristensen
Copy link
Contributor

Is there a tool like valgrind that could be used in a test suite to check for errors like this?

@ThomasWaldmann
Copy link
Member

Good question, I am not familiar with such C tools.

@ThomasWaldmann
Copy link
Member

I find that hashindex_compact code a bit confusing, but at least i could not find an error besides the stuff in PR #6816. If somebody has time, please review.

@ThomasWaldmann
Copy link
Member

@enkore please check ^^^

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jun 29, 2022
@tomaszduda23
Copy link

Is there a tool like valgrind that could be used in a test suite to check for errors like this?

https://clang.llvm.org/docs/AddressSanitizer.html

@jdchristensen
Copy link
Contributor

But can it be used with cython modules? And is there a way to trigger this bug so it could be added as a test? OTOH, maybe a very short independent executable can be built using _hashindex.c that manually adds a few entries and calls hasindex_compact?

@tomaszduda23
Copy link

It should be possible to use it. It is just matter of adding c compilation flags and loading libs. https://tobywf.com/2021/02/python-ext-asan/ I have not tested it though.

@jdchristensen
Copy link
Contributor

@enkore Can you explain what hashindex_compact is supposed to do? Is it moving all non-empty and non-deleted (tombstone) entries to an initial segment of the hash table? If so, doesn't this destroy efficient lookup? The only place I can see it called is from cache.write_archive_index, so maybe it is intended to "ruin" the hash table but make it take up less space on disk?

@ThomasWaldmann
Copy link
Member

It is so we can save .compact files to disk for chunks.archives.d/*, not wasting disk space with empty buckets.

@ThomasWaldmann
Copy link
Member

At load time, if permit_compact is False it will resize such a compact hashtable, so there are free buckets.

@jdchristensen
Copy link
Contributor

If I'm right, I wonder if it would be cleaner to get rid of hashindex_compact, and instead have hashindex_write iterate over the hash table and only write out the non-empty/non-deleted entries (or do this when passed a flag). The logic of that would be clear, it would probably be faster, and it would be non-destructive... But obviously there's a lot going on here that I'm not aware of, so take this with a grain of salt.

@jdchristensen
Copy link
Contributor

It is so we can save .compact files to disk for chunks.archives.d/*, not wasting disk space with empty buckets.

But does that hash table get used after being written to disk? If so, I think keys will be in random positions, so look ups will be slow. In any case, it seems strange to me to reorganize a hash table in memory, with lots of copying, just so that the non-empty buckets can get written to a file, since that could be done directly.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 30, 2022

Yes, it gets used whenever it needs to rebuild the main chunks index from these per-archive chunks indexes.
So it loads the compact representation into memory, then resizes it (== rebuilds a bigger hashtable from that data).
Only after that it will do lookups, so there's no problem with performance.

I agree that just writing the occupied buckets to disk would also work and might be easier than what we do now, needs checking in the code how much effort that actually is and whether it is worth the change (the current code just writes one big block to disk).

A similar consideration has to be done at load time: instead of load compact + resize, we could also load each single bucket and directly build the hashtable of the desired size.

ThomasWaldmann added a commit that referenced this issue Jun 30, 2022
…ster

hashindex_compact: fix eval order (check idx before use), fixes #5899
ThomasWaldmann added a commit that referenced this issue Jun 30, 2022
hashindex_compact: fix eval order (check idx before use), fixes #5899
ThomasWaldmann added a commit that referenced this issue Jun 30, 2022
hashindex_compact: fix eval order (check idx before use), fixes #5899
@ThomasWaldmann
Copy link
Member

@jdchristensen When thinking now about writing out the used buckets individually without compacting the hash table first, I had the feeling it might be much slower as this would generate very many very small writes (like e.g. 40 bytes per bucket / per write). Likely some buffering by the OS will make it a bit less bad though, but likely not as efficient as compacting + one big write. And our hashtable files can be rather large...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants