-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible segfault with multiple m5copy instances using the same jive5ab? #24
Comments
Hmmm, sorry to learn about this! Any idea that if it happens, it happens at the same phase during a transfer? That knowledge might help ... |
The jive5ab we have running here reports itself as To transfer data as fast as possible with m5copy, we had in this case opted for running multiple m5copy instances in parallell, in this case 6 of them. Each instance used the same receiver and sender jive5ab instances, but with a different data port, with each being executed like: After multiple failures last night, I decided to fallback to one single instance. No issues so far (couple of hours). I also tried running the correlation job in parallell to stress it, no issues yet. So I now suspect it is to do with the multiple instances. Does this seem reasonable? I tried to re-run with multiple instances, but it did not fail directly (it never did before, after some time) so hard to prove. But we could set up some proper multi-instance local tests if this seems like a plausible idea. (Preferred solution would be to convince correlator to accept etd/etc) |
Thanks for the update! Based on this feedback I start to suspect the |
no vbs option I'm afraid, only vbs --> file. After waiting a little more with 5 streams it crashed again, same error. So I'm fairly sure it's conneted to this. Interestingly: jive5ab locally crashed with the output above. No instance running. Remotely, however, jive5ab survived partly. I'll explain the "partly". We need to start a receiving jive5ab instance at the remote correlator. The remote server does not have screen installed. Therefore, we start "screen" locally, then ssh to remote corr in that screen, and start jive5ab. After the crash, I checked the screen and I was logged out from the remote correlator. However, when logging back in, I find with "ps" that jive5ab is still running, and is accepting data. So whatever caused this crash, it made the local jive5ab fail totally, and the remote one only partly... |
Would it be possible to compile the local version with debug ( |
Already built with Debug; now prepared for a core dump next time it happens. Meanwhile, Simon C here checked dmesg after the failures. Almost all of them hade the same message:
This would perhaps suggest it's code-related rather than e.g. bad hardware? I'll get back with core dump if/when it happens again. |
Oh it's definitely code related :-) No worries there. This snippet might already be able to trace it to somewhere in the binary, which would help |
Using the coredump at On was able to trace the SIGSEGV *happening in the garbage collector thread* that the UDT library starts (!): 6 0x000055e8617dd78e in CSndBuffer::~CSndBuffer (this=0x7f8e7c0021a0, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/buffer.cpp:101 7 0x000055e8617e208b in CUDT::~CUDT (this=0x55e8652c9780, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/core.cpp:195 8 0x000055e8617cab58 in CUDTSocket::~CUDTSocket (this=0x55e8652c9650, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/api.cpp:99 so it's likely that ~CSndBuffer() is being executed from different threads and the d'tor has no protection at all against that. It's a probable cause for sigsegv'ing so added a mutex (on non-Win32 systems) to see if that fixes the issue reported
Thanks! We have installed the latest issue-25 code, will use it for upcoming transfers and see if it fails. May take a few weeks to feel confident if fixed or not - will report back once we have logged some significant transfer hours. |
We had some time without errors, but yesterday evening we got a segfault again. No core dump (as far as I could see, maybe @psi-c could find one), in dmesg I found the lines
Not sure if it is relevant, but dmesg also shows many lines like this for several hours before the crash:
but maybe that's unrelated. |
That is weird - could it be that someone is trying rpc-based attacks by randomly sending udp data to ports? For the udt protocol incoming traffic must be allowed; is the machine firewalled to allow only incoming from the remote host where you’re transferring to? Otherwise, |
@haavee The RPC is probably indeed due to unrestrictive firewall, @psi-c should be able to comment on those. I think it's probably unrelated to the error at hand.
and dmesg -T said
|
I get an |
See if you can give it a poke now :) |
Interesting observation + question following from that. "Of course" it happens inside the UDT library again. I see the code Can you elaborate a bit on the actual transfer command line(s) used leading up to the |
Well, we have transfers in both directions. We have had segfaults on transfers initiated both from Ishioka and Ny-Ålesund, as well as our own outgoing transfers (mainly to USNO). In this particular case, it was dealing with incoming data from Ishioka when it aborted. I don't have their exact command, but I can ask if that helps? At least I know the target was to store as VBS on our end. Maybe the fix applied so far adressed an outgoing bug, but there is still one for receiving data? |
Thx for the clarification. Yes it might be that the patch I did fixed outgoing issues. Please try to catch more |
Another one:
this time also during receiving data:
core dump at gyller:/tmp/core-jive5ab-issue25.1846251.gyller.1683523682 |
Note: Recent segfaults happen with only ONE incoming transfer stream, so this segfault seems to be unrelated to the "multiple" in the subject of this issue. Still, I think it's best for clarify to continue the list of crashes in this ticket? Most recent one: jive5abn log tail:
dmesg -T:
core dump at gyller:/tmp/core-jive5ab-issue25.15128.gyller.1684745591 EDIT: |
Thx! I don't have (read)permission on the coredump :-( |
Readable now! |
Using the coredump at On was able to trace the SIGSEGV *happening in the garbage collector thread* that the UDT library starts (!): 6 0x000055e8617dd78e in CSndBuffer::~CSndBuffer (this=0x7f8e7c0021a0, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/buffer.cpp:101 7 0x000055e8617e208b in CUDT::~CUDT (this=0x55e8652c9780, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/core.cpp:195 8 0x000055e8617cab58 in CUDTSocket::~CUDTSocket (this=0x55e8652c9650, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/api.cpp:99 so it's likely that ~CSndBuffer() is being executed from different threads and the d'tor has no protection at all against that. It's a probable cause for sigsegv'ing so added a mutex (on non-Win32 systems) to see if that fixes the issue reported
Using the coredump at On was able to trace the SIGSEGV *happening in the garbage collector thread* that the UDT library starts (!): 6 0x000055e8617dd78e in CSndBuffer::~CSndBuffer (this=0x7f8e7c0021a0, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/buffer.cpp:101 7 0x000055e8617e208b in CUDT::~CUDT (this=0x55e8652c9780, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/core.cpp:195 8 0x000055e8617cab58 in CUDTSocket::~CUDTSocket (this=0x55e8652c9650, __in_chrg=<optimized out>) at /usr/local/src/jive5ab.git/libudt5ab/api.cpp:99 so it's likely that ~CSndBuffer() is being executed from different threads and the d'tor has no protection at all against that. It's a probable cause for sigsegv'ing so added a mutex (on non-Win32 systems) to see if that fixes the issue reported
We have a jive5ab instance running in a screen, to be used for sending data to correlators. On a number of occasions, this instance has crashed during data transfer, with segfault (ugh). This evening it happened again, and I realised that I just had started a correlation job on the same machine (I never correlate while recording, but usually I don't care about transfers as I assume they should just go a bit slower). The jive5ab screen had this to say (with the remote receiver IP masked):
Any clues to why this could happen? (We use m5copy to transfer data to this correlator since they do not (yet) accept etd/etc).
The text was updated successfully, but these errors were encountered: