Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptors are not all released in vbs2file #37

Open
HF-vlbi opened this issue Dec 24, 2024 · 10 comments
Open

File descriptors are not all released in vbs2file #37

HF-vlbi opened this issue Dec 24, 2024 · 10 comments

Comments

@HF-vlbi
Copy link

HF-vlbi commented Dec 24, 2024

After we updated jive5ab from 3.0.0 to 3.1.0, an error message [Too many open files] became to be popped out when we try to combine the observed VLBI data at FlexBuff.


SRC::VBS [127.0.0.1:2620] r11185_is_345-1659* ===> DST::FILE [127.0.0.1:2620:] /mnt/nfs_banana/vlbi/se_test/ Unacceptable return code 4 from query 'runtime=lkBFpvQi:transient' - !runtime = 4 : /usr/local/src/jive5ab/jive5ab-3.1.0/src/mountpoint.cc@837 [(mtab=::setmntent("/etc/mtab", "r"))] fails - Too many open files(24); [acceptable: [0]]

jive5ab_error_log.txt

In the attached log, the data [1647][1651][1654] were combined, but at the same time the file descriptors were left there and thus jive5ab kept holding those three files at last.

This problem does not occur in ver. 3.0.0.
There should be some bug in jive5ab-3.1.0.

I would be happy if someone could check this issue and fix the bug in jive5ab-3.1.0.

Thank you, and Merry Christmas!!

@haavee
Copy link
Member

haavee commented Dec 24, 2024

Thank you for reporting this issue! We will look into it.
Looking at your use case, I may have a suggestion that you could use until the issue is fixed (and maybe continue to use).

If I am correct you want to create single files for each of the "vbs"-style scattered recordings, right?
Should that be the case, please take a look at the VBS tools. The require "libfuse-dev" to be installed.

After compiling the vbs_fs tool can be pointed at the mountpoint(s) containing the scattered data and present those as single files in a (temporary) mountpoint of your choice. After that the individual recordings can be processed with any UNIX/Linux command (such as cp) - they are regular files to the operating system.

Let me know if this helps you.
Merry Christmas!

@HF-vlbi
Copy link
Author

HF-vlbi commented Dec 25, 2024 via email

haavee pushed a commit that referenced this issue Jan 19, 2025
HarunaF: After scan_set= + disk2file one (1) file descriptor was leaked,
after many disk2files run out of max open files!

Indeed, this is caused by scan_set= caching the open file for later use;
vbs_read() will start /really/ opening the file(s).

Probably better to close the whole file after use; this means that after
a disk2file the scan as set by "scan_set=" is invalidated/has become
unusable for e.g. scan_check?

Whether that is desirable needs to be investigated - if the scan_set=
scan should remain accessible after disk2file= then we need to come up
with a different solution to prevent leaking file descriptors.
@haavee
Copy link
Member

haavee commented Jan 19, 2025

Hi HF-vlbi,

I have created a new branch issue-37. Can you try that version?
After git clone (or git pull of an existing checkout), do git checkout issue-37 and re-compile.

Could you verify that this fixes the issue? Maybe it provokes other usability problems that I have not foreseen.
Thanks in advance for your testing time!

@ops-obs
Copy link

ops-obs commented Jan 22, 2025

We are also seeing this with transfer vbs: ===> file:
jive5ab-3.1.1
Linux flexbuff-2yg 6.8.0-49-generic #49~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 6 17:42:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Unacceptable return code 4 from query 'runtime=4iS5QPxG:transient' - !runtime = 4 : /home/observer/jive5ab_src/jive5ab/src/mountpoint.cc@837 [(mtab=::setmntent("/etc/mtab", "r"))] fails - Too many open files(24); [acceptable: [0]]

At the time, reading for transfer, and record writing concurrently.
lsof shows jive5ab has thousands of vbs files still open, after they had been transferred successfully. Normal behaviour, just 10-30 vbs file chunks open at a time for each file transfer, and then they get closed.
Jive5ab log shows that during recording some disks were removed from set_disks due to unable to write there, because too many open files from that mountpoint.
We can't reproduce trigger circumstances, but jive5ab 3.1.1 regularly fails with above error, when open file load is excessive.

@ops-obs
Copy link

ops-obs commented Feb 4, 2025

We tried issue-37 branch version.
!version? 0 : jive5ab : 3.2.0-dev : 64bit : Release : flexbuff-3hb : 04-Feb-2025 : 22h27m50s : nossapi :

Still the same behaviour with m5copy vbs ==> file transfers, open files just keep piling up, after transfer has completed. Number of open jive5ab process files going up in order of tens per scan transfer, after 15 minutes, hundreds of files, until a few thousand after hours and then problematic ~ 10,000 ; jive5ab can't access some disks; fails - Too many open files(24)

We notice that e.g 11 files open during scan transfer, when transfer complete, some get get closed, then minutes later, 11 open again.
lsof | grep r11193_hb_034-1701_dssy | wc -l
11
lsof | grep r11193_hb_034-1701_dssy | wc -l
7
lsof | grep r11193_hb_034-1701_dssy | wc -l
11
Above still true (alternating 7-11 open) after one hour.

We also see "too many open files" behaviour when we mount vbs_fs and try m5copy file ==> file transfer.
If we mount vbf_fs and use rsync to copy files over, then not a problem.

@ops-obs
Copy link

ops-obs commented Feb 11, 2025

We notice that it is a single vbs chunk, with the same file descriptor, open for read many times, by multiple threads.
jive5ab 1555485 observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023
jive5ab 1555485 1555486 jive5ab observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023
jive5ab 1555485 1555654 jive5ab observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023
jive5ab 1555485 1603786 jive5ab observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023
jive5ab 1555485 1603787 jive5ab observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023
jive5ab 1555485 1603788 jive5ab observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023
jive5ab 1555485 1603789 jive5ab observer 93r REG 65,17 501108448 25782338374 /mnt/disk17/r41192_yg_031-0311_sx/r41192_yg_031-0311_sx.00000023

@haavee
Copy link
Member

haavee commented Feb 12, 2025

Hi ops-obs,

Thanks for your test results! The oscillating between 7 and 11 open files seems OK ("seems root cause solved") - but if m5copy file => file runs out of file descriptors that's not good, so there is more to it.

In discussions with other users it would seem that the current "fix" has unwanted side effects, so this definitely will not be the final solution. Stay tuned, but don't hold breath ...

I wonder if it's coincidence or definite hint of something fundamentally wrong if it's the same file that remains open.

@HF-vlbi
Copy link
Author

HF-vlbi commented Mar 3, 2025

Dear all,
Sorry for being late to test the update.
During our antenna maintenance, we checked the updated one.

It seems that the previous issue was cleared and i worked well, thank you.

jive5ab_error_log_20250219.txt

However, since the VDIF file name was changed in the new version, we decided to use the old version until we are ready to change our local scripts which refers file names.

EX.
old file name : b25011_is_011-0629b_6
new file name : b25011_is_011-0629b_ds6

Thank you for your kind and quick support.

@haavee
Copy link
Member

haavee commented Mar 5, 2025

Thanks for the feedback!
(And note that this is probably not the best solution for getting vbs recorded data into a single file-on-disk ... ;-) )

@haavee
Copy link
Member

haavee commented Mar 5, 2025

I have one more request, if that's ok. After discussing the current approach with another user it became clear that whilst the current fix is OK for this specific use case, it would impact other use cases badly.
So I have tried another approach that a) fixes this use case and b) does not negatively impact other use cases.

Would you be kind enough to pull the issue-37 branch again, and try this fix once more, to see if it fixes the reported problem? That would be greatly appreciated!
UPDATE 05 Mar 2025 14:53 UTC: I have implemented a better approach - the one triggering this original message broke everything! Please make sure you get the issue-37 branch as fresh as possible after the mentioned time stamp

Note: for details about why another approach see the commit message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants