-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in threaded dgemv #660
Comments
Do you know what is the input size for dgemv? On the machine, is there only one JVM process calling OpenBLAS? |
The matrix is approx 50k rows x 10k cols (row count is ~49k and col count is precisely 10k) and there's a single thread in a single process calling OpenBLAS. |
If sched_yield uses 100% of CPU something is badly broken with your system's process scheduler and timers |
I have the exact same problem with OpenBLAS called from numpy, same backtrace. Here's the full backtrace as suggested by @brada4: https://gist.github.com/jbaiter/8dd11939bedfa9d9bf7e417c86f02bd9 The version is 0.2.18-1ubuntu1 running on Linux 4.4.0-38-generic. |
Could be related to races like
|
Had the same problem, via Lapack, using libopenblas 0.2.19-1 on Debian, on Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz with 20 cores. The main thread got stuck after a few hours on exec_blas_async_wait called from dsytri from Lapack. Kernel was 4.5.0-2-amd64 #1 SMP Debian 4.5.3-2 (2016-05-08) x86_64 GNU/Linux. The other threads were all in pthread_cond_destroy. (gdb) bt |
I am making extremely slow progress, if any, as thread programming is not exactly my forte. Part of the problem now seems to be that blas_thread_server appears to be "thinking" in terms of cpus while exec_blas_async_wait is iterating over threads so I have a hard time putting mutex locks in the latter on the right item to achieve anything. |
Would either of you be willing to retest with the changed blas_server.c and memory.c linked from #995 (or point me to some self-contained test to check this myself) please ? |
I think I just ran into this problem too. I'd be very happy for a workaround .. I have a very simple test case with numpy 1.12.0 in Python 3.6.
Here's the backtrace:
|
@martin-frbg I don't have an easy-to-reproduce case as it happened after a very long computation launched by someone else, and the code has evolved since so that it doesn't make as many calls now. If the fix works on @sparseinference's case I would like to start using it on a development machine to catch some regressions, but there are no guarantees that I would run into the bug if it isn't fixed. |
I can sacrifice some performance but still make stuff work with: export OPENBLAS_NUM_THREADS=1 |
Same here, we've been running with OPENBLAS_NUM_THREADS=1 since November and had no issues. I parallelize at the process level. |
@sparceinference which version of OpenBLAS do you see this with - one of the releases (including latest 0.2.19) or with current git develop tree ? The fix (if it is one) is only in the latter so far. |
@martin-frbg - I think it's 2.18 . The library loaded is |
I had the same problem with version 0.2.18. That is, my process was deadlocked at 100% use of one core. In addition, #0 through #3 of the backtrace are identical to that reported by @bkgood . The long and repetitive calculation I was running was getting deadlocked in a small fraction of calls, but in effect multiple times a day. Based on the information on this Issues thread, I have then compiled OpenBLAS from the develop branch on github (commit 1acfc78, 4/3/2017). Then compiled the latest numpy (1.12.1) and scipy using this version of OpenBLAS. The calculation has since be running uninterrupted for more than a day. Thus, for me, the newest github revision most likely fixes this deadlock. I will report should the problem reappear, or appear with much lower likelyhood. |
We seem to be running into a similar deadlock, but interestingly it is hit when we have multiple processes on the same machine both using BLAS at the same time. Setting
Sorry for the long text, just trying to dig in deep. Perhaps this is actually a separate issue- if so, I'd be happy to open another ticket. EDIT: FWIW, we are using code that has the changes from @brada4's PR. |
What makes you think the situation is similar, and what hardware are you seeing this on ? (From your comment, the OpenBLAS version you use is probably something like 0.2.20, but it might help to know that as well). Sharing resources between two independent processes would I believe be entirely unexpected and unwanted. While both may use the same in-memory copy of the library, the operating system should clone any memory pages that receive process-specific writes to some flag value. |
The thing that makes me think it's similar is that we're showing multiple threads stuck at As for hardware, it's an Amazon EC2 instance. We've got 2 cores, x86 64 bit. The os is RHEL 7, using the kernel 3.10.0-693.17.1.el7.x86_64. We're using OpenBLAS 0.2.20 . I share your confusion about any sharing between processes, yet the fact remains we can reproduce the deadlock when running two processes, but not with one. It's possible it's some indirect interaction is going on (ex, having the CPUs tied up with one process changes the way the threads interleave for the other process in such a way that causes conditions to be just right for the deadlock)... but I hope for our sake and for the maintainer's that the answer is more straightforward than that! |
Two cores only does not sound quite right ? How many threads are you trying to run on this ? Any chance to reproduce this on non-virtual hardware ? |
Yes, it's only 2 cores. So far we've only seen this on a somewhat dinky machine we use for building and testing. Reproducing it on non-virtual hardware will be difficult because our build at present only supports RHEL 7, and we don't have any physical hardware running that. |
How many threads are you expecting to run on that poor thing - I see nine in that snippet of the backtrace from just one your concurrent processes alone ? Maybe you just see threads fighting for a bit of cpu time. I take it your code is proprietary and/or hard to port to different environments, or is this just some scipy ? |
We aren't directly controlling the number of threads in the python application. Our stack is a flask app which accesses BLAS through numpy/scipy. So basically, the server spawns off threads when requests come in. I don't think it's just fighting for CPU time. When run in series, the workload we're hitting it with takes less than a minute. When it hangs, we've left it for probably close to half an hour without any progress. The code is unfortunately proprietary. We only use OpenBLAS on RHEL (that's a whole other story), and so we can't easily compare on other environments. I should say that at this point we may be fine just using a workaround of |
I just rechecked this situation using the same python example that I posted above. As expected, 24 threads were created and all cores were running at 100%. @augray, perhaps a test using Flask would reproduce your problem? |
I've been trying to reproduce using a sharable example, but no luck so far :-( . Perhaps I would be better served to just ask for a detailed explanation of how behavior changes when |
Thanks, that's helpful confirmation! We don't directly control the number of threads (that's determined by the server our app is running in). However, we haven't noticed any performance degradation except when there's a complete deadlock. For instance, submitting 6 jobs at once to two separate processes doesn't take much longer than submitting a job to a single process 3 times in serial. For the other, non-blas threads in our app, they're not CPU bound, so I don't think they should affect CPU contention too much. I'm thinking we'll probably go with |
I am encountering the same kind of bug. I tried to compile OpenBLAS with thread sanitizer enabled to see what happens with the test. Actually, something has been found:
I am not sure that these two data races are directly related to this issue, but it looks like that compiling with sanitizers is very helpful for testing, and maybe the same approach could be used to track down the problem. EDIT: I pasted two times the same error. |
I just wanted to track down at least these two. diff --git a/driver/level3/level3_thread.c b/driver/level3/level3_thread.c
index fec873e..da87a98 100644
--- a/driver/level3/level3_thread.c
+++ b/driver/level3/level3_thread.c
@@ -91,7 +91,7 @@
#endif
typedef struct {
- volatile BLASLONG working[MAX_CPU_NUMBER][CACHE_LINE_SIZE * DIVIDE_RATE];
+ _Atomic BLASLONG working[MAX_CPU_NUMBER][CACHE_LINE_SIZE * DIVIDE_RATE];
} job_t; The other seems to be related to a wrong mutex. Unfortunately, it is the first time I looked at OpenBLAS code, and I can really hypothesize how things should work. Nevertheless, at line 965 of diff --git a/driver/others/blas_server.c b/driver/others/blas_server.c
index 9debe17..5abb362 100644
--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -960,14 +960,9 @@ int BLASFUNC(blas_thread_shutdown)(void){
for (i = 0; i < blas_num_threads - 1; i++) {
- blas_lock(&exec_queue_lock);
-
- thread_status[i].queue = (blas_queue_t *)-1;
-
- blas_unlock(&exec_queue_lock);
-
pthread_mutex_lock (&thread_status[i].lock);
+ thread_status[i].queue = (blas_queue_t *)-1;
thread_status[i].status = THREAD_STATUS_WAKEUP;
pthread_cond_signal (&thread_status[i].wakeup); As I said, I never hacked OpenBLAS, so I could be completely wrong about the patches. I just hope that this could help somehow. |
Second one should actually be fixed by #1299 since september, unfortunately this is only in the develop branch as there has not been a release in the meantime. First one I would never have been able to find, with some luck this will finally fix the long-standing thread safety issues in the multithreaded level3 BLAS. |
I created this fork to track down the issues I am finding. I will need to write code to conditionally use |
Thanks, I will hold off on my partial PR then. |
Very good hint about check without sanitizers. Unfortunately, the check is triggered as well. However, I catch it from the Nevertheless, on a completely different topic: on x86 based machine, even if using volatile for pre-C11 could be a decent compromise (it will always trigger cpu reordering, then possible data races, but at least That's the problem, now for possible solutions. AFAIK there is no way of having portable atomic operations before C11, even from POSIX. I hope of being wrong, but if this is the case, I think that there are only two solutions, and using The first is to drop pre-C11 support. Let's be honest: we are in 2018, and GCC had C11 support even before 2011, I think that this could be an acceptable solution. If this is not possible, then the only possible way, IMHO, is to create a general set of 'atomic types', then write down the wrappers for every possible operation, for every platform. In this way every simple operation will have to be replaced with an inlined function with raw asm code. This is obviously a huge amount of work, it will be very difficult to maintain and the code will be bloated by things like Said that, I think that the actual state of OpenBLAS is, unfortunately, far away of thread-safety. With (relatively) modern compilers we have the possibility of throwing away a large amount of issues without a particularly big effort, but pre-C11 must be dropped. I am the last arrived, and I will never have time to maintain this great library, therefore I cannot decide anything like this, I just wanted to show you my point of view. For now, let's try to solve the current issues ;) |
As far as I know, thread safety is not an issue when OpenMP is used, and the library has been used successfully on big NUMA machines as well. For non-OpenMP builds, it seems to behave "reasonably well" in most cases, at least when USE_SIMPLE_THREADED_LEVEL3 is set. So I do not think the situation is as bleak as you make it. Adding C11 |
Possibly the cpuid instruction used in the x86_64 "WhereAmI()" is unsuitable for "modern" big Intel systems. https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration |
I can show you as much information as possible, but I don't any idea of how I could find possible issue with OS (nevertheless, the machine works flawlessly, I just do not know what to search for to discover anomalous behaviours).
Feel free to ask me any specific information. The question is: could be exactly this the problem we are encountering? |
Sorry, it took me some time to write down the previous reply. Maybe we don't need to touch the |
(Un?)fortunately it seems to me that you may well be more experienced with multithreading in general and NUMA architectures in particular than me. As far as I can tell, gotoblas_affinity_init() in init.c does a sched_setaffinity() for itself before enumerating the cores with whereAmI(), and I assume this should be equivalent to what your test code is doing with the non- (or less-)portable pthread_setaffinity_np() function that appears to be a GNU extension |
can you add "numactl -H" output? I have not seen out of order core numbering recently. |
Sure!
You are absolutely right, i just took the first one in my test... just because ;) I replaced it with the Nevertheless, I want to update you on my tests: I started thinking that the problem was about thread switching in critical sections (not the one the could create a data race, but something like a couple of instruction one expects to be executed without an interruption), but now I am not sure. The test I have got now is much more similar to what is going on inside OpenBLAS -- a
I see that it is always true that the first ID is (generally) different from the second, which in turn is always equal to the third. This means that, at least in this machine with this OS, the rescheduling happens as soon as the |
I have been quite busy, but I took a bit of time to make a simple check. I initialized all the values of I am not completely sure this could be the right way to perform this check, but surely @martin-frbg has a better knowledge of the code. Martin, if you have the possibility to give it a look, maybe we could be on the right direction. Thanks! 😄 |
Not sure what you want me to do here, are you seeing active cpu nodes that never had their cpu_info set to a sensible value ? From the git history, the last person to do any serious work on the NUMA mapping seems to have been ashwinyes (remapping cpus to keep all tasks on a common node if possible). I do note there are a few printfs to dump the cpu_info already in the code that should become active when compiling with DEBUG=1. |
I think I fixed the issue. As I said previously, I made a fork where you can go and perform some tests. Let's explain what I found. At the same time, The commit I pushed to my repo seems to fix the deadlock. I would really like to use the thread sanitizer to perform additional tests, but as suggested previously by @martin-frbg, this seems to introduce problems. This could also mean that there are still some issues, but it is difficult to say. If you have the possibility to perform some test, it would be really nice. In theory things should work flawlessly if previously were working, but at the same time who was experiencing the deadlock now should be able to run applications and tests without problems. |
Thanks for the detailed explanation - in retrospect it makes sense that a misbehaviour in the mapping would be related to the dubious nature of the hyperthreading semi-cores. (Which would also prevent |
Actually, this is my fault: I did not include the initialization and the assertions in the commit of the patch, just to focus on the fix. But you are right, it is useful to have a commit with these, I will make one.
After I wrote the previous comment, I remembered I can use valgrind with helgrind. And the result is different from the thread sanitizer (which lead to a strange deadlock during the destruction phase), but it still involves issues. In details, the Even if changing the mapping behaviour seems to solve the deadlock problem, I want to be sure I did not introduce other issues, or, if there are other related bugs left, I would like to fix them. The problems with the thread sanitizer and valgrind leave me very suspicious. |
I made some real case tests, and I find that the issues are still there 😢. I am going to continue searching for bugs. |
Ouch, too bad. Still the same symptoms though, and are you testing with dgemv (as was the original issue here) ? It is possible that the bug I tracked down in #1497 (using the return value from blas_quickdivide to calculate blocksizes without checking that it is non-zero ) may be present in some of the level3 BLAS files as well. |
I've compiled OpenBLAS dfe1eef and I'm linking to it from a JNI interface and calling it from the JVM. It was compiled using USE_THREAD=1 and NO_AFFINITY=0 using gcc 4.4.7. After approximately 5 days of processing between 5 and 10 dgemv calls per second, I saw a deadlock occurring with a thread consuming 100% usage of a CPU core.
I grabbed this backtrace; this was the only thread with symbols in libblas.so:
Thread 1 (process 60961):
#0 0x00007fde1dce2287 in sched_yield () from /lib64/libc.so.6
#1 0x00007fdcca2f5ba5 in exec_blas_async_wait () from ./libblas.so.3
#2 0x00007fdcca2f6272 in exec_blas () from ./libblas.so.3
#3 0x00007fdcca132a74 in dgemv_thread_n () from ./libblas.so.3
#4 0x00007fdcca0f4934 in cblas_dgemv () from ./libblas.so.3
#5 0x00007fdccb31dafb in Java_com_github_fommil_netlib_NativeSystemBLAS_dgemv_1offsets () from /tmp/jniloader5273540829126619406netlib-native_system-linux-x86_64.so
#6 0x00007fde094c04a9 in ?? ()
#7 0x0000000000000000 in ?? ()
./libblas.so.3 is
lrwxrwxrwx 1 wgood users 35 Sep 23 15:16 libblas.so.3 -> libopenblas_sandybridgep-r0.2.14.so
make output looks like this:
OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)
OS ... Linux
Architecture ... x86_64
BINARY ... 64bit
C compiler ... GCC (command line : gcc)
Fortran compiler ... GFORTRAN (command line : gfortran)
Library Name ... libopenblas_sandybridgep-r0.2.14.a (Multi threaded; Max num-threads is 24)
The JVM is only calling dgemv from a single thread at any given point in time.
I will continue to watch for this; I suspect it will happen again in the next day or so although not all JVM processes failed after 5 days. My long term plan was to migrate to a single-threaded BLAS and do my own threading and that will fix this issue as well, but I'll try to see if I can learn anything more about this issue in the meanwhile.
The text was updated successfully, but these errors were encountered: