Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in threaded dgemv #660

Closed
bkgood opened this issue Oct 5, 2015 · 47 comments · Fixed by #1234
Closed

Deadlock in threaded dgemv #660

bkgood opened this issue Oct 5, 2015 · 47 comments · Fixed by #1234
Labels

Comments

@bkgood
Copy link

bkgood commented Oct 5, 2015

I've compiled OpenBLAS dfe1eef and I'm linking to it from a JNI interface and calling it from the JVM. It was compiled using USE_THREAD=1 and NO_AFFINITY=0 using gcc 4.4.7. After approximately 5 days of processing between 5 and 10 dgemv calls per second, I saw a deadlock occurring with a thread consuming 100% usage of a CPU core.

I grabbed this backtrace; this was the only thread with symbols in libblas.so:
Thread 1 (process 60961):
#0 0x00007fde1dce2287 in sched_yield () from /lib64/libc.so.6
#1 0x00007fdcca2f5ba5 in exec_blas_async_wait () from ./libblas.so.3
#2 0x00007fdcca2f6272 in exec_blas () from ./libblas.so.3
#3 0x00007fdcca132a74 in dgemv_thread_n () from ./libblas.so.3
#4 0x00007fdcca0f4934 in cblas_dgemv () from ./libblas.so.3
#5 0x00007fdccb31dafb in Java_com_github_fommil_netlib_NativeSystemBLAS_dgemv_1offsets () from /tmp/jniloader5273540829126619406netlib-native_system-linux-x86_64.so
#6 0x00007fde094c04a9 in ?? ()
#7 0x0000000000000000 in ?? ()

./libblas.so.3 is
lrwxrwxrwx 1 wgood users 35 Sep 23 15:16 libblas.so.3 -> libopenblas_sandybridgep-r0.2.14.so

make output looks like this:
OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

OS ... Linux
Architecture ... x86_64
BINARY ... 64bit
C compiler ... GCC (command line : gcc)
Fortran compiler ... GFORTRAN (command line : gfortran)
Library Name ... libopenblas_sandybridgep-r0.2.14.a (Multi threaded; Max num-threads is 24)

The JVM is only calling dgemv from a single thread at any given point in time.

I will continue to watch for this; I suspect it will happen again in the next day or so although not all JVM processes failed after 5 days. My long term plan was to migrate to a single-threaded BLAS and do my own threading and that will fix this issue as well, but I'll try to see if I can learn anything more about this issue in the meanwhile.

@xianyi xianyi added the Bug label Oct 5, 2015
@xianyi
Copy link
Collaborator

xianyi commented Oct 5, 2015

Do you know what is the input size for dgemv? On the machine, is there only one JVM process calling OpenBLAS?

@bkgood
Copy link
Author

bkgood commented Oct 5, 2015

The matrix is approx 50k rows x 10k cols (row count is ~49k and col count is precisely 10k) and there's a single thread in a single process calling OpenBLAS.

@brada4
Copy link
Contributor

brada4 commented Dec 1, 2015

If sched_yield uses 100% of CPU something is badly broken with your system's process scheduler and timers
Can you post to gist
gdb> attach
gdb >thread apply all backtrace
gdb> detach
gdb> quit

@jbaiter
Copy link

jbaiter commented Oct 27, 2016

I have the exact same problem with OpenBLAS called from numpy, same backtrace.

Here's the full backtrace as suggested by @brada4: https://gist.github.com/jbaiter/8dd11939bedfa9d9bf7e417c86f02bd9

The version is 0.2.18-1ubuntu1 running on Linux 4.4.0-38-generic.

@martin-frbg
Copy link
Collaborator

Could be related to races like
==7085== Possible data race during write of size 8 at 0x641C00 by thread #2
==7085== Locks held: none
==7085== at 0x432F91: blas_thread_server (blas_server.c:438)
==7085== by 0x4C2F736: mythread_wrapper (hg_intercepts.c:234)
==7085== by 0x54680A3: start_thread (pthread_create.c:309)
==7085==
==7085== This conflicts with a previous read of size 8 by thread #1
==7085== Locks held: none
==7085== at 0x4333F2: exec_blas_async_wait (blas_server.c:726)
==7085== by 0x4335A5: exec_blas (blas_server.c:795)
==7085== by 0x42C3DE: sgbmv_thread_n (gbmv_thread.c:260)
==7085== by 0x416480: sgbmv_ (gbmv.c:240)
==7085== by 0x40D3A1: schk1_ (sblat2.f:592)
==7085== by 0x413F6F: MAIN__ (sblat2.f:305)
==7085== by 0x4148FE: main (sblat2.f:390)
==7085== Address 0x641c00 is 0 bytes inside data symbol "thread_status"
seen in #995 - I am a bit unsure how to (sanely) put a lock around or into the "while" loop at 726

while(thread_status[queue -> assigned].queue) {
  YIELDING;
 };

at the moment (and it may all be a red herring)

@berke
Copy link

berke commented Oct 31, 2016

Had the same problem, via Lapack, using libopenblas 0.2.19-1 on Debian, on Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz with 20 cores. The main thread got stuck after a few hours on exec_blas_async_wait called from dsytri from Lapack. Kernel was 4.5.0-2-amd64 #1 SMP Debian 4.5.3-2 (2016-05-08) x86_64 GNU/Linux. The other threads were all in pthread_cond_destroy.

(gdb) bt
#0 0x00007f2c9704b308 in exec_blas_async_wait () from /usr/lib/libopenblas.so.0
#1 0x00007f2c9704b3a7 in exec_blas () from /usr/lib/libopenblas.so.0
#2 0x00007f2c96e7a2ee in dsymv_thread_U () from /usr/lib/libopenblas.so.0
#3 0x00007f2c9a6cb043 in dsymv_ () from /usr/lib/libblas.so.3
#4 0x00007f2c9a12e3ff in dsytri_ () from /usr/lib/liblapack.so.3

@martin-frbg
Copy link
Collaborator

I am making extremely slow progress, if any, as thread programming is not exactly my forte. Part of the problem now seems to be that blas_thread_server appears to be "thinking" in terms of cpus while exec_blas_async_wait is iterating over threads so I have a hard time putting mutex locks in the latter on the right item to achieve anything.

@martin-frbg
Copy link
Collaborator

Would either of you be willing to retest with the changed blas_server.c and memory.c linked from #995 (or point me to some self-contained test to check this myself) please ?

@sparseinference
Copy link

I think I just ran into this problem too. I'd be very happy for a workaround ..

I have a very simple test case with numpy 1.12.0 in Python 3.6.
It hangs at the call to dot.

from numpy import *
from numpy.random import normal,uniform

A = normal(size=(350,35))
b = uniform(size=350)

for i in range(1000000):
    for _ in range(1000):
        x = normal(size=35)
        y = dot(A,x) + b
    print(f"[{i:6d}] {y[0]:10.4f}")

Here's the backtrace:

GNU gdb (Ubuntu 7.11.90.20161005-0ubuntu1) 7.11.90.20161005-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) r testdot.py
Starting program: /home/peter/python36/bin/python testdot.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3c17700 (LWP 16527)]
[New Thread 0x7ffff3416700 (LWP 16528)]
[New Thread 0x7fffeec15700 (LWP 16529)]
[New Thread 0x7fffee414700 (LWP 16530)]
[New Thread 0x7fffe9c13700 (LWP 16531)]
[New Thread 0x7fffe7412700 (LWP 16532)]
[New Thread 0x7fffe4c11700 (LWP 16533)]
[New Thread 0x7fffe2410700 (LWP 16534)]
[New Thread 0x7fffdfc0f700 (LWP 16535)]
[New Thread 0x7fffdd40e700 (LWP 16536)]
[New Thread 0x7fffdac0d700 (LWP 16537)]
[     0]     7.0992
[     1]    10.2106
[     2]   -14.5927
[     3]    10.9267
[     4]     0.8033
[     5]     8.1307
[     6]     9.4003
[     7]    -2.9214
[     8]     3.1963
[     9]    -4.4260
[    10]    -8.6859
[    11]    -1.5888
[    12]     2.4825
[    13]    -3.8991
[    14]     2.1277
[    15]   -14.3231
[    16]     6.9087
[    17]     2.6755
[    18]    -3.1991
[    19]     7.4059
^C
Thread 1 "python" received signal SIGINT, Interrupt.
0x00007ffff71ce4e7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff71ce4e7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff4249e75 in exec_blas_async_wait ()
   from /home/peter/python36/lib/python3.6/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#2  0x00007ffff4249f66 in exec_blas () from /home/peter/python36/lib/python3.6/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#3  0x00007ffff40789f5 in dgemv_thread_t () from /home/peter/python36/lib/python3.6/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#4  0x00007ffff4038d97 in cblas_dgemv () from /home/peter/python36/lib/python3.6/site-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#5  0x00007ffff66700a3 in gemv (typenum=typenum@entry=12, order=<optimized out>, trans=trans@entry=CblasNoTrans, A=A@entry=0x7ffff6df3350, 
    lda=<optimized out>, X=X@entry=0x7ffff7ed5bc0, incX=1, R=R@entry=0x7fffd5207c60) at numpy/core/src/multiarray/cblasfuncs.c:95
#6  0x00007ffff6670e9c in cblas_matrixproduct (typenum=typenum@entry=12, ap1=ap1@entry=0x7ffff6df3350, ap2=ap2@entry=0x7ffff7ed5bc0, out=out@entry=0x0)
    at numpy/core/src/multiarray/cblasfuncs.c:623
#7  0x00007ffff663ac5c in PyArray_MatrixProduct2 (op1=<optimized out>, op2=0x7ffff7ed5bc0, out=0x0) at numpy/core/src/multiarray/multiarraymodule.c:938
#8  0x00007ffff663c08a in array_matrixproduct (__NPY_UNUSED_TAGGEDdummy=<optimized out>, args=0x7ffff7e0bec8, kwds=0x0)
    at numpy/core/src/multiarray/multiarraymodule.c:2180
#9  0x0000555555685363 in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=<optimized out>, func_obj=0x7ffff6b30f78)
    at Objects/methodobject.c:231
#10 _PyCFunction_FastCallKeywords () at Objects/methodobject.c:295
#11 0x00005555556dea1d in call_function () at Python/ceval.c:4788
#12 0x00005555556d702f in _PyEval_EvalFrameDefault () at Python/ceval.c:3275
#13 0x00005555556d6187 in PyEval_EvalFrameEx (throwflag=0, f=0x7ffff7f6ba20) at Python/ceval.c:718
#14 _PyEval_EvalCodeWithName () at Python/ceval.c:4119
#15 0x00005555556d5f03 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, args=0x0, 
    locals=locals@entry=0x7ffff7f3d288, globals=globals@entry=0x7ffff7f3d288, _co=_co@entry=0x7ffff7ef7930) at Python/ceval.c:4140
#16 PyEval_EvalCode (co=co@entry=0x7ffff7ef7930, globals=globals@entry=0x7ffff7f3d288, locals=locals@entry=0x7ffff7f3d288) at Python/ceval.c:695
#17 0x0000555555727712 in run_mod () at Python/pythonrun.c:980
#18 0x0000555555727b6d in PyRun_FileExFlags () at Python/pythonrun.c:933
#19 0x000055555572794e in PyRun_SimpleFileExFlags () at Python/pythonrun.c:396
#20 0x000055555572d631 in run_file (p_cf=0x7fffffffda30, filename=0x555555ab72c0 L"testdot.py", fp=0x555555b41f00) at Modules/main.c:320
#21 Py_Main () at Modules/main.c:780
#22 0x00005555556395ad in main () at ./Programs/python.c:69
#23 0x00007ffff71033f1 in __libc_start_main (main=0x5555556394c0 <main>, argc=2, argv=0x7fffffffdc48, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffdc38) at ../csu/libc-start.c:291
#24 0x0000555555724c8a in _start ()
(gdb) 

@berke
Copy link

berke commented Feb 2, 2017

@martin-frbg I don't have an easy-to-reproduce case as it happened after a very long computation launched by someone else, and the code has evolved since so that it doesn't make as many calls now. If the fix works on @sparseinference's case I would like to start using it on a development machine to catch some regressions, but there are no guarantees that I would run into the bug if it isn't fixed.

@sparseinference
Copy link

I can sacrifice some performance but still make stuff work with:

export OPENBLAS_NUM_THREADS=1

@berke
Copy link

berke commented Feb 2, 2017

Same here, we've been running with OPENBLAS_NUM_THREADS=1 since November and had no issues. I parallelize at the process level.

@martin-frbg
Copy link
Collaborator

@sparceinference which version of OpenBLAS do you see this with - one of the releases (including latest 0.2.19) or with current git develop tree ? The fix (if it is one) is only in the latter so far.

@sparseinference
Copy link

@martin-frbg - I think it's 2.18 . The library loaded is libopenblasp-r0.2.18.so .

@pyIonControl
Copy link

I had the same problem with version 0.2.18. That is, my process was deadlocked at 100% use of one core. In addition, #0 through #3 of the backtrace are identical to that reported by @bkgood . The long and repetitive calculation I was running was getting deadlocked in a small fraction of calls, but in effect multiple times a day. Based on the information on this Issues thread, I have then compiled OpenBLAS from the develop branch on github (commit 1acfc78, 4/3/2017). Then compiled the latest numpy (1.12.1) and scipy using this version of OpenBLAS. The calculation has since be running uninterrupted for more than a day. Thus, for me, the newest github revision most likely fixes this deadlock. I will report should the problem reappear, or appear with much lower likelyhood.

@augray
Copy link

augray commented Feb 22, 2018

We seem to be running into a similar deadlock, but interestingly it is hit when we have multiple processes on the same machine both using BLAS at the same time. Setting OPENBLAS_NUM_THREADS=1 for both processes seems to fix the issue, but I'm still left with a couple questions.

  1. How is it that we hit the deadlock only when we run two independent processes using BLAS? Is BLAS sharing resources between these two processes, and if so, what resources?
  2. If BLAS is sharing resources between the two processes somehow, then how does setting OPENBLAS_NUM_THREADS=1 on each of the processes change this relationship?
  3. Based on experimentation, it appears that the BLAS computations from the two processes are happening in parallel, when each one is set with OPENBLAS_NUM_THREADS=1. Can you confirm that this is actually the case?
  4. What are the implications of setting OPENBLAS_NUM_THREADS=1? Obviously it changes how many threads OpenBLAS uses :-) . But threads for what? Is BLAS using multiple threads for say, a single matrix multiplication, or is the parallelism at the job level (ex: I can ask BLAS to do 2 matrix multiplications at once and it will do each one on a separate thread). How does this relate to whatever cross-process interaction is going on? Is there anything else I should be aware of?

Sorry for the long text, just trying to dig in deep. Perhaps this is actually a separate issue- if so, I'd be happy to open another ticket.

EDIT: FWIW, we are using code that has the changes from @brada4's PR.

@martin-frbg
Copy link
Collaborator

What makes you think the situation is similar, and what hardware are you seeing this on ? (From your comment, the OpenBLAS version you use is probably something like 0.2.20, but it might help to know that as well). Sharing resources between two independent processes would I believe be entirely unexpected and unwanted. While both may use the same in-memory copy of the library, the operating system should clone any memory pages that receive process-specific writes to some flag value.

@augray
Copy link

augray commented Feb 22, 2018

The thing that makes me think it's similar is that we're showing multiple threads stuck at #0 0x00007fa1aeb3ae47 in sched_yield () from /lib64/libc.so.6 #1 0x00007fa1a19d47c5 in exec_blas_async (), which is similar to what @sparseinference posted, and also setting OPENBLAS_NUM_THREADS=1 resolves it, as suggested by him. Here's a full backtrace of the hung process in a gist.

As for hardware, it's an Amazon EC2 instance. We've got 2 cores, x86 64 bit. The os is RHEL 7, using the kernel 3.10.0-693.17.1.el7.x86_64. We're using OpenBLAS 0.2.20 .

I share your confusion about any sharing between processes, yet the fact remains we can reproduce the deadlock when running two processes, but not with one. It's possible it's some indirect interaction is going on (ex, having the CPUs tied up with one process changes the way the threads interleave for the other process in such a way that causes conditions to be just right for the deadlock)... but I hope for our sake and for the maintainer's that the answer is more straightforward than that!

@martin-frbg
Copy link
Collaborator

Two cores only does not sound quite right ? How many threads are you trying to run on this ? Any chance to reproduce this on non-virtual hardware ?

@augray
Copy link

augray commented Feb 22, 2018

Yes, it's only 2 cores. So far we've only seen this on a somewhat dinky machine we use for building and testing. Reproducing it on non-virtual hardware will be difficult because our build at present only supports RHEL 7, and we don't have any physical hardware running that.

@martin-frbg
Copy link
Collaborator

How many threads are you expecting to run on that poor thing - I see nine in that snippet of the backtrace from just one your concurrent processes alone ? Maybe you just see threads fighting for a bit of cpu time. I take it your code is proprietary and/or hard to port to different environments, or is this just some scipy ?

@augray
Copy link

augray commented Feb 22, 2018

We aren't directly controlling the number of threads in the python application. Our stack is a flask app which accesses BLAS through numpy/scipy. So basically, the server spawns off threads when requests come in.

I don't think it's just fighting for CPU time. When run in series, the workload we're hitting it with takes less than a minute. When it hangs, we've left it for probably close to half an hour without any progress.

The code is unfortunately proprietary. We only use OpenBLAS on RHEL (that's a whole other story), and so we can't easily compare on other environments.

I should say that at this point we may be fine just using a workaround of OPENBLAS_NUM_THREADS=1, provided we feel comfortable that we understand how exactly that changes OpenBLAS's behavior. In other words, if you can confirm that there's no direct resource sharing going on, just answering 3 & 4 from my earlier post would go a long way for us. Btw, Thanks for the time you've already spent looking at this!

@sparseinference
Copy link

I just rechecked this situation using the same python example that I posted above.
I have removed the restriction on OpenBlas threads and I ran the test in two terminal sessions concurrently.
The CPU has 6 hyper-threading cores.
Kernel: 4.13.0-19
numpy: 1.14.1
openblas: 0.2.20

As expected, 24 threads were created and all cores were running at 100%.
I let the test run for about 5 minutes with no problems.

@augray, perhaps a test using Flask would reproduce your problem?

@augray
Copy link

augray commented Feb 23, 2018

I've been trying to reproduce using a sharable example, but no luck so far :-( . Perhaps I would be better served to just ask for a detailed explanation of how behavior changes when OPENBLAS_NUM_THREADS=1 is set. I can open this as a docs ticket if that would be preferable?

@augray
Copy link

augray commented Feb 23, 2018

Thanks, that's helpful confirmation! We don't directly control the number of threads (that's determined by the server our app is running in). However, we haven't noticed any performance degradation except when there's a complete deadlock. For instance, submitting 6 jobs at once to two separate processes doesn't take much longer than submitting a job to a single process 3 times in serial. For the other, non-blas threads in our app, they're not CPU bound, so I don't think they should affect CPU contention too much. I'm thinking we'll probably go with OPENBLAS_NUM_THREADS=1 for each process. Since there's no sharing going on between the two processes, there's definitely no chance for any kind of deadlock in single-threaded processes, and the performance as we've tested it so far using this setting is sufficient. I guess we'll just have to assume the explanation for the reproducibility being restricted to the case of two processes must be something nebulous like "non-deterministic thread ordering behavior that's influenced by increased cpu contention happens to make conditions amenable for deadlock on our os and hardware with our particular workload". I appreciate the info!

@dodomorandi
Copy link

dodomorandi commented Mar 9, 2018

I am encountering the same kind of bug. I tried to compile OpenBLAS with thread sanitizer enabled to see what happens with the test. Actually, something has been found:

==================
WARNING: ThreadSanitizer: data race (pid=37875)
  Write of size 8 at 0x7ffdb5b290b0 by thread T1:
    #0 inner_thread level3_thread.c:397 (sblat3+0x000000434527)
    #1 blas_thread_server /home/edoardo/src/OpenBLAS/driver/others/blas_server.c:433 (sblat3+0x0000004406f9)
    
  Previous read of size 8 at 0x7ffdb5b290b0 by main thread:
    #0 inner_thread level3_thread.c:416 (sblat3+0x000000434621)
    #1 exec_blas /home/edoardo/src/OpenBLAS/driver/others/blas_server.c:820 (sblat3+0x000000441557)
    #2 gemm_driver level3_thread.c:672 (sblat3+0x000000433e72)
    #3 ssymm_thread_LU level3_thread.c:733 (sblat3+0x000000433e72)
    #4 ssymm_ /home/edoardo/src/OpenBLAS/interface/symm.c (sblat3+0x000000414a34)
    #5 schk2_ /home/edoardo/src/OpenBLAS/test/sblat3.f:795 (sblat3+0x000000408ca9)

  Location is stack of main thread.

  Thread T1 (tid=37877, running) created by main thread at:
    #0 pthread_create ../../../../gcc-7.3.0/libsanitizer/tsan/tsan_interceptors.cc:900 (libtsan.so.0+0x000000028260)
    #1 blas_thread_init /home/edoardo/src/OpenBLAS/driver/others/blas_server.c:579 (sblat3+0x00000043fc69)
    #2 gotoblas_init /home/edoardo/src/OpenBLAS/driver/others/memory.c:1435 (sblat3+0x00000043f2ea)
    #3 __libc_csu_init <null> (sblat3+0x00000051a51c)

SUMMARY: ThreadSanitizer: data race level3_thread.c:397 in inner_thread
==================
==================
WARNING: ThreadSanitizer: data race (pid=13862)
  Read of size 8 at 0x000000650200 by thread T1 (mutexes: write M12):
    #0 blas_thread_server ~/src/OpenBLAS/driver/others/blas_server.c:343 (sblat2+0x00000042e737)

  Previous write of size 8 at 0x000000650200 by main thread (mutexes: write M11):
    #0 blas_thread_shutdown_ ~/src/OpenBLAS/driver/others/blas_server.c:965 (sblat2+0x000000430e17)
    #1 blas_shutdown ~/src/OpenBLAS/driver/others/memory.c:1252 (sblat2+0x00000042d675)
    #2 gotoblas_quit ~/src/OpenBLAS/driver/others/memory.c:1455 (sblat2+0x00000042d8df)
    #3 _dl_fini <null> (ld-linux-x86-64.so.2+0x0000000104e6)

  Location is global 'thread_status' of size 8192 at 0x000000650200 (sblat2+0x000000650200)

  Mutex M12 (0x000000650218) created at:
    #0 pthread_mutex_init ../../../../gcc-7.3.0/libsanitizer/tsan/tsan_interceptors.cc:1117 (libtsan.so.0+0x000000028a6e)
    #1 blas_thread_init ~/src/OpenBLAS/driver/others/blas_server.c:572 (sblat2+0x00000042e1a0)
    #2 gotoblas_init ~/src/OpenBLAS/driver/others/memory.c:1435 (sblat2+0x00000042d87a)
    #3 __libc_csu_init <null> (sblat2+0x000000444b6c)

  Mutex M11 (0x000000650188) created at:
    #0 pthread_mutex_lock ../../../../gcc-7.3.0/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:3608 (libtsan.so.0+0x00000003872f)
    #1 blas_thread_init ~/src/OpenBLAS/driver/others/blas_server.c:556 (sblat2+0x00000042e067)
    #2 gotoblas_init ~/src/OpenBLAS/driver/others/memory.c:1435 (sblat2+0x00000042d87a)
    #3 __libc_csu_init <null> (sblat2+0x000000444b6c)

  Thread T1 (tid=13864, running) created by main thread at:
    #0 pthread_create ../../../../gcc-7.3.0/libsanitizer/tsan/tsan_interceptors.cc:900 (libtsan.so.0+0x000000028260)
    #1 blas_thread_init ~/src/OpenBLAS/driver/others/blas_server.c:579 (sblat2+0x00000042e1f9)
    #2 gotoblas_init ~/src/OpenBLAS/driver/others/memory.c:1435 (sblat2+0x00000042d87a)
    #3 __libc_csu_init <null> (sblat2+0x000000444b6c)

SUMMARY: ThreadSanitizer: data race ~/src/OpenBLAS/driver/others/blas_server.c:343 in blas_thread_server
==================

I am not sure that these two data races are directly related to this issue, but it looks like that compiling with sanitizers is very helpful for testing, and maybe the same approach could be used to track down the problem.

EDIT: I pasted two times the same error.

@dodomorandi
Copy link

I just wanted to track down at least these two.
The first one is related to the misuse of volatile. Please, use _Atomic keyword or any related atomic type, volatile is not meant to avoid data races.
A possible patch is the following:

diff --git a/driver/level3/level3_thread.c b/driver/level3/level3_thread.c
index fec873e..da87a98 100644
--- a/driver/level3/level3_thread.c
+++ b/driver/level3/level3_thread.c
@@ -91,7 +91,7 @@
 #endif
 
 typedef struct {
-  volatile BLASLONG working[MAX_CPU_NUMBER][CACHE_LINE_SIZE * DIVIDE_RATE];
+  _Atomic BLASLONG working[MAX_CPU_NUMBER][CACHE_LINE_SIZE * DIVIDE_RATE];
 } job_t;

The other seems to be related to a wrong mutex. Unfortunately, it is the first time I looked at OpenBLAS code, and I can really hypothesize how things should work. Nevertheless, at line 965 of blas_server.c, thread_status[i].queue is locked through blas_lock(&exec_queue_lock), but in many other cases, the lock is performed through pthread_mutex_lock(&thread_status[i].lock). Actually, this is exactly the lock performed in the following lines. I cannot be sure, but a possible fix that seems to work is the following:

diff --git a/driver/others/blas_server.c b/driver/others/blas_server.c
index 9debe17..5abb362 100644
--- a/driver/others/blas_server.c
+++ b/driver/others/blas_server.c
@@ -960,14 +960,9 @@ int BLASFUNC(blas_thread_shutdown)(void){
 
   for (i = 0; i < blas_num_threads - 1; i++) {
 
-    blas_lock(&exec_queue_lock);
-
-    thread_status[i].queue = (blas_queue_t *)-1;
-
-    blas_unlock(&exec_queue_lock);
-
     pthread_mutex_lock  (&thread_status[i].lock);
 
+    thread_status[i].queue = (blas_queue_t *)-1;
     thread_status[i].status = THREAD_STATUS_WAKEUP;
 
     pthread_cond_signal (&thread_status[i].wakeup);

As I said, I never hacked OpenBLAS, so I could be completely wrong about the patches. I just hope that this could help somehow.

@martin-frbg
Copy link
Collaborator

Second one should actually be fixed by #1299 since september, unfortunately this is only in the develop branch as there has not been a release in the meantime. First one I would never have been able to find, with some luck this will finally fix the long-standing thread safety issues in the multithreaded level3 BLAS.

@dodomorandi
Copy link

dodomorandi commented Mar 12, 2018

I created this fork to track down the issues I am finding. I will need to write code to conditionally use _Atomic in case of C11 support, but this is a minor issue for now.
The big problem I am encountering is related to node mapping, in particular I have two different issues, and I am not sure they could be related or not. The problem is triggered by the safety utest; sometimes I get a deadlock, sometimes the node_mapping is written more than one time at the same index. I created this commit to help me (or anyone else, obviously) to track down the issue, but for now I am quite stuck.

@martin-frbg
Copy link
Collaborator

Thanks, I will hold off on my partial PR then.
Do you see the duplicate mapping also without thread sanitizer ? (Though it would not surprise me if that phtread_atfork hack of stopping and re-initializing threads around a fork were a bit fragile)

@dodomorandi
Copy link

Very good hint about check without sanitizers. Unfortunately, the check is triggered as well. However, I catch it from the bug_695 test. Now I want to give you some more details: I am working on a NUMA machine with HT. From what I can understand, it looks like that the ACPI ID that is obtained from WhereAmI() is not unambiguous between NUMA units. But I do not have a big knowledge about NUMA machines, so I can (and I will probably be) wrong. What do you think @martin-frbg?

Nevertheless, on a completely different topic: on x86 based machine, even if using volatile for pre-C11 could be a decent compromise (it will always trigger cpu reordering, then possible data races, but at least mov instructions are atomic when the data is completely inside the same cache line), AFAIK this is complete mess on other architectures like MIPS and ARM, especially with older CPUs. Look at this simple test from Compiler Explorer (yes, it is in C++ instead of being in C, but is is just because it gave me some problem with C11 instructions like _Atomic, and the C compiler seems to be available only for X86 archs; results are however the same in this case), you will notice a big difference between using atomic and volatile variables.

That's the problem, now for possible solutions. AFAIK there is no way of having portable atomic operations before C11, even from POSIX. I hope of being wrong, but if this is the case, I think that there are only two solutions, and using volatile is not one.

The first is to drop pre-C11 support. Let's be honest: we are in 2018, and GCC had C11 support even before 2011, I think that this could be an acceptable solution.

If this is not possible, then the only possible way, IMHO, is to create a general set of 'atomic types', then write down the wrappers for every possible operation, for every platform. In this way every simple operation will have to be replaced with an inlined function with raw asm code. This is obviously a huge amount of work, it will be very difficult to maintain and the code will be bloated by things like atomic_add_float(f, 3.f) for every atomic operation that will need to be performed.

Said that, I think that the actual state of OpenBLAS is, unfortunately, far away of thread-safety. With (relatively) modern compilers we have the possibility of throwing away a large amount of issues without a particularly big effort, but pre-C11 must be dropped. I am the last arrived, and I will never have time to maintain this great library, therefore I cannot decide anything like this, I just wanted to show you my point of view.

For now, let's try to solve the current issues ;)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 13, 2018

As far as I know, thread safety is not an issue when OpenMP is used, and the library has been used successfully on big NUMA machines as well. For non-OpenMP builds, it seems to behave "reasonably well" in most cases, at least when USE_SIMPLE_THREADED_LEVEL3 is set. So I do not think the situation is as bleak as you make it. Adding C11 atomic operations where available should improve things, but I see no reason to completely remove the kludgy volatile and just tell affected
users that we're in 2018.
I do not know anything about the behaviour of whereAmI() on NUMA installations, but I expect there should be a way to get unambiguous IDs on such systems.
Can you provide more details about the platform you are using, and can you exclude any hardware/firmware/OS issues with that particular system ?

@martin-frbg
Copy link
Collaborator

Possibly the cpuid instruction used in the x86_64 "WhereAmI()" is unsuitable for "modern" big Intel systems. https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration
discusses means to enumerate processor topology but I do not have time to look into this right now.

@dodomorandi
Copy link

Can you provide more details about the platform you are using, and can you exclude any hardware/firmware/OS issues with that particular system ?

I can show you as much information as possible, but I don't any idea of how I could find possible issue with OS (nevertheless, the machine works flawlessly, I just do not know what to search for to discover anomalous behaviours).
This is from lscpu:

Architecture:          x86_64                                                                                                                                                                                                                                                     
CPU op-mode(s):        32-bit, 64-bit                                                                                                                                                                                                                                             
Byte Order:            Little Endian                                                                                                                                                                                                                                              
CPU(s):                64                                                                                                                                                                                                                                                         
On-line CPU(s) list:   0-63                                                                                                                                                                                                                                                       
Thread(s) per core:    2                                                                                                                                                                                                                                                          
Core(s) per socket:    8                                                                                                                                                                                                                                                          
Socket(s):             4                                                                                                                                                                                                                                                          
NUMA node(s):          8                                                                                                                                                                                                                                                          
Vendor ID:             AuthenticAMD                                                                                                                                                                                                                                               
CPU family:            21                                                                                                                                                                                                                                                         
Model:                 1                                                                                                                                                                                                                                                          
Model name:            AMD Opteron(TM) Processor 6274                                                                                                                                                                                                                             
Stepping:              2                                                                                                                                                                                                                                                          
CPU MHz:               2200.052                                                                                                                                                                                                                                                   
BogoMIPS:              4399.74                                                                                                                                                                                                                                                    
Virtualization:        AMD-V                                                                                                                                                                                                                                                      
L1d cache:             16K                                                                                                                                                                                                                                                        
L1i cache:             64K                                                                                                                                                                                                                                                        
L2 cache:              2048K                                                                                                                                                                                                                                                      
L3 cache:              6144K                                                                                                                                                                                                                                                      
NUMA node0 CPU(s):     0,4,8,12,16,20,24,28                                                                                                                                                                                                                                       
NUMA node1 CPU(s):     32,36,40,44,48,52,56,60                                                                                                                                                                                                                                    
NUMA node2 CPU(s):     2,6,10,14,18,22,26,30                                                                                                                                                                                                                                      
NUMA node3 CPU(s):     34,38,42,46,50,54,58,62                                                                                                                                                                                                                                    
NUMA node4 CPU(s):     3,7,11,15,19,23,27,31                                                                                                                                                                                                                                      
NUMA node5 CPU(s):     35,39,43,47,51,55,59,63                                                                                                                                                                                                                                    
NUMA node6 CPU(s):     1,5,9,13,17,21,25,29                                                                                                                                                                                                                                       
NUMA node7 CPU(s):     33,37,41,45,49,53,57,61

Feel free to ask me any specific information.
However, to discover a bit the behaviour of my machine, I wrote a simple program in C++. It creates a pool of threads and every thread get the cpu index and the acpi id from sched_getcpu and the cpuid instruction and stores them. As soon I spawned the thread, I use the pthread_setaffinity_np to set the thread affinity. Then I join the threads and I check if the ids (both CPU and ACPI) are unique or not.
What I found is that this was leading to spurious id collisions, for both types of ids. After a bit I understood the problem: the main thread was not set as affine. As soon as I set the main thread affine using the same procedure, I stopped having id collisions.

The question is: could be exactly this the problem we are encountering?

@dodomorandi
Copy link

Possibly the cpuid instruction used in the x86_64 "WhereAmI()" is unsuitable for "modern" big Intel systems. https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration
discusses means to enumerate processor topology but I do not have time to look into this right now.

Sorry, it took me some time to write down the previous reply. Maybe we don't need to touch the WhereAmI function, for now. But I will wait for your opinion to have a better idea.

@martin-frbg
Copy link
Collaborator

(Un?)fortunately it seems to me that you may well be more experienced with multithreading in general and NUMA architectures in particular than me. As far as I can tell, gotoblas_affinity_init() in init.c does a sched_setaffinity() for itself before enumerating the cores with whereAmI(), and I assume this should be equivalent to what your test code is doing with the non- (or less-)portable pthread_setaffinity_np() function that appears to be a GNU extension

@brada4
Copy link
Contributor

brada4 commented Mar 15, 2018

can you add "numactl -H" output? I have not seen out of order core numbering recently.
Initial report is on small systems, nothing to do with NUMA.
IMO trying to override OpenMP thread placement is wrong.

@dodomorandi
Copy link

can you add "numactl -H" output?

Sure!

available: 8 nodes (0-7)
node 0 cpus: 0 4 8 12 16 20 24 28
node 0 size: 32166 MB
node 0 free: 29854 MB
node 1 cpus: 32 36 40 44 48 52 56 60
node 1 size: 32254 MB
node 1 free: 31911 MB
node 2 cpus: 2 6 10 14 18 22 26 30
node 2 size: 32254 MB
node 2 free: 31786 MB
node 3 cpus: 34 38 42 46 50 54 58 62
node 3 size: 32254 MB
node 3 free: 31936 MB
node 4 cpus: 3 7 11 15 19 23 27 31
node 4 size: 32254 MB
node 4 free: 30836 MB
node 5 cpus: 35 39 43 47 51 55 59 63
node 5 size: 32254 MB
node 5 free: 31727 MB
node 6 cpus: 1 5 9 13 17 21 25 29
node 6 size: 32254 MB
node 6 free: 32004 MB
node 7 cpus: 33 37 41 45 49 53 57 61
node 7 size: 32237 MB
node 7 free: 31628 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  22  16  22  16  22 
  1:  16  10  22  16  16  22  22  16 
  2:  16  22  10  16  16  16  16  16 
  3:  22  16  16  10  16  16  22  22 
  4:  16  16  16  16  10  16  16  22 
  5:  22  22  16  16  16  10  22  16 
  6:  16  22  16  22  16  22  10  16 
  7:  22  16  16  22  22  16  16  10 

I assume this should be equivalent to what your test code is doing with the non- (or less-)portable pthread_setaffinity_np() function that appears to be a GNU extension

You are absolutely right, i just took the first one in my test... just because ;) I replaced it with the sched_setaffinity for coherence.

Nevertheless, I want to update you on my tests: I started thinking that the problem was about thread switching in critical sections (not the one the could create a data race, but something like a couple of instruction one expects to be executed without an interruption), but now I am not sure. The test I have got now is much more similar to what is going on inside OpenBLAS -- a sched_setaffinity is called inside the thread. I tried to perform the following things inside the spawned thread, in order:

  • The ACPI ID is colllected
  • sched_setaffinity is called to make the thread affine to the CPU equal to the thread index
  • ACPI ID is collected again
  • The thread is explicitly yielded
  • Again, the ACPI ID is collected

I see that it is always true that the first ID is (generally) different from the second, which in turn is always equal to the third. This means that, at least in this machine with this OS, the rescheduling happens as soon as the sched_setaffinity is called. I just wanted to be sure that something unexpected could happen in this context.
Oh, the test written in this way never leads to ACPI ID collisions. Probably I am going toward the wrong direction 😞

@dodomorandi
Copy link

I have been quite busy, but I took a bit of time to make a simple check. I initialized all the values of common->cpu_info inside drivers/other/init.c with ULONG_MAX. Then I tried to put around some assertions, just to check that the cpu_info parameter was written, and I see these throwing.

I am not completely sure this could be the right way to perform this check, but surely @martin-frbg has a better knowledge of the code. Martin, if you have the possibility to give it a look, maybe we could be on the right direction. Thanks! 😄

@martin-frbg
Copy link
Collaborator

Not sure what you want me to do here, are you seeing active cpu nodes that never had their cpu_info set to a sensible value ? From the git history, the last person to do any serious work on the NUMA mapping seems to have been ashwinyes (remapping cpus to keep all tasks on a common node if possible). I do note there are a few printfs to dump the cpu_info already in the code that should become active when compiling with DEBUG=1.

@dodomorandi
Copy link

I think I fixed the issue. As I said previously, I made a fork where you can go and perform some tests.

Let's explain what I found. numa_mapping sets the common->cpu_info array using the counter count, incremented only when an assignment is performed. When hyperthreading is disabled, we are only assigning only common -> num_procs / 2.
At the same time, inside local_cpu_map, mapping is used as a counter of mapped cores, while cpu is incremented in any case. This means that mapping can only reach common -> num_procs / 2 with disabled hyperthreading, but cpu can exceed this value. Therefore, here cpu_mapping[mapping] must be set related to common->cpu_info[mapping], not common->cpu_info[cpu].

At the same time, cpu_sub_mapping maps to the complete sets of cores, therefore using it to take elements from common->cpu_info should only bring to use uninitialized values.

The commit I pushed to my repo seems to fix the deadlock. I would really like to use the thread sanitizer to perform additional tests, but as suggested previously by @martin-frbg, this seems to introduce problems. This could also mean that there are still some issues, but it is difficult to say.

If you have the possibility to perform some test, it would be really nice. In theory things should work flawlessly if previously were working, but at the same time who was experiencing the deadlock now should be able to run applications and tests without problems.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 5, 2018

Thanks for the detailed explanation - in retrospect it makes sense that a misbehaviour in the mapping would be related to the dubious nature of the hyperthreading semi-cores. (Which would also prevent
its showing up on other big systems that use a different architecture)
In my defense, I had looked at your fork but did not see a commit that contained the initializing of common->cpu_info that you had mentioned, then I had a look at the code paths that read or write this variable and became confused rather quickly... still after reading your patch I am unsure if dropping the cpu_sub_mapping would let the code still work as intended when hyperthreading is enabled (but I would take your word for it). Maybe I should just read what you wrote aloud three more times to get it.
As an alternative to thread sanitizer, perhaps helgrind from the valgrind package can provide additional insight (but there too, the environmental changes necessary for observing may change the system under observation)

@dodomorandi
Copy link

In my defense, I had looked at your fork but did not see a commit that contained the initializing of common->cpu_info that you had mentioned

Actually, this is my fault: I did not include the initialization and the assertions in the commit of the patch, just to focus on the fix. But you are right, it is useful to have a commit with these, I will make one.

As an alternative to thread sanitizer, perhaps helgrind from the valgrind package can provide additional insight (but there too, the environmental changes necessary for observing may change the system under observation)

After I wrote the previous comment, I remembered I can use valgrind with helgrind. And the result is different from the thread sanitizer (which lead to a strange deadlock during the destruction phase), but it still involves issues. In details, the openblas_utest aborts on the check I put inside gotoblas_set_affinity to avoid node remapping. But I am not able to reproduce this situation without valgrind (until now).

Even if changing the mapping behaviour seems to solve the deadlock problem, I want to be sure I did not introduce other issues, or, if there are other related bugs left, I would like to fix them. The problems with the thread sanitizer and valgrind leave me very suspicious.

@dodomorandi
Copy link

I made some real case tests, and I find that the issues are still there 😢. I am going to continue searching for bugs.

@martin-frbg
Copy link
Collaborator

Ouch, too bad. Still the same symptoms though, and are you testing with dgemv (as was the original issue here) ? It is possible that the bug I tracked down in #1497 (using the return value from blas_quickdivide to calculate blocksizes without checking that it is non-zero ) may be present in some of the level3 BLAS files as well.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 2, 2018

Hopefilly this is now resolved with PR #1865, the fix for #1844 (where it was found that an old optimization (#532) for inputs with small M and large N had introduced local static storage that broke reentrancy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants