Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'named symbol not found' error at runtime in cuda_piKernelCreate if static library with kernel is linked #4307

Closed
ivorobts opened this issue Aug 10, 2021 · 13 comments · Fixed by #4616
Labels
bug Something isn't working compiler Compiler related issue cuda CUDA back-end

Comments

@ivorobts
Copy link

ivorobts commented Aug 10, 2021

A simple example showing the usage of static library with kernel which fails at runtime with CUDA BE:
Here is the content of file sycl_lib.cpp:

#include<CL/sycl.hpp>
using namespace sycl;
void func() {
  queue q;
  q.submit([&](sycl::handler &h) {
      sycl::stream os(1024, 768, h);
      h.parallel_for(32, [=](sycl::id<1> i) {
          os<<i<<"\n";
        });
    });
}

and a simple main code:

void func();
int main() {
  func();
  return 0;
}

Compilation of sycl_lib.cpp, static library creation and linking works fine:

clang++  -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -fsycl-unnamed-lambda -c sycl_lib.cpp
ar rvs sycl_lib.a sycl_lib.o
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice main.cpp -o test-lib sycl_lib.a
SYCL_BE=PI_CUDA ./test-lib

The application crashes at runtime with following error:

PI CUDA ERROR:
        Value:           500
        Name:            CUDA_ERROR_NOT_FOUND
        Description:     named symbol not found
        Function:        cuda_piKernelCreate
        Source Location: /llvm/sycl/plugins/cuda/pi_cuda.cpp:2380

terminate called after throwing an instance of 'cl::sycl::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown OpenCL error code) -999 (Unknown OpenCL error code)

Note that ahead of time compilation and linking works fine for CPU or Intel GPU devices. This looks like a limitation of CUDA BE.

@ivorobts ivorobts added the bug Something isn't working label Aug 10, 2021
@Michoumichmich
Copy link
Contributor

Michoumichmich commented Aug 10, 2021

Hello, I get also the same issue on the CUDA be, but this is not a limitation of the CUDA back-end. Could you try: https://github.com/Michoumichmich/llvm/commit/e3eb050b73bff24cb8335f84fa6a51f1ebaf05c5.diff, this is barely a temporary workaround to a bug that was introduced by f7ce532 in a pulldown

@rodburns rodburns added the cuda CUDA back-end label Aug 11, 2021
@ravil-mobile
Copy link
Contributor

Thanks a lot. It saved my day.

@Michoumichmich
Copy link
Contributor

Michoumichmich commented Aug 12, 2021

Thanks a lot. It saved my day.

You're welcome!

@danchitnis
Copy link

It seems to happen on some versions but not others. It happens both on WSL and native Linux, I assumed it was related to GPU driver mismatch with CUDA. perhaps can you make the error info more useful with debugging traces?

@Michoumichmich
Copy link
Contributor

Michoumichmich commented Aug 20, 2021

It seems to happen on some versions but not others. It happens both on WSL and native Linux, I assumed it was related to GPU driver mismatch with CUDA. perhaps can you make the error info more useful with debugging traces?

The error is coming from the fact that the names of the generated device object files were changed, but clang-offload-bundler is still called on the old names it fails silently. If you compile with -v you won't even see ptxas compiling down the kernels during the final linking of the .so object or executable. It might seem to work if you did not clean your build directory, as you're probably using old files.
If your bug depends on the version of CUDA, it might not be related to that exact one

@danchitnis
Copy link

So far it is working for me, but I am not using the latest commit (a month old). I am aware that there is a recent bug with nvidia containers influencing WSL (and native Linux), so then these two bugs may not be related.

@Michoumichmich
Copy link
Contributor

Michoumichmich commented Sep 3, 2021

Hello @ivorobts can you try compiling with nvptx64-nvidia-cuda--sm_50 on the current branch to see if it's fixed ?

@npmiller
Copy link
Contributor

npmiller commented Sep 8, 2021

This was fixed recently by this PR:

If you want to try it note that your SYCL code is missing a wait which might cause issues at the moment, something like this works fine though:

#include<CL/sycl.hpp>
using namespace sycl;
void func() {
  queue q;
  q.submit([&](sycl::handler &h) {
      sycl::stream os(1024, 768, h);
      h.parallel_for(32, [=](sycl::id<1> i) {
          os<<i<<"\n";
        });
    }).wait_and_throw();
}

I'm not entirely sure if it should work without the wait but either way that would be a separate issue to investigate, the original problem reported here should be fixed.

@npmiller
Copy link
Contributor

npmiller commented Sep 8, 2021

Just as a side note, I double checked with the SYCL 2020 specification and the wait is indeed required in this case, the queue destructor doesn't not implicitly wait on kernel completion, section 3.9.8.1:

Note that the destructors of other SYCL objects (sycl::queue, sycl::context,...) do not block. Only a
sycl::buffer, sycl::sampled_image or sycl::unsampled_image destructor might block. The rationale is that
an object without any side effect on the host does not need to block on destruction as it would impact the
performance. So it is up to the programmer to use a member function to wait for completion in some
cases if this does not fit the goal.

@AerialMantis
Copy link
Contributor

We believe this issue to be resolved now, so we are closing the ticket, @ivorobts if you continue to see any issue here please feel free to comment and we can re-open the ticket.

@Luigi-Crisci
Copy link

Hi,
I'm having the same issue with the lastest build from the sycl branch when I try to compile the same code provided by @ivorobts. This happens when targeting nvptx64_nvidia_cuda.

This was fixed recently by this PR:

If you want to try it note that your SYCL code is missing a wait which might cause issues at the moment, something like this works fine though:

#include<CL/sycl.hpp>
using namespace sycl;
void func() {
  queue q;
  q.submit([&](sycl::handler &h) {
      sycl::stream os(1024, 768, h);
      h.parallel_for(32, [=](sycl::id<1> i) {
          os<<i<<"\n";
        });
    }).wait_and_throw();
}

I'm not entirely sure if it should work without the wait but either way that would be a separate issue to investigate, the original problem reported here should be fixed.

I see that it should have been fixed, maybe the lastest build broke it again?

@bader
Copy link
Contributor

bader commented Sep 21, 2021

Re-open the issue to check again.

@bader bader reopened this Sep 21, 2021
@npmiller
Copy link
Contributor

I had a look and it is indeed failing again with the most recent build, I was able to track it down to the following commit:

I'll look into fixing this again.

npmiller added a commit to npmiller/llvm that referenced this issue Sep 22, 2021
The patch 9838076 changed triple
processing so gpu arch could be deducted even without extra `-` so we no
longer need to add padding, it also seems like SYCL was inadvertently
removed from the branch adding in the bound arch to the triple.

So this patch fixes adding the bound arch in the triple when using SYCL,
removes leftover triple padding code in the offload deps command.

The test was also updated accordingly and it now also checks the triple
used for `clang-offload-deps` so that we can hopefully catch mismatch
between the two earlier in the future.

This fixes intel#4307
romanovvlad pushed a commit that referenced this issue Sep 23, 2021
The patch 9838076 changed triple
processing so gpu arch could be deducted even without extra `-` so we no
longer need to add padding, it also seems like SYCL was inadvertently
removed from the branch adding in the bound arch to the triple.

So this patch fixes adding the bound arch in the triple when using SYCL,
removes leftover triple padding code in the offload deps command.

The test was also updated accordingly and it now also checks the triple
used for `clang-offload-deps` so that we can hopefully catch mismatch
between the two earlier in the future.

This fixes #4307
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compiler Compiler related issue cuda CUDA back-end
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants