Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Import Error: /libs/libth_common.so: Undefined Symbol" While Building #808

Closed
eurus-ch opened this issue Jan 4, 2024 · 8 comments
Closed
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@eurus-ch
Copy link

eurus-ch commented Jan 4, 2024

Hi,

while trying to run this

python build.py --model_dir $model_dir$ \
                --dtype float16 \
                --use_gpt_attentionZ_plugin float16 \
                --use_gemm_plugin float16 \
                --max_batch_size 4 \
                --max_input_len 128 \
                --max_output_len 128

we run into this FATAL ERROR, a strange undefined symbol

Traceback (most recent call last):  
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 56, in _init 	torch.classes.load_library(ft_decoder_lib)  
File "/usr/local/lib/python3.10/dist-packages/torch/_classes.py", line 51, in load_library  	torch.ops.load_library(path)  
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 841, in load_library
	ctypes.CDLL(path)  
File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
	self._handle = _dlopen(self._name, mode)OSError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_
	
During handling of the above exception, another exception occurred:Traceback (most recent call last):  

File "TensorRT-LLM/examples/llama/build.py", line 33, in <module>
	from weight import (get_scaling_factors, load_from_awq_llama, load_from_binary,  
File "TensorRT-LLM/examples/llama/weight.py", line 24, in <module>
	import tensorrt_llm  
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 61, in <module>    
	_init(log_level="error")  
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 59, in _init 	raise ImportError(str(e) + msg)
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_
FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

Before that, we built wheel through

python3 ./scripts/build_wheel.py --clean  --trt_root /usr/local/tensorrt

And the software versions are

tensorboard               2.9.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorrt                  9.2.0.post12.dev5
tensorrt-llm              0.7.1
torch-tensorrt            0.0.0
pytorch-quantization      2.1.2
torch                     2.1.0a0+32f93b1
torch-tensorrt            0.0.0
torchdata                 0.7.0a0
torchtext                 0.16.0a0
torchvision               0.16.0a0

Have you got any clue on solving this? Much thanks!

@Shixiaowei02 Shixiaowei02 self-assigned this Jan 4, 2024
@Shixiaowei02 Shixiaowei02 added the triaged Issue has been triaged by maintainers label Jan 4, 2024
@Shixiaowei02
Copy link
Collaborator

Shixiaowei02 commented Jan 4, 2024

Please ensure that you build and run TensorRT-LLM in the same environment. Alternatively, you can try building TensorRT-LLM in a Docker container by executing this command:

make -C docker release_build

Thank you!

@eurus-ch
Copy link
Author

eurus-ch commented Jan 4, 2024

Using tensorrt-llm 0.6.1, and the error changes into this

Traceback (most recent call last):  
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer    
	_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__    
	func = self.__getitem__(name)  
File "/usr/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__    
	func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMemoryInfo_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):  
File "/TensorRT-LLM/./examples/llama/build.py", line 906, in <module>    
	build(0, args)  
File "/TensorRT-LLM/./examples/llama/build.py", line 850, in build    
	engine = build_rank_engine(builder, builder_config, engine_name,  
File "/TensorRT-LLM/./examples/llama/build.py", line 609, in build_rank_engine    
	profiler.print_memory_usage(f'Rank {rank} Engine build starts')  
File "/TensorRT-LLM/tensorrt_llm/profiler.py", line 197, in print_memory_usage    
	alloc_device_mem, _, _ = device_memory_info(device=device)  
File "/TensorRT-LLM/tensorrt_llm/profiler.py", line 148, in device_memory_info    
	mem_info = _device_get_memory_info_fn(handle)  
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 2438, in nvmlDeviceGetMemoryInfo    
	fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2") 
File "/usr/local/lib/python3.10/dist-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer    
	raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

Thank you, but I'm developing in a Docker and building another Docker within seems restrained so...

@woskii
Copy link

woskii commented Jan 8, 2024

ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6
FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

===================================================================
I solved this error by manually installing pytorch 2.1.0
Command like this:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

@ekagra-ranjan
Copy link

ekagra-ranjan commented Jan 11, 2024

raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

I too faced this issue. This was the fix: NVIDIA/k8s-device-plugin#331 (comment)

@dongteng
Copy link

dongteng commented Feb 2, 2024

ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6 FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

=================================================================== I solved this error by manually installing pytorch 2.1.0 Command like this: pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

Thanks

@AbhisKmr
Copy link

AbhisKmr commented Apr 1, 2024

Im still facing the same issue

ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6 FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

=================================================================== I solved this error by manually installing pytorch 2.1.0 Command like this: pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

Im still getting the same error my env configs are
attrs 23.2.0
av 10.0.0
bcrypt 4.1.2
braceexpand 0.1.7
certifi 2020.6.20
cffi 1.16.0
chardet 4.0.0
charset-normalizer 3.3.2
coloredlogs 15.0.1
cryptography 42.0.5
ctranslate2 3.24.0
dbus-python 1.2.16
distro 1.9.0
distro-info 1.0+deb11u1
docker 7.0.0
docker-compose 1.29.2
dockerpty 0.4.1
docopt 0.6.2
einops 0.7.0
encodec 0.1.1
fastcore 1.5.29
faster-whisper 0.9.0
fastprogress 1.0.3
ffmpeg-python 0.2.0
filelock 3.13.3
flatbuffers 24.3.25
fsspec 2024.3.1
future 1.0.0
httplib2 0.18.1
huggingface-hub 0.17.3
humanfriendly 10.0
HyperPyYAML 1.2.2
idna 2.10
Jinja2 3.1.3
joblib 1.3.2
jsonschema 3.2.0
kaldialign 0.9.1
llvmlite 0.42.0
MarkupSafe 2.1.5
more-itertools 10.2.0
mpmath 1.3.0
networkx 3.2.1
numba 0.59.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
nvidia-pyindex 1.0.9
onnxruntime 1.16.0
openai-whisper 20231117
packaging 24.0
paramiko 3.4.0
pillow 10.2.0
pip 20.3.4
protobuf 5.26.1
pycparser 2.22
pycurl 7.43.0.6
PyGObject 3.38.0
PyNaCl 1.5.0
pyrsistent 0.20.0
PySimpleSOAP 1.16.2
python-apt 2.2.1
python-debian 0.1.39
python-debianbts 3.1.0
python-dotenv 0.21.1
python-snappy 0.5.3
PyYAML 5.4.1
regex 2023.12.25
reportbug 7.10.3+deb11u1
requests 2.31.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
scipy 1.12.0
sentencepiece 0.2.0
setuptools 52.0.0
six 1.16.0
soundfile 0.12.1
speechbrain 0.5.16
sympy 1.12
tensorrt 8.6.1.post1
tensorrt-bindings 8.6.1
tensorrt-libs 8.6.1
texttable 1.7.0
tiktoken 0.3.3
tokenizers 0.14.1
torch 2.1.0+cu121
torchaudio 2.1.0+cu121
torchvision 0.16.0+cu121
tqdm 4.66.2
triton 2.1.0
typing-extensions 4.10.0
unattended-upgrades 0.1
urllib3 1.26.5
vocos 0.1.0
websocket-client 0.59.0
websockets 12.0
wheel 0.34.2
WhisperSpeech 0.8

development hardware: google cloud

Error message::

FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 58, in _init
    torch.classes.load_library(ft_decoder_lib)
  File "/usr/local/lib/python3.10/dist-packages/torch/_classes.py", line 51, in load_library
    torch.ops.load_library(path)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 933, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/WhisperFusion/main.py", line 11, in <module>
    from whisper_live.trt_server import TranscriptionServer
  File "/root/WhisperFusion/whisper_live/trt_server.py", line 17, in <module>
    from whisper_live.trt_transcriber import WhisperTRTLLM
  File "/root/WhisperFusion/whisper_live/trt_transcriber.py", line 16, in <module>
    import tensorrt_llm
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 64, in <module>
    _init(log_level="error")
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 61, in _init
    raise ImportError(str(e) + msg)
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libth_common.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev
FATAL: Decoding operators failed to load. This may be caused by the incompatibility between PyTorch and TensorRT-LLM. Please rebuild and install TensorRT-LLM.

CoderHam pushed a commit to CoderHam/TensorRT-LLM that referenced this issue May 3, 2024
- There is a bug that was fixed in the 526 driver release. For older driver versions the recommendation is to downgrade the pynvml version to 11.4.0 and use 11.5.0 only for drivers after 526.

Uses the legacy pynvml memory usage function even with pynvml 11.5.0 if the driver version is older than 526.

Mentioned in the issue as well: NVIDIA#808 (comment)
@nv-guomingz
Copy link
Collaborator

please feel free to reopen this ticket if needed.

@DeekshithaDPrakash
Copy link

DeekshithaDPrakash commented Nov 25, 2024

+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                                                                                     |
+------------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                                                                                                      |
| preprocessing    | 1       | READY                                                                                                                                                                                                      |
| tensorrt_llm     | 1       | UNAVAILABLE: Not found: unable to load shared library: /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so: undefined symbol: _ZNK12tensorrt_llm8executor8Response11getErrorMsgB5cxx11E |
|                  |         | v                                                                                                                                                                                                          |
| tensorrt_llm_bls | 1       | READY                       

same issue when I execute launch_triton_server.py file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

8 participants