Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'transformers_modules' with API serving using baichuan-7b #572

Closed
McCarrtney opened this issue Jul 25, 2023 · 14 comments

Comments

@McCarrtney
Copy link

I tried to deploy an API serving using baichuan-7b, but there is an error:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model /root/data/zyy/baichuan-7B --host 0.0.0.0 --port 11114 --tensor-parallel-size 2 --trust-remote-code
(RayWorker pid=53626) No module named 'transformers_modules'
(RayWorker pid=53626) Traceback (most recent call last):
(RayWorker pid=53626)   File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/ray/_private/serialization.py", line 387, in deserialize_objects
(RayWorker pid=53626)     obj = self._deserialize_object(data, metadata, object_ref)
(RayWorker pid=53626)   File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/ray/_private/serialization.py", line 268, in _deserialize_object
(RayWorker pid=53626)     return self._deserialize_msgpack_data(data, metadata_fields)
(RayWorker pid=53626)   File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/ray/_private/serialization.py", line 223, in _deserialize_msgpack_data
(RayWorker pid=53626)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(RayWorker pid=53626)   File "/root/miniconda3/envs/vllm/lib/python3.9/site-packages/ray/_private/serialization.py", line 213, in _deserialize_pickle5_data
(RayWorker pid=53626)     obj = pickle.loads(in_band)
(RayWorker pid=53626) ModuleNotFoundError: No module named 'transformers_modules'
@imhuay
Copy link

imhuay commented Jul 25, 2023

It seems that baichuan is not supported yet, you can refer to this repository: https://github.com/gameofdimension/vllm-cn

Or refer to this document: https://vllm.readthedocs.io/en/latest/models/adding_model.html

@McCarrtney
Copy link
Author

Thank you! That's quite helpful

@Sanster
Copy link
Contributor

Sanster commented Jul 27, 2023

Add this environment variable (replace with your modules directory) can make it work, but the results generated by the model are completely incorrect.

PYTHONPATH=/root/.cache/huggingface/modules

@mklf
Copy link
Contributor

mklf commented Jul 27, 2023

this is caused by:

self._run_workers("init_worker",

self._run_workers("init_worker",
                  get_all_outputs=True,
                  worker_init_fn=lambda: Worker(
                      self.model_config,
                      self.parallel_config,
                       ...
)

this lambda function will capture self object, which has a reference to config from remote huggingface modules

you can try this to avoid capture self object:

import copy
model_config = copy.deepcopy(self.model_config)
parallel_config = copy.deepcopy(self.parallel_config)
scheduler_config = copy.deepcopy(self.scheduler_config)
self._run_workers("init_worker",
                      get_all_outputs=True,
                      worker_init_fn=lambda: Worker(
                          model_config,
                          parallel_config,
                          scheduler_config,
                          None,
                          None,
                      ))

for baichuan, please note that baichuan-7b uses rotary embedding while baichuan-13b uses alibi, you can refer this link #512

@Sanster
Copy link
Contributor

Sanster commented Jul 27, 2023

@mklf Can you get normal output using biachuan-7b in distribute mode? In my environment MPT-7b works fine in distribute mode, but biachuan-7b/13b always return garbage output

baichuan-7b normal output using 1 gpu
image

baichuan-7b garbage output using tensor_parallel_size=2

image

@mklf
Copy link
Contributor

mklf commented Jul 27, 2023

no , in this link #512 they mentioned:


Our code is currently only compatible with non-distributed deployments, i.e., setups involving a single GPU and single model.

While our code is operational with distributed deployment using tensor parallelism, the results it produces are not yet accurate. We are actively looking for community help to rectify this issue.

@mklf
Copy link
Contributor

mklf commented Jul 27, 2023

@Sanster I fixed the bug, here is the updated code, you can replace load_weights in

def load_weights(self,
with the following code(tested in baichuan13b):

  def load_weights(self,
                   model_name_or_path: str,
                   cache_dir: Optional[str] = None,
                   use_np_cache: bool = False):
      tensor_model_parallel_rank = get_tensor_model_parallel_rank()
      state_dict = self.state_dict()

      for name, loaded_weight in hf_model_weights_iterator(
              model_name_or_path, cache_dir, use_np_cache):
          if "rotary_emb.inv_freq" in name:
              continue

          is_gate_up_weight = False
          for stride_id, weight_name in enumerate(["gate_proj", "up_proj"]):
              if weight_name not in name:
                  continue
              param = state_dict[name.replace(weight_name, "gate_up_proj")]
              shard_size = param.shape[0] // 2
              loaded_weight = loaded_weight[
                  shard_size * tensor_model_parallel_rank:shard_size *
                  (tensor_model_parallel_rank + 1)]
              param_slice = param.data[shard_size * stride_id:shard_size *
                                       (stride_id + 1)]
              assert param_slice.shape == loaded_weight.shape
              param_slice.copy_(loaded_weight)
              is_gate_up_weight = True
              break
          if is_gate_up_weight:
              continue

          param = state_dict[name]
         
          if "W_pack.weight" in name:  # <----- newly added code here
              head_size = self.config.hidden_size // self.config.num_attention_heads
              loaded_weight = (
                  loaded_weight.contiguous()
                  .view(
                      3,
                      self.config.num_attention_heads,
                      head_size,
                      self.config.hidden_size,
                  )
                  .transpose(0, 1)
                  .contiguous()
                  .view(-1, self.config.hidden_size)
              )

              shard_size = param.shape[0]
              start = shard_size * tensor_model_parallel_rank
              end = shard_size * (tensor_model_parallel_rank + 1)
              loaded_weight = loaded_weight[start:end]
              loaded_weight = loaded_weight.view(
                  -1, 3, head_size, self.config.hidden_size
              )
              loaded_weight = loaded_weight.transpose(0, 1)
              loaded_weight = loaded_weight.reshape(-1, self.config.hidden_size)

          load_tensor_parallel_weights(param, loaded_weight, name,
                                       self._column_parallel_weights,
                                       self._row_parallel_weights,
                                       tensor_model_parallel_rank)

@McCarrtney
Copy link
Author

We have tried baichaun 7B API serving on a single GPU, it's ok and the generation is good. But the speedup is 4x slower than LLaMA 7B. Is there anything that decelerate baichuan inferencing?

@Sanster
Copy link
Contributor

Sanster commented Jul 27, 2023

@mklf Thank you for your reminder. I have also found the issue in "load_weights". The method I used to fix it is similar to yours. Here is my implementation: #598

@zhuohan123
Copy link
Member

Fixed by #599

@Jingru
Copy link
Contributor

Jingru commented Aug 17, 2023

I have the same issue when serving local internlm-7b model with tensor_parallel=4. Any ideas?

@Jingru
Copy link
Contributor

Jingru commented Aug 25, 2023

I have the same issue when serving local internlm-7b model with tensor_parallel=4. Any ideas?

This pull request #871 will solve the problem.

@lihkinVerma
Copy link

lihkinVerma commented Dec 20, 2023

Its a long pending issue in transformer code. To fix it, never use the model_name having periods(.) in the model name if using trust_remote_code feature, change the name to have either underscore(_) or any other sysmbol

@nightflight-dk
Copy link

workaround for me was to switch to multiprocessing (disable Ray) and remove '.' from the model name (path) if present, before instantiating the engine. e.g. "weights/Phi3.5mini-instruct" -> "weights/Phi35mini-instruct"

rickyyx pushed a commit to rickyyx/vllm that referenced this issue Oct 7, 2024
to avoid merge conflicts.

moving `.buildkite/ci/anyscale` 's gitignore inside
pi314ever pushed a commit to pi314ever/vllm that referenced this issue Dec 3, 2024
Set vllm-hpu-extension to fb36408, that includes support for non-GQA
workloads in PipelinedPA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants