No module named 'graphgpt' #56

xxrrnn · 2024-03-28T14:10:34Z

我已经在Graphgpt的文件夹里面，使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件，但依旧有这个报错。请问我该如何修正呢？

tjb-tech · 2024-03-29T05:25:04Z

我已经在Graphgpt的文件夹里面，使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件，但依旧有这个报错。请问我该如何修正呢？

您好，请问您可以提供具体的报错信息嘛。

Melo-1017 · 2024-03-29T11:51:03Z

@tjb-tech 您好，我也遇到同样的问题，以下是我的运行指令：
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
以下是该指令的报错：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 23257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 23258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt，可是我检查后路径没有问题，我已经在Graphgpt的文件路径下了，请问这种情况您有遇见过吗？应该如何处理？

xxrrnn · 2024-03-29T11:55:05Z

@tjb-tech 您好，我也遇到同样的问题，以下是我的运行指令： (graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh 以下是该指令的报错：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 23257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 23258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt，可是我检查后路径没有问题，我已经在Graphgpt的文件路径下了，请问这种情况您有遇见过吗？应该如何处理？

我的报错信息和ta的一样

Ffffffffire · 2024-03-30T03:07:05Z

同样遇到该问题

tjb-tech · 2024-03-30T06:29:43Z

wandb

您好，请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛？同时，能否放出您使用的脚本的具体内容，可以帮助我们帮您debug。

tjb-tech · 2024-03-30T06:42:13Z

@tjb-tech 您好，我现在已经在GraphGPT目录下运行了，下面是脚本内容和运行结果：除了no module外也有其他报错，希望能获得帮助，感谢。

# to fill in the following path to run the first stage of our GraphGPT!
model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv
output_model=../stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

运行结果：

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE    assets                 data        graphgpt            images      requirements.txt  stage_1  text-graph-grounding  vicuna-7b-v1.5-16k
README.md  clip_gt_arxiv_pub.pkl  graph_data  graphgpt_stage1.sh  playground  scripts           tests    vicuna-7b-v1.5        wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh ./scripts/tune_script/graphgpt_stage1.sh 
./scripts/tune_script/graphgpt_stage1.sh: 2: ./vicuna-7b-v1.5-16k: Permission denied
./data/graph_matching.json: 1: Syntax error: Unterminated quoted string
./scripts/tune_script/graphgpt_stage1.sh: 5: clip_gt_arxiv: not found
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19608) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 19609)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 19610)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 19611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19608)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您好，您的shell脚本中的以下内容填写有错，shell脚本变量的“=”之间不能有空格，可能会导致编译错误：

model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv

应该改为：

model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv

另外，请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛？

Melo-1017 · 2024-03-30T06:42:22Z

我刚才也尝试了重新在GraphGPT下运行，结果同样，下面是我的运行指令：

# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

我的报错信息：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 122699) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 122700)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 122701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 122702)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 122699)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

xxrrnn · 2024-03-30T06:49:27Z

@tjb-tech
感谢您之前的回复，我已经修改了对应的内容，并且在GraphGPT文件夹下运行sh文件，但是依然有no module named graphgpt的问题。
文件内容：

# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

输出：

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE    assets                 data        graphgpt            images      requirements.txt  stage_1  text-graph-grounding  vicuna-7b-v1.5-16k
README.md  clip_gt_arxiv_pub.pkl  graph_data  graphgpt_stage1.sh  playground  scripts           tests    vicuna-7b-v1.5        wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh graphgpt_stage1.sh
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19846) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 19847)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 19848)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 19849)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19846)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

tjb-tech · 2024-03-30T06:52:44Z

checkpoints

您好，这个问题我之前都没遇到过，您可以试试把python换成python3再试试嘛，或者python -m torch.distributed.run换成torchrun具体做法：

# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1

wandb offline
python3 -m torch.distributed.run  --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

tjb-tech · 2024-03-30T06:53:23Z

nproc_per_node

可以尝试一下上面的方法

Melo-1017 · 2024-03-30T06:57:19Z

@tjb-tech 您好，我刚才尝试了这个方法，报错似乎变得不一样了：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

tjb-tech · 2024-03-30T07:03:44Z

@tjb-tech 您好，我刚才尝试了这个方法，报错似乎变得不一样了：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛，或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题，所以麻烦您都尝试一下。

Melo-1017 · 2024-03-30T07:09:42Z

@tjb-tech 您好，我刚才尝试了这个方法，报错似乎变得不一样了：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛，或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题，所以麻烦您都尝试一下。

好的，python3.8 -m torch.distributed.run运行之后依然存在这样的错误，python3.8 torch.distributed.run显示：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory

tjb-tech · 2024-03-30T07:30:36Z

@tjb-tech 您好，我刚才尝试了这个方法，报错似乎变得不一样了：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛，或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题，所以麻烦您都尝试一下。

好的，python3.8 -m torch.distributed.run运行之后依然存在这样的错误，python3.8 torch.distributed.run显示：

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory

您好，现在错误的原因还未清晰（我这边运行没有问题，但是有同学反映有问题）。作为暂时的解决方案，您可以在train_mem.py的最开头中加入以下代码来显性添加路径：

import os
import sys
curPath = os.path.abspath(os.path.dirname(__file__))
rootPath = os.path.split(os.path.split(curPath)[0])[0]
print(curPath, rootPath)
sys.path.append(rootPath)

我们后续会看下到底是什么原因导致的这个问题

xxrrnn · 2024-03-30T07:34:21Z

我通过修改环境变量的方式解决了no module这个问题：
参考链接中方法2
https://blog.csdn.net/weixin_48594878/article/details/120461124

我使用的是：
export PYTHONPATH=$PYTHONPATH:/root/autodl-tmp/GraphGPT/graphgpt
source /etc/profile

tjb-tech · 2024-03-30T07:38:58Z

graphgpt

感谢您的回答！

xxrrnn · 2024-03-30T08:11:54Z

解决了no module的报错后，又出现了以下的报错，请问该如何解决呢？

You are using a model of type llama to instantiate a model of type GraphLlama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                     | 0/2 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1283 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1284) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tjb-tech · 2024-03-30T08:14:15Z

--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题，请问您有根据机器修改--nnodes=1 --nproc_per_node=4吗

xxrrnn · 2024-03-30T08:18:35Z

--nnodes=1 --nproc_per_node=4
这个是在模型load的时候出了问题，请问您有根据机器修改--nnodes=1 --nproc_per_node=4吗

好的，我已经修改了，但又出现了新的error：
AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

tjb-tech · 2024-03-30T08:20:10Z

--nnodes=1 --nproc_per_node=4
这个是在模型load的时候出了问题，请问您有根据机器修改--nnodes=1 --nproc_per_node=4吗
好的，我已经修改了，但又出现了新的error： AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

可以参考issue #7

xxrrnn · 2024-03-30T09:33:29Z

Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 20, in <module>
    train()
  File "/root/autodl-tmp/GraphGPT/graphgpt/train/train_graph.py", line 871, in train
    model_graph_dict = model.get_model().initialize_graph_modules(
  File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 139, in initialize_graph_modules
    clip_graph, args= load_model_pretrained(CLIP, self.config.pretrain_graph_model_path) 
  File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 54, in load_model_pretrained
    assert osp.exists(osp.join(pretrain_model_path, 'config.json')), 'config.json missing'
AssertionError: config.json missing

不好意思，我这里还有error，想问下是哪里出了问题呢？在vicuna的config中写的pretrain_model_path地址下是有config.json的，但依然报错

{
  "_name_or_path": "vicuna-7b-v1.5-16k",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_sequence_length": 16384,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32000, 
  "graph_hidden_size": 128, 
  "pretrain_graph_model_path": "/root/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT/"
}

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT# ls
clip_gt_arxiv_pub.pkl  config.json

tjb-tech closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No module named 'graphgpt' #56

No module named 'graphgpt' #56

xxrrnn commented Mar 28, 2024

tjb-tech commented Mar 29, 2024 •

edited

Loading

Melo-1017 commented Mar 29, 2024

xxrrnn commented Mar 29, 2024

Ffffffffire commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

Melo-1017 commented Mar 30, 2024

xxrrnn commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

Melo-1017 commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

Melo-1017 commented Mar 30, 2024

tjb-tech commented Mar 30, 2024 •

edited

Loading

xxrrnn commented Mar 30, 2024 •

edited

Loading

tjb-tech commented Mar 30, 2024

xxrrnn commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

xxrrnn commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

xxrrnn commented Mar 30, 2024 •

edited

Loading

No module named 'graphgpt' #56

No module named 'graphgpt' #56

Comments

xxrrnn commented Mar 28, 2024

tjb-tech commented Mar 29, 2024 • edited Loading

Melo-1017 commented Mar 29, 2024

xxrrnn commented Mar 29, 2024

Ffffffffire commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

Melo-1017 commented Mar 30, 2024

xxrrnn commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

Melo-1017 commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

Melo-1017 commented Mar 30, 2024

tjb-tech commented Mar 30, 2024 • edited Loading

xxrrnn commented Mar 30, 2024 • edited Loading

tjb-tech commented Mar 30, 2024

xxrrnn commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

xxrrnn commented Mar 30, 2024

tjb-tech commented Mar 30, 2024

xxrrnn commented Mar 30, 2024 • edited Loading

tjb-tech commented Mar 29, 2024 •

edited

Loading

tjb-tech commented Mar 30, 2024 •

edited

Loading

xxrrnn commented Mar 30, 2024 •

edited

Loading

xxrrnn commented Mar 30, 2024 •

edited

Loading