Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No module named 'graphgpt' #56

Closed
xxrrnn opened this issue Mar 28, 2024 · 21 comments
Closed

No module named 'graphgpt' #56

xxrrnn opened this issue Mar 28, 2024 · 21 comments

Comments

@xxrrnn
Copy link

xxrrnn commented Mar 28, 2024

我已经在Graphgpt的文件夹里面,使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件,但依旧有这个报错。请问我该如何修正呢?

@tjb-tech
Copy link
Collaborator

tjb-tech commented Mar 29, 2024

我已经在Graphgpt的文件夹里面,使用sh ./GraphGPT/scripts/tune_script/graphgpt_stage1.sh来运行文件,但依旧有这个报错。请问我该如何修正呢?

您好,请问您可以提供具体的报错信息嘛。

@Melo-1017
Copy link

@tjb-tech 您好,我也遇到同样的问题,以下是我的运行指令:
(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
以下是该指令的报错:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 23257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 23258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt,可是我检查后路径没有问题,我已经在Graphgpt的文件路径下了,请问这种情况您有遇见过吗?应该如何处理?

@xxrrnn
Copy link
Author

xxrrnn commented Mar 29, 2024

@tjb-tech 您好,我也遇到同样的问题,以下是我的运行指令: (graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh 以下是该指令的报错:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash scripts/tune_script/graphgpt_stage1.sh
scripts/tune_script/graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23255) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 23257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 23258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-20_09:24:02
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在之前的issue中有提到遇到ModuleNotFoundError: No module named 'graphgpt'要将脚本执行路径切换到Graphgpt,可是我检查后路径没有问题,我已经在Graphgpt的文件路径下了,请问这种情况您有遇见过吗?应该如何处理?

我的报错信息和ta的一样

@Ffffffffire
Copy link

同样遇到该问题

@tjb-tech
Copy link
Collaborator

wandb

您好,请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛?同时,能否放出您使用的脚本的具体内容,可以帮助我们帮您debug。

@tjb-tech
Copy link
Collaborator

@tjb-tech 您好,我现在已经在GraphGPT目录下运行了,下面是脚本内容和运行结果: 除了no module外也有其他报错,希望能获得帮助,感谢。

# to fill in the following path to run the first stage of our GraphGPT!
model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv
output_model=../stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

运行结果:

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE    assets                 data        graphgpt            images      requirements.txt  stage_1  text-graph-grounding  vicuna-7b-v1.5-16k
README.md  clip_gt_arxiv_pub.pkl  graph_data  graphgpt_stage1.sh  playground  scripts           tests    vicuna-7b-v1.5        wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh ./scripts/tune_script/graphgpt_stage1.sh 
./scripts/tune_script/graphgpt_stage1.sh: 2: ./vicuna-7b-v1.5-16k: Permission denied
./data/graph_matching.json: 1: Syntax error: Unterminated quoted string
./scripts/tune_script/graphgpt_stage1.sh: 5: clip_gt_arxiv: not found
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19608) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 19609)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 19610)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 19611)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_14:34:46
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19608)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您好,您的shell脚本中的以下内容填写有错,shell脚本变量的“=”之间不能有空格,可能会导致编译错误:

model_path= ./vicuna-7b-v1.5-16k
instruct_ds= ./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn= clip_gt_arxiv

应该改为:

model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv

另外,请问您有试过将graphgpt_stage1.sh脚本放到GraphGPT的目录下再运行嘛?

@Melo-1017
Copy link

我刚才也尝试了重新在GraphGPT下运行,结果同样,下面是我的运行指令

# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

我的报错信息:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 122699) of binary: /opt/conda/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 122700)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 122701)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 122702)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:36:27
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 122699)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@xxrrnn
Copy link
Author

xxrrnn commented Mar 30, 2024

@tjb-tech
感谢您之前的回复,我已经修改了对应的内容,并且在GraphGPT文件夹下运行sh文件,但是依然有no module named graphgpt的问题。
文件内容:

# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

输出:

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# ls
LICENSE    assets                 data        graphgpt            images      requirements.txt  stage_1  text-graph-grounding  vicuna-7b-v1.5-16k
README.md  clip_gt_arxiv_pub.pkl  graph_data  graphgpt_stage1.sh  playground  scripts           tests    vicuna-7b-v1.5        wandb
(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT# sh graphgpt_stage1.sh
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 10, in <module>
    from graphgpt.train.train_graph import train
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 19846) of binary: /root/miniconda3/envs/graphgpt/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 19847)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 19848)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 19849)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_14:47:56
  host      : autodl-container-d05b4ca599-9ff30ed2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19846)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@tjb-tech
Copy link
Collaborator

checkpoints

您好,这个问题我之前都没遇到过,您可以试试把python换成python3再试试嘛,或者python -m torch.distributed.run换成torchrun具体做法:

# to fill in the following path to run the first stage of our GraphGPT!
model_path=/root/nas/models_hf/vicuna-7b-v1.5
instruct_ds=/root/nas/GraphGPT/train_instruct_graphmatch.json
graph_data_path=/root/nas/GraphGPT/graphgpt/graph_data/graph_data_all.pt
pretra_gnn=/root/nas/GraphGPT/graphgpt/clip_gt_arxiv
output_model=/root/nas/GraphGPT/checkpoints/stage_1

wandb offline
python3 -m torch.distributed.run  --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

@tjb-tech
Copy link
Collaborator

nproc_per_node

可以尝试一下上面的方法

@Melo-1017
Copy link

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@tjb-tech
Copy link
Collaborator

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛,或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。

@Melo-1017
Copy link

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛,或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。

好的,python3.8 -m torch.distributed.run运行之后依然存在这样的错误,python3.8 torch.distributed.run显示:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory

@tjb-tech
Copy link
Collaborator

tjb-tech commented Mar 30, 2024

@tjb-tech 您好,我刚才尝试了这个方法,报错似乎变得不一样了:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
  File "graphgpt/train/train_mem.py", line 4, in <module>
Traceback (most recent call last):
      File "graphgpt/train/train_mem.py", line 4, in <module>
from graphgpt.train.llama_flash_attn_monkey_patch import (        
from graphgpt.train.llama_flash_attn_monkey_patch import (from graphgpt.train.llama_flash_attn_monkey_patch import (

ModuleNotFoundError: No module named 'graphgpt'
ModuleNotFoundErrorModuleNotFoundError: : No module named 'graphgpt'No module named 'graphgpt'

    from graphgpt.train.llama_flash_attn_monkey_patch import (
ModuleNotFoundError: No module named 'graphgpt'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 123076) of binary: /opt/conda/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
graphgpt/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 123077)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 123078)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 123079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_06:54:22
  host      : e2b5ff656edd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 123076)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

您有尝试过torchrun嘛,或者把python3 -m torch.distributed.run改成python3.8 -m torch.distributed.run 或者 python3.8 torch.distributed.run。由于我们之前从来没遇到过这个问题,所以麻烦您都尝试一下。

好的,python3.8 -m torch.distributed.run运行之后依然存在这样的错误,python3.8 torch.distributed.run显示:

(graphgpt) root@e2b5ff656edd:~/nas/GraphGPT# bash graphgpt_stage1.sh
graphgpt_stage1.sh: line 8: wandb: command not found
python3.8: can't open file 'torch.distributed.run': [Errno 2] No such file or directory

您好,现在错误的原因还未清晰(我这边运行没有问题,但是有同学反映有问题)。作为暂时的解决方案,您可以在train_mem.py的最开头中加入以下代码来显性添加路径:

import os
import sys
curPath = os.path.abspath(os.path.dirname(__file__))
rootPath = os.path.split(os.path.split(curPath)[0])[0]
print(curPath, rootPath)
sys.path.append(rootPath)

我们后续会看下到底是什么原因导致的这个问题

@xxrrnn
Copy link
Author

xxrrnn commented Mar 30, 2024

我通过修改环境变量的方式解决了no module这个问题:
参考链接中方法2
https://blog.csdn.net/weixin_48594878/article/details/120461124

我使用的是:
export PYTHONPATH=$PYTHONPATH:/root/autodl-tmp/GraphGPT/graphgpt
source /etc/profile

@tjb-tech
Copy link
Collaborator

graphgpt

感谢您的回答!

@xxrrnn
Copy link
Author

xxrrnn commented Mar 30, 2024

解决了no module的报错后,又出现了以下的报错,请问该如何解决呢?

You are using a model of type llama to instantiate a model of type GraphLlama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                     | 0/2 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1283 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1284) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

@tjb-tech
Copy link
Collaborator

--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4

@xxrrnn
Copy link
Author

xxrrnn commented Mar 30, 2024

--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4

好的,我已经修改了,但又出现了新的error:
AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

@tjb-tech
Copy link
Collaborator

--nnodes=1 --nproc_per_node=4

这个是在模型load的时候出了问题,请问您有根据机器修改--nnodes=1 --nproc_per_node=4

好的,我已经修改了,但又出现了新的error: AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

可以参考issue #7

@xxrrnn
Copy link
Author

xxrrnn commented Mar 30, 2024

Traceback (most recent call last):
  File "graphgpt/train/train_mem.py", line 20, in <module>
    train()
  File "/root/autodl-tmp/GraphGPT/graphgpt/train/train_graph.py", line 871, in train
    model_graph_dict = model.get_model().initialize_graph_modules(
  File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 139, in initialize_graph_modules
    clip_graph, args= load_model_pretrained(CLIP, self.config.pretrain_graph_model_path) 
  File "/root/autodl-tmp/GraphGPT/graphgpt/model/GraphLlama.py", line 54, in load_model_pretrained
    assert osp.exists(osp.join(pretrain_model_path, 'config.json')), 'config.json missing'
AssertionError: config.json missing

不好意思,我这里还有error,想问下是哪里出了问题呢?在vicuna的config中写的pretrain_model_path地址下是有config.json的,但依然报错

{
  "_name_or_path": "vicuna-7b-v1.5-16k",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_sequence_length": 16384,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32000, 
  "graph_hidden_size": 128, 
  "pretrain_graph_model_path": "/root/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT/"
}

(graphgpt) root@autodl-container-d05b4ca599-9ff30ed2:~/autodl-tmp/GraphGPT/Arxiv-PubMed-GraphCLIP-GT# ls
clip_gt_arxiv_pub.pkl  config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants