-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NotImplementedError: Cannot copy out of meta tensor; no data! with Multi-node training #26971
Comments
Relevant: #26631 @pacman100 |
|
Hello @ari9dam, The PR you tagged above should resolve this issue. Please recreate the FSDP config via |
Thank you that solved it. I've one more question: @pacman100 should I pass torch dtype here while loading the model? I'm using bf16 in accelerate config. I get warnings: You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour |
also had this issue and fixed it by changing
to
here
(this is for 8 gpus per node; for 4 gpus per node should be 4 etc) |
System Info
Who can help?
@muellerz @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The training job works on A100 with 1 node and 8 GPUs. It fails when job uses more than 1 node with the error:
Expected behavior
No error
The text was updated successfully, but these errors were encountered: