-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No effect from InitProcessGroupKwargs timeout #2236
Comments
Hi @Randl, thanks for asking. This is normal for nccl backend. I invite you to read the description of the
|
@Randl can you rerun your code building accelerate from |
@muellerzr I've also tried to set |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I don't think it was addressed? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
still not resolved? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
... |
Looking into this again this week, sorry for the delay |
I'm definitely seeing an effect here. Note that timeout only applies on situations where Minimal test: import time
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from torch import tensor
kwargs = [InitProcessGroupKwargs(timeout=timedelta(seconds=4))]
accelerator = Accelerator(kwargs_handlers=kwargs)
if accelerator.is_main_process:
t = tensor(0).to(accelerator.device)
time.sleep(8)
else:
t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()
print("All called!") This will lead to a failure, change that 4 to a 10 and it'll pass. |
Can you give us more of your trace? It doesn't hint at where it's failing at. |
I don't have the access to the machine currently. I'll update you when I can run stuff on it. |
Thanks, that's helpful |
I see the exact issue, it's due to |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
and run the training
3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d
Note that timeout is still 1800 secconds
(see also huggingface/alignment-handbook#59)
Expected behavior
Timeout is increased, and no crush.
The text was updated successfully, but these errors were encountered: