-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No effect from InitProcessGroupKwargs timeout #1403
Comments
Actually, @Randl when are you doing this in your code? accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))]) And what is your full code? (I still think it may be a TRL issue, but I need that to be 100% sure) |
I may have found the solution. @Randl can you try again (I know it'll take awhile to run), installing transformers via Finally narrowed it down. |
I'll update you when I run it |
@Randl were you able to try it out? 🤗 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Wondering if this was addressed? |
@thepowerfuldeez Wondering how was this addressed? |
Followup from huggingface/accelerate#2236 (comment)
cc @muellerzr
I'll copy main text from there, and there are some more details in discussion
System Info
Reproduction
and run the training
3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d
Note that timeout is still 1800 secconds
(see also huggingface/alignment-handbook#59)
Expected behavior
Timeout is increased, and no crush.
The text was updated successfully, but these errors were encountered: