Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机多卡设置 #5

Open
233function opened this issue Oct 1, 2024 · 1 comment
Open

多机多卡设置 #5

233function opened this issue Oct 1, 2024 · 1 comment

Comments

@233function
Copy link

您好!我在64卡上外推72b模型时一直遇到OOM的问题,是不是multi_node.yaml中配置错了?
multi_node.yaml
debug: false deepspeed_config: deepspeed_config_file: utils/accelerate_configs/zero3_offload.json deepspeed_multinode_launcher: standard zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' num_processes: 128 num_machines: 128 main_training_function: main rdzv_backend: c10d same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

zero3_offload.json
{ "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto" }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 5e-5, "warmup_num_steps": 0, "warmup_type": "linear" } }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false }

@zhiyuanhubj
Copy link
Owner

Hello, sorry for the late response. We haven't fully test the yaml for multi-nodes training. But we currently are working in training the LLM with 512k context window which requires the multi-nodes training. We will release the script within two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants