多机多卡设置 #5

233function · 2024-10-01T07:18:50Z

您好！我在64卡上外推72b模型时一直遇到OOM的问题，是不是multi_node.yaml中配置错了？
multi_node.yaml
debug: false deepspeed_config: deepspeed_config_file: utils/accelerate_configs/zero3_offload.json deepspeed_multinode_launcher: standard zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' num_processes: 128 num_machines: 128 main_training_function: main rdzv_backend: c10d same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

zero3_offload.json
{ "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto" }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 5e-5, "warmup_num_steps": 0, "warmup_type": "linear" } }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false }

The text was updated successfully, but these errors were encountered:

zhiyuanhubj · 2024-10-24T08:47:35Z

Hello, sorry for the late response. We haven't fully test the yaml for multi-nodes training. But we currently are working in training the LLM with 512k context window which requires the multi-nodes training. We will release the script within two weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多机多卡设置 #5

多机多卡设置 #5

233function commented Oct 1, 2024

zhiyuanhubj commented Oct 24, 2024

多机多卡设置 #5

多机多卡设置 #5

Comments

233function commented Oct 1, 2024

zhiyuanhubj commented Oct 24, 2024