You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, sorry for the late response. We haven't fully test the yaml for multi-nodes training. But we currently are working in training the LLM with 512k context window which requires the multi-nodes training. We will release the script within two weeks.
您好!我在64卡上外推72b模型时一直遇到OOM的问题,是不是multi_node.yaml中配置错了?
multi_node.yaml
debug: false deepspeed_config: deepspeed_config_file: utils/accelerate_configs/zero3_offload.json deepspeed_multinode_launcher: standard zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' num_processes: 128 num_machines: 128 main_training_function: main rdzv_backend: c10d same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
zero3_offload.json
{ "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto" }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 5e-5, "warmup_num_steps": 0, "warmup_type": "linear" } }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 0.1 } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false }
The text was updated successfully, but these errors were encountered: