Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PIR] Fix memory leak when training models containing if control flow #64130

Closed
wants to merge 5 commits into from

Conversation

huangjiyi
Copy link
Member

@huangjiyi huangjiyi commented May 8, 2024

PR Category

Others

PR Types

Others

Description

  • 修复 PaddleDetection 中 ppyoloe_plus_crn_l_80e_coco 模型训练在 PIR 下出现的显存泄露问题

问题描述:

ppyoloe_plus_crn_l_80e_coco 模型在 PIR 下训练会出现 memory allocated 随着训练不断增长最后 out of memory 的错误

修复前 PIR 训练过程
Epoch: [0] [   0/6250] mem_allocated: 1021 MB mem_reserved: 3496 MB
Epoch: [0] [   1/6250] mem_allocated: 1125 MB mem_reserved: 3496 MB
Epoch: [0] [   2/6250] mem_allocated: 1229 MB mem_reserved: 3496 MB
Epoch: [0] [   3/6250] mem_allocated: 1698 MB mem_reserved: 8608 MB
Epoch: [0] [   4/6250] mem_allocated: 1920 MB mem_reserved: 8608 MB
Epoch: [0] [   5/6250] mem_allocated: 2059 MB mem_reserved: 8608 MB
Epoch: [0] [   6/6250] mem_allocated: 2492 MB mem_reserved: 8822 MB
Epoch: [0] [   7/6250] mem_allocated: 2596 MB mem_reserved: 8822 MB
Epoch: [0] [   8/6250] mem_allocated: 2716 MB mem_reserved: 8822 MB
Epoch: [0] [   9/6250] mem_allocated: 2939 MB mem_reserved: 8822 MB
Epoch: [0] [  10/6250] mem_allocated: 3044 MB mem_reserved: 8822 MB
...
Epoch: [0] [  19/6250] mem_allocated: 5268 MB mem_reserved: 10415 MB
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 27.000000MB memory on GPU 0, 10.906372GB memory has been allocated and available memory is only 4.062500MB.
修复后 PIR 训练过程
Epoch: [0] [   0/6250] mem_allocated: 825 MB mem_reserved: 3497 MB
Epoch: [0] [   1/6250] mem_allocated: 825 MB mem_reserved: 3497 MB
Epoch: [0] [   2/6250] mem_allocated: 825 MB mem_reserved: 3497 MB
Epoch: [0] [   3/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [   4/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [   5/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [   6/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [   7/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [   8/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [   9/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
Epoch: [0] [  10/6250] mem_allocated: 825 MB mem_reserved: 8529 MB
旧 IR 训练过程
Epoch: [0] [   0/6250] mem_allocated: 825 MB mem_reserved: 3480 MB
Epoch: [0] [   1/6250] mem_allocated: 825 MB mem_reserved: 3480 MB
Epoch: [0] [   2/6250] mem_allocated: 825 MB mem_reserved: 3480 MB
Epoch: [0] [   3/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [   4/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [   5/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [   6/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [   7/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [   8/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [   9/6250] mem_allocated: 825 MB mem_reserved: 8517 MB
Epoch: [0] [  10/6250] mem_allocated: 825 MB mem_reserved: 8517 MB

问题原因

  1. if_instruction 中 true 或 false 分支里的计算结果 (inner outputs) 会以共享内存的的方法拷贝给其他 Variable 作为 if_instruction 的输出 (if outputs),但是这些 inner outputs 没有被 GC,而当 if outputs 被 GC 时,由于 inner outputs 也持有了相同的内存,最终导致这个内存没能正确回收
  2. ppyoloe_plus_crn_l_80e_coco 对于 Program 中的同一个 if 控制流,每次迭代执行时都会定义新的 inner outputs,导致随着训练没有被回收的显存越来越多,最后报了 OOM 的错误

Solution

  • 在完成 if_instruction 后 GC inner outputs,inner outputs 在 if op 的子 block 中不能被 GC,否者会导致 inner outputs 由于没有被引用会在拷贝到 if outputs 前就被 GC 掉了从而报错

Copy link

paddle-bot bot commented May 8, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label May 8, 2024
@huangjiyi huangjiyi changed the title [PIR] Fix out of memory error when training models containing if control flow [PIR] Fix memory leak when training models containing if control flow May 10, 2024
@huangjiyi huangjiyi closed this May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant