Update auto wrap policy and remove duplicate load in trainer.fit #175

MattIrv · 2024-11-04T17:06:15Z

This updates the benchmark to use the FSDP wrap policy defined in https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html#identify-large-layers.

Testing on a machine with 4 GPUs, before this change I get a CUDA OOM at a certain number of parameters for my model, and after the change I do not.

I also removed the checkpoint restore path from the call to trainer.fit during load since this was making the trainer restore from that checkpoint an extra time, which is time consuming, and also explicitly deleting trainer/strategy/model prior to resetting them in case it helps prevent memory leaks.

dataflux_pytorch/benchmark/checkpointing/multinode/train.py

…h into mirvine/fsdp5

Update auto wrap policy and remove duplicate load in trainer.fit

e855919

MattIrv requested review from Yash9060, jdnurme and abhibyreddi November 4, 2024 17:06

MattIrv requested a review from a team as a code owner November 4, 2024 17:06

abhibyreddi approved these changes Nov 4, 2024

View reviewed changes

jdnurme approved these changes Nov 4, 2024

View reviewed changes

dataflux_pytorch/benchmark/checkpointing/multinode/train.py Show resolved Hide resolved

dataflux_pytorch/benchmark/checkpointing/multinode/train.py Show resolved Hide resolved

Merge branch 'main' of github.com:GoogleCloudPlatform/dataflux-pytorc…

8e26d82

…h into mirvine/fsdp5

MattIrv enabled auto-merge (squash) November 4, 2024 19:11

Yash9060 approved these changes Nov 4, 2024

View reviewed changes

MattIrv merged commit 7195b58 into main Nov 4, 2024
5 checks passed

MattIrv deleted the mirvine/fsdp5 branch November 4, 2024 19:22

abhibyreddi pushed a commit that referenced this pull request Nov 8, 2024

Update auto wrap policy and remove duplicate load in trainer.fit (#175)

bc09a0d

abhibyreddi pushed a commit that referenced this pull request Nov 8, 2024

Update auto wrap policy and remove duplicate load in trainer.fit (#175)

18acb93

Yash9060 pushed a commit that referenced this pull request Nov 8, 2024

Update auto wrap policy and remove duplicate load in trainer.fit (#175)

372356e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update auto wrap policy and remove duplicate load in trainer.fit #175

Update auto wrap policy and remove duplicate load in trainer.fit #175

MattIrv commented Nov 4, 2024

Update auto wrap policy and remove duplicate load in trainer.fit #175

Update auto wrap policy and remove duplicate load in trainer.fit #175

Conversation

MattIrv commented Nov 4, 2024