Trainer memory usage issue #20630

ys010 · 2025-03-09T15:54:57Z

ys010
Mar 9, 2025

Hi everyone,

I'm experiencing a significant memory usage difference between a manual training loop and using pl.Trainer in PyTorch Lightning, and I'm hoping someone might have insights.

Here's the situation:

When I run my training loop manually (iterating through the DataLoader, calling training_step, backpropagation, optimizer step, etc.), the memory usage is approximately 1GB.
However, when I use pl.Trainer with the same LightningModule and DataLoader, the memory usage jumps to about 5GB.
What I've tried:

I've disabled callbacks (callbacks=[]), logging (logger=False), and validation (val_check_interval=None) in pl.Trainer.

My questions:

Are there any other hidden memory-consuming features in pl.Trainer that I might be missing?
Could there be differences in how pl.Trainer handles the DataLoader or performs other optimizations that contribute to the higher memory usage?
Are there any recommended methods for profiling the memory usage of the trainer itself?
Can it be a new issue with handling mps on Macos Sequoia

[Edit]:
When i use cpu (no mps), the memory usage is as expected when using the Trainer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer memory usage issue #20630

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Trainer memory usage issue #20630

ys010 Mar 9, 2025

Replies: 0 comments

ys010
Mar 9, 2025