Interleave IO operations with kernel calculation #23

felipeZ · 2020-02-13T14:01:36Z

Currently, all IO and Kernel operations happen in a single stream. The performance would be significantly increase if we interleave multiple streams.

The text was updated successfully, but these errors were encountered:

benvanwerkhoven · 2020-02-17T08:05:45Z

I've been thinking about this. If you want to overlap these things you have to indeed ensure that streams are used so that computation in one stream can overlap with data transfers in other streams. It might be enough to use multiple threads, one for each stream. However, I know that in a single threaded application it is necessary to allocate host memory in a way that ensures that the cudamemcpy operations can be performed by DMA. It's the only way to make the async API calls truly asynchronous with respect to the host.

Perhaps you won't need it, because you will be using multiple threads, in which case it might not hurt performance when the cpu thread blocks on the cudamemcpyasync. But if you don't see any overlap between copies in one stream and copies (in the opposite direction) and computations in other streams then this could be the cause. Also, I expect the achieved bandwidth of cudamemcpy to increase significantly if you allocate host memory that is page-locked and aligned. But depending on how Eigen is coded it might require modifying Eigen to really achieve this, I haven't checked that.

felipeZ added the enhancement New feature or request label Feb 13, 2020

felipeZ self-assigned this Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interleave IO operations with kernel calculation #23

Interleave IO operations with kernel calculation #23

felipeZ commented Feb 13, 2020

benvanwerkhoven commented Feb 17, 2020

Interleave IO operations with kernel calculation #23

Interleave IO operations with kernel calculation #23

Comments

felipeZ commented Feb 13, 2020

benvanwerkhoven commented Feb 17, 2020