Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900
+355
−116
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
OPs
Describe
Use
MultiTensorApply
to improve the L2-Norm calculation inDistributedFusedLamb
optimizer.优化前,
DistributedFusedLamb
在计算Parameter L2-Norm和Trust Ratio Div L2-Norm的时候调用的cub::DeviceSegmentedReduce
,改用MultiTensorApply
的方式,并优化每个launch kernel时的最大Tensor数量、chunk数量等参数。BERT Large(batch_size = 56, max_seq_len = 512, pure_fp16):
cub::DeviceSegmentedReduce
)MaxTensorNumPerLaunch=110
,MaxChunkNumPerLaunch=320
MaxTensorNumPerLaunch=110
,MaxChunkNumPerLaunch=320
(对齐NV配置),优于NV 0.7%-2.2%,基本可以认为是持平。相比Paddle Baseline提升约95%。MaxTensorNumPerLaunch=50
,MaxChunkNumPerLaunch=680
,优于NV 14%。相比Paddle Baseline提升约96%。