Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900

sneaxiy · 2022-02-24T09:09:46Z

PR types

Performance optimization

PR changes

OPs

Describe

Use MultiTensorApply to improve the L2-Norm calculation in DistributedFusedLamb optimizer.

优化前，DistributedFusedLamb在计算Parameter L2-Norm和Trust Ratio Div L2-Norm的时候调用的cub::DeviceSegmentedReduce，改用MultiTensorApply的方式，并优化每个launch kernel时的最大Tensor数量、chunk数量等参数。

BERT Large（batch_size = 56, max_seq_len = 512, pure_fp16）:

Paddle Baseline（使用cub::DeviceSegmentedReduce）

	总调用次数	总时间	单个batch调用次数	单个batch总时间
cub::DeviceSegmentedReduce	2648	7182050678	2	5424509.576

NV的性能数据MaxTensorNumPerLaunch=110，MaxChunkNumPerLaunch=320

	总调用次数	总时间	单个batch调用次数	单个batch总时间
MultiTensorApply Kernel	1962 + 1962	90753225 + 67747444	6	242355.763
Cleanup Kernel	1308	`7537783`	2	11525.6621
合计	-	-	-	253881.425

Paddle的性能数据MaxTensorNumPerLaunch=110，MaxChunkNumPerLaunch=320（对齐NV配置），优于NV 0.7%-2.2%，基本可以认为是持平。相比Paddle Baseline提升约95%。

	总调用次数	总时间	单个batch调用次数	单个batch总时间
GPU 0 MultiTensorApply Kernel	3972	158703257	6	239733.017
GPU 0 Cleanup Kernel	1324	`6489006`	2	9802.12387
GPU 0 合计	-	-	-	249535.14
GPU 1 MultiTensorApply Kernel	3972	158083541	6	238796.89
GPU 1 Cleanup Kernel	1324	6246289	2	9435.48187
GPU 1 合计	-	-	-	248232.372
GPU 2 MultiTensorApply Kernel	3972	158550442	6	239502.178
GPU 2 Cleanup Kernel	1324	6351557	2	9594.49698
GPU 2 合计	-	-	-	249096.675
GPU 3 MultiTensorApply Kernel	3972	160482464	6	242420.64
GPU 3 Cleanup Kernel	1324	6396205	2	9661.94109
GPU 3 合计	-	-	-	252082.582
GPU 4 MultiTensorApply Kernel	3972	157911387	6	238536.838
GPU 4 Cleanup Kernel	1324	6313354	2	9536.78852
GPU 4 合计	-	-	-	248073.627
GPU 5 MultiTensorApply Kernel	3972	158206590	6	238982.764
GPU 5 Cleanup Kernel	1324	6352998	2	9596.67372
GPU 5 合计	-	-	-	248579.438
GPU 6 MultiTensorApply Kernel	3972	158026604	6	238710.882
GPU 6 Cleanup Kernel	1324	`6412573`	2	9686.66616
GPU 6 合计	-	-	-	248397.548
GPU 7 MultiTensorApply Kernel	3972	158762293	6	239822.195
GPU 7 Cleanup Kernel	1324	`6391204`	2	9654.38671
GPU 7 合计	-	-	-	249476.582

Paddle的性能数据MaxTensorNumPerLaunch=50，MaxChunkNumPerLaunch=680，优于NV 14%。相比Paddle Baseline提升约96%。

	总调用次数	总时间	单个batch调用次数	单个batch总时间
GPU 0 MultiTensorApply Kernel	1324	137146700	2	207170.242
GPU 0 Cleanup Kernel	1324	6688593	2	10103.6148
GPU 0 合计	-	-	-	217273.856
GPU 1 MultiTensorApply Kernel	1324	137585724	2	207833.42
GPU 1 Cleanup Kernel	1324	`6483174`	2	9793.3142
GPU 1 合计	-	-	-	217626.734
GPU 2 MultiTensorApply Kernel	1324	137302011	2	207404.85
GPU 2 Cleanup Kernel	1324	6562869	2	9913.6994
GPU 2 合计	-	-	-	217318.55
GPU 3 MultiTensorApply Kernel	1324	137531462	2	207751.453
GPU 3 Cleanup Kernel	1324	6615911	2	9993.82326
GPU 3 合计	-	-	-	217745.276
GPU 4 MultiTensorApply Kernel	1324	137312205	2	207420.249
GPU 4 Cleanup Kernel	1324	6511524	2	9836.13897
GPU 4 合计	-	-	-	217256.388
GPU 5 MultiTensorApply Kernel	1324	137457972	2	207640.441
GPU 5 Cleanup Kernel	1324	6539170	2	9877.9003
GPU 5 合计	-	-	-	217518.341
GPU 6 MultiTensorApply Kernel	1324	137166650	2	207200.378
GPU 6 Cleanup Kernel	1324	6617388	2	9996.05438
GPU 6 合计	-	-	-	217196.432
GPU 7 MultiTensorApply Kernel	1324	137427733	2	207594.763
GPU 7 Cleanup Kernel	1324	6563153	2	9914.1284
GPU 7 合计	-	-	-	217508.891

paddle-bot-old · 2022-02-24T09:10:05Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

limin2021

LGTM.

sneaxiy added 3 commits February 21, 2022 12:21

add multi tensor apply l2 norm

a800518

add multi_tensor_apply code

65f224c

make sizeof(TensorMeta) smalller

244ac7d

sneaxiy added 3 commits February 24, 2022 13:28

Merge upstream/develop to solve conflict

9a701b0

move code to distributed_fused_lamb_op.cu

0e32d82

remove useless FLAGS

b6ea3dc

sneaxiy changed the title ~~[WIP] Add MultiTensorApply to calculate L2-Norm~~ Add MultiTensorApply to calculate L2-Norm Feb 24, 2022

sneaxiy requested a review from limin2021 February 25, 2022 02:47

sneaxiy changed the title ~~Add MultiTensorApply to calculate L2-Norm~~ Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer Feb 25, 2022

limin2021 approved these changes Feb 25, 2022

View reviewed changes

sneaxiy merged commit d32a010 into PaddlePaddle:develop Feb 25, 2022

sneaxiy deleted the add_multi_tensor_apply_l2_norm branch February 25, 2022 10:57

sneaxiy mentioned this pull request Feb 28, 2022

Optimize the CUDA kernel in DistributedFusedLamb optimizer #39972

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900

Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900

sneaxiy commented Feb 24, 2022 •

edited

Loading

paddle-bot-old bot commented Feb 24, 2022

limin2021 left a comment

Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900

Add MultiTensorApply to calculate L2-Norm in DistributedFusedLamb optimizer #39900

Conversation

sneaxiy commented Feb 24, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 24, 2022

limin2021 left a comment

Choose a reason for hiding this comment

sneaxiy commented Feb 24, 2022 •

edited

Loading