From b3828abd211c2d067d32ed9cf8826de2379340cb Mon Sep 17 00:00:00 2001
From: thunder95 <290844930@qq.com>
Date: Mon, 1 Aug 2022 11:08:53 +0800
Subject: [PATCH] =?UTF-8?q?=E3=80=90Hackathon=20No.31=E3=80=91=E4=B8=BA=20?=
 =?UTF-8?q?Paddle=20=E4=BC=98=E5=8C=96=20dist=20op=20=E5=9C=A8=20GPU=20?=
 =?UTF-8?q?=E4=B8=8A=E7=9A=84=E8=AE=A1=E7=AE=97=E6=80=A7=E8=83=BD=20(#187)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../APIs/20220708_api_design_for_bucketize.md | 133 --------
 .../APIs/20220708_api_design_for_trapezoid.md | 285 ------------------
 .../OPs-Perf/20220706_Selu_op_optimization.md |  83 -----
 .../OPs-Perf/20220714_dist_op_optimization.md |  89 ++++++
 4 files changed, 89 insertions(+), 501 deletions(-)
 delete mode 100644 rfcs/APIs/20220708_api_design_for_bucketize.md
 delete mode 100644 rfcs/APIs/20220708_api_design_for_trapezoid.md
 delete mode 100644 rfcs/OPs-Perf/20220706_Selu_op_optimization.md
 create mode 100644 rfcs/OPs-Perf/20220714_dist_op_optimization.md
diff --git a/rfcs/APIs/20220708_api_design_for_bucketize.md b/rfcs/APIs/20220708_api_design_for_bucketize.md
deleted file mode 100644
index b73d855f3..000000000
--- a/rfcs/APIs/20220708_api_design_for_bucketize.md
+++ /dev/null
@@ -1,133 +0,0 @@
-# paddle.Tensor.bucketize 设计文档
-
-|API名称 | paddle.bucketize | 
-|---|---|
-|提交作者<input type="checkbox" class="rowselector hidden"> | 李芳钰 | 
-|提交时间<input type="checkbox" class="rowselector hidden"> | 2022-07-8 | 
-|版本号 | V1.0 | 
-|依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | develop | 
-|文件名 | 20220708_api_design_for_bucketize.md<br> | 
-
-# 一、概述
-
-## 1、相关背景
-为了提升飞桨API丰富度，Paddle需要扩充API`paddle.bucketize`的功能。
-## 2、功能目标
-增加API`paddle.bucketize`，实现根据边界返回输入值的桶索引。
-## 3、意义
-飞桨支持`paddle.bucketize`的API功能。
-
-# 二、飞桨现状
-目前paddle可直接由`paddle.searchsorted`API，直接实现该功能。
-
-paddle已经实现了[paddle.searchsorted](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/search.py#L910)API,所以只需要调用该API既可以实现该功能。
-
-需要注意的是`paddle.bucketize`处理的sorted_sequence特殊要求为1-D Tensor。
-
-# 三、业内方案调研
-## Numpy 
-### 实现方法
-以现有numpy python API组合实现，[代码位置](https://github.com/numpy/numpy/blob/v1.23.0/numpy/lib/function_base.py#L5447-L5555).
-其中核心代码为：
-```Python
-    x = _nx.asarray(x)
-    bins = _nx.asarray(bins)
-
-    # here for compatibility, searchsorted below is happy to take this
-    if np.issubdtype(x.dtype, _nx.complexfloating):
-        raise TypeError("x may not be complex")
-
-    mono = _monotonicity(bins)
-    if mono == 0:
-        raise ValueError("bins must be monotonically increasing or decreasing")
-
-    # this is backwards because the arguments below are swapped
-    side = 'left' if right else 'right'
-    if mono == -1:
-        # reverse the bins, and invert the results
-        return len(bins) - _nx.searchsorted(bins[::-1], x, side=side)
-    else:
-        return _nx.searchsorted(bins, x, side=side)
-```
-整体逻辑为：
-
-- 通过`_monotonicity`判断箱子是否单调递增或者递减。
-- 然后根据`mono`和参数`right`决定是否需要反转箱子。
-- 最后也是通过`searchsorted`直接返回输入对应的箱子索引。
-
-## Pytorch
-Pytorch中有API`torch.bucketize(input, boundaries, *, out_int32=False, right=False, out=None) → Tensor`。在pytorch中，介绍为：
-```
-Returns the indices of the buckets to which each value in the input belongs, where the boundaries of the buckets are set by boundaries. Return a new tensor with the same size as input. If right is False (default), then the left boundary is closed. 
-```
-
-### 实现方法
-在实现方法上，Pytorch的整体逻辑与Numpy基本一致，[代码位置](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Bucketization.cpp)。其中核心代码为：
-```c++
-Tensor& bucketize_out_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right, Tensor& result) {
-  TORCH_CHECK(boundaries.dim() == 1, "boundaries tensor must be 1 dimension, but got dim(", boundaries.dim(), ")");
-  at::native::searchsorted_out_cpu(boundaries, self, out_int32, right, nullopt, nullopt, result);
-  return result;
-}
-```
-整体逻辑为：
-- 检查输入参数`boundaries`。
-- 然后直接利用`searchsorted_out_cpu`返回结果。
-
-## Tensorflow
-Tensorflow`tft.bucketize(
-    x: common_types.ConsistentTensorType,
-    num_buckets: int,
-    epsilon: Optional[float] = None,
-    weights: Optional[tf.Tensor] = None,
-    elementwise: bool = False,
-    name: Optional[str] = None
-) -> common_types.ConsistentTensorType`。在Tensorflow中，介绍为：
-Returns a bucketized column, with a bucket index assigned to each input.
-
-### 实现方法
-在实现方法上，Tensorflow的API参数设计于Numpy和Pytorch都不大相同，[代码位置](https://github.com/tensorflow/transform/blob/v1.9.0/tensorflow_transform/mappers.py#L1690-L1770)。这里就不具体分析其核心代码了，因为和我们想要实现的功能有很大的差距。
-
-
-# 四、对比分析
-- 使用场景与功能：Pytorch会比Numpy更贴和我们想要实现的功能，因为Pytorch也是仅针对1-D Tensor，而Numpy支持多维。
-
-# 五、方案设计
-## 命名与参数设计
-API设计为`paddle.bucketize(x, sorted_sequence, out_int32=False, right=False, name=None)`
-命名与参数顺序为：形参名`input`->`x`,  与paddle其他API保持一致性，不影响实际功能使用。
-参数类型中，`x`为N-D Tensor，`sorted_sequence`为1-D Tensor。
-
-## 底层OP设计
-使用已有API组合实现，不再单独设计OP。
-
-## API实现方案
-主要按下列步骤进行实现,实现位置为`paddle/tensor/math.py`与`searchsorted`方法放在一起：
-1. 使用`len(sorted_sequence)`检验参数`sorted_sequence`的维度。
-2. 使用`paddle.searchsorted`得到输入的桶索引。
-
-
-# 六、测试和验收的考量
-测试考虑的case如下：
-
-- 和numpy结果的数值的一致性, `paddle.bucketize`,和`numpy.searchsorted`结果是否一致；
-- 参数`right`为True和False时输出的正确性；
-- `out_int32`为True和False时输出dtype正确性；
-- 未输入`right`时的输出正确性；
-- 未输入`out_int32`时的输出正确性；
-- 错误检查：输入`x`不是Tensor时,能否正确抛出错误；
-- 错误检查：输入`sorted_sequence`不是一维张量时,能否正确抛出错误；
-- 错误检查：未输入`x`和`sorted_sequence`时,能否正确抛出错误；
-
-# 七、可行性分析及规划排期
-
-方案主要依赖现有paddle api组合而成，且依赖的`paddle.searchsorted`已经在 Paddle repo 的 python/paddle/tensor/search.py [目录中](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/search.py#L910)。工期上可以满足在当前版本周期内开发完成。
-
-# 八、影响面
-为独立新增API，对其他模块没有影响
-
-# 名词解释
-无
-# 附件及参考资料
-无
-
diff --git a/rfcs/APIs/20220708_api_design_for_trapezoid.md b/rfcs/APIs/20220708_api_design_for_trapezoid.md
deleted file mode 100644
index e175bc96a..000000000
--- a/rfcs/APIs/20220708_api_design_for_trapezoid.md
+++ /dev/null
@@ -1,285 +0,0 @@
-# paddle.trapezoid 设计文档
-
-| API名称 | trapezoid |
-| --- | --- |
-| 提交作者<input type="checkbox" class="rowselector hidden"> | [kunkun0w0](https://github.com/kunkun0w0) |
-| 提交时间<input type="checkbox" class="rowselector hidden"> | 2022-07-08 |
-| 版本号 | V1.0 |
-| 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | develop |
-| 文件名 | 20220708_api_design_for_trapezoid.md<br> |
-
-# 一、概述
-
-## 1、相关背景
-
-Paddle需要扩充API：paddle.trapezoid 和 Tensor.trapezoid。
-
-## 2、功能目标
-
-实现 [trapezoid rule](https://en.wikipedia.org/wiki/Trapezoidal_rule) 的算法，支持输入N 维 Tensor，在指定的某一维实现 [trapezoid rule](https://en.wikipedia.org/wiki/Trapezoidal_rule) 算法。
-
-## 3、意义
-
-为 paddle 框架中提供一种通过对函数的左右黎曼和求平均来逼近函数定积分的技术(即trapezoid rule)。
-
-# 二、飞桨现状
-
-飞桨内暂时无相关近似求积分的API。
-
-# 三、业内方案调研
-
-Pytorch 中有相关的函数
-
-```python
-torch.trapezoid(y, x=None, dx=None, dim=- 1)
-```
-
-在 PyTorch 中，介绍为：
-
-> Computes the [trapezoidal rule](https://en.wikipedia.org/wiki/Trapezoidal_rule) along `dim`. By default the spacing between elements is assumed to be 1, but `dx` can be used to specify a different constant spacing, and `x` can be used to specify arbitrary spacing along `dim`.
-> 
-> Assuming `y` is a one-dimensional tensor with elements $y_0​,y_1​,...,y_n$​, the default computation is
-> 
-> $$
-> \sum_{i=1}^{n-1} \frac{1}{2}\left(y_{i}+y_{i-1}\right)
-> $$
-> 
-> When `dx` is specified the computation becomes
-> 
-> $$
-> \sum_{i=1}^{n-1} \frac{\Delta x}{2}\left(y_{i}+y_{i-1}\right)
-> $$
-> 
-> effectively multiplying the result by `dx`. When `x` is specified, assuming `x` is also a one-dimensional tensor with elements $x_0​,x_1​,...,x_n$​, the computation becomes
-> 
-> $$
-> \sum_{i=1}^{n-1} \frac{\left(x_{i}-x_{i-1}\right)}{2}\left(y_{i}+y_{i-1}\right)
-> $$
-> 
-> When `x` and `y` have the same size, the computation is as described above and no broadcasting is needed. The broadcasting behavior of this function is as follows when their sizes are different. For both `x` and `y`, the function computes the difference between consecutive elements along dimension `dim`. This effectively creates two tensors, x_diff and y_diff, that have the same shape as the original tensors except their lengths along the dimension `dim` is reduced by 1. After that, those two tensors are broadcast together to compute final output as part of the trapezoidal rule. See the examples below for details.
-
-PyTorch C++ 代码：
-
-```cpp
-// The estimated integral of a function y of x,
-// sampled at points (y_1, ..., y_n) that are separated by distance (dx_1, ..., dx_{n-1}),
-// is given by the trapezoid rule:
-//
-// \sum_{i=1}^{n-1}  dx_i * (y_i + y_{i+1}) / 2
-//
-// TODO: if we extend TensorIterator to accept 3 inputs,
-// we can probably make this a bit more performant.
-Tensor do_trapezoid(const Tensor& y, const Tensor& dx, int64_t dim) {
-    Tensor left = y.slice(dim, 0, -1);
-    Tensor right = y.slice(dim, 1);
-    // If the dimensions of 'dx' and '(left + right)' do not match
-    // broadcasting is attempted here.
-    return ((left + right) * dx).sum(dim) / 2.;
-}
-
-// When dx is constant, the above formula simplifies
-// to dx * [(\sum_{i=1}^n y_i) - (y_1 + y_n)/2]
-Tensor do_trapezoid(const Tensor& y, double dx, int64_t dim) {
-    return (y.sum(dim) - (y.select(dim, 0) + y.select(dim, -1)) * (0.5)) * dx;
-}
-```
-
-- 如果输入的`dx`是double，即相邻两个采样点间隔相同，则计算如下：$$\sum_{i=1}^{n-1} \frac{\Delta x}{2}\left(y_{i}+y_{i-1}\right)$$
-- 如果输入的`dx`是tensor，即使用tensor指定相邻两个采样点间隔，则计算如下 (其中$\Delta x_i = x_{i+1}-x_i$)：$$\sum_{i=1}^{n-1} \frac{\Delta x_i}{2}\left(y_{i}+y_{i-1}\right)$$
-
-Tensorflow python 代码
-
-```python
-def trapz(
-    y,
-    x=None,
-    dx=None,
-    axis=-1,
-    name=None,
-):
-  """Integrate y(x) on the specified axis using the trapezoidal rule.
-
-  Computes ∫ y(x) dx ≈ Σ [0.5 (y_k + y_{k+1}) * (x_{k+1} - x_k)]
-
-  Args:
-    y: Float `Tensor` of values to integrate.
-    x: Optional, Float `Tensor` of points corresponding to the y values. The
-      shape of x should match that of y. If x is None, the sample points are
-      assumed to be evenly spaced dx apart.
-    dx: Scalar float `Tensor`. The spacing between sample points when x is None.
-      If neither x nor dx is provided then the default is dx = 1.
-    axis: Scalar integer `Tensor`. The axis along which to integrate.
-    name: Python `str` name prefixed to ops created by this function.
-      Default value: `None`, uses name='trapz'.
-
-  Returns:
-    Float `Tensor` integral approximated by trapezoidal rule.
-      Has the shape of y but with the dimension associated with axis removed.
-  """
-  with tf.name_scope(name or 'trapz'):
-    if not (x is None or dx is None):
-      raise ValueError('Not permitted to specify both x and dx input args.')
-    dtype = dtype_util.common_dtype([y, x, dx], dtype_hint=tf.float32)
-    axis = ps.convert_to_shape_tensor(axis, dtype=tf.int32, name='axis')
-    axis_rank = tensorshape_util.rank(axis.shape)
-    if axis_rank is None:
-      raise ValueError('Require axis to have a static shape.')
-    if axis_rank:
-      raise ValueError(
-          'Only permitted to specify one axis, got axis={}'.format(axis))
-    y = tf.convert_to_tensor(y, dtype=dtype, name='y')
-    y_shape = ps.convert_to_shape_tensor(ps.shape(y), dtype=tf.int32)
-    length = y_shape[axis]
-    if x is None:
-      if dx is None:
-        dx = 1.
-      dx = tf.convert_to_tensor(dx, dtype=dtype, name='dx')
-      if ps.shape(dx):
-        raise ValueError('Expected dx to be a scalar, got dx={}'.format(dx))
-      elem_sum = tf.reduce_sum(y, axis=axis)
-      elem_sum -= 0.5 * tf.reduce_sum(
-          tf.gather(y, [0, length - 1], axis=axis),
-          axis=axis)  # half weight endpoints
-      return elem_sum * dx
-    else:
-      x = tf.convert_to_tensor(x, dtype=dtype, name='x')
-      tensorshape_util.assert_is_compatible_with(x.shape, y.shape)
-      dx = (
-          tf.gather(x, ps.range(1, length), axis=axis) -
-          tf.gather(x, ps.range(0, length - 1), axis=axis))
-      return 0.5 * tf.reduce_sum(
-          (tf.gather(y, ps.range(1, length), axis=axis) +
-           tf.gather(y, ps.range(0, length - 1), axis=axis)) * dx,
-          axis=axis)
-```
-
-整体逻辑为：
-
-- 确保 `x` 和 `dx` 不都为空
-- 确保输入 `axis` 是一个有效的 `int` 维度
-- 若 `x` 为空
-  - 若 `dx` 为空
-    - 设 `dx` 等于 1，根据下式按照指定维度直接计算结果：$$\sum_{i=1}^{n-1} \frac{1}{2}\left(y_{i}+y_{i-1}\right)$$
-  - 若 `dx` 不为空，根据下式按照指定维度直接计算结果：$$\sum_{i=1}^{n-1} \frac{\Delta x}{2}\left(y_{i}+y_{i-1}\right)$$
-- 若 `x` 不为空：按照指定维度利用差分求解间距，根据下式直接计算结果：$$\sum_{i=1}^{n-1} \frac{\left(x_{i}-x_{i-1}\right)}{2}\left(y_{i}+y_{i-1}\right)$$
-
-
-# 四、对比分析
-
-计算思路基本一致，无功能差别，只是在分片方式上不一样。
-
-在计算差分过程中，PyTorch使用slice然后相减，从而完成差分；Tensorflow使用gather然后相减，从而完成差分。
-
-# 五、设计思路与实现方案
-
-## 命名与参数设计
-
-`paddle.trapezoid(y, x=None, dx=None, axis=-1)` 参数说明如下：
-
-- **y** (Tensor) – 计算 trapezoidal rule 时所需的值。
-  
-- **x** (Tensor) – 可选，**y** 中数值对应的点的浮点数所组成的 Tensor；**x** 的形状应与 **y** 的形状相匹配；如果 **x** 为 None，则假定采样点均匀分布 **dx**。
-  
-- **dx** (float) - 相邻采样点之间的常数间隔；当**x**和**dx**均未指定时，**dx**默认为1.0。
-  
-- **dim** (int) – 计算 trapezoidal rule 时 **y** 的维度。
-  
-
-`Tensor.trapezoid(x=None, dx=None, axis=-1)` 参数说明如下：
-
-- **x** (Tensor) – 可选，`Tensor` 中数值对应的点的浮点数所组成的 Tensor；**x** 的形状应与 **y** 的形状相匹配；如果 **x** 为 None，则假定采样点均匀分布 **dx**。
-  
-- **dx** (float) - 相邻采样点之间的常数间隔；当**x**和**dx**均未指定时，**dx**默认为1.0。
-  
-- **dim** (int) – 计算 trapezoidal rule 时 **y** 的维度。
-  
-
-输出是一个Tensor，其形状与 **y** 的形状与用于计算 trapezoidal rule 时的维度有关。
-
-## 底层OP设计
-
-使用`paddle.diff`和`Tensor.sum`组合实现。
-
-## API实现方案
-
-核心计算公式如下: $$\sum_{i=1}^{n-1} \frac{\Delta x_i}{2}\left(y_{i}+y_{i-1}\right)$$
-
-- 若 `x` 为空
-  - 若 `dx` 为空, 则 $\Delta x_i = 1.0$
-  - 若 `dx` 不为空, 则 $\Delta x_i = \text{dx}$
-- 若 `x` 不为空：按照指定维度进行如下差分：$\Delta x_i = x_{i+1} - x_{i}$
-
-
-demo:
-
-```python
-def trapezoid(y, x=None, dx=1.0, axis=-1):
-    if x is None:
-        d = dx
-    else:
-        d = paddle.diff(x, axis=axis)
-
-    nd = y.ndim
-    slice1 = [slice(None)] * nd
-    slice2 = [slice(None)] * nd
-    slice1[axis] = slice(1, None)
-    slice2[axis] = slice(None, -1)
-
-    result = (d * (y[tuple(slice1)] + y[tuple(slice2)]) / 2.0).sum(axis)
-
-    return result
-```
-
-# 六、测试和验收的考量
-
-1. 结果正确性: 
-    - 前向计算: `paddle.trapezoid`(和 `Tensor.trapezoid`) 计算结果与 `np.trapz` 计算结果一致。
-    - 反向计算: `paddle.trapezoid`(和 `Tensor.trapezoid`) 计算结果反向传播所得到的梯度与使用 numpy 手动计算的结果一致。令输出 $p$ 对 $x_i$ 求导所得梯度为 $g_i$ 则:
-        - 当 $i=1$ 时, $g_i = \Delta x_1$ 
-        - 当 $\text{1} < \text{i}< \text{n-1}$ 时, $g_i = \frac{\Delta x_{i-1} + \Delta x_{i}}{2}$
-        - 当 $i=n$ 时, $g_i = \Delta x_{n-1}$
-        
-2. 硬件场景: 在CPU和GPU硬件条件下的运行结果一致。
-  
-3. 异常测试:
-    - 数据类型检验:
-        - y 要求为 paddle.Tensor
-        - x 若有输入, 则要求为 paddle.Tensor
-        - dx 若有输入, 则要求为 float
-        - axis 若有输入, 则要求为 int
-    - 具体数值检验:
-        - 若 x 有输入, 已知 y 的尺寸为 `[d_1, d_2, ... , d_n]` 且 `axis=k` , 则 x 的尺寸只能为 `[d_k]` 或 `[d_1, d_2, ... , d_n]`
-        - 若 dx 有输入, 则要非负
-        - 若 axis 有输入, 则要求 y 存在该维度
-        
-3. 各参数输入组合有效:
-    - 检查只输入 y 的情况
-        - 正常计算
-    - 检查输入 y 和 dx 的情况
-        - 正常计算
-    - 检查输入 y 和 x 的情况
-        - 正常计算
-    - 检查输入 y, dx 和 axis 的情况
-        - 检查y是否存在输入的axis索引, 若存在则正常计算; 否则抛出异常
-    - 检查输入 y, x 和 axis 的情况
-        - 检查 y 是否存在输入的 axis 索引, 若存在则正常计算; 否则抛出异常
-        - 检查 x 和 y 的尺寸是否匹配, 若存在则正常计算; 否则抛出异常
-        - 其余情况正常计算
-    - 其他组合输入
-        - 异常组合输入, 抛出异常
-
-# 七、可行性分析和排期规划
-
-方案主要依赖现有`paddle.diff`和`Tensor.sum`组合实现，可以满足在当前版本周期内开发完成。
-
-# 八、影响面
-
-为独立新增API，对其他模块没有影响
-
-# 名词解释
-
-无
-
-# 附件及参考资料
-
-无
diff --git a/rfcs/OPs-Perf/20220706_Selu_op_optimization.md b/rfcs/OPs-Perf/20220706_Selu_op_optimization.md
deleted file mode 100644
index 107431fa7..000000000
--- a/rfcs/OPs-Perf/20220706_Selu_op_optimization.md
+++ /dev/null
@@ -1,83 +0,0 @@
-# Selu OP性能优化设计文档
-
-
-| 基本信息                                                     | 内容                                                         |
-| ------------------------------------------------------------ | ------------------------------------------------------------ |
-| 提交作者<input type="checkbox" class="rowselector hidden">   | carryyu                                               |
-| 提交时间<input type="checkbox" class="rowselector hidden">   | 2022-07-06                                                   |
-| 版本号                                                       | V1.0                                   |
-| 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | PaddleDevelop                      |
-| 文件名                                                       | 20220706_Selu_op_optimization.md<br> |
-
-# 1 背景与意义
-
-目前Paddle中的Selu是通过Eigen组合实现，没有用到一些性能优化的技巧，存在性能优化的空间。
-
-## 1.1 飞桨现状
-
-目前的实现有一定的性能优化空间，可以加入一些性能优化的技巧。当前性能如下表：
-| Case No. | device | input_shape | input_type | Paddle Perf(ms) |
-|---|---|---|---|---|
-| 1 | Tesla T4 | [8, 1024, 3072] | float32 | 0.9122 | 
-| 2 | Tesla T4 | [8, 1024, 3072] | float64 | 5.2592 |
-
-## 1.2 业内方案调研
-
-Pytorch中对应`paddle.nn.functional.selu` 的Api为 `torch.nn.functional.selu`。调研发现Pytorch中采用的是`SeluKernel` Kernel完成该OP的GPU实现。PyTorch采用的方案是1维线程设置完成整体计算，整体性能如下：
-| Case No. | device | input_shape | input_type | Pytorch Perf(ms) |
-|---|---|---|---|---|
-| 1 | Tesla T4 | [8, 1024, 3072] | float32 | 0.8349 | 
-| 2 | Tesla T4 | [8, 1024, 3072] | float64 | 5.4939 |
-
-## 1.3 对比分析
-
-目前Paddle与Pytorch的方案几乎相同，但理论上可以通过向量化读取和写入等手段进行优化，进一步提升算子性能。
-
-# 2 设计方案与性能预期
-
-## 2.1 关键模块与性能提升点
-
-通过使用飞桨内部的Elementwise Kernel来进行计算。通过向量化读取、向量化写入以及gpu_launch_config.h中的线程配置方法对算子进行优化，预计提升5%。
-
-## 2.2 Host端计算流程
-
-通过gpu_launch_config.h中的线程配置方法配置1D线程。
-
-## 2.4 Device端计算流程
-
-设备端通过kps::ReadData和kps::WriteData对数据进行读写，再对每个值进行selu计算。
-
-# 3 测试和验收的考量
-
-参考：[算子性能优化验收标准](http://agroup.baidu.com/paddle-perf/md/article/4892913)
-完成优化后，Paddle与优化前的Paddle的性能对比效果如下，达到了预期性能提升效果（提升5%）：
-| Case No. | device | input_shape | input_type | Paddle Perf(ms) | Old-Paddle Perf(ms) | diff |
-|---|---|---|---|---|---|---|
-| 1 | Tesla T4 | [8, 1024, 3072] | float32 | 0.8277 | 0.9122 | faster than 9.26% |
-| 2 | Tesla T4 | [8, 1024, 3072] | float64 | 4.5655 | 5.2592 | faster than 13.19% |
-
-完成优化后，Paddle与Pytorch的性能对比效果如下，在fp32情况下基本与Pytorch持平，在fp64情况下提升较大 ：
-| Case No. | device | input_shape | input_type | Paddle Perf(ms) | Pytorch Perf(ms) | diff |
-|---|---|---|---|---|---|---|
-| 1 | Tesla T4 | [8, 1024, 3072] | float32 | 0.8277 | 0.8349 | faster than 0.86% |
-| 2 | Tesla T4 | [8, 1024, 3072] | float64 | 4.5655 | 5.4939 | faster than 16.89% |
-
-# 4 可行性分析和排期规划
-
-时间和开发排期规划，主要milestone
-
-| No. | 开发内容 | 预期时间 |
-|---|---|---|
-| 1 | 理清Paddle中OP设计思路，同类产品中最佳设计方案  | 2022-07-06 |
-| 2 | 完成开发文档设计  | 2022-07-14 |
-| 3 | 完成代码开发工作，并通过线程CI测试 | 2022-07-17 |
-
-# 5 影响面
-
-需要进一步讨论的问题，开放性问题，有争议问题；对其他模块是否有影响。
-
-# 名词解释
-
-# 附件及参考资料
-
-[1]. [OP Benchmark使用指南](https://github.com/PaddlePaddle/benchmark/blob/master/api/README.md)
diff --git a/rfcs/OPs-Perf/20220714_dist_op_optimization.md b/rfcs/OPs-Perf/20220714_dist_op_optimization.md
new file mode 100644
index 000000000..7110ce1ee
--- /dev/null
+++ b/rfcs/OPs-Perf/20220714_dist_op_optimization.md
@@ -0,0 +1,89 @@
+# dist OP性能优化设计文档
+
+
+| 基本信息                                                     | 内容                                   |
+| ------------------------------------------------------------ |--------------------------------------|
+| 提交作者<input type="checkbox" class="rowselector hidden">   | thunder95                            |
+| 提交时间<input type="checkbox" class="rowselector hidden">   | 2022-07-14                           |
+| 版本号                                                       | V1.0                                 |
+| 依赖飞桨版本<input type="checkbox" class="rowselector hidden"> | PaddleDevelop                        |
+| 文件名                                                       | 20220714_dist_op_optimization.md<br> |
+
+
+# 1 背景与意义
+
+目前Paddle中的Dist算子已通过基于Kernel Primitive API实现的PNormKernel达到很不错的性能效果。 
+待挖掘的性能提升方面可能可以基于原生的自定义算子实现。
+
+## 1.1 飞桨现状
+
+当前性能如下表(基于ＰaddleＰaddle　develop分支)：
+
+| Case No. | input_shape |　ｐ | Ｐaddle Perf(ms) |
+|---|---|---|---|
+| 0 | [1000,1000] | 2.0 | 0.2338 | 
+| 1 | [1000,1000] | inf　| 0.1843 | 
+| 2 | [1000,1000] |　0 | 0.1586 | 
+
+三种Case都基于形状[1000,1000]的输入，只是p的取值不一样，分别是２.0, inf, 0。
+当前API设计文档: https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/dist_cn.html#dist
+
+## 1.2 业内方案调研
+
+Pytorch中对dist算子的实现基于GPU计算,  整体性能如下(基于Ｐytorch　v1.12)：
+
+| Case No. | input_shape |　ｐ | Ｐytorch Perf(ms) |
+|---|---|---|---|
+| 0 | [1000,1000] |　2.0  |  0.2492 | 
+| 1 | [1000,1000] | inf　|  0.2134 | 
+| 2 | [1000,1000] |　0 | 0.1586 | 
+
+## 1.3 对比分析
+
+目前Paddle与Pytorch的API设计方案几乎相同，3种case测试发现均优于Pytorch的实现。
+
+# 2 设计方案与性能预期
+
+## 2.1 关键模块与性能提升点
+
+对kps:reduce进行改写，将中间步骤合并到一个kernel执行，预计提升1.3倍以上。
+
+## 2.2 Host端计算流程
+
+通过broadcast对齐输入的tensor形状。
+
+## 2.4 Device端计算流程
+
+设备端通过kps::ReadData和kps::WriteData对数据进行读写，再对每对值进行dist运算。
+
+# 3 测试和验收的考量
+
+参考：[算子性能优化验收标准](http://agroup.baidu.com/paddle-perf/md/article/4892913)
+
+
+
+# 4 可行性分析和排期规划
+
+时间和开发排期规划，主要milestone
+
+| No. | 开发内容 | 预期时间 |
+|---|---|---|
+| 1 | 理清Paddle中OP设计思路，同类产品中最佳设计方案  | 2022-07-11 |
+| 2 | 完成开发文档设计  | 2022-07-17 |
+| 3 | 完成代码开发工作，并通过线程CI测试 | 2022-07-24 |
+
+
+
+# 5 影响面
+
+待优化的算子独立运行，不涉及其他算子和模块的修改，API设计与之前保持一致。
+
+
+# 名词解释
+
+
+# 附件及参考资料
+
+[1]. [OP Benchmark使用指南](https://github.com/PaddlePaddle/benchmark/blob/master/api/README.md)
+
+