注意力#

缩放点积注意力 FP16/BF16 前向#

此操作计算缩放点积注意力 (SDPA)，如下所示

\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V\)

使用论文 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 中描述的 FlashAttention-2 算法。它适用于训练和推理阶段，并可以选择生成用于后向训练计算的统计张量。

Python 示例： samples/python/50_scaled_dot_product_attention.ipynb
带有分页缓存的 Python 示例： samples/python/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
C++ 示例： samples/cpp/sdpa
Python 测试： test/python/test_mhas.py

可配置选项#

注意力缩放 (attn_scale)：在 softmax 之前将缩放因子应用于注意力分数，例如 \(\frac{1}{\sqrt{\text{d}}}\)。默认设置为 1.0。
偏差掩码：将附加偏差掩码应用于注意力分数。您必须传递一个偏差张量，如下面的张量部分所述。作为 1 传递的维度将在注意力分数上应用广播掩码。
Alibi 掩码：线性偏差注意力 (ALiBi) 是一种应用于注意力分数的附加掩码，如论文 Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation 中所述。
填充掩码：也称为可变序列长度，此选项会屏蔽掉填充的时间步，以在计算中忽略它们。您必须传递每个批次的序列长度，如下面的张量部分所述。
因果掩码：用负无穷大填充注意力分数的上三角矩阵。矩阵的对角线从左上角开始。如下所述，可以通过调用 set_diagonal_band_right_bound(0) 和 set_diagonal_alignment(DiagonalAlignment_t::TOP_LEFT) 或使用已弃用的选项 set_causal_mask 来设置。
右下角因果掩码：与因果掩码类似，但对角线从右下角开始。此外，如果使用可变序列长度，则对角线在“每个批次”的基础上指定，具体取决于给定批次的序列长度。可以通过调用 set_diagonal_band_right_bound(0) 和 set_diagonal_alignment(DiagonalAlignment_t::TOP_RIGHT) 或使用已弃用的选项 set_causal_mask_bottom_right 来设置。
对角带：对角带的左右边界之间的任何内容都将被关注。默认情况下，左右边界被认为是无限的，并且所有内容都将被关注。
对角带左边界（新）/滑动窗口长度（已弃用）：指定对于行索引 row_idx，注意力分数直到并包括列 row_idx-left_bound 都不会被计算，并且分数矩阵中的相关条目将填充为负无穷大。
对角带右边界：指定对于行索引 row_idx，注意力分数超出列 row_idx+right_bound 都不会被计算，并且分数矩阵中的相关条目将填充为负无穷大。
对角对齐：指定当使用左边界和/或右边界时，对角线从左上角 (“TOP_LEFT”) 或右下角开始，并与实际序列长度 (“BOTTOM_RIGHT”) 对齐。
Dropout：在 softmax 之后随机将一些注意力权重归零，作为一种正则化形式。您可以通过两种方式配置 dropout
- 要使用性能更高的 Philox RNG dropout 实现，您必须提供
  - 作为 cuDNN 张量传递的 RNG 种子。
  - 作为 cuDNN 张量传递的 RNG 偏移量。
  - 一个浮点数，表示 dropout 概率，即任何给定权重设置为零的概率。
  - （仅调试）由 Philox RNG 生成的输出 RNG 转储，作为 cuDNN 张量传递。
- 要使用用户提供的 dropout 掩码，您必须提供
  - 与注意力权重维度匹配的 dropout mask，指示要丢弃哪些权重。作为 1 传递的维度将应用广播 dropout 掩码。
  - 用于相应调整剩余权重比例的 dropout scale，例如 \(1 / (1 - \text{dropout probability})\)。
打包布局：查询、键、值和输出张量应为参差不齐的张量，这些张量是以嵌套可变长度列表作为内部维度的张量。您必须使用 Tensor_attributes.set_ragged_offset() 方法传递另一个名为参差不齐的偏移张量的张量。参差不齐的偏移张量必须是大小为 \((B + 1, 1, 1, 1)\) 的张量，其中包含嵌套张量的偏移量（以元素数量而非字节为单位）。偏移张量的最后一个值指定参差不齐的张量的超出末尾元素的偏移量。有关支持的布局的更多信息，请参阅支持的张量布局。
分页注意力：使用分页 K 和/或 V 缓存，K/V 块不再需要是连续的，从而允许您通过避免碎片来更好地利用内存。
- 因此，您必须
  - 传递一个 page table k 张量，其中包含指向 K 块容器的偏移量。这是可选的，仅当 K 缓存分页时才需要。
  - 传递一个 page table v 张量，其中包含指向 V 块容器的偏移量。这是可选的，仅当 V 缓存分页时才需要。
  - 传递上面 Padding mask 所需的任何内容（即，K 和 V 缓存的每个批次的序列长度）。如果至少有一个 K/V 缓存分页，则需要这样做。
  - 可选但建议传递 K/V 缓存的最大序列长度。如果省略，它将被（过度）估计，这可能会在某些极端情况下导致图损坏。
- 到 K/V 容器的偏移量将计算为
  - \(Kcache[b,h,s,d] = K[page\ table\ k[b,1,s / bs_k, 1],h,s\ mod\ bs_{k},d]\)
  - \(Vcache[b,h,s,d] = V[page\ table\ v[b,1,s / bs_v, 1],h,s\ mod\ bs_{v},d]\)
- 另请参阅 PagedAttention 论文。

张量#

输入张量#

张量名称	设备	数据类型	维度
Q	GPU	FP16 或 BF16	\((B, H_{q}, S_{q}, D_{qk})\)
K	GPU	FP16 或 BF16	\((B, H_{k}, S_{kv}, D_{qk})\)，或 \((num\_blocks_{k}, H_{k}, bs_{k}, D_{qk})\) （如果为分页 K 缓存）
V	GPU	FP16 或 BF16	\((B, H_{v}, S_{kv}, D_{v})\)，或 \((num\_blocks_{v}, H_{v}, bs_{v}, D_{v})\) （如果为分页 V 缓存）
（偏差掩码）偏差掩码	GPU	FP16 或 BF16	\((1, 1, S_{q}, S_{kv})\)、\((1, H_{q}, S_{q}, S_{kv})\)、\((B, 1, S_{q}, S_{kv})\) 或 \((B, H_{q}, S_{q}, S_{kv})\)
（填充掩码/分页缓存）序列长度 Q	GPU	INT32	\((B, 1, 1, 1)\)
（填充掩码/分页缓存）序列长度 KV	GPU	INT32	\((B, 1, 1, 1)\)
（Philox RNG Dropout）种子	CPU 或 GPU	INT32 或 INT64	\((1, 1, 1, 1)\)
（Philox RNG Dropout）偏移量	CPU 或 GPU	INT32 或 INT64	\((1, 1, 1, 1)\)
（自定义 Dropout 掩码）掩码	GPU	FP16 或 BF16	\((1, 1, S_{q}, S_{kv})\)、\((1, H_{q}, S_{q}, S_{kv})\)、\((B, 1, S_{q}, S_{kv})\) 或 \((B, H_{q}, S_{q}, S_{kv})\)
（自定义 Dropout 掩码）比例	GPU	FP32	\((1, 1, 1, 1)\)
（打包布局）参差不齐的偏移量	GPU	INT32	\((B + 1, 1, 1, 1)\)
（分页注意力）页表 K	GPU	INT32	\((B, 1, ceil(S_{kv}/bs_{k}), 1)\)
（分页注意力）页表 V	GPU	INT32	\((B, 1, ceil(S_{kv}/bs_{v}), 1)\)
（分页注意力）最大序列长度 KV	CPU	INT32 或 INT64	\((1, 1, 1, 1)\)

输出张量#

张量名称	设备	数据类型	维度
O	GPU	FP16 或 BF16	\((B, H_{q}, S_{q}, D_{v})\)
统计信息（仅限训练）	GPU	FP32	\((B, H_{q}, S_{q}, 1)\)
（Philox RNG Dropout）RNG 转储	GPU	FP32	\((B, H_{q}, S_{q}, S_{kv})\)

其中

\(B\) 是批次大小
\(H_{q}\) 是查询头的数量
\(H_{k}\) 是键头的数量
\(H_{v}\) 是值头的数量
\(S_{q}\) 是查询的序列长度
\(S_{kv}\) 是键和值的序列长度
\(D_{qk}\) 是每个查询和键头的嵌入维度
\(D_{v}\) 是每个值头的嵌入维度
\(bs_{k}\) 是 K 容器的（2 的幂）块大小
\(bs_{v}\) 是 V 容器的（2 的幂）块大小
\(num\_blocks_{k}\) 是 K 容器中的块数
\(num\_blocks_{v}\) 是 V 容器中的块数

分组查询注意力 (GQA) 和多查询注意力 (MQA)#

如论文 GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints 中所述

当 \(H_{k}\) 和 \(H_{v}\) 小于 \(H_{q}\) 并且是 \(H_{q}\) 的因子时，此操作将执行分组查询注意力 (GQA) 计算。
当 \(H_{k}\) 和 \(H_{v}\) 都设置为 1 时，此操作执行多查询注意力 (MQA) 计算。

局限性#

所有输入和输出张量数据类型必须为 float16 或 bfloat16 数据类型，但 softmax 统计信息输出张量除外，后者必须为 float32。
每个头的嵌入维度 \(D_{qk}\) 和 \(D_{v}\) 的维度必须是 8 的倍数，最大值为 128。
对于上述所有张量，每个头的嵌入维度 \(D_{qk}\) 和 \(D_{v}\) 的步幅必须为 1。
此操作仅在具有 NVIDIA Ampere 架构 (SM80) 或更新架构的 GPU 上受支持。

C++ API#

// returns [output, softmax_stats]
std::array<std::shared_ptr<Tensor_attributes>, 2> 
sdpa(std::shared_ptr<Tensor_attributes> q,
     std::shared_ptr<Tensor_attributes> k,
     std::shared_ptr<Tensor_attributes> v,
     SDPA_attributes options);

类型为 SDPA_attributes 的 options 参数用于控制前向操作的属性，如下详述

SDPA_attributes& set_is_inference(bool const value);

SDPA_attributes& set_attn_scale(std::shared_ptr<Tensor_attributes> value);
SDPA_attributes& set_attn_scale(float const value);

// ========================== BEGIN paged attn options =====================
SDPA_attributes& set_paged_attention_k_table(std::shared_ptr<Tensor_attributes> value);
SDPA_attributes& set_paged_attention_v_table(std::shared_ptr<Tensor_attributes> value);
SDPA_attributes& set_paged_attention_max_seq_len_kv(int const value);
// ==========================  END  paged attn options =====================

// ========================== BEGIN    var len options =====================
SDPA_attributes& set_padding_mask(bool const value);

// integer tensor that specifies the sequence length of each batch
SDPA_attributes& set_seq_len_q(std::shared_ptr<Tensor_attributes> value);
SDPA_attributes& set_seq_len_kv(std::shared_ptr<Tensor_attributes> value);
// ==========================  END     var len options =====================

// ========================== BEGIN score mod options =====================
SDPA_attributes& set_score_mod(std::function<Tensor_t(Graph_t, Tensor_t)>);

// Use in combination to set diagonal masking
SDPA_attributes& set_diagonal_alignment(DiagonalAlignment_t const alignment);
SDPA_attributes& set_diagonal_band_left_bound(int const value);
SDPA_attributes& set_diagonal_band_right_bound(int const value);

// DEPRECATED
// Sets the diagonal position to TOP_LEFT
// calls set_diagonal_band_right_bound(0) if no right_bound was specified
SDPA_attributes& set_causal_mask(bool const value);

// DEPRECATED
// Sets the diagonal position to BOTTOM_RIGHT
// and calls set_diagonal_band_right_bound(0) if no right_bound was specified
SDPA_attributes& set_causal_mask_bottom_right(bool const value);

// DEPRECATED
// calls set_diagonal_band_left_bound(value)
SDPA_attributes& set_sliding_window_length(int const value);

SDPA_attributes& set_bias(std::shared_ptr<Tensor_attributes> value);

SDPA_attributes& set_alibi_mask(bool const value);
// ==========================  END  score mod options =====================

// ========================== BEGIN   dropout options =====================
SDPA_attributes& set_dropout(float const probability,
                             std::shared_ptr<Tensor_attributes> seed,
                             std::shared_ptr<Tensor_attributes> offset);

SDPA_attributes& set_dropout(std::shared_ptr<Tensor_attributes> mask,
                             std::shared_ptr<Tensor_attributes> scale);

// for debugging dropout mask with seed and offset
SDPA_attributes& set_rng_dump(std::shared_ptr<Tensor_attributes> value);
// ==========================  END    dropout options =====================

SDPA_attributes& set_compute_data_type(DataType_t value);

Python API#

Args:
    q (cudnn_tensor): The query data.
    k (cudnn_tensor): The key data. When page_table_k is provided, 'k' is a container of non-contiguous key data.
    v (cudnn_tensor): The value data. When page_table_v is provided, 'v' is a container of non-contiguous value data.
    is_inference (bool): Whether it is an inference step or training step.
    attn_scale (Optional[Union[float, cudnn_tensor]]): The scale factor for attention. Default is None.
    bias (Optional[cudnn_tensor]): The bias data for attention. Default is None.
    use_alibi_mask (Optional[bool]): Whether to use alibi mask. Default is False.
    use_padding_mask (Optional[bool]): Whether to use padding mask. Default is False.
    seq_len_q (Optional[cudnn_tensor]): The sequence length of the query.
    seq_len_kv (Optional[cudnn_tensor]): The sequence length of the key.
    dropout (Optional[Union[Tuple[(probability: float, seed: cudnn_tensor, offset: cudnn_tensor)], Tuple[mask: cudnn_tensor, scale: cudnn_tensor]]]): Whether to do dropout. Default is None.
    rng_dump (Optional[cudnn_tensor]): Debug tensor to dump the Philox RNG dropout mask. Default is None.
    paged_attention_k_table (Optional[cudnn_tensor]): The page table to look up offsets into 'k'
    paged_attention_v_table (Optional[cudnn_tensor]): The page table to look up offsets into 'v'
    paged_attention_max_seq_len_kv (Optional[integer]): The maximum sequence length for k/v caches when paged attention is active.
    compute_data_type (Optional[cudnn.data_type]): The data type for computation. Default is NOT_SET.
    name (Optional[str]): The name of the operation.
Preferred masking Args:
    diagonal_alignment (Optional[cudnn.diagonal_alignment]): One of {"TOP_LEFT", "BOTTOM_RIGHT"}. E.g., causal masking can be performed by setting diagonal_alignment=TOP_LEFT, and right_bound=0. Default is TOP_LEFT.
    left_bround (Optional[cudnn.diagonal_alignment]): An integer > 1 specifying the offset to the left of the main diagonal to attend to. Default is None, implying +Inf.
    right_bound (Optional[cudnn.diagonal_alignment]): An integer > 0 specifying the offset to the right of the main diagonal to attend to. Default is None, implying +Inf.
Deprecated masking Args (can cause undetermined behavior when combined with the Preferred masking args):
    sliding_window_length (Optional[int]): A positive int specifying the left bound sliding window length
    use_causal_mask (Optional[bool]): Whether to use causal mask. Default is False.
    use_causal_mask_bottom_right (Optional[bool]): Whether to use bottom right aligned causal mask. Default is False.

Returns:
    o (cudnn_tensor): The output data.
    stats (Optional[cudnn_tensor]): The softmax statistics in case the operation is in a training step.

缩放点积注意力 FP16/BF16 后向#

此操作使用论文 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 中描述的 FlashAttention-2 算法计算缩放点积注意力 (SDPA) 的梯度张量。您需要将前向操作中的统计张量传递到后向操作作为输入。

Python 示例： samples/python/51_scaled_dot_product_attention_backward.ipynb
C++ 示例： samples/cpp/sdpa
Python 测试： test/python/test_mhas.py

可配置选项#

所有前向操作中提到的可配置选项，包括参差不齐的张量和 GQA/MQA，也适用于后向操作。

张量#

所有前向操作的张量要求也适用于后向操作。查询、键、值、输出和偏差的梯度张量应与其非梯度对应项具有相同的属性。

输入张量#

张量名称	设备	数据类型	维度
dO	GPU	FP16 或 BF16	\((B, H_{q}, S_{q}, D_{v})\)

输出张量#

张量名称	设备	数据类型	维度
dQ	GPU	FP16 或 BF16	\((B, H_{q}, S_{q}, D_{qk})\)
dK	GPU	FP16 或 BF16	\((B, H_{k}, S_{kv}, D_{qk})\)
dV	GPU	FP16 或 BF16	\((B, H_{v}, S_{kv}, D_{v})\)

局限性#

所有前向操作中提到的局限性也适用于后向操作。

C++ API#

// returns [dQ, dK, dV]
std::array<std::shared_ptr<Tensor_attributes>, 3>
sdpa_backward(std::shared_ptr<Tensor_attributes> q,
              std::shared_ptr<Tensor_attributes> k,
              std::shared_ptr<Tensor_attributes> v,
              std::shared_ptr<Tensor_attributes> o,
              std::shared_ptr<Tensor_attributes> dO,
              std::shared_ptr<Tensor_attributes> stats,
              SDPA_backward_attributes);

类型为 SDPA_backward_attributes 的 options 参数用于控制后向操作的属性，如下详述

SDPA_backward_attributes& set_attn_scale(std::shared_ptr<Tensor_attributes> value);
SDPA_backward_attributes& set_attn_scale(float const value);

// ========================== BEGIN    var len options =====================
SDPA_backward_attributes& set_padding_mask(bool const value);

// integer tensor that specifies the sequence length of each batch
SDPA_backward_attributes& set_seq_len_q(std::shared_ptr<Tensor_attributes> value);
SDPA_backward_attributes& set_seq_len_kv(std::shared_ptr<Tensor_attributes> value);

// the maximum number of sequence tokens for all batches, used for workspace allocation
SDPA_backward_attributes& set_max_total_seq_len_q(int64_t const value);
SDPA_backward_attributes& set_max_total_seq_len_kv(int64_t const value);
// ==========================  END     var len options =====================

// ========================== BEGIN score modoptions =====================
SDPA_backward_attributes& set_score_mod(std::function<Tensor_t(Graph_t, Tensor_t)>);

// Use in combination to set_diagonal_alignment to set (bottom right) causal masking
SDPA_backward_attributes& set_diagonal_alignment(DiagonalAlignment_t const alignment);
SDPA_backward_attributes& set_diagonal_band_left_bound(int const value);
SDPA_backward_attributes& set_diagonal_band_right_bound(int const value);

// DEPRECATED
// Sets the diagonal position to TOP_LEFT
// calls set_diagonal_band_right_bound(0) if no right_bound was specified
SDPA_backward_attributes& set_causal_mask(bool const value);

// DEPRECATED
// Sets the diagonal position to BOTTOM_RIGHT
// and calls set_diagonal_band_right_bound(0) if no right_bound was specified
SDPA_backward_attributes& set_causal_mask_bottom_right(bool const value);

// DEPRECATED
// calls set_diagonal_band_left_bound(value)
SDPA_backward_attributes& set_sliding_window_length(int const value);

SDPA_backward_attributes& set_bias(std::shared_ptr<Tensor_attributes> value);
SDPA_backward_attributes& set_dbias(std::shared_ptr<Tensor_attributes> value);

SDPA_backward_attributes& set_alibi_mask(bool const value);
// ==========================  END  score modoptions =====================

// ========================== BEGIN   dropout options =====================
SDPA_backward_attributes& set_dropout(float const probability,
                                      std::shared_ptr<Tensor_attributes> seed,
                                      std::shared_ptr<Tensor_attributes> offset);
SDPA_backward_attributes& set_dropout(std::shared_ptr<Tensor_attributes> mask,
                                      std::shared_ptr<Tensor_attributes> scale,
                                      std::shared_ptr<Tensor_attributes> scale_inv);

// for debugging dropout mask with seed and offset
SDPA_backward_attributes& set_rng_dump(std::shared_ptr<Tensor_attributes> value);
// ==========================  END    dropout options =====================

SDPA_backward_attributes& set_deterministic_algorithm(bool const value);

SDPA_backward_attributes& set_compute_data_type(DataType_t const value);

Python API#

Args:
    q (cudnn_tensor): The query data.
    k (cudnn_tensor): The key data.
    v (cudnn_tensor): The value data.
    o (cudnn_tensor): The output data.
    dO (cudnn_tensor): The output loss gradient.
    stats (cudnn_tensor): The softmax statistics from the forward pass.
    attn_scale (Optional[Union[float, cudnn_tensor]]): The scale factor for attention. Default is None.
    bias (Optional[cudnn_tensor]): The bias data for attention. Default is None.
    dBias (Optional[cudnn_tensor]): The dBias data for attention. Default is None.
    use_alibi_mask (Optional[bool]): Whether to use alibi mask. Default is False.
    use_padding_mask (Optional[bool]): Whether to use padding mask. Default is False.
    seq_len_q (Optional[cudnn_tensor]): The sequence length of the query.
    seq_len_kv (Optional[cudnn_tensor]): The sequence length of the key.
    dropout (Optional[Union[Tuple[(probability: float, seed: cudnn_tensor, offset: cudnn_tensor)], Tuple[mask: cudnn_tensor, scale: cudnn_tensor]]]): Whether to do dropout. Default is None.
    rng_dump (Optional[cudnn_tensor]): Debug tensor to dump the Philox RNG dropout mask. Default is None.
    use_deterministic_algorithm (Optional[bool]): Whether to always use deterministic algorithm. Default is False.
    compute_data_type (Optional[cudnn.data_type]): The data type for computation. Default is NOT_SET.
    name (Optional[str]): The name of the operation.
Preferred masking Args:
    diagonal_alignment (Optional[cudnn.diagonal_alignment]): One of {"TOP_LEFT", "BOTTOM_RIGHT"}. E.g., causal masking can be performed by setting diagonal_alignment=TOP_LEFT, and right_bound=0. Default is TOP_LEFT.
    left_bround (Optional[cudnn.diagonal_alignment]): An integer > 1 specifying the offset to the left of the main diagonal to attend to. Default is None, implying +Inf.
    right_bound (Optional[cudnn.diagonal_alignment]): An integer > 0 specifying the offset to the right of the main diagonal to attend to. Default is None, implying +Inf.
Deprecated masking Args (can cause undetermined behavior when combined with the Preferred masking args):
    sliding_window_length (Optional[int]): A positive int specifying the left bound sliding window length
    use_causal_mask (Optional[bool]): Whether to use causal mask. Default is False.
    use_causal_mask_bottom_right (Optional[bool]): Whether to use bottom right aligned causal mask. Default is False.

Returns:
    dQ (cudnn_tensor): The query gradient data.
    dK (cudnn_tensor): The key gradient data.
    dV (cudnn_tensor): The value gradient data.

缩放点积注意力 FP8 前向#

此操作使用论文 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 中描述的 FlashAttention-2 算法，以 8 位浮点 (FP8) 数据类型计算缩放点积注意力 (SDPA)。它适用于训练和推理阶段，并可以选择生成用于后向训练计算的统计张量。

FP8 数据类型由两种编码组成

FP8_E4M3（1 个符号位、4 个指数位和 3 个尾数位）
FP8_E5M2（1 个符号位、5 个指数位和 2 个尾数位）。

由于 FP8 数据类型的数值精度有限，对于实际用例，您必须先以 FP32 格式缩放计算的值，然后再将其存储为 FP8 格式，并在对其执行计算之前，先解缩以 FP8 格式存储的值。有关更多信息，请参阅 Transformer Engine FP8 入门。

建议的缩放因子值计算为：（fp8 格式中最大可表示值）/（上一层张量中看到的绝对最大值）。

对于 E4M3，建议的缩放因子为 448.f/ prev_layer_tensor_amax（四舍五入到最接近的较低 2 的幂）
对于 E5M2，建议的缩放因子为 57344.f/ prev_layer_tensor_amax（四舍五入到最接近的较低 2 的幂）

建议的解缩因子值是缩放因子的倒数。

由于缩放和解缩对于 FP8 数据类型的收敛至关重要，因此您需要传递缩放和解缩输入张量以及 amax 输出张量。

C++ 示例： samples/cpp/sdpa

可配置选项#

当前 FP8 支持是 FP16 和 BF16 支持中支持的选项的子集。我们正在积极扩展对 FP8 的支持。

注意力缩放 (attn_scale)：在 softmax 之前将缩放因子应用于注意力分数，例如 \(\frac{1}{\sqrt{\text{d}}}\)。默认设置为 1.0。
因果掩码：用负无穷大填充注意力分数的上三角矩阵。

张量#

前向操作中的张量定义如下

\(P = QK^T\)

\(S = \text{softmax}(P)\)

\(O = SV\)

输入张量#

张量名称	设备	数据类型	维度
Q	GPU	E4M3 或 E5M2	\((B, H_{q}, S_{q}, D_{qk})\)
K	GPU	E4M3 或 E5M2	\((B, H_{k}, S_{kv}, D_{qk})\)
V	GPU	E4M3 或 E5M2	\((B, H_{v}, S_{kv}, D_{v})\)
解缩 Q	GPU	FP32	\((1, 1, 1, 1)\)
解缩 K	GPU	FP32	\((1, 1, 1, 1)\)
解缩 V	GPU	FP32	\((1, 1, 1, 1)\)
（偏差掩码）偏差掩码	GPU	E4M3 或 E5M2	\((1, 1, S_{q}, S_{kv})\)、\((1, H_{q}, S_{q}, S_{kv})\)、\((B, 1, S_{q}, S_{kv})\) 或 \((B, H_{q}, S_{q}, S_{kv})\)
（填充掩码）序列长度 Q	GPU	INT32	\((B, 1, 1, 1)\)
（填充掩码）序列长度 KV	GPU	INT32	\((B, 1, 1, 1)\)
（Philox RNG Dropout）种子	CPU 或 GPU	INT32 或 INT64	\((1, 1, 1, 1)\)
（Philox RNG Dropout）偏移量	CPU 或 GPU	INT32 或 INT64	\((1, 1, 1, 1)\)
（自定义 Dropout 掩码）掩码	GPU	E4M3 或 E5M2	\((1, 1, S_{q}, S_{kv})\)、\((1, H_{q}, S_{q}, S_{kv})\)、\((B, 1, S_{q}, S_{kv})\) 或 \((B, H_{q}, S_{q}, S_{kv})\)
（自定义 Dropout 掩码）比例	GPU	FP32	\((1, 1, 1, 1)\)
解缩 S	GPU	FP32	\((1, 1, 1, 1)\)
缩放 S	GPU	FP32	\((1, 1, 1, 1)\)

输出张量#

张量名称	设备	数据类型	维度
O	GPU	E4M3 或 E5M2	\((B, H_{q}, S_{q}, D_{v})\)
统计信息（仅限训练）	GPU	FP32	\((B, H_{q}, S_{q}, 1)\)
AMax S	GPU	FP32	\((1, 1, 1, 1)\)
AMax O	GPU	FP32	\((1, 1, 1, 1)\)

其中

\(B\) 是批次大小
\(H_{q}\) 是查询头的数量
\(H_{k}\) 是键头的数量
\(H_{v}\) 是值头的数量
\(S_{q}\) 是查询的序列长度
\(S_{kv}\) 是键和值的序列长度
\(D_{qk}\) 是每个查询和键头的嵌入维度
\(D_{v}\) 是每个值头的嵌入维度

分组查询注意力 (GQA) 和多查询注意力 (MQA)#

所有 FP16/BF16 前向操作的计算也适用于 FP8 前向操作。

局限性#

每个头的嵌入维度 \(D_{qk}\) 和 \(D_{v}\) 的维度必须是 8 的倍数，最大值为 128。
对于上述所有张量，每个头的嵌入维度 \(D_{qk}\) 和 \(D_{v}\) 的步幅必须为 1。
此操作仅在具有 NVIDIA Hopper 架构 (SM90) 或更新架构的 GPU 上受支持。

C++ API#

// returns [o, stats, amax_s, amax_o]
std::array<std::shared_ptr<Tensor_attributes>, 4>
Graph::sdpa_fp8(std::shared_ptr<Tensor_attributes> q,
                std::shared_ptr<Tensor_attributes> k,
                std::shared_ptr<Tensor_attributes> v,
                std::shared_ptr<Tensor_attributes> descale_q,
                std::shared_ptr<Tensor_attributes> descale_k,
                std::shared_ptr<Tensor_attributes> descale_v,
                std::shared_ptr<Tensor_attributes> descale_s,
                std::shared_ptr<Tensor_attributes> scale_s,
                std::shared_ptr<Tensor_attributes> scale_o,
                SDPA_fp8_attributes attributes);

类型为 SDPA_fp8_attributes 的 options 参数用于控制前向操作的属性，如下详述

SDPA_fp8_attributes&
set_is_inference(bool const value);

SDPA_fp8_attributes&
set_attn_scale(std::shared_ptr<Tensor_attributes> value);

SDPA_fp8_attributes&
set_attn_scale(float const value);

SDPA_fp8_attributes&
set_causal_mask(bool const value);

SDPA_fp8_attributes&
set_bias(std::shared_ptr<Tensor_attributes> value);

SDPA_fp8_attributes&
set_padding_mask(bool const value);

SDPA_fp8_attributes&
set_seq_len_q(std::shared_ptr<Tensor_attributes> value);

SDPA_fp8_attributes&
set_seq_len_kv(std::shared_ptr<Tensor_attributes> value);

SDPA_fp8_attributes&
set_dropout(float const probability,
            std::shared_ptr<Tensor_attributes> seed,
            std::shared_ptr<Tensor_attributes> offset);

SDPA_fp8_attributes&
set_dropout(std::shared_ptr<Tensor_attributes> mask,
            std::shared_ptr<Tensor_attributes> scale);

Python API#

Args:
    q (cudnn_tensor): The query data.
    k (cudnn_tensor): The key data.
    v (cudnn_tensor): The value data.
    descale_q (cudnn_tensor): Descale factor for query.
    descale_k (cudnn_tensor): Descale factor for key.
    descale_v (cudnn_tensor): Descale factor for value.
    descale_s (cudnn_tensor): Descale factor for S tensor.
    scale_s (cudnn_tensor): Scale factor for S tensor.
    scale_o (cudnn_tensor): Scale factor for output.
    is_inference (bool): Whether it is an inference step or training step.
    attn_scale (Optional[Union[float, cudnn_tensor]]): The scale factor for attention. Default is None.
    use_causal_mask (Optional[bool]): Whether to use causal mask. Default is False.
    compute_data_type (Optional[cudnn.data_type]): The data type for computation. Default is NOT_SET.
    name (Optional[str]): The name of the operation.

Returns:
    o (cudnn_tensor): The output data.
    stats (Optional[cudnn_tensor]): The softmax statistics in case the operation is in a training step.
    amax_s (cudnn_tensor): The absolute maximum of S tensor.
    amax_o (cudnn_tensor): The absolute maximum of output tensor.

缩放点积注意力 FP8 后向#

此操作使用论文 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 中描述的 FlashAttention-2 算法，计算缩放点积注意力 (SDPA) 8 位浮点 (FP8) 数据类型的梯度。您需要将前向操作中的统计张量传递到后向操作作为输入。

C++ 示例： samples/cpp/sdpa

可配置选项#

所有前向 FP8 操作中提到的可配置选项，包括参差不齐的张量和 GQA/MQA，也适用于后向操作。

张量#

后向操作中的张量定义如下

\(dV = S^TdO\)

\(dS = dOV^T\)

\(dP = \text{dSoftmax}(dS)\)

\(dQ = dPK\)

\(dK = QdP\)

输入张量#

张量名称	设备	数据类型	维度
Q	GPU	E4M3 或 E5M2	\((B, H_{q}, S_{q}, D_{qk})\)
K	GPU	E4M3 或 E5M2	\((B, H_{k}, S_{kv}, D_{qk})\)
V	GPU	E4M3 或 E5M2	\((B, H_{v}, S_{kv}, D_{v})\)
O	GPU	E4M3 或 E5M2	\((B, H_{q}, S_{q}, D_{v})\)
dO	GPU	E4M3 或 E5M2	\((B, H_{q}, S_{q}, D_{v})\)
统计信息	GPU	FP32	\((B, H_{q}, S_{q}, 1)\)
解缩 Q	GPU	FP32	\((1, 1, 1, 1)\)
解缩 K	GPU	FP32	\((1, 1, 1, 1)\)
解缩 V	GPU	FP32	\((1, 1, 1, 1)\)
解缩 O	GPU	FP32	\((1, 1, 1, 1)\)
解缩 dO	GPU	FP32	\((1, 1, 1, 1)\)
解缩 S	GPU	FP32	\((1, 1, 1, 1)\)
解缩 dP	GPU	FP32	\((1, 1, 1, 1)\)
缩放 S	GPU	FP32	\((1, 1, 1, 1)\)
缩放 dQ	GPU	FP32	\((1, 1, 1, 1)\)
缩放 dK	GPU	FP32	\((1, 1, 1, 1)\)
缩放 dV	GPU	FP32	\((1, 1, 1, 1)\)
缩放 dP	GPU	FP32	\((1, 1, 1, 1)\)

输出张量#

张量名称	设备	数据类型	维度
dQ	GPU	E4M3 或 E5M2	\((B, H_{q}, S_{q}, D_{qk})\)
dK	GPU	E4M3 或 E5M2	\((B, H_{k}, S_{kv}, D_{qk})\)
dV	GPU	E4M3 或 E5M2	\((B, H_{v}, S_{kv}, D_{v})\)
Amax dQ	GPU	FP32	\((1, 1, 1, 1)\)
Amax dK	GPU	FP32	\((1, 1, 1, 1)\)
Amax dV	GPU	FP32	\((1, 1, 1, 1)\)
Amax dP	GPU	FP32	\((1, 1, 1, 1)\)

其中

\(B\) 是批次大小
\(H_{q}\) 是查询头的数量
\(H_{k}\) 是键头的数量
\(H_{v}\) 是值头的数量
\(S_{q}\) 是查询的序列长度
\(S_{kv}\) 是键和值的序列长度
\(D_{qk}\) 是每个查询和键头的嵌入维度
\(D_{v}\) 是每个值头的嵌入维度

局限性#

前向操作中提到的所有局限性也适用于后向操作。

C++ API#

// returns [dQ, dK, dV, amax_dQ, amax_dK, amax_dV, amax_dP]
std::array<std::shared_ptr<Tensor_attributes>, 7>
Graph::sdpa_fp8_backward(std::shared_ptr<Tensor_attributes> q,
                         std::shared_ptr<Tensor_attributes> k,
                         std::shared_ptr<Tensor_attributes> v,
                         std::shared_ptr<Tensor_attributes> o,
                         std::shared_ptr<Tensor_attributes> dO,
                         std::shared_ptr<Tensor_attributes> Stats,
                         std::shared_ptr<Tensor_attributes> descale_q,
                         std::shared_ptr<Tensor_attributes> descale_k,
                         std::shared_ptr<Tensor_attributes> descale_v,
                         std::shared_ptr<Tensor_attributes> descale_o,
                         std::shared_ptr<Tensor_attributes> descale_do,
                         std::shared_ptr<Tensor_attributes> descale_s,
                         std::shared_ptr<Tensor_attributes> descale_dp,
                         std::shared_ptr<Tensor_attributes> scale_s,
                         std::shared_ptr<Tensor_attributes> scale_dq,
                         std::shared_ptr<Tensor_attributes> scale_dk,
                         std::shared_ptr<Tensor_attributes> scale_dv,
                         std::shared_ptr<Tensor_attributes> scale_dp,
                         SDPA_fp8_backward_attributes attributes);

类型为 SDPA_fp8_backward_attributes 的 options 参数用于控制前向操作的属性，如下详述

SDPA_fp8_backward_attributes&
set_attn_scale(std::shared_ptr<Tensor_attributes> value);

SDPA_fp8_backward_attributes&
set_attn_scale(float const value);

SDPA_fp8_backward_attributes&
set_causal_mask(bool const value);

Python API#

Args:
    q (cudnn_tensor): The query data.
    k (cudnn_tensor): The key data.
    v (cudnn_tensor): The value data.
    o (cudnn_tensor): The output data.
    dO (cudnn_tensor): The output gradient data.
    stats (cudnn_tensor): The softmax statistics in case the operation is in a training step.
    descale_q (cudnn_tensor): Descale factor for query.
    descale_k (cudnn_tensor): Descale factor for key.
    descale_v (cudnn_tensor): Descale factor for value.
    descale_o (cudnn_tensor): Descale factor for output.
    descale_dO (cudnn_tensor): Descale factor for output gradient.
    descale_s (cudnn_tensor): Descale factor for S tensor.
    descale_dP (cudnn_tensor): Descale factor for P gradient tensor.
    scale_s (cudnn_tensor): Scale factor for S tensor.
    scale_dQ (cudnn_tensor): Scale factor for query gradient.
    scale_dK (cudnn_tensor): Scale factor for key gradient.
    scale_dV (cudnn_tensor): Scale factor for value gradient.
    scale_dP (cudnn_tensor): Scale factor for dP gradient.
    attn_scale (Optional[Union[float, cudnn_tensor]]): The scale factor for attention. Default is None.
    use_causal_mask (Optional[bool]): Whether to use causal mask. Default is False.
    compute_data_type (Optional[cudnn.data_type]): The data type for computation. Default is NOT_SET.
    name (Optional[str]): The name of the operation.

Returns:
    dQ (cudnn_tensor): The query gradient data.
    dK (cudnn_tensor): The key gradient data.
    dV (cudnn_tensor): The value gradient data.
    amax_dQ (cudnn_tensor): The absolute maximum of query gradient tensor.
    amax_dK (cudnn_tensor): The absolute maximum of key gradient tensor.
    amax_dV (cudnn_tensor): The absolute maximum of value gradient tensor.
    amax_dP (cudnn_tensor): The absolute maximum of dP tensor.

支持的张量布局#

cuDNN API 基于步幅表示 \(Q\)、\(K\)、\(V\) 和 \(O\) 张量对应的梯度布局。

例如，假设 \(Q\) 的维度 = \([5, 7, 4, 3]\)，步幅 = \([84, 12, 3, 1]\)。索引 \([i, j, k, l]\) 处的元素可以在 \(Q_{ptr} + i * 84 + j * 12 + k * 3 + l * 1\) 的位置访问。请注意步幅如何与索引相乘，以获得所有元素的位置。

以下示例显示了注意力张量的标准用法以及如何在 cuDNN 中表达它们。

以下表示法在示例中使用
\(B\) 是批次大小
\(H_{q}\) 是查询头的数量
\(H_{k}\) 是键头的数量
\(H_{v}\) 是值头的数量
\(S_{q}\) 是查询的序列长度
\(S_{kv}\) 是键和值的序列长度
\(D_{qk}\) 是每个查询和键头的嵌入维度
\(D_{v}\) 是每个值头的嵌入维度

情况 1：\(Q\)、\(K\)、\(V\)、\(O\) 是密集非重叠内存中的张量
这是基本情况，您可以在任何步幅顺序中为 \(Q\)、\(K\)、\(V\)、\(O\) 中的每一个指定维度和步幅。唯一的限制是最后一个维度（每个头的嵌入维度 \(D_{qk}\) 和 \(D_v\)）的步幅必须为 1。
例如，对于维度 = \([B, H_q, S_q, D_{qk}]\) 的 \(Q\)，cuDNN 支持包括（但不限于）
- 步幅 = \([S_q \times H_q \times D_{qk}, D_{qk}, H_q \times D_{qk}, 1]\)，这称为 BSHD 布局
- 步幅 = \([H_q \times D_{qk}, D_{qk}, B \times H_q \times D_{qk}, 1]\)，这称为 SBHD 布局
情况 2：\(Q\)、\(K\)、\(V\) 是密集交错布局中的张量
在某些情况下，您可能需要将 \(Q\)、\(K\)、\(V\) 张量交错在一起，以简化缩放点积运算之前的矩阵乘法。例如，您可以分配大小 = \(3 \times B \times H \times S \times D\) 的单个张量，指定 \(Q\)、\(K\)、\(V\) 维度 = \([B, H, S, D]\)，cuDNN 支持包括（但不限于）
- 步幅 = \([S \times 3 \times H \times D, D, 3 \times H \times D, 1]\)，这称为 BS3HD
  带有 \(QKV\) 变体打包指针偏移量，如下所示
  \(Q_{ptr}\) = \(Storage_{ptr}\)
  \(K_{ptr}\) = \(Storage_{ptr} + 1 \times H \times D\)
  \(V_{ptr}\) = \(Storage_{ptr} + 2 \times H \times D\)
- 步幅 = \([H \times 3 \times D, 3 \times D, B \times H \times 3 \times D, 1]\)，也称为 SBH3D
  带有 \(QKV\) 变体打包指针偏移量，如下所示
  \(Q_{ptr}\) = \(Storage_{ptr}\)
  \(K_{ptr}\) = \(Storage_{ptr} + 1 \times D\)
  \(V_{ptr}\) = \(Storage_{ptr} + 2 \times D\)
情况 3：\(Q\)、\(K\)、\(V\) 是并非所有 token 都有效的张量
考虑一个具有两个批次 (\(B\) = 2) 的 Q 张量，序列长度不同 [“aa”, “bbb”]。假设最大序列长度 \(S\) = 8，头的数量 \(H = 1\)。在这种情况下，您应使用序列长度张量 seq_len = [2, 3] 指示每个批次的实际序列长度，并使用 set_seq_len_q() 和 set_seq_len_kv() 将其传递给 SDPA 节点。请注意，序列长度张量中的每个元素应始终小于最大序列长度 \(S\)。

cuDNN 对可变序列长度的布局支持包括（但不限于）
- 完全填充布局
  Q[b=0] = aa000000
  Q[b=1] = bbb00000
  维度 = \([B=2, H=1, S=8, D=64]\)
  步幅 = \([S \times H \times D=512, D=64, H \times D=64, 1]\)
  
  cuDNN 根据步幅读取数据。
- 完全打包布局，也称为 THD，其中 T = sum(seq_len)
  Q = aabbb
  维度 = \([B=2, H=1, S=8, D=64]\)
  步幅 = \([S \times H \times D=512, D=64, H \times D=64, 1]\)
  
  步幅保持不变，但它们是不正确的，因为第二个批次从 64*2 开始。因此，您必须使用 <tensor>.set_ragged_offset(<ragged_offset_tensor>) API 设置 ragged_offset 张量，这是一个 \(B + 1\) 大小的整数张量，用于指示每个批次的起始位置。b+1 元素是最后一个批次的结束位置。对于这种情况，ragged_offset 应为 [0, 2 * H * D, (2+3) * H * D] = [0, 128, 320]
  
  如果此布局与 bprop 一起使用，建议您使用 set_max_total_seq_len_q 传递用于计算工作区大小的最大 token 数。否则，最大数量将假定为 \(B \times S\)。
- 批次中的有效 token 打包在一起
  Q = aa00bbb0
  对于这种情况，将参差不齐的偏移量设置为 [0, 4 * H * D, (4+3) * H * D] = [0, 256, 448]
- 有效 token 未打包在一起
  Q = a0abbb00bb000000
  参差不齐的偏移量不足以表示这种情况。不支持这种情况。

cudnn Flex Attention API#

SDPA 和 SDPA 后向操作现在接受函数 set_score_mod 和 set_score_mod_bprop，这允许修改注意力分数矩阵。这些函数可用于编程逐点操作的子图，随后可用于编程分数修改器。请注意，此函数用法专用于即用型选项的用法。