其他方法#

获取内存布局#

BLAS::get_layout_<gmem/smem>_<a/b/c>() 函数获取矩阵 ABC 的全局内存或共享内存 CuTe 布局,由矩阵大小排列方式前导维度确定。对于共享内存布局,前导维度如果未通过参数显式指定,将从前导维度运算符推断。对于全局内存布局,自定义前导维度必须通过静态或动态整数类型传递,否则将从矩阵大小推断。

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c();

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c();

// Overloads for specifying the leading dimensions statically during compilation time.
// integral_type can be either signed or unsigned integer type and integral_value follow
// this specification.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const std::integral_constant<integral_type, lda>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const std::integral_constant<integral_type, ldb>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const std::integral_constant<integral_type, ldc>);

// Overloads for specifying the leading dimensions during the execution time.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const unsigned int ldc);

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a(const unsigned int lda);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b(const unsigned int ldb);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c(const unsigned int ldc);

BLAS::get_layout_<gmem/smem>_<a/b/c>() 返回内存标签(全局或共享)和矩阵 ABC 的布局 (cute::Layout) 的组合,可以直接传递给 cublasdx::make_tensor 以创建张量。

BLAS::get_layout_<gmem/smem>_<a/b/c>() 返回与通过 Arrangement 运算符设置的顺序对应的矩阵布局。例如,如果矩阵 A 的顺序设置为 cublasdx::row-major,则返回的布局遵循行优先顺序。

如果用户在执行时提供了动态前导维度,该函数接受前导维度作为参数,请参见以下示例。

示例

using BLAS = decltype(...)

extern __shared__ __align__(16) char smem[];

// a, b, c are pointers to global memory of input matrices A and B and output matrix C
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());

auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c());

// With leading dimensions specified during the compilation time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(std::integral_constant<int, lda>{}));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(std::integral_constant<int, ldb>{}));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(std::integral_constant<int, ldc>{}));

// With leading dimensions specified during the execution time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(ldc));

auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem, lda, ldb, ldc);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c(ldc));

建议的共享内存布局#

除了 get_layout_smem_* 函数外,cuBLASDx 还提供了一个函数,该函数返回矩阵 ABC 的建议的自定义共享内存布局,由值类型矩阵大小排列方式对齐方式、块大小和 GPU 架构确定。这些建议的布局旨在积极影响矩阵乘法本身的性能以及共享内存和全局内存之间的复制操作,因此它们依赖于矩阵的排列方式。建议的布局忽略使用 LeadingDimension 运算符设置的前导维度。对于行优先 A 和列优先 B,预计会有最佳改进。

__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_a();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_b();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_c();

BLAS::suggest_layout_smem_<a/b/c>() 返回共享内存标签和 A/B/C 的布局 (cute::Layout) 的组合,可以直接传递给 cublasdx::make_tensor 以创建张量。

示例

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float, float, double>());

extern __shared__ __align__(16) char smem[];

// Slice shared memory into pointer for A, B, and C matrices
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);

// Create suggested shared memory layout for optimal performance
auto suggested_smem_a = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto suggested_smem_b = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());
auto suggested_smem_c = cublasdx::make_tensor(smem_c, BLAS::suggest_layout_smem_c());

数据分区器#

数据分区器是一个对象,它了解 GEMM 的执行上下文和实现细节,可以从中推断出 C 矩阵的哪些元素将映射到任何线程。cuBLASDx 使用分区器作为辅助对象,用于分区全局内存和共享内存张量,以及获取、复制、修改和转换寄存器片段。

有关可用分区器功能的更多信息,请参阅分区器和寄存器片段张量

获取数据分区器#

默认数据分区器用于非建议的执行上下文。混合上下文可能会导致性能下降。

// Get layouts
auto a_smem_layout = BLAS::get_layout_smem_a();
auto b_smem_layout = BLAS::get_layout_smem_b();

// Get partitioner
auto partitioner = BLAS::get_partitioner();

建议数据分区器#

建议的数据分区器用于建议的执行上下文。混合上下文可能会导致性能下降。

// Suggest layouts
auto a_smem_layout = BLAS::suggest_layout_smem_a();
auto b_smem_layout = BLAS::suggest_layout_smem_b();

// Suggest partitioner
auto partitioner = BLAS::suggest_partitioner();