其他方法#
获取内存布局#
BLAS::get_layout_<gmem/smem>_<a/b/c>()
函数获取矩阵 A
、B
或 C
的全局内存或共享内存 CuTe 布局,由矩阵大小、排列方式和前导维度确定。对于共享内存布局,前导维度如果未通过参数显式指定,将从前导维度运算符推断。对于全局内存布局,自定义前导维度必须通过静态或动态整数类型传递,否则将从矩阵大小推断。
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c();
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_c();
// Overloads for specifying the leading dimensions statically during compilation time.
// integral_type can be either signed or unsigned integer type and integral_value follow
// this specification.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const std::integral_constant<integral_type, lda>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const std::integral_constant<integral_type, ldb>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const std::integral_constant<integral_type, ldc>);
// Overloads for specifying the leading dimensions during the execution time.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const unsigned int ldc);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_c(const unsigned int ldc);
BLAS::get_layout_<gmem/smem>_<a/b/c>()
返回内存标签(全局或共享)和矩阵 A
、B
或 C
的布局 (cute::Layout) 的组合,可以直接传递给 cublasdx::make_tensor
以创建张量。
BLAS::get_layout_<gmem/smem>_<a/b/c>()
返回与通过 Arrangement 运算符设置的顺序对应的矩阵布局。例如,如果矩阵 A
的顺序设置为 cublasdx::row-major
,则返回的布局遵循行优先顺序。
如果用户在执行时提供了动态前导维度,该函数接受前导维度作为参数,请参见以下示例。
示例
using BLAS = decltype(...)
extern __shared__ __align__(16) char smem[];
// a, b, c are pointers to global memory of input matrices A and B and output matrix C
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c());
// With leading dimensions specified during the compilation time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(std::integral_constant<int, lda>{}));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(std::integral_constant<int, ldb>{}));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(std::integral_constant<int, ldc>{}));
// With leading dimensions specified during the execution time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(ldc));
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem, lda, ldb, ldc);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c(ldc));
数据分区器#
数据分区器是一个对象,它了解 GEMM 的执行上下文和实现细节,可以从中推断出 C 矩阵的哪些元素将映射到任何线程。cuBLASDx 使用分区器作为辅助对象,用于分区全局内存和共享内存张量,以及获取、复制、修改和转换寄存器片段。
有关可用分区器功能的更多信息,请参阅分区器和寄存器片段张量。
获取数据分区器#
默认数据分区器用于非建议的执行上下文。混合上下文可能会导致性能下降。
// Get layouts
auto a_smem_layout = BLAS::get_layout_smem_a();
auto b_smem_layout = BLAS::get_layout_smem_b();
// Get partitioner
auto partitioner = BLAS::get_partitioner();
建议数据分区器#
建议的数据分区器用于建议的执行上下文。混合上下文可能会导致性能下降。
// Suggest layouts
auto a_smem_layout = BLAS::suggest_layout_smem_a();
auto b_smem_layout = BLAS::suggest_layout_smem_b();
// Suggest partitioner
auto partitioner = BLAS::suggest_partitioner();