即时 (JIT) 编译¶

本节介绍即时 (JIT) 编译功能。此功能允许用户编译专用内核，以最大限度地提高特定操作的性能。

给定缩并的复杂性（例如，模式的数量和顺序、缩并模式的数量）决定了其内核搜索空间的大小（即，可用于执行缩并的候选内核集）。随着复杂性的增加，搜索空间可能会变得非常大。我们的预编译内核经过精心选择，可以在各种不同的缩并操作中表现良好。然而，随着缩并复杂性的增加，为给定缩并量身定制的即时编译内核可以胜过来自固定大小的预编译内核集的内核。

JIT 编译通过创建一个内核来克服此限制，该内核可以更好地利用适用于给定缩并的优化机会。

编译内核的成本通常为 1-8 秒（取决于内核和主机 CPU）；此成本在规划阶段每个内核仅发生一次；内核可以被后续的缩并操作重用（即，内核在编译后会自动缓存）。

所有 JIT 编译的内核都添加到内核缓存中，整个库都可以访问该缓存（即，在库句柄之间共享）。我们提供了在磁盘上读写内核缓存的函数，以避免在重新运行程序时重复 JIT 编译相同内核的成本。

本节的其余部分假设您熟悉入门指南。

注意

JIT 编译功能仅适用于计算能力大于或等于 8.0（Ampere 或更新版本）的 GPU。此外，此功能目前仅限于张量缩并。

入门示例¶

本小节提供与 JIT 编译相关的 API 调用和功能的基本概述。

我们首先使用与入门指南中描述的相同步骤计算缩并，但使用不同的缩并示例来强调当缩并模式数量增加时 JIT 编译的优势。然后，我们描述启用 JIT 编译的必要修改，并比较预编译内核和 JIT 编译内核的性能。

我们的代码计算以下操作（请注意，我们现在使用数字而不是字母来命名每个模式，因为模式的数量超过了拉丁字母的数量）

\[ \begin{align}\begin{aligned}C_{0,1,2,3,4,6,8,9,25,26,10,12,14,27,15,28,17,19,29,20,21,30,23,24} = \alpha A_{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24} B_{25,26,27,28,29,30,5,7,11,13,16,18,22}\\ + \beta C_{0,1,2,3,4,6,8,9,25,26,10,12,14,27,15,28,17,19,29,20,21,30,23,24}\end{aligned}\end{align} \]

所有操作数都包含单精度复数值，并且计算使用模拟的单精度算术 (3XTF32) 执行。所有模式的范围均为 2。

计算上述缩并的步骤与入门指南中的步骤相同，并列出如下

#include <chrono>
#include <complex>
#include <stdlib.h>
#include <stdio.h>
#include <unordered_map>
#include <vector>

#include <cuda_runtime.h>
#include <cutensor.h>

// Handle cuTENSOR errors
#define HANDLE_ERROR(x) {                                                                \
  const auto err = x;                                                                    \
  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
};

// Handle CUDA errors
#define HANDLE_CUDA_ERROR(x) {                                                       \
  const auto err = x;                                                                \
  if( err != cudaSuccess )                                                           \
  { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); exit(-1); } \
};

class CPUTimer
{
public:
    void start()
    {
        start_ = std::chrono::steady_clock::now();
    }

    double seconds()
    {
        end_ = std::chrono::steady_clock::now();
        elapsed_ = end_ - start_;
        //return in ms
        return elapsed_.count() * 1000;
    }

private:
    typedef std::chrono::steady_clock::time_point tp;
    tp start_;
    tp end_;
    std::chrono::duration<double> elapsed_;
};

struct GPUTimer
{
  GPUTimer()
  {
    cudaEventCreate(&start_);
    cudaEventCreate(&stop_);
    cudaEventRecord(start_, 0);
  }

  ~GPUTimer()
  {
    cudaEventDestroy(start_);
    cudaEventDestroy(stop_);
  }

  void start()
  {
    cudaEventRecord(start_, 0);
  }

  float seconds()
  {
    cudaEventRecord(stop_, 0);
    cudaEventSynchronize(stop_);
    float time;
    cudaEventElapsedTime(&time, start_, stop_);
    return time * 1e-3;
  }
  private:
  cudaEvent_t start_, stop_;
};

int main()
{
  typedef std::complex<float> TypeA;
  typedef std::complex<float> TypeB;
  typedef std::complex<float> TypeC;
  typedef std::complex<float> TypeScalar;

  auto alpha = TypeScalar(1.1, 0.0);
  auto beta  = TypeScalar(0.0, 0.0);

  // CUDA types
  cutensorDataType_t typeA = CUTENSOR_C_32F;
  cutensorDataType_t typeB = CUTENSOR_C_32F;
  cutensorDataType_t typeC = CUTENSOR_C_32F;
  cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_3XTF32;


  /* ***************************** */

  // Create vector of modes
  std::vector<int> modeC{0,1,2,3,4,6,8,9,25,26,10,12,14,27,15,28,17,19,29,20,21,30,23,24};
  std::vector<int> modeA{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24};
  std::vector<int> modeB{25,26,27,28,29,30,5,7,11,13,16,18,22};
  int nmodeA = modeA.size();
  int nmodeB = modeB.size();
  int nmodeC = modeC.size();

  // Extents
  std::unordered_map<int, int64_t> extent;
  for (auto i = 0; i <= 30; i++)
      extent[i] = 2;

  // Create a vector of extents for each tensor
  std::vector<int64_t> extentC;
  for (auto mode : modeC)
      extentC.push_back(extent[mode]);
  std::vector<int64_t> extentA;
  for (auto mode : modeA)
      extentA.push_back(extent[mode]);
  std::vector<int64_t> extentB;
  for (auto mode : modeB)
      extentB.push_back(extent[mode]);

  /**********************
   * Allocating data
   **********************/

  // Number of elements of each tensor
  size_t elementsA = 1;
  for (auto mode : modeA)
      elementsA *= extent[mode];
  size_t elementsB = 1;
  for (auto mode : modeB)
      elementsB *= extent[mode];
  size_t elementsC = 1;
  for (auto mode : modeC)
      elementsC *= extent[mode];

  // Size in bytes
  size_t sizeA = sizeof(TypeA) * elementsA;
  size_t sizeB = sizeof(TypeB) * elementsB;
  size_t sizeC = sizeof(TypeC) * elementsC;
  printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

  // Allocate on device
  void *A_d, *B_d, *C_d;
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

  // Allocate on host
  TypeA *A = (TypeA*) malloc(sizeof(TypeA) * elementsA);
  TypeB *B = (TypeB*) malloc(sizeof(TypeB) * elementsB);
  TypeC *C = (TypeC*) malloc(sizeof(TypeC) * elementsC);

  if (A == nullptr || B == nullptr || C == nullptr)
  {
      printf("Error: Host allocation of A, B, or C.\n");
      exit(-1);
  }

  /*******************
   * Initialize data
   *******************/

  for (int64_t i = 0; i < elementsA; i++)
      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
  for (int64_t i = 0; i < elementsB; i++)
      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
  for (int64_t i = 0; i < elementsC; i++)
      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

  // Copy to device
  HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
  HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
  HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

  const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)

  /*************************
   * cuTENSOR
   *************************/

  // Initialize cuTENSOR library
  cutensorHandle_t handle;
  HANDLE_ERROR(cutensorCreate(&handle));

  /**********************
   * Create Tensor Descriptors
   **********************/

  cutensorTensorDescriptor_t descA;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descA,
                                              nmodeA,
                                              extentA.data(),
                                              NULL,/*stride*/
                                              typeA, kAlignment));

  cutensorTensorDescriptor_t descB;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descB,
                                              nmodeB,
                                              extentB.data(),
                                              NULL,/*stride*/
                                              typeB, kAlignment));

  cutensorTensorDescriptor_t descC;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descC,
                                              nmodeC,
                                              extentC.data(),
                                              NULL,/*stride*/
                                              typeC, kAlignment));

  /*******************************
   * Create Contraction Descriptor
   *******************************/

  cutensorOperationDescriptor_t desc;
  HANDLE_ERROR(cutensorCreateContraction(handle,
                                         &desc,
                                         descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                                         descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                                         descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                                         descC, modeC.data(),
                                         descCompute));

  /**************************
  * Set the algorithm to use -- without just-in-time compilation
  ***************************/

  const cutensorAlgo_t algo = CUTENSOR_ALGO_GETT;

  cutensorPlanPreference_t planPref;
  HANDLE_ERROR(cutensorCreatePlanPreference(handle,
                                            &planPref,
                                            algo,
                                            CUTENSOR_JIT_MODE_NONE));

  /**********************
   * Query workspace estimate
   **********************/

  uint64_t workspaceSizeEstimate = 0;
  const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
  HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                             desc,
                                             planPref,
                                             workspacePref,
                                             &workspaceSizeEstimate));
  // Allocate workspace
  void *work = nullptr;
  if (workspaceSizeEstimate > 0)
  {
      HANDLE_CUDA_ERROR(cudaMalloc(&work, workspaceSizeEstimate));
  }

  /**************************
   * Create Contraction Plan -- without just-in-time compilation
   **************************/

  cutensorPlan_t plan;
  HANDLE_ERROR(cutensorCreatePlan(handle,
                                  &plan,
                                  desc,
                                  planPref,
                                  workspaceSizeEstimate));

  /**********************
   * Execute the tensor contraction
   **********************/

  cudaStream_t stream;
  HANDLE_CUDA_ERROR(cudaStreamCreate(&stream));

  double minTimeCUTENSOR = 1e100;
  for (int i=0; i < 3; ++i)
  {
      cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);

      // Set up timing
      GPUTimer timer;
      timer.start();

      HANDLE_ERROR(cutensorContract(handle,
                                    plan,
                                    (void*) &alpha, A_d, B_d,
                                    (void*) &beta,  C_d, C_d,
                                    work, workspaceSizeEstimate, stream))

      // Synchronize and measure timing
      auto time = timer.seconds();

      minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
  }

  /*************************/

  float flops;
  HANDLE_ERROR(cutensorOperationDescriptorGetAttribute(handle,
                                                       desc,
                                                       CUTENSOR_OPERATION_DESCRIPTOR_FLOPS,
                                                       (void*)&flops,
                                                       sizeof(flops)));
  auto gflops = flops / 1e9;
  auto gflopsPerSec = gflops / minTimeCUTENSOR;

  printf("cuTENSOR    : %6.0f GFLOPs/s\n", gflopsPerSec);

  HANDLE_ERROR(cutensorDestroy(handle));
  HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
  HANDLE_CUDA_ERROR(cudaStreamDestroy(stream));
  HANDLE_ERROR(cutensorDestroyPlanPreference(planPref));
  HANDLE_ERROR(cutensorDestroyPlan(plan));

  if (A) free(A);
  if (B) free(B);
  if (C) free(C);
  if (A_d) cudaFree(A_d);
  if (B_d) cudaFree(B_d);
  if (C_d) cudaFree(C_d);
  if (work) cudaFree(work);

  printf("Successful completion\n");
  return 0;
}

启用 JIT 编译所需的全部操作是将 cutensorCreatePlanPreference() 的最后一个参数从 CUTENSOR_JIT_MODE_NONE 更改为 CUTENSOR_JIT_MODE_DEFAULT——无需进一步更改

cutensorPlanPreference_t planPrefJit;
cutensorCreatePlanPreference(handle,
                             &planPrefJit,
                             algo,
                             CUTENSOR_JIT_MODE_DEFAULT);

内核在调用 cutensorCreatePlan() 期间编译。此调用是阻塞的，编译过程可能需要几秒钟。创建 plan 后，内核将被编译并存储在内核缓存中（请参阅在磁盘上读写内核缓存），以便通过调用 cutensorContract() 使用。

注意

要在后续缩并操作（使用不同的 plan）中重用 JIT 编译的内核，用户必须再次在 cutensorCreatePlanPreference() 期间设置 CUTENSOR_JIT_MODE_DEFAULT。否则，将使用预编译内核。

完整的可运行示例如下

#include <chrono>
#include <complex>
#include <stdlib.h>
#include <stdio.h>
#include <unordered_map>
#include <vector>

#include <cuda_runtime.h>
#include <cutensor.h>

// Handle cuTENSOR errors
#define HANDLE_ERROR(x) {                                                                \
  const auto err = x;                                                                    \
  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
};

// Handle CUDA errors
#define HANDLE_CUDA_ERROR(x) {                                                       \
  const auto err = x;                                                                \
  if( err != cudaSuccess )                                                           \
  { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); exit(-1); } \
};

class CPUTimer
{
public:
    void start()
    {
        start_ = std::chrono::steady_clock::now();
    }

    double seconds()
    {
        end_ = std::chrono::steady_clock::now();
        elapsed_ = end_ - start_;
        //return in ms
        return elapsed_.count() * 1000;
    }

private:
    typedef std::chrono::steady_clock::time_point tp;
    tp start_;
    tp end_;
    std::chrono::duration<double> elapsed_;
};

struct GPUTimer
{
  GPUTimer()
  {
    cudaEventCreate(&start_);
    cudaEventCreate(&stop_);
    cudaEventRecord(start_, 0);
  }

  ~GPUTimer()
  {
    cudaEventDestroy(start_);
    cudaEventDestroy(stop_);
  }

  void start()
  {
    cudaEventRecord(start_, 0);
  }

  float seconds()
  {
    cudaEventRecord(stop_, 0);
    cudaEventSynchronize(stop_);
    float time;
    cudaEventElapsedTime(&time, start_, stop_);
    return time * 1e-3;
  }
  private:
  cudaEvent_t start_, stop_;
};

int main()
{
  typedef std::complex<float> TypeA;
  typedef std::complex<float> TypeB;
  typedef std::complex<float> TypeC;
  typedef std::complex<float> TypeScalar;

  auto alpha = TypeScalar(1.1, 0.0);
  auto beta  = TypeScalar(0.0, 0.0);

  // CUDA types
  cutensorDataType_t typeA = CUTENSOR_C_32F;
  cutensorDataType_t typeB = CUTENSOR_C_32F;
  cutensorDataType_t typeC = CUTENSOR_C_32F;
  cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_3XTF32;


  /* ***************************** */

  // Create vector of modes
  std::vector<int> modeC{0,1,2,3,4,6,8,9,25,26,10,12,14,27,15,28,17,19,29,20,21,30,23,24};
  std::vector<int> modeA{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24};
  std::vector<int> modeB{25,26,27,28,29,30,5,7,11,13,16,18,22};
  int nmodeA = modeA.size();
  int nmodeB = modeB.size();
  int nmodeC = modeC.size();

  // Extents
  std::unordered_map<int, int64_t> extent;
  for (auto i = 0; i <= 30; i++)
      extent[i] = 2;

  // Create a vector of extents for each tensor
  std::vector<int64_t> extentC;
  for (auto mode : modeC)
      extentC.push_back(extent[mode]);
  std::vector<int64_t> extentA;
  for (auto mode : modeA)
      extentA.push_back(extent[mode]);
  std::vector<int64_t> extentB;
  for (auto mode : modeB)
      extentB.push_back(extent[mode]);

  /**********************
   * Allocating data
   **********************/

  // Number of elements of each tensor
  size_t elementsA = 1;
  for (auto mode : modeA)
      elementsA *= extent[mode];
  size_t elementsB = 1;
  for (auto mode : modeB)
      elementsB *= extent[mode];
  size_t elementsC = 1;
  for (auto mode : modeC)
      elementsC *= extent[mode];

  // Size in bytes
  size_t sizeA = sizeof(TypeA) * elementsA;
  size_t sizeB = sizeof(TypeB) * elementsB;
  size_t sizeC = sizeof(TypeC) * elementsC;
  printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

  // Allocate on device
  void *A_d, *B_d, *C_d;
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

  // Allocate on host
  TypeA *A = (TypeA*) malloc(sizeof(TypeA) * elementsA);
  TypeB *B = (TypeB*) malloc(sizeof(TypeB) * elementsB);
  TypeC *C = (TypeC*) malloc(sizeof(TypeC) * elementsC);

  if (A == nullptr || B == nullptr || C == nullptr)
  {
      printf("Error: Host allocation of A, B, or C.\n");
      exit(-1);
  }

  /*******************
   * Initialize data
   *******************/

  for (int64_t i = 0; i < elementsA; i++)
      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
  for (int64_t i = 0; i < elementsB; i++)
      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
  for (int64_t i = 0; i < elementsC; i++)
      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

  // Copy to device
  HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
  HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
  HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

  const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)

  /*************************
   * cuTENSOR
   *************************/

  // Initialize cuTENSOR library
  cutensorHandle_t handle;
  HANDLE_ERROR(cutensorCreate(&handle));

  /**********************
   * Create Tensor Descriptors
   **********************/

  cutensorTensorDescriptor_t descA;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descA,
                                              nmodeA,
                                              extentA.data(),
                                              NULL,/*stride*/
                                              typeA, kAlignment));

  cutensorTensorDescriptor_t descB;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descB,
                                              nmodeB,
                                              extentB.data(),
                                              NULL,/*stride*/
                                              typeB, kAlignment));

  cutensorTensorDescriptor_t descC;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descC,
                                              nmodeC,
                                              extentC.data(),
                                              NULL,/*stride*/
                                              typeC, kAlignment));

  /*******************************
   * Create Contraction Descriptor
   *******************************/

  cutensorOperationDescriptor_t desc;
  HANDLE_ERROR(cutensorCreateContraction(handle,
                                         &desc,
                                         descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                                         descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                                         descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                                         descC, modeC.data(),
                                         descCompute));

  /**************************
  * Set the algorithm to use -- without just-in-time compilation
  ***************************/

  const cutensorAlgo_t algo = CUTENSOR_ALGO_GETT;

  cutensorPlanPreference_t planPref;
  HANDLE_ERROR(cutensorCreatePlanPreference(handle,
                                            &planPref,
                                            algo,
                                            CUTENSOR_JIT_MODE_NONE));

  /**********************
   * Query workspace estimate
   **********************/

  uint64_t workspaceSizeEstimate = 0;
  const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
  HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                             desc,
                                             planPref,
                                             workspacePref,
                                             &workspaceSizeEstimate));
  // Allocate workspace
  void *work = nullptr;
  if (workspaceSizeEstimate > 0)
  {
      HANDLE_CUDA_ERROR(cudaMalloc(&work, workspaceSizeEstimate));
  }

  /**************************
   * Create Contraction Plan -- without just-in-time compilation
   **************************/

  cutensorPlan_t plan;
  HANDLE_ERROR(cutensorCreatePlan(handle,
                                  &plan,
                                  desc,
                                  planPref,
                                  workspaceSizeEstimate));

  /**********************
   * Execute the tensor contraction
   **********************/

  cudaStream_t stream;
  HANDLE_CUDA_ERROR(cudaStreamCreate(&stream));

  double minTimeCUTENSOR = 1e100;
  for (int i=0; i < 3; ++i)
  {
      cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);

      // Set up timing
      GPUTimer timer;
      timer.start();

      HANDLE_ERROR(cutensorContract(handle,
                                    plan,
                                    (void*) &alpha, A_d, B_d,
                                    (void*) &beta,  C_d, C_d,
                                    work, workspaceSizeEstimate, stream))

      // Synchronize and measure timing
      auto time = timer.seconds();

      minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
  }

  /*************************/

  /**************************
  * Set the algorithm to use -- with just-in-time compilation
  ***************************/

  cutensorPlanPreference_t planPrefJit;
  HANDLE_ERROR(cutensorCreatePlanPreference(handle,
                                            &planPrefJit,
                                            algo,
                                            CUTENSOR_JIT_MODE_DEFAULT));

  /**********************
   * Query workspace estimate
   **********************/

  uint64_t workspaceSizeEstimateJit = 0;
  const cutensorWorksizePreference_t workspacePrefJit = CUTENSOR_WORKSPACE_DEFAULT;
  HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                             desc,
                                             planPrefJit,
                                             workspacePrefJit,
                                             &workspaceSizeEstimateJit));
  // Allocate workspace
  void *workJit = nullptr;
  if (workspaceSizeEstimateJit > 0)
  {
      HANDLE_CUDA_ERROR(cudaMalloc(&workJit, workspaceSizeEstimateJit));
  }

  /**************************
   * Create Contraction Plan -- with just-in-time compilation
   **************************/

  cutensorPlan_t planJit;
  CPUTimer jitPlanTimer;
  jitPlanTimer.start();
  // This is where the kernel is actually compiled
  HANDLE_ERROR(cutensorCreatePlan(handle,
                                  &planJit,
                                  desc,
                                  planPrefJit,
                                  workspaceSizeEstimateJit));
  auto jitPlanTime = jitPlanTimer.seconds();

  /**********************
   * Execute the tensor contraction using the JIT compiled kernel
   **********************/

  double minTimeCUTENSORJit = 1e100;
  for (int i=0; i < 3; ++i)
  {
      cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);

      // Set up timing
      GPUTimer timer;
      timer.start();

      HANDLE_ERROR(cutensorContract(handle,
                                    planJit,
                                    (void*) &alpha, A_d, B_d,
                                    (void*) &beta,  C_d, C_d,
                                    workJit, workspaceSizeEstimateJit, stream))

      // Synchronize and measure timing
      auto time = timer.seconds();

      minTimeCUTENSORJit = (minTimeCUTENSORJit < time) ? minTimeCUTENSORJit : time;
  }

  /*************************/

  float flops;
  HANDLE_ERROR(cutensorOperationDescriptorGetAttribute(handle,
                                                       desc,
                                                       CUTENSOR_OPERATION_DESCRIPTOR_FLOPS,
                                                       (void*)&flops,
                                                       sizeof(flops)));
  auto gflops = flops / 1e9;
  auto gflopsPerSec = gflops / minTimeCUTENSOR;
  auto gflopsPerSecJit = gflops / minTimeCUTENSORJit;

  printf("cuTENSOR    : %6.0f GFLOPs/s\n", gflopsPerSec);
  printf("cuTENSOR JIT: %6.0f GFLOPs/s\n", gflopsPerSecJit);
  printf("Speedup: %.1fx\n", gflopsPerSecJit / gflopsPerSec);
  printf("JIT Compilation time: %.1f seconds\n", jitPlanTime / 1e3);

  HANDLE_ERROR(cutensorDestroy(handle));
  HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
  HANDLE_CUDA_ERROR(cudaStreamDestroy(stream));
  HANDLE_ERROR(cutensorDestroyPlanPreference(planPref));
  HANDLE_ERROR(cutensorDestroyPlan(plan));
  HANDLE_ERROR(cutensorDestroyPlanPreference(planPrefJit));
  HANDLE_ERROR(cutensorDestroyPlan(planJit));

  if (A) free(A);
  if (B) free(B);
  if (C) free(C);
  if (A_d) cudaFree(A_d);
  if (B_d) cudaFree(B_d);
  if (C_d) cudaFree(C_d);
  if (work) cudaFree(work);
  if (workJit) cudaFree(workJit);

  printf("Successful completion\n");
  return 0;
}

以下是上述程序在 NVIDIA H100 PCIe GPU 上的输出

cuTENSOR    :    774 GFLOPs/s
cuTENSOR JIT:   5374 GFLOPs/s
Speedup: 6.9x
JIT Compilation time: 8.3 seconds

入门示例到此结束。

在磁盘上读写内核缓存¶

JIT 编译可能需要大量时间，特别是当应用于数十或数百个不同的缩并操作时。为了分摊此开销，我们在创建所有 plan 后提供了将内核缓存写入磁盘的功能。这样，在程序的多次执行中，每个 plan 的编译成本仅发生一次。

注意

内核缓存文件存储有关所使用的 cuTENSOR 版本 (CUTENSOR_VERSION)、系统上 CUDA 版本 (CUDA_VERSION) 和 GPU 型号 (GPU_MODEL) 的信息。为了成功读取内核缓存文件，这三个值必须在目标系统上完全匹配。

要在创建所有使用 JIT 编译的 plan 后，将内核缓存写入文件，请使用 cutensorWriteKernelCacheToFile() 函数。

cutensorWriteKernelCacheToFile(handle, "kernelCache.bin")

要读取文件并将内核加载到正在运行的 cuTENSOR 实例中，只需使用 cutensorReadKernelCacheFromFile() 函数

cutensorReadKernelCacheFromFile(handle, "kernelCache.bin")

注意

从文件读取内核缓存后，用户仍然必须在 cutensorCreatePlanPreference() 期间启用 JIT 编译（通过提供 CUTENSOR_JIT_MODE_DEFAULT 参数），以便用于那些应该使用先前 JIT 编译内核的缩并操作。

下面，我们重复入门示例，但在第 188 行我们检查是否可以读取内核缓存文件，并在第 399 行我们将内核缓存写入文件。在第二次执行以下示例时，将读取内核缓存并避免编译。

#include <chrono>
#include <complex>
#include <stdlib.h>
#include <stdio.h>
#include <unordered_map>
#include <vector>

#include <cuda_runtime.h>
#include <cutensor.h>

// Handle cuTENSOR errors
#define HANDLE_ERROR(x) {                                                                \
  const auto err = x;                                                                    \
  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
};

// Handle CUDA errors
#define HANDLE_CUDA_ERROR(x) {                                                       \
  const auto err = x;                                                                \
  if( err != cudaSuccess )                                                           \
  { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); exit(-1); } \
};

class CPUTimer
{
public:
    void start()
    {
        start_ = std::chrono::steady_clock::now();
    }

    double seconds()
    {
        end_ = std::chrono::steady_clock::now();
        elapsed_ = end_ - start_;
        //return in ms
        return elapsed_.count() * 1000;
    }

private:
    typedef std::chrono::steady_clock::time_point tp;
    tp start_;
    tp end_;
    std::chrono::duration<double> elapsed_;
};

struct GPUTimer
{
  GPUTimer()
  {
    cudaEventCreate(&start_);
    cudaEventCreate(&stop_);
    cudaEventRecord(start_, 0);
  }

  ~GPUTimer()
  {
    cudaEventDestroy(start_);
    cudaEventDestroy(stop_);
  }

  void start()
  {
    cudaEventRecord(start_, 0);
  }

  float seconds()
  {
    cudaEventRecord(stop_, 0);
    cudaEventSynchronize(stop_);
    float time;
    cudaEventElapsedTime(&time, start_, stop_);
    return time * 1e-3;
  }
  private:
  cudaEvent_t start_, stop_;
};

int main()
{
  typedef std::complex<float> TypeA;
  typedef std::complex<float> TypeB;
  typedef std::complex<float> TypeC;
  typedef std::complex<float> TypeScalar;

  auto alpha = TypeScalar(1.1, 0.0);
  auto beta  = TypeScalar(0.0, 0.0);

  // CUDA types
  cutensorDataType_t typeA = CUTENSOR_C_32F;
  cutensorDataType_t typeB = CUTENSOR_C_32F;
  cutensorDataType_t typeC = CUTENSOR_C_32F;
  cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_3XTF32;


  /* ***************************** */

  // Create vector of modes
  std::vector<int> modeC{0,1,2,3,4,6,8,9,25,26,10,12,14,27,15,28,17,19,29,20,21,30,23,24};
  std::vector<int> modeA{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24};
  std::vector<int> modeB{25,26,27,28,29,30,5,7,11,13,16,18,22};
  int nmodeA = modeA.size();
  int nmodeB = modeB.size();
  int nmodeC = modeC.size();

  // Extents
  std::unordered_map<int, int64_t> extent;
  for (auto i = 0; i <= 30; i++)
      extent[i] = 2;

  // Create a vector of extents for each tensor
  std::vector<int64_t> extentC;
  for (auto mode : modeC)
      extentC.push_back(extent[mode]);
  std::vector<int64_t> extentA;
  for (auto mode : modeA)
      extentA.push_back(extent[mode]);
  std::vector<int64_t> extentB;
  for (auto mode : modeB)
      extentB.push_back(extent[mode]);

  /**********************
   * Allocating data
   **********************/

  // Number of elements of each tensor
  size_t elementsA = 1;
  for (auto mode : modeA)
      elementsA *= extent[mode];
  size_t elementsB = 1;
  for (auto mode : modeB)
      elementsB *= extent[mode];
  size_t elementsC = 1;
  for (auto mode : modeC)
      elementsC *= extent[mode];

  // Size in bytes
  size_t sizeA = sizeof(TypeA) * elementsA;
  size_t sizeB = sizeof(TypeB) * elementsB;
  size_t sizeC = sizeof(TypeC) * elementsC;
  printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

  // Allocate on device
  void *A_d, *B_d, *C_d;
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
  HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

  // Allocate on host
  TypeA *A = (TypeA*) malloc(sizeof(TypeA) * elementsA);
  TypeB *B = (TypeB*) malloc(sizeof(TypeB) * elementsB);
  TypeC *C = (TypeC*) malloc(sizeof(TypeC) * elementsC);

  if (A == nullptr || B == nullptr || C == nullptr)
  {
      printf("Error: Host allocation of A, B, or C.\n");
      exit(-1);
  }

  /*******************
   * Initialize data
   *******************/

  for (int64_t i = 0; i < elementsA; i++)
      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
  for (int64_t i = 0; i < elementsB; i++)
      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
  for (int64_t i = 0; i < elementsC; i++)
      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

  // Copy to device
  HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
  HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
  HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

  const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)

  /*************************
   * cuTENSOR
   *************************/

  // Initialize cuTENSOR library
  cutensorHandle_t handle;
  HANDLE_ERROR(cutensorCreate(&handle));

  // Read kernel cache from file (if the file was generated by a prior execution)
  auto readKernelCacheStatus = cutensorReadKernelCacheFromFile(handle, "kernelCache.bin");

  if (readKernelCacheStatus == CUTENSOR_STATUS_IO_ERROR)
      printf("No kernel cache found. It will be generated before the end of this execution.\n");
  else if (readKernelCacheStatus == CUTENSOR_STATUS_SUCCESS)
      printf("Kernel cache found and read successfully.\n");
  else
      HANDLE_ERROR(readKernelCacheStatus);

  /**********************
   * Create Tensor Descriptors
   **********************/

  cutensorTensorDescriptor_t descA;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descA,
                                              nmodeA,
                                              extentA.data(),
                                              NULL,/*stride*/
                                              typeA, kAlignment));

  cutensorTensorDescriptor_t descB;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descB,
                                              nmodeB,
                                              extentB.data(),
                                              NULL,/*stride*/
                                              typeB, kAlignment));

  cutensorTensorDescriptor_t descC;
  HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                                              &descC,
                                              nmodeC,
                                              extentC.data(),
                                              NULL,/*stride*/
                                              typeC, kAlignment));

  /*******************************
   * Create Contraction Descriptor
   *******************************/

  cutensorOperationDescriptor_t desc;
  HANDLE_ERROR(cutensorCreateContraction(handle,
                                         &desc,
                                         descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                                         descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                                         descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                                         descC, modeC.data(),
                                         descCompute));

  /**************************
  * Set the algorithm to use -- without just-in-time compilation
  ***************************/

  const cutensorAlgo_t algo = CUTENSOR_ALGO_GETT;

  cutensorPlanPreference_t planPref;
  HANDLE_ERROR(cutensorCreatePlanPreference(handle,
                                            &planPref,
                                            algo,
                                            CUTENSOR_JIT_MODE_NONE));

  /**********************
   * Query workspace estimate
   **********************/

  uint64_t workspaceSizeEstimate = 0;
  const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
  HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                             desc,
                                             planPref,
                                             workspacePref,
                                             &workspaceSizeEstimate));
  // Allocate workspace
  void *work = nullptr;
  if (workspaceSizeEstimate > 0)
  {
      HANDLE_CUDA_ERROR(cudaMalloc(&work, workspaceSizeEstimate));
  }

  /**************************
   * Create Contraction Plan -- without just-in-time compilation
   **************************/

  cutensorPlan_t plan;
  HANDLE_ERROR(cutensorCreatePlan(handle,
                                  &plan,
                                  desc,
                                  planPref,
                                  workspaceSizeEstimate));

  /**********************
   * Execute the tensor contraction
   **********************/

  cudaStream_t stream;
  HANDLE_CUDA_ERROR(cudaStreamCreate(&stream));

  double minTimeCUTENSOR = 1e100;
  for (int i=0; i < 3; ++i)
  {
      cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);

      // Set up timing
      GPUTimer timer;
      timer.start();

      HANDLE_ERROR(cutensorContract(handle,
                                    plan,
                                    (void*) &alpha, A_d, B_d,
                                    (void*) &beta,  C_d, C_d,
                                    work, workspaceSizeEstimate, stream))

      // Synchronize and measure timing
      auto time = timer.seconds();

      minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
  }

  /*************************/

  /**************************
  * Set the algorithm to use -- with just-in-time compilation
  ***************************/

  cutensorPlanPreference_t planPrefJit;
  HANDLE_ERROR(cutensorCreatePlanPreference(handle,
                                            &planPrefJit,
                                            algo,
                                            CUTENSOR_JIT_MODE_DEFAULT));

  /**********************
   * Query workspace estimate
   **********************/

  uint64_t workspaceSizeEstimateJit = 0;
  const cutensorWorksizePreference_t workspacePrefJit = CUTENSOR_WORKSPACE_DEFAULT;
  HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                             desc,
                                             planPrefJit,
                                             workspacePrefJit,
                                             &workspaceSizeEstimateJit));
  // Allocate workspace
  void *workJit = nullptr;
  if (workspaceSizeEstimateJit > 0)
  {
      HANDLE_CUDA_ERROR(cudaMalloc(&workJit, workspaceSizeEstimateJit));
  }

  /**************************
   * Create Contraction Plan -- with just-in-time compilation
   **************************/

  cutensorPlan_t planJit;
  CPUTimer jitPlanTimer;
  jitPlanTimer.start();
  // This is where the kernel is actually compiled
  HANDLE_ERROR(cutensorCreatePlan(handle,
                                  &planJit,
                                  desc,
                                  planPrefJit,
                                  workspaceSizeEstimateJit));
  auto jitPlanTime = jitPlanTimer.seconds();

  /**********************
   * Execute the tensor contraction using the JIT compiled kernel
   **********************/

  double minTimeCUTENSORJit = 1e100;
  for (int i=0; i < 3; ++i)
  {
      cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);

      // Set up timing
      GPUTimer timer;
      timer.start();

      HANDLE_ERROR(cutensorContract(handle,
                                    planJit,
                                    (void*) &alpha, A_d, B_d,
                                    (void*) &beta,  C_d, C_d,
                                    workJit, workspaceSizeEstimateJit, stream))

      // Synchronize and measure timing
      auto time = timer.seconds();

      minTimeCUTENSORJit = (minTimeCUTENSORJit < time) ? minTimeCUTENSORJit : time;
  }

  /*************************/

  float flops;
  HANDLE_ERROR(cutensorOperationDescriptorGetAttribute(handle,
                                                       desc,
                                                       CUTENSOR_OPERATION_DESCRIPTOR_FLOPS,
                                                       (void*)&flops,
                                                       sizeof(flops)));
  auto gflops = flops / 1e9;
  auto gflopsPerSec = gflops / minTimeCUTENSOR;
  auto gflopsPerSecJit = gflops / minTimeCUTENSORJit;

  printf("cuTENSOR    : %6.0f GFLOPs/s\n", gflopsPerSec);
  printf("cuTENSOR JIT: %6.0f GFLOPs/s\n", gflopsPerSecJit);
  printf("Speedup: %.1fx\n", gflopsPerSecJit / gflopsPerSec);
  printf("JIT Compilation time: %.1f seconds ", jitPlanTime / 1e3);
  if (readKernelCacheStatus == CUTENSOR_STATUS_SUCCESS)
      printf("(Kernel cache file was read successfully; Compilation was not required)\n");
  else
      printf("\n");

  // Write kernel cache to file
  HANDLE_ERROR(cutensorWriteKernelCacheToFile(handle, "kernelCache.bin"))
  printf("Kernel cache written to file. Will be read in next execution.\n");

  HANDLE_ERROR(cutensorDestroy(handle));
  HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
  HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
  HANDLE_CUDA_ERROR(cudaStreamDestroy(stream));
  HANDLE_ERROR(cutensorDestroyPlanPreference(planPref));
  HANDLE_ERROR(cutensorDestroyPlan(plan));
  HANDLE_ERROR(cutensorDestroyPlanPreference(planPrefJit));
  HANDLE_ERROR(cutensorDestroyPlan(planJit));

  if (A) free(A);
  if (B) free(B);
  if (C) free(C);
  if (A_d) cudaFree(A_d);
  if (B_d) cudaFree(B_d);
  if (C_d) cudaFree(C_d);
  if (work) cudaFree(work);
  if (workJit) cudaFree(workJit);

  printf("Successful completion\n");
  return 0;
}