计划缓存¶

本节介绍软件管理的计划缓存，它具有以下主要特性

最大限度地减少启动相关的开销（例如，由于内核选择）。
无开销的自动调优（也称为增量自动调优）。
- 此功能使用户能够自动找到给定问题的最佳实现，从而提高性能。
缓存以线程安全的方式实现，并在所有使用相同 cutensorHandle_t 的线程之间共享。
序列化和反序列化缓存
- 允许用户将缓存状态存储到磁盘并在以后重用它

本质上，计划缓存可以看作是从特定问题实例（即 cutensorOperationDescriptor_t）到实际实现（由 cutensorPlan_t 编码）的查找表。

本节的其余部分假定您熟悉入门指南。

注意

默认情况下，缓存处于激活状态，可以通过 CUTENSOR_DISABLE_PLAN_CACHE 环境变量禁用（请参阅环境变量）。

增量自动调优¶

增量自动调优功能使用户能够自动探索给定操作的不同实现，称为候选项。

当将缓存与增量自动调优功能 (CUTENSOR_AUTOTUNE_MODE_INCREMENTAL) 结合使用时，对同一操作的后续调用（尽管可能使用不同的数据指针）将由不同的候选项执行；将自动测量这些候选项的计时，并将最快的候选项存储在计划缓存中。要探索的不同候选项的数量可由用户配置（通过 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT）；然后，对同一问题的所有后续调用都将映射到最快的候选项（存储在缓存中），从而利用最快的（已测量的）候选项。

这种自动调优方法具有一些关键优势

候选项在硬件缓存处于生产环境状态的时间点进行评估（即，硬件缓存状态反映了真实情况）。
开销最小化（即，没有计时循环，没有同步）。
- 此外，候选项按照我们的性能模型给出的顺序（从最快到最慢）进行评估。

如果将增量自动调优与 cuTENSOR 的缓存序列化功能（通过 cutensorHandleWritePlanCacheToFile 和 cutensorHandleReadPlanCacheFromFile）结合使用，通过将已调优的缓存写入磁盘，则增量自动调优尤其强大。

注意

我们建议在自动调优之前预热 GPU（即，达到稳态性能），以最大限度地减少测量性能的波动。

入门示例¶

本小节概述了与缓存相关的 API 调用和功能。除了入门指南中概述的步骤之外，在本示例中，我们还

设置合适的缓存大小

在逐个缩并操作的基础上配置缓存行为（通过 cutensorPlanPreferenceSetAttribute）。

让我们首先看一下入门指南中概述的相同示例：由于 cuTENSOR 2.x 默认启用缓存，因此它已经利用了缓存。虽然是可选的，但以下代码演示了如何从其实现定义的初始值调整缓存大小。

// Set cache size
constexpr int32_t numEntries = 128;
HANDLE_ERROR( cutensorHandleResizePlanCachelines(&handle, numEntries) );

// ...

请注意，条目的数量是用户可配置的；理想情况下，我们希望缓存足够大，以便为应用程序的每个不同的缩并调用提供足够的容量。由于这可能并非总是可行（由于内存限制），cuTENSOR 的计划缓存将使用最近最少使用 (LRU) 策略驱逐缓存条目。用户还可以选择在逐个缩并操作的基础上禁用缓存（通过 cutensorCacheMode_t::CUTENSOR_CACHE_MODE_NONE）。

请注意，缓存查找发生在创建计划时。因此，如果同一缩并在同一应用程序中被计划多次，则此技术特别有用。

要为某个缩并禁用缓存（即，选择退出），需要在 cutensorPlanPreference_t 中修改以下设置

const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
     &handle,
     &find,
     CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
     &cacheMode,
     sizeof(cutensorCacheMode_t)));

入门示例到此结束。

高级示例¶

本示例将扩展示例，并解释如何

利用增量自动调优
- 建议在自动调优之前预热 GPU（即，达到稳态性能）（以避免测量性能的较大波动）
使用标签区分两个其他方面相同的张量缩并
- 如果 GPU 的硬件缓存在这两个调用之间（可能）有很大不同（例如，如果其中一个操作数刚刚被先前的调用读取/写入），并且预计缓存状态对性能有重大影响（例如，对于带宽受限的缩并），则此功能很有用
将计划缓存状态写入文件并读回

让我们首先启用增量自动调优。为此，我们按如下方式修改 cutensorPlanPreference_t

const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE_MODE,
    &autotuneMode ,
    sizeof(cutensorAutotuneMode_t)));

const uint32_t incCount = 4;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
    &incCount,
    sizeof(uint32_t)));

对 cutensorPlanPreferenceSetAttribute 的第一次调用启用增量自动调优，而第二次调用设置 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT；此值对应于应通过增量自动调优探索的不同候选项的数量，然后再从计划缓存中查找后续调用。较高的 incCount 值会探索更多候选项，因此最初会导致更大的开销，但如果初始开销可以摊销（例如，在将缓存写入磁盘时），它们也可能带来更好的性能。我们认为 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT 为 4 是一个很好的默认值。

以下代码结合了这些更改

#include <stdlib.h>
#include <stdio.h>

#include <unordered_map>
#include <vector>
#include <cassert>

#include <cuda_runtime.h>
#include <cutensor.h>

#define HANDLE_ERROR(x)                                               \
{ const auto err = x;                                                 \
  if( err != CUTENSOR_STATUS_SUCCESS )                                \
  { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
};

#define HANDLE_CUDA_ERROR(x)                                      \
{ const auto err = x;                                             \
  if( err != cudaSuccess )                                        \
  { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
};

struct GPUTimer
{
    GPUTimer()
    {
        cudaEventCreate(&start_);
        cudaEventCreate(&stop_);
        cudaEventRecord(start_, 0);
    }

    ~GPUTimer()
    {
        cudaEventDestroy(start_);
        cudaEventDestroy(stop_);
    }

    void start()
    {
        cudaEventRecord(start_, 0);
    }

    float seconds()
    {
        cudaEventRecord(stop_, 0);
        cudaEventSynchronize(stop_);
        float time;
        cudaEventElapsedTime(&time, start_, stop_);
        return time * 1e-3;
    }
    private:
    cudaEvent_t start_, stop_;
};

int main()
{
    typedef float floatTypeA;
    typedef float floatTypeB;
    typedef float floatTypeC;
    typedef float floatTypeCompute;

    cutensorDataType_t typeA = CUTENSOR_R_32F;
    cutensorDataType_t typeB = CUTENSOR_R_32F;
    cutensorDataType_t typeC = CUTENSOR_R_32F;
    const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;

    floatTypeCompute alpha = (floatTypeCompute)1.1f;
    floatTypeCompute beta  = (floatTypeCompute)0.f;

    /**********************
     * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
     **********************/

    std::vector<int> modeC{'m','u','n','v'};
    std::vector<int> modeA{'m','h','k','n'};
    std::vector<int> modeB{'u','k','v','h'};
    int nmodeA = modeA.size();
    int nmodeB = modeB.size();
    int nmodeC = modeC.size();

    std::unordered_map<int, int64_t> extent;
    extent['m'] = 96;
    extent['n'] = 96;
    extent['u'] = 96;
    extent['v'] = 64;
    extent['h'] = 64;
    extent['k'] = 64;

    double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;

    std::vector<int64_t> extentC;
    for (auto mode : modeC)
        extentC.push_back(extent[mode]);
    std::vector<int64_t> extentA;
    for (auto mode : modeA)
        extentA.push_back(extent[mode]);
    std::vector<int64_t> extentB;
    for (auto mode : modeB)
        extentB.push_back(extent[mode]);

    /**********************
     * Allocating data
     **********************/

    size_t elementsA = 1;
    for (auto mode : modeA)
        elementsA *= extent[mode];
    size_t elementsB = 1;
    for (auto mode : modeB)
        elementsB *= extent[mode];
    size_t elementsC = 1;
    for (auto mode : modeC)
        elementsC *= extent[mode];

    size_t sizeA = sizeof(floatTypeA) * elementsA;
    size_t sizeB = sizeof(floatTypeB) * elementsB;
    size_t sizeC = sizeof(floatTypeC) * elementsC;
    printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

    void *A_d, *B_d, *C_d;
    HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
    HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
    HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

    const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
    assert(uintptr_t(A_d) % kAlignment == 0);
    assert(uintptr_t(B_d) % kAlignment == 0);
    assert(uintptr_t(C_d) % kAlignment == 0);

    floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
    floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
    floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);

    if (A == NULL || B == NULL || C == NULL)
    {
        printf("Error: Host allocation of A or C.\n");
        return -1;
    }

    /*******************
     * Initialize data
     *******************/

    for (int64_t i = 0; i < elementsA; i++)
        A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
    for (int64_t i = 0; i < elementsB; i++)
        B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
    for (int64_t i = 0; i < elementsC; i++)
        C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

    HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
    HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
    HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

    /*************************
     * cuTENSOR
     *************************/

    cutensorHandle_t handle;
    HANDLE_ERROR(cutensorCreate(&handle));

    /**********************
     * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
     **********************/
    uint32_t numEntries = 128;
    HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));

    /**********************
     * Create Tensor Descriptors
     **********************/
    cutensorTensorDescriptor_t descA;
    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                 &descA,
                 nmodeA,
                 extentA.data(),
                 NULL,/*stride*/
                 typeA, kAlignment));

    cutensorTensorDescriptor_t descB;
    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                 &descB,
                 nmodeB,
                 extentB.data(),
                 NULL,/*stride*/
                 typeB, kAlignment));

    cutensorTensorDescriptor_t descC;
    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                 &descC,
                 nmodeC,
                 extentC.data(),
                 NULL,/*stride*/
                 typeC, kAlignment));

    /*******************************
     * Create Contraction Descriptor
     *******************************/

    cutensorOperationDescriptor_t desc;
    HANDLE_ERROR(cutensorCreateContraction(handle,
                 &desc,
                 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                 descC, modeC.data(),
                 descCompute));

    /**************************
     * PlanPreference: Set the algorithm to use and enable incremental autotuning
     ***************************/

    const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

    cutensorPlanPreference_t planPref;
    HANDLE_ERROR(cutensorCreatePlanPreference(
                               handle,
                               &planPref,
                               algo,
                               CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation

    const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
        handle,
        planPref,
        CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
        &cacheMode,
        sizeof(cutensorCacheMode_t)));

    const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
        handle,
        planPref,
        CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
        &autotuneMode ,
        sizeof(cutensorAutotuneMode_t)));

    const uint32_t incCount = 4;
    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
        handle,
        planPref,
        CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
        &incCount,
        sizeof(uint32_t)));

    /**********************
     * Query workspace estimate
     **********************/

    uint64_t workspaceSizeEstimate = 0;
    const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
    HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                          desc,
                                          planPref,
                                          workspacePref,
                                          &workspaceSizeEstimate));

    /**************************
     * Create Contraction Plan
     **************************/

    cutensorPlan_t plan;
    HANDLE_ERROR(cutensorCreatePlan(handle,
                 &plan,
                 desc,
                 planPref,
                 workspaceSizeEstimate));

    /**************************
     * Optional: Query information about the created plan
     **************************/

    // query actually used workspace
    uint64_t actualWorkspaceSize = 0;
    HANDLE_ERROR(cutensorPlanGetAttribute(handle,
        plan,
        CUTENSOR_PLAN_REQUIRED_WORKSPACE,
        &actualWorkspaceSize,
        sizeof(actualWorkspaceSize)));

    // At this point the user knows exactly how much memory is need by the operation and
    // only the smaller actual workspace needs to be allocated
    assert(actualWorkspaceSize <= workspaceSizeEstimate);

    void *work = nullptr;
    if (actualWorkspaceSize > 0)
    {
        HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
        assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
    }

    /**********************
     * Run
     **********************/

    double minTimeCUTENSOR = 1e100;
    for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
    {
        cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
        cudaDeviceSynchronize();

        // Set up timing
        GPUTimer timer;
        timer.start();

        // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
        HANDLE_ERROR(cutensorContract(handle,
                                  plan,
                                  (void*) &alpha, A_d, B_d,
                                  (void*) &beta,  C_d, C_d,
                                  work, actualWorkspaceSize, 0 /* stream */));

        // Synchronize and measure timing
        auto time = timer.seconds();

        minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
    }

    /*************************/

    double transferedBytes = sizeC + sizeA + sizeB;
    transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
    transferedBytes /= 1e9;
    printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

    HANDLE_ERROR(cutensorDestroy(handle));
    HANDLE_ERROR(cutensorDestroyPlan(plan));
    HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));

    if (A) free(A);
    if (B) free(B);
    if (C) free(C);
    if (A_d) cudaFree(A_d);
    if (B_d) cudaFree(B_d);
    if (C_d) cudaFree(C_d);
    if (work) cudaFree(work);

    return 0;
}

让我们通过将计划缓存写入文件并读回（如果之前已写入）来进一步扩展示例

const char planCacheFilename[] = "./planCache.bin";
uint32_t numCachelines = 0;
cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
        planCacheFilename, &numCachelines);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
}
else
{
    printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
            numCachelines);
}

// ...

status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) couldn't be written to.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
            cutensorGetErrorString(status));
}
else
{
    printf("Plan cache successfully stored to %s.\n", planCacheFilename);
}

警告

只有当计划缓存的大小足以读取文件中存储的所有缓存行时，cutensorHandleReadPlanCacheFromFile 才会成功；否则，将返回 CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE，并且足够的缓存行数将存储在 numCachelinesRead 中。

进行这些更改后，示例现在如下所示

 #include <stdlib.h>
 #include <stdio.h>

 #include <unordered_map>
 #include <vector>
 #include <cassert>

 #include <cuda_runtime.h>
 #include <cutensor.h>

 #define HANDLE_ERROR(x)                                               \
 { const auto err = x;                                                 \
   if( err != CUTENSOR_STATUS_SUCCESS )                                \
   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 };

 #define HANDLE_CUDA_ERROR(x)                                      \
 { const auto err = x;                                             \
   if( err != cudaSuccess )                                        \
   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 };

 struct GPUTimer
 {
     GPUTimer()
     {
         cudaEventCreate(&start_);
         cudaEventCreate(&stop_);
         cudaEventRecord(start_, 0);
     }

     ~GPUTimer()
     {
         cudaEventDestroy(start_);
         cudaEventDestroy(stop_);
     }

     void start()
     {
         cudaEventRecord(start_, 0);
     }

     float seconds()
     {
         cudaEventRecord(stop_, 0);
         cudaEventSynchronize(stop_);
         float time;
         cudaEventElapsedTime(&time, start_, stop_);
         return time * 1e-3;
     }
     private:
     cudaEvent_t start_, stop_;
 };

 int main()
 {
     typedef float floatTypeA;
     typedef float floatTypeB;
     typedef float floatTypeC;
     typedef float floatTypeCompute;

     cutensorDataType_t typeA = CUTENSOR_R_32F;
     cutensorDataType_t typeB = CUTENSOR_R_32F;
     cutensorDataType_t typeC = CUTENSOR_R_32F;
     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;

     floatTypeCompute alpha = (floatTypeCompute)1.1f;
     floatTypeCompute beta  = (floatTypeCompute)0.f;

     /**********************
      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
      **********************/

     std::vector<int> modeC{'m','u','n','v'};
     std::vector<int> modeA{'m','h','k','n'};
     std::vector<int> modeB{'u','k','v','h'};
     int nmodeA = modeA.size();
     int nmodeB = modeB.size();
     int nmodeC = modeC.size();

     std::unordered_map<int, int64_t> extent;
     extent['m'] = 96;
     extent['n'] = 96;
     extent['u'] = 96;
     extent['v'] = 64;
     extent['h'] = 64;
     extent['k'] = 64;

     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;

     std::vector<int64_t> extentC;
     for (auto mode : modeC)
         extentC.push_back(extent[mode]);
     std::vector<int64_t> extentA;
     for (auto mode : modeA)
         extentA.push_back(extent[mode]);
     std::vector<int64_t> extentB;
     for (auto mode : modeB)
         extentB.push_back(extent[mode]);

     /**********************
      * Allocating data
      **********************/

     size_t elementsA = 1;
     for (auto mode : modeA)
         elementsA *= extent[mode];
     size_t elementsB = 1;
     for (auto mode : modeB)
         elementsB *= extent[mode];
     size_t elementsC = 1;
     for (auto mode : modeC)
         elementsC *= extent[mode];

     size_t sizeA = sizeof(floatTypeA) * elementsA;
     size_t sizeB = sizeof(floatTypeB) * elementsB;
     size_t sizeC = sizeof(floatTypeC) * elementsC;
     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

     void *A_d, *B_d, *C_d;
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
     assert(uintptr_t(A_d) % kAlignment == 0);
     assert(uintptr_t(B_d) % kAlignment == 0);
     assert(uintptr_t(C_d) % kAlignment == 0);

     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);

     if (A == NULL || B == NULL || C == NULL)
     {
         printf("Error: Host allocation of A or C.\n");
         return -1;
     }

     /*******************
      * Initialize data
      *******************/

     for (int64_t i = 0; i < elementsA; i++)
         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsB; i++)
         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsC; i++)
         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

     /*************************
      * cuTENSOR
      *************************/

     cutensorHandle_t handle;
     HANDLE_ERROR(cutensorCreate(&handle));

     /**********************
      * Load plan cache
      **********************/

     // holds information about the per-handle plan cache
     const char planCacheFilename[] = "./planCache.bin";
     uint32_t numCachelines = 0;
     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
             planCacheFilename, &numCachelines);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
     }
     else
     {
         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
                 numCachelines);
     }

     /**********************
      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
      **********************/
     uint32_t numEntries = 128;
     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));

     /**********************
      * Create Tensor Descriptors
      **********************/
     cutensorTensorDescriptor_t descA;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descA,
                  nmodeA,
                  extentA.data(),
                  NULL,/*stride*/
                  typeA, kAlignment));

     cutensorTensorDescriptor_t descB;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descB,
                  nmodeB,
                  extentB.data(),
                  NULL,/*stride*/
                  typeB, kAlignment));

     cutensorTensorDescriptor_t descC;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descC,
                  nmodeC,
                  extentC.data(),
                  NULL,/*stride*/
                  typeC, kAlignment));

     /*******************************
      * Create Contraction Descriptor
      *******************************/

     cutensorOperationDescriptor_t desc;
     HANDLE_ERROR(cutensorCreateContraction(handle,
                  &desc,
                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(),
                  descCompute));

     /**************************
      * PlanPreference: Set the algorithm to use and enable incremental autotuning
      ***************************/

     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

     cutensorPlanPreference_t planPref;
     HANDLE_ERROR(cutensorCreatePlanPreference(
                                handle,
                                &planPref,
                                algo,
                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation

     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
         &cacheMode,
         sizeof(cutensorCacheMode_t)));

     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
         &autotuneMode ,
         sizeof(cutensorAutotuneMode_t)));

     const uint32_t incCount = 4;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
         &incCount,
         sizeof(uint32_t)));

     /**********************
      * Query workspace estimate
      **********************/

     uint64_t workspaceSizeEstimate = 0;
     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                           desc,
                                           planPref,
                                           workspacePref,
                                           &workspaceSizeEstimate));

     /**************************
      * Create Contraction Plan
      **************************/

     cutensorPlan_t plan;
     HANDLE_ERROR(cutensorCreatePlan(handle,
                  &plan,
                  desc,
                  planPref,
                  workspaceSizeEstimate));

     /**************************
      * Optional: Query information about the created plan
      **************************/

     // query actually used workspace
     uint64_t actualWorkspaceSize = 0;
     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
         plan,
         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
         &actualWorkspaceSize,
         sizeof(actualWorkspaceSize)));

     // At this point the user knows exactly how much memory is need by the operation and
     // only the smaller actual workspace needs to be allocated
     assert(actualWorkspaceSize <= workspaceSizeEstimate);

     void *work = nullptr;
     if (actualWorkspaceSize > 0)
     {
         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
     }

     /**********************
      * Run
      **********************/

     double minTimeCUTENSOR = 1e100;
     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
     {
         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
         cudaDeviceSynchronize();

         // Set up timing
         GPUTimer timer;
         timer.start();

         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
         HANDLE_ERROR(cutensorContract(handle,
                                   plan,
                                   (void*) &alpha, A_d, B_d,
                                   (void*) &beta,  C_d, C_d,
                                   work, actualWorkspaceSize, 0 /* stream */));

         // Synchronize and measure timing
         auto time = timer.seconds();

         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
     }

     /*************************/

     double transferedBytes = sizeC + sizeA + sizeB;
     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
     transferedBytes /= 1e9;
     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) couldn't be written to.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
                 cutensorGetErrorString(status));
     }
     else
     {
         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
     }


     HANDLE_ERROR(cutensorDestroy(handle));
     HANDLE_ERROR(cutensorDestroyPlan(plan));
     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));

     if (A) free(A);
     if (B) free(B);
     if (C) free(C);
     if (A_d) cudaFree(A_d);
     if (B_d) cudaFree(B_d);
     if (C_d) cudaFree(C_d);
     if (work) cudaFree(work);

     return 0;
 }

最后，让我们添加第二个缩并循环，但这次我们希望使用不同的缓存行来缓存 – 否则相同的 – 缩并：如果这两个调用之间的硬件缓存状态有很大不同（即，影响内核的测量运行时），则这可能很有用。为此，我们使用 CUTENSOR_CONTRACTION_DESCRIPTOR_TAG 属性

uint32_t tag = 1;
HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
     &handle,
     &desc,
     CUTENSOR_OPERATION_DESCRIPTOR_TAG,
     &tag,
     sizeof(uint32_t)));

进行此更改后，示例代码现在如下所示

 #include <stdlib.h>
 #include <stdio.h>

 #include <unordered_map>
 #include <vector>
 #include <cassert>

 #include <cuda_runtime.h>
 #include <cutensor.h>

 #define HANDLE_ERROR(x)                                               \
 { const auto err = x;                                                 \
   if( err != CUTENSOR_STATUS_SUCCESS )                                \
   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 };

 #define HANDLE_CUDA_ERROR(x)                                      \
 { const auto err = x;                                             \
   if( err != cudaSuccess )                                        \
   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 };

 struct GPUTimer
 {
     GPUTimer()
     {
         cudaEventCreate(&start_);
         cudaEventCreate(&stop_);
         cudaEventRecord(start_, 0);
     }

     ~GPUTimer()
     {
         cudaEventDestroy(start_);
         cudaEventDestroy(stop_);
     }

     void start()
     {
         cudaEventRecord(start_, 0);
     }

     float seconds()
     {
         cudaEventRecord(stop_, 0);
         cudaEventSynchronize(stop_);
         float time;
         cudaEventElapsedTime(&time, start_, stop_);
         return time * 1e-3;
     }
     private:
     cudaEvent_t start_, stop_;
 };

 int main()
 {
     typedef float floatTypeA;
     typedef float floatTypeB;
     typedef float floatTypeC;
     typedef float floatTypeCompute;

     cutensorDataType_t typeA = CUTENSOR_R_32F;
     cutensorDataType_t typeB = CUTENSOR_R_32F;
     cutensorDataType_t typeC = CUTENSOR_R_32F;
     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;

     floatTypeCompute alpha = (floatTypeCompute)1.1f;
     floatTypeCompute beta  = (floatTypeCompute)0.f;

     /**********************
      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
      **********************/

     std::vector<int> modeC{'m','u','n','v'};
     std::vector<int> modeA{'m','h','k','n'};
     std::vector<int> modeB{'u','k','v','h'};
     int nmodeA = modeA.size();
     int nmodeB = modeB.size();
     int nmodeC = modeC.size();

     std::unordered_map<int, int64_t> extent;
     extent['m'] = 96;
     extent['n'] = 96;
     extent['u'] = 96;
     extent['v'] = 64;
     extent['h'] = 64;
     extent['k'] = 64;

     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;

     std::vector<int64_t> extentC;
     for (auto mode : modeC)
         extentC.push_back(extent[mode]);
     std::vector<int64_t> extentA;
     for (auto mode : modeA)
         extentA.push_back(extent[mode]);
     std::vector<int64_t> extentB;
     for (auto mode : modeB)
         extentB.push_back(extent[mode]);

     /**********************
      * Allocating data
      **********************/

     size_t elementsA = 1;
     for (auto mode : modeA)
         elementsA *= extent[mode];
     size_t elementsB = 1;
     for (auto mode : modeB)
         elementsB *= extent[mode];
     size_t elementsC = 1;
     for (auto mode : modeC)
         elementsC *= extent[mode];

     size_t sizeA = sizeof(floatTypeA) * elementsA;
     size_t sizeB = sizeof(floatTypeB) * elementsB;
     size_t sizeC = sizeof(floatTypeC) * elementsC;
     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

     void *A_d, *B_d, *C_d;
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
     assert(uintptr_t(A_d) % kAlignment == 0);
     assert(uintptr_t(B_d) % kAlignment == 0);
     assert(uintptr_t(C_d) % kAlignment == 0);

     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);

     if (A == NULL || B == NULL || C == NULL)
     {
         printf("Error: Host allocation of A or C.\n");
         return -1;
     }

     /*******************
      * Initialize data
      *******************/

     for (int64_t i = 0; i < elementsA; i++)
         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsB; i++)
         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsC; i++)
         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

     /*************************
      * cuTENSOR
      *************************/

     cutensorHandle_t handle;
     HANDLE_ERROR(cutensorCreate(&handle));

     /**********************
      * Load plan cache
      **********************/

     // holds information about the per-handle plan cache
     const char planCacheFilename[] = "./planCache.bin";
     uint32_t numCachelines = 0;
     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
             planCacheFilename, &numCachelines);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
     }
     else
     {
         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
                 numCachelines);
     }

     /**********************
      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
      **********************/
     uint32_t numEntries = 128;
     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));

     /**********************
      * Create Tensor Descriptors
      **********************/
     cutensorTensorDescriptor_t descA;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descA,
                  nmodeA,
                  extentA.data(),
                  NULL,/*stride*/
                  typeA, kAlignment));

     cutensorTensorDescriptor_t descB;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descB,
                  nmodeB,
                  extentB.data(),
                  NULL,/*stride*/
                  typeB, kAlignment));

     cutensorTensorDescriptor_t descC;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descC,
                  nmodeC,
                  extentC.data(),
                  NULL,/*stride*/
                  typeC, kAlignment));

     /*******************************
      * Create Contraction Descriptor
      *******************************/

     cutensorOperationDescriptor_t desc;
     HANDLE_ERROR(cutensorCreateContraction(handle,
                  &desc,
                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(),
                  descCompute));

     /**************************
      * PlanPreference: Set the algorithm to use and enable incremental autotuning
      ***************************/

     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

     cutensorPlanPreference_t planPref;
     HANDLE_ERROR(cutensorCreatePlanPreference(
                                handle,
                                &planPref,
                                algo,
                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation

     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
         &cacheMode,
         sizeof(cutensorCacheMode_t)));

     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
         &autotuneMode ,
         sizeof(cutensorAutotuneMode_t)));

     const uint32_t incCount = 4;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
         &incCount,
         sizeof(uint32_t)));

     /**********************
      * Query workspace estimate
      **********************/

     uint64_t workspaceSizeEstimate = 0;
     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                           desc,
                                           planPref,
                                           workspacePref,
                                           &workspaceSizeEstimate));

     /**************************
      * Create Contraction Plan
      **************************/

     cutensorPlan_t plan;
     HANDLE_ERROR(cutensorCreatePlan(handle,
                  &plan,
                  desc,
                  planPref,
                  workspaceSizeEstimate));

     /**************************
      * Optional: Query information about the created plan
      **************************/

     // query actually used workspace
     uint64_t actualWorkspaceSize = 0;
     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
         plan,
         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
         &actualWorkspaceSize,
         sizeof(actualWorkspaceSize)));

     // At this point the user knows exactly how much memory is need by the operation and
     // only the smaller actual workspace needs to be allocated
     assert(actualWorkspaceSize <= workspaceSizeEstimate);

     void *work = nullptr;
     if (actualWorkspaceSize > 0)
     {
         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
     }

     /**********************
      * Run
      **********************/

     double minTimeCUTENSOR = 1e100;
     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
     {
         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
         cudaDeviceSynchronize();

         // Set up timing
         GPUTimer timer;
         timer.start();

         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
         HANDLE_ERROR(cutensorContract(handle,
                                   plan,
                                   (void*) &alpha, A_d, B_d,
                                   (void*) &beta,  C_d, C_d,
                                   work, actualWorkspaceSize, 0 /* stream */));

         // Synchronize and measure timing
         auto time = timer.seconds();

         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
     }

     /*************************/

     double transferedBytes = sizeC + sizeA + sizeB;
     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
     transferedBytes /= 1e9;
     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

     uint32_t tag = 1;
     HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
          &handle,
          &desc,
          CUTENSOR_OPERATION_DESCRIPTOR_TAG,
          &tag,
          sizeof(uint32_t)));

     /**************************
      * Create Contraction Plan (with a different tag)
      **************************/

     cutensorPlan_t plan;
     HANDLE_ERROR(cutensorCreatePlan(handle,
                  &plan,
                  desc,
                  planPref,
                  workspaceSizeEstimate));

     /**************************
      * Optional: Query information about the created plan
      **************************/

     // query actually used workspace
     uint64_t actualWorkspaceSize = 0;
     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
         plan,
         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
         &actualWorkspaceSize,
         sizeof(actualWorkspaceSize)));

     // At this point the user knows exactly how much memory is need by the operation and
     // only the smaller actual workspace needs to be allocated
     assert(actualWorkspaceSize <= workspaceSizeEstimate);

     void *work = nullptr;
     if (actualWorkspaceSize > 0)
     {
         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
     }

     /**********************
      * Run
      **********************/

     double minTimeCUTENSOR = 1e100;
     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
     {
         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
         cudaDeviceSynchronize();

         // Set up timing
         GPUTimer timer;
         timer.start();

         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
         HANDLE_ERROR(cutensorContract(handle,
                                   plan,
                                   (void*) &alpha, A_d, B_d,
                                   (void*) &beta,  C_d, C_d,
                                   work, actualWorkspaceSize, 0 /* stream */));

         // Synchronize and measure timing
         auto time = timer.seconds();

         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
     }

     /*************************/

     double transferedBytes = sizeC + sizeA + sizeB;
     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
     transferedBytes /= 1e9;
     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) couldn't be written to.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
                 cutensorGetErrorString(status));
     }
     else
     {
         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
     }


     HANDLE_ERROR(cutensorDestroy(handle));
     HANDLE_ERROR(cutensorDestroyPlan(plan));
     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));

     if (A) free(A);
     if (B) free(B);
     if (C) free(C);
     if (A_d) cudaFree(A_d);
     if (B_d) cudaFree(B_d);
     if (C_d) cudaFree(C_d);
     if (work) cudaFree(work);

     return 0;
 }

您可以再次调用二进制文件来确认缓存现在有两个条目；这次它应该报告“已成功从文件 (./cache.bin) 读取 2 个缓存行”。

我们的计划缓存示例到此结束；您可以在示例存储库中找到这些示例（包括计时和预热运行）。

如果您有任何进一步的问题或建议，请随时与我们联系。