计划缓存

本节介绍软件管理的计划缓存,它具有以下主要特性

  • 最大限度地减少启动相关的开销(例如,由于内核选择)。

  • 无开销的自动调优(也称为 增量自动调优)。

    • 此功能使用户能够自动找到给定问题的最佳实现,从而提高性能。

  • 缓存以线程安全的方式实现,并在所有使用相同 cutensorHandle_t 的线程之间共享。

  • 序列化和反序列化缓存

    • 允许用户将缓存状态存储到磁盘并在以后重用它

本质上,计划缓存可以看作是从特定问题实例(即 cutensorOperationDescriptor_t)到实际实现(由 cutensorPlan_t 编码)的查找表。

本节的其余部分假定您熟悉入门指南

注意

默认情况下,缓存处于激活状态,可以通过 CUTENSOR_DISABLE_PLAN_CACHE 环境变量禁用(请参阅环境变量)。

增量自动调优

增量自动调优功能使用户能够自动探索给定操作的不同实现,称为候选项

当将缓存与增量自动调优功能 (CUTENSOR_AUTOTUNE_MODE_INCREMENTAL) 结合使用时,对同一操作的后续调用(尽管可能使用不同的数据指针)将由不同的候选项执行;将自动测量这些候选项的计时,并将最快的候选项存储在计划缓存中。要探索的不同候选项的数量可由用户配置(通过 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT);然后,对同一问题的所有后续调用都将映射到最快的候选项(存储在缓存中),从而利用最快的(已测量的)候选项。

这种自动调优方法具有一些关键优势

  • 候选项在硬件缓存处于生产环境状态的时间点进行评估(即,硬件缓存状态反映了真实情况)。

  • 开销最小化(即,没有计时循环,没有同步)。

    • 此外,候选项按照我们的性能模型给出的顺序(从最快到最慢)进行评估。

如果将增量自动调优与 cuTENSOR 的缓存序列化功能(通过 cutensorHandleWritePlanCacheToFilecutensorHandleReadPlanCacheFromFile)结合使用,通过将已调优的缓存写入磁盘,则增量自动调优尤其强大。

注意

我们建议在自动调优之前预热 GPU(即,达到稳态性能),以最大限度地减少测量性能的波动。

入门示例

本小节概述了与缓存相关的 API 调用和功能。除了入门指南中概述的步骤之外,在本示例中,我们还

  • 设置合适的缓存大小

  • 在逐个缩并操作的基础上配置缓存行为(通过 cutensorPlanPreferenceSetAttribute)。

让我们首先看一下入门指南中概述的相同示例:由于 cuTENSOR 2.x 默认启用缓存,因此它已经利用了缓存。虽然是可选的,但以下代码演示了如何从其实现定义的初始值调整缓存大小。

// Set cache size
constexpr int32_t numEntries = 128;
HANDLE_ERROR( cutensorHandleResizePlanCachelines(&handle, numEntries) );

// ...

请注意,条目的数量是用户可配置的;理想情况下,我们希望缓存足够大,以便为应用程序的每个不同的缩并调用提供足够的容量。由于这可能并非总是可行(由于内存限制),cuTENSOR 的计划缓存将使用最近最少使用 (LRU) 策略驱逐缓存条目。用户还可以选择在逐个缩并操作的基础上禁用缓存(通过 cutensorCacheMode_t::CUTENSOR_CACHE_MODE_NONE)。

请注意,缓存查找发生在创建计划时。因此,如果同一缩并在同一应用程序中被计划多次,则此技术特别有用。

要为某个缩并禁用缓存(即,选择退出),需要在 cutensorPlanPreference_t 中修改以下设置

const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
     &handle,
     &find,
     CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
     &cacheMode,
     sizeof(cutensorCacheMode_t)));

入门示例到此结束。

高级示例

本示例将扩展示例,并解释如何

  • 利用增量自动调优

    • 建议在自动调优之前预热 GPU(即,达到稳态性能)(以避免测量性能的较大波动)

  • 使用标签区分两个其他方面相同的张量缩并

    • 如果 GPU 的硬件缓存在这两个调用之间(可能)有很大不同(例如,如果其中一个操作数刚刚被先前的调用读取/写入),并且预计缓存状态对性能有重大影响(例如,对于带宽受限的缩并),则此功能很有用

  • 将计划缓存状态写入文件并读回

让我们首先启用增量自动调优。为此,我们按如下方式修改 cutensorPlanPreference_t

const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE_MODE,
    &autotuneMode ,
    sizeof(cutensorAutotuneMode_t)));

const uint32_t incCount = 4;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
    &incCount,
    sizeof(uint32_t)));

cutensorPlanPreferenceSetAttribute 的第一次调用启用增量自动调优,而第二次调用设置 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT;此值对应于应通过增量自动调优探索的不同候选项的数量,然后再从计划缓存中查找后续调用。较高的 incCount 值会探索更多候选项,因此最初会导致更大的开销,但如果初始开销可以摊销(例如,在将缓存写入磁盘时),它们也可能带来更好的性能。我们认为 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT 为 4 是一个很好的默认值。

以下代码结合了这些更改

  1#include <stdlib.h>
  2#include <stdio.h>
  3
  4#include <unordered_map>
  5#include <vector>
  6#include <cassert>
  7
  8#include <cuda_runtime.h>
  9#include <cutensor.h>
 10
 11#define HANDLE_ERROR(x)                                               \
 12{ const auto err = x;                                                 \
 13  if( err != CUTENSOR_STATUS_SUCCESS )                                \
 14  { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 15};
 16
 17#define HANDLE_CUDA_ERROR(x)                                      \
 18{ const auto err = x;                                             \
 19  if( err != cudaSuccess )                                        \
 20  { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 21};
 22
 23struct GPUTimer
 24{
 25    GPUTimer()
 26    {
 27        cudaEventCreate(&start_);
 28        cudaEventCreate(&stop_);
 29        cudaEventRecord(start_, 0);
 30    }
 31
 32    ~GPUTimer()
 33    {
 34        cudaEventDestroy(start_);
 35        cudaEventDestroy(stop_);
 36    }
 37
 38    void start()
 39    {
 40        cudaEventRecord(start_, 0);
 41    }
 42
 43    float seconds()
 44    {
 45        cudaEventRecord(stop_, 0);
 46        cudaEventSynchronize(stop_);
 47        float time;
 48        cudaEventElapsedTime(&time, start_, stop_);
 49        return time * 1e-3;
 50    }
 51    private:
 52    cudaEvent_t start_, stop_;
 53};
 54
 55int main()
 56{
 57    typedef float floatTypeA;
 58    typedef float floatTypeB;
 59    typedef float floatTypeC;
 60    typedef float floatTypeCompute;
 61
 62    cutensorDataType_t typeA = CUTENSOR_R_32F;
 63    cutensorDataType_t typeB = CUTENSOR_R_32F;
 64    cutensorDataType_t typeC = CUTENSOR_R_32F;
 65    const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
 66
 67    floatTypeCompute alpha = (floatTypeCompute)1.1f;
 68    floatTypeCompute beta  = (floatTypeCompute)0.f;
 69
 70    /**********************
 71     * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
 72     **********************/
 73
 74    std::vector<int> modeC{'m','u','n','v'};
 75    std::vector<int> modeA{'m','h','k','n'};
 76    std::vector<int> modeB{'u','k','v','h'};
 77    int nmodeA = modeA.size();
 78    int nmodeB = modeB.size();
 79    int nmodeC = modeC.size();
 80
 81    std::unordered_map<int, int64_t> extent;
 82    extent['m'] = 96;
 83    extent['n'] = 96;
 84    extent['u'] = 96;
 85    extent['v'] = 64;
 86    extent['h'] = 64;
 87    extent['k'] = 64;
 88
 89    double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
 90
 91    std::vector<int64_t> extentC;
 92    for (auto mode : modeC)
 93        extentC.push_back(extent[mode]);
 94    std::vector<int64_t> extentA;
 95    for (auto mode : modeA)
 96        extentA.push_back(extent[mode]);
 97    std::vector<int64_t> extentB;
 98    for (auto mode : modeB)
 99        extentB.push_back(extent[mode]);
100
101    /**********************
102     * Allocating data
103     **********************/
104
105    size_t elementsA = 1;
106    for (auto mode : modeA)
107        elementsA *= extent[mode];
108    size_t elementsB = 1;
109    for (auto mode : modeB)
110        elementsB *= extent[mode];
111    size_t elementsC = 1;
112    for (auto mode : modeC)
113        elementsC *= extent[mode];
114
115    size_t sizeA = sizeof(floatTypeA) * elementsA;
116    size_t sizeB = sizeof(floatTypeB) * elementsB;
117    size_t sizeC = sizeof(floatTypeC) * elementsC;
118    printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120    void *A_d, *B_d, *C_d;
121    HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122    HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123    HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125    const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126    assert(uintptr_t(A_d) % kAlignment == 0);
127    assert(uintptr_t(B_d) % kAlignment == 0);
128    assert(uintptr_t(C_d) % kAlignment == 0);
129
130    floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131    floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132    floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134    if (A == NULL || B == NULL || C == NULL)
135    {
136        printf("Error: Host allocation of A or C.\n");
137        return -1;
138    }
139
140    /*******************
141     * Initialize data
142     *******************/
143
144    for (int64_t i = 0; i < elementsA; i++)
145        A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146    for (int64_t i = 0; i < elementsB; i++)
147        B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148    for (int64_t i = 0; i < elementsC; i++)
149        C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151    HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152    HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153    HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155    /*************************
156     * cuTENSOR
157     *************************/
158
159    cutensorHandle_t handle;
160    HANDLE_ERROR(cutensorCreate(&handle));
161
162    /**********************
163     * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
164     **********************/
165    uint32_t numEntries = 128;
166    HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
167
168    /**********************
169     * Create Tensor Descriptors
170     **********************/
171    cutensorTensorDescriptor_t descA;
172    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
173                 &descA,
174                 nmodeA,
175                 extentA.data(),
176                 NULL,/*stride*/
177                 typeA, kAlignment));
178
179    cutensorTensorDescriptor_t descB;
180    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
181                 &descB,
182                 nmodeB,
183                 extentB.data(),
184                 NULL,/*stride*/
185                 typeB, kAlignment));
186
187    cutensorTensorDescriptor_t descC;
188    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
189                 &descC,
190                 nmodeC,
191                 extentC.data(),
192                 NULL,/*stride*/
193                 typeC, kAlignment));
194
195    /*******************************
196     * Create Contraction Descriptor
197     *******************************/
198
199    cutensorOperationDescriptor_t desc;
200    HANDLE_ERROR(cutensorCreateContraction(handle,
201                 &desc,
202                 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
203                 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
204                 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
205                 descC, modeC.data(),
206                 descCompute));
207
208    /**************************
209     * PlanPreference: Set the algorithm to use and enable incremental autotuning
210     ***************************/
211
212    const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
213
214    cutensorPlanPreference_t planPref;
215    HANDLE_ERROR(cutensorCreatePlanPreference(
216                               handle,
217                               &planPref,
218                               algo,
219                               CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
220
221    const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
222    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
223        handle,
224        planPref,
225        CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
226        &cacheMode,
227        sizeof(cutensorCacheMode_t)));
228
229    const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
230    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
231        handle,
232        planPref,
233        CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
234        &autotuneMode ,
235        sizeof(cutensorAutotuneMode_t)));
236
237    const uint32_t incCount = 4;
238    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
239        handle,
240        planPref,
241        CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
242        &incCount,
243        sizeof(uint32_t)));
244
245    /**********************
246     * Query workspace estimate
247     **********************/
248
249    uint64_t workspaceSizeEstimate = 0;
250    const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
251    HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
252                                          desc,
253                                          planPref,
254                                          workspacePref,
255                                          &workspaceSizeEstimate));
256
257    /**************************
258     * Create Contraction Plan
259     **************************/
260
261    cutensorPlan_t plan;
262    HANDLE_ERROR(cutensorCreatePlan(handle,
263                 &plan,
264                 desc,
265                 planPref,
266                 workspaceSizeEstimate));
267
268    /**************************
269     * Optional: Query information about the created plan
270     **************************/
271
272    // query actually used workspace
273    uint64_t actualWorkspaceSize = 0;
274    HANDLE_ERROR(cutensorPlanGetAttribute(handle,
275        plan,
276        CUTENSOR_PLAN_REQUIRED_WORKSPACE,
277        &actualWorkspaceSize,
278        sizeof(actualWorkspaceSize)));
279
280    // At this point the user knows exactly how much memory is need by the operation and
281    // only the smaller actual workspace needs to be allocated
282    assert(actualWorkspaceSize <= workspaceSizeEstimate);
283
284    void *work = nullptr;
285    if (actualWorkspaceSize > 0)
286    {
287        HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
288        assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
289    }
290
291    /**********************
292     * Run
293     **********************/
294
295    double minTimeCUTENSOR = 1e100;
296    for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
297    {
298        cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
299        cudaDeviceSynchronize();
300
301        // Set up timing
302        GPUTimer timer;
303        timer.start();
304
305        // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
306        HANDLE_ERROR(cutensorContract(handle,
307                                  plan,
308                                  (void*) &alpha, A_d, B_d,
309                                  (void*) &beta,  C_d, C_d,
310                                  work, actualWorkspaceSize, 0 /* stream */));
311
312        // Synchronize and measure timing
313        auto time = timer.seconds();
314
315        minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
316    }
317
318    /*************************/
319
320    double transferedBytes = sizeC + sizeA + sizeB;
321    transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
322    transferedBytes /= 1e9;
323    printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
324
325    HANDLE_ERROR(cutensorDestroy(handle));
326    HANDLE_ERROR(cutensorDestroyPlan(plan));
327    HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
328    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
329    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
330    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
331
332    if (A) free(A);
333    if (B) free(B);
334    if (C) free(C);
335    if (A_d) cudaFree(A_d);
336    if (B_d) cudaFree(B_d);
337    if (C_d) cudaFree(C_d);
338    if (work) cudaFree(work);
339
340    return 0;
341}

让我们通过将计划缓存写入文件并读回(如果之前已写入)来进一步扩展示例

const char planCacheFilename[] = "./planCache.bin";
uint32_t numCachelines = 0;
cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
        planCacheFilename, &numCachelines);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
}
else
{
    printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
            numCachelines);
}

// ...

status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) couldn't be written to.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
            cutensorGetErrorString(status));
}
else
{
    printf("Plan cache successfully stored to %s.\n", planCacheFilename);
}

警告

只有当计划缓存的大小足以读取文件中存储的所有缓存行时,cutensorHandleReadPlanCacheFromFile 才会成功;否则,将返回 CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE,并且足够的缓存行数将存储在 numCachelinesRead 中。

进行这些更改后,示例现在如下所示

  1 #include <stdlib.h>
  2 #include <stdio.h>
  3
  4 #include <unordered_map>
  5 #include <vector>
  6 #include <cassert>
  7
  8 #include <cuda_runtime.h>
  9 #include <cutensor.h>
 10
 11 #define HANDLE_ERROR(x)                                               \
 12 { const auto err = x;                                                 \
 13   if( err != CUTENSOR_STATUS_SUCCESS )                                \
 14   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 15 };
 16
 17 #define HANDLE_CUDA_ERROR(x)                                      \
 18 { const auto err = x;                                             \
 19   if( err != cudaSuccess )                                        \
 20   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 21 };
 22
 23 struct GPUTimer
 24 {
 25     GPUTimer()
 26     {
 27         cudaEventCreate(&start_);
 28         cudaEventCreate(&stop_);
 29         cudaEventRecord(start_, 0);
 30     }
 31
 32     ~GPUTimer()
 33     {
 34         cudaEventDestroy(start_);
 35         cudaEventDestroy(stop_);
 36     }
 37
 38     void start()
 39     {
 40         cudaEventRecord(start_, 0);
 41     }
 42
 43     float seconds()
 44     {
 45         cudaEventRecord(stop_, 0);
 46         cudaEventSynchronize(stop_);
 47         float time;
 48         cudaEventElapsedTime(&time, start_, stop_);
 49         return time * 1e-3;
 50     }
 51     private:
 52     cudaEvent_t start_, stop_;
 53 };
 54
 55 int main()
 56 {
 57     typedef float floatTypeA;
 58     typedef float floatTypeB;
 59     typedef float floatTypeC;
 60     typedef float floatTypeCompute;
 61
 62     cutensorDataType_t typeA = CUTENSOR_R_32F;
 63     cutensorDataType_t typeB = CUTENSOR_R_32F;
 64     cutensorDataType_t typeC = CUTENSOR_R_32F;
 65     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
 66
 67     floatTypeCompute alpha = (floatTypeCompute)1.1f;
 68     floatTypeCompute beta  = (floatTypeCompute)0.f;
 69
 70     /**********************
 71      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
 72      **********************/
 73
 74     std::vector<int> modeC{'m','u','n','v'};
 75     std::vector<int> modeA{'m','h','k','n'};
 76     std::vector<int> modeB{'u','k','v','h'};
 77     int nmodeA = modeA.size();
 78     int nmodeB = modeB.size();
 79     int nmodeC = modeC.size();
 80
 81     std::unordered_map<int, int64_t> extent;
 82     extent['m'] = 96;
 83     extent['n'] = 96;
 84     extent['u'] = 96;
 85     extent['v'] = 64;
 86     extent['h'] = 64;
 87     extent['k'] = 64;
 88
 89     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
 90
 91     std::vector<int64_t> extentC;
 92     for (auto mode : modeC)
 93         extentC.push_back(extent[mode]);
 94     std::vector<int64_t> extentA;
 95     for (auto mode : modeA)
 96         extentA.push_back(extent[mode]);
 97     std::vector<int64_t> extentB;
 98     for (auto mode : modeB)
 99         extentB.push_back(extent[mode]);
100
101     /**********************
102      * Allocating data
103      **********************/
104
105     size_t elementsA = 1;
106     for (auto mode : modeA)
107         elementsA *= extent[mode];
108     size_t elementsB = 1;
109     for (auto mode : modeB)
110         elementsB *= extent[mode];
111     size_t elementsC = 1;
112     for (auto mode : modeC)
113         elementsC *= extent[mode];
114
115     size_t sizeA = sizeof(floatTypeA) * elementsA;
116     size_t sizeB = sizeof(floatTypeB) * elementsB;
117     size_t sizeC = sizeof(floatTypeC) * elementsC;
118     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120     void *A_d, *B_d, *C_d;
121     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126     assert(uintptr_t(A_d) % kAlignment == 0);
127     assert(uintptr_t(B_d) % kAlignment == 0);
128     assert(uintptr_t(C_d) % kAlignment == 0);
129
130     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134     if (A == NULL || B == NULL || C == NULL)
135     {
136         printf("Error: Host allocation of A or C.\n");
137         return -1;
138     }
139
140     /*******************
141      * Initialize data
142      *******************/
143
144     for (int64_t i = 0; i < elementsA; i++)
145         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146     for (int64_t i = 0; i < elementsB; i++)
147         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148     for (int64_t i = 0; i < elementsC; i++)
149         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155     /*************************
156      * cuTENSOR
157      *************************/
158
159     cutensorHandle_t handle;
160     HANDLE_ERROR(cutensorCreate(&handle));
161
162     /**********************
163      * Load plan cache
164      **********************/
165
166     // holds information about the per-handle plan cache
167     const char planCacheFilename[] = "./planCache.bin";
168     uint32_t numCachelines = 0;
169     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170             planCacheFilename, &numCachelines);
171     if (status == CUTENSOR_STATUS_IO_ERROR)
172     {
173         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174     }
175     else if (status != CUTENSOR_STATUS_SUCCESS)
176     {
177         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178     }
179     else
180     {
181         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182                 numCachelines);
183     }
184
185     /**********************
186      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187      **********************/
188     uint32_t numEntries = 128;
189     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191     /**********************
192      * Create Tensor Descriptors
193      **********************/
194     cutensorTensorDescriptor_t descA;
195     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196                  &descA,
197                  nmodeA,
198                  extentA.data(),
199                  NULL,/*stride*/
200                  typeA, kAlignment));
201
202     cutensorTensorDescriptor_t descB;
203     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204                  &descB,
205                  nmodeB,
206                  extentB.data(),
207                  NULL,/*stride*/
208                  typeB, kAlignment));
209
210     cutensorTensorDescriptor_t descC;
211     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212                  &descC,
213                  nmodeC,
214                  extentC.data(),
215                  NULL,/*stride*/
216                  typeC, kAlignment));
217
218     /*******************************
219      * Create Contraction Descriptor
220      *******************************/
221
222     cutensorOperationDescriptor_t desc;
223     HANDLE_ERROR(cutensorCreateContraction(handle,
224                  &desc,
225                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228                  descC, modeC.data(),
229                  descCompute));
230
231     /**************************
232      * PlanPreference: Set the algorithm to use and enable incremental autotuning
233      ***************************/
234
235     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237     cutensorPlanPreference_t planPref;
238     HANDLE_ERROR(cutensorCreatePlanPreference(
239                                handle,
240                                &planPref,
241                                algo,
242                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246         handle,
247         planPref,
248         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249         &cacheMode,
250         sizeof(cutensorCacheMode_t)));
251
252     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254         handle,
255         planPref,
256         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257         &autotuneMode ,
258         sizeof(cutensorAutotuneMode_t)));
259
260     const uint32_t incCount = 4;
261     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262         handle,
263         planPref,
264         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265         &incCount,
266         sizeof(uint32_t)));
267
268     /**********************
269      * Query workspace estimate
270      **********************/
271
272     uint64_t workspaceSizeEstimate = 0;
273     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275                                           desc,
276                                           planPref,
277                                           workspacePref,
278                                           &workspaceSizeEstimate));
279
280     /**************************
281      * Create Contraction Plan
282      **************************/
283
284     cutensorPlan_t plan;
285     HANDLE_ERROR(cutensorCreatePlan(handle,
286                  &plan,
287                  desc,
288                  planPref,
289                  workspaceSizeEstimate));
290
291     /**************************
292      * Optional: Query information about the created plan
293      **************************/
294
295     // query actually used workspace
296     uint64_t actualWorkspaceSize = 0;
297     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298         plan,
299         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300         &actualWorkspaceSize,
301         sizeof(actualWorkspaceSize)));
302
303     // At this point the user knows exactly how much memory is need by the operation and
304     // only the smaller actual workspace needs to be allocated
305     assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307     void *work = nullptr;
308     if (actualWorkspaceSize > 0)
309     {
310         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312     }
313
314     /**********************
315      * Run
316      **********************/
317
318     double minTimeCUTENSOR = 1e100;
319     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320     {
321         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322         cudaDeviceSynchronize();
323
324         // Set up timing
325         GPUTimer timer;
326         timer.start();
327
328         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329         HANDLE_ERROR(cutensorContract(handle,
330                                   plan,
331                                   (void*) &alpha, A_d, B_d,
332                                   (void*) &beta,  C_d, C_d,
333                                   work, actualWorkspaceSize, 0 /* stream */));
334
335         // Synchronize and measure timing
336         auto time = timer.seconds();
337
338         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339     }
340
341     /*************************/
342
343     double transferedBytes = sizeC + sizeA + sizeB;
344     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345     transferedBytes /= 1e9;
346     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
349     if (status == CUTENSOR_STATUS_IO_ERROR)
350     {
351         printf("File (%s) couldn't be written to.\n", planCacheFilename);
352     }
353     else if (status != CUTENSOR_STATUS_SUCCESS)
354     {
355         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
356                 cutensorGetErrorString(status));
357     }
358     else
359     {
360         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
361     }
362
363
364     HANDLE_ERROR(cutensorDestroy(handle));
365     HANDLE_ERROR(cutensorDestroyPlan(plan));
366     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
367     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
368     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
369     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
370
371     if (A) free(A);
372     if (B) free(B);
373     if (C) free(C);
374     if (A_d) cudaFree(A_d);
375     if (B_d) cudaFree(B_d);
376     if (C_d) cudaFree(C_d);
377     if (work) cudaFree(work);
378
379     return 0;
380 }

最后,让我们添加第二个缩并循环,但这次我们希望使用不同的缓存行来缓存 – 否则相同的 – 缩并:如果这两个调用之间的硬件缓存状态有很大不同(即,影响内核的测量运行时),则这可能很有用。为此,我们使用 CUTENSOR_CONTRACTION_DESCRIPTOR_TAG 属性

uint32_t tag = 1;
HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
     &handle,
     &desc,
     CUTENSOR_OPERATION_DESCRIPTOR_TAG,
     &tag,
     sizeof(uint32_t)));

进行此更改后,示例代码现在如下所示

  1 #include <stdlib.h>
  2 #include <stdio.h>
  3
  4 #include <unordered_map>
  5 #include <vector>
  6 #include <cassert>
  7
  8 #include <cuda_runtime.h>
  9 #include <cutensor.h>
 10
 11 #define HANDLE_ERROR(x)                                               \
 12 { const auto err = x;                                                 \
 13   if( err != CUTENSOR_STATUS_SUCCESS )                                \
 14   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 15 };
 16
 17 #define HANDLE_CUDA_ERROR(x)                                      \
 18 { const auto err = x;                                             \
 19   if( err != cudaSuccess )                                        \
 20   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 21 };
 22
 23 struct GPUTimer
 24 {
 25     GPUTimer()
 26     {
 27         cudaEventCreate(&start_);
 28         cudaEventCreate(&stop_);
 29         cudaEventRecord(start_, 0);
 30     }
 31
 32     ~GPUTimer()
 33     {
 34         cudaEventDestroy(start_);
 35         cudaEventDestroy(stop_);
 36     }
 37
 38     void start()
 39     {
 40         cudaEventRecord(start_, 0);
 41     }
 42
 43     float seconds()
 44     {
 45         cudaEventRecord(stop_, 0);
 46         cudaEventSynchronize(stop_);
 47         float time;
 48         cudaEventElapsedTime(&time, start_, stop_);
 49         return time * 1e-3;
 50     }
 51     private:
 52     cudaEvent_t start_, stop_;
 53 };
 54
 55 int main()
 56 {
 57     typedef float floatTypeA;
 58     typedef float floatTypeB;
 59     typedef float floatTypeC;
 60     typedef float floatTypeCompute;
 61
 62     cutensorDataType_t typeA = CUTENSOR_R_32F;
 63     cutensorDataType_t typeB = CUTENSOR_R_32F;
 64     cutensorDataType_t typeC = CUTENSOR_R_32F;
 65     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
 66
 67     floatTypeCompute alpha = (floatTypeCompute)1.1f;
 68     floatTypeCompute beta  = (floatTypeCompute)0.f;
 69
 70     /**********************
 71      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
 72      **********************/
 73
 74     std::vector<int> modeC{'m','u','n','v'};
 75     std::vector<int> modeA{'m','h','k','n'};
 76     std::vector<int> modeB{'u','k','v','h'};
 77     int nmodeA = modeA.size();
 78     int nmodeB = modeB.size();
 79     int nmodeC = modeC.size();
 80
 81     std::unordered_map<int, int64_t> extent;
 82     extent['m'] = 96;
 83     extent['n'] = 96;
 84     extent['u'] = 96;
 85     extent['v'] = 64;
 86     extent['h'] = 64;
 87     extent['k'] = 64;
 88
 89     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
 90
 91     std::vector<int64_t> extentC;
 92     for (auto mode : modeC)
 93         extentC.push_back(extent[mode]);
 94     std::vector<int64_t> extentA;
 95     for (auto mode : modeA)
 96         extentA.push_back(extent[mode]);
 97     std::vector<int64_t> extentB;
 98     for (auto mode : modeB)
 99         extentB.push_back(extent[mode]);
100
101     /**********************
102      * Allocating data
103      **********************/
104
105     size_t elementsA = 1;
106     for (auto mode : modeA)
107         elementsA *= extent[mode];
108     size_t elementsB = 1;
109     for (auto mode : modeB)
110         elementsB *= extent[mode];
111     size_t elementsC = 1;
112     for (auto mode : modeC)
113         elementsC *= extent[mode];
114
115     size_t sizeA = sizeof(floatTypeA) * elementsA;
116     size_t sizeB = sizeof(floatTypeB) * elementsB;
117     size_t sizeC = sizeof(floatTypeC) * elementsC;
118     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120     void *A_d, *B_d, *C_d;
121     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126     assert(uintptr_t(A_d) % kAlignment == 0);
127     assert(uintptr_t(B_d) % kAlignment == 0);
128     assert(uintptr_t(C_d) % kAlignment == 0);
129
130     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134     if (A == NULL || B == NULL || C == NULL)
135     {
136         printf("Error: Host allocation of A or C.\n");
137         return -1;
138     }
139
140     /*******************
141      * Initialize data
142      *******************/
143
144     for (int64_t i = 0; i < elementsA; i++)
145         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146     for (int64_t i = 0; i < elementsB; i++)
147         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148     for (int64_t i = 0; i < elementsC; i++)
149         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155     /*************************
156      * cuTENSOR
157      *************************/
158
159     cutensorHandle_t handle;
160     HANDLE_ERROR(cutensorCreate(&handle));
161
162     /**********************
163      * Load plan cache
164      **********************/
165
166     // holds information about the per-handle plan cache
167     const char planCacheFilename[] = "./planCache.bin";
168     uint32_t numCachelines = 0;
169     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170             planCacheFilename, &numCachelines);
171     if (status == CUTENSOR_STATUS_IO_ERROR)
172     {
173         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174     }
175     else if (status != CUTENSOR_STATUS_SUCCESS)
176     {
177         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178     }
179     else
180     {
181         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182                 numCachelines);
183     }
184
185     /**********************
186      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187      **********************/
188     uint32_t numEntries = 128;
189     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191     /**********************
192      * Create Tensor Descriptors
193      **********************/
194     cutensorTensorDescriptor_t descA;
195     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196                  &descA,
197                  nmodeA,
198                  extentA.data(),
199                  NULL,/*stride*/
200                  typeA, kAlignment));
201
202     cutensorTensorDescriptor_t descB;
203     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204                  &descB,
205                  nmodeB,
206                  extentB.data(),
207                  NULL,/*stride*/
208                  typeB, kAlignment));
209
210     cutensorTensorDescriptor_t descC;
211     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212                  &descC,
213                  nmodeC,
214                  extentC.data(),
215                  NULL,/*stride*/
216                  typeC, kAlignment));
217
218     /*******************************
219      * Create Contraction Descriptor
220      *******************************/
221
222     cutensorOperationDescriptor_t desc;
223     HANDLE_ERROR(cutensorCreateContraction(handle,
224                  &desc,
225                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228                  descC, modeC.data(),
229                  descCompute));
230
231     /**************************
232      * PlanPreference: Set the algorithm to use and enable incremental autotuning
233      ***************************/
234
235     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237     cutensorPlanPreference_t planPref;
238     HANDLE_ERROR(cutensorCreatePlanPreference(
239                                handle,
240                                &planPref,
241                                algo,
242                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246         handle,
247         planPref,
248         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249         &cacheMode,
250         sizeof(cutensorCacheMode_t)));
251
252     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254         handle,
255         planPref,
256         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257         &autotuneMode ,
258         sizeof(cutensorAutotuneMode_t)));
259
260     const uint32_t incCount = 4;
261     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262         handle,
263         planPref,
264         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265         &incCount,
266         sizeof(uint32_t)));
267
268     /**********************
269      * Query workspace estimate
270      **********************/
271
272     uint64_t workspaceSizeEstimate = 0;
273     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275                                           desc,
276                                           planPref,
277                                           workspacePref,
278                                           &workspaceSizeEstimate));
279
280     /**************************
281      * Create Contraction Plan
282      **************************/
283
284     cutensorPlan_t plan;
285     HANDLE_ERROR(cutensorCreatePlan(handle,
286                  &plan,
287                  desc,
288                  planPref,
289                  workspaceSizeEstimate));
290
291     /**************************
292      * Optional: Query information about the created plan
293      **************************/
294
295     // query actually used workspace
296     uint64_t actualWorkspaceSize = 0;
297     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298         plan,
299         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300         &actualWorkspaceSize,
301         sizeof(actualWorkspaceSize)));
302
303     // At this point the user knows exactly how much memory is need by the operation and
304     // only the smaller actual workspace needs to be allocated
305     assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307     void *work = nullptr;
308     if (actualWorkspaceSize > 0)
309     {
310         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312     }
313
314     /**********************
315      * Run
316      **********************/
317
318     double minTimeCUTENSOR = 1e100;
319     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320     {
321         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322         cudaDeviceSynchronize();
323
324         // Set up timing
325         GPUTimer timer;
326         timer.start();
327
328         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329         HANDLE_ERROR(cutensorContract(handle,
330                                   plan,
331                                   (void*) &alpha, A_d, B_d,
332                                   (void*) &beta,  C_d, C_d,
333                                   work, actualWorkspaceSize, 0 /* stream */));
334
335         // Synchronize and measure timing
336         auto time = timer.seconds();
337
338         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339     }
340
341     /*************************/
342
343     double transferedBytes = sizeC + sizeA + sizeB;
344     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345     transferedBytes /= 1e9;
346     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348     uint32_t tag = 1;
349     HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
350          &handle,
351          &desc,
352          CUTENSOR_OPERATION_DESCRIPTOR_TAG,
353          &tag,
354          sizeof(uint32_t)));
355
356     /**************************
357      * Create Contraction Plan (with a different tag)
358      **************************/
359
360     cutensorPlan_t plan;
361     HANDLE_ERROR(cutensorCreatePlan(handle,
362                  &plan,
363                  desc,
364                  planPref,
365                  workspaceSizeEstimate));
366
367     /**************************
368      * Optional: Query information about the created plan
369      **************************/
370
371     // query actually used workspace
372     uint64_t actualWorkspaceSize = 0;
373     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
374         plan,
375         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
376         &actualWorkspaceSize,
377         sizeof(actualWorkspaceSize)));
378
379     // At this point the user knows exactly how much memory is need by the operation and
380     // only the smaller actual workspace needs to be allocated
381     assert(actualWorkspaceSize <= workspaceSizeEstimate);
382
383     void *work = nullptr;
384     if (actualWorkspaceSize > 0)
385     {
386         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
387         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
388     }
389
390     /**********************
391      * Run
392      **********************/
393
394     double minTimeCUTENSOR = 1e100;
395     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
396     {
397         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
398         cudaDeviceSynchronize();
399
400         // Set up timing
401         GPUTimer timer;
402         timer.start();
403
404         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
405         HANDLE_ERROR(cutensorContract(handle,
406                                   plan,
407                                   (void*) &alpha, A_d, B_d,
408                                   (void*) &beta,  C_d, C_d,
409                                   work, actualWorkspaceSize, 0 /* stream */));
410
411         // Synchronize and measure timing
412         auto time = timer.seconds();
413
414         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
415     }
416
417     /*************************/
418
419     double transferedBytes = sizeC + sizeA + sizeB;
420     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
421     transferedBytes /= 1e9;
422     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
423
424     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
425     if (status == CUTENSOR_STATUS_IO_ERROR)
426     {
427         printf("File (%s) couldn't be written to.\n", planCacheFilename);
428     }
429     else if (status != CUTENSOR_STATUS_SUCCESS)
430     {
431         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
432                 cutensorGetErrorString(status));
433     }
434     else
435     {
436         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
437     }
438
439
440     HANDLE_ERROR(cutensorDestroy(handle));
441     HANDLE_ERROR(cutensorDestroyPlan(plan));
442     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
443     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
444     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
445     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
446
447     if (A) free(A);
448     if (B) free(B);
449     if (C) free(C);
450     if (A_d) cudaFree(A_d);
451     if (B_d) cudaFree(B_d);
452     if (C_d) cudaFree(C_d);
453     if (work) cudaFree(work);
454
455     return 0;
456 }

您可以再次调用二进制文件来确认缓存现在有两个条目;这次它应该报告“已成功从文件 (./cache.bin) 读取 2 个缓存行”。

我们的计划缓存示例到此结束;您可以在示例存储库中找到这些示例(包括计时和预热运行)。

如果您有任何进一步的问题或建议,请随时与我们联系。