计划缓存¶
本节介绍软件管理的计划缓存,它具有以下主要特性
最大限度地减少启动相关的开销(例如,由于内核选择)。
无开销的自动调优(也称为 增量自动调优)。
此功能使用户能够自动找到给定问题的最佳实现,从而提高性能。
缓存以线程安全的方式实现,并在所有使用相同
cutensorHandle_t
的线程之间共享。序列化和反序列化缓存
允许用户将缓存状态存储到磁盘并在以后重用它
本质上,计划缓存可以看作是从特定问题实例(即 cutensorOperationDescriptor_t
)到实际实现(由 cutensorPlan_t
编码)的查找表。
本节的其余部分假定您熟悉入门指南。
注意
默认情况下,缓存处于激活状态,可以通过 CUTENSOR_DISABLE_PLAN_CACHE
环境变量禁用(请参阅环境变量)。
增量自动调优¶
增量自动调优功能使用户能够自动探索给定操作的不同实现,称为候选项。
当将缓存与增量自动调优功能 (CUTENSOR_AUTOTUNE_MODE_INCREMENTAL
) 结合使用时,对同一操作的后续调用(尽管可能使用不同的数据指针)将由不同的候选项执行;将自动测量这些候选项的计时,并将最快的候选项存储在计划缓存中。要探索的不同候选项的数量可由用户配置(通过 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT
);然后,对同一问题的所有后续调用都将映射到最快的候选项(存储在缓存中),从而利用最快的(已测量的)候选项。
这种自动调优方法具有一些关键优势
候选项在硬件缓存处于生产环境状态的时间点进行评估(即,硬件缓存状态反映了真实情况)。
开销最小化(即,没有计时循环,没有同步)。
此外,候选项按照我们的性能模型给出的顺序(从最快到最慢)进行评估。
如果将增量自动调优与 cuTENSOR 的缓存序列化功能(通过 cutensorHandleWritePlanCacheToFile
和 cutensorHandleReadPlanCacheFromFile
)结合使用,通过将已调优的缓存写入磁盘,则增量自动调优尤其强大。
注意
我们建议在自动调优之前预热 GPU(即,达到稳态性能),以最大限度地减少测量性能的波动。
入门示例¶
本小节概述了与缓存相关的 API 调用和功能。除了入门指南中概述的步骤之外,在本示例中,我们还
设置合适的缓存大小
在逐个缩并操作的基础上配置缓存行为(通过
cutensorPlanPreferenceSetAttribute
)。
让我们首先看一下入门指南中概述的相同示例:由于 cuTENSOR 2.x 默认启用缓存,因此它已经利用了缓存。虽然是可选的,但以下代码演示了如何从其实现定义的初始值调整缓存大小。
// Set cache size
constexpr int32_t numEntries = 128;
HANDLE_ERROR( cutensorHandleResizePlanCachelines(&handle, numEntries) );
// ...
请注意,条目的数量是用户可配置的;理想情况下,我们希望缓存足够大,以便为应用程序的每个不同的缩并调用提供足够的容量。由于这可能并非总是可行(由于内存限制),cuTENSOR 的计划缓存将使用最近最少使用 (LRU) 策略驱逐缓存条目。用户还可以选择在逐个缩并操作的基础上禁用缓存(通过 cutensorCacheMode_t::CUTENSOR_CACHE_MODE_NONE
)。
请注意,缓存查找发生在创建计划时。因此,如果同一缩并在同一应用程序中被计划多次,则此技术特别有用。
要为某个缩并禁用缓存(即,选择退出),需要在 cutensorPlanPreference_t
中修改以下设置
const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
&handle,
&find,
CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
&cacheMode,
sizeof(cutensorCacheMode_t)));
入门示例到此结束。
高级示例¶
本示例将扩展示例,并解释如何
利用增量自动调优
建议在自动调优之前预热 GPU(即,达到稳态性能)(以避免测量性能的较大波动)
使用标签区分两个其他方面相同的张量缩并
如果 GPU 的硬件缓存在这两个调用之间(可能)有很大不同(例如,如果其中一个操作数刚刚被先前的调用读取/写入),并且预计缓存状态对性能有重大影响(例如,对于带宽受限的缩并),则此功能很有用
将计划缓存状态写入文件并读回
让我们首先启用增量自动调优。为此,我们按如下方式修改 cutensorPlanPreference_t
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
&handle,
&find,
CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
&handle,
&find,
CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
对 cutensorPlanPreferenceSetAttribute
的第一次调用启用增量自动调优,而第二次调用设置 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT
;此值对应于应通过增量自动调优探索的不同候选项的数量,然后再从计划缓存中查找后续调用。较高的 incCount
值会探索更多候选项,因此最初会导致更大的开销,但如果初始开销可以摊销(例如,在将缓存写入磁盘时),它们也可能带来更好的性能。我们认为 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT
为 4 是一个很好的默认值。
以下代码结合了这些更改
1#include <stdlib.h>
2#include <stdio.h>
3
4#include <unordered_map>
5#include <vector>
6#include <cassert>
7
8#include <cuda_runtime.h>
9#include <cutensor.h>
10
11#define HANDLE_ERROR(x) \
12{ const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
15};
16
17#define HANDLE_CUDA_ERROR(x) \
18{ const auto err = x; \
19 if( err != cudaSuccess ) \
20 { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
21};
22
23struct GPUTimer
24{
25 GPUTimer()
26 {
27 cudaEventCreate(&start_);
28 cudaEventCreate(&stop_);
29 cudaEventRecord(start_, 0);
30 }
31
32 ~GPUTimer()
33 {
34 cudaEventDestroy(start_);
35 cudaEventDestroy(stop_);
36 }
37
38 void start()
39 {
40 cudaEventRecord(start_, 0);
41 }
42
43 float seconds()
44 {
45 cudaEventRecord(stop_, 0);
46 cudaEventSynchronize(stop_);
47 float time;
48 cudaEventElapsedTime(&time, start_, stop_);
49 return time * 1e-3;
50 }
51 private:
52 cudaEvent_t start_, stop_;
53};
54
55int main()
56{
57 typedef float floatTypeA;
58 typedef float floatTypeB;
59 typedef float floatTypeC;
60 typedef float floatTypeCompute;
61
62 cutensorDataType_t typeA = CUTENSOR_R_32F;
63 cutensorDataType_t typeB = CUTENSOR_R_32F;
64 cutensorDataType_t typeC = CUTENSOR_R_32F;
65 const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
66
67 floatTypeCompute alpha = (floatTypeCompute)1.1f;
68 floatTypeCompute beta = (floatTypeCompute)0.f;
69
70 /**********************
71 * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
72 **********************/
73
74 std::vector<int> modeC{'m','u','n','v'};
75 std::vector<int> modeA{'m','h','k','n'};
76 std::vector<int> modeB{'u','k','v','h'};
77 int nmodeA = modeA.size();
78 int nmodeB = modeB.size();
79 int nmodeC = modeC.size();
80
81 std::unordered_map<int, int64_t> extent;
82 extent['m'] = 96;
83 extent['n'] = 96;
84 extent['u'] = 96;
85 extent['v'] = 64;
86 extent['h'] = 64;
87 extent['k'] = 64;
88
89 double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
90
91 std::vector<int64_t> extentC;
92 for (auto mode : modeC)
93 extentC.push_back(extent[mode]);
94 std::vector<int64_t> extentA;
95 for (auto mode : modeA)
96 extentA.push_back(extent[mode]);
97 std::vector<int64_t> extentB;
98 for (auto mode : modeB)
99 extentB.push_back(extent[mode]);
100
101 /**********************
102 * Allocating data
103 **********************/
104
105 size_t elementsA = 1;
106 for (auto mode : modeA)
107 elementsA *= extent[mode];
108 size_t elementsB = 1;
109 for (auto mode : modeB)
110 elementsB *= extent[mode];
111 size_t elementsC = 1;
112 for (auto mode : modeC)
113 elementsC *= extent[mode];
114
115 size_t sizeA = sizeof(floatTypeA) * elementsA;
116 size_t sizeB = sizeof(floatTypeB) * elementsB;
117 size_t sizeC = sizeof(floatTypeC) * elementsC;
118 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120 void *A_d, *B_d, *C_d;
121 HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122 HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123 HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125 const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126 assert(uintptr_t(A_d) % kAlignment == 0);
127 assert(uintptr_t(B_d) % kAlignment == 0);
128 assert(uintptr_t(C_d) % kAlignment == 0);
129
130 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134 if (A == NULL || B == NULL || C == NULL)
135 {
136 printf("Error: Host allocation of A or C.\n");
137 return -1;
138 }
139
140 /*******************
141 * Initialize data
142 *******************/
143
144 for (int64_t i = 0; i < elementsA; i++)
145 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146 for (int64_t i = 0; i < elementsB; i++)
147 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148 for (int64_t i = 0; i < elementsC; i++)
149 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151 HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152 HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153 HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155 /*************************
156 * cuTENSOR
157 *************************/
158
159 cutensorHandle_t handle;
160 HANDLE_ERROR(cutensorCreate(&handle));
161
162 /**********************
163 * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
164 **********************/
165 uint32_t numEntries = 128;
166 HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
167
168 /**********************
169 * Create Tensor Descriptors
170 **********************/
171 cutensorTensorDescriptor_t descA;
172 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
173 &descA,
174 nmodeA,
175 extentA.data(),
176 NULL,/*stride*/
177 typeA, kAlignment));
178
179 cutensorTensorDescriptor_t descB;
180 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
181 &descB,
182 nmodeB,
183 extentB.data(),
184 NULL,/*stride*/
185 typeB, kAlignment));
186
187 cutensorTensorDescriptor_t descC;
188 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
189 &descC,
190 nmodeC,
191 extentC.data(),
192 NULL,/*stride*/
193 typeC, kAlignment));
194
195 /*******************************
196 * Create Contraction Descriptor
197 *******************************/
198
199 cutensorOperationDescriptor_t desc;
200 HANDLE_ERROR(cutensorCreateContraction(handle,
201 &desc,
202 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
203 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
204 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
205 descC, modeC.data(),
206 descCompute));
207
208 /**************************
209 * PlanPreference: Set the algorithm to use and enable incremental autotuning
210 ***************************/
211
212 const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
213
214 cutensorPlanPreference_t planPref;
215 HANDLE_ERROR(cutensorCreatePlanPreference(
216 handle,
217 &planPref,
218 algo,
219 CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
220
221 const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
222 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
223 handle,
224 planPref,
225 CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
226 &cacheMode,
227 sizeof(cutensorCacheMode_t)));
228
229 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
230 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
231 handle,
232 planPref,
233 CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
234 &autotuneMode ,
235 sizeof(cutensorAutotuneMode_t)));
236
237 const uint32_t incCount = 4;
238 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
239 handle,
240 planPref,
241 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
242 &incCount,
243 sizeof(uint32_t)));
244
245 /**********************
246 * Query workspace estimate
247 **********************/
248
249 uint64_t workspaceSizeEstimate = 0;
250 const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
251 HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
252 desc,
253 planPref,
254 workspacePref,
255 &workspaceSizeEstimate));
256
257 /**************************
258 * Create Contraction Plan
259 **************************/
260
261 cutensorPlan_t plan;
262 HANDLE_ERROR(cutensorCreatePlan(handle,
263 &plan,
264 desc,
265 planPref,
266 workspaceSizeEstimate));
267
268 /**************************
269 * Optional: Query information about the created plan
270 **************************/
271
272 // query actually used workspace
273 uint64_t actualWorkspaceSize = 0;
274 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
275 plan,
276 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
277 &actualWorkspaceSize,
278 sizeof(actualWorkspaceSize)));
279
280 // At this point the user knows exactly how much memory is need by the operation and
281 // only the smaller actual workspace needs to be allocated
282 assert(actualWorkspaceSize <= workspaceSizeEstimate);
283
284 void *work = nullptr;
285 if (actualWorkspaceSize > 0)
286 {
287 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
288 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
289 }
290
291 /**********************
292 * Run
293 **********************/
294
295 double minTimeCUTENSOR = 1e100;
296 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
297 {
298 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
299 cudaDeviceSynchronize();
300
301 // Set up timing
302 GPUTimer timer;
303 timer.start();
304
305 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
306 HANDLE_ERROR(cutensorContract(handle,
307 plan,
308 (void*) &alpha, A_d, B_d,
309 (void*) &beta, C_d, C_d,
310 work, actualWorkspaceSize, 0 /* stream */));
311
312 // Synchronize and measure timing
313 auto time = timer.seconds();
314
315 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
316 }
317
318 /*************************/
319
320 double transferedBytes = sizeC + sizeA + sizeB;
321 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
322 transferedBytes /= 1e9;
323 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
324
325 HANDLE_ERROR(cutensorDestroy(handle));
326 HANDLE_ERROR(cutensorDestroyPlan(plan));
327 HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
328 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
329 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
330 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
331
332 if (A) free(A);
333 if (B) free(B);
334 if (C) free(C);
335 if (A_d) cudaFree(A_d);
336 if (B_d) cudaFree(B_d);
337 if (C_d) cudaFree(C_d);
338 if (work) cudaFree(work);
339
340 return 0;
341}
让我们通过将计划缓存写入文件并读回(如果之前已写入)来进一步扩展示例
const char planCacheFilename[] = "./planCache.bin";
uint32_t numCachelines = 0;
cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
planCacheFilename, &numCachelines);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
}
else
{
printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
numCachelines);
}
// ...
status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) couldn't be written to.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
cutensorGetErrorString(status));
}
else
{
printf("Plan cache successfully stored to %s.\n", planCacheFilename);
}
警告
只有当计划缓存的大小足以读取文件中存储的所有缓存行时,cutensorHandleReadPlanCacheFromFile
才会成功;否则,将返回 CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE
,并且足够的缓存行数将存储在 numCachelinesRead
中。
进行这些更改后,示例现在如下所示
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 #include <unordered_map>
5 #include <vector>
6 #include <cassert>
7
8 #include <cuda_runtime.h>
9 #include <cutensor.h>
10
11 #define HANDLE_ERROR(x) \
12 { const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
15 };
16
17 #define HANDLE_CUDA_ERROR(x) \
18 { const auto err = x; \
19 if( err != cudaSuccess ) \
20 { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
21 };
22
23 struct GPUTimer
24 {
25 GPUTimer()
26 {
27 cudaEventCreate(&start_);
28 cudaEventCreate(&stop_);
29 cudaEventRecord(start_, 0);
30 }
31
32 ~GPUTimer()
33 {
34 cudaEventDestroy(start_);
35 cudaEventDestroy(stop_);
36 }
37
38 void start()
39 {
40 cudaEventRecord(start_, 0);
41 }
42
43 float seconds()
44 {
45 cudaEventRecord(stop_, 0);
46 cudaEventSynchronize(stop_);
47 float time;
48 cudaEventElapsedTime(&time, start_, stop_);
49 return time * 1e-3;
50 }
51 private:
52 cudaEvent_t start_, stop_;
53 };
54
55 int main()
56 {
57 typedef float floatTypeA;
58 typedef float floatTypeB;
59 typedef float floatTypeC;
60 typedef float floatTypeCompute;
61
62 cutensorDataType_t typeA = CUTENSOR_R_32F;
63 cutensorDataType_t typeB = CUTENSOR_R_32F;
64 cutensorDataType_t typeC = CUTENSOR_R_32F;
65 const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
66
67 floatTypeCompute alpha = (floatTypeCompute)1.1f;
68 floatTypeCompute beta = (floatTypeCompute)0.f;
69
70 /**********************
71 * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
72 **********************/
73
74 std::vector<int> modeC{'m','u','n','v'};
75 std::vector<int> modeA{'m','h','k','n'};
76 std::vector<int> modeB{'u','k','v','h'};
77 int nmodeA = modeA.size();
78 int nmodeB = modeB.size();
79 int nmodeC = modeC.size();
80
81 std::unordered_map<int, int64_t> extent;
82 extent['m'] = 96;
83 extent['n'] = 96;
84 extent['u'] = 96;
85 extent['v'] = 64;
86 extent['h'] = 64;
87 extent['k'] = 64;
88
89 double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
90
91 std::vector<int64_t> extentC;
92 for (auto mode : modeC)
93 extentC.push_back(extent[mode]);
94 std::vector<int64_t> extentA;
95 for (auto mode : modeA)
96 extentA.push_back(extent[mode]);
97 std::vector<int64_t> extentB;
98 for (auto mode : modeB)
99 extentB.push_back(extent[mode]);
100
101 /**********************
102 * Allocating data
103 **********************/
104
105 size_t elementsA = 1;
106 for (auto mode : modeA)
107 elementsA *= extent[mode];
108 size_t elementsB = 1;
109 for (auto mode : modeB)
110 elementsB *= extent[mode];
111 size_t elementsC = 1;
112 for (auto mode : modeC)
113 elementsC *= extent[mode];
114
115 size_t sizeA = sizeof(floatTypeA) * elementsA;
116 size_t sizeB = sizeof(floatTypeB) * elementsB;
117 size_t sizeC = sizeof(floatTypeC) * elementsC;
118 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120 void *A_d, *B_d, *C_d;
121 HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122 HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123 HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125 const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126 assert(uintptr_t(A_d) % kAlignment == 0);
127 assert(uintptr_t(B_d) % kAlignment == 0);
128 assert(uintptr_t(C_d) % kAlignment == 0);
129
130 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134 if (A == NULL || B == NULL || C == NULL)
135 {
136 printf("Error: Host allocation of A or C.\n");
137 return -1;
138 }
139
140 /*******************
141 * Initialize data
142 *******************/
143
144 for (int64_t i = 0; i < elementsA; i++)
145 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146 for (int64_t i = 0; i < elementsB; i++)
147 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148 for (int64_t i = 0; i < elementsC; i++)
149 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151 HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152 HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153 HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155 /*************************
156 * cuTENSOR
157 *************************/
158
159 cutensorHandle_t handle;
160 HANDLE_ERROR(cutensorCreate(&handle));
161
162 /**********************
163 * Load plan cache
164 **********************/
165
166 // holds information about the per-handle plan cache
167 const char planCacheFilename[] = "./planCache.bin";
168 uint32_t numCachelines = 0;
169 cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170 planCacheFilename, &numCachelines);
171 if (status == CUTENSOR_STATUS_IO_ERROR)
172 {
173 printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174 }
175 else if (status != CUTENSOR_STATUS_SUCCESS)
176 {
177 printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178 }
179 else
180 {
181 printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182 numCachelines);
183 }
184
185 /**********************
186 * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187 **********************/
188 uint32_t numEntries = 128;
189 HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191 /**********************
192 * Create Tensor Descriptors
193 **********************/
194 cutensorTensorDescriptor_t descA;
195 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196 &descA,
197 nmodeA,
198 extentA.data(),
199 NULL,/*stride*/
200 typeA, kAlignment));
201
202 cutensorTensorDescriptor_t descB;
203 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204 &descB,
205 nmodeB,
206 extentB.data(),
207 NULL,/*stride*/
208 typeB, kAlignment));
209
210 cutensorTensorDescriptor_t descC;
211 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212 &descC,
213 nmodeC,
214 extentC.data(),
215 NULL,/*stride*/
216 typeC, kAlignment));
217
218 /*******************************
219 * Create Contraction Descriptor
220 *******************************/
221
222 cutensorOperationDescriptor_t desc;
223 HANDLE_ERROR(cutensorCreateContraction(handle,
224 &desc,
225 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228 descC, modeC.data(),
229 descCompute));
230
231 /**************************
232 * PlanPreference: Set the algorithm to use and enable incremental autotuning
233 ***************************/
234
235 const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237 cutensorPlanPreference_t planPref;
238 HANDLE_ERROR(cutensorCreatePlanPreference(
239 handle,
240 &planPref,
241 algo,
242 CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244 const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246 handle,
247 planPref,
248 CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249 &cacheMode,
250 sizeof(cutensorCacheMode_t)));
251
252 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254 handle,
255 planPref,
256 CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257 &autotuneMode ,
258 sizeof(cutensorAutotuneMode_t)));
259
260 const uint32_t incCount = 4;
261 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262 handle,
263 planPref,
264 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265 &incCount,
266 sizeof(uint32_t)));
267
268 /**********************
269 * Query workspace estimate
270 **********************/
271
272 uint64_t workspaceSizeEstimate = 0;
273 const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274 HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275 desc,
276 planPref,
277 workspacePref,
278 &workspaceSizeEstimate));
279
280 /**************************
281 * Create Contraction Plan
282 **************************/
283
284 cutensorPlan_t plan;
285 HANDLE_ERROR(cutensorCreatePlan(handle,
286 &plan,
287 desc,
288 planPref,
289 workspaceSizeEstimate));
290
291 /**************************
292 * Optional: Query information about the created plan
293 **************************/
294
295 // query actually used workspace
296 uint64_t actualWorkspaceSize = 0;
297 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298 plan,
299 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300 &actualWorkspaceSize,
301 sizeof(actualWorkspaceSize)));
302
303 // At this point the user knows exactly how much memory is need by the operation and
304 // only the smaller actual workspace needs to be allocated
305 assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307 void *work = nullptr;
308 if (actualWorkspaceSize > 0)
309 {
310 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312 }
313
314 /**********************
315 * Run
316 **********************/
317
318 double minTimeCUTENSOR = 1e100;
319 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320 {
321 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322 cudaDeviceSynchronize();
323
324 // Set up timing
325 GPUTimer timer;
326 timer.start();
327
328 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329 HANDLE_ERROR(cutensorContract(handle,
330 plan,
331 (void*) &alpha, A_d, B_d,
332 (void*) &beta, C_d, C_d,
333 work, actualWorkspaceSize, 0 /* stream */));
334
335 // Synchronize and measure timing
336 auto time = timer.seconds();
337
338 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339 }
340
341 /*************************/
342
343 double transferedBytes = sizeC + sizeA + sizeB;
344 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345 transferedBytes /= 1e9;
346 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348 status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
349 if (status == CUTENSOR_STATUS_IO_ERROR)
350 {
351 printf("File (%s) couldn't be written to.\n", planCacheFilename);
352 }
353 else if (status != CUTENSOR_STATUS_SUCCESS)
354 {
355 printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
356 cutensorGetErrorString(status));
357 }
358 else
359 {
360 printf("Plan cache successfully stored to %s.\n", planCacheFilename);
361 }
362
363
364 HANDLE_ERROR(cutensorDestroy(handle));
365 HANDLE_ERROR(cutensorDestroyPlan(plan));
366 HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
367 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
368 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
369 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
370
371 if (A) free(A);
372 if (B) free(B);
373 if (C) free(C);
374 if (A_d) cudaFree(A_d);
375 if (B_d) cudaFree(B_d);
376 if (C_d) cudaFree(C_d);
377 if (work) cudaFree(work);
378
379 return 0;
380 }
最后,让我们添加第二个缩并循环,但这次我们希望使用不同的缓存行来缓存 – 否则相同的 – 缩并:如果这两个调用之间的硬件缓存状态有很大不同(即,影响内核的测量运行时),则这可能很有用。为此,我们使用 CUTENSOR_CONTRACTION_DESCRIPTOR_TAG
属性
uint32_t tag = 1;
HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
&handle,
&desc,
CUTENSOR_OPERATION_DESCRIPTOR_TAG,
&tag,
sizeof(uint32_t)));
进行此更改后,示例代码现在如下所示
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 #include <unordered_map>
5 #include <vector>
6 #include <cassert>
7
8 #include <cuda_runtime.h>
9 #include <cutensor.h>
10
11 #define HANDLE_ERROR(x) \
12 { const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
15 };
16
17 #define HANDLE_CUDA_ERROR(x) \
18 { const auto err = x; \
19 if( err != cudaSuccess ) \
20 { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
21 };
22
23 struct GPUTimer
24 {
25 GPUTimer()
26 {
27 cudaEventCreate(&start_);
28 cudaEventCreate(&stop_);
29 cudaEventRecord(start_, 0);
30 }
31
32 ~GPUTimer()
33 {
34 cudaEventDestroy(start_);
35 cudaEventDestroy(stop_);
36 }
37
38 void start()
39 {
40 cudaEventRecord(start_, 0);
41 }
42
43 float seconds()
44 {
45 cudaEventRecord(stop_, 0);
46 cudaEventSynchronize(stop_);
47 float time;
48 cudaEventElapsedTime(&time, start_, stop_);
49 return time * 1e-3;
50 }
51 private:
52 cudaEvent_t start_, stop_;
53 };
54
55 int main()
56 {
57 typedef float floatTypeA;
58 typedef float floatTypeB;
59 typedef float floatTypeC;
60 typedef float floatTypeCompute;
61
62 cutensorDataType_t typeA = CUTENSOR_R_32F;
63 cutensorDataType_t typeB = CUTENSOR_R_32F;
64 cutensorDataType_t typeC = CUTENSOR_R_32F;
65 const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
66
67 floatTypeCompute alpha = (floatTypeCompute)1.1f;
68 floatTypeCompute beta = (floatTypeCompute)0.f;
69
70 /**********************
71 * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
72 **********************/
73
74 std::vector<int> modeC{'m','u','n','v'};
75 std::vector<int> modeA{'m','h','k','n'};
76 std::vector<int> modeB{'u','k','v','h'};
77 int nmodeA = modeA.size();
78 int nmodeB = modeB.size();
79 int nmodeC = modeC.size();
80
81 std::unordered_map<int, int64_t> extent;
82 extent['m'] = 96;
83 extent['n'] = 96;
84 extent['u'] = 96;
85 extent['v'] = 64;
86 extent['h'] = 64;
87 extent['k'] = 64;
88
89 double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
90
91 std::vector<int64_t> extentC;
92 for (auto mode : modeC)
93 extentC.push_back(extent[mode]);
94 std::vector<int64_t> extentA;
95 for (auto mode : modeA)
96 extentA.push_back(extent[mode]);
97 std::vector<int64_t> extentB;
98 for (auto mode : modeB)
99 extentB.push_back(extent[mode]);
100
101 /**********************
102 * Allocating data
103 **********************/
104
105 size_t elementsA = 1;
106 for (auto mode : modeA)
107 elementsA *= extent[mode];
108 size_t elementsB = 1;
109 for (auto mode : modeB)
110 elementsB *= extent[mode];
111 size_t elementsC = 1;
112 for (auto mode : modeC)
113 elementsC *= extent[mode];
114
115 size_t sizeA = sizeof(floatTypeA) * elementsA;
116 size_t sizeB = sizeof(floatTypeB) * elementsB;
117 size_t sizeC = sizeof(floatTypeC) * elementsC;
118 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120 void *A_d, *B_d, *C_d;
121 HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122 HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123 HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125 const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126 assert(uintptr_t(A_d) % kAlignment == 0);
127 assert(uintptr_t(B_d) % kAlignment == 0);
128 assert(uintptr_t(C_d) % kAlignment == 0);
129
130 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134 if (A == NULL || B == NULL || C == NULL)
135 {
136 printf("Error: Host allocation of A or C.\n");
137 return -1;
138 }
139
140 /*******************
141 * Initialize data
142 *******************/
143
144 for (int64_t i = 0; i < elementsA; i++)
145 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146 for (int64_t i = 0; i < elementsB; i++)
147 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148 for (int64_t i = 0; i < elementsC; i++)
149 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151 HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152 HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153 HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155 /*************************
156 * cuTENSOR
157 *************************/
158
159 cutensorHandle_t handle;
160 HANDLE_ERROR(cutensorCreate(&handle));
161
162 /**********************
163 * Load plan cache
164 **********************/
165
166 // holds information about the per-handle plan cache
167 const char planCacheFilename[] = "./planCache.bin";
168 uint32_t numCachelines = 0;
169 cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170 planCacheFilename, &numCachelines);
171 if (status == CUTENSOR_STATUS_IO_ERROR)
172 {
173 printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174 }
175 else if (status != CUTENSOR_STATUS_SUCCESS)
176 {
177 printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178 }
179 else
180 {
181 printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182 numCachelines);
183 }
184
185 /**********************
186 * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187 **********************/
188 uint32_t numEntries = 128;
189 HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191 /**********************
192 * Create Tensor Descriptors
193 **********************/
194 cutensorTensorDescriptor_t descA;
195 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196 &descA,
197 nmodeA,
198 extentA.data(),
199 NULL,/*stride*/
200 typeA, kAlignment));
201
202 cutensorTensorDescriptor_t descB;
203 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204 &descB,
205 nmodeB,
206 extentB.data(),
207 NULL,/*stride*/
208 typeB, kAlignment));
209
210 cutensorTensorDescriptor_t descC;
211 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212 &descC,
213 nmodeC,
214 extentC.data(),
215 NULL,/*stride*/
216 typeC, kAlignment));
217
218 /*******************************
219 * Create Contraction Descriptor
220 *******************************/
221
222 cutensorOperationDescriptor_t desc;
223 HANDLE_ERROR(cutensorCreateContraction(handle,
224 &desc,
225 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228 descC, modeC.data(),
229 descCompute));
230
231 /**************************
232 * PlanPreference: Set the algorithm to use and enable incremental autotuning
233 ***************************/
234
235 const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237 cutensorPlanPreference_t planPref;
238 HANDLE_ERROR(cutensorCreatePlanPreference(
239 handle,
240 &planPref,
241 algo,
242 CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244 const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246 handle,
247 planPref,
248 CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249 &cacheMode,
250 sizeof(cutensorCacheMode_t)));
251
252 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254 handle,
255 planPref,
256 CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257 &autotuneMode ,
258 sizeof(cutensorAutotuneMode_t)));
259
260 const uint32_t incCount = 4;
261 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262 handle,
263 planPref,
264 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265 &incCount,
266 sizeof(uint32_t)));
267
268 /**********************
269 * Query workspace estimate
270 **********************/
271
272 uint64_t workspaceSizeEstimate = 0;
273 const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274 HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275 desc,
276 planPref,
277 workspacePref,
278 &workspaceSizeEstimate));
279
280 /**************************
281 * Create Contraction Plan
282 **************************/
283
284 cutensorPlan_t plan;
285 HANDLE_ERROR(cutensorCreatePlan(handle,
286 &plan,
287 desc,
288 planPref,
289 workspaceSizeEstimate));
290
291 /**************************
292 * Optional: Query information about the created plan
293 **************************/
294
295 // query actually used workspace
296 uint64_t actualWorkspaceSize = 0;
297 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298 plan,
299 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300 &actualWorkspaceSize,
301 sizeof(actualWorkspaceSize)));
302
303 // At this point the user knows exactly how much memory is need by the operation and
304 // only the smaller actual workspace needs to be allocated
305 assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307 void *work = nullptr;
308 if (actualWorkspaceSize > 0)
309 {
310 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312 }
313
314 /**********************
315 * Run
316 **********************/
317
318 double minTimeCUTENSOR = 1e100;
319 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320 {
321 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322 cudaDeviceSynchronize();
323
324 // Set up timing
325 GPUTimer timer;
326 timer.start();
327
328 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329 HANDLE_ERROR(cutensorContract(handle,
330 plan,
331 (void*) &alpha, A_d, B_d,
332 (void*) &beta, C_d, C_d,
333 work, actualWorkspaceSize, 0 /* stream */));
334
335 // Synchronize and measure timing
336 auto time = timer.seconds();
337
338 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339 }
340
341 /*************************/
342
343 double transferedBytes = sizeC + sizeA + sizeB;
344 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345 transferedBytes /= 1e9;
346 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348 uint32_t tag = 1;
349 HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
350 &handle,
351 &desc,
352 CUTENSOR_OPERATION_DESCRIPTOR_TAG,
353 &tag,
354 sizeof(uint32_t)));
355
356 /**************************
357 * Create Contraction Plan (with a different tag)
358 **************************/
359
360 cutensorPlan_t plan;
361 HANDLE_ERROR(cutensorCreatePlan(handle,
362 &plan,
363 desc,
364 planPref,
365 workspaceSizeEstimate));
366
367 /**************************
368 * Optional: Query information about the created plan
369 **************************/
370
371 // query actually used workspace
372 uint64_t actualWorkspaceSize = 0;
373 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
374 plan,
375 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
376 &actualWorkspaceSize,
377 sizeof(actualWorkspaceSize)));
378
379 // At this point the user knows exactly how much memory is need by the operation and
380 // only the smaller actual workspace needs to be allocated
381 assert(actualWorkspaceSize <= workspaceSizeEstimate);
382
383 void *work = nullptr;
384 if (actualWorkspaceSize > 0)
385 {
386 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
387 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
388 }
389
390 /**********************
391 * Run
392 **********************/
393
394 double minTimeCUTENSOR = 1e100;
395 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
396 {
397 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
398 cudaDeviceSynchronize();
399
400 // Set up timing
401 GPUTimer timer;
402 timer.start();
403
404 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
405 HANDLE_ERROR(cutensorContract(handle,
406 plan,
407 (void*) &alpha, A_d, B_d,
408 (void*) &beta, C_d, C_d,
409 work, actualWorkspaceSize, 0 /* stream */));
410
411 // Synchronize and measure timing
412 auto time = timer.seconds();
413
414 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
415 }
416
417 /*************************/
418
419 double transferedBytes = sizeC + sizeA + sizeB;
420 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
421 transferedBytes /= 1e9;
422 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
423
424 status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
425 if (status == CUTENSOR_STATUS_IO_ERROR)
426 {
427 printf("File (%s) couldn't be written to.\n", planCacheFilename);
428 }
429 else if (status != CUTENSOR_STATUS_SUCCESS)
430 {
431 printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
432 cutensorGetErrorString(status));
433 }
434 else
435 {
436 printf("Plan cache successfully stored to %s.\n", planCacheFilename);
437 }
438
439
440 HANDLE_ERROR(cutensorDestroy(handle));
441 HANDLE_ERROR(cutensorDestroyPlan(plan));
442 HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
443 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
444 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
445 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
446
447 if (A) free(A);
448 if (B) free(B);
449 if (C) free(C);
450 if (A_d) cudaFree(A_d);
451 if (B_d) cudaFree(B_d);
452 if (C_d) cudaFree(C_d);
453 if (work) cudaFree(work);
454
455 return 0;
456 }
您可以再次调用二进制文件来确认缓存现在有两个条目;这次它应该报告“已成功从文件 (./cache.bin) 读取 2 个缓存行”。
我们的计划缓存示例到此结束;您可以在示例存储库中找到这些示例(包括计时和预热运行)。
如果您有任何进一步的问题或建议,请随时与我们联系。