nvrtc

NVRTC 库的用户指南。

1. 简介

NVRTC 是 CUDA C++ 的运行时编译库。它接受字符字符串形式的 CUDA C++ 源代码，并创建可用于获取 PTX 的句柄。NVRTC 生成的 PTX 字符串可以由 cuModuleLoadData 和 cuModuleLoadDataEx 加载，并使用 nvJitLink 库或 CUDA Driver API 的 cuLinkAddData 与其他模块链接。这种机制通常可以提供纯粹的离线静态编译无法实现的优化和性能。

在没有 NVRTC（或 CUDA 中任何运行时编译支持）的情况下，如果用户希望在其应用程序或库中实现运行时编译，则需要在运行时生成一个单独的进程来执行 nvcc，但不幸的是，这种方法有以下缺点：

编译开销往往高于必要水平。
最终用户需要安装 nvcc 和相关工具，这使得分发使用运行时编译的应用程序变得复杂。

NVRTC 通过提供一个库接口来解决这些问题，该接口消除了与生成单独进程、磁盘 I/O 等相关的开销，同时保持应用程序部署的简单性。

2. 入门指南

2.1. 系统要求

NVRTC 在以下平台受支持：Linux x86_64、Linux ppc64le、Linux aarch64、Windows x86_64。

注意：NVRTC 不依赖于 CUDA 工具包中的任何其他库或头文件，并且可以在没有 GPU 的系统上运行。

2.2. 安装

NVRTC 是 CUDA 工具包发布版的一部分，组件按以下方式组织在 CUDA 工具包安装目录中：

在 Windows 上
- include\nvrtc.h
- bin\nvrtc64_Major Release Version_0.dll
- bin\nvrtc-builtins64_Major Release VersionMinor Release Version.dll
- lib\x64\nvrtc.lib
- lib\x64\nvrtc_static.lib
- lib\x64\nvrtc-builtins_static.lib
- doc\pdf\NVRTC_User_Guide.pdf
在 Linux 上
- include/nvrtc.h
- lib64/libnvrtc.so
- lib64/libnvrtc.so.Major Release Version
- lib64/libnvrtc.so.Major Release Version.Minor Release Version.<构建版本>
- lib64/libnvrtc-builtins.so
- lib64/libnvrtc-builtins.so.Major Release Version.Minor Release Version
- lib64/libnvrtc-builtins.so.Major Release Version.Minor Release Version.<构建版本>
- lib64/libnvrtc_static.a
- lib64/libnvrtc-builtins_static.a
- doc/pdf/NVRTC_User_Guide.pdf

3. 用户界面

本章介绍 NVRTC 的 API。API 的基本用法在基本用法中进行了解释。

错误处理

通用信息查询

编译

支持的编译选项

预编译头文件 (PCH) (CUDA 12.8+)

主机助手

3.1. 错误处理

NVRTC 定义了以下枚举类型和函数，用于 API 调用错误处理。

枚举

nvrtcResult: 枚举类型 nvrtcResult 定义 API 调用结果代码。

函数

const char * nvrtcGetErrorString(nvrtcResult result): nvrtcGetErrorString 是一个辅助函数，它返回一个描述给定 nvrtcResult 代码的字符串，例如，NVRTC_SUCCESS 对应于 "NVRTC_SUCCESS" 。

3.1.1. 枚举

enum nvrtcResult

枚举类型 nvrtcResult 定义 API 调用结果代码。

NVRTC API 函数返回 nvrtcResult 以指示调用结果。

值

enumerator NVRTC_SUCCESS

enumerator NVRTC_ERROR_OUT_OF_MEMORY

enumerator NVRTC_ERROR_PROGRAM_CREATION_FAILURE

enumerator NVRTC_ERROR_INVALID_INPUT

enumerator NVRTC_ERROR_INVALID_PROGRAM

enumerator NVRTC_ERROR_INVALID_OPTION

enumerator NVRTC_ERROR_COMPILATION

enumerator NVRTC_ERROR_BUILTIN_OPERATION_FAILURE

enumerator NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION

enumerator NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION

enumerator NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID

enumerator NVRTC_ERROR_INTERNAL_ERROR

enumerator NVRTC_ERROR_TIME_FILE_WRITE_FAILED

enumerator NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED

enumerator NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED

enumerator NVRTC_ERROR_PCH_CREATE

enumerator NVRTC_ERROR_CANCELLED

3.1.2. 函数

const char *nvrtcGetErrorString(nvrtcResult result)

nvrtcGetErrorString 是一个辅助函数，它返回一个描述给定 nvrtcResult 代码的字符串，例如，NVRTC_SUCCESS 对应于 "NVRTC_SUCCESS"。

对于无法识别的枚举值，它返回 "NVRTC_ERROR unknown"。

参数: result – [in] CUDA 运行时编译 API 结果代码。
返回值: 给定 nvrtcResult 代码的消息字符串。

3.2. 通用信息查询

NVRTC 定义了以下函数用于通用信息查询。

函数

nvrtcResult nvrtcGetNumSupportedArchs(int *numArchs): nvrtcGetNumSupportedArchs 使用 NVRTC 支持的架构数量设置输出参数 numArchs。
nvrtcResult nvrtcGetSupportedArchs(int *supportedArchs): nvrtcGetSupportedArchs 使用 NVRTC 支持的架构填充通过输出参数 supportedArchs 传递的数组。
nvrtcResult nvrtcVersion(int *major, int *minor): nvrtcVersion 使用 CUDA 运行时编译版本号设置输出参数 major 和 minor。

3.2.1. 函数

nvrtcResult nvrtcGetNumSupportedArchs(int *numArchs)

nvrtcGetNumSupportedArchs 使用 NVRTC 支持的架构数量设置输出参数 numArchs。

然后可以使用它将数组传递给 nvrtcGetSupportedArchs 以获取支持的架构。

参见 nvrtcGetSupportedArchs

参数

numArchs – [out] 支持的架构数量。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

nvrtcResult nvrtcGetSupportedArchs(int *supportedArchs)

nvrtcGetSupportedArchs 使用 NVRTC 支持的架构填充通过输出参数 supportedArchs 传递的数组。

该数组按升序排序。要传递的数组大小可以使用 nvrtcGetNumSupportedArchs 确定。

参见 nvrtcGetNumSupportedArchs

参数

supportedArchs – [out] 支持的架构的排序数组。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

nvrtcResult nvrtcVersion(int *major, int *minor)

nvrtcVersion 使用 CUDA 运行时编译版本号设置输出参数 major 和 minor。

参数

major – [out] CUDA 运行时编译主版本号。
minor – [out] CUDA 运行时编译次版本号。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

3.3. 编译

NVRTC 定义了以下类型和函数，用于实际编译。

函数

nvrtcResult nvrtcAddNameExpression(nvrtcProgram prog, const char *const name_expression): nvrtcAddNameExpression 记录给定的名称表达式，该表达式表示 global 函数或 device /__constant__ 变量的地址。
nvrtcResult nvrtcCompileProgram(nvrtcProgram prog, int numOptions, const char *const *options): nvrtcCompileProgram 编译给定的程序。
nvrtcResult nvrtcCreateProgram(nvrtcProgram *prog, const char *src, const char *name, int numHeaders, const char *const *headers, const char *const *includeNames): nvrtcCreateProgram 使用给定的输入参数创建 nvrtcProgram 的实例，并使用它设置输出参数 prog。
nvrtcResult nvrtcDestroyProgram(nvrtcProgram *prog): nvrtcDestroyProgram 销毁给定的程序。
nvrtcResult nvrtcGetCUBIN(nvrtcProgram prog, char *cubin): nvrtcGetCUBIN 将先前编译 prog 生成的 cubin 存储在 cubin 指向的内存中。
nvrtcResult nvrtcGetCUBINSize(nvrtcProgram prog, size_t *cubinSizeRet): nvrtcGetCUBINSize 使用先前编译 prog 生成的 cubin 的大小设置 cubinSizeRet 的值。
nvrtcResult nvrtcGetLTOIR(nvrtcProgram prog, char *LTOIR): nvrtcGetLTOIR 将先前编译 prog 生成的 LTO IR 存储在 LTOIR 指向的内存中。
nvrtcResult nvrtcGetLTOIRSize(nvrtcProgram prog, size_t *LTOIRSizeRet): nvrtcGetLTOIRSize 使用先前编译 prog 生成的 LTO IR 的大小设置 LTOIRSizeRet 的值。
nvrtcResult nvrtcGetLoweredName(nvrtcProgram prog, const char *const name_expression, const char **lowered_name): nvrtcGetLoweredName 提取 global 函数或 device /__constant__ 变量的降级（已修改）名称，并更新 *lowered_name 以指向它。
nvrtcResult nvrtcGetNVVM(nvrtcProgram prog, char *nvvm): 弃用通知：此函数将在未来的版本中移除。
nvrtcResult nvrtcGetNVVMSize(nvrtcProgram prog, size_t *nvvmSizeRet): 弃用通知：此函数将在未来的版本中移除。
nvrtcResult nvrtcGetOptiXIR(nvrtcProgram prog, char *optixir): nvrtcGetOptiXIR 将先前编译 prog 生成的 OptiX IR 存储在 optixir 指向的内存中。
nvrtcResult nvrtcGetOptiXIRSize(nvrtcProgram prog, size_t *optixirSizeRet): nvrtcGetOptiXIRSize 使用先前编译 prog 生成的 OptiX IR 的大小设置 optixirSizeRet 的值。
nvrtcResult nvrtcGetPTX(nvrtcProgram prog, char *ptx): nvrtcGetPTX 将先前编译 prog 生成的 PTX 存储在 ptx 指向的内存中。
nvrtcResult nvrtcGetPTXSize(nvrtcProgram prog, size_t *ptxSizeRet): nvrtcGetPTXSize 使用先前编译 prog 生成的 PTX 的大小设置 ptxSizeRet 的值（包括尾部的 NULL ）。
nvrtcResult nvrtcGetProgramLog(nvrtcProgram prog, char *log): nvrtcGetProgramLog 将先前编译 prog 生成的日志存储在 log 指向的内存中。
nvrtcResult nvrtcGetProgramLogSize(nvrtcProgram prog, size_t *logSizeRet): nvrtcGetProgramLogSize 使用先前编译 prog 生成的日志的大小设置 logSizeRet（包括尾部的 NULL ）。
nvrtcResult nvrtcSetFlowCallback(nvrtcProgram prog, int(*callback)(void *, void *), void *payload): nvrtcSetFlowCallback 注册一个回调函数，编译器将在调用 nvrtcCompileProgram 期间的不同点调用该函数，并且回调函数可以决定是否通过返回特定值来取消编译。

类型定义

nvrtcProgram: nvrtcProgram 是编译单元，也是程序的不透明句柄。

3.3.1. 函数

nvrtcResult nvrtcAddNameExpression(nvrtcProgram prog, const char *const name_expression)

nvrtcAddNameExpression 记录给定的名称表达式，该表达式表示 global 函数或 device/__constant__ 变量的地址。

必须在后续调用 nvrtcGetLoweredName 时提供相同的名称表达式字符串，以提取降级名称。

另请参阅

nvrtcGetLoweredName

参数

prog – [in] CUDA 运行时编译程序。
name_expression – [in] 表示 global 函数或 device/__constant__ 变量地址的常量表达式。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION

nvrtcResult nvrtcCompileProgram(nvrtcProgram prog, int numOptions, const char *const *options)

nvrtcCompileProgram 编译给定的程序。

它支持支持的编译选项中列出的编译选项。

参数

prog – [in] CUDA 运行时编译程序。
numOptions – [in] 传递的编译器选项的数量。
options – [in] C 字符串数组形式的编译器选项。options 在 numOptions 为 0 时可以为 NULL。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_OUT_OF_MEMORY
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_OPTION
NVRTC_ERROR_COMPILATION
NVRTC_ERROR_BUILTIN_OPERATION_FAILURE
NVRTC_ERROR_TIME_FILE_WRITE_FAILED
NVRTC_ERROR_CANCELLED

nvrtcResult nvrtcCreateProgram(nvrtcProgram *prog, const char *src, const char *name, int numHeaders, const char *const *headers, const char *const *includeNames)

nvrtcCreateProgram 使用给定的输入参数创建 nvrtcProgram 的实例，并使用它设置输出参数 prog。

另请参阅

nvrtcDestroyProgram

参数

prog – [out] CUDA 运行时编译程序。
src – [in] CUDA 程序源代码。
name – [in] CUDA 程序名称。name 可以是 NULL；当 name 为 NULL 或 “”（空字符串）时，将使用 "default_program"。
numHeaders – [in] 使用的头文件数量。numHeaders 必须大于或等于 0。
headers – [in] 头文件的源。headers 当 numHeaders 为 0 时，可以是 NULL。
includeNames – [in] 每个头文件的名称，CUDA 程序源代码可以通过这些名称包含它们。includeNames 当 numHeaders 为 0 时，可以是 NULL。这些头文件必须使用此处指定的精确名称包含。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_OUT_OF_MEMORY
NVRTC_ERROR_PROGRAM_CREATION_FAILURE
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcDestroyProgram(nvrtcProgram *prog)

nvrtcDestroyProgram 销毁给定的程序。

另请参阅

nvrtcCreateProgram

参数

prog – [in] CUDA 运行时编译程序。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetCUBIN(nvrtcProgram prog, char *cubin)

nvrtcGetCUBIN 将之前编译 prog 生成的 cubin 存储在 cubin 指向的内存中。

如果为 -arch 指定的值是虚拟架构而不是实际架构，则 cubin 不可用。

另请参阅

nvrtcGetCUBINSize

参数

prog – [in] CUDA 运行时编译程序。
cubin – [out] 编译和汇编的结果。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetCUBINSize(nvrtcProgram prog, size_t *cubinSizeRet)

nvrtcGetCUBINSize 使用之前编译 prog 生成的 cubin 的大小设置 cubinSizeRet 的值。

如果为 -arch 指定的值是虚拟架构而不是实际架构，则 cubinSizeRet 的值设置为 0。

另请参阅

nvrtcGetCUBIN

参数

prog – [in] CUDA 运行时编译程序。
cubinSizeRet – [out] 生成的 cubin 的大小。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetLTOIR(nvrtcProgram prog, char *LTOIR)

nvrtcGetLTOIR 将之前编译 prog 生成的 LTO IR 存储在 LTOIR 指向的内存中。

如果程序在编译时未使用 -dlto，则 LTO IR 不可用。

另请参阅

nvrtcGetLTOIRSize

参数

prog – [in] CUDA 运行时编译程序。
LTOIR – [out] 编译结果。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetLTOIRSize(nvrtcProgram prog, size_t *LTOIRSizeRet)

nvrtcGetLTOIRSize 使用之前编译 prog 生成的 LTO IR 的大小设置 LTOIRSizeRet 的值。

如果程序在编译时未使用 -dlto，则 LTOIRSizeRet 的值设置为 0。

另请参阅

nvrtcGetLTOIR

参数

prog – [in] CUDA 运行时编译程序。
LTOIRSizeRet – [out] 生成的 LTO IR 的大小。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetLoweredName(nvrtcProgram prog, const char *const name_expression, const char **lowered_name)

nvrtcGetLoweredName 提取 **global** 函数或 **device**/__constant__ 变量的降级（mangled）名称，并更新 *lowered_name 以指向它。

当 NVRTC 程序被 nvrtcDestroyProgram 销毁时，包含名称的内存将被释放。相同的名称表达式必须先前已提供给 nvrtcAddNameExpression。

另请参阅

nvrtcAddNameExpression

参数

prog – [in] CUDA 运行时编译程序。
name_expression – [in] 表示 global 函数或 device/__constant__ 变量地址的常量表达式。
lowered_name – [out] 由函数初始化，以指向包含与提供的名称表达式相对应的降级（mangled）名称的 C 字符串。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID

nvrtcResult nvrtcGetNVVM(nvrtcProgram prog, char *nvvm)

弃用通知：此函数将在未来的版本中移除。

请改用 nvrtcGetLTOIR（和 nvrtcGetLTOIRSize）。

nvrtcResult nvrtcGetNVVMSize(nvrtcProgram prog, size_t *nvvmSizeRet)

弃用通知：此函数将在未来的版本中移除。

请改用 nvrtcGetLTOIRSize（和 nvrtcGetLTOIR）。

nvrtcResult nvrtcGetOptiXIR(nvrtcProgram prog, char *optixir)

nvrtcGetOptiXIR 将之前编译 prog 生成的 OptiX IR 存储在 optixir 指向的内存中。

如果程序在编译时使用了与 OptiX IR 生成不兼容的选项，则 OptiX IR 不可用。

另请参阅

nvrtcGetOptiXIRSize

参数

prog – [in] CUDA 运行时编译程序。
optixir – [out] Optix IR 编译结果。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetOptiXIRSize(nvrtcProgram prog, size_t *optixirSizeRet)

nvrtcGetOptiXIRSize 使用之前编译 prog 生成的 OptiX IR 的大小设置 optixirSizeRet 的值。

如果程序在编译时使用了与 OptiX IR 生成不兼容的选项，则 nvrtcGetOptiXIRSize 的值设置为 0。

另请参阅

nvrtcGetOptiXIR

参数

prog – [in] CUDA 运行时编译程序。
optixirSizeRet – [out] 生成的 LTO IR 的大小。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetPTX(nvrtcProgram prog, char *ptx)

nvrtcGetPTX 将之前编译 prog 生成的 PTX 存储在 ptx 指向的内存中。

另请参阅

nvrtcGetPTXSize

参数

prog – [in] CUDA 运行时编译程序。
ptx – [out] 编译结果。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetPTXSize(nvrtcProgram prog, size_t *ptxSizeRet)

nvrtcGetPTXSize 使用之前编译 prog 生成的 PTX 的大小（包括尾部的 NULL）设置 ptxSizeRet 的值。

另请参阅

nvrtcGetPTX

参数

prog – [in] CUDA 运行时编译程序。
ptxSizeRet – [out] 生成的 PTX 的大小（包括尾部的 NULL）。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetProgramLog(nvrtcProgram prog, char *log)

nvrtcGetProgramLog 将之前编译 prog 生成的日志存储在 log 指向的内存中。

另请参阅

nvrtcGetProgramLogSize

参数

prog – [in] CUDA 运行时编译程序。
log – [out] 编译日志。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetProgramLogSize(nvrtcProgram prog, size_t *logSizeRet)

nvrtcGetProgramLogSize 使用之前编译 prog 生成的日志的大小（包括尾部的 NULL）设置 logSizeRet。

请注意，即使 prog 的编译成功，也可能会生成包含警告和信息性消息的编译日志。

另请参阅

nvrtcGetProgramLog

参数

prog – [in] CUDA 运行时编译程序。
logSizeRet – [out] 编译日志的大小（包括尾部的 NULL）。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcSetFlowCallback(nvrtcProgram prog, int (*callback)(void*, void*), void *payload)

nvrtcSetFlowCallback 注册一个回调函数，编译器将在调用 nvrtcCompileProgram 期间的不同点调用该函数，并且回调函数可以决定是否通过返回特定值来取消编译。

回调函数必须满足以下约束

(1) 其签名应为

int callback(void* param1, void* param2);

当调用回调时，编译器将始终将 payload 传递给 param1，以便回调可以根据 payload 做出决策。目前，它将始终将 NULL 传递给 param2，param2 保留供将来扩展使用。

(2) 它必须返回 1 以取消编译，或返回 0 以继续。其他返回值保留供将来使用。

(3) 它必须返回一致的值。一旦它在某一点返回 1，则在当前正在进行的 nvrtcCompileProgram 调用期间的所有后续调用中，它都必须返回 1。

(4) 它必须是线程安全的。

(5) 它不得调用任何 nvrtc/libnvvm/ptx API。

参数

prog – [in] CUDA 运行时编译程序。
callback – [in] 发出取消信号的回调。
payload – [in] 在调用回调时作为参数传递。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT

3.3.2. 类型定义

typedef struct _nvrtcProgram *nvrtcProgram

nvrtcProgram 是编译单元，也是程序的不透明句柄。

要编译 CUDA 程序字符串，必须首先使用 nvrtcCreateProgram 创建 nvrtcProgram 的实例，然后使用 nvrtcCompileProgram 进行编译。

3.4. 支持的编译选项

NVRTC 支持以下编译选项。

带有两个前导破折号 (--) 的选项名称是长选项名称，带有一个前导破折号 (-) 的选项名称是短选项名称。可以使用短选项名称代替长选项名称。当编译选项需要参数时，赋值运算符 (=) 用于分隔编译选项参数和编译选项名称，例如，"--gpu-architecture=compute_60"。或者，可以在没有赋值运算符的单独字符串中指定编译选项名称和参数，例如，"--gpu-architecture" "compute_60"。单字符短选项名称，例如 -D、-U 和 -I，不需要赋值运算符，并且编译选项名称和参数可以存在于同一字符串中，它们之间可以有或没有空格。例如，"-D=<def>"、"-D<def>" 和 "-D <def>" 均受支持。

有效的编译器选项包括

编译目标
- --gpu-architecture=<arch> (-arch)
  
  指定必须为其编译输入的 GPU 架构类别的名称。
  - 有效的 <arch>
    - compute_50
    - compute_52
    - compute_53
    - compute_60
    - compute_61
    - compute_62
    - compute_70
    - compute_72
    - compute_75
    - compute_80
    - compute_87
    - compute_89
    - compute_90
    - compute_90a
    - compute_100
    - compute_100a
    - sm_50
    - sm_52
    - sm_53
    - sm_60
    - sm_61
    - sm_62
    - sm_70
    - sm_72
    - sm_75
    - sm_80
    - sm_87
    - sm_89
    - sm_90
    - sm_90a
    - sm_100
    - sm_100a
  - 默认值：compute_52
单独编译 / 全程序编译
- --device-c (-dc)
  
  生成可重定位的代码，该代码可以与其他可重定位的设备代码链接。它等效于 --relocatable-device-code=true。
- --device-w (-dw)
  
  生成不可重定位的代码。它等效于 --relocatable-device-code=false。
- --relocatable-device-code={true|false} (-rdc)
  
  启用（禁用）生成可重定位的设备代码。
  - 默认值：false
- --extensible-whole-program (-ewp)
  
  对设备代码执行可扩展的全程序编译。
  - 默认值：false
调试支持
- --device-debug (-G)
  
  生成调试信息。如果未指定 --dopt，则关闭所有优化。
- --generate-line-info (-lineinfo)
  
  生成行号信息。
代码生成
- --dopt on (-dopt)
- --dopt=on
  
  启用设备代码优化。当与 -G 一起指定时，为优化的设备代码启用有限的调试信息生成（目前仅限行号信息）。当未指定 -G 时，-dopt=on 是隐式的。
- --ptxas-options <options> (-Xptxas)
- --ptxas-options=<options>
  
  直接为 PTX 优化汇编器 ptxas 指定选项。
- --maxrregcount=<N> (-maxrregcount)
  
  指定 GPU 函数可以使用的最大寄存器数量。在达到函数特定的限制之前，较高的值通常会提高执行此函数的单个 GPU 线程的性能。但是，由于线程寄存器是从每个 GPU 上的全局寄存器池中分配的，因此此选项的较高值也会减小最大线程块大小，从而减少线程并行度。因此，良好的 maxrregcount 值是权衡的结果。如果未指定此选项，则不假定最大值。小于 ABI 所需的最小寄存器数的值将被编译器提升到 ABI 最小限制。
- --ftz={true|false} (-ftz)
  
  执行单精度浮点运算时，将非正规值刷新为零或保留非正规值。
  
  --use_fast_math 意味着 --ftz=true。
  - 默认值：false
- --prec-sqrt={true|false} (-prec-sqrt)
  
  对于单精度浮点平方根，使用 IEEE 四舍五入到最接近的模式或使用更快的近似值。--use_fast_math 意味着 --prec-sqrt=false。
  - 默认值：true
- --prec-div={true|false} (-prec-div) 对于单精度浮点除法和倒数，使用 IEEE 四舍五入到最接近的模式或使用更快的近似值。--use_fast_math 意味着 --prec-div=false。
  - 默认值：true
- --fmad={true|false} (-fmad)
  
  启用（禁用）将浮点乘法和加法/减法收缩为浮点乘加运算（FMAD、FFMA 或 DFMA）。--use_fast_math 意味着 --fmad=true。
  - 默认值：true
- --use_fast_math (-use_fast_math)
  
  使用快速数学运算。--use_fast_math 意味着 --ftz=true --prec-div=false --prec-sqrt=false --fmad=true。
- --extra-device-vectorization (-extra-device-vectorization)
  
  在 NVVM 优化器中启用更激进的设备代码向量化。
- --modify-stack-limit={true|false} (-modify-stack-limit)
  
  在 Linux 上，在编译期间，使用 setrlimit() 将堆栈大小增加到允许的最大值。限制在编译结束时重置为先前的值。注意：setrlimit() 更改整个进程的值。
  - 默认值：true
- --dlink-time-opt (-dlto)
  
  生成中间代码以进行后续的链接时优化。它意味着 -rdc=true。注意：当使用此选项时，应使用 nvrtcGetLTOIR API，因为不会生成 PTX 或 Cubin。
- --gen-opt-lto (-gen-opt-lto)
  
  在生成 LTO IR 之前运行优化器pass。
- --optix-ir (-optix-ir)
  
  生成 OptiX IR。Optix IR 仅供 OptiX 通过适当的 API 使用。链接时优化（-dlto）不支持此功能。
  
  注意：当使用此选项时，应使用 nvrtcGetOptiX API，因为不会生成 PTX 或 Cubin。
- --jump-table-density=[0-101] (-jtd)
  
  在 switch 语句中指定 case 密度百分比，并将其用作确定是否使用跳转表（brx.idx 指令）来实现 switch 语句的最小阈值。默认值为 101。百分比范围为 0 到 101（含）。
- --device-stack-protector={true|false} (-device-stack-protector)
  
  启用（禁用）在设备代码中生成堆栈金丝雀。
  - 默认值：false
预处理
- --define-macro=<def> (-D)
  
  <def> 可以是 <name> 或 <name=definitions>。
  - <name>
    
    将 <name> 预定义为宏，定义为 1。
  - <name>=<definition>
    
    <definition> 的内容将被标记化和预处理，就像它们在翻译阶段三中出现在 #define 指令中一样。特别是，定义将被嵌入式换行符截断。
- --undefine-macro=<def> (-U)
  
  取消先前对 <def> 的任何定义。
- --include-path=<dir> (-I)
  
  添加目录 <dir> 到头文件搜索目录列表中。这些路径会在传递给 nvrtcCreateProgram 的头文件列表之后被搜索。
- --pre-include=<header> (-include)
  
  在预处理期间预包含 <header>。
- --no-source-include (-no-source-include)
  
  预处理器默认将每个输入源文件的目录添加到包含路径中。此选项禁用此功能，并且仅考虑显式指定的路径。
语言方言
- --std={c++03|c++11|c++14|c++17|c++20} (-std)
  
  设置语言方言为 C++03、C++11、C++14、C++17 或 C++20
  - 默认值： c++17
- --builtin-move-forward={true|false} (-builtin-move-forward)
  
  当选择 C++11 或更高版本的语言方言时，提供 std::move 和 std::forward 的内置定义。
  - 默认值：true
- --builtin-initializer-list={true|false} (-builtin-initializer-list)
  
  当选择 C++11 或更高版本的语言方言时，提供 std::initializer_list 类和成员函数的内置定义。
  - 默认值：true
预编译头文件支持 (CUDA 12.8+)
- --pch (-pch)
  
  启用自动 PCH 处理。
- --create-pch=<file-name> (-create-pch)
  
  创建 PCH 文件。
- --use-pch=<file-name> (-use-pch)
  
  使用指定的 PCH 文件。
- --pch-dir=<directory-name> (-pch-dir)
  
  当使用自动 PCH (-pch) 时，在指定目录中查找和创建 PCH 文件。当使用显式 PCH (-create-pch 或 -use-pch) 时，目录名称会作为指定文件名的前缀，除非文件名是绝对路径名。
- --pch-verbose={true|false} (-pch-verbose)
  
  在自动 PCH 模式下，对于每个无法在当前编译中使用的 PCH 文件，在编译日志中打印原因。
  - 默认值：true
- --pch-messages={true|false} (-pch-messages)
  
  如果在当前编译中创建或使用了 PCH 文件，则在编译日志中打印消息。
  - 默认值：true
- --instantiate-templates-in-pch={true|false} (-instantiate-templates-in-pch)
  
  启用或禁用在 PCH 创建之前实例化模板。实例化模板可能会增加 PCH 文件的大小，同时降低使用 PCH 文件时的编译成本（因为可以跳过一些模板实例化）。
  - 默认值：true
杂项。
- --disable-warnings (-w)
  
  禁止显示所有警告消息。
- --restrict (-restrict)
  
  程序员断言所有内核指针参数都是 restrict 指针。
- --device-as-default-execution-space (-default-device)
  
  将没有执行空间注释的实体视为 __device__ 实体。
- --device-int128 (-device-int128)
  
  允许设备代码中使用 __int128 类型。还会导致宏 __CUDACC_RTC_INT128__ 被定义。
- --device-float128 (-device-float128)
  
  允许设备代码中使用 __float128 和 _Float128 类型。还会导致宏 D__CUDACC_RTC_FLOAT128__ 被定义。
- --optimization-info=<kind> (-opt-info)
  
  为指定类型的优化提供优化报告。支持以下类型标签
  - inline : 当函数被内联时发出备注。
- --display-error-number (-err-no)
  
  显示警告消息的诊断编号。（默认）
- --no-display-error-number (-no-err-no)
  
  禁用显示警告消息的诊断编号。
- --diag-error=<error-number>,… (-diag-error)
  
  为指定的诊断消息编号发出错误。消息编号可以用逗号分隔。
- --diag-suppress=<error-number>,… (-diag-suppress)
  
  抑制指定的诊断消息编号。消息编号可以用逗号分隔。
- --diag-warn=<error-number>,… (-diag-warn)
  
  为指定的诊断消息编号发出警告。消息编号可以用逗号分隔。
- --brief-diagnostics={true|false} (-brief-diag)
  
  此选项禁用或启用在诊断信息中显示源代码行和列信息。 --brief-diagnostics=true 将不显示源代码行和列信息。
  - 默认值：false
- --time=<file-name> (-time)
  
  生成一个逗号分隔值表，其中包含每个编译阶段所花费的时间，并将其附加到作为选项参数给定的文件的末尾。如果文件不存在，则在表的第一行生成列标题。如果文件名为“-”，则定时数据将写入编译日志。
- --split-compile=<number-of-threads> (-split-compile=<number-of-threads>)
  
  并行执行编译器优化。拆分编译尝试通过使编译器能够并发运行某些优化pass来减少编译时间。此选项接受一个数值，用于指定编译器可以使用的最大线程数。也可以通过设置 --split-compile=0 来允许编译器使用系统上可用的最大线程数。设置 --split-compile=1 将导致此选项被忽略。
- --fdevice-syntax-only (-fdevice-syntax-only)
  
  在前端语法检查后结束设备编译。此选项不会生成有效的设备代码。
- --minimal (-minimal)
  
  省略某些语言功能以减少小型程序的编译时间。特别是，以下内容被省略
  - 纹理和表面函数以及关联的类型，例如 cudaTextureObject_t。
  - cudadevrt 设备代码库提供的 CUDA 运行时函数，通常以前缀 “cuda” 命名，例如 cudaMalloc。
  - 从设备代码启动内核。
  - 与 CUDA 运行时和驱动程序 API 关联的类型和宏，由 cuda/tools/cudart/driver_types.h 提供，通常以前缀 “cuda” 命名，例如 cudaError_t。
- --device-stack-protector (-device-stack-protector)
  
  在设备代码中启用堆栈金丝雀。堆栈金丝雀使利用某些涉及堆栈局部变量的内存安全漏洞变得更加困难。编译器使用启发式方法评估每个函数中此类错误的风险。只有那些被认为是高风险的函数才会使用堆栈金丝雀。
- --fdevice-time-trace=<file-name> (-fdevice-time-trace=<file-name>) 启用时间分析器，根据给定的 <file-name> 输出 JSON 文件。结果可以在 chrome://tracing 上进行分析以获得火焰图可视化效果。

3.5. 预编译头文件 (PCH) (CUDA 12.8+)

另请参阅传递给 nvrtcCompileProgram 的 PCH 相关标志。

函数

nvrtcResult nvrtcGetPCHCreateStatus(nvrtcProgram prog): 返回 PCH 创建状态。
nvrtcResult nvrtcGetPCHHeapSize(size_t *ret): 检索 PCH 堆的当前大小。
nvrtcResult nvrtcGetPCHHeapSizeRequired(nvrtcProgram prog, size_t *size): 检索编译给定程序所需的 PCH 堆的所需大小。
nvrtcResult nvrtcSetPCHHeapSize(size_t size): 设置 PCH 堆的大小。

3.5.1. 函数

nvrtcResult nvrtcGetPCHCreateStatus(nvrtcProgram prog)

返回 PCH 创建状态。

NVRTC_SUCCESS 表示 PCH 已成功创建。 NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED 表示未尝试创建 PCH，原因是在之前的 nvrtcCompileProgram 调用期间未请求 PCH 功能，或者请求了自动 PCH 处理，但编译器选择不创建 PCH 文件。 NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED 表示可能已创建 PCH 文件，但编译器在 PCH 堆中空间不足。在这种情况下，可以使用 nvrtcGetPCHHeapSizeRequired() 查询所需的堆大小，可以使用 nvrtcSetPCHHeapSize() 为此大小重新分配堆，并且可以再次调用 nvrtcCompileProgram() 和新的 NVRTC 程序实例来重新尝试 PCH 创建。 NVRTC_ERROR_PCH_CREATE 表示错误条件阻止了 PCH 文件的创建。

参数

prog – [in] CUDA 运行时编译程序。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED
NVRTC_ERROR_PCH_CREATE
NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED
NVRTC_ERROR_INVALID_PROGRAM

nvrtcResult nvrtcGetPCHHeapSize(size_t *ret)

检索 PCH 堆的当前大小。

参数

ret – [out] 指向将存储 PCH 堆大小的位置的指针

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT

nvrtcResult nvrtcGetPCHHeapSizeRequired(nvrtcProgram prog, size_t *size)

检索编译给定程序所需的 PCH 堆的所需大小。

参数

prog – [in] CUDA 运行时编译程序。
size – [out] 指向将存储 PCH 堆所需大小的位置的指针

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT 仅当 nvrtcGetPCHCreateStatus() 返回 NVRTC_SUCCESS 或 NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED 时，使用此函数检索的大小才有效

nvrtcResult nvrtcSetPCHHeapSize(size_t size)

设置 PCH 堆的大小。

请求的大小可能会向上舍入到平台相关的对齐方式（例如，页面大小）。如果 PCH 堆已分配，则堆内存将被释放，并且将分配新的 PCH 堆。

参数

size – [in] 请求的 PCH 堆大小，以字节为单位

返回值

NVRTC_SUCCESS

3.6. 主机助手

NVRTC 定义了以下函数，以便更轻松地与主机代码交互。

函数

nvrtcResult nvrtcGetTypeName(const std::type_info &tinfo, std::string *result): nvrtcGetTypeName 将类型的源代码级别名称存储在给定的 std::string 位置。
nvrtcResult nvrtcGetTypeName(std::string *result): nvrtcGetTypeName 将模板类型参数 T 的源代码级别名称存储在给定的 std::string 位置。

3.6.1. 函数

inline nvrtcResult nvrtcGetTypeName(const std::type_info &tinfo, std::string *result)

nvrtcGetTypeName 将类型的源代码级别名称存储在给定的 std::string 位置。

仅当宏 NVRTC_GET_TYPE_NAME 定义为非零值时，才提供此函数。当分别使用 gcc/clang 或 cl.exe 编译器时，它使用 abi::__cxa_demangle 或 UnDecorateSymbolName 函数调用来提取类型名称。如果名称提取失败，它将返回 NVRTC_INTERNAL_ERROR，否则 *result 将使用提取的名称进行初始化。

Windows 特定说明

nvrtcGetTypeName() 不是多线程安全的，因为它调用了 UnDecorateSymbolName()，而 UnDecorateSymbolName() 不是多线程安全的。
返回的字符串可能包含 Microsoft 特定的关键字，例如 __ptr64 和 __cdecl。

参数

tinfo – [in] 对给定类型的 std::type_info 对象的引用。
result – [in] 指向 std::string 的指针，类型名称将存储在其中。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INTERNAL_ERROR

template<typename T> nvrtcResult nvrtcGetTypeName(std::string *result)

nvrtcGetTypeName 将模板类型参数 T 的源代码级别名称存储在给定的 std::string 位置。

仅当宏 NVRTC_GET_TYPE_NAME 定义为非零值时，才提供此函数。当分别使用 gcc/clang 或 cl.exe 编译器时，它使用 abi::__cxa_demangle 或 UnDecorateSymbolName 函数调用来提取类型名称。如果名称提取失败，它将返回 NVRTC_INTERNAL_ERROR，否则 *result 将使用提取的名称进行初始化。

Windows 特定说明

nvrtcGetTypeName() 不是多线程安全的，因为它调用了 UnDecorateSymbolName()，而 UnDecorateSymbolName() 不是多线程安全的。
返回的字符串可能包含 Microsoft 特定的关键字，例如 __ptr64 和 __cdecl。

参数

result – [in] 指向 std::string 的指针，类型名称将存储在其中。

返回值

NVRTC_SUCCESS
NVRTC_ERROR_INTERNAL_ERROR

4. 语言

与离线 nvcc 编译器不同，NVRTC 旨在仅编译设备 CUDA C++ 代码。它不接受输入代码中的主机代码或主机编译器扩展，除非另有说明。

4.1. 执行空间

NVRTC 使用 __host__ 作为默认执行空间，如果它在输入中遇到任何主机代码，则会生成错误。也就是说，如果输入包含带有显式 __host__ 注释或没有执行空间注释的实体，NVRTC 将发出错误。 __host__ __device__ 函数被视为设备函数。

NVRTC 提供了一个编译选项 --device-as-default-execution-space (请参阅支持的编译选项)，该选项启用了一种替代编译模式，在该模式中，没有执行空间注释的实体被视为 __device__ 实体。

4.2. 单独编译

NVRTC 本身不提供任何链接器。但是，用户可以使用 nvJitLink 库或 CUDA Driver API 中的 cuLinkAddData 将生成的重定位 PTX 代码与其他重定位代码链接。要生成可重定位的 PTX 代码，需要编译选项 --relocatable-device-code=true 或 --device-c。

4.3. 动态并行

NVRTC 在以下条件下支持动态并行

编译目标必须是 compute 35 或更高版本。
必须启用单独编译 (--relocatable-device-code=true 或 --device-c) 或可扩展的整体程序编译 (--extensible-whole-program)。
生成的 PTX 必须链接到 CUDA 设备运行时 (cudadevrt) 库 (请参阅单独编译)。

示例：动态并行提供了一个简单的示例。

4.4. 整数大小

不同的操作系统以不同的方式定义整数类型大小。 Linux x86_64 实现 LP64，而 Windows x86_64 实现 LLP64。

表 1. LLP64 和 LP64 的整数大小（以位为单位）
	`short`	`int`	`long`	`long long`	指针和 `size_t`
LLP64	16	32	32	64	64
LP64	16	32	64	64	64

NVRTC 在 Linux 上实现 LP64，在 Windows 上实现 LLP64。

NVRTC 通过 __int128 类型支持 128 位整数类型。这可以使用 --device-int128 标志启用。 128 位整数支持在 Windows 上不可用。

4.5. 包含语法

当调用 nvrtcCompileProgram() 时，当前工作目录将添加到头文件搜索路径，该路径用于查找使用带引号的语法（例如，#include "foo.h"）包含的文件，然后再编译代码。

4.6. 预定义宏

__CUDACC_RTC__：用于区分用户代码中的运行时和离线 nvcc 编译。
__CUDACC__：定义与离线 nvcc 编译相同的语义。
__CUDACC_RDC__：定义与离线 nvcc 编译相同的语义。
__CUDACC_EWP__：定义与离线 nvcc 编译相同的语义。
__CUDACC_DEBUG__：定义与离线 nvcc 编译相同的语义。
__CUDA_ARCH__：定义与离线 nvcc 编译相同的语义。
__CUDA_ARCH_LIST__：定义与离线 nvcc 编译相同的语义。
__CUDACC_VER_MAJOR__：定义为由 nvrtcVersion 返回的主版本号。
__CUDACC_VER_MINOR__：定义为由 nvrtcVersion 返回的次版本号。
__CUDACC_VER_BUILD__：定义为构建版本号。
__NVCC_DIAG_PRAGMA_SUPPORT__：定义与离线 nvcc 编译相同的语义。
__CUDACC_RTC_INT128__：在编译期间指定 -device-int128 标志时定义，并指示支持 __int128 类型。
NULL：空指针常量。
va_start
va_end
va_arg
va_copy : 当选择 C++11 或更高版本的语言方言时定义。
__cplusplus
_WIN64 : 在 Windows 平台上定义。
__LP64__ : 在非 Windows 平台上定义，其中 long int 和指针类型为 64 位。
__cdecl : 在所有平台上定义为空。
__ptr64 : 在 Windows 平台上定义为空。
__CUDACC_RTC_MINIMAL__：在编译期间指定 -minimal 标志时定义（自 CUDA 12.4 起）。
隐式提供在 nv/target 头文件中定义的宏，例如 NV_IF_TARGET。
__CUDACC_DEVICE_ATOMIC_BUILTINS__：当支持设备原子编译器内置函数时定义。有关更多详细信息，请参阅 CUDA C++ 编程指南。

4.7. 预定义类型

clock_t
size_t
ptrdiff_t
va_list：请注意，此类型的定义可能与 nvcc 在编译 CUDA 代码时选择的定义不同。
CUDA 运行时头文件中提供的预定义类型（例如 dim3、char4 等），当使用 nvcc 离线编译时也可用，除非另有说明。
std::initializer_list<T>：在 C++11 和更高版本方言中隐式提供，除非指定 -builtin-initializer-list=false。
std::move<T>, std::forward<T>：在 C++11 和更高版本方言中隐式提供，除非指定 -builtin-move-forward=false。

4.8. 内置函数

CUDA 运行时头文件中提供的内置函数（当使用 nvcc 离线编译时）可用，除非另有说明。

4.9. 默认 C++ 方言

默认 C++ 方言是 C++17。可以使用 -std 标志选择其他方言。

5. 基本用法

本文档的这一节使用一个简单的示例，“单精度 α⋅X 加 Y”（SAXPY），如图 1 所示，解释了使用 NVRTC 进行运行时编译所涉及的内容。为了简洁和可读性，未显示 API 返回值的错误检查。完整的代码清单在示例：SAXPY中提供。

图 1. SAXPY 的 CUDA 源字符串

const char *saxpy = "                                          \n\
extern \"C\" __global__                                        \n\
void saxpy(float a, float *x, float *y, float *out, size_t n)  \n\
{                                                              \n\
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;         \n\
   if (tid < n) {                                              \n\
      out[tid] = a * x[tid] + y[tid];                          \n\
   }                                                           \n\
}                                                              \n";

首先，需要创建 nvrtcProgram 的实例。图 2 显示了为 SAXPY 创建 nvrtcProgram。由于 SAXPY 不需要任何头文件，因此将 0 作为 numHeaders 传递，将 NULL 作为 headers 和 includeNames 传递。

图 2. SAXPY 的 nvrtcProgram 创建

nvrtcProgram prog;
nvrtcCreateProgram(&prog, // prog
        saxpy,         // buffer
        "saxpy.cu",    // name
        0,             // numHeaders
        NULL,          // headers
        NULL);         // includeNames

如果 SAXPY 有任何 #include 指令，则可以将被 #include 的文件的内容作为 headers 的元素传递，并将其名称作为 includeNames 的元素传递。例如，#include <foo.h> 和 #include <bar.h> 将需要 2 作为 numHeaders，{ "<foo.h 的内容>", "<bar.h 的内容>" } 作为 headers，以及 { "foo.h", "bar.h" } 作为 includeNames（<foo.h 的内容> 和 <bar.h 的内容> 必须替换为 foo.h 和 bar.h 的实际内容）。或者，如果头文件保证在运行时存在于文件系统中，则可以使用编译选项 -I。

一旦创建了用于编译的 nvrtcProgram 实例，就可以通过 nvrtcCompileProgram 进行编译，如图 3 所示。此示例中使用了两个编译选项 --gpu-architecture=compute_80 和 --fmad=false，以生成用于 compute_80 架构的代码，并禁用将浮点乘法和加法/减法收缩为浮点乘加运算。可以根据需要使用其他编译选项组合，并且支持的编译选项列出了有效的编译选项。

图 3. 启用 FMAD 的 compute_80 的 SAXPY 编译

const char *opts[] = {"--gpu-architecture=compute_80",
         "--fmad=false"};
nvrtcCompileProgram(prog,     // prog
         2,        // numOptions
         opts);    // options

编译完成后，用户可以获取程序编译日志和生成的 PTX，如图 4 所示。当编译失败时，NVRTC 不会生成有效的 PTX，并且即使编译成功，如果需要，它也可能会生成程序编译日志。

一个 nvrtcProgram 可以通过 nvrtcCompileProgram 使用不同的编译选项多次编译，并且用户只能检索由上次编译生成的 PTX 和日志。

图 4. 获取生成的 PTX 和程序编译日志

// Obtain compilation log from the program.

size_t logSize;

nvrtcGetProgramLogSize(prog, &logSize);
char *log = new char[logSize];
nvrtcGetProgramLog(prog, log);
// Obtain PTX from the program.
size_t ptxSize;
nvrtcGetPTXSize(prog, &ptxSize);
char *ptx = new char[ptxSize];
nvrtcGetPTX(prog, ptx);

当不再需要 nvrtcProgram 的实例时，可以通过 nvrtcDestroyProgram 销毁它，如图 5 所示。

图 5. nvrtcProgram 的销毁

nvrtcDestroyProgram(&prog);

生成的 PTX 可以通过 CUDA Driver API 进一步操作，用于执行或链接。图 6 显示了用于执行生成的 PTX 的示例代码序列。

图 6. 使用 NVRTC 生成的 PTX 执行 SAXPY

CUdevice cuDevice;
CUcontext context;
CUmodule module;
CUfunction kernel;
cuInit(0);
cuDeviceGet(&cuDevice, 0);
cuCtxCreate(&context, 0, cuDevice);
cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
cuModuleGetFunction(&kernel, module, "saxpy");
size_t n = size_t n = NUM_THREADS * NUM_BLOCKS;
size_t bufferSize = n * sizeof(float);
float a = ...;
float *hX = ..., *hY = ..., *hOut = ...;
CUdeviceptr dX, dY, dOut;
cuMemAlloc(&dX, bufferSize);
cuMemAlloc(&dY, bufferSize);
cuMemAlloc(&dOut, bufferSize);
cuMemcpyHtoD(dX, hX, bufferSize);
cuMemcpyHtoD(dY, hY, bufferSize);
void *args[] = { &a, &dX, &dY, &dOut, &n };
cuLaunchKernel(kernel,
            NUM_THREADS, 1, 1,   // grid dim
            NUM_BLOCKS, 1, 1,    // block dim
            0, NULL,             // shared mem and stream
            args,                // arguments
            0);
cuCtxSynchronize();
cuMemcpyDtoH(hOut, dOut, bufferSize);

6. 预编译头文件 (CUDA 12.8+)

6.1. 概述

预编译头文件 (PCH) 是一种编译时优化功能，适用于在连续的编译器调用中编译同一组“前缀”头文件的情况。例如，考虑两个翻译单元 a.cu 和 b.cu，它们包含相同的头文件集

//a.cu
#include "foo.h"
#include "bar.h"

//<-- 'header stop' point
int xxx;

//b.cu
#include "foo.h"
#include "bar.h"

//<-- 'header stop' point
double ddd;

假设 a.cu 使用 NVRTC 编译，然后是 b.cu。如果 PCH 功能被激活，在编译 a.cu 时，编译器会识别头文件停止点，这通常是主源文件中不属于预处理指令的第一个标记 3。然后，编译器将其内部状态保存到 PCH 文件。稍后，当编译 b.cu 时，编译器会确定直到头文件停止点的前处理器指令前缀，检查是否提供了兼容的 PCH 文件，并通过从 PCH 文件重新加载其内部状态并继续编译来跳过解析头文件。

如果头文件很大，这可以显著节省编译时间。编译器支持自动和显式 PCH 模式。自动模式由 -pch 标志指定；在这种模式下，编译器将自动创建和使用 PCH 文件。在显式模式下，使用 --create-pch=filename 标志显式创建 PCH 文件，并使用 --use-pch=filename 标志在后续编译中指定使用。

3: 有关头文件停止点确定的详细信息，请参阅文档的后续部分。

6.2. 实现概述

PCH 编译器实现保存和恢复编译器的内部状态。内部状态包括内存缓冲区的内容，这些缓冲区包含指向数据结构的指针。不幸的是，在现代操作系统上，一种称为地址空间布局随机化 (ASLR) 的安全功能导致动态内存分配（例如 malloc/mmap）返回的地址在二进制文件的每次调用中都不同。因此，在一个程序调用期间创建的 PCH 文件通常与程序的下一次运行不兼容，因为动态分配返回的内存地址不再与 PCH 文件中保存的编译器状态中的对象地址匹配。

因此，PCH 文件必须在 NVRTC 库的同一动态实例中创建和使用。图 1 显示了编译器实现的概述。连续进行了 2 次 NVRTC 调用，第一次编译 a.cu，下一次编译 b.cu。在内部，编译器有两个不同的堆 - PCH 堆和瞬态堆。当请求 PCH 处理时，PCH 堆会被延迟分配。一旦分配，为 PCH 堆分配的地址空间不会在 a.cu 的 NVRTC 调用结束时返回给操作系统（但是，后备内存会被“取消提交”，以便操作系统可以重复使用它）。在下一次 b.cu 的 NVRTC 调用中，使用 PCH 处理，内存对象从 PCH 堆分配。如果内存分配的顺序与之前的 NVRTC 调用相同，则内存分配器返回的地址现在与之前的 NVRTC 调用中返回的值匹配（因为 PCH 堆的地址空间被保留了）。这允许从编译 a.cu 期间创建的 PCH 文件中成功恢复编译器的保留状态。

一旦 PCH 堆耗尽，或者 PCH 处理未激活，编译器将从瞬态堆分配。瞬态堆在当前 NVRTC 编译调用完成后释放。如果在请求创建 PCH 时 PCH 堆空间耗尽，编译器将报告错误 (NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED)；创建 PCH 文件所需的内存量可以使用 nvrtcGetPCHHeapSizeRequired() 查询。在使用 nvrtcSetPCHHeapSize() 调整 PCH 堆大小后，可以再次尝试创建 PCH。

创建 PCH 文件时，编译器将保存一个元数据前缀，该前缀将用于检查 PCH 文件是否兼容。元数据前缀包括以下信息

来自主源文件的预处理指令的初始序列，直到头文件停止点。
命令行选项。
编译器版本。
PCH 堆的基地址。

当考虑使用 PCH 文件时，会检查元数据前缀中的信息以确保兼容性。

6.3. 自动 PCH

通过将 -pch 传递给 NVRTC 编译调用来激活自动 PCH 模式。在自动 PCH 模式下，编译器将首先检测头文件停止点。然后，它将从文件系统中查找扩展名为 .pch 的兼容 PCH 文件。还可以使用 -pch-dir 标志显式指定要搜索 PCH 文件的目录位置。如果找到合适的 PCH 文件，将使用它，编译器将跳过解析直到头文件停止点的头文件序列。编译器将为每个被认为不兼容使用的 PCH 文件打印消息到编译日志，并提供不兼容的原因。此外，编译器也可能选择创建新的 PCH 文件。如果 PCH 文件无法创建，编译仍将成功；函数 nvrtcGetPCHCreateStatus() 可用于检索 PCH 创建的状态，并将报告 NVRTC_SUCCESS (成功)、NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED、NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED 或 NVRTC_ERROR_PCH_CREATE 之一。

编译器在自动 PCH 处理期间创建的 PCH 文件在 NVRTC 库卸载时被删除。

示例：自动 PCH (CUDA 12.8+) 演示了自动 PCH 的使用。

6.4. 显式 PCH 创建和使用

或者，可以使用 --create-pch=filename 显式创建 PCH 文件，并使用 --use-pch=filename 使用。与自动 PCH 一样，在创建 PCH 文件时，可以使用 nvrtcGetPCHCreateStatus() 检查 PCH 创建的状态。

示例：显式 PCH 创建/使用 (CUDA 12.8+) 演示了 PCH 文件的显式创建和使用。

6.5. 确定头文件停止点

PCH 文件包含编译器状态，该状态解析到头文件停止点。头文件停止点通常是主源文件中不属于预处理指令的第一个标记。例如

#include "foo.h"
#include "bar.h"
int qqq;

这里，头文件停止点是 ‘int’。或者，可以使用 #pragma nv_hdrstop 指定头文件停止点

#include "foo.h"
#pragma nv_hdrstop

#include "bar.h"
int qqq;

如果预期的头文件停止点或 #pragma nv_hdrstop 在 #if 内部，则头文件停止点是最外层的封闭 if

#include "aaa.h"
#ifndef FOO_H
#define FOO_H 1
#include "bbb.h"
#endif
#if MYMACRO
int qqq;
#endif

在这里，第一个非预处理标记是 int，但是，头文件停止点是封闭的 #if MYMACRO 块的开始。

6.6. PCH 失败条件

由于创建 PCH 文件的编译器调用与当前调用之间在以下任何方面不匹配，PCH 文件可能被视为不兼容使用

NVRTC 命令行参数。
主源文件的预处理指令（例如 #include）的初始序列。
PCH 堆基地址。如果 PCH 文件是由 NVRTC 库的不同动态实例创建的，或者在创建 PCH 文件后使用 nvrtcSetPCHHeapSize() 调整了 PCH 堆的大小，则可能会发生这种情况。
编译器版本。

注意： 编译器不存储 PCH 前缀中引用的头文件序列的文件修改时间。用户有责任确保自创建 PCH 文件以来，头文件内容 4 没有更改。

PCH 文件创建可能因以下原因失败

头文件停止点之前的代码中存在错误。
遇到了 __DATE__ 或 __TIME__ 宏。
遇到了 pragma #pragma nv_no_pch。
头文件停止点不在顶层声明之间。例子

// foo.h
static

// foo.cu
#include "foo.h"
int qqq;

在达到头文件停止点之前，PCH 堆已耗尽。如果发生这种情况，nvrtcGetPCHCreateStatus() 将报告 NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED。可以通过调用 nvrtcGetPCHHeapSizeRequired() 检索所需的堆大小，并通过调用 nvrtcSetPCHHeapSize() 增加堆大小，并且可以再次尝试创建 PCH。示例：PCH 堆大小调整 (CUDA 12.8+) 演示了这种模式。

4: 从文件系统读取的头文件以及直接作为字符串指定给 nvrtcCreateProgram() 的头文件。

6.7. PCH 堆控制

当首次请求 PCH 处理时，PCH 堆会被延迟分配。在该编译调用结束时，PCH 堆的后备内存将返回给操作系统（“取消提交”），但地址空间不会被释放。在下一次需要 PCH 处理的 NVRTC 调用期间，PCH 堆的后备内存将从操作系统重新获取。

默认 PCH 堆大小为 256 MB。环境变量 NVRTC_PCH_HEAP_SIZE 在 NVRTC 库初始化时读取，可用于修改默认 PCH 堆分配大小（以字节为单位）。可以使用 nvrtcGetPCHHeapSize() 检索 PCH 堆大小（以字节为单位），并使用 nvrtcSetPCHHeapSize() 设置。用户指定的 PCH 堆大小向上舍入为平台相关的值 5。

注意：将 PCH 堆大小设置为 0 将释放 PCH 堆并禁用 PCH 处理。

PCH 堆基地址编码在生成的 PCH 文件中。 nvrtcSetPCHHeapSize() 将释放当前分配的 PCH 堆并分配新的堆。因此，在调用 nvrtcSetPCHHeapSize() 之前创建的 PCH 文件很可能与未来的编译不兼容，因为 PCH 堆基地址几乎肯定已更改。

示例：PCH 堆大小调整 (CUDA 12.8+) 列出了一个完整的可运行示例，演示了 PCH 堆大小调整。

5: 例如，平台上的页面大小。

6.8. 其他控制

6.8.1. 环境变量

这些变量在 NVRTC 初始化期间读取

NVRTC_PCH_HEAP_SIZE <size> : 设置默认 PCH 堆大小（以字节为单位）。堆在首次请求 PCH 处理时延迟分配。
NVRTC_DISABLE_PCH : 禁用所有 NVRTC 调用的 PCH 处理。

6.8.2. Pragma

支持以下 pragma

#pragma nv_hdrstop : 指示 PCH 头文件停止。
#pragma nv_no_pch : 禁用当前源文件的 PCH 文件创建。

6.8.3. 标志

有关 nvrtCompileProgram 支持的 PCH 相关标志，请参阅支持的编译选项。

6.8.4. 在创建 PCH 之前实例化模板

标志 -instantiate-templates-in-pch={true|false} 可用于控制是否在创建 PCH 文件之前实例化模板。这可能会增加 PCH 文件的大小，同时加快使用 PCH 文件的编译速度（因为模板实例化不需要再次完成）。默认情况下，此标志处于启用状态。

7. 访问降级的名称

NVRTC 将按照 IA64 ABI 的规定，修改 __global__ 函数名称以及 __device__ 和 __constant__ 变量的名称。如果生成的 PTX 正在使用 CUDA Driver API 加载，则必须按名称查找内核函数或 __device__/__constant__ 变量，但是当名称已被修改时，这很难做到。为了解决这个问题，NVRTC 提供了 API 函数，这些函数将源级别的 __global__ 函数或 __device__/__constant__ 变量名称映射到生成的 PTX 中存在的修改后的名称。

两个 API 函数 nvrtcAddNameExpression 和 nvrtcGetLoweredName 协同工作以提供此功能。首先，将表示 __global__ 函数或 __device__/__constant__ 变量地址的“名称表达式”字符串提供给 nvrtcAddNameExpression。然后，使用 nvrtcCompileProgram 编译程序。在编译期间，NVRTC 将在用户程序末尾将名称表达式字符串解析为 C++ 常量表达式。常量表达式必须提供 __global__ 函数或 __device__/__constant__ 变量的地址。最后，使用原始名称表达式调用函数 nvrtcGetLoweredName，它返回指向降级名称的指针。降级的名称可用于在 CUDA Driver API 中引用内核或变量。

NVRTC 保证在调用 nvrtcAddNameExpression 中引用的任何 __global__ 函数或 __device__/__constant__ 变量都将存在于生成的 PTX 中（如果该定义在输入源代码中可用）。

7.1. 示例

示例：使用降级的名称列出了一个完整的可运行示例。一些相关的代码片段

GPU 源代码 ('gpu_program') 包含各种 __global__ 函数/函数模板以及 __device__/__constant__ 变量的定义

const char *gpu_program = "                                     \n\
__device__ int V1; // set from host code                        \n\
static __global__ void f1(int *result) { *result = V1 + 10; }   \n\
namespace N1 {                                                  \n\
   namespace N2 {                                               \n\
      __constant__ int V2; // set from host code                \n\
      __global__ void f2(int *result) { *result = V2 + 20; }    \n\
   }                                                            \n\
}                                                               \n\
template<typename T>                                            \n\
__global__ void f3(int *result) { *result = sizeof(T); }        \n\

主机源代码使用各种名称表达式调用 nvrtcAddNameExpression，这些名称表达式引用 __global__ 函数和 __device__/__constant__ 变量的地址

kernel_name_vec.push_back("&f1");
..
kernel_name_vec.push_back("N1::N2::f2");
..
kernel_name_vec.push_back("f3<int>");
..
kernel_name_vec.push_back("f3<double>");

// add name expressions to NVRTC. Note this must be done before
// the program is compiled.
for (size_t i = 0; i < name_vec.size(); ++i)
NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, kernel_name_vec[i].c_str()));
..
// add expressions for  __device__ / __constant__ variables to NVRTC
variable_name_vec.push_back("&V1");
..
variable_name_vec.push_back("&N1::N2::V2");
..
for (size_t i = 0; i < variable_name_vec.size(); ++i)
NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog,
variable_name_vec[i].c_str()));

然后使用 nvrtcCompileProgram 编译 GPU 程序。生成的 PTX 加载到 GPU 上。查找 __device__/__constant__ 变量和 __global__ 函数的修改后的名称

// note: this call must be made after NVRTC program has been
// compiled and before it has been destroyed.
NVRTC_SAFE_CALL(nvrtcGetLoweredName(
prog,
variable_name_vec[i].c_str(), // name expression
&name                         // lowered name
));
..
NVRTC_SAFE_CALL(nvrtcGetLoweredName(
prog,
kernel_name_vec[i].c_str(), // name expression
&name // lowered name
));

然后，__device__/__constant__ 变量的修改后的名称用于在模块中查找变量，并使用 CUDA Driver API 更新其值

CUdeviceptr variable_addr;
CUDA_SAFE_CALL(cuModuleGetGlobal(&variable_addr, NULL, module, name));
CUDA_SAFE_CALL(cuMemcpyHtoD(variable_addr,
&initial_value, sizeof(initial_value)));

然后，内核的修改后的名称用于使用 CUDA Driver API 启动它

CUfunction kernel;
CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));
...
CUDA_SAFE_CALL(
cuLaunchKernel(kernel,
1, 1, 1, // grid dim
1, 1, 1, // block dim
0, NULL, // shared mem and stream
args, 0));

7.2. 注意

调用顺序：所有名称表达式都必须在使用 nvrtcCompileProgram 编译 NVRTC 程序之前使用 nvrtcAddNameExpression 添加。这是必需的，因为名称表达式在用户程序末尾解析，并且可能会触发模板实例化。必须仅在 NVRTC 程序编译后且在销毁之前，通过调用 nvrtcGetLoweredName 查找降级的名称。 nvrtcGetLoweredName 返回的指针指向 NVRTC 拥有的内存，并且当 NVRTC 程序被销毁时（nvrtcDestroyProgram）此内存被释放。因此，正确的调用顺序是：nvrtcAddNameExpression、nvrtcCompileProgram、nvrtcGetLoweredName、nvrtcDestroyProgram。
相同的名称表达式：传递给 nvrtcAddNameExpression 和 nvrtcGetLoweredName 的名称表达式字符串必须具有相同的字符。例如，“foo”和“foo ”不是相同的字符串，即使它们在语义上指的是同一个实体 (foo)，因为第二个字符串有一个额外的空格字符。
常量表达式：名称表达式字符串中的字符在用户程序末尾解析为 C++ 常量表达式。解析期间的任何错误都将导致编译失败，并且将在编译日志中生成编译器诊断信息。常量表达式必须引用 __global__ 函数或 __device__/__constant__ 变量的地址。
重载函数的地址：如果 NVRTC 源代码具有多个重载的 __global__ 函数，则名称表达式必须使用强制转换操作来消除歧义。但是，C++11 之前的 C++ 方言不允许在常量表达式中使用强制转换。如果使用此类名称表达式，请使用 -std 命令行标志在 C++11 或更高版本的方言中编译代码。示例：假设 GPU 代码字符串包含
```
__global__ void foo(int) { }
__global__ void foo(char) { }
```
名称表达式 (void(*)(int))foo 正确地消除了 foo(int) 的歧义，但程序必须在 C++11 或更高版本的方言（例如 -std=c++11）中编译，因为在 C++11 之前的常量表达式中不允许强制转换。

8. 与模板主机代码交互

在某些情况下，基于主机代码中的模板参数在设备代码中实例化 __global__ 函数模板很有用。 NVRTC 辅助函数 nvrtcGetTypeName 可用于提取主机代码中类型的源级别名称，并且此字符串可用于实例化 __global__ 函数模板，并使用 nvrtcAddNameExpression 和 nvrtcGetLoweredName 函数获取实例化的修改后的名称。

nvrtcGetTypeName 在 NVRTC 头文件中内联定义，并且当使用非零值定义宏 NVRTC_GET_TYPE_NAME 时可用。当分别使用 gcc/clang 和 cl.exe 编译器时，它使用 abi::__cxa_demangle 和 UnDecorateSymbolName 主机代码函数。用户可能需要指定额外的头文件路径和库来查找使用的主机函数（abi::__cxa_demangle / UnDecorateSymbolName）。有关参考，请参阅下面示例的构建说明 (nvrtcGetTypeName 构建说明)。

8.1. 模板主机代码示例

示例：使用 nvrtcGetTypeName 列出了一个完整的可运行示例。一些相关的代码片段

GPU 源代码 (gpu_program) 包含 __global__ 函数模板的定义

const char *gpu_program = " \n\
namespace N1 { struct S1_t { int i; double d; }; } \n\
template<typename T> \n\
__global__ void f3(int *result) { *result = sizeof(T); } \n\
\n";

主机代码函数 getKernelNameForType 基于主机模板类型 T 创建 __global__ 函数模板实例化的名称表达式。类型 T 的名称使用 nvrtcGetTypeName 提取

template <typename T>
std::string getKernelNameForType(void)
{
// Look up the source level name string for the type "T" using
// nvrtcGetTypeName() and use it to create the kernel name
std::string type_name;
NVRTC_SAFE_CALL(nvrtcGetTypeName<T>(&type_name));
return std::string("f3<") + type_name + ">";
}

名称表达式使用 nvrtcAddNameExpression 函数呈现给 NVRTC

name_vec.push_back(getKernelNameForType<int>());
..
name_vec.push_back(getKernelNameForType<double>());
..
name_vec.push_back(getKernelNameForType<N1::S1_t>());
..
for (size_t i = 0; i < name_vec.size(); ++i)
NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, name_vec[i].c_str()));

然后使用 nvrtcCompileProgram 编译 GPU 程序。生成的 PTX 加载到 GPU 上。查找 __global__ 函数模板实例化的修改后的名称

// note: this call must be made after NVRTC program has been
// compiled and before it has been destroyed.
NVRTC_SAFE_CALL(nvrtcGetLoweredName(
prog,
name_vec[i].c_str(), // name expression
&name // lowered name
));

然后，修改后的名称用于使用 CUDA Driver API 启动内核

CUfunction kernel;
CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));
...
CUDA_SAFE_CALL(
cuLaunchKernel(kernel,
1, 1, 1, // grid dim
1, 1, 1, // block dim
0, NULL, // shared mem and stream
args, 0));

9. 版本控制方案

9.1. NVRTC 共享库版本控制

在下文中，MAJOR 和 MINOR 表示 CUDA 工具包的主版本和次版本。例如，对于 CUDA 11.2，MAJOR 为 “11”，MINOR 为 “2”。

Linux
- 在 CUDA 11.3 之前的 CUDA 工具包中，soname 设置为 “MAJOR.MINOR”。
- 在 CUDA 11.3 及更高版本的 11.x 工具包中，soname 字段设置为 “11.2”。
- 在主版本 > 11 的 CUDA 工具包（例如 CUDA 12.x）中，soname 字段设置为 “MAJOR”。
Windows
- 在 cuda 11.3 之前的 CUDA 工具包中，DLL 名称的格式为 “nvrtc64_XY_0.dll”，其中 X = MAJOR，Y = MINOR。
- 在 CUDA 11.3 及更高版本的 11.x 工具包中，DLL 名称为 “nvrtc64_112_0.dll”。
- 在主版本 > 11 的 CUDA 工具包（例如 CUDA 12.x）中，DLL 名称的格式为 “nvrtc64_X0_0.dll”，其中 X = MAJOR。

考虑主版本 > 11 的 CUDA 工具包。此 CUDA 工具包中的 NVRTC 共享库将具有与先前次要版本的同一 CUDA 工具包中的 NVRTC 共享库相同的 soname (Linux) 或 DLL 名称 (Windows)。同样，CUDA 11.3 及更高版本 11.x 版本中的 NVRTC 共享库将具有与 CUDA 11.2 中的 NVRTC 共享库相同的 soname (Linux) 或 DLL 名称 (Windows)。

由于上述版本控制方案，链接到特定 NVRTC 共享库的 NVRTC 客户端将继续与具有匹配 soname (Linux) 或 DLL 名称 (Windows) 的未来 NVRTC 共享库一起工作。这允许 NVRTC 客户端利用更新的 NVRTC 共享库 1 中提供的错误修复和增强功能。但是，更新的 NVRTC 共享库可能会生成 CUDA Driver API 的较旧 CUDA 驱动程序功能不接受的版本 PTX，如最佳实践指南中所述。

解决此问题的一些方法

安装与包含正在使用的 NVRTC 库的 CUDA 工具包兼容的更新的 CUDA 驱动程序。
使用 NVRTC 直接编译为 SASS 而不是 PTX（请参阅最佳实践指南）。

或者，NVRTC 客户端可以链接到静态 NVRTC 库，或者重新分发特定版本的 NVRTC 共享库，并使用 dlopen (Linux) 或 LoadLibrary (Windows) 函数在运行时使用该库。这两种方法都允许 NVRTC 客户端保持对其部署期间使用的 NVRTC 版本的控制，以确保可预测的功能和性能。

9.2. NVRTC-builtins 库

NVRTC-builtins 库包含作为 NVRTC 包一部分的辅助代码。它仅供 NVRTC 库内部使用。每个 NVRTC 库仅与来自同一 CUDA 工具包的 NVRTC-builtins 库兼容。

10. 杂项说明

10.1. 线程安全

多个线程可以并发调用 NVRTC API 函数，只要没有竞争条件。在此上下文中，如果多个线程使用相同的 nvrtcProgram 参数并发调用 NVRTC API 函数，则会发生竞争条件，其中至少一个线程正在调用 nvrtcCompileProgram 或 nvrtcAddNameExpression 2。

自 CUDA 12.3 起，NVRTC 允许并发调用 nvrtcCompileProgram，也可能并发调用嵌入式 NVVM 优化器/代码生成阶段。设置环境变量 NVRTC_DISABLE_CONCURRENT_NVVM 将禁用此行为，即，嵌入式 NVVM 优化器/代码生成阶段的调用将被串行化。

10.2. 堆栈大小

在 Linux 上，NVRTC 将在编译期间使用 setrlimit() 函数将堆栈大小增加到允许的最大值。这降低了编译器在处理复杂输入源时耗尽堆栈的可能性。编译完成后，堆栈大小将重置为先前的值。

由于 setrlimit() 更改了整个进程的堆栈大小，因此它也会影响可能同时执行的其他应用程序线程。命令行标志 -modify-stack-limit=false 将阻止 NVRTC 修改堆栈限制。

10.3. NVRTC 静态库

NVRTC 静态库引用了在 NVRTC-builtins 静态库和 PTX 编译器静态库中定义的函数。请参阅构建说明以获取示例。

11. 示例：SAXPY

11.1. 代码 (saxpy.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>

#define NUM_THREADS 128
#define NUM_BLOCKS 32
#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
  do {                                                            \
    CUresult result = x;                                          \
    if (result != CUDA_SUCCESS) {                                 \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
                << msg << '\n';                                   \
      exit(1);                                                    \
    }                                                             \
} while(0)

const char *saxpy = "                                           \n\
extern \"C\" __global__                                         \n\
void saxpy(float a, float *x, float *y, float *out, size_t n)   \n\
{                                                               \n\
  size_t tid = blockIdx.x * blockDim.x + threadIdx.x;           \n\
  if (tid < n) {                                                \n\
    out[tid] = a * x[tid] + y[tid];                             \n\
  }                                                             \n\
}                                                               \n";

int main()
{
   // Create an instance of nvrtcProgram with the SAXPY code string.
   nvrtcProgram prog;
   NVRTC_SAFE_CALL(
      nvrtcCreateProgram(&prog,         // prog
                        saxpy,         // buffer
                        "saxpy.cu",    // name
                        0,             // numHeaders
                        NULL,          // headers
                        NULL));        // includeNames
   // Compile the program with fmad disabled.
   // Note: Can specify GPU target architecture explicitly with '-arch' flag.
   const char *opts[] = {"--fmad=false"};
   nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                   1,     // numOptions
                                                   opts); // options
   // Obtain compilation log from the program.
   size_t logSize;
   NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
   char *log = new char[logSize];
   NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
   std::cout << log << '\n';
   delete[] log;
   if (compileResult != NVRTC_SUCCESS) {
      exit(1);
   }
   // Obtain PTX from the program.
   size_t ptxSize;
   NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
   char *ptx = new char[ptxSize];
   NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
   // Destroy the program.
   NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));
   // Load the generated PTX and get a handle to the SAXPY kernel.
   CUdevice cuDevice;
   CUcontext context;
   CUmodule module;
   CUfunction kernel;
   CUDA_SAFE_CALL(cuInit(0));
   CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
   CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
   CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0));
   CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "saxpy"));
   // Generate input for execution, and create output buffers.
   size_t n = NUM_THREADS * NUM_BLOCKS;
   size_t bufferSize = n * sizeof(float);
   float a = 5.1f;
   float *hX = new float[n], *hY = new float[n], *hOut = new float[n];
   for (size_t i = 0; i < n; ++i) {
      hX[i] = static_cast<float>(i);
      hY[i] = static_cast<float>(i * 2);
   }
   CUdeviceptr dX, dY, dOut;
   CUDA_SAFE_CALL(cuMemAlloc(&dX, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dY, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dX, hX, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dY, hY, bufferSize));
   // Execute SAXPY.
   void *args[] = { &a, &dX, &dY, &dOut, &n };
   CUDA_SAFE_CALL(
      cuLaunchKernel(kernel,
                     NUM_BLOCKS, 1, 1,    // grid dim
                     NUM_THREADS, 1, 1,   // block dim
                     0, NULL,             // shared mem and stream
                     args, 0));           // arguments
   CUDA_SAFE_CALL(cuCtxSynchronize());
   // Retrieve and print output.
   CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));
   for (size_t i = 0; i < n; ++i) {
      std::cout << a << " * " << hX[i] << " + " << hY[i]
               << " = " << hOut[i] << '\n';
   }
   // Release resources.
   CUDA_SAFE_CALL(cuMemFree(dX));
   CUDA_SAFE_CALL(cuMemFree(dY));
   CUDA_SAFE_CALL(cuMemFree(dOut));
   CUDA_SAFE_CALL(cuModuleUnload(module));
   CUDA_SAFE_CALL(cuCtxDestroy(context));
   delete[] hX;
   delete[] hY;
   delete[] hOut;
   delete[] ptx;
   return 0;
}

11.2. Saxpy 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe saxpy.cpp /Fesaxpy ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ saxpy.cpp -o saxpy \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe saxpy.cpp /Fesaxpy  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ saxpy.cpp -o saxpy \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

12. 示例：使用 Lowered Name

12.1. 代码 (lowered-name.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>
#include <vector>
#include <string>

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                              \
   nvrtcResult result = x;                                        \
   if (result != NVRTC_SUCCESS) {                                 \
      std::cerr << "\nerror: " #x " failed with error "           \
               << nvrtcGetErrorString(result) << '\n';            \
      exit(1);                                                    \
   }                                                              \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
do {                                                              \
   CUresult result = x;                                           \
   if (result != CUDA_SUCCESS) {                                  \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
               << msg << '\n';                                    \
      exit(1);                                                    \
   }                                                              \
} while(0)

const char *gpu_program = "
  __device__ int V1; // set from host code                        \n\
  static __global__ void f1(int *result) { *result = V1 + 10; }   \n\
  namespace N1 {                                                  \n\
    namespace N2 {                                                \n\
       __constant__ int V2; // set from host code                 \n\
       __global__ void f2(int *result) { *result = V2 + 20; }     \n\
      }                                                           \n\
   }                                                              \n\
   template<typename T>                                           \n\
   __global__ void f3(int *result) { *result = sizeof(T); }       \n\
                                                                  \n";

int main()
{
   // Create an instance of nvrtcProgram
   nvrtcProgram prog;
   NVRTC_SAFE_CALL(nvrtcCreateProgram(&prog,         // prog
                                       gpu_program,   // buffer
                                       "prog.cu",     // name
                                       0,             // numHeaders
                                       NULL,          // headers
                                       NULL));        // includeNames

   // add all name expressions for kernels
   std::vector<std::string> kernel_name_vec;
   std::vector<std::string> variable_name_vec;
   std::vector<int> variable_initial_value;

   std::vector<int> expected_result;

   // note the name expressions are parsed as constant expressions
   kernel_name_vec.push_back("&f1");
   expected_result.push_back(10 + 100);

   kernel_name_vec.push_back("N1::N2::f2");
   expected_result.push_back(20 + 200);

   kernel_name_vec.push_back("f3<int>");
   expected_result.push_back(sizeof(int));

   kernel_name_vec.push_back("f3<double>");
   expected_result.push_back(sizeof(double));

   // add kernel name expressions to NVRTC. Note this must be done before
   // the program is compiled.
   for (size_t i = 0; i < kernel_name_vec.size(); ++i)
      NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, kernel_name_vec[i].c_str()));

   // add expressions for  __device__ / __constant__ variables to NVRTC
   variable_name_vec.push_back("&V1");
   variable_initial_value.push_back(100);

   variable_name_vec.push_back("&N1::N2::V2");
   variable_initial_value.push_back(200);

   for (size_t i = 0; i < variable_name_vec.size(); ++i)
      NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, variable_name_vec[i].c_str()));

   nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                   0,     // numOptions
                                                   NULL); // options
   // Obtain compilation log from the program.
   size_t logSize;
   NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
   char *log = new char[logSize];
   NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
   std::cout << log << '\n';
   delete[] log;
   if (compileResult != NVRTC_SUCCESS) {
      exit(1);
   }
   // Obtain PTX from the program.
   size_t ptxSize;
   NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
   char *ptx = new char[ptxSize];
   NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
   // Load the generated PTX
   CUdevice cuDevice;
   CUcontext context;
   CUmodule module;

   CUDA_SAFE_CALL(cuInit(0));
   CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
   CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
   CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0));

   CUdeviceptr dResult;
   int hResult = 0;
   CUDA_SAFE_CALL(cuMemAlloc(&dResult, sizeof(hResult)));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dResult, &hResult, sizeof(hResult)));

   // for each of the __device__/__constant__ variable address
   // expressions provided to NVRTC, extract the lowered name for the
   // corresponding variable, and set its value
   for (size_t i = 0; i < variable_name_vec.size(); ++i) {
      const char *name;

      // note: this call must be made after NVRTC program has been
      // compiled and before it has been destroyed.
      NVRTC_SAFE_CALL(nvrtcGetLoweredName(
                           prog,
            variable_name_vec[i].c_str(), // name expression
            &name                         // lowered name
                                          ));
      int initial_value = variable_initial_value[i];

      // get pointer to variable using lowered name, and set its
      // initial value
      CUdeviceptr variable_addr;
      CUDA_SAFE_CALL(cuModuleGetGlobal(&variable_addr, NULL, module, name));
      CUDA_SAFE_CALL(cuMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value)));
   }


   // for each of the kernel name expressions previously provided to NVRTC,
   // extract the lowered name for corresponding __global__ function,
   // and launch it.

   for (size_t i = 0; i < kernel_name_vec.size(); ++i) {
      const char *name;

      // note: this call must be made after NVRTC program has been
      // compiled and before it has been destroyed.
      NVRTC_SAFE_CALL(nvrtcGetLoweredName(
                           prog,
            kernel_name_vec[i].c_str(), // name expression
            &name                // lowered name
                                          ));

      // get pointer to kernel from loaded PTX
      CUfunction kernel;
      CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));

      // launch the kernel
      std::cout << "\nlaunching " << name << " ("
            << kernel_name_vec[i] << ")" << std::endl;

      void *args[] = { &dResult };
      CUDA_SAFE_CALL(
         cuLaunchKernel(kernel,
            1, 1, 1,             // grid dim
            1, 1, 1,             // block dim
            0, NULL,             // shared mem and stream
            args, 0));           // arguments
      CUDA_SAFE_CALL(cuCtxSynchronize());

      // Retrieve the result
      CUDA_SAFE_CALL(cuMemcpyDtoH(&hResult, dResult, sizeof(hResult)));

      // check against expected value
      if (expected_result[i] != hResult) {
         std::cout << "\n Error: expected result = " << expected_result[i]
                  << " , actual result = " << hResult << std::endl;
         exit(1);
      }
   }  // for

   // Release resources.
   CUDA_SAFE_CALL(cuMemFree(dResult));
   CUDA_SAFE_CALL(cuModuleUnload(module));
   CUDA_SAFE_CALL(cuCtxDestroy(context));
   delete[] ptx;

   // Destroy the program.
   NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

   return 0;
}

12.2. Lowered Name 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe lowered-name.cpp /Felowered-name ^
/I "%CUDA_PATH%"\include ^
"%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ lowered-name.cpp -o lowered-name \
-I $CUDA_PATH/include \
-L $CUDA_PATH/lib64 \
-lnvrtc -lcuda \
-Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe lowered-name.cpp /Felowered-name  ^
/I "%CUDA_PATH%"\include ^
"%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
"%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
"%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
"%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ lowered-name.cpp -o lowered-name \
-I $CUDA_PATH/include \
-L $CUDA_PATH/lib64 \
-lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static \
-lcuda -lpthread

13. 示例：使用 nvrtcGetTypeName

13.1. 代码 (host-type-name.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>
#include <vector>
#include <string>

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                            \
   nvrtcResult result = x;                                       \
   if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
               << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
   }                                                             \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
do {                                                            \
   CUresult result = x;                                          \
   if (result != CUDA_SUCCESS) {                                 \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
               << msg << '\n';                                   \
      exit(1);                                                    \
   }                                                             \
} while(0)

const char *gpu_program = "                                     \n\
namespace N1 { struct S1_t { int i; double d; }; }              \n\
template<typename T>                                            \n\
__global__ void f3(int *result) { *result = sizeof(T); }        \n\
                                                               \n";


// note: this structure is also defined in GPU code string. Should ideally
// be in a header file included by both GPU code string and by CPU code.
namespace N1 { struct S1_t { int i; double d; }; };
template <typename T>
std::string getKernelNameForType(void)
{
   // Look up the source level name string for the type "T" using
   // nvrtcGetTypeName() and use it to create the kernel name
   std::string type_name;
   NVRTC_SAFE_CALL(nvrtcGetTypeName<T>(&type_name));
   return std::string("f3<") + type_name + ">";
}

int main()
{
// Create an instance of nvrtcProgram
nvrtcProgram prog;
NVRTC_SAFE_CALL(
   nvrtcCreateProgram(&prog,         // prog
                     gpu_program,   // buffer
                     "gpu_program.cu",    // name
                     0,             // numHeaders
                     NULL,          // headers
                     NULL));        // includeNames

// add all name expressions for kernels
std::vector<std::string> name_vec;
std::vector<int> expected_result;

// note the name expressions are parsed as constant expressions
name_vec.push_back(getKernelNameForType<int>());
expected_result.push_back(sizeof(int));

name_vec.push_back(getKernelNameForType<double>());
expected_result.push_back(sizeof(double));

name_vec.push_back(getKernelNameForType<N1::S1_t>());
expected_result.push_back(sizeof(N1::S1_t));


// add name expressions to NVRTC. Note this must be done before
// the program is compiled.
for (size_t i = 0; i < name_vec.size(); ++i)
   NVRTC_SAFE_CALL(nvrtcAddNameExpression(prog, name_vec[i].c_str()));

nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                0,     // numOptions
                                                NULL); // options
// Obtain compilation log from the program.
size_t logSize;
NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
char *log = new char[logSize];
NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
std::cout << log << '\n';
delete[] log;
if (compileResult != NVRTC_SUCCESS) {
   exit(1);
}
// Obtain PTX from the program.
size_t ptxSize;
NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = new char[ptxSize];
NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));


// Load the generated PTX
CUdevice cuDevice;
CUcontext context;
CUmodule module;

CUDA_SAFE_CALL(cuInit(0));
CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0));

CUdeviceptr dResult;
int hResult = 0;
CUDA_SAFE_CALL(cuMemAlloc(&dResult, sizeof(hResult)));
CUDA_SAFE_CALL(cuMemcpyHtoD(dResult, &hResult, sizeof(hResult)));

// for each of the name expressions previously provided to NVRTC,
// extract the lowered name for corresponding __global__ function,
// and launch it.

for (size_t i = 0; i < name_vec.size(); ++i) {
   const char *name;

   // note: this call must be made after NVRTC program has been
   // compiled and before it has been destroyed.
   NVRTC_SAFE_CALL(nvrtcGetLoweredName(
                        prog,
         name_vec[i].c_str(), // name expression
         &name                // lowered name
                                       ));

   // get pointer to kernel from loaded PTX
   CUfunction kernel;
   CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, name));

   // launch the kernel
   std::cout << "\nlaunching " << name << " ("
         << name_vec[i] << ")" << std::endl;

   void *args[] = { &dResult };
   CUDA_SAFE_CALL(
      cuLaunchKernel(kernel,
         1, 1, 1,             // grid dim
         1, 1, 1,             // block dim
         0, NULL,             // shared mem and stream
         args, 0));           // arguments
   CUDA_SAFE_CALL(cuCtxSynchronize());

   // Retrieve the result
   CUDA_SAFE_CALL(cuMemcpyDtoH(&hResult, dResult, sizeof(hResult)));

   // check against expected value
   if (expected_result[i] != hResult) {
      std::cout << "\n Error: expected result = " << expected_result[i]
      << " , actual result = " << hResult << std::endl;
      exit(1);
   }
}  // for

// Release resources.
CUDA_SAFE_CALL(cuMemFree(dResult));
CUDA_SAFE_CALL(cuModuleUnload(module));
CUDA_SAFE_CALL(cuCtxDestroy(context));
delete[] ptx;

// Destroy the program.
NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

return 0;
}

13.2. nvrtcGetTypeName 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp /Fehost-type-name ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib DbgHelp.lib

Linux

g++ -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp -o host-type-name \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp /Fehost-type-name  ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
   "%CUDA_PATH%"\lib\x64\cuda.lib DbgHelp.lib user32.lib Ws2_32.lib

Linux

g++ -DNVRTC_GET_TYPE_NAME=1 host-type-name.cpp -o host-type-name \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static \
   -lcuda -lpthread

14. 示例：动态并行

代码 (dynamic-parallelism.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <iostream>

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                              \
   nvrtcResult result = x;                                        \
   if (result != NVRTC_SUCCESS) {                                 \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
   }                                                              \
} while(0)
#define CUDA_SAFE_CALL(x)                                         \
do {                                                              \
   CUresult result = x;                                           \
   if (result != CUDA_SUCCESS) {                                  \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
                << msg << '\n';                                   \
      exit(1);                                                    \
   }                                                              \
} while(0)

const char *dynamic_parallelism = "                             \n\
extern \"C\" __global__                                         \n\
void child(float *out, size_t n)                                \n\
{                                                               \n\
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;          \n\
   if (tid < n) {                                               \n\
      out[tid] = tid;                                           \n\
   }                                                            \n\
}                                                               \n\
                                                                \n\
extern \"C\" __global__                                         \n\
void parent(float *out, size_t n,                               \n\
            size_t numBlocks, size_t numThreads)                \n\
{                                                               \n\
   child<<<numBlocks, numThreads>>>(out, n);                    \n\
   cudaDeviceSynchronize();                                     \n\
}                                                               \n";
int main(int argc, char *argv[])
{
if (argc < 2) {
   std::cout << "Usage: dynamic-parallelism <path to cudadevrt library>\n\n"
             << "<path to cudadevrt library> must include the cudadevrt\n"
             << "library name itself, e.g., Z:\\path\\to\\cudadevrt.lib on \n"
             << "Windows and /path/to/libcudadevrt.a on Linux.\n";
   exit(1);
}
size_t numBlocks = 32;
size_t numThreads = 128;
// Create an instance of nvrtcProgram with the code string.
nvrtcProgram prog;
NVRTC_SAFE_CALL(
   nvrtcCreateProgram(&prog,                       // prog
                     dynamic_parallelism,          // buffer
                     "dynamic_parallelism.cu",     // name
                     0,                            // numHeaders
                     NULL,                         // headers
                     NULL));                       // includeNames
// Compile the program for compute_35 with rdc enabled.
const char *opts[] = {"--gpu-architecture=compute_35",
                        "--relocatable-device-code=true"};
nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                2,     // numOptions
                                                opts); // options
// Obtain compilation log from the program.
size_t logSize;
NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
char *log = new char[logSize];
NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
std::cout << log << '\n';
delete[] log;
if (compileResult != NVRTC_SUCCESS) {
   exit(1);
}
// Obtain PTX from the program.
size_t ptxSize;
NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = new char[ptxSize];
NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
// Destroy the program.
NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));
// Load the generated PTX and get a handle to the parent kernel.
CUdevice cuDevice;
CUcontext context;
CUlinkState linkState;
CUmodule module;
CUfunction kernel;
CUDA_SAFE_CALL(cuInit(0));
CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
CUDA_SAFE_CALL(cuLinkCreate(0, 0, 0, &linkState));
CUDA_SAFE_CALL(cuLinkAddFile(linkState, CU_JIT_INPUT_LIBRARY, argv[1],
                              0, 0, 0));
CUDA_SAFE_CALL(cuLinkAddData(linkState, CU_JIT_INPUT_PTX,
                              (void *)ptx, ptxSize, "dynamic_parallelism.ptx",
                              0, 0, 0));
size_t cubinSize;
void *cubin;
CUDA_SAFE_CALL(cuLinkComplete(linkState, &cubin, &cubinSize));
CUDA_SAFE_CALL(cuModuleLoadData(&module, cubin));
CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "parent"));
// Generate input for execution, and create output buffers.
size_t n = numBlocks * numThreads;
size_t bufferSize = n * sizeof(float);
float *hOut = new float[n];
CUdeviceptr dX, dY, dOut;
CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));
// Execute parent kernel.
void *args[] = { &dOut, &n, &numBlocks, &numThreads };
CUDA_SAFE_CALL(
   cuLaunchKernel(kernel,
                  1, 1, 1,    // grid dim
                  1, 1, 1,    // block dim
                  0, NULL,    // shared mem and stream
                  args, 0));  // arguments
CUDA_SAFE_CALL(cuCtxSynchronize());
// Retrieve and print output.
CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));

for (size_t i = 0; i < n; ++i) {
   std::cout << hOut[i] << '\n';
}
// Release resources.
CUDA_SAFE_CALL(cuMemFree(dOut));
CUDA_SAFE_CALL(cuModuleUnload(module));
CUDA_SAFE_CALL(cuLinkDestroy(linkState));
CUDA_SAFE_CALL(cuCtxDestroy(context));
delete[] hOut;
delete[] ptx;
return 0;
}

14.1. 动态并行构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe dynamic-parallelism.cpp /Fedynamic-parallelism ^
   /I "%CUDA_PATH%\include" ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ dynamic-parallelism.cpp -o dynamic-parallelism \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe dynamic-parallelism.cpp /Fedynamic-parallelism  ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
  "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ dynamic-parallelism.cpp -o dynamic-parallelism \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

15. 示例：设备 LTO（链接时优化）

本节演示设备链接时优化 (LTO)。LTO IR 有两个单元。第一个单元是使用 nvcc 离线生成的，方法是将架构指定为 -arch lto_XX（请参阅代码 (offline.cu)）。生成的 LTO IR 打包在 fatbinary 中。

第二个单元是使用 NVRTC 在线生成的，方法是指定标志 -dlto（请参阅代码 (online.cpp)）。

然后，这两个单元被传递给 libnvJitLink* API 函数，这些函数将 LTO IR 链接在一起，对链接的 IR 运行优化器并生成 cubin（请参阅代码 (online.cpp)）。然后 cubin 被加载到 GPU 上并执行。

15.1. 代码 (offline.cu)

__device__  float compute(float a, float x, float y) {
return a * x + y;
}

15.2. 代码 (online.cpp)

#include <nvrtc.h>
#include <cuda.h>
#include <nvJitLink.h>
#include <iostream>

#define NUM_THREADS 128
#define NUM_BLOCKS 32

#define NVRTC_SAFE_CALL(x)                                        \
do {                                                              \
   nvrtcResult result = x;                                        \
   if (result != NVRTC_SUCCESS) {                                 \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
   }                                                              \
} while(0)

#define CUDA_SAFE_CALL(x)                                         \
do {                                                              \
   CUresult result = x;                                           \
   if (result != CUDA_SUCCESS) {                                  \
      const char *msg;                                            \
      cuGetErrorName(result, &msg);                               \
      std::cerr << "\nerror: " #x " failed with error "           \
                << msg << '\n';                                   \
      exit(1);                                                    \
   }                                                              \
} while(0)

#define NVJITLINK_SAFE_CALL(h,x)                                  \
do {                                                              \
   nvJitLinkResult result = x;                                    \
   if (result != NVJITLINK_SUCCESS) {                             \
      std::cerr << "\nerror: " #x " failed with error "           \
                << result << '\n';                                \
      size_t lsize;                                               \
      result = nvJitLinkGetErrorLogSize(h, &lsize);               \
      if (result == NVJITLINK_SUCCESS && lsize > 0) {             \
         char *log = (char*)malloc(lsize);                        \
         result = nvJitLinkGetErrorLog(h, log);                   \
         if (result == NVJITLINK_SUCCESS) {                       \
            std::cerr << "error: " << log << '\n';                \
            free(log);                                            \
         }                                                        \
      }                                                           \
      exit(1);                                                    \
   }                                                              \
} while(0)

const char *lto_saxpy = "                                       \n\
extern __device__ float compute(float a, float x, float y);     \n\
                                                                \n\
extern \"C\" __global__                                         \n\
void saxpy(float a, float *x, float *y, float *out, size_t n)   \n\
{                                                               \n\
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;          \n\
   if (tid < n) {                                               \n\
      out[tid] = compute(a, x[tid], y[tid]);                    \n\
   }                                                            \n\
} \n";

int main(int argc, char *argv[])
{
   size_t numBlocks = 32;
   size_t numThreads = 128;
   // Create an instance of nvrtcProgram with the code string.
   nvrtcProgram prog;
   NVRTC_SAFE_CALL(
      nvrtcCreateProgram(&prog,                      // prog
                        lto_saxpy,                   // buffer
                        "lto_saxpy.cu",              // name
                        0,                           // numHeaders
                        NULL,                        // headers
                        NULL));                      // includeNames

   // specify that LTO IR should be generated for LTO operation
   const char *opts[] = {"-dlto",
                           "--relocatable-device-code=true"};
   nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
                                                   2,     // numOptions
                                                   opts); // options
   // Obtain compilation log from the program.
   size_t logSize;
   NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
   char *log = new char[logSize];
   NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
   std::cout << log << '\n';
   delete[] log;
   if (compileResult != NVRTC_SUCCESS) {
      exit(1);
   }
   // Obtain generated LTO IR from the program.
   size_t LTOIRSize;
   NVRTC_SAFE_CALL(nvrtcGetLTOIRSize(prog, &LTOIRSize));
   char *LTOIR = new char[LTOIRSize];
   NVRTC_SAFE_CALL(nvrtcGetLTOIR(prog, LTOIR));
   // Destroy the program.
   NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

   CUdevice cuDevice;
   CUcontext context;
   CUmodule module;
   CUfunction kernel;
   CUDA_SAFE_CALL(cuInit(0));
   CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
   CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));

   // Load the generated LTO IR and the LTO IR generated offline
   // and link them together.
   nvJitLinkHandle handle;
   // Dynamically determine the arch to link for
   int major = 0;
   int minor = 0;
   CUDA_SAFE_CALL(cuDeviceGetAttribute(&major,
                     CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
   CUDA_SAFE_CALL(cuDeviceGetAttribute(&minor,
                     CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
   int arch = major*10 + minor;
   char smbuf[16];
   sprintf(smbuf, "-arch=sm_%d", arch);
   const char *lopts[] = {"-dlto", smbuf};
   NVJITLINK_SAFE_CALL(handle, nvJitLinkCreate(&handle, 2, lopts));

   // NOTE: assumes "offline.fatbin" is in the current directory
   // The fatbinary contains LTO IR generated offline using nvcc
   NVJITLINK_SAFE_CALL(handle, nvJitLinkAddFile(handle, NVJITLINK_INPUT_FATBIN,
                                 "offline.fatbin"));
   NVJITLINK_SAFE_CALL(handle, nvJitLinkAddData(handle, NVJITLINK_INPUT_LTOIR,
                              (void *)LTOIR, LTOIRSize, "lto_online"));

   // The call to nvJitLinkComplete causes linker to link together the two
   // LTO IR modules (offline and online), do optimization on the linked LTO IR,
   // and generate cubin from it.
   NVJITLINK_SAFE_CALL(handle, nvJitLinkComplete(handle));
   size_t cubinSize;
   NVJITLINK_SAFE_CALL(handle, nvJitLinkGetLinkedCubinSize(handle, &cubinSize));
   void *cubin = malloc(cubinSize);
   NVJITLINK_SAFE_CALL(handle, nvJitLinkGetLinkedCubin(handle, cubin));
   NVJITLINK_SAFE_CALL(handle, nvJitLinkDestroy(&handle));


   CUDA_SAFE_CALL(cuModuleLoadData(&module, cubin));
   CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "saxpy"));

   // Generate input for execution, and create output buffers.
   size_t n = NUM_THREADS * NUM_BLOCKS;
   size_t bufferSize = n * sizeof(float);
   float a = 5.1f;
   float *hX = new float[n], *hY = new float[n], *hOut = new float[n];
   for (size_t i = 0; i < n; ++i) {
      hX[i] = static_cast<float>(i);
      hY[i] = static_cast<float>(i * 2);
   }
   CUdeviceptr dX, dY, dOut;
   CUDA_SAFE_CALL(cuMemAlloc(&dX, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dY, bufferSize));
   CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dX, hX, bufferSize));
   CUDA_SAFE_CALL(cuMemcpyHtoD(dY, hY, bufferSize));
   // Execute SAXPY.
   void *args[] = { &a, &dX, &dY, &dOut, &n };
   CUDA_SAFE_CALL(
      cuLaunchKernel(kernel,
                     NUM_BLOCKS, 1, 1,    // grid dim
                     NUM_THREADS, 1, 1,   // block dim
                     0, NULL,             // shared mem and stream
                     args, 0));           // arguments
   CUDA_SAFE_CALL(cuCtxSynchronize());
   // Retrieve and print output.
   CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));

   for (size_t i = 0; i < n; ++i) {
      std::cout << a << " * " << hX[i] << " + " << hY[i]
               << " = " << hOut[i] << '\n';
   }
   // Release resources.
   CUDA_SAFE_CALL(cuMemFree(dX));
   CUDA_SAFE_CALL(cuMemFree(dY));
   CUDA_SAFE_CALL(cuMemFree(dOut));
   CUDA_SAFE_CALL(cuModuleUnload(module));
   CUDA_SAFE_CALL(cuCtxDestroy(context));
   free(cubin);
   delete[] hX;
   delete[] hY;
   delete[] hOut;
   delete[] LTOIR;
   return 0;
}

15.3. 设备 LTO 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

将 offline.cu 编译为包含 LTO IR 的 fatbinary（根据需要将 lto_52 更改为不同的 lto_XX 架构）。
```
nvcc -arch lto_52 -rdc=true -fatbin offline.cu
```

使用 NVRTC 共享库

Windows

cl.exe online.cpp /Feonline ^
   /I "%CUDA_PATH%\include" ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib ^
   "%CUDA_PATH%"\lib\x64\nvJitLink.lib ^
   "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ online.cpp -o online \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lnvJitLink -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe online.cpp /Feonline  ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvJitLink_static.lib ^
   "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
   "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ online.cpp -o online \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvJitLink_static -lnvptxcompiler_static -lcuda \
   -lpthread

16. 示例：自动 PCH (CUDA 12.8+)

此示例演示自动 PCH 模式，该模式通过将 -pch 传递给 nvrtcCompileProgram 调用来启用。有两个不同的程序 first 和 second 包含相同的头文件

const char *first  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"auto_pch_common.h\" \n"
                      "__global__ void other(double *a, double *b) {\n"
                      "*a = *b + doit();\n}\n";

当使用 NVRTC 和 -pch 编译 first 时，编译器将为包含 auto_pch_common.h 头文件创建 PCH 文件。当使用 -pch 编译 second 时，编译器将透明地使用先前创建的 PCH 文件。

以下是程序运行时输出

compiling first program
"default_program": creating precompiled header file "default_program.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_SUCCESS

 compiling second program (expect to use PCH)
"default_program": using precompiled header file "default_program.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED

在编译 first 程序时，PCH 文件已成功创建。在编译 second 程序时，PCH 文件已成功使用，但编译器选择不创建另一个 PCH 文件（因此 nvrtcGetPCHCreateStatus() 返回 NVRC_ERROR_NO_PCH_CREATE_ATTEMPTED - 这是预期的）。

16.1. 代码 (auto_pch_common.h)

__device__ double qqq = 10;

template <typename T>
__device__ double get(T in) { return sin(in) + qqq; }

__device__  double doit() { return get<double>(0.5); }

16.2. 代码 (auto_pch.cpp)

#include <iostream>
#include <nvrtc.h>

#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
  } while(0)


const char *docompile(const char *progstr)
{
  nvrtcProgram prog;

  NVRTC_SAFE_CALL(
    nvrtcCreateProgram(&prog,
      progstr,         // buffer
      "",              // name
      0,               // numHeaders
      NULL,            // headers
      NULL));          // includeNames

  const char *opts[] = { "-pch" /* automatic PCH */
                       };

  nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
    sizeof(opts) / sizeof(opts[0]),     // numOptions
    opts); // options


  // Obtain compilation log from the program.
  size_t logSize;
  NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
  char* log = new char[logSize];
  NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
  std::cout << log;
  delete[] log;
  if (compileResult != NVRTC_SUCCESS) {
    exit(1);
  }
  nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
  std::cout << "\n nvrtcGetPCHCreateStatus returned : "
            << nvrtcGetErrorString(err) << std::endl;

  // Obtain PTX from the program.
  size_t ptxSize;
  NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
  char* ptx = new char[ptxSize];
  NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
  // Destroy the program.
  NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

  return ptx;
}

int main()
{
const char *first  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void other(double *a, double *b) {\n"
                     "*a = *b + doit();\n}\n";

std::cout << "\n compiling first program\n";
const char *ptr1 = docompile(first);

std::cout << "\n compiling second program (expect to use PCH)\n";
const char *ptr2 = docompile(second);

delete [] ptr1;
delete [] ptr2;

}

16.3. 自动 PCH 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe auto_pch.cpp /Feauto_pch ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ auto_pch.cpp -o auto_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe auto_pch.cpp /Feauto_pch  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ auto_pch.cpp -o auto_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

17. 示例：显式 PCH 创建/使用 (CUDA 12.8+)

此示例演示显式 PCH 模式，其中用户代码可以使用 “-create-pch=filename” 和 “-use-pch=filename” NVRTC 标志分别显式创建和使用 PCH 文件。有两个不同的程序 first 和 second 包含相同的头文件

const char *first  = "#include \"common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"common.h\" \n"
                      "__global__ void other(double *a, double *b) {\n"
                      "*a = *b + doit();\n}\n";

当使用 NVRTC 和 -create-pch=foo.pch 编译 first 时，编译器将为包含 common.h 头文件创建 PCH 文件 (“foo.pch”)。当使用 -use-pch=foo.pch 编译 second 时，编译器将使用指定的 PCH 文件 “foo.pch”。

以下是程序运行时输出

compiling first program
"default_program": creating precompiled header file "foo.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_SUCCESS

 compiling second program (expect to use PCH)
"default_program": using precompiled header file "foo.pch"

 nvrtcGetPCHCreateStatus returned : NVRTC_ERROR_NO_PCH_CREATE_ATTEMPTED

在编译 first 程序时，PCH 文件 “foo.pch” 已成功创建。在编译 second 程序时，PCH 文件 “foo.pch” 已成功使用；但没有创建 PCH 文件（因此 nvrtcGetPCHCreateStatus() 返回 NVRC_ERROR_NO_PCH_CREATE_ATTEMPTED）。

17.1. 代码 (common.h)

__device__ double qqq = 10;

template <typename T>
__device__ double get(T in) { return sin(in) + qqq; }

__device__  double doit() { return get<double>(0.5); }

17.2. 代码 (explicit_pch.cpp)

#include <iostream>
#include <nvrtc.h>

#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
  } while(0)


const char *docompile(const char *progstr, const char *pch_mode)
{
  nvrtcProgram prog;

  NVRTC_SAFE_CALL(
    nvrtcCreateProgram(&prog,
      progstr,         // buffer
      "",              // name
      0,               // numHeaders
      NULL,            // headers
      NULL));          // includeNames

  const char *opts[] = { pch_mode /*create/use PCH */
                       };

  nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
    sizeof(opts) / sizeof(opts[0]),     // numOptions
    opts); // options


  // Obtain compilation log from the program.
  size_t logSize;
  NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
  char* log = new char[logSize];
  NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
  std::cout << log;
  delete[] log;
  if (compileResult != NVRTC_SUCCESS) {
    exit(1);
  }
  nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
  std::cout << "\n nvrtcGetPCHCreateStatus returned : " << nvrtcGetErrorString(err) << std::endl;

  // Obtain PTX from the program.
  size_t ptxSize;
  NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
  char* ptx = new char[ptxSize];
  NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
  // Destroy the program.
  NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

  return ptx;
}

int main()
{
const char *first  = "#include \"common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"common.h\" \n"
                     "__global__ void other(double *a, double *b) {\n"
                     "*a = *b + doit();\n}\n";

std::cout << "\n compiling first program\n";
const char *ptr1 = docompile(first, "-create-pch=foo.pch");

std::cout << "\n compiling second program (expect to use PCH)\n";
const char *ptr2 = docompile(second, "-use-pch=foo.pch");

delete [] ptr1;
delete [] ptr2;

}

17.3. 显式 PCH 构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe explicit_pch.cpp /Feexplicit_pch ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ explicit_pch.cpp -o explicit_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe explicit_pch.cpp /Feexplicit_pch  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ explicit_pch.cpp -o explicit_pch \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

18. 示例：PCH 堆调整大小 (CUDA 12.8+)

PCH 堆在 nvrtcCompileProgram() 调用中是持久的。在内存受限的环境中，可能需要将 PCH 堆的大小调整为小于默认值。此示例展示了如何调整 PCH 堆的大小。首先，堆大小设置为较低的值 (8 KB) 6

NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(8*1024));

然后，使用 -pch 调用 nvrtcCompileProgram() 。PCH 创建预计会失败，nvrtcGetPCHCreateStatus() 返回 NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED。然后可以通过调用 nvrtcGetPCHHeapSizeRequired() 来检索所需的 PCH 堆大小，并且可以通过调用 nvrtcSetPCHHeapSize() 来调整 PCH 堆的大小

nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
std::cout << "\nnvrtcGetPCHCreateStatus returned : " << nvrtcGetErrorString(err) << std::endl;

if (err == NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED) {
    size_t size;
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSize(&size));
    ...
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSizeRequired(prog, &size));
    ..
    NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(size));
}

现在，使用同一文件请求 PCH 创建的下一个 NVRTC 编译预计将成功。这是程序的输出

compiling first program
auto_pch_common.h(7): warning #639-D: insufficient preallocated memory for generation of precompiled header file (4481024 bytes required)
  __device__  double doit() { return get<double>(0.5); }
                                                       ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"


nvrtcGetPCHCreateStatus returned : NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED
nvrtcGetPCHHeapSize() before: 8192
nvrtcGetPCHHeapSizeRequired() reports: 4481024
nvrtcGetPCHHeapSize() after: 4481024

 compiling second program (expect to use PCH)
"default_program": creating precompiled header file "default_program.pch"

nvrtcGetPCHCreateStatus returned : NVRTC_SUCCESS

6: 但不是 0，因为这将禁用 PCH 操作。

18.1. 代码 (auto_pch_common.h)

__device__ double qqq = 10;

template <typename T>
__device__ double get(T in) { return sin(in) + qqq; }

__device__  double doit() { return get<double>(0.5); }

18.2. 代码 (pch_resize.cpp)

#include <iostream>
#include <nvrtc.h>

#define NVRTC_SAFE_CALL(x)                                        \
  do {                                                            \
    nvrtcResult result = x;                                       \
    if (result != NVRTC_SUCCESS) {                                \
      std::cerr << "\nerror: " #x " failed with error "           \
                << nvrtcGetErrorString(result) << '\n';           \
      exit(1);                                                    \
    }                                                             \
  } while(0)


const char *docompile(const char *progstr)
{
  nvrtcProgram prog;

  NVRTC_SAFE_CALL(
    nvrtcCreateProgram(&prog,
      progstr,         // buffer
      "",              // name
      0,               // numHeaders
      NULL,            // headers
      NULL));          // includeNames

  const char *opts[] = { "-pch"  /* automatic PCH */
                       };

  nvrtcResult compileResult = nvrtcCompileProgram(prog,  // prog
    sizeof(opts) / sizeof(opts[0]),     // numOptions
    opts); // options


  // Obtain compilation log from the program.
  size_t logSize;
  NVRTC_SAFE_CALL(nvrtcGetProgramLogSize(prog, &logSize));
  char* log = new char[logSize];
  NVRTC_SAFE_CALL(nvrtcGetProgramLog(prog, log));
  std::cout << log;
  delete[] log;
  if (compileResult != NVRTC_SUCCESS) {
    exit(1);
  }
  nvrtcResult err = nvrtcGetPCHCreateStatus(prog);
  std::cout << "\nnvrtcGetPCHCreateStatus returned : " << nvrtcGetErrorString(err) << std::endl;

  if (err == NVRTC_ERROR_PCH_CREATE_HEAP_EXHAUSTED) {
    size_t size;
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSize(&size));
    std::cout << "nvrtcGetPCHHeapSize() before: " << size << std::endl;
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSizeRequired(prog, &size));
    std::cout << "nvrtcGetPCHHeapSizeRequired() reports: " << size << std::endl;
    NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(size));
    NVRTC_SAFE_CALL(nvrtcGetPCHHeapSize(&size));
    std::cout << "nvrtcGetPCHHeapSize() after: " << size << std::endl;
  }

  // Obtain PTX from the program.
  size_t ptxSize;
  NVRTC_SAFE_CALL(nvrtcGetPTXSize(prog, &ptxSize));
  char* ptx = new char[ptxSize];
  NVRTC_SAFE_CALL(nvrtcGetPTX(prog, ptx));
  // Destroy the program.
  NVRTC_SAFE_CALL(nvrtcDestroyProgram(&prog));

  return ptx;
}

int main()
{
const char *first  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void foo(double *ptr) {\n"
                     "*ptr = doit();\n}\n";

const char *second  = "#include \"auto_pch_common.h\" \n"
                     "__global__ void other(double *a, double *b) {\n"
                     "*a = *b + doit();\n}\n";

//set NVRTC PCH heap to a low initial value (8 KB) (note: don't use 0)
NVRTC_SAFE_CALL(nvrtcSetPCHHeapSize(8*1024));

std::cout << "\n compiling first program\n";
const char *ptr1 = docompile(first);

std::cout << "\n compiling second program (expect to use PCH)\n";
const char *ptr2 = docompile(second);

delete [] ptr1;
delete [] ptr2;

}

18.3. PCH 堆调整大小构建说明

假设环境变量 CUDA_PATH 指向 CUDA 工具包安装目录，请按如下方式构建此示例

使用 NVRTC 共享库

Windows

cl.exe pch_resize.cpp /Fepch_resize ^
   /I "%CUDA_PATH%"\include ^
   "%CUDA_PATH%"\lib\x64\nvrtc.lib "%CUDA_PATH%"\lib\x64\cuda.lib

Linux

g++ pch_resize.cpp -o pch_resize \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc -lcuda \
   -Wl,-rpath,$CUDA_PATH/lib64

使用 NVRTC 静态库

Windows

cl.exe pch_resize.cpp /Fepch_resize  ^
    /I "%CUDA_PATH%"\include ^
    "%CUDA_PATH%"\lib\x64\nvrtc_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvrtc-builtins_static.lib ^
    "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib ^
    "%CUDA_PATH%"\lib\x64\cuda.lib user32.lib Ws2_32.lib

Linux

g++ pch_resize.cpp -o pch_resize \
   -I $CUDA_PATH/include \
   -L $CUDA_PATH/lib64 \
   -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda \
   -lpthread

18.4. 声明

18.4.1. 声明

本文档仅供参考，不得视为对产品的特定功能、条件或质量的保证。NVIDIA Corporation（“NVIDIA”）对本文档中包含的信息的准确性或完整性不作任何明示或暗示的陈述或保证，并且对本文档中包含的任何错误不承担任何责任。NVIDIA 对因使用此类信息而造成的后果或使用，或因其使用而可能导致的侵犯第三方专利或其他权利的行为不承担任何责任。本文档不构成对开发、发布或交付任何材料（如下定义）、代码或功能的承诺。

NVIDIA 保留随时对此文档进行更正、修改、增强、改进和任何其他更改的权利，恕不另行通知。

客户应在下订单前获取最新的相关信息，并应验证此类信息是否为最新且完整。

NVIDIA 产品根据订单确认时提供的 NVIDIA 标准销售条款和条件进行销售，除非 NVIDIA 和客户的授权代表签署的单独销售协议（“销售条款”）另有约定。NVIDIA 在此明确反对将任何客户通用条款和条件应用于购买本文档中引用的 NVIDIA 产品。本文档未直接或间接地形成任何合同义务。

NVIDIA 产品并非设计、授权或保证适用于医疗、军事、航空、航天或生命支持设备，也不适用于 NVIDIA 产品发生故障或故障可能合理预期会导致人身伤害、死亡或财产或环境损害的应用。NVIDIA 对在此类设备或应用中包含和/或使用 NVIDIA 产品不承担任何责任，因此，此类包含和/或使用风险由客户自行承担。

NVIDIA 不保证或声明基于本文档的产品将适用于任何特定用途。NVIDIA 不一定执行每个产品所有参数的测试。客户全权负责评估和确定本文档中包含的任何信息的适用性，确保产品适合客户计划的应用，并执行应用所需的测试，以避免应用或产品出现缺陷。客户产品设计中的缺陷可能会影响 NVIDIA 产品的质量和可靠性，并可能导致超出本文档中包含的附加或不同条件和/或要求。NVIDIA 对可能基于或归因于以下原因的任何缺陷、损坏、成本或问题不承担任何责任：(i) 以任何违反本文档的方式使用 NVIDIA 产品或 (ii) 客户产品设计。

在本文档下，未授予任何 NVIDIA 专利权、版权或其他 NVIDIA 知识产权的明示或暗示许可。NVIDIA 发布的有关第三方产品或服务的信息不构成 NVIDIA 授予使用此类产品或服务的许可，也不构成 NVIDIA 对其的保证或认可。使用此类信息可能需要从第三方获得专利或许可或第三方的其他知识产权，或者从 NVIDIA 获得 NVIDIA 的专利或其他知识产权的许可。

只有在事先获得 NVIDIA 书面批准的情况下，才允许复制本文档中的信息，复制时不得进行修改，并且必须完全遵守所有适用的出口法律和法规，并附带所有相关的条件、限制和声明。

本文档和所有 NVIDIA 设计规范、参考板、文件、图纸、诊断程序、列表和其他文档（统称为“材料”）均“按原样”提供。NVIDIA 不对材料作出任何明示、暗示、法定或其他方面的保证，并明确否认所有关于不侵权、适销性和针对特定用途适用性的默示保证。在法律未禁止的范围内，在任何情况下，NVIDIA 均不对因使用本文档而引起的任何损害（包括但不限于任何直接、间接、特殊、偶然、惩罚性或后果性损害，无论如何引起，也无论责任理论如何）承担责任，即使 NVIDIA 已被告知可能发生此类损害。尽管客户可能因任何原因遭受任何损害，但 NVIDIA 对本文所述产品的客户的累计总责任应根据产品的销售条款进行限制。

18.4.2. OpenCL

OpenCL 是 Apple Inc. 的商标，已授权 Khronos Group Inc. 使用。

18.4.3. 商标

NVIDIA 和 NVIDIA 徽标是 NVIDIA Corporation 在美国和其他国家/地区的商标或注册商标。其他公司和产品名称可能是与其相关的各自公司的商标。

1: 较新 NVRTC 共享库中编译器优化器启发式的更改也可能导致生成的代码的性能波动。
2: 这些 API 函数会修改关联的 nvrtcProgram 的状态。