入门指南#

nvmath-python 将 NVIDIA 数学库的强大功能引入 Python 生态系统。该软件包旨在提供直观的 Pythonic API，使用户能够在各种执行空间中完全访问 NVIDIA 库提供的所有功能。 nvmath-python 与现有的 Python 数组/张量框架无缝协作，并专注于提供这些框架中缺少的功能。

要了解有关 nvmath-python 设计的更多信息，请访问我们的概述。

安装#

要快速安装 nvmath-python，只需运行以下命令

pip install nvmath-python[cu12,dx]

有关更多详细信息，请访问安装指南。

示例#

在下面的示例中，我们快速演示了 nvmath-python 的基本功能。您可以在我们的GitHub 仓库中找到更多示例。

矩阵乘法#

使用 nvmath-python API 可以访问底层 NVIDIA cuBLASLt 库的所有参数。其中一些参数在 NVIDIA C-API 库的其他封装中不可用。

>>> import cupy as cp
>>> import nvmath
>>>
>>> m, n, k = 123, 456, 789
>>> a = cp.random.rand(m, k).astype(cp.float32)
>>> b = cp.random.rand(k, n).astype(cp.float32)
>>>
>>> # Use the stateful nvmath.linalg.advanced.Matmul object in order to separate planning
>>> # from actual execution of matrix multiplication. nvmath-python allows you to fine-tune
>>> # your operations by, for example, selecting a mixed-precision compute type.
>>> options = {
...     "compute_type": nvmath.linalg.advanced.MatmulComputeType.COMPUTE_32F_FAST_16F
... }
>>> with nvmath.linalg.advanced.Matmul(a, b, options=options) as mm:
...     algorithms = mm.plan()
...     result = mm.execute()

要了解有关 nvmath-python 中矩阵乘法的更多信息，请查看Matmul。

带回调的 FFT#

用户定义的函数可以编译为 LTO-IR 格式，并作为 FFT 操作的尾声或序言提供，从而实现链接时优化和融合。

此示例展示了如何通过提供 Python 回调函数作为 IFFT 操作的序言来执行卷积。

>>> import cupy as cp
>>> import nvmath
>>>
>>> # Create the data for the batched 1-D FFT.
>>> B, N = 256, 1024
>>> a = cp.random.rand(B, N, dtype=cp.float64) + 1j * cp.random.rand(B, N, dtype=cp.float64)
>>>
>>> # Create the data to use as filter.
>>> filter_data = cp.sin(a)
>>>
>>> # Define the prolog function for the inverse FFT.
>>> # A convolution corresponds to pointwise multiplication in the frequency domain.
>>> def convolve(data_in, offset, filter_data, unused):
...     # Note we are accessing `data_out` and `filter_data` with a single `offset` integer,
...     # even though the input and `filter_data` are 2D tensors (batches of samples).
...     # Care must be taken to assure that both arrays accessed here have the same memory
...     # layout.
...     return data_in[offset] * filter_data[offset] / N
>>>
>>> # Compile the prolog to LTO-IR.
>>> with cp.cuda.Device():
...     prolog = nvmath.fft.compile_prolog(convolve, "complex128", "complex128")
>>>
>>> # Perform the forward FFT, followed by the inverse FFT, applying the filter as a prolog.
>>> r = nvmath.fft.fft(a, axes=[-1])
>>> r = nvmath.fft.ifft(r, axes=[-1], prolog={
...         "ltoir": prolog,
...         "data": filter_data.data.ptr
...     })

有关更多详细信息，请参阅FFT 回调文档。

设备 API#

nvmath-python 的设备 API 允许您在内核中访问 cuFFTDx、cuBLASDx 和 cuRAND 库的功能。

此示例展示了如何使用 cuRAND 从正态分布中采样单精度值。

首先，创建位生成器状态数组（每个线程一个）。在此示例中，我们将使用Philox4_32_10生成器。

>>> from numba import cuda
>>> from nvmath.device import random
>>> compiled_apis = random.Compile()
>>>
>>> threads, blocks = 64, 64
>>> nthreads = blocks * threads
>>>
>>> states = random.StatesPhilox4_32_10(nthreads)
>>>
>>> # Next, define and launch a setup kernel, which will initialize the states using
>>> # nvmath.device.random.init function.
>>> @cuda.jit(link=compiled_apis.files, extensions=compiled_apis.extension)
... def setup(states):
...     i = cuda.grid(1)
...     random.init(1234, i, 0, states[i])
>>>
>>> setup[blocks, threads](states)
>>>
>>> # With your states array ready, you can use samplers such as
>>> # nvmath.device.random.normal2 to sample random values in your kernels.
>>> @cuda.jit(link=compiled_apis.files, extensions=compiled_apis.extension)
... def kernel(states):
...     i = cuda.grid(1)
...     random_values = random.normal2(states[i])