入门指南#
nvmath-python 将 NVIDIA 数学库的强大功能引入 Python 生态系统。该软件包旨在提供直观的 Pythonic API,使用户能够在各种执行空间中完全访问 NVIDIA 库提供的所有功能。 nvmath-python 与现有的 Python 数组/张量框架无缝协作,并专注于提供这些框架中缺少的功能。
要了解有关 nvmath-python 设计的更多信息,请访问我们的概述。
安装#
要快速安装 nvmath-python,只需运行以下命令
pip install nvmath-python[cu12,dx]
有关更多详细信息,请访问安装指南。
示例#
在下面的示例中,我们快速演示了 nvmath-python 的基本功能。您可以在我们的GitHub 仓库中找到更多示例。
矩阵乘法#
使用 nvmath-python API 可以访问底层 NVIDIA cuBLASLt 库的所有参数。其中一些参数在 NVIDIA C-API 库的其他封装中不可用。
>>> import cupy as cp
>>> import nvmath
>>>
>>> m, n, k = 123, 456, 789
>>> a = cp.random.rand(m, k).astype(cp.float32)
>>> b = cp.random.rand(k, n).astype(cp.float32)
>>>
>>> # Use the stateful nvmath.linalg.advanced.Matmul object in order to separate planning
>>> # from actual execution of matrix multiplication. nvmath-python allows you to fine-tune
>>> # your operations by, for example, selecting a mixed-precision compute type.
>>> options = {
... "compute_type": nvmath.linalg.advanced.MatmulComputeType.COMPUTE_32F_FAST_16F
... }
>>> with nvmath.linalg.advanced.Matmul(a, b, options=options) as mm:
... algorithms = mm.plan()
... result = mm.execute()
要了解有关 nvmath-python 中矩阵乘法的更多信息,请查看Matmul
。
带回调的 FFT#
用户定义的函数可以编译为 LTO-IR 格式,并作为 FFT 操作的尾声或序言提供,从而实现链接时优化和融合。
此示例展示了如何通过提供 Python 回调函数作为 IFFT 操作的序言来执行卷积。
>>> import cupy as cp
>>> import nvmath
>>>
>>> # Create the data for the batched 1-D FFT.
>>> B, N = 256, 1024
>>> a = cp.random.rand(B, N, dtype=cp.float64) + 1j * cp.random.rand(B, N, dtype=cp.float64)
>>>
>>> # Create the data to use as filter.
>>> filter_data = cp.sin(a)
>>>
>>> # Define the prolog function for the inverse FFT.
>>> # A convolution corresponds to pointwise multiplication in the frequency domain.
>>> def convolve(data_in, offset, filter_data, unused):
... # Note we are accessing `data_out` and `filter_data` with a single `offset` integer,
... # even though the input and `filter_data` are 2D tensors (batches of samples).
... # Care must be taken to assure that both arrays accessed here have the same memory
... # layout.
... return data_in[offset] * filter_data[offset] / N
>>>
>>> # Compile the prolog to LTO-IR.
>>> with cp.cuda.Device():
... prolog = nvmath.fft.compile_prolog(convolve, "complex128", "complex128")
>>>
>>> # Perform the forward FFT, followed by the inverse FFT, applying the filter as a prolog.
>>> r = nvmath.fft.fft(a, axes=[-1])
>>> r = nvmath.fft.ifft(r, axes=[-1], prolog={
... "ltoir": prolog,
... "data": filter_data.data.ptr
... })
有关更多详细信息,请参阅FFT 回调文档。
设备 API#
nvmath-python 的设备 API 允许您在内核中访问 cuFFTDx、cuBLASDx 和 cuRAND 库的功能。
此示例展示了如何使用 cuRAND 从正态分布中采样单精度值。
首先,创建位生成器状态数组(每个线程一个)。在此示例中,我们将使用Philox4_32_10
生成器。
>>> from numba import cuda
>>> from nvmath.device import random
>>> compiled_apis = random.Compile()
>>>
>>> threads, blocks = 64, 64
>>> nthreads = blocks * threads
>>>
>>> states = random.StatesPhilox4_32_10(nthreads)
>>>
>>> # Next, define and launch a setup kernel, which will initialize the states using
>>> # nvmath.device.random.init function.
>>> @cuda.jit(link=compiled_apis.files, extensions=compiled_apis.extension)
... def setup(states):
... i = cuda.grid(1)
... random.init(1234, i, 0, states[i])
>>>
>>> setup[blocks, threads](states)
>>>
>>> # With your states array ready, you can use samplers such as
>>> # nvmath.device.random.normal2 to sample random values in your kernels.
>>> @cuda.jit(link=compiled_apis.files, extensions=compiled_apis.extension)
... def kernel(states):
... i = cuda.grid(1)
... random_values = random.normal2(states[i])
要了解有关此设备 API 和其他设备 API 的更多信息,请访问nvmath.
的文档。