Low-level C 快速入门指南#

某些应用程序需要压缩或解压缩多个小型输入，因此我们提供了一个额外的 API 来高效地执行此操作。这些 API 调用将所有压缩/解压缩组合成一次执行，与单独运行每个输入相比，大大提高了性能。此 API 依赖用户将数据拆分为块，并管理元数据信息，例如压缩和未压缩的块大小。在拆分数据时，为了获得最佳性能，块应大小相对相等，以实现良好的负载平衡并提取足够的并行性。因此，在存在多个输入要压缩的情况下，最好还是将每个输入分解为更小的块。

底层批量 C API 提供了一组函数来执行批量解压缩和压缩。

在以下 API 描述中，将 <compression_method> 替换为所需的压缩算法，可以是以下之一

ans
bitcomp
cascaded
deflate
gdeflate
gzip（仅用于解压缩）
lz4
snappy
zstd

例如，对于 LZ4，nvcompBatched<compression_method>CompressAsync 变为 nvcompBatchedLZ4CompressAsync，nvcompBatched<compression_method>DecompressAsync 变为 nvcompBatchedLZ4DecompressAsync。

某些压缩器对用户提供的输入、输出和/或暂存缓冲区具有（最多 8 字节）对齐要求。请查看 include/ 中相应标头中的文档，以查看有关任何特定 API 的对齐要求的详细信息。

压缩 API#

要进行批量压缩，需要在设备内存中分配一个临时工作区。此空间的大小使用以下公式计算

/**
* @brief Get the amount of temporary memory required on the GPU for compression.
*
* @param[in] num_chunks The number of chunks of memory in the batch.
* @param[in] max_uncompressed_chunk_bytes The maximum size of a chunk in the
* batch.
* @param[in] format_opts Compression options.
* @param[out] temp_bytes The amount of GPU memory that will be temporarily
* required during compression.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>CompressGetTempSize(
    size_t num_chunks,
    size_t max_uncompressed_chunk_bytes,
    nvcompBatched<compression_method>Opts format_opts,
    size_t * temp_bytes);

然后使用以下方法完成压缩

/**
* @brief Perform batched asynchronous compression.
*
* @note Violating any of the conditions listed in the parameter descriptions
* below may result in undefined behaviour.
*
* @param[in] device_uncompressed_chunk_ptrs Array with size \p num_chunks of pointers
* to the uncompressed data chunks. Both the pointers and the uncompressed data
* should reside in device-accessible memory.
* Each chunk must be aligned to the value in the `input` member of the
* `nvcompAlignmentRequirements_t` object output by
* `nvcompBatched<compression_method>CompressGetRequiredAlignments`
* when called with the same \p format_opts.
* @param[in] device_uncompressed_chunk_bytes Array with size \p num_chunks of
* sizes of the uncompressed chunks in bytes.
* The sizes should reside in device-accessible memory.
* @param[in] max_uncompressed_chunk_bytes The size of the largest uncompressed chunk.
* @param[in] num_chunks Number of chunks of data to compress.
* @param[in] device_temp_ptr The temporary GPU workspace, could be NULL in case
* temporary memory is not needed.
* Must be aligned to the value in the `temp` member of the
* `nvcompAlignmentRequirements_t` object output by
* `nvcompBatched<compression_method>CompressGetRequiredAlignments` when called with the same
* \p format_opts.
* @param[in] temp_bytes The size of the temporary GPU memory pointed to by
* `device_temp_ptr`.
* @param[out] device_compressed_chunk_ptrs Array with size \p num_chunks of pointers
* to the output compressed buffers. Both the pointers and the compressed
* buffers should reside in device-accessible memory. Each compressed buffer
* should be preallocated with the size given by
* `nvcompBatched<compression_method>CompressGetMaxOutputChunkSize`.
* Each compressed buffer must be aligned to the value in the `output` member of the
* `nvcompAlignmentRequirements_t` object output by
* `nvcompBatched<compression_method>CompressGetRequiredAlignments`
* when called with the same \p format_opts.
* @param[out] device_compressed_chunk_bytes Array with size \p num_chunks,
* to be filled with the compressed sizes of each chunk.
* The buffer should be preallocated in device-accessible memory.
* @param[in] format_opts Compression options.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successfully launched, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>CompressAsync(
    const void* const* device_uncompressed_chunk_ptrs,
    const size_t* device_uncompressed_chunk_bytes,
    size_t max_uncompressed_chunk_bytes,
    size_t num_chunks,
    void* device_temp_ptr,
    size_t temp_bytes,
    void* const* device_compressed_chunk_ptrs,
    size_t* device_compressed_chunk_bytes,
    nvcompBatched<compression_method>Opts_t format_opts,
    cudaStream_t stream);

解压缩 API#

解压缩也需要一个临时工作区。这使用以下公式计算

/**
* @brief Get the amount of temporary memory required on the GPU for decompression.
*
* @param[in] num_chunks Number of chunks of data to be decompressed.
* @param[in] max_uncompressed_chunk_bytes The size of the largest chunk in bytes
* when uncompressed.
* @param[out] temp_bytes The amount of GPU memory that will be temporarily required
* during decompression.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>DecompressGetTempSize(
    size_t num_chunks,
    size_t max_uncompressed_chunk_bytes,
    size_t * temp_bytes);

在解压缩期间，必须提供足够大的设备内存缓冲区来容纳解压缩结果。支持以下三种可能的工作流程

每个缓冲区的未压缩大小都精确已知（例如，Apache Parquet apache/parquet-format）
仅知道所有缓冲区中的最大未压缩大小（例如，Apache ORC）
未提供有关未压缩大小的信息（例如，Apache Avro）

对于情况 3），nvCOMP 提供了一个 API 用于预处理压缩文件: 以确定解压缩输出缓冲区的适当大小。此 API 如下

/**
* @brief Asynchronously compute the number of bytes of uncompressed data for
* each compressed chunk.
*
* This is needed when we do not know the expected output size.
*
* @note If the stream is corrupt, the calculated sizes will be invalid.
*
* @note Violating any of the conditions listed in the parameter descriptions
* below may result in undefined behaviour.
*
* @param[in] device_compressed_chunk_ptrs Array with size \p num_chunks of
* pointers in device-accessible memory to compressed buffers.
* Each buffer must be aligned to the value in
* `nvcompBatched<compression_method>DecompressRequiredAlignments.input`.
* @param[in] device_compressed_chunk_bytes Array with size \p num_chunks of sizes
* of the compressed buffers in bytes. The sizes should reside in device-accessible memory.
* @param[out] device_uncompressed_chunk_bytes Array with size \p num_chunks
* to be filled with the sizes, in bytes, of each uncompressed data chunk.
* This argument needs to be preallocated in device-accessible memory.
* @param[in] num_chunks Number of data chunks to compute sizes of.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>GetDecompressSizeAsync(
    const void* const* device_compressed_chunk_ptrs,
    const size_t* device_compressed_chunk_bytes,
    size_t* device_uncompressed_chunk_bytes,
    size_t num_chunks,
    cudaStream_t stream);

在知道解压缩大小后，我们现在可以使用解压缩 API

/**
* @brief Perform batched asynchronous decompression.
*
* @note Violating any of the conditions listed in the parameter descriptions
* below may result in undefined behaviour.
*
* @param[in] device_compressed_chunk_ptrs Array with size \p num_chunks of pointers
* in device-accessible memory to device-accessible compressed buffers.
* Each buffer must be aligned to the value in
* `nvcompBatched<compression_method>DecompressRequiredAlignments.input`.
* @param[in] device_compressed_chunk_bytes Array with size \p num_chunks of sizes of
* the compressed buffers in bytes. The sizes should reside in device-accessible memory.
* @param[in] device_uncompressed_buffer_bytes Array with size \p num_chunks of sizes,
* in bytes, of the output buffers to be filled with uncompressed data for each chunk.
* The sizes should reside in device-accessible memory. If a
* size is not large enough to hold all decompressed data, the decompressor
* will set the status in \p device_statuses corresponding to the
* overflow chunk to `nvcompErrorCannotDecompress`.
* @param[out] device_uncompressed_chunk_bytes Array with size \p num_chunks to
* be filled with the actual number of bytes decompressed for every chunk.
* This argument needs to be preallocated, but can be nullptr if desired,
* in which case the actual sizes are not reported.
* @param[in] num_chunks Number of chunks of data to decompress.
* @param[in] device_temp_ptr The temporary GPU space.
* Must be aligned to the value in `nvcompBatched<compression_method>DecompressRequiredAlignments.temp`.
* @param[in] temp_bytes The size of the temporary GPU space.
* @param[out] device_uncompressed_chunk_ptrs Array with size \p num_chunks of
* pointers in device-accessible memory to decompressed data. Each uncompressed
* buffer needs to be preallocated in device-accessible memory, have the size
* specified by the corresponding entry in \p device_uncompressed_buffer_bytes,
* and be aligned to the value in
* `nvcompBatched<compress_method>DecompressRequiredAlignments.output`.
* @param[out] device_statuses Array with size \p num_chunks of statuses in
* device-accessible memory. This argument needs to be preallocated. For each
* chunk, if the decompression is successful, the status will be set to
* `nvcompSuccess`. If the decompression is not successful, for example due to
* the corrupted input or out-of-bound errors, the status will be set to
* `nvcompErrorCannotDecompress`.
* Can be nullptr if desired, in which case error status is not reported.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successfully launched, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>DecompressAsync(
    const void* const* device_compressed_chunk_ptrs,
    const size_t* device_compressed_chunk_bytes,
    const size_t* device_uncompressed_buffer_bytes,
    size_t* device_uncompressed_chunk_bytes,
    size_t num_chunks,
    void* const device_temp_ptr,
    size_t temp_bytes,
    void* const* device_uncompressed_chunk_ptrs,
    nvcompStatus_t* device_statuses,
    cudaStream_t stream);

请注意，对于 LZ4、Snappy 和 GDeflate，device_uncompressed_chunk_bytes 和 device_statuses 都可以指定为 nullptr。如果这些为 nullptr，则这些方法将不计算这些值。特别是，如果 device_statuses 为 nullptr，则会禁用越界 (OOB) 错误检查。这可以显着提高解压缩吞吐量。

批量压缩/解压缩示例 - LZ4#

有关使用 LZ4 进行批量压缩和解压缩的示例，请参阅 examples/low_level_quickstart_example.cpp。