Low-level C 快速入门指南#
某些应用程序需要压缩或解压缩多个小型输入,因此我们提供了一个额外的 API 来高效地执行此操作。这些 API 调用将所有压缩/解压缩组合成一次执行,与单独运行每个输入相比,大大提高了性能。此 API 依赖用户将数据拆分为块,并管理元数据信息,例如压缩和未压缩的块大小。在拆分数据时,为了获得最佳性能,块应大小相对相等,以实现良好的负载平衡并提取足够的并行性。因此,在存在多个输入要压缩的情况下,最好还是将每个输入分解为更小的块。
底层批量 C API 提供了一组函数来执行批量解压缩和压缩。
- 在以下 API 描述中,将
<compression_method>
替换为所需的压缩算法,可以是以下之一 ans
bitcomp
cascaded
deflate
gdeflate
gzip
(仅用于解压缩)lz4
snappy
zstd
例如,对于 LZ4,nvcompBatched<compression_method>CompressAsync
变为 nvcompBatchedLZ4CompressAsync
,nvcompBatched<compression_method>DecompressAsync
变为 nvcompBatchedLZ4DecompressAsync
。
某些压缩器对用户提供的输入、输出和/或暂存缓冲区具有(最多 8 字节)对齐要求。请查看 include/ 中相应标头中的文档,以查看有关任何特定 API 的对齐要求的详细信息。
压缩 API#
要进行批量压缩,需要在设备内存中分配一个临时工作区。此空间的大小使用以下公式计算
/**
* @brief Get the amount of temporary memory required on the GPU for compression.
*
* @param[in] num_chunks The number of chunks of memory in the batch.
* @param[in] max_uncompressed_chunk_bytes The maximum size of a chunk in the
* batch.
* @param[in] format_opts Compression options.
* @param[out] temp_bytes The amount of GPU memory that will be temporarily
* required during compression.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>CompressGetTempSize(
size_t num_chunks,
size_t max_uncompressed_chunk_bytes,
nvcompBatched<compression_method>Opts format_opts,
size_t * temp_bytes);
然后使用以下方法完成压缩
/**
* @brief Perform batched asynchronous compression.
*
* @note Violating any of the conditions listed in the parameter descriptions
* below may result in undefined behaviour.
*
* @param[in] device_uncompressed_chunk_ptrs Array with size \p num_chunks of pointers
* to the uncompressed data chunks. Both the pointers and the uncompressed data
* should reside in device-accessible memory.
* Each chunk must be aligned to the value in the `input` member of the
* `nvcompAlignmentRequirements_t` object output by
* `nvcompBatched<compression_method>CompressGetRequiredAlignments`
* when called with the same \p format_opts.
* @param[in] device_uncompressed_chunk_bytes Array with size \p num_chunks of
* sizes of the uncompressed chunks in bytes.
* The sizes should reside in device-accessible memory.
* @param[in] max_uncompressed_chunk_bytes The size of the largest uncompressed chunk.
* @param[in] num_chunks Number of chunks of data to compress.
* @param[in] device_temp_ptr The temporary GPU workspace, could be NULL in case
* temporary memory is not needed.
* Must be aligned to the value in the `temp` member of the
* `nvcompAlignmentRequirements_t` object output by
* `nvcompBatched<compression_method>CompressGetRequiredAlignments` when called with the same
* \p format_opts.
* @param[in] temp_bytes The size of the temporary GPU memory pointed to by
* `device_temp_ptr`.
* @param[out] device_compressed_chunk_ptrs Array with size \p num_chunks of pointers
* to the output compressed buffers. Both the pointers and the compressed
* buffers should reside in device-accessible memory. Each compressed buffer
* should be preallocated with the size given by
* `nvcompBatched<compression_method>CompressGetMaxOutputChunkSize`.
* Each compressed buffer must be aligned to the value in the `output` member of the
* `nvcompAlignmentRequirements_t` object output by
* `nvcompBatched<compression_method>CompressGetRequiredAlignments`
* when called with the same \p format_opts.
* @param[out] device_compressed_chunk_bytes Array with size \p num_chunks,
* to be filled with the compressed sizes of each chunk.
* The buffer should be preallocated in device-accessible memory.
* @param[in] format_opts Compression options.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successfully launched, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>CompressAsync(
const void* const* device_uncompressed_chunk_ptrs,
const size_t* device_uncompressed_chunk_bytes,
size_t max_uncompressed_chunk_bytes,
size_t num_chunks,
void* device_temp_ptr,
size_t temp_bytes,
void* const* device_compressed_chunk_ptrs,
size_t* device_compressed_chunk_bytes,
nvcompBatched<compression_method>Opts_t format_opts,
cudaStream_t stream);
解压缩 API#
解压缩也需要一个临时工作区。这使用以下公式计算
/**
* @brief Get the amount of temporary memory required on the GPU for decompression.
*
* @param[in] num_chunks Number of chunks of data to be decompressed.
* @param[in] max_uncompressed_chunk_bytes The size of the largest chunk in bytes
* when uncompressed.
* @param[out] temp_bytes The amount of GPU memory that will be temporarily required
* during decompression.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>DecompressGetTempSize(
size_t num_chunks,
size_t max_uncompressed_chunk_bytes,
size_t * temp_bytes);
在解压缩期间,必须提供足够大的设备内存缓冲区来容纳解压缩结果。支持以下三种可能的工作流程
每个缓冲区的未压缩大小都精确已知(例如,Apache Parquet apache/parquet-format)
仅知道所有缓冲区中的最大未压缩大小(例如,Apache ORC)
未提供有关未压缩大小的信息(例如,Apache Avro)
- 对于情况 3),nvCOMP 提供了一个 API 用于预处理压缩文件
以确定解压缩输出缓冲区的适当大小。此 API 如下
/**
* @brief Asynchronously compute the number of bytes of uncompressed data for
* each compressed chunk.
*
* This is needed when we do not know the expected output size.
*
* @note If the stream is corrupt, the calculated sizes will be invalid.
*
* @note Violating any of the conditions listed in the parameter descriptions
* below may result in undefined behaviour.
*
* @param[in] device_compressed_chunk_ptrs Array with size \p num_chunks of
* pointers in device-accessible memory to compressed buffers.
* Each buffer must be aligned to the value in
* `nvcompBatched<compression_method>DecompressRequiredAlignments.input`.
* @param[in] device_compressed_chunk_bytes Array with size \p num_chunks of sizes
* of the compressed buffers in bytes. The sizes should reside in device-accessible memory.
* @param[out] device_uncompressed_chunk_bytes Array with size \p num_chunks
* to be filled with the sizes, in bytes, of each uncompressed data chunk.
* This argument needs to be preallocated in device-accessible memory.
* @param[in] num_chunks Number of data chunks to compute sizes of.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>GetDecompressSizeAsync(
const void* const* device_compressed_chunk_ptrs,
const size_t* device_compressed_chunk_bytes,
size_t* device_uncompressed_chunk_bytes,
size_t num_chunks,
cudaStream_t stream);
在知道解压缩大小后,我们现在可以使用解压缩 API
/**
* @brief Perform batched asynchronous decompression.
*
* @note Violating any of the conditions listed in the parameter descriptions
* below may result in undefined behaviour.
*
* @param[in] device_compressed_chunk_ptrs Array with size \p num_chunks of pointers
* in device-accessible memory to device-accessible compressed buffers.
* Each buffer must be aligned to the value in
* `nvcompBatched<compression_method>DecompressRequiredAlignments.input`.
* @param[in] device_compressed_chunk_bytes Array with size \p num_chunks of sizes of
* the compressed buffers in bytes. The sizes should reside in device-accessible memory.
* @param[in] device_uncompressed_buffer_bytes Array with size \p num_chunks of sizes,
* in bytes, of the output buffers to be filled with uncompressed data for each chunk.
* The sizes should reside in device-accessible memory. If a
* size is not large enough to hold all decompressed data, the decompressor
* will set the status in \p device_statuses corresponding to the
* overflow chunk to `nvcompErrorCannotDecompress`.
* @param[out] device_uncompressed_chunk_bytes Array with size \p num_chunks to
* be filled with the actual number of bytes decompressed for every chunk.
* This argument needs to be preallocated, but can be nullptr if desired,
* in which case the actual sizes are not reported.
* @param[in] num_chunks Number of chunks of data to decompress.
* @param[in] device_temp_ptr The temporary GPU space.
* Must be aligned to the value in `nvcompBatched<compression_method>DecompressRequiredAlignments.temp`.
* @param[in] temp_bytes The size of the temporary GPU space.
* @param[out] device_uncompressed_chunk_ptrs Array with size \p num_chunks of
* pointers in device-accessible memory to decompressed data. Each uncompressed
* buffer needs to be preallocated in device-accessible memory, have the size
* specified by the corresponding entry in \p device_uncompressed_buffer_bytes,
* and be aligned to the value in
* `nvcompBatched<compress_method>DecompressRequiredAlignments.output`.
* @param[out] device_statuses Array with size \p num_chunks of statuses in
* device-accessible memory. This argument needs to be preallocated. For each
* chunk, if the decompression is successful, the status will be set to
* `nvcompSuccess`. If the decompression is not successful, for example due to
* the corrupted input or out-of-bound errors, the status will be set to
* `nvcompErrorCannotDecompress`.
* Can be nullptr if desired, in which case error status is not reported.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successfully launched, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>DecompressAsync(
const void* const* device_compressed_chunk_ptrs,
const size_t* device_compressed_chunk_bytes,
const size_t* device_uncompressed_buffer_bytes,
size_t* device_uncompressed_chunk_bytes,
size_t num_chunks,
void* const device_temp_ptr,
size_t temp_bytes,
void* const* device_uncompressed_chunk_ptrs,
nvcompStatus_t* device_statuses,
cudaStream_t stream);
请注意,对于 LZ4、Snappy 和 GDeflate,device_uncompressed_chunk_bytes
和 device_statuses
都可以指定为 nullptr。如果这些为 nullptr,则这些方法将不计算这些值。特别是,如果 device_statuses 为 nullptr,则会禁用越界 (OOB) 错误检查。这可以显着提高解压缩吞吐量。
批量压缩/解压缩示例 - LZ4#
有关使用 LZ4 进行批量压缩和解压缩的示例,请参阅 examples/low_level_quickstart_example.cpp。