集体通信函数¶

以下 NCCL API 提供了一些常用的集体操作。

ncclAllReduce¶

ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶

使用 op 操作归约 sendbuff 中长度为 count 的数据数组，并在每个 recvbuff 中留下结果的相同副本。

如果 sendbuff == recvbuff，将发生就地操作。

ncclBroadcast¶

ncclResult_t ncclBroadcast(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶

将 root 秩上 sendbuff 中的 count 个元素复制到所有秩的 recvbuff。sendbuff 仅在秩 root 上使用，对于其他秩则忽略。

如果 sendbuff == recvbuff，将发生就地操作。

ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)¶

ncclBroadcast 的旧版就地版本，类似于 MPI_Bcast。调用

ncclBcast(buff, count, datatype, root, comm, stream)

等效于

ncclBroadcast(buff, buff, count, datatype, root, comm, stream)

ncclReduce¶

ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)¶

使用 op 操作将 sendbuff 中长度为 count 的数据数组归约到 root 秩上的 recvbuff 中。recvbuff 仅在秩 root 上使用，对于其他秩则忽略。

如果 sendbuff == recvbuff，将发生就地操作。

ncclAllGather¶

ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)¶

从所有 GPU 收集 sendcount 个值，并在每个 recvbuff 中留下结果的相同副本，从秩 i 接收数据，偏移量为 i*sendcount。

注意：这假设接收计数等于 nranks*sendcount，这意味着 recvbuff 的大小应至少为 nranks*sendcount 个元素。

如果 sendbuff == recvbuff + rank * sendcount，将发生就地操作。

相关链接： AllGather，就地操作。

ncclReduceScatter¶

ncclResult_t ncclReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)¶

使用 op 操作归约来自所有 GPU 的 sendbuff 中的数据，并将归约结果分散在设备上，以便秩 i 上的 recvbuff 将包含结果的第 i 个块。

注意：这假设发送计数等于 nranks*recvcount，这意味着 sendbuff 的大小应至少为 nranks*recvcount 个元素。

如果 recvbuff == sendbuff + rank * recvcount，将发生就地操作。

相关链接： ReduceScatter，就地操作。