跳到内容

Megatron utils

is_only_data_parallel()

检查以查看您是否处于仅数据并行激活的分布式 Megatron 环境中。

如果您正在开发模型、损失函数等,并且您知道您尚不支持 megatron 模型并行性,这将非常有用。 您可以测试唯一使用的并行类型是否为数据并行性。

返回

类型 描述
布尔值

如果数据并行是唯一的并行模式,则为 True,否则为 False。

源代码位于 bionemo/llm/utils/megatron_utils.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def is_only_data_parallel() -> bool:
    """Checks to see if you are in a distributed megatron environment with only data parallelism active.

    This is useful if you are working on a model, loss, etc and you know that you do not yet support megatron model
    parallelism. You can test that the only kind of parallelism in use is data parallelism.

    Returns:
        True if data parallel is the only parallel mode, False otherwise.
    """
    if not (torch.distributed.is_available() and parallel_state.is_initialized()):
        raise RuntimeError("This function is only defined within an initialized megatron parallel environment.")
    # Idea: when world_size == data_parallel_world_size, then you know that you are fully DDP, which means you are not
    #  using model parallelism (meaning virtual GPUs composed of several underlying GPUs that you need to reduce over).

    world_size: int = torch.distributed.get_world_size()
    dp_world_size: int = parallel_state.get_data_parallel_world_size()
    return world_size == dp_world_size