跳到内容

Megatron数据集兼容性

DatasetDistributedNondeterministic

基类: AssertionError

数据集不是本地确定性的。

源代码位于 `bionemo/testing/megatron_dataset_compatibility.py`
48
49
class DatasetDistributedNondeterministic(AssertionError):
    """Datasets are not locally deterministic."""

DatasetLocallyNondeterministic

基类: AssertionError

数据集不是本地确定性的。

源代码位于 `bionemo/testing/megatron_dataset_compatibility.py`
44
45
class DatasetLocallyNondeterministic(AssertionError):
    """Datasets are not locally deterministic."""

assert_dataset_compatible_with_megatron(dataset, index=0, assert_elements_equal=assert_dict_tensors_approx_equal)

确保数据集通过 Megatron 确定性约束的一些基本健全性检查。

测试的约束
  • dataset[i] 无论设备如何都返回相同的元素
  • dataset[i] 不调用已知的有问题的随机化过程 (当前为 `torch.manual_seed`)。

随着发现更多约束,应将其添加到此测试中。

源代码位于 `bionemo/testing/megatron_dataset_compatibility.py`
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def assert_dataset_compatible_with_megatron(
    dataset: torch.utils.data.Dataset[TensorCollectionOrTensor],
    index: Index = 0,
    assert_elements_equal: Callable[
        [TensorCollectionOrTensor, TensorCollectionOrTensor], None
    ] = assert_dict_tensors_approx_equal,
):
    """Make sure that a dataset passes some basic sanity checks for megatron determinism constraints.

    Constraints tested:
        * dataset[i] returns the same element regardless of device
        * dataset[i] doesn't make calls to known problematic randomization procedures (currently `torch.manual_seed`).

    As more constraints are discovered, they should be added to this test.
    """
    # 1. Make sure the dataset is deterministic when you ask for the same elements.
    n_elements = len(dataset)  # type: ignore
    assert n_elements > 0, "Need one element or more to test"
    try:
        assert_elements_equal(dataset[index], dataset[index])
    except AssertionError as e_0:
        raise DatasetLocallyNondeterministic(e_0)
    with (
        patch("torch.manual_seed") as mock_manual_seed,
        patch("torch.cuda.manual_seed") as mock_cuda_manual_seed,
        patch("torch.cuda.manual_seed_all") as mock_cuda_manual_seed_all,
    ):
        _ = dataset[index]
    if mock_manual_seed.call_count > 0 or mock_cuda_manual_seed.call_count > 0 or mock_cuda_manual_seed_all.call_count:
        raise DatasetDistributedNondeterministic(
            "You cannot safely use torch.manual_seed in a cluster with model parallelism. Use torch.Generator directly."
            " See https://github.com/NVIDIA/Megatron-LM/blob/dddecd19/megatron/core/tensor_parallel/random.py#L198-L199"
        )

assert_dataset_elements_not_equal(dataset, index_a=0, index_b=1, assert_elements_equal=assert_dict_tensors_approx_equal)

测试在采用随机性的数据集上,例如掩码,两个索引返回不同元素的情况。

注意:如果您的数据集没有任何类型的随机性,只需使用 `assert_dataset_compatible_with_megatron` 测试并跳过此测试。此测试适用于您想测试数据集是否将随机变换应用于元素作为索引函数的情况,实际上对于映射到同一底层对象的两个不同索引执行此操作。此测试还在后台运行 `assert_dataset_compatible_with_megatron`,因此如果您执行此操作,则无需再执行另一个。

对于 epoch 上采样方法,某些底层索引(例如 index=0)将被某些包装数据集对象多次调用。例如,如果您的数据集长度为 1,并且您将其包装在一个上采样器中,该上采样器通过将索引 0 映射到 0,将 1 映射到 0 来将其映射到长度 2,然后在该包装器中,我们将随机性应用于结果,并且我们期望每次调用都使用不同的掩码,即使底层对象是相同的。同样,此测试仅适用于采用随机性的数据集。我们的一些数据集采用的另一种方法是使用一个特殊索引,该索引捕获底层索引和 epoch 索引。索引的这个元组在内部用于为掩码播种。如果使用这种类型的数据集,则 index_a 可以是 (epoch=0, idx=0),而 index_b 可以是 (epoch=1, idx=0),例如。我们希望这些返回不同的随机特征。

有效使用此测试的想法是识别您有两个索引返回相同的底层对象,但您期望数据集对每个索引应用不同随机化的情况。

参数

名称 类型 描述 默认值
dataset Dataset[TensorCollectionOrTensor]

具有随机性(例如掩码)的数据集对象以进行测试。

必需
index_a Index

某个元素的索引。默认为 0。

0
index_b Index

不同元素的索引。默认为 1。

1
assert_elements_equal Callable[[TensorCollectionOrTensor, TensorCollectionOrTensor], None]

用于比较两个返回的批次元素的功能。默认为 `assert_dict_tensors_approx_equal`,它适用于张量和张量字典。

assert_dict_tensors_approx_equal
源代码位于 `bionemo/testing/megatron_dataset_compatibility.py`
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def assert_dataset_elements_not_equal(
    dataset: torch.utils.data.Dataset[TensorCollectionOrTensor],
    index_a: Index = 0,
    index_b: Index = 1,
    assert_elements_equal: Callable[
        [TensorCollectionOrTensor, TensorCollectionOrTensor], None
    ] = assert_dict_tensors_approx_equal,
):
    """Test the case where two indices return different elements on datasets that employ randomness, like masking.

    NOTE: if you have a dataset without any kinds of randomness, just use the `assert_dataset_compatible_with_megatron`
    test and skip this one. This test is for the case when you want to test that a dataset that applies a random
    transform to your elements as a function of index actually does so with two different indices that map to the same
    underlying object. This test also runs `assert_dataset_compatible_with_megatron` behind the scenes so if you
    do this you do not need to also do the other.

    With epoch upsampling approaches, some underlying index, say index=0, will be called multiple times by some wrapping
    dataset object. For example if you have a dataset of length 1, and you wrap it in an up-sampler that maps it to
    length 2 by mapping index 0 to 0 and 1 to 0, then in that wrapper we apply randomness to the result and we expect
    different masks to be used for each call, even though the underlying object is the same. Again this test only
    applies to a dataset that employs randomness. Another approach some of our datasets take is to use a special index
    that captures both the underlying index, and the epoch index. This tuple of indices is used internally to seed the
    mask. If that kind of dataset is used, then index_a could be (epoch=0, idx=0) and index_b could be (epoch=1, idx=0),
    for example. We expect those to return different random features.

    The idea for using this test effectively is to identify cases where you have two indices that return the same
    underlying object, but where you expect different randomization to be applied to each by the dataset.

    Args:
        dataset: dataset object with randomness (eg masking) to test.
        index_a: index for some element. Defaults to 0.
        index_b: index for a different element. Defaults to 1.
        assert_elements_equal: Function to compare two returned batch elements. Defaults to
            `assert_dict_tensors_approx_equal` which works for both tensors and dictionaries of tensors.
    """
    # 0, first sanity check for determinism/compatibility on idx0 and idx1
    assert_dataset_compatible_with_megatron(dataset, index=index_a, assert_elements_equal=assert_elements_equal)
    assert_dataset_compatible_with_megatron(dataset, index=index_b, assert_elements_equal=assert_elements_equal)
    # 1, now check that index_a != index_b
    with pytest.raises(AssertionError):
        assert_elements_equal(dataset[index_a], dataset[index_b])

assert_dict_tensors_approx_equal(actual, expected)

断言两个张量相等。

源代码位于 `bionemo/testing/megatron_dataset_compatibility.py`
33
34
35
36
37
38
39
40
41
def assert_dict_tensors_approx_equal(actual: TensorCollectionOrTensor, expected: TensorCollectionOrTensor) -> None:
    """Assert that two tensors are equal."""
    if isinstance(actual, dict) and isinstance(expected, dict):
        a_keys, b_keys = actual.keys(), expected.keys()
        assert a_keys == b_keys
        for key in a_keys:
            torch.testing.assert_close(actual=actual[key], expected=expected[key])
    else:
        torch.testing.assert_close(actual=actual, expected=expected)