跳到内容

Mask

DiscreteMaskedPrior

Bases: DiscretePriorDistribution

表示离散 Masked 先验分布的子类。

源代码在 bionemo/moco/distributions/prior/discrete/mask.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
class DiscreteMaskedPrior(DiscretePriorDistribution):
    """A subclass representing a Discrete Masked prior distribution."""

    def __init__(self, num_classes: int = 10, mask_dim: Optional[int] = None, inclusive: bool = True) -> None:
        """Discrete Masked prior distribution.

        Theres 3 ways I can think of defining the problem that are hard to mesh together.

        1. [..., M, ....] inclusive anywhere --> exisiting LLM tokenizer where the mask has a specific location not at the end
        2. [......, M] inclusive on end --> mask_dim = None with inclusive set to True default stick on the end
        3. [.....] + [M] exclusive --> the number of classes representes the number of data classes and one wishes to add a separate MASK dimension.
            - Note the pad_sample function is provided to help add this extra external dimension.

        Args:
            num_classes (int): The number of classes in the distribution. Defaults to 10.
            mask_dim (int): The index for the mask token. Defaults to num_classes - 1 if inclusive or num_classes if exclusive.
            inclusive (bool): Whether the mask is included in the specified number of classes.
                                If True, the mask is considered as one of the classes.
                                If False, the mask is considered as an additional class. Defaults to True.
        """
        if inclusive:
            if mask_dim is None:
                mask_dim = num_classes - 1
            else:
                if mask_dim >= num_classes:
                    raise ValueError(
                        "As Inclusive accounts for the mask as one of the specified num_classes, the provided mask_dim cannot be >= to num_classes"
                    )
            prior_dist = torch.zeros((num_classes))
            prior_dist[-1] = 1.0
            super().__init__(num_classes, prior_dist)
            self.mask_dim = mask_dim
        else:
            prior_dist = torch.zeros((num_classes + 1))
            prior_dist[-1] = 1.0
            super().__init__(num_classes + 1, prior_dist)
            self.mask_dim = num_classes
        if torch.sum(self.prior_dist).item() - 1.0 >= 1e-5:
            raise ValueError("Invalid probability distribution. Must sum to 1.0")

    def sample(
        self,
        shape: Tuple,
        mask: Optional[Tensor] = None,
        device: Union[str, torch.device] = "cpu",
        rng_generator: Optional[torch.Generator] = None,
    ) -> Tensor:
        """Generates a specified number of samples.

        Args:
            shape (Tuple): The shape of the samples to generate.
            device (str): cpu or gpu.
            mask (Optional[Tensor]): An optional mask to apply to the samples. Defaults to None.
            rng_generator: An optional :class:`torch.Generator` for reproducible sampling. Defaults to None.

        Returns:
            Float: A tensor of samples.
        """
        samples = torch.ones(shape, dtype=torch.int64, device=device) * self.mask_dim
        if mask is not None:
            samples = samples * mask[(...,) + (None,) * (len(samples.shape) - len(mask.shape))]
        return samples

    def is_masked(self, sample: Tensor) -> Tensor:
        """Creates a mask for whether a state is masked.

        Args:
            sample (Tensor): The sample to check.

        Returns:
            Tensor: A float tensor indicating whether the sample is masked.
        """
        return (sample == self.mask_dim).float()

    def pad_sample(self, sample: Tensor) -> Tensor:
        """Pads the input sample with zeros along the last dimension.

        Args:
            sample (Tensor): The input sample to be padded.

        Returns:
            Tensor: The padded sample.
        """
        # Create a zeros tensor with the same shape as the original tensor, except the last dimension is 1
        zeros = torch.zeros((*sample.shape[:-1], 1), dtype=torch.float, device=sample.device)
        # Concatenate along the last dimension to make the shape (..., N+1)
        padded_sample = torch.cat((sample, zeros), dim=-1)
        return padded_sample

__init__(num_classes=10, mask_dim=None, inclusive=True)

离散 Masked 先验分布。

关于如何定义这个问题,我想到了 3 种很难整合在一起的方法。

  1. [..., M, ....] 包含任何位置 --> 现有的 LLM 标记器,其中 mask 具有特定位置,而不是在末尾
  2. [......, M] 包含在末尾 --> mask_dim = None,且 inclusive 设置为 True 默认粘在末尾
  3. [.....] + [M] 独占 --> 类别的数量表示数据类别的数量,并且希望添加一个单独的 MASK 维度。
    • 请注意,提供了 pad_sample 函数来帮助添加这个额外的外部维度。

参数

名称 类型 描述 默认值
num_classes int

分布中的类别数量。默认为 10。

10
mask_dim int

mask 令牌的索引。如果 inclusive 为 True,则默认为 num_classes - 1;如果 exclusive 为 True,则默认为 num_classes。

None
inclusive bool

mask 是否包含在指定的类别数量中。如果为 True,则将 mask 视为其中一个类别。如果为 False,则将 mask 视为一个额外的类别。默认为 True。

True
源代码在 bionemo/moco/distributions/prior/discrete/mask.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def __init__(self, num_classes: int = 10, mask_dim: Optional[int] = None, inclusive: bool = True) -> None:
    """Discrete Masked prior distribution.

    Theres 3 ways I can think of defining the problem that are hard to mesh together.

    1. [..., M, ....] inclusive anywhere --> exisiting LLM tokenizer where the mask has a specific location not at the end
    2. [......, M] inclusive on end --> mask_dim = None with inclusive set to True default stick on the end
    3. [.....] + [M] exclusive --> the number of classes representes the number of data classes and one wishes to add a separate MASK dimension.
        - Note the pad_sample function is provided to help add this extra external dimension.

    Args:
        num_classes (int): The number of classes in the distribution. Defaults to 10.
        mask_dim (int): The index for the mask token. Defaults to num_classes - 1 if inclusive or num_classes if exclusive.
        inclusive (bool): Whether the mask is included in the specified number of classes.
                            If True, the mask is considered as one of the classes.
                            If False, the mask is considered as an additional class. Defaults to True.
    """
    if inclusive:
        if mask_dim is None:
            mask_dim = num_classes - 1
        else:
            if mask_dim >= num_classes:
                raise ValueError(
                    "As Inclusive accounts for the mask as one of the specified num_classes, the provided mask_dim cannot be >= to num_classes"
                )
        prior_dist = torch.zeros((num_classes))
        prior_dist[-1] = 1.0
        super().__init__(num_classes, prior_dist)
        self.mask_dim = mask_dim
    else:
        prior_dist = torch.zeros((num_classes + 1))
        prior_dist[-1] = 1.0
        super().__init__(num_classes + 1, prior_dist)
        self.mask_dim = num_classes
    if torch.sum(self.prior_dist).item() - 1.0 >= 1e-5:
        raise ValueError("Invalid probability distribution. Must sum to 1.0")

is_masked(sample)

创建一个 mask,用于指示状态是否被 mask。

参数

名称 类型 描述 默认值
sample Tensor

要检查的样本。

必需

返回值

名称 类型 描述
Tensor Tensor

一个浮点张量,指示样本是否被 mask。

源代码在 bionemo/moco/distributions/prior/discrete/mask.py
88
89
90
91
92
93
94
95
96
97
def is_masked(self, sample: Tensor) -> Tensor:
    """Creates a mask for whether a state is masked.

    Args:
        sample (Tensor): The sample to check.

    Returns:
        Tensor: A float tensor indicating whether the sample is masked.
    """
    return (sample == self.mask_dim).float()

pad_sample(sample)

沿最后一个维度用零填充输入样本。

参数

名称 类型 描述 默认值
sample Tensor

要填充的输入样本。

必需

返回值

名称 类型 描述
Tensor Tensor

填充后的样本。

源代码在 bionemo/moco/distributions/prior/discrete/mask.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def pad_sample(self, sample: Tensor) -> Tensor:
    """Pads the input sample with zeros along the last dimension.

    Args:
        sample (Tensor): The input sample to be padded.

    Returns:
        Tensor: The padded sample.
    """
    # Create a zeros tensor with the same shape as the original tensor, except the last dimension is 1
    zeros = torch.zeros((*sample.shape[:-1], 1), dtype=torch.float, device=sample.device)
    # Concatenate along the last dimension to make the shape (..., N+1)
    padded_sample = torch.cat((sample, zeros), dim=-1)
    return padded_sample

sample(shape, mask=None, device='cpu', rng_generator=None)

生成指定数量的样本。

参数

名称 类型 描述 默认值
shape Tuple

要生成的样本的形状。

必需
device str

cpu 或 gpu。

'cpu'
mask Optional[Tensor]

一个可选的 mask,用于应用于样本。默认为 None。

None
rng_generator Optional[Generator]

一个可选的 :class:torch.Generator,用于可重现的采样。默认为 None。

None

返回值

名称 类型 描述
Float Tensor

样本张量。

源代码在 bionemo/moco/distributions/prior/discrete/mask.py
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def sample(
    self,
    shape: Tuple,
    mask: Optional[Tensor] = None,
    device: Union[str, torch.device] = "cpu",
    rng_generator: Optional[torch.Generator] = None,
) -> Tensor:
    """Generates a specified number of samples.

    Args:
        shape (Tuple): The shape of the samples to generate.
        device (str): cpu or gpu.
        mask (Optional[Tensor]): An optional mask to apply to the samples. Defaults to None.
        rng_generator: An optional :class:`torch.Generator` for reproducible sampling. Defaults to None.

    Returns:
        Float: A tensor of samples.
    """
    samples = torch.ones(shape, dtype=torch.int64, device=device) * self.mask_dim
    if mask is not None:
        samples = samples * mask[(...,) + (None,) * (len(samples.shape) - len(mask.shape))]
    return samples