数据集

`ESMMaskedResidueDataset`

基类：Dataset

用于 ESM 预训练的数据集类，实现了 UniRef50 和 UniRef90 序列的聚类抽样。

Megatron-LM 期望输入数据集是可索引的，并且对于给定索引的数据集输出是确定性的。在聚类抽样中，这可能很棘手，因为我们需要对 UniRef50 聚类执行加权抽样。

在此，getitem(i) 从 i % len(dataset) UniRef50 聚类中返回一个随机抽样的 UniRef90 序列，其中 i 控制用于选择 UniRef90 序列和执行掩码的随机种子。

多 epoch 训练

目前，此类拥有通过直接传递大于提供的聚类总数的 total_samples 来对蛋白质进行上采样的逻辑，以进行多 epoch 训练。这样做是因为 megatron 训练假设 dataset[i] 在分布式训练中始终返回完全相同的张量。因为我们希望每次采样给定聚类时都改变掩码模式和聚类抽样，所以我们在数据集本身内部创建我们自己的伪 epoch。最终，我们希望摆脱这种模式，并允许多 epoch 训练通过回调来改变数据集的随机状态，并允许 megatron 采样器处理样本顺序的 epoch 到 epoch 混洗。

源代码位于 bionemo/esm2/data/dataset.py

class ESMMaskedResidueDataset(Dataset):
    """Dataset class for ESM pretraining that implements cluster sampling of UniRef50 and UniRef90 sequences.

    Megatron-LM expects the input datasets to be indexable, and for the output of the dataset for a given index to be
    deterministic. In cluster sampling, this can be tricky, since we need to perform weighted sampling over UniRef50
    clusters.

    Here, the getitem(i) returns a randomly sampled UniRef90 sequence from the i % len(dataset) UniRef50 cluster, with i
    controlling the random seed used for selecting the UniRef90 sequence and performing the masking.

    !!! note "Multi-epoch training"

        Currently, this class owns the logic for upsampling proteins for multi-epoch training by directly passing a
        total_samples that's larger than the number of clusters provided. This is done because megatron training assumes
        that `dataset[i]` will always return the exact same tensors in distributed training. Because the we want to vary
        mask patterns and cluster sampling each time a given cluster is sampled, we create our own pseudo-epochs inside
        the dataset itself. Eventually we'd like to move away from this paradigm and allow multi-epoch training to vary
        the dataset's random state through a callback, and allow megatron samplers to handle the epoch-to-epoch
        shuffling of sample order.

    """

    def __init__(
        self,
        protein_dataset: Dataset,
        clusters: Sequence[Sequence[str]],
        seed: int = np.random.SeedSequence().entropy,  # type: ignore
        max_seq_length: int = 1024,
        mask_prob: float = 0.15,
        mask_token_prob: float = 0.8,
        mask_random_prob: float = 0.1,
        random_mask_strategy: RandomMaskStrategy = RandomMaskStrategy.ALL_TOKENS,
        tokenizer: tokenizer.BioNeMoESMTokenizer = tokenizer.get_tokenizer(),
    ) -> None:
        """Initializes the dataset.

        Args:
            protein_dataset: Dataset containing protein sequences, indexed by UniRef90 ids.
            clusters: UniRef90 ids for all training sequences, bucketed by UniRef50 cluster. Alternatively for
                validation, this can also just a list of UniRef50 ids, with each entry being a length-1 list with a
                single UniRef50 id.
            total_samples: Total number of samples to draw from the dataset.
            seed: Random seed for reproducibility. This seed is mixed with the index of the sample to retrieve to ensure
                that __getitem__ is deterministic, but can be random across different runs. If None, a random seed is
                generated.
            max_seq_length: Crop long sequences to a maximum of this length, including BOS and EOS tokens.
            mask_prob: The overall probability a token is included in the loss function. Defaults to 0.15.
            mask_token_prob: Proportion of masked tokens that get assigned the <MASK> id. Defaults to 0.8.
            mask_random_prob: Proportion of tokens that get assigned a random natural amino acid. Defaults to 0.1.
            random_mask_strategy: Whether to replace random masked tokens with all tokens or amino acids only. Defaults to RandomMaskStrategy.ALL_TOKENS.
            tokenizer: The input ESM tokenizer. Defaults to the standard ESM tokenizer.
        """
        self.protein_dataset = protein_dataset
        self.clusters = clusters
        self.seed = seed
        self.max_seq_length = max_seq_length
        self.random_mask_strategy = random_mask_strategy

        if tokenizer.mask_token_id is None:
            raise ValueError("Tokenizer does not have a mask token.")

        self.mask_config = masking.BertMaskConfig(
            tokenizer=tokenizer,
            random_tokens=range(len(tokenizer.all_tokens))
            if self.random_mask_strategy == RandomMaskStrategy.ALL_TOKENS
            else range(4, 24),
            mask_prob=mask_prob,
            mask_token_prob=mask_token_prob,
            random_token_prob=mask_random_prob,
        )

        self.tokenizer = tokenizer

    def __len__(self) -> int:
        """Returns the number of clusters, which constitutes a single epoch."""
        return len(self.clusters)

    def __getitem__(self, index: EpochIndex) -> BertSample:
        """Deterministically masks and returns a protein sequence from the dataset.

        This method samples from the i % len(dataset) cluster from the input clusters list. Random draws of the same
        cluster can be achieved by calling this method with i + len(dataset), i.e., wrapping around the dataset length.

        Args:
            index: The current epoch and the index of the cluster to sample.

        Returns:
            A (possibly-truncated), masked protein sequence with CLS and EOS tokens and associated mask fields.
        """
        # Initialize a random number generator with a seed that is a combination of the dataset seed, epoch, and index.
        rng = np.random.default_rng([self.seed, index.epoch, index.idx])
        if not len(self.clusters[index.idx]):
            raise ValueError(f"Cluster {index.idx} is empty.")

        sequence_id = rng.choice(self.clusters[index.idx])
        sequence = self.protein_dataset[sequence_id]

        # We don't want special tokens before we pass the input to the masking function; we add these in the collate_fn.
        tokenized_sequence = self._tokenize(sequence)
        cropped_sequence = _random_crop(tokenized_sequence, self.max_seq_length, rng)

        # Get a single integer seed for torch from our rng, since the index tuple is hard to pass directly to torch.
        torch_seed = random_utils.get_seed_from_rng(rng)
        masked_sequence, labels, loss_mask = masking.apply_bert_pretraining_mask(
            tokenized_sequence=cropped_sequence,  # type: ignore
            random_seed=torch_seed,
            mask_config=self.mask_config,
        )

        return {
            "text": masked_sequence,
            "types": torch.zeros_like(masked_sequence, dtype=torch.int64),
            "attention_mask": torch.ones_like(masked_sequence, dtype=torch.int64),
            "labels": labels,
            "loss_mask": loss_mask,
            "is_random": torch.zeros_like(masked_sequence, dtype=torch.int64),
        }

    def _tokenize(self, sequence: str) -> torch.Tensor:
        """Tokenize a protein sequence.

        Args:
            sequence: The protein sequence.

        Returns:
            The tokenized sequence.
        """
        tensor = self.tokenizer.encode(sequence, add_special_tokens=True, return_tensors="pt")
        return tensor.flatten()  # type: ignore

`getitem(index)`

确定性地掩盖并从数据集中返回蛋白质序列。

此方法从输入聚类列表中的 i % len(dataset) 聚类中采样。可以通过使用 i + len(dataset) 调用此方法来实现同一聚类的随机抽取，即环绕数据集长度。

参数

名称	类型	描述	默认值
`index`	`EpochIndex`	当前 epoch 和要采样的聚类索引。	必需

返回

类型	描述
`BertSample`	一个（可能被截断的）、带掩码的蛋白质序列，带有 CLS 和 EOS 令牌以及相关的掩码字段。

源代码位于 bionemo/esm2/data/dataset.py

def __getitem__(self, index: EpochIndex) -> BertSample:
    """Deterministically masks and returns a protein sequence from the dataset.

    This method samples from the i % len(dataset) cluster from the input clusters list. Random draws of the same
    cluster can be achieved by calling this method with i + len(dataset), i.e., wrapping around the dataset length.

    Args:
        index: The current epoch and the index of the cluster to sample.

    Returns:
        A (possibly-truncated), masked protein sequence with CLS and EOS tokens and associated mask fields.
    """
    # Initialize a random number generator with a seed that is a combination of the dataset seed, epoch, and index.
    rng = np.random.default_rng([self.seed, index.epoch, index.idx])
    if not len(self.clusters[index.idx]):
        raise ValueError(f"Cluster {index.idx} is empty.")

    sequence_id = rng.choice(self.clusters[index.idx])
    sequence = self.protein_dataset[sequence_id]

    # We don't want special tokens before we pass the input to the masking function; we add these in the collate_fn.
    tokenized_sequence = self._tokenize(sequence)
    cropped_sequence = _random_crop(tokenized_sequence, self.max_seq_length, rng)

    # Get a single integer seed for torch from our rng, since the index tuple is hard to pass directly to torch.
    torch_seed = random_utils.get_seed_from_rng(rng)
    masked_sequence, labels, loss_mask = masking.apply_bert_pretraining_mask(
        tokenized_sequence=cropped_sequence,  # type: ignore
        random_seed=torch_seed,
        mask_config=self.mask_config,
    )

    return {
        "text": masked_sequence,
        "types": torch.zeros_like(masked_sequence, dtype=torch.int64),
        "attention_mask": torch.ones_like(masked_sequence, dtype=torch.int64),
        "labels": labels,
        "loss_mask": loss_mask,
        "is_random": torch.zeros_like(masked_sequence, dtype=torch.int64),
    }

`init(protein_dataset, clusters, seed=np.random.SeedSequence().entropy, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())`

初始化数据集。

参数

名称	类型	描述	默认值
`protein_dataset`	`数据集`	包含蛋白质序列的数据集，按 UniRef90 ID 索引。	必需
`clusters`	`Sequence[Sequence[str]]`	所有训练序列的 UniRef90 ID，按 UniRef50 聚类分桶。或者，对于验证，这也可能只是 UniRef50 ID 的列表，每个条目都是一个长度为 1 的列表，其中包含单个 UniRef50 ID。	必需
`total_samples`		要从数据集中抽取的样本总数。	必需
`seed`	`int`	用于重现性的随机种子。此种子与要检索的样本的索引混合，以确保 getitem 是确定性的，但在不同运行中可以是随机的。如果为 None，则生成随机种子。	`entropy`
`max_seq_length`	`int`	将长序列裁剪为此最大长度，包括 BOS 和 EOS 令牌。	`1024`
`mask_prob`	`float`	令牌包含在损失函数中的总体概率。默认为 0.15。	`0.15`
`mask_token_prob`	`float`	分配给掩码令牌的比例id。默认为 0.8。	`0.8`
`mask_random_prob`	`float`	分配给随机自然氨基酸的令牌比例。默认为 0.1。	`0.1`
`random_mask_strategy`	`RandomMaskStrategy`	是否用所有令牌或仅氨基酸替换随机掩码令牌。默认为 RandomMaskStrategy.ALL_TOKENS。	`ALL_TOKENS`
`tokenizer`	`BioNeMoESMTokenizer`	输入 ESM tokenizer。默认为标准 ESM tokenizer。	`get_tokenizer()`

源代码位于 bionemo/esm2/data/dataset.py

def __init__(
    self,
    protein_dataset: Dataset,
    clusters: Sequence[Sequence[str]],
    seed: int = np.random.SeedSequence().entropy,  # type: ignore
    max_seq_length: int = 1024,
    mask_prob: float = 0.15,
    mask_token_prob: float = 0.8,
    mask_random_prob: float = 0.1,
    random_mask_strategy: RandomMaskStrategy = RandomMaskStrategy.ALL_TOKENS,
    tokenizer: tokenizer.BioNeMoESMTokenizer = tokenizer.get_tokenizer(),
) -> None:
    """Initializes the dataset.

    Args:
        protein_dataset: Dataset containing protein sequences, indexed by UniRef90 ids.
        clusters: UniRef90 ids for all training sequences, bucketed by UniRef50 cluster. Alternatively for
            validation, this can also just a list of UniRef50 ids, with each entry being a length-1 list with a
            single UniRef50 id.
        total_samples: Total number of samples to draw from the dataset.
        seed: Random seed for reproducibility. This seed is mixed with the index of the sample to retrieve to ensure
            that __getitem__ is deterministic, but can be random across different runs. If None, a random seed is
            generated.
        max_seq_length: Crop long sequences to a maximum of this length, including BOS and EOS tokens.
        mask_prob: The overall probability a token is included in the loss function. Defaults to 0.15.
        mask_token_prob: Proportion of masked tokens that get assigned the <MASK> id. Defaults to 0.8.
        mask_random_prob: Proportion of tokens that get assigned a random natural amino acid. Defaults to 0.1.
        random_mask_strategy: Whether to replace random masked tokens with all tokens or amino acids only. Defaults to RandomMaskStrategy.ALL_TOKENS.
        tokenizer: The input ESM tokenizer. Defaults to the standard ESM tokenizer.
    """
    self.protein_dataset = protein_dataset
    self.clusters = clusters
    self.seed = seed
    self.max_seq_length = max_seq_length
    self.random_mask_strategy = random_mask_strategy

    if tokenizer.mask_token_id is None:
        raise ValueError("Tokenizer does not have a mask token.")

    self.mask_config = masking.BertMaskConfig(
        tokenizer=tokenizer,
        random_tokens=range(len(tokenizer.all_tokens))
        if self.random_mask_strategy == RandomMaskStrategy.ALL_TOKENS
        else range(4, 24),
        mask_prob=mask_prob,
        mask_token_prob=mask_token_prob,
        random_token_prob=mask_random_prob,
    )

    self.tokenizer = tokenizer

`len()`

返回聚类的数量，这构成一个 epoch。

源代码位于 bionemo/esm2/data/dataset.py

def __len__(self) -> int:
    """Returns the number of clusters, which constitutes a single epoch."""
    return len(self.clusters)

`ProteinSQLiteDataset`

基类：Dataset

用于存储在 SQLite 数据库中的蛋白质序列的数据集。

源代码位于 bionemo/esm2/data/dataset.py

class ProteinSQLiteDataset(Dataset):
    """Dataset for protein sequences stored in a SQLite database."""

    def __init__(self, db_path: str | os.PathLike):
        """Initializes the dataset.

        Args:
            db_path: Path to the SQLite database.
        """
        self.conn = sqlite3.connect(str(db_path))
        self.cursor = self.conn.cursor()
        self._len = None

    def __len__(self) -> int:
        """Returns the number of proteins in the dataset.

        Returns:
            Number of proteins in the dataset.
        """
        if self._len is None:
            self.cursor.execute("SELECT COUNT(*) FROM protein")
            self._len = int(self.cursor.fetchone()[0])
        return self._len

    def __getitem__(self, idx: str) -> str:
        """Returns the sequence of a protein at a given index.

        TODO: This method may want to support batched indexing for improved performance.

        Args:
            idx: An identifier for the protein sequence. For training data, these are UniRef90 IDs, while for validation
                data, they are UniRef50 IDs.

        Returns:
            The protein sequence as a string.
        """
        if not isinstance(idx, str):
            raise TypeError(f"Expected string, got {type(idx)}: {idx}.")

        self.cursor.execute("SELECT sequence FROM protein WHERE id = ?", (idx,))
        return self.cursor.fetchone()[0]

`getitem(idx)`

返回给定索引处蛋白质的序列。

TODO：此方法可能希望支持批量索引以提高性能。

参数

名称	类型	描述	默认值
`idx`	`str`	蛋白质序列的标识符。对于训练数据，这些是 UniRef90 ID，而对于验证数据，它们是 UniRef50 ID。	必需

返回

类型	描述
`str`	蛋白质序列，以字符串形式。

源代码位于 bionemo/esm2/data/dataset.py

def __getitem__(self, idx: str) -> str:
    """Returns the sequence of a protein at a given index.

    TODO: This method may want to support batched indexing for improved performance.

    Args:
        idx: An identifier for the protein sequence. For training data, these are UniRef90 IDs, while for validation
            data, they are UniRef50 IDs.

    Returns:
        The protein sequence as a string.
    """
    if not isinstance(idx, str):
        raise TypeError(f"Expected string, got {type(idx)}: {idx}.")

    self.cursor.execute("SELECT sequence FROM protein WHERE id = ?", (idx,))
    return self.cursor.fetchone()[0]

`init(db_path)`

初始化数据集。

参数

名称	类型	描述	默认值
`db_path`	`str \| PathLike`	到 SQLite 数据库的路径。	必需

源代码位于 bionemo/esm2/data/dataset.py

def __init__(self, db_path: str | os.PathLike):
    """Initializes the dataset.

    Args:
        db_path: Path to the SQLite database.
    """
    self.conn = sqlite3.connect(str(db_path))
    self.cursor = self.conn.cursor()
    self._len = None

`len()`

返回数据集中的蛋白质数量。

返回

类型	描述
`int`	数据集中的蛋白质数量。

源代码位于 bionemo/esm2/data/dataset.py

def __len__(self) -> int:
    """Returns the number of proteins in the dataset.

    Returns:
        Number of proteins in the dataset.
    """
    if self._len is None:
        self.cursor.execute("SELECT COUNT(*) FROM protein")
        self._len = int(self.cursor.fetchone()[0])
    return self._len

`RandomMaskStrategy`

基类：str, Enum

用于不同随机掩码策略的枚举。

在 ESM2 预训练中，所有令牌的 15% 被掩码，其中 10% 被随机令牌替换。此类控制要从中选择的随机令牌集。

源代码位于 bionemo/esm2/data/dataset.py

class RandomMaskStrategy(str, Enum):
    """Enum for different random masking strategies.

    In ESM2 pretraining, 15% of all tokens are masked and among which 10% are replaced with a random token. This class controls the set of random tokens to choose from.

    """

    AMINO_ACIDS_ONLY = "amino_acids_only"
    """Mask only with amino acid tokens."""

    ALL_TOKENS = "all_tokens"
    """Mask with all tokens in the tokenizer, including special tokens, padding and non-canonical amino acid tokens."""

`ALL_TOKENS = 'all_tokens'` `class-attribute` `instance-attribute`

使用 tokenizer 中的所有令牌进行掩码，包括特殊令牌、填充和非规范氨基酸令牌。

`AMINO_ACIDS_ONLY = 'amino_acids_only'` `class-attribute` `instance-attribute`

仅使用氨基酸令牌进行掩码。

`create_train_dataset(cluster_file, db_path, total_samples, seed, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())`

为 ESM 预训练创建训练数据集。

参数

名称	类型	描述	默认值
`cluster_file`	`str \| PathLike`	到聚类文件的路径。该文件应包含一个 "ur90_id" 列，其中每行包含单个 UniRef50 聚类的 UniRef90 ID 列表。	必需
`db_path`	`str \| PathLike`	到 SQLite 数据库的路径。	必需
`total_samples`	`int`	要从数据集中抽取的样本总数。	必需
`seed`	`int`	用于重现性的随机种子。	必需
`max_seq_length`	`int`	将长序列裁剪为此最大长度，包括 BOS 和 EOS 令牌。	`1024`
`mask_prob`	`float`	令牌包含在损失函数中的总体概率。默认为 0.15。	`0.15`
`mask_token_prob`	`float`	分配给掩码令牌的比例id。默认为 0.8。	`0.8`
`mask_random_prob`	`float`	分配给随机自然氨基酸的令牌比例。默认为 0.1。	`0.1`
`random_mask_strategy`	`RandomMaskStrategy`	是否用所有令牌或仅氨基酸替换随机掩码令牌。默认为 RandomMaskStrategy.ALL_TOKENS。	`ALL_TOKENS`
`tokenizer`	`BioNeMoESMTokenizer`	输入 ESM tokenizer。默认为标准 ESM tokenizer。	`get_tokenizer()`

返回

类型	描述
	用于 ESM 预训练的数据集。

引发

类型	描述
`ValueError`	如果聚类文件不存在、数据库文件不存在或聚类文件不包含 "ur90_id" 列。

源代码位于 bionemo/esm2/data/dataset.py

def create_train_dataset(
    cluster_file: str | os.PathLike,
    db_path: str | os.PathLike,
    total_samples: int,
    seed: int,
    max_seq_length: int = 1024,
    mask_prob: float = 0.15,
    mask_token_prob: float = 0.8,
    mask_random_prob: float = 0.1,
    random_mask_strategy: RandomMaskStrategy = RandomMaskStrategy.ALL_TOKENS,
    tokenizer: tokenizer.BioNeMoESMTokenizer = tokenizer.get_tokenizer(),
):
    """Creates a training dataset for ESM pretraining.

    Args:
        cluster_file: Path to the cluster file. The file should contain a "ur90_id" column, where each row contains a
            list of UniRef90 ids for a single UniRef50 cluster.
        db_path: Path to the SQLite database.
        total_samples: Total number of samples to draw from the dataset.
        seed: Random seed for reproducibility.
        max_seq_length: Crop long sequences to a maximum of this length, including BOS and EOS tokens.
        mask_prob: The overall probability a token is included in the loss function. Defaults to 0.15.
        mask_token_prob: Proportion of masked tokens that get assigned the <MASK> id. Defaults to 0.8.
        mask_random_prob: Proportion of tokens that get assigned a random natural amino acid. Defaults to 0.1.
        random_mask_strategy: Whether to replace random masked tokens with all tokens or amino acids only. Defaults to RandomMaskStrategy.ALL_TOKENS.
        tokenizer: The input ESM tokenizer. Defaults to the standard ESM tokenizer.

    Returns:
        A dataset for ESM pretraining.

    Raises:
        ValueError: If the cluster file does not exist, the database file does not exist, or the cluster file does not
            contain a "ur90_id" column.
    """
    if not Path(cluster_file).exists():
        raise ValueError(f"Cluster file {cluster_file} not found.")

    if not Path(db_path).exists():
        raise ValueError(f"Database file {db_path} not found.")

    cluster_df = pd.read_parquet(cluster_file)
    if "ur90_id" not in cluster_df.columns:
        raise ValueError(f"Training cluster file must contain a 'ur90_id' column. Found columns {cluster_df.columns}.")

    protein_dataset = ProteinSQLiteDataset(db_path)
    masked_cluster_dataset = ESMMaskedResidueDataset(
        protein_dataset=protein_dataset,
        clusters=cluster_df["ur90_id"],
        seed=seed,
        max_seq_length=max_seq_length,
        mask_prob=mask_prob,
        mask_token_prob=mask_token_prob,
        mask_random_prob=mask_random_prob,
        random_mask_strategy=random_mask_strategy,
        tokenizer=tokenizer,
    )

    return MultiEpochDatasetResampler(masked_cluster_dataset, num_samples=total_samples, shuffle=True, seed=seed)

`create_valid_clusters(cluster_file)`

从聚类 parquet 文件创建 UniRef50 聚类 ID 的 pandas series。

参数

名称	类型	描述	默认值
`cluster_file`	`str \| PathLike`	到聚类文件的路径。该文件应包含一个名为 "ur50_id" 的单列，其中包含 UniRef50	必需

返回

类型	描述
`Series`	UniRef50 聚类 ID 的 pandas series。

源代码位于 bionemo/esm2/data/dataset.py

def create_valid_clusters(cluster_file: str | os.PathLike) -> pd.Series:
    """Create a pandas series of UniRef50 cluster IDs from a cluster parquet file.

    Args:
        cluster_file: Path to the cluster file. The file should contain a single column named "ur50_id" with UniRef50
        IDs, with one UniRef50 ID per row.

    Returns:
        A pandas series of UniRef50 cluster IDs.
    """
    if not Path(cluster_file).exists():
        raise ValueError(f"Cluster file {cluster_file} not found.")

    cluster_df = pd.read_parquet(cluster_file)
    if "ur50_id" not in cluster_df.columns:
        raise ValueError(
            f"Validation cluster file must contain a 'ur50_id' column. Found columns {cluster_df.columns}."
        )
    clusters = cluster_df["ur50_id"].apply(lambda x: [x])
    return clusters

`create_valid_dataset(clusters, db_path, seed, total_samples=None, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())`

为 ESM 预训练创建验证数据集。

参数

名称	类型	描述	默认值
`cluster_file`		聚类，类型为 pd.Series，或到聚类文件的路径。该文件应包含一个名为 "ur50_id" 的单列，其中包含 UniRef50 ID，每行一个 UniRef50 ID。	必需
`db_path`	`str \| PathLike`	到 SQLite 数据库的路径。	必需
`total_samples`	`int \| None`	要从数据集中抽取的样本总数。	`None`
`seed`	`int`	用于重现性的随机种子。	必需
`max_seq_length`	`int`	将长序列裁剪为此最大长度，包括 BOS 和 EOS 令牌。	`1024`
`mask_prob`	`float`	令牌包含在损失函数中的总体概率。默认为 0.15。	`0.15`
`mask_token_prob`	`float`	分配给掩码令牌的比例id。默认为 0.8。	`0.8`
`mask_random_prob`	`float`	分配给随机自然氨基酸的令牌比例。默认为 0.1。	`0.1`
`random_masking_strategy`		是否用所有令牌或仅氨基酸替换随机掩码令牌。默认为 RandomMaskStrategy.ALL_TOKENS。	必需

引发

类型	描述
`ValueError`	如果聚类文件不存在、数据库文件不存在或聚类文件不包含 "ur50_id" 列。

源代码位于 bionemo/esm2/data/dataset.py

def create_valid_dataset(  # noqa: D417
    clusters: pd.Series | str | os.PathLike,
    db_path: str | os.PathLike,
    seed: int,
    total_samples: int | None = None,
    max_seq_length: int = 1024,
    mask_prob: float = 0.15,
    mask_token_prob: float = 0.8,
    mask_random_prob: float = 0.1,
    random_mask_strategy: RandomMaskStrategy = RandomMaskStrategy.ALL_TOKENS,
    tokenizer: tokenizer.BioNeMoESMTokenizer = tokenizer.get_tokenizer(),
):
    """Creates a validation dataset for ESM pretraining.

    Args:
        cluster_file: Clusters as pd.Series, or path to the cluster file. The file should contain a single column named "ur50_id" with UniRef50
            IDs, with one UniRef50 ID per row.
        db_path: Path to the SQLite database.
        total_samples: Total number of samples to draw from the dataset.
        seed: Random seed for reproducibility.
        max_seq_length: Crop long sequences to a maximum of this length, including BOS and EOS tokens.
        mask_prob: The overall probability a token is included in the loss function. Defaults to 0.15.
        mask_token_prob: Proportion of masked tokens that get assigned the <MASK> id. Defaults to 0.8.
        mask_random_prob: Proportion of tokens that get assigned a random natural amino acid. Defaults to 0.1.
        random_masking_strategy: Whether to replace random masked tokens with all tokens or amino acids only. Defaults to RandomMaskStrategy.ALL_TOKENS.

    Raises:
        ValueError: If the cluster file does not exist, the database file does not exist, or the cluster file does not
            contain a "ur50_id" column.
    """
    if isinstance(clusters, (str, os.PathLike)):
        clusters = create_valid_clusters(clusters)

    elif not isinstance(clusters, pd.Series):
        raise ValueError(f"Clusters must be a pandas Series. Got {type(clusters)}.")

    if not Path(db_path).exists():
        raise ValueError(f"Database file {db_path} not found.")

    protein_dataset = ProteinSQLiteDataset(db_path)
    masked_dataset = ESMMaskedResidueDataset(
        protein_dataset=protein_dataset,
        clusters=clusters,
        seed=seed,
        max_seq_length=max_seq_length,
        mask_prob=mask_prob,
        mask_token_prob=mask_token_prob,
        mask_random_prob=mask_random_prob,
        random_mask_strategy=random_mask_strategy,
        tokenizer=tokenizer,
    )

    return MultiEpochDatasetResampler(masked_dataset, num_samples=total_samples, shuffle=True, seed=seed)

数据集

ESMMaskedResidueDataset

__getitem__(index)

__init__(protein_dataset, clusters, seed=np.random.SeedSequence().entropy, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())

__len__()

ProteinSQLiteDataset

__getitem__(idx)

__init__(db_path)

__len__()

RandomMaskStrategy

ALL_TOKENS = 'all_tokens' class-attribute instance-attribute

AMINO_ACIDS_ONLY = 'amino_acids_only' class-attribute instance-attribute

create_train_dataset(cluster_file, db_path, total_samples, seed, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())

create_valid_clusters(cluster_file)

create_valid_dataset(clusters, db_path, seed, total_samples=None, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())

`ESMMaskedResidueDataset`

`getitem(index)`

`init(protein_dataset, clusters, seed=np.random.SeedSequence().entropy, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())`

`len()`

`ProteinSQLiteDataset`

`getitem(idx)`

`init(db_path)`

`len()`

`RandomMaskStrategy`

`ALL_TOKENS = 'all_tokens'` `class-attribute` `instance-attribute`

`AMINO_ACIDS_ONLY = 'amino_acids_only'` `class-attribute` `instance-attribute`

`create_train_dataset(cluster_file, db_path, total_samples, seed, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())`

`create_valid_clusters(cluster_file)`

`create_valid_dataset(clusters, db_path, seed, total_samples=None, max_seq_length=1024, mask_prob=0.15, mask_token_prob=0.8, mask_random_prob=0.1, random_mask_strategy=RandomMaskStrategy.ALL_TOKENS, tokenizer=tokenizer.get_tokenizer())`