数据集

`SingleCellDataset`

基类：Dataset

用于单细胞预训练的数据集类。这些可以使用 sc_memmap.py 脚本生成。未来的更新将包含更全面的工作流程，用于从 scRNA-seq 生成稀疏内存映射。

参数

名称	类型	描述	默认值
`data_path`	`str`	单细胞文件以 SingleCell Memmap 格式存储的路径。它应包含以下文件： - `metadata.json`：包含数据集中行数的路径。 - 以 CSR 格式存储的基因表达矩阵，格式为 `numpy.memmap`： - `data.npy`：非零基因表达值。 - `col_ptr.npy`：data.npy 中每个条目对应的基因索引。 - `row_ptr.npy`：每个细胞样本的列索引指针。	必需
`tokenizer`	`Any`	用于对输入数据进行分词的分词器。	必需
`median_dict`	`dict`	包含每个基因中位数值的字典。默认为 None。	`None`
`max_len`	`int`	输入序列的最大长度。默认为 1024。	`1024`
`include_unrecognized_vocab_in_dataset`	`bool`	如果设置为 True，则执行硬检查以验证所有基因标识符是否都在用户提供的分词器词汇表中。默认为 False，这意味着将排除用户提供的分词器词汇表中不存在的任何基因标识符。	`False`

属性

名称	类型	描述
`data_path`	`str`	单细胞文件以 SCDL 内存映射格式存储的路径。
`max_len`	`int`	输入序列的最大长度。
`metadata`	`dict`	从 `metadata.json` 加载的元数据。
`gene_medians`	`dict`	包含每个基因中位数值的字典。如果为 None，则所有基因的中位数均假定为 '1'。
`num_train`	`int`	训练拆分中的样本数。
`num_val`	`int`	验证拆分中的样本数。
`num_test`	`int`	测试拆分中的样本数。
`index_offset`	`int`	应用于索引的偏移量。
`length`	`int`	数据集中的样本总数。
`gene_data`	`memmap`	以 CSR 格式存储的基因表达值。
`gene_data_indices`	`memmap`	与基因值关联的基因索引。
`gene_data_ptr`	`memmap`	每个样本的列索引。
`tokenizer`		用于对输入数据进行分词的分词器。
`dataset_ccum`	`ndarray`	行计数的累积和，用于将行索引映射到数据集 ID。
`dataset_map`	`dict`	数据集 ID 到数据集名称的映射。

方法

名称	描述
`__len__`	返回数据集的长度。
`__getitem__`	返回给定索引处的项目。

另请参阅

bionemo/data/singlecell/sc_memmap.py - 创建从 hdf5 文件实例化 singlecell 数据集所需的工件。

源代码位于 bionemo/geneformer/data/singlecell/dataset.py

class SingleCellDataset(Dataset):
    """A dataset class for single-cell pre-training. These can be generated using the sc_memmap.py script. Future
    updates will contain more comprehensive workflows for generating a Sparse Memmap from scRNA-seq.

    Args:
        data_path (str): Path where the single cell files are stored in SingleCell Memmap format. It should contain the following files:
            - `metadata.json`: Path containing the number of rows int he dataset.
            - Gene expression matrix stored in CSR format as `numpy.memmap`:
                - `data.npy`: Non-zero gene expression values.
                - `col_ptr.npy`: Indices of the corresponding genes for each entry in data.npy.
                - `row_ptr.npy`: Column index pointers for each cell sample.
        tokenizer: The tokenizer to use for tokenizing the input data.
        median_dict (dict, optional): A dictionary containing median values for each gene. Defaults to None.
        max_len (int, optional): The maximum length of the input sequence. Defaults to 1024.
        include_unrecognized_vocab_in_dataset (bool, optional): If set to True, a hard-check is performed to verify all gene identifers are in the user supplied tokenizer vocab. Defaults to False which means any gene identifier not in the user supplied tokenizer vocab will be excluded.

    Attributes:
        data_path (str): Path where the single cell files are stored in SCDL memmap format.
        max_len (int): The maximum length of the input sequence.
        metadata (dict): Metadata loaded from `metadata.json`.
        gene_medians (dict): A dictionary containing median values for each gene. If None, a median of '1' is assumed for all genes.
        num_train (int): The number of samples in the training split.
        num_val (int): The number of samples in the validation split.
        num_test (int): The number of samples in the test split.
        index_offset (int): The offset to apply to the indices.
        length (int): The total number of samples in the dataset.
        gene_data (numpy.memmap): Gene expression values stored in CSR format.
        gene_data_indices (numpy.memmap): Gene indices associated with gene values.
        gene_data_ptr (numpy.memmap): Column indices for each sample.
        tokenizer: The tokenizer used for tokenizing the input data.
        dataset_ccum (numpy.ndarray): Cumulative sum of row counts to map row indices to dataset id.
        dataset_map (dict): Mapping of dataset id to dataset name.

    Methods:
        __len__(): Returns the length of the dataset.
        __getitem__(idx): Returns the item at the given index.

    See Also:
        bionemo/data/singlecell/sc_memmap.py - creates the artifacts required for instantiating a singlecell dataset from hdf5 files.
    """  # noqa: D205

    def __init__(  # noqa: D107
        self,
        data_path: str | Path,
        tokenizer: Any,
        median_dict: Optional[dict] = None,
        max_len: int = 1024,
        mask_prob: float = 0.15,
        mask_token_prob: float = 0.8,
        random_token_prob: float = 0.1,
        prepend_cls_token: bool = True,
        eos_token: int | None = None,
        include_unrecognized_vocab_in_dataset: bool = False,
        seed: int = np.random.SeedSequence().entropy,  # type: ignore
    ):
        super().__init__()

        self.data_path = data_path
        self.max_len = max_len
        self.random_token_prob = random_token_prob
        self.mask_token_prob = mask_token_prob
        self.mask_prob = mask_prob
        self.prepend_cls_token = prepend_cls_token
        self._seed = seed
        self.eos_token = eos_token

        self.scdl = SingleCellMemMapDataset(str(data_path))
        self.length = len(self.scdl)
        # - median dict
        self.gene_medians = median_dict
        self.tokenizer = tokenizer
        self.include_unrecognized_vocab_in_dataset = include_unrecognized_vocab_in_dataset

    def __len__(self):  # noqa: D105
        return self.length

    def __getitem__(self, index: EpochIndex) -> types.BertSample:
        """Performs a lookup and the required transformation for the model."""
        rng = np.random.default_rng([self._seed, index.epoch, index.idx])
        values, feature_ids = self.scdl.get_row(index.idx, return_features=True, feature_vars=["feature_id"])
        assert (
            len(feature_ids) == 1
        )  # we expect feature_ids to be a list containing one np.array with the row's feature ids
        gene_data, col_idxs = np.array(values[0]), np.array(values[1])
        if len(gene_data) == 0:
            raise ValueError(
                "SingleCellMemap data provided is invalid; the gene expression data parsed for the specified index is empty."
            )
        return process_item(
            gene_data,
            col_idxs,
            feature_ids[0],
            self.tokenizer,
            gene_median=self.gene_medians,
            rng=rng,
            max_len=self.max_len,
            mask_token_prob=self.mask_token_prob,
            mask_prob=self.mask_prob,
            random_token_prob=self.random_token_prob,
            prepend_cls_token=self.prepend_cls_token,
            eos_token=self.eos_token,
            include_unrecognized_vocab_in_dataset=self.include_unrecognized_vocab_in_dataset,
        )

`getitem(index)`

执行查找和模型所需的转换。

源代码位于 bionemo/geneformer/data/singlecell/dataset.py

def __getitem__(self, index: EpochIndex) -> types.BertSample:
    """Performs a lookup and the required transformation for the model."""
    rng = np.random.default_rng([self._seed, index.epoch, index.idx])
    values, feature_ids = self.scdl.get_row(index.idx, return_features=True, feature_vars=["feature_id"])
    assert (
        len(feature_ids) == 1
    )  # we expect feature_ids to be a list containing one np.array with the row's feature ids
    gene_data, col_idxs = np.array(values[0]), np.array(values[1])
    if len(gene_data) == 0:
        raise ValueError(
            "SingleCellMemap data provided is invalid; the gene expression data parsed for the specified index is empty."
        )
    return process_item(
        gene_data,
        col_idxs,
        feature_ids[0],
        self.tokenizer,
        gene_median=self.gene_medians,
        rng=rng,
        max_len=self.max_len,
        mask_token_prob=self.mask_token_prob,
        mask_prob=self.mask_prob,
        random_token_prob=self.random_token_prob,
        prepend_cls_token=self.prepend_cls_token,
        eos_token=self.eos_token,
        include_unrecognized_vocab_in_dataset=self.include_unrecognized_vocab_in_dataset,
    )

`process_item(gene_data, gene_idxs, feature_ids, tokenizer, gene_median, rng, max_len=1024, mask_prob=0.15, mask_token_prob=0.8, random_token_prob=0.1, target_sum=10000, normalize=True, prepend_cls_token=True, eos_token=None, include_unrecognized_vocab_in_dataset=False)`

处理数据集中的单个项目。

可选地执行中位数归一化和等级排序。分词器的 CLS 令牌被添加到每个样本的开头。在分词之前将基因名称转换为 ensemble id。期望 gene_medians 包含 ensembl id 作为键。

参数

名称	类型	描述	默认值
`gene_data`	`list`	基因数据列表，这些是表达计数。	必需
`gene_idxs`	`list`	基因索引列表，这些是 'metadata['feature_ids']' 中的键，并且与 CSR 条目相对应。	必需
`feature_ids`	`list`	完整数据集的特征 ID。	必需
`tokenizer`	`Tokenizer`	Tokenizer object.	必需
`gene_median`	`optional(dict`	基因中位数词典。默认为 None。期望 ensembl ID 作为键。	必需
`rng`	`Generator`	随机数生成器，以确保确定性结果。	必需
`max_len`	`int`	项目的最大长度。默认为 1024。将填充应用于任何短于 max_len 的序列，并截断任何长于 max_len 的序列。	`1024`
`mask_prob`	`float`	屏蔽令牌的概率。默认为 0.15。	`0.15`
`target_sum`	`int`	归一化的目标总和。默认为 10000。	`10000`
`normalize`	`bool`	用于归一化基因数据的标志。默认为 True。设置后，这将按基因令牌的中位数表达值重新排序它们。	`True`
`probabilistic_dirichlet_sampling`	`bool`	启用概率狄利克雷采样的标志。默认为 False。	必需
`dirichlet_alpha`	`float`	如果由 `probabilistic_dirichlet_sampling` 设置，则为狄利克雷采样的 Alpha 值。默认为 0.5。	必需
`same_length`	`bool`	为 true 时，采样与您在狄利克雷采样器之前最初拥有的基因长度相同的长度。	必需
`recompute_globals`	`bool`	为 true 时，全局数组始终重新计算。这仅对测试有用。	必需
`include_unrecognized_vocab_in_dataset`	`bool`	如果设置为 True，则执行硬检查以验证所有基因标识符是否都在用户提供的分词器词汇表中。默认为 False，这意味着将排除用户提供的分词器词汇表中不存在的任何基因标识符。	`False`

返回

名称	类型	描述
`dict`	`BertSample`	已处理的项目字典。

此方法非常重要且非常有用。为了概括这一点，我们应该为以下内容添加一个抽象：

具有某种函子转换的数据集。

源代码位于 bionemo/geneformer/data/singlecell/dataset.py

def process_item(  # noqa: D417
    gene_data: np.ndarray,
    gene_idxs: np.ndarray,
    feature_ids: np.ndarray,
    tokenizer: GeneTokenizer,
    gene_median: dict,
    rng: np.random.Generator,
    max_len: int = 1024,
    mask_prob: float = 0.15,
    mask_token_prob: float = 0.8,
    random_token_prob: float = 0.1,
    target_sum: int = 10000,
    normalize: bool = True,
    prepend_cls_token: bool = True,
    eos_token: None | int = None,
    include_unrecognized_vocab_in_dataset: bool = False,
) -> types.BertSample:
    """Process a single item in the dataset.

    Optionally performs median normalization and rank ordering. The tokenizers CLS token is added to the beginning
    of every sample. Converts gene names to ensemble ids before tokenizing. Expects gene_medians to contain ensembl ids as keys.

    Args:
        gene_data (list): List of gene data, these are expression counts.
        gene_idxs (list): List of gene indices, these are keys in 'metadata['feature_ids']' and corresponding the CSR entry.
        feature_ids (list): Feature ids for the full dataset.
        tokenizer (Tokenizer): Tokenizer object.
        gene_median (optional(dict)): Dictionary of gene medians. Defaults to None. Expects ensembl IDs to be keys.
        rng: Random number generator to ensure deterministic results.
        max_len (int): Maximum length of the item. Defaults to 1024. Applies padding to any sequence shorter than max_len and truncates any sequence longer than max_len.
        mask_prob (float): Probability of masking a token. Defaults to 0.15.
        target_sum (int): Target sum for normalization. Defaults to 10000.
        normalize (bool): Flag to normalize the gene data. Defaults to True.
            When set, this re-orders the gene tokens by their median expression value.
        probabilistic_dirichlet_sampling (bool): Flag to enable probabilistic dirichlet sampling. Defaults to False.
        dirichlet_alpha (float): Alpha value for dirichlet sampling if set by `probabilistic_dirichlet_sampling`. Defaults to 0.5.
        same_length (bool): when true, sample the same length of genes as you originally had before the dirichlet sampler.
        recompute_globals (bool): when true, global arrays are always recomputed. this is only useful for testing.
        include_unrecognized_vocab_in_dataset (bool, optional): If set to True, a hard-check is performed to verify all gene identifers are in the user supplied tokenizer vocab. Defaults to False which means any gene identifier not in the user supplied tokenizer vocab will be excluded.

    Returns:
        dict: Processed item dictionary.

    NOTE: this method is very important and very useful. To generalize thiswwe should add an abstraction for
        Datasets that have some kind of functor transformation.
    """
    if max_len < 1:
        raise ValueError(f"max_len must be greater than 1, {max_len=}")

    if gene_median is None:
        raise ValueError("gene_median must be provided for this tokenizer")

    if prepend_cls_token:
        max_len = max_len - 1  # - minus 1 for [CLS] token
    if eos_token is not None:
        max_len = max_len - 1  # - minus 1 for [EOS] token

    gene_names = feature_ids[gene_idxs]

    gene_expression_cell, token_ids, gene_expression_medians = _gather_medians(
        gene_names,
        gene_data,
        normalize,
        tokenizer.vocab,
        gene_median,
        include_unrecognized_vocab_in_dataset=include_unrecognized_vocab_in_dataset,
    )

    if normalize:
        # re-order according to expression median normalized rank. descending order.

        gene_expression_cell = gene_expression_cell / gene_expression_cell.sum() * target_sum
        gene_expression_cell = gene_expression_cell / gene_expression_medians.astype(float)
        idxs = np.argsort(
            -gene_expression_cell
        )  # sort in descending order so that the 0th position is the highest value.
        gene_expression_cell = gene_expression_cell[idxs]
        token_ids = token_ids[idxs]

    # - select max_len subset, set sample to false so it doesnt permute the already rank ordered expression values.
    token_ids = sample_or_truncate(token_ids, max_len, sample=False)
    with torch.no_grad(), torch.device("cpu"):
        masked_tokens, labels, loss_mask = masking.apply_bert_pretraining_mask(
            tokenized_sequence=torch.from_numpy(token_ids),
            random_seed=int(random_utils.get_seed_from_rng(rng)),
            mask_config=masking.BertMaskConfig(
                tokenizer=tokenizer,
                random_tokens=range(len(tokenizer.special_tokens), len(tokenizer.vocab)),
                mask_prob=mask_prob,
                mask_token_prob=mask_token_prob,
                random_token_prob=random_token_prob,
            ),
        )
        cls_token = tokenizer.token_to_id(tokenizer.cls_token) if prepend_cls_token else None
        if cls_token is not None or eos_token is not None:
            masked_tokens, labels, loss_mask = masking.add_cls_and_eos_tokens(
                sequence=masked_tokens,
                labels=labels,
                loss_mask=loss_mask,
                cls_token=cls_token,
                eos_token=eos_token,
            )

        # NeMo megatron assumes this return structure.
        return {
            "text": masked_tokens,
            "types": torch.zeros_like(masked_tokens, dtype=torch.int64),
            "attention_mask": torch.ones_like(masked_tokens, dtype=torch.int64),
            "labels": labels,
            "loss_mask": loss_mask,
            "is_random": torch.zeros_like(masked_tokens, dtype=torch.int64),
        }

数据集

SingleCellDataset

__getitem__(index)

process_item(gene_data, gene_idxs, feature_ids, tokenizer, gene_median, rng, max_len=1024, mask_prob=0.15, mask_token_prob=0.8, random_token_prob=0.1, target_sum=10000, normalize=True, prepend_cls_token=True, eos_token=None, include_unrecognized_vocab_in_dataset=False)

`SingleCellDataset`

`getitem(index)`

`process_item(gene_data, gene_idxs, feature_ids, tokenizer, gene_median, rng, max_len=1024, mask_prob=0.15, mask_token_prob=0.8, random_token_prob=0.1, target_sum=10000, normalize=True, prepend_cls_token=True, eos_token=None, include_unrecognized_vocab_in_dataset=False)`