跳到内容

Nvfaidx

NvFaidx

NvFaidx 是 PyFaidx 的 rest + pyo3 替代品,它为参考基因组提供类似字典的接口。

此类是 SequenceAccessor 的集合,通过类似字典的方式按序列 ID 组织。SequenceAcecessors 类似于底层索引中实际序列条目的类似字典的接口。此外,还提供了用于解析 faidx 文件、构建 faidx 文件以及将 faidx 文件存储到磁盘的实用程序。

重要提示 默认情况下,所有 fasta 文件都构建内存中的 faidx 对象。这是由于在使用多进程(例如,默认构造函数在运行时创建这些文件)时可能发生的简单错误。但是,存在手动创建这些方法的方法,用户可以更好地控制多进程。

示例

>>> index = NvFaidx(fasta_file, faidx_path=None, ignore_existing_fai=True)
>>> index['chr1'] # Returns a SequenceAccessor for chr1
>>> index['chr1'][0:10] # Returns the first 10 bases of chr1.
>>> faidx_filename = NvFaidx.create_faidx(fasta_file) # Creates a faidx to disk.
>>> index = NvFaidx(fasta_File, faidx_filename, ignore_existing_fai = True) # Uses a faidx from disk.

动机和更多细节

NvFaidx 使用 Noodles 作为 Fai 对象的后端构建,并使用内存映射来支持底层 fasta。使用 Memmaps 的后端提供以下好处: - 内核通过使用页面错误来实现此机制 - mmap'd 文件中的每次读取都会导致页面错误:内存中没有可读取的内容! - 内核通过转到磁盘、读取指定偏移量 + 索引中的文件,然后将刚刚读取的内容返回给用户进程来处理此页面错误,从而防止上下文切换带来的惩罚。

背景:PyFaidx 或任何基于缓冲读取的索引都不是进程安全的,因此与 pytorch 数据加载器不兼容。由于操作顺序,底层文件句柄在进程之间共享,当调用 seek() 执行随机查找时,这可能会导致派生进程中出现意外行为。参考:https://github.com/mdshw5/pyfaidx/issues/211

对于一个好的解决方案,我们需要三件事

1) 安全的索引创建,在多进程或多节点场景中,这应限制为单个节点,其中所有工作进程都将阻塞,直到完成(上面未实现) 2) 索引对象实例化必须快速。 3) 索引对象的只读使用必须在 Python 中既是线程安全的又是进程安全的。

另请参阅:bionemo.noodles.nvfaidx.SequenceAccessor

源代码在 bionemo/noodles/nvfaidx.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
class NvFaidx:
    """NvFaidx is a rest + pyo3 replacement for PyFaidx that provides a dictionary-like interface to reference genomes.

    This class is a collection of SequenceAccessors, organized by sequence-id in a dictionary like manner. SequenceAcecessors
     are similar dict-like interfaces over actual sequence entries in the underlying index. Furthermore, utilities are provided
     for parsing faidx files, building faidx files, and storing faidx files to disk.

    **IMPORTANT** by default all fasta files build an in-memory faidx object. This is due easy mistakes that may occur
    if a faidx file is constructed while using multi-processing (such as a default constructor that creates these files on the fly).
    However, methods exist to create these methods manually where a user has more control over multiprocessing.

    Examples:
        >>> index = NvFaidx(fasta_file, faidx_path=None, ignore_existing_fai=True)
        >>> index['chr1'] # Returns a SequenceAccessor for chr1
        >>> index['chr1'][0:10] # Returns the first 10 bases of chr1.
        >>> faidx_filename = NvFaidx.create_faidx(fasta_file) # Creates a faidx to disk.
        >>> index = NvFaidx(fasta_File, faidx_filename, ignore_existing_fai = True) # Uses a faidx from disk.


    Motivation and more details:

    NvFaidx is built using Noodles as a backend for Fai objects, and memory maps for backing the underlying fasta.
    Using a backend of Memmaps provide the following benefits:
        - The kernel implements this mechanism by using page faults
        - Each read in a mmap'd file results in a page fault: there's nothing in memory to read!
        - The kernel handles this page fault by going to the disk, reading the file in the specified offset + index,
            then returning to the user process with what it just read, preventing penalties from context switching.

    *Context*: PyFaidx or _any_ buffered read based index is not process safe, and therefore does not play nice with pytorch dataloaders.
    Due to the order of operations, the underlying file handle is shared between processes, when `seek()` is called to perform random lookups,
    this can cause unexpected behavior in the forked processes.
    Ref: https://github.com/mdshw5/pyfaidx/issues/211

    For a good solution we need three things:
        1) Safe index creation, in multi-process or multi-node scenarios, this should be restricted to a single node
            where all workers block until it is complete (not implemented above)
        2) Index object instantion must be fast.
        3) Read-only use of the index object must be both thread safe and process safe with python.

    See Also: bionemo.noodles.nvfaidx.SequenceAccessor
    """

    def __init__(
        self,
        fasta_path: str | Path,
        faidx_path: Optional[str | Path] = None,
        ignore_existing_fai: bool = True,
        allow_duplicate_seqids: bool = False,
    ):
        """Construct a dict-like object representing a memmapped, indexed FASTA file.

        This is an indexed fasta reader. Consequences of this are that the FASTA file must be well formed, meaning
        sequence-ids and line-lengths must conform to FASTA standards. Additionally, the order of returned seqid, sequence
        pairs when iterating over the index is not guaranteed to be the same order as the underlying fasta file.

        Args:
            fasta_path (str): Path to the FASTA file.
            faidx_path (str): Path to the FAI index file. If None, one will be created.
            ignore_existing_fai (bool): If True, ignore any existing FAI file and create an in-memory index. Note that
                this will also ignore `faidx_path`.
            allow_duplicate_seqids (bool): If true, will produce index for invalid fastas which contain duplicate seqids.
                In this scenario, indexing is performed by integer rather than strings.

                Example with invalid seqids.
                    >chr1 dupes|not|allowd
                    ATGATGATGATG
                    >chr1 whoops|there|is|dupe
                    ATGATGATGATG
                NvFaidx:
                    {
                        0 : SequenceAccessor(chr1 dupes|not|allowed),
                        1 : SequenceAccessor(chr1 whoops|there|is|dupe)
                    }

        """
        if isinstance(fasta_path, Path):
            fasta_path = str(fasta_path)
        elif not isinstance(fasta_path, str):
            raise TypeError(f"fasta_path must be a `str` or `pathlib.Path`, got: {type(fasta_path)}")

        if isinstance(faidx_path, Path):
            faidx_path = str(faidx_path)
        elif not isinstance(faidx_path, str) and faidx_path is not None:
            raise TypeError(f"faidx_path must be a `str`, `pathlib.Path`, or None. got: {type(faidx_path)}")

        match (fasta_path, faidx_path, ignore_existing_fai):
            case (_, _, True):
                self.reader = PyIndexedMmapFastaReader(fasta_path, ignore_existing_fai=ignore_existing_fai)
            case (_, faidx_path, _) if faidx_path is not None:
                self.reader = PyIndexedMmapFastaReader.from_fasta_and_faidx(fasta_path, faidx_path)
            # In this case, faidx path is None and ignore_existing is False, and it covers all other cases.
            case (_, None, False):
                # But the logic here doesnt make sense, ignore_existing is false, but it should only use if it if it exists.
                self.reader = PyIndexedMmapFastaReader(fasta_path, False)
            case _:
                raise ValueError("unreachable condition.")

        self.records: Dict[str | int, PyFaidxRecord] = {record.name: record for record in self.reader.records()}
        if len(self.records) != len(self.reader.records()):
            if not allow_duplicate_seqids:
                raise ValueError(
                    "Non-unique sequence-id detected in FASTA, this is invalid. Correct headers and try again or pass allow_duplicate_seqid'"
                )
            else:
                self.records: Dict[str | int, PyFaidxRecord] = dict(enumerate(self.reader.records()))

    def __getitem__(self, seqid: str) -> SequenceAccessor:  # noqa: D105
        if seqid not in self.records:
            raise KeyError(f"Sequence '{seqid}' not found in index.")

        # Return a SequenceAccessor for slicing access
        record_length = self.records[seqid].length
        return SequenceAccessor(self.reader, seqid, record_length)

    def __contains__(self, seqid: str) -> bool:  # noqa: D105
        return seqid in self.records

    def __len__(self) -> int:  # noqa: D105
        return len(self.records)

    def keys(self) -> set[str]:  # noqa: D102
        return set(self.records.keys())

    # These provide dict like iteration functionality
    def __iter__(self):  # noqa: D105
        return iter(self.keys())

    def items(self):  # noqa: D102
        for key in self.keys():
            yield key, self[key][:]

    def values(self):  # noqa: D102
        for key in self.keys():
            yield self[key][:]

    @staticmethod
    def create_faidx(fasta_filename: str | Path, force: bool = False) -> str:
        """Create a FAI index for a FASTA file, the result is saved in the same location as `fasta_filename`, with a .fai extension.

        Args:
            fasta_filename (str): Path to the FASTA file to be indexed.
            force (bool): Delete existing faidx file and create a new index file.
        """
        if isinstance(fasta_filename, Path):
            fasta_filename = str(fasta_filename)
        return PyIndexedMmapFastaReader.create_faidx(fasta_filename, force)

__init__(fasta_path, faidx_path=None, ignore_existing_fai=True, allow_duplicate_seqids=False)

构造一个类似字典的对象,表示内存映射的索引 FASTA 文件。

这是一个索引的 fasta 读取器。这样做的后果是,FASTA 文件必须格式良好,这意味着序列 ID 和行长度必须符合 FASTA 标准。此外,在迭代索引时,返回的 seqid、序列对的顺序不保证与底层 fasta 文件的顺序相同。

参数

名称 类型 描述 默认值
fasta_path str

FASTA 文件的路径。

必需
faidx_path str

FAI 索引文件的路径。如果为 None,将创建一个。

None
ignore_existing_fai bool

如果为 True,则忽略任何现有的 FAI 文件并创建内存索引。请注意,这也将忽略 faidx_path

True
allow_duplicate_seqids bool

如果为 true,将为包含重复 seqid 的无效 fasta 生成索引。在这种情况下,索引是通过整数而不是字符串执行的。

具有无效 seqid 的示例。 >chr1 dupes|not|allowd ATGATGATGATG >chr1 whoops|there|is|dupe ATGATGATGATG NvFaidx: { 0 : SequenceAccessor(chr1 dupes|not|allowed), 1 : SequenceAccessor(chr1 whoops|there|is|dupe) }

False
源代码在 bionemo/noodles/nvfaidx.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def __init__(
    self,
    fasta_path: str | Path,
    faidx_path: Optional[str | Path] = None,
    ignore_existing_fai: bool = True,
    allow_duplicate_seqids: bool = False,
):
    """Construct a dict-like object representing a memmapped, indexed FASTA file.

    This is an indexed fasta reader. Consequences of this are that the FASTA file must be well formed, meaning
    sequence-ids and line-lengths must conform to FASTA standards. Additionally, the order of returned seqid, sequence
    pairs when iterating over the index is not guaranteed to be the same order as the underlying fasta file.

    Args:
        fasta_path (str): Path to the FASTA file.
        faidx_path (str): Path to the FAI index file. If None, one will be created.
        ignore_existing_fai (bool): If True, ignore any existing FAI file and create an in-memory index. Note that
            this will also ignore `faidx_path`.
        allow_duplicate_seqids (bool): If true, will produce index for invalid fastas which contain duplicate seqids.
            In this scenario, indexing is performed by integer rather than strings.

            Example with invalid seqids.
                >chr1 dupes|not|allowd
                ATGATGATGATG
                >chr1 whoops|there|is|dupe
                ATGATGATGATG
            NvFaidx:
                {
                    0 : SequenceAccessor(chr1 dupes|not|allowed),
                    1 : SequenceAccessor(chr1 whoops|there|is|dupe)
                }

    """
    if isinstance(fasta_path, Path):
        fasta_path = str(fasta_path)
    elif not isinstance(fasta_path, str):
        raise TypeError(f"fasta_path must be a `str` or `pathlib.Path`, got: {type(fasta_path)}")

    if isinstance(faidx_path, Path):
        faidx_path = str(faidx_path)
    elif not isinstance(faidx_path, str) and faidx_path is not None:
        raise TypeError(f"faidx_path must be a `str`, `pathlib.Path`, or None. got: {type(faidx_path)}")

    match (fasta_path, faidx_path, ignore_existing_fai):
        case (_, _, True):
            self.reader = PyIndexedMmapFastaReader(fasta_path, ignore_existing_fai=ignore_existing_fai)
        case (_, faidx_path, _) if faidx_path is not None:
            self.reader = PyIndexedMmapFastaReader.from_fasta_and_faidx(fasta_path, faidx_path)
        # In this case, faidx path is None and ignore_existing is False, and it covers all other cases.
        case (_, None, False):
            # But the logic here doesnt make sense, ignore_existing is false, but it should only use if it if it exists.
            self.reader = PyIndexedMmapFastaReader(fasta_path, False)
        case _:
            raise ValueError("unreachable condition.")

    self.records: Dict[str | int, PyFaidxRecord] = {record.name: record for record in self.reader.records()}
    if len(self.records) != len(self.reader.records()):
        if not allow_duplicate_seqids:
            raise ValueError(
                "Non-unique sequence-id detected in FASTA, this is invalid. Correct headers and try again or pass allow_duplicate_seqid'"
            )
        else:
            self.records: Dict[str | int, PyFaidxRecord] = dict(enumerate(self.reader.records()))

create_faidx(fasta_filename, force=False) staticmethod

为 FASTA 文件创建 FAI 索引,结果保存在与 fasta_filename 相同的位置,扩展名为 .fai。

参数

名称 类型 描述 默认值
fasta_filename str

要索引的 FASTA 文件的路径。

必需
force bool

删除现有的 faidx 文件并创建一个新的索引文件。

False
源代码在 bionemo/noodles/nvfaidx.py
243
244
245
246
247
248
249
250
251
252
253
@staticmethod
def create_faidx(fasta_filename: str | Path, force: bool = False) -> str:
    """Create a FAI index for a FASTA file, the result is saved in the same location as `fasta_filename`, with a .fai extension.

    Args:
        fasta_filename (str): Path to the FASTA file to be indexed.
        force (bool): Delete existing faidx file and create a new index file.
    """
    if isinstance(fasta_filename, Path):
        fasta_filename = str(fasta_filename)
    return PyIndexedMmapFastaReader.create_faidx(fasta_filename, force)

SequenceAccessor

SequenceAccessor 为索引的 FASTA 文件中的单个序列提供类似字典的接口。

这允许随机访问序列,可以通过索引或切片。

源代码在 bionemo/noodles/nvfaidx.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
class SequenceAccessor:
    """SequenceAccessor provides a dictionary-like interface to a single sequence in an indexed FASTA file.

    This allows for random access to the sequence, either by index, or by slice.
    """

    def __init__(self, reader: PyIndexedMmapFastaReader, seqid: str, length: int) -> None:
        """Construct a SequenceAccessor object. Ultimately this is used as a convenience object with NvFaidx.

        When querying the following are true:
            - Negative indexing is supported, but it does not wrap. so query[-10000] for a sequence of length 1 will fail.
            - out of bounds indexing is truncated: query[1:999999999] will return a string from position 1 to the terminus.
            - reversed slices return the empty string: query[999:1] is the empty string.
            - empty slice returns the full string: query[:] is the full string of the sequence.
            - beginning of slice is beyond the range of the contig, the empty string is returned.

        Additionally there are convenience methods that you may find useful in the class definition.

        Args:
            reader (PyIndexedMmapFastaReader): The indexed reader object that provides access to the underlying FASTA file.
            seqid (str): The sequence identifier.
            length (int): The length of the sequence.
        """
        self.reader = reader
        self.seqid = seqid
        self.length = length

    def __getitem__(self, key: int | slice) -> str:  # noqa: D105
        if isinstance(key, slice):
            # Provide defaults for missing arguments in the slice.
            start = key.start if key.start is not None else 0
            stop = key.stop if key.stop is not None else self.length

            # Handle negative cases, remember, you can be arbitrarily negative in a slice.
            if start < 0:
                start += self.length
            if stop < 0:
                stop += self.length

            # Clamp normalized indices to valid range
            start = max(0, min(self.length, start))
            stop = max(0, min(self.length, stop))

            # Bounds checking after normalization
            if start > stop:
                return ""  # Return empty string for an empty slice

            # Construct region string
            region_str = f"{self.seqid}:{start + 1}-{stop}"  # +1 for 1-based indexing
            return self.reader.read_sequence_mmap(region_str)

        elif isinstance(key, int):
            # Normalize single integer for negative indexing
            if key < 0:
                key += self.length

            # Bounds checking
            if key < 0 or key >= self.length:
                raise IndexError(f"Position {key} is out of bounds for '{self.seqid}' with length {self.length}.")

            # Query single nucleotide by creating a 1-length region
            region_str = f"{self.seqid}:{key + 1}-{key + 1}"  # +1 for 1-based indexing
            return self.reader.read_sequence_mmap(region_str)

        else:
            raise TypeError("Index must be an integer or a slice.")

    def __len__(self) -> int:  # noqa: D105
        return self.length

    def sequence_id(self) -> str:
        """Returns the sequenceid of this SequenceAccessor."""
        return self.seqid

    def sequence(self) -> str:
        """Returns the sequence associated with this SequenceAccessor as a string."""
        return self[:]

__init__(reader, seqid, length)

构造 SequenceAccessor 对象。最终,这用作 NvFaidx 的便利对象。

查询时,以下情况为真
  • 支持负索引,但不换行。因此,查询长度为 1 的序列的 query[-10000] 将失败。
  • 超出范围的索引将被截断:query[1:999999999] 将返回从位置 1 到末尾的字符串。
  • 反向切片返回空字符串:query[999:1] 是空字符串。
  • 空切片返回完整字符串:query[:] 是序列的完整字符串。
  • 切片的开头超出 contig 的范围,将返回空字符串。

此外,类定义中还有您可能会觉得有用的便捷方法。

参数

名称 类型 描述 默认值
reader PyIndexedMmapFastaReader

索引读取器对象,提供对底层 FASTA 文件的访问。

必需
seqid str

序列标识符。

必需
length int

序列的长度。

必需
源代码在 bionemo/noodles/nvfaidx.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def __init__(self, reader: PyIndexedMmapFastaReader, seqid: str, length: int) -> None:
    """Construct a SequenceAccessor object. Ultimately this is used as a convenience object with NvFaidx.

    When querying the following are true:
        - Negative indexing is supported, but it does not wrap. so query[-10000] for a sequence of length 1 will fail.
        - out of bounds indexing is truncated: query[1:999999999] will return a string from position 1 to the terminus.
        - reversed slices return the empty string: query[999:1] is the empty string.
        - empty slice returns the full string: query[:] is the full string of the sequence.
        - beginning of slice is beyond the range of the contig, the empty string is returned.

    Additionally there are convenience methods that you may find useful in the class definition.

    Args:
        reader (PyIndexedMmapFastaReader): The indexed reader object that provides access to the underlying FASTA file.
        seqid (str): The sequence identifier.
        length (int): The length of the sequence.
    """
    self.reader = reader
    self.seqid = seqid
    self.length = length

sequence()

返回与此 SequenceAccessor 关联的序列,作为字符串。

源代码在 bionemo/noodles/nvfaidx.py
103
104
105
def sequence(self) -> str:
    """Returns the sequence associated with this SequenceAccessor as a string."""
    return self[:]

sequence_id()

返回此 SequenceAccessor 的序列 ID。

源代码在 bionemo/noodles/nvfaidx.py
 99
100
101
def sequence_id(self) -> str:
    """Returns the sequenceid of this SequenceAccessor."""
    return self.seqid