注意部署此笔记本作为 Launchable 大约需要 10 分钟。截至撰写本文时，我们正在开发免费层，因此可能需要信用卡。您可以联系您的 NVIDIA 代表以获取积分。

ESM-2 推理¶

本教程演示了如何使用带有 sequences 列的 CSV 文件进行 ESM2 推理。要预训练 ESM2 模型，请参阅 ESM-2 预训练教程。

注意以下某些单元格会生成长文本输出。我们正在使用

%%capture --no-display --no-stderr cell_output

来抑制此输出。注释掉或删除以下单元格中的此行以恢复完整输出。

设置和假设¶

在本教程中，我们将演示如何下载 ESM2 检查点，创建包含蛋白质序列的 CSV 文件，并推理 ESM-2 模型。

所有命令都应在 BioNeMo Docker 容器内执行，该容器已预安装所有 ESM-2 依赖项。有关如何构建或拉取 BioNeMo2 容器的更多信息，请参阅初始化指南。

导入所需的库¶

在 [1] 中

已复制！





%%capture --no-display --no-stderr cell_output

import os
import torch
import shutil
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
%%capture --no-display --no-stderr cell_output import os import torch import shutil import numpy as np import pandas as pd import warnings warnings.filterwarnings('ignore') warnings.simplefilter('ignore')

工作目录¶

设置工作目录以存储数据和结果

注意我们设置以下内容来清理此笔记本创建的工作目录

cleanup : bool = True

在 [2] 中

已复制！

cleanup : bool = True
cleanup : bool = True

在 [3] 中

已复制！

work_dir="/workspace/bionemo2/esm2_inference_tutorial"

if cleanup and os.path.exists(work_dir):
    shutil.rmtree(work_dir)

if not os.path.exists(work_dir):
    os.makedirs(work_dir)
    print(f"Directory '{work_dir}' created.")
else:
    print(f"Directory '{work_dir}' already exists.")
work_dir="/workspace/bionemo2/esm2_inference_tutorial" if cleanup and os.path.exists(work_dir): shutil.rmtree(work_dir) if not os.path.exists(work_dir): os.makedirs(work_dir) print(f"目录 '{work_dir}' 已创建。") else: print(f"目录 '{work_dir}' 已存在。")

Directory '/workspace/bionemo2/esm2_inference_tutorial' created.

下载模型检查点¶

以下代码将从 NGC 注册表下载预训练模型 esm2n/650m:2.0

在 [4] 中

已复制！

from bionemo.core.data.load import load

checkpoint_path = load("esm2/650m:2.0")
print(checkpoint_path)
from bionemo.core.data.load import load checkpoint_path = load("esm2/650m:2.0") print(checkpoint_path)

/home/bionemo/.cache/bionemo/0798767e843e3d54315aef91934d28ae7d8e93c2849d5fcfbdf5fac242013997-esm2_650M_nemo2.tar.gz.untar

数据¶

我们使用 InMemoryCSVDataset 类从 .csv 文件加载蛋白质序列数据。此数据文件应至少包含 sequences 列，并且可以选择性地包含用于微调应用程序的 labels 列。以下是如何使用 Python 中的序列列表创建您自己的推理输入数据的示例

在 [5] 中

已复制！





import pandas as pd

artificial_sequence_data = [
    "TLILGWSDKLGSLLNQLAIANESLGGGTIAVMAERDKEDMELDIGKMEFDFKGTSVI",
    "LYSGDHSTQGARFLRDLAENTGRAEYELLSLF",
    "GRFNVWLGGNESKIRQVLKAVKEIGVSPTLFAVYEKN",
    "DELTALGGLLHDIGKPVQRAGLYSGDHSTQGARFLRDLAENTGRAEYELLSLF",
    "KLGSLLNQLAIANESLGGGTIAVMAERDKEDMELDIGKMEFDFKGTSVI",
    "LFGAIGNAISAIHGQSAVEELVDAFVGGARISSAFPYSGDTYYLPKP",
    "LGGLLHDIGKPVQRAGLYSGDHSTQGARFLRDLAENTGRAEYELLSLF",
    "LYSGDHSTQGARFLRDLAENTGRAEYELLSLF",
    "ISAIHGQSAVEELVDAFVGGARISSAFPYSGDTYYLPKP",
    "SGSKASSDSQDANQCCTSCEDNAPATSYCVECSEPLCETCVEAHQRVKYTKDHTVRSTGPAKT",
]

# Create a DataFrame
df = pd.DataFrame(artificial_sequence_data, columns=["sequences"])

# Save the DataFrame to a CSV file
data_path = os.path.join(work_dir, "sequences.csv")
df.to_csv(data_path, index=False)
import pandas as pd artificial_sequence_data = [ "TLILGWSDKLGSLLNQLAIANESLGGGTIAVMAERDKEDMELDIGKMEFDFKGTSVI", "LYSGDHSTQGARFLRDLAENTGRAEYELLSLF", "GRFNVWLGGNESKIRQVLKAVKEIGVSPTLFAVYEKN", "DELTALGGLLHDIGKPVQRAGLYSGDHSTQGARFLRDLAENTGRAEYELLSLF", "KLGSLLNQLAIANESLGGGTIAVMAERDKEDMELDIGKMEFDFKGTSVI", "LFGAIGNAISAIHGQSAVEELVDAFVGGARISSAFPYSGDTYYLPKP", "LGGLLHDIGKPVQRAGLYSGDHSTQGARFLRDLAENTGRAEYELLSLF", "LYSGDHSTQGARFLRDLAENTGRAEYELLSLF", "ISAIHGQSAVEELVDAFVGGARISSAFPYSGDTYYLPKP", "SGSKASSDSQDANQCCTSCEDNAPATSYCVECSEPLCETCVEAHQRVKYTKDHTVRSTGPAKT", ] # 创建 DataFrame df = pd.DataFrame(artificial_sequence_data, columns=["sequences"]) # 将 DataFrame 保存到 CSV 文件 data_path = os.path.join(work_dir, "sequences.csv") df.to_csv(data_path, index=False)

运行推理¶

与 PyTorch Lightning 类似，ESM-2 推理利用了一些关键类

MegatronStrategy - 启动和设置 NeMo 和 Megatron-LM 的并行性。
Trainer - 配置训练配置和日志记录。
ESMFineTuneDataModule - 加载用于微调和推理的序列数据。
ESM2Config - 将 ESM-2 模型配置为 BionemoLightningModule。

有关这些类的详细描述，请参阅 ESM-2 预训练和 ESM-2 微调教程。

要在上一步中创建的数据上运行推理，我们可以使用 infer_esm2 可执行文件，它调用 bionemo-framework/sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/infer_esm2.py。我们可以通过在以下命令中提供 --help 来获取推理参数的完整描述

在 [6] 中

已复制！

! infer_esm2 --help
! infer_esm2 --help

2024-12-16 20:19:23 - faiss.loader - INFO - Loading faiss with AVX512 support.
2024-12-16 20:19:23 - faiss.loader - INFO - Successfully loaded faiss with AVX512 support.
[NeMo W 2024-12-16 20:19:24 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
    
[NeMo W 2024-12-16 20:19:24 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
      cm = get_cmap("Set1")
    
usage: infer_esm2 [-h] --checkpoint-path CHECKPOINT_PATH --data-path DATA_PATH
                  --results-path RESULTS_PATH
                  [--precision {fp16,bf16,fp32,bf16-mixed,fp32-mixed,16-mixed,fp16-mixed,16,32}]
                  [--num-gpus NUM_GPUS] [--num-nodes NUM_NODES]
                  [--micro-batch-size MICRO_BATCH_SIZE]
                  [--pipeline-model-parallel-size PIPELINE_MODEL_PARALLEL_SIZE]
                  [--tensor-model-parallel-size TENSOR_MODEL_PARALLEL_SIZE]
                  [--prediction-interval {epoch,batch}] [--include-hiddens]
                  [--include-input-ids] [--include-embeddings]
                  [--include-logits] [--config-class CONFIG_CLASS]

Infer ESM2.

options:
  -h, --help            show this help message and exit
  --checkpoint-path CHECKPOINT_PATH
                        Path to the ESM2 pretrained checkpoint
  --data-path DATA_PATH
                        Path to the CSV file containing sequences and label
                        columns
  --results-path RESULTS_PATH
                        Path to the results directory.
  --precision {fp16,bf16,fp32,bf16-mixed,fp32-mixed,16-mixed,fp16-mixed,16,32}
                        Precision type to use for training.
  --num-gpus NUM_GPUS   Number of GPUs to use for training. Default is 1.
  --num-nodes NUM_NODES
                        Number of nodes to use for training. Default is 1.
  --micro-batch-size MICRO_BATCH_SIZE
                        Micro-batch size. Global batch size is inferred from
                        this.
  --pipeline-model-parallel-size PIPELINE_MODEL_PARALLEL_SIZE
                        Pipeline model parallel size. Default is 1.
  --tensor-model-parallel-size TENSOR_MODEL_PARALLEL_SIZE
                        Tensor model parallel size. Default is 1.
  --prediction-interval {epoch,batch}
                        Intervals to write DDP predictions into disk
  --include-hiddens     Include hiddens in output of inference
  --include-input-ids   Include input_ids in output of inference
  --include-embeddings  Include embeddings in output of inference
  --include-logits      Include per-token logits in output.
  --config-class CONFIG_CLASS
                        Model configs link model classes with losses, and
                        handle model initialization (including from a prior
                        checkpoint). This is how you can fine-tune a model.
                        First train with one config class that points to one
                        model class and loss, then implement and provide an
                        alternative config class that points to a variant of
                        that model and alternative loss. In the future this
                        script should also provide similar support for picking
                        different data modules for fine-tuning with different
                        data types. Choices: dict_keys(['ESM2Config',
                        'ESM2FineTuneSeqConfig', 'ESM2FineTuneTokenConfig'])

隐藏状态（通常是神经网络中每一层的输出）可以通过在使用 BioNeMo 框架中的 ESM-2 推理函数时使用 --include-hiddens 参数来获得。

隐藏状态可以转换为固定大小的向量嵌入。这是通过删除与填充 token 对应的隐藏状态向量，然后对剩余部分进行平均来实现的。当目标是从模型的隐藏状态创建单个向量表示时，通常使用此过程，该向量表示可用于各种序列级下游任务，例如分类（例如亚细胞定位）或回归（例如熔解温度预测）。要获得嵌入结果，我们可以使用 --include-embeddings 参数。

通过将氨基酸序列的隐藏状态传递到 BERT 语言模型头，我们可以获得每个位置的输出 logits，并将它们转换为概率。这可以通过使用 --include-logits 参数来实现。此处的 Logits 是原始的、未归一化的分数，表示每个类别的可能性，而不是概率本身；它们可以是任何实数，包括负值。

现在，让我们调用带有相关参数的 infer_esm2 可执行文件来计算并可选地返回嵌入、隐藏状态和 logits。

在 [7] 中

已复制！





%%capture --no-display --no-stderr cell_output

! infer_esm2 --checkpoint-path {checkpoint_path} \
             --data-path {data_path} \
             --results-path {work_dir} \
             --micro-batch-size 3 \
             --num-gpus 1 \
             --precision "bf16-mixed" \
             --include-hiddens \
             --include-embeddings \
             --include-logits \
             --include-input-ids
%%capture --no-display --no-stderr cell_output ! infer_esm2 --checkpoint-path {checkpoint_path} \ --data-path {data_path} \ --results-path {work_dir} \ --micro-batch-size 3 \ --num-gpus 1 \ --precision "bf16-mixed" \ --include-hiddens \ --include-embeddings \ --include-logits \ --include-input-ids

推理结果¶

推理预测存储在每个设备的 .pt 文件中。由于我们在上一步中仅使用了一个设备来运行推理 (--num-gpus 1)，因此结果被写入此笔记本的工作目录（如上定义）下的 {work_dir}/predictions__rank_0.pt。 .pt 文件包含一个 {'result_key': torch.Tensor} 字典，可以使用 PyTorch 加载

在 [8] 中

已复制！

import torch
results = torch.load(f"{work_dir}/predictions__rank_0.pt")

for key, val in results.items():
    if val is not None:
        print(f'{key}\t{val.shape}')
import torch results = torch.load(f"{work_dir}/predictions__rank_0.pt") for key, val in results.items(): if val is not None: print(f'{key}\t{val.shape}')

token_logits	torch.Size([1024, 10, 128])
hidden_states	torch.Size([10, 1024, 1280])
input_ids	torch.Size([10, 1024])
embeddings	torch.Size([10, 1280])

在本例中，data 是一个 Python 字典，包含以下键 ['token_logits', 'hidden_states', 'input_ids', 'embeddings']。 Logits (token_logits) 张量的维度为 [sequence, batch, hidden]，以提高训练性能。我们将在下面转置前两个维度，使其具有像其余输出张量一样的批次优先形状。

在 [9] 中

已复制！

logits = results['token_logits'].transpose(0, 1)  # s, b, h  -> b, s, h
print(logits.shape)
logits = results['token_logits'].transpose(0, 1) # s, b, h -> b, s, h print(logits.shape)

torch.Size([10, 1024, 128])

token_logits 的最后一个维度是 128，其中前 33 个位置对应于氨基酸词汇表，后跟 95 个填充。我们使用 tokenizer.vocab_size 来过滤掉填充，仅保留 33 个词汇位置。

在 [10] 中

已复制！

from bionemo.esm2.data.tokenizer import get_tokenizer
tokenizer = get_tokenizer()

tokens = tokenizer.all_tokens
print(f"There are {tokenizer.vocab_size} unique tokens: {tokens}.")

aa_logits = logits[..., :tokenizer.vocab_size]  # filter out the 95 paddings and only keep 33 vocab positions
print(f"Logits shape after removing the paddings in hidden dimension: {aa_logits.shape}")
from bionemo.esm2.data.tokenizer import get_tokenizer tokenizer = get_tokenizer() tokens = tokenizer.all_tokens print(f"有 {tokenizer.vocab_size} 个唯一 token：{tokens}。") aa_logits = logits[..., :tokenizer.vocab_size] # 过滤掉 95 个填充，仅保留 33 个词汇位置 print(f"删除隐藏维度中的填充后的 Logits 形状：{aa_logits.shape}")

There are 33 unique tokens: ['<cls>', '<pad>', '<eos>', '<unk>', 'L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C', 'X', 'B', 'U', 'Z', 'O', '.', '-', '<null_1>', '<mask>'].
Logits shape after removing the paddings in hidden dimension: torch.Size([10, 1024, 33])

让我们留出与 20 种已知氨基酸对应的 token。

在 [11] 中

已复制！

aa_tokens = ['L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C']

aa_indices = [i for i, token in enumerate(tokens) if token in aa_tokens]
extra_indices = [i for i, token in enumerate(tokens) if token not in aa_tokens]
aa_tokens = ['L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C'] aa_indices = [i for i, token in enumerate(tokens) if token in aa_tokens] extra_indices = [i for i, token in enumerate(tokens) if token not in aa_tokens]

本例中的序列维度 (1024) 表示最大序列长度，其中包括填充、EOS 和 BOS。要过滤相关的氨基酸信息，我们可以使用结果中的输入序列 ID 创建一个掩码，该掩码可用于提取 aa_logits 中的相关信息

在 [12] 中

已复制！

input_ids = results['input_ids'] # b, s
# mask where non-amino acid tokens are True
mask = torch.isin(input_ids, torch.tensor(extra_indices))
input_ids = results['input_ids'] # b, s # 非氨基酸 token 为 True 的掩码 mask = torch.isin(input_ids, torch.tensor(extra_indices))

DDP 推理支持¶

虽然本教程利用一个设备来运行推理，但 BioNeMo 框架中的 ESM2 支持分布式推理。可以简单地将 --num-gpus n 设置为在 n 个设备上运行分布式推理。输出预测将写入提供的 --results-path 下的 predictions__rank_<0...n-1>.pt。此外，通过使用 --include-input-ids 可选地包含输入 token ID，我们可以确保输入序列和输出预测之间 1:1 映射。

以下代码片段可用于加载预测并将预测整理到单个字典中。

import glob
from bionemo.llm.lightning import batch_collator

collated_preditions = batch_collator([torch.load(path) for path in glob.glob(f"{work_dir}/predictions__rank_*.pt")])
for key, val in collated_preditions.items():
    if val is not None:
        print(f'{key}\t{val.shape}')

# token_logits	torch.Size([1024, 10, 128])
# hidden_states	torch.Size([10, 1024, 1280])
# input_ids     torch.Size([10, 1024])
# embeddings	torch.Size([10, 1280])

有关推理和将 logits 转换为概率的更深入示例，请参阅 ESM-2 突变体设计教程