推理端点#
AlphaFold2-Multimer NIM 提供了以下端点
protein-structure/alphafold2/multimer/predict-structure-from-sequences
- 根据输入的氨基酸序列列表预测蛋白质结构。protein-structure/alphafold2/multimer/predict-MSA-from-sequences
- 执行多序列比对 (MSA) 并返回用于 AlphaFold2 推理的 MSA 和模板。此端点适用于在结构预测之前批量处理长时间运行且 CPU 密集型的 MSA 运行。protein-structure/alphafold2/multimer/predict-structure-from-MSA
- 从输入的 MSA 和模板执行结构预测。当使用预先计算或自定义/外部 MSA 时,这非常有用。
用法#
下面,我们概述了 API 的三个端点。我们给出了在 NIM 正确配置时应运行的真实请求示例。
从多个输入序列预测结构(多聚体)#
predict-structure-from-sequences
端点提供完整的端到端结构预测管道,即从蛋白质序列到多聚体蛋白质结构。它需要至少 1 个,最多 6 个氨基酸序列,尽管有许多可调参数
sequences
:有效氨基酸序列的数组。如果您不确定您的序列是否有效,请参阅氨基酸代码表。databases
:包含uniref90
、mgnify
和small_bfd
中任何一个的列表。这些数据库包含用于生成多序列比对 (MSA) 的序列,该 MSA 用作 AlphaFold2 中结构预测神经网络的输入。一般来说,传递所有三个数据库将提供最准确的结构预测,但代价是需要最长的运行时间。algorithm
:用于多序列比对的算法。目前,仅支持jackhmmer
。e_value
:用于过滤 MSA 中序列的序列 e 值。值越小意味着比对越严格 - 将包含起源概率较高的序列,但这也将降低 MSA 的灵敏度。默认值0.0001
通常是一个不错的选择。此值的范围为 0 到 1。bit_score
:用于 MSA 之前过滤的序列比特分数。如果传递此值,则将使用它代替 e 值进行过滤。一个好的起点大约是200
。此值大于零。iterations
:要执行的 MSA 迭代次数。一般来说,默认的iterations=1
就足够了,并且花费的时间最少。relax_prediction
:设置为True
以在预测后运行结构松弛。默认情况下设置为True
,有助于修复预测结构中的冲突。
这是一个使用 cURL 查询序列和完整数据库集的示例
curl -X 'POST' \
-i \
"http://127.0.0.1:8000/protein-structure/alphafold2/multimer/predict-structure-from-sequences" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{"sequences": ["MNVIDIAIAMAI", "IAMNVIDIAAI"], "databases": ["uniref90", "mgnify", "small_bfd"]}'
这是相同的示例,但这次使用的是 Python requests 模块
import requests
import json
url = "http://127.0.0.1:8000/protein-structure/alphafold2/multimer/predict-structure-from-sequences"
sequences = ["MNVIDIAIAMAI", "IAMNVIDIAAI"] # Replace with the actual sequences you want to perform structure prediction on.
headers = {
"content-type": "application/json"
}
data = {
"sequences": sequences,
"databases": ["uniref90", "mgnify", "small_bfd"]
}
response = requests.post(url, headers=headers, data=json.dumps(data))
# Check if the request was successful
if response.ok:
print("Request succeeded:", response.json())
else:
print("Request failed:", response.status_code, response.text)
此端点的输出是一个 PDB 文件。PDB 格式可以使用 pymol 和其他查看程序轻松查看;有关文档和用法,请参阅 pymol 网站。
从多个输入序列预测 MSA(多聚体)#
predict-msa-from-sequences
端点生成用于结构预测的多序列比对 (MSA) 和模板。如果您想在不同的(CPU 密集型)节点上批量预测,这将非常有用。
以下是使用 cURL 的查询示例
curl -X 'POST' \
-i \
"http://127.0.0.1:8000/protein-structure/alphafold2/multimer/predict-msa-from-sequences" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{"sequences": ["MNVIDIAIAMAI", "IAMNVIDIAAI"], "databases": ["uniref90", "mgnify", "small_bfd"]}'
这是在 Python 中使用 requests
模块的相同查询
import requests
import json
url = "http://0:8000/protein-structure/alphafold2/multimer/predict-msa-from-sequences"
sequences = ["MNVIDIAIAMAI", "IAMNVIDIAAI"] # Replace with the actual sequences you want to perform structure prediction on.
headers = {
"content-type": "application/json"
}
data = {
"sequences": sequences,
"databases": ["uniref90", "mgnify", "small_bfd"]
}
response = requests.post(url, headers=headers, data=json.dumps(data))
# Check if the request was successful
if response.ok:
print("Request succeeded:", response.json())
else:
print("Request failed:", response.status_code, response.text)
predict-msa-from-sequences
端点接受以下参数
sequences
:有效氨基酸序列的数组。如果您不确定您的序列是否有效,请参阅氨基酸代码表。databases
:包含uniref90
、mgnify
和small_bfd
中任何一个的列表。这些数据库包含用于生成多序列比对 (MSA) 的序列,该 MSA 用作 AlphaFold2 中结构预测神经网络的输入。一般来说,传递所有三个数据库将提供最准确的结构预测,但代价是需要最长的运行时间。如果您必须只选择一个,则uniref90
被认为是最佳选择,但仍然建议使用所有三个。algorithm
:用于多序列比对的算法。目前,仅支持jackhmmer
。e_value
:用于过滤 MSA 中序列的序列 e 值。值越小意味着比对越严格 - 将包含起源概率较高的序列,但这也将降低 MSA 的灵敏度。默认值0.0001
通常是一个不错的选择。此值的范围为 0 到 1。bit_score
:用于 MSA 之前过滤的序列比特分数。如果传递此值,则将使用它代替 e 值进行过滤。一个好的起点大约是200
。此值大于零。iterations
:要执行的 MSA 迭代次数。一般来说,默认的iterations=1
就足够了,并且花费的时间最少。
从 MSA 预测蛋白质结构#
predict-structure-from-msa
端点接受 predict-msa-from-sequences
端点的结果并运行结构预测。
注意:我们不建议使用 CURL 运行 msa 到结构的预测。这是因为输入具有需要在 bash
中仔细转义的字符。为了获得最佳用户体验,我们建议通过 Python requests
模块与此端点进行交互。
predict-structure-from-msa
端点接受以下参数
sequences
:有效氨基酸序列的数组。如果您不确定您的序列是否有效,请参阅氨基酸代码表。alignments
:来自predict-msa-from-sequences
的 MSA 结果。这是一个字典数组,其中包含以下形式的元组:{<db name> : {<db name>, <MSA output>, <MSA output format>}}
,每个输入氨基酸序列一个。templates
:来自结构数据库搜索的模板。这些模板采用 AlphaFold2 内部结构特定的格式;有关字段的更多详细信息,请参见此处。relax_prediction
:设置为True
以在预测后运行结构松弛。默认情况下设置为True
,有助于修复预测结构中的冲突。
以下是使用 Python requests
模块向 predict-structure-from-msa
端点发出请求的示例
import requests
import json
url = "http://0:8000/protein-structure/alphafold2/multimer/predict-structure-from-msa"
sequences = ["STARWARSNVIDIAAAAAA"] # Replace with the actual MSA sequences.
alignments = [{
'uniref90': ['uniref90', '# STOCKHOLM 1.0\n\n-151285509650596177 STARWARSNVIDIAAAAAA\n#=GC RF xxxxxxxxxxxxxxxxxxx\n//\n', 'sto'],
'small_bfd': ['small_bfd', '# STOCKHOLM 1.0\n\n-151285509650596177 STARWARSNVIDIAAAAAA\n#=GC RF xxxxxxxxxxxxxxxxxxx\n//\n', 'sto']
}]
templates = [
[{'index': 1, 'name': '5X6U_E Ragulator complex protein LAMTOR3, Ragulator; Ragulator complex, scaffold, roadblock, lysosome; 2.4A {Homo sapiens}', 'aligned_cols': 10, 'sum_probs': 0.0, 'query': 'RSNVIDIAAA', 'hit_sequence': 'ASNIIDVSAA', 'indices_query': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [23, 24, 25, 26, 27, 28, 29, 30, 31, 32]}, {'index': 2, 'name': '5X6V_E Ragulator complex protein LAMTOR3, Ragulator; Ragulator Rag GTPase complex, scaffold; 2.02A {Homo sapiens}', 'aligned_cols': 10, 'sum_probs': 7.9, 'query': 'RSNVIDIAAA', 'hit_sequence': 'ASNIIDVSAA', 'indices_query': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [23, 24, 25, 26, 27, 28, 29, 30, 31, 32]}, {'index': 3, 'name': '6EHP_E Ragulator complex protein LAMTOR3, Ragulator; Scaffolding complex, Rag-GTPase, mTOR, Ragulator; 2.3A {Homo sapiens}', 'aligned_cols': 10, 'sum_probs': 0.0, 'query': 'RSNVIDIAAA', 'hit_sequence': 'ASNIIDVSAA', 'indices_query': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [45, 46, 47, 48, 49, 50, 51, 52, 53, 54]}, {'index': 4, 'name': '6EHR_E Ragulator complex protein LAMTOR3, Ragulator; Scaffolding complex, Rag-GTPases, mTOR, Ragulator; 2.898A {Homo sapiens}', 'aligned_cols': 10, 'sum_probs': 7.8, 'query': 'RSNVIDIAAA', 'hit_sequence': 'ASNIIDVSAA', 'indices_query': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [45, 46, 47, 48, 49, 50, 51, 52, 53, 54]}, {'index': 5, 'name': '6CTD_B Large-conductance mechanosensitive channel; Channel Mechanosensitive Mycobacterium tuberculosis, MEMBRANE; 5.8A {Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra)}', 'aligned_cols': 11, 'sum_probs': 8.7, 'query': 'ARSNVIDIAAA', 'hit_sequence': 'ARGNIVDLAVA', 'indices_query': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]}, {'index': 6, 'name': '3HZQ_A Large-conductance mechanosensitive channel; intermediate state Mechanosensitive channel osmoregulation; 3.82A {Staphylococcus aureus subsp. aureus MW2}', 'aligned_cols': 11, 'sum_probs': 8.6, 'query': 'ARSNVIDIAAA', 'hit_sequence': 'LKGNVLDLAIA', 'indices_query': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]}, {'index': 7, 'name': '6B9X_A Ragulator complex protein LAMTOR1, Ragulator; Ragulator, Lamtor, SIGNALING PROTEIN; 1.42A {Homo sapiens}', 'aligned_cols': 12, 'sum_probs': 0.0, 'query': 'WARSNVIDIAAA', 'hit_sequence': 'KTASNIIDVSAA', 'indices_query': [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70]}, {'index': 8, 'name': '4V7H_BM Ribosome; eukaryotic ribosome, 80S, RACK1 protein; HET: OMC, PSU, 5MU, 1MA, OMG, 5MC, YYG, 7MG, 2MG, H2U, M2G; 8.9A {Thermomyces lanuginosus}', 'aligned_cols': 15, 'sum_probs': 9.1, 'query': 'RWARSNVIDIAAAAA', 'hit_sequence': 'GWKAAAAAAAAAAAA', 'indices_query': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'indices_hit': [139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153]}, {'index': 9, 'name': '6QKP_A Nucleoid-associated protein Lsr2; Tuberculosis, DNA organisation, Transcriptional regulator; NMR {Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv)}', 'aligned_cols': 12, 'sum_probs': 9.2, 'query': 'RWARSNVIDIAA', 'hit_sequence': 'EWARRNGHNVST', 'indices_query': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'indices_hit': [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]}, {'index': 10, 'name': "1QGN_F CYSTATHIONINE GAMMA-SYNTHASE; METHIONINE BIOSYNTHESIS, PYRIDOXAL 5'-PHOSPHATE, GAMMA-FAMILY; HET: PLP; 2.9A {Nicotiana tabacum} SCOP: c.67.1.3", 'aligned_cols': 10, 'sum_probs': 0.0, 'query': 'NVIDIAAAAA', 'hit_sequence': 'KAVDAAAAAA', 'indices_query': [8, 9, 10, 11, 12, 13, 14, 15, 16, 17], 'indices_hit': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}, {'index': 11, 'name': '2OAR_E Large-conductance mechanosensitive channel; stretch activated ion channel mechanosensitive; 3.5A {Mycobacterium tuberculosis H37Ra} SCOP: f.16.1.1', 'aligned_cols': 11, 'sum_probs': 8.9, 'query': 'ARSNVIDIAAA', 'hit_sequence': 'ARGNIVDLAVA', 'indices_query': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'indices_hit': [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]}, {'index': 12, 'name': '5XKX_A Flavin-containing monooxygenase; Dimethylsulfoniopropionate (DMSP) lyase, LYASE; 1.5A {Acinetobacter bereziniae NIPH 3}', 'aligned_cols': 10, 'sum_probs': 8.0, 'query': 'ARWARSNVID', 'hit_sequence': 'TVWARTTAQD', 'indices_query': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'indices_hit': [356, 357, 358, 359, 360, 361, 362, 363, 364, 365]}]
]
headers = {
"content-type": "application/json"
}
data = {
"sequences": sequences,
"alignments": alignments,
"templates": templates
}
response = requests.post(url, headers=headers, data=json.dumps(data))
# Check if the request was successful
if response.ok:
print("Request succeeded:", response.json())
else:
print("Request failed:", response.status_code, response.text)
结构预测模块的规模与序列长度呈二次方关系。长序列可能需要几个小时才能预测。