FQ2BAM 教程
本教程将向您展示如何运行我们的核心比对工具 FQ2BAM,该工具允许您按照 GATK 最佳实践以极快的速度比对 FASTQ 文件。这包括黄金标准比对工具 BWA-MEM,它内置了输出文件的坐标排序,并可选择应用碱基质量分数重校准和标记重复读取。
fq2bam
工具对双端 FASTQ 文件数据进行比对、排序(按坐标)和标记重复项。本示例中使用的数据文件取自上一节中下载的示例数据。
fq2bam 工具默认需要至少 38 GB 的 GPU 内存;--low-memory
选项将把此要求降低到 16 GB 的 GPU 内存,但会降低处理速度。
如果您使用 NVIDIA Parabricks 示例数据执行以下命令,您应该获得与此处显示的结果相同的结果。
在执行此命令之前,请确保您的当前目录是您提取示例数据的位置;它应该有一个 parabricks_sample 子目录。
$ docker run \
--gpus all \
--rm \
--volume $(pwd):/workdir \
--volume $(pwd):/outputdir \
nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1 \
pbrun fq2bam \
--ref /workdir/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta \
--in-fq /workdir/parabricks_sample/Data/sample_1.fq.gz /workdir/parabricks_sample/Data/sample_2.fq.gz \
--out-bam /outputdir/fq2bam_output.bam
[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /workdir/parabricks_sample/Data/sample_1.fq.gz and
/workdir/parabricks_sample/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2022-Sep-02 19:49:27] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:49:27] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2022-Sep-02 19:49:27] || Version 4.0.0-1 ||
[PB Info 2022-Sep-02 19:49:27] || GPU-BWA mem, Sorting Phase-I ||
[PB Info 2022-Sep-02 19:49:27] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Warning 2022-Sep-02 19:50:02][ParaBricks/src/pbOpts.cu:325]
WARNING
The system has 12 threads, however recommended number of threads with 1 GPU is 16.
The run might not finish or might have less than expected performance.
[PB Info 2022-Sep-02 19:50:02] GPU-BWA mem
[PB Info 2022-Sep-02 19:50:02] ProgressMeter Reads Base Pairs Aligned
[PB Info 2022-Sep-02 19:50:45] 5043564 580000000
[PB Info 2022-Sep-02 19:51:21] 10087128 1160000000
[PB Info 2022-Sep-02 19:51:59] 15130692 1740000000
[PB Info 2022-Sep-02 19:52:39] 20174256 2320000000
[PB Info 2022-Sep-02 19:53:20] 25217820 2900000000
[PB Info 2022-Sep-02 19:53:58] 30261384 3480000000
[PB Info 2022-Sep-02 19:54:36] 35304948 4060000000
[PB Info 2022-Sep-02 19:55:13] 40348512 4640000000
[PB Info 2022-Sep-02 19:55:53] 45392076 5220000000
[PB Info 2022-Sep-02 19:56:36] 50435640 5800000000
[PB Info 2022-Sep-02 19:57:02]
GPU-BWA Mem time: 420.426442 seconds
[PB Info 2022-Sep-02 19:57:02] GPU-BWA Mem is finished.
[main] CMD: /usr/local/parabricks/binaries//bin/bwa mem -Z ./pbOpts.txt /workdir/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta /workdir/parabricks_sample/Data/sample_1.fq.gz /workdir/parabricks_sample/Data/sample_2.fq.gz @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[main] Real time: 455.468 sec; CPU: 4766.384 sec
[PB Info 2022-Sep-02 19:57:02] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:02] || Program: GPU-BWA mem, Sorting Phase-I ||
[PB Info 2022-Sep-02 19:57:02] || Version: 4.0.0-1 ||
[PB Info 2022-Sep-02 19:57:02] || Start Time: Fri Sep 2 19:49:27 2022 ||
[PB Info 2022-Sep-02 19:57:02] || End Time: Fri Sep 2 19:57:02 2022 ||
[PB Info 2022-Sep-02 19:57:02] || Total Time: 7 minutes 35 seconds ||
[PB Info 2022-Sep-02 19:57:02] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:03] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:03] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2022-Sep-02 19:57:03] || Version 4.0.0-1 ||
[PB Info 2022-Sep-02 19:57:03] || Sorting Phase-II ||
[PB Info 2022-Sep-02 19:57:03] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:03] progressMeter - Percentage
[PB Info 2022-Sep-02 19:57:03] 0.0 0.00 GB
[PB Info 2022-Sep-02 19:57:13] 72.8 0.00 GB
[PB Info 2022-Sep-02 19:57:23] Sorting and Marking: 20.001 seconds
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:23] || Program: Sorting Phase-II ||
[PB Info 2022-Sep-02 19:57:23] || Version: 4.0.0-1 ||
[PB Info 2022-Sep-02 19:57:23] || Start Time: Fri Sep 2 19:57:03 2022 ||
[PB Info 2022-Sep-02 19:57:23] || End Time: Fri Sep 2 19:57:23 2022 ||
[PB Info 2022-Sep-02 19:57:23] || Total Time: 20 seconds ||
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:23] || Parabricks accelerated Genomics Pipeline ||
[PB Info 2022-Sep-02 19:57:23] || Version 4.0.0-1 ||
[PB Info 2022-Sep-02 19:57:23] || Marking Duplicates, BQSR ||
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:24] progressMeter - Percentage
[PB Info 2022-Sep-02 19:57:34] 13.6 16.60 GB
[PB Info 2022-Sep-02 19:57:44] 31.1 13.45 GB
[PB Info 2022-Sep-02 19:57:54] 46.8 10.22 GB
[PB Info 2022-Sep-02 19:58:04] 61.1 7.05 GB
[PB Info 2022-Sep-02 19:58:14] 77.3 3.84 GB
[PB Info 2022-Sep-02 19:58:24] 91.4 0.60 GB
[PB Info 2022-Sep-02 19:58:34] 100.0 0.00 GB
[PB Info 2022-Sep-02 19:59:18] BQSR and writing final BAM: 113.592 seconds
[PB Info 2022-Sep-02 19:59:18] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:59:18] || Program: Marking Duplicates, BQSR ||
[PB Info 2022-Sep-02 19:59:18] || Version: 4.0.0-1 ||
[PB Info 2022-Sep-02 19:59:18] || Start Time: Fri Sep 2 19:57:23 2022 ||
[PB Info 2022-Sep-02 19:59:18] || End Time: Fri Sep 2 19:59:18 2022 ||
[PB Info 2022-Sep-02 19:59:18] || Total Time: 1 minute 55 seconds ||
[PB Info 2022-Sep-02 19:59:18] ------------------------------------------------------------------------------
Please visit https://docs.nvda.net.cn/clara/#parabricks for detailed documentation
在 AWS g4dn.8xlarge 实例(32 个 vCPU,一个 T4 GPU,128 GB 内存)上,这大约需要六分钟。
如果您收到内存不足错误,请确保您的计算机有足够的 RAM,并且没有其他程序占用大量内存。
此 fq2bam
命令生成三个输出文件
$ ls -l
total 14330820
-rw-r--r-- 1 root root 4819386804 Sep 2 15:58 fq2bam_output.bam
-rw-r--r-- 1 root root 6882792 Sep 2 15:59 fq2bam_output.bam.bai
-rw-r--r-- 1 root root 87690 Sep 2 15:59 fq2bam_output_chrs.txt
(input files not shown)
fq2bam_output.bam
的第一行(使用 samtools view fq2bam_output.bam
命令查看)如下所示
HWI-D00127:570:HK3TJBCX2:1:1202:9643:76055 99 chr1 10027 26 24M5I86M = 10178 231 ACCCTAACCCTAACCCTAACCCGACCCCGACCCCGACCCAAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACC DDDDDHGHIIIIIHIIHHIHHHIHIIIIIIHDHHIHHHIHIHIIIIFHIEHHIIHHIIIIEHIIIIHHIHIIICHE@1FHH?1GEFE1111D11<FH11<FD11<<FFE111<11 MD:Z:22T5T0A4T5T41A27 PG:Z:MarkDuplicatesRG:Z:HK3TJBCX2.1 NM:i:11 AS:i:69 XS:i:72
....
如果 fq2bam
命令在内存不足的系统上运行,您将在初始标头后看到此消息
警告
系统有 62 GB 内存,但建议 1 个 GPU 配备 64 GB RAM。
运行可能无法完成或性能可能低于预期。