了解您的 Grace 机器#
在您启动 Grace 机器后,运行 sudo ipmitool fru print
命令并检查关于 NVIDIA Grace 模块的信息。
以下是来自 Grace CPU Superchip 机器的示例输出。FRU 设备描述为 PG535,产品名称为 C2。
FRU Device Description : PG535 (ID 192)
Board Mfg Date : [REDACTED]
Board Mfg : NVIDIA
Board Product : PG535
Board Serial : [REDACTED]
Board Part Number : 699-2G535-0200-DV2
Product Manufacturer : NVIDIA
Product Name : C2
Product Part Number : 900-2G535-0000-000
Product Version : B-R00
Product Serial : [REDACTED]
以下是来自 Grace Hopper Superchip 机器的示例输出。FRU 设备描述为 PG530,产品名称为 GH200。
FRU Device Description : PG530 (ID 133)
Board Mfg Date : [REDACTED]
Board Mfg : NVIDIA
Board Product : PG530
Board Serial : [REDACTED]
Board Part Number : 699-2G530-0206-QS1
Product Manufacturer : NVIDIA
Product Name : GH200 480GB
Product Part Number : 900-2G530-0000-000
Product Version : A-R00
Product Serial : [REDACTED]
检查 CPU#
Linux 中的 lscpu
命令行实用程序获取关于系统的 CPU 信息,从 sysfs
和 /proc/cpuinfo
文件中获取 CPU 架构信息,并在终端中显示该信息。
在您启动 Grace 机器后,运行 lscpu 命令并检查 CPU。
以下是来自 Grace CPU Superchip 机器的示例输出
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 144
On-line CPU(s) list: 0-143
Vendor ID: ARM
Model: 0
Thread(s) per core: 1
Core(s) per socket: 72
Socket(s): 2
Stepping: r0p0
Frequency boost: disabled
CPU max MHz: 3582.0000
CPU min MHz: 81.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp a
simdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4
asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs
sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesh
a3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
L1d: 9 MiB (144 instances)
L1i: 9 MiB (144 instances)
L2: 144 MiB (144 instances)
L3: 228 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-71
NUMA node1 CPU(s): 72-143
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; --user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
从该输出中,您可以看到诸如 CPU 插槽数量、每个插槽的内核数量、每个内核的硬件线程数量以及最大/最小 CPU 频率等信息。您还可以找到 L1、L2 和 L3 缓存的大小。
以下是 Grace Hopper Superchip 系统的示例输出
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Vendor ID: ARM
Model: 0
Thread(s) per core: 1
Core(s) per socket: 72
Socket(s): 1
Stepping: r0p0
Frequency boost: disabled
CPU max MHz: 3591.0000
CPU min MHz: 81.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha
512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpod
p sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint s
vei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
L1d: 4.5 MiB (72 instances)
L1i: 4.5 MiB (72 instances)
L2: 72 MiB (72 instances)
L3: 114 MiB (1 instance)
NUMA:
NUMA node(s): 9
NUMA node0 CPU(s): 0-71
NUMA node1 CPU(s):
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s):
NUMA node5 CPU(s):
NUMA node6 CPU(s):
NUMA node7 CPU(s):
NUMA node8 CPU(s):
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; --user pointer sanitization
Spectre v2: Not affected
Srbds: Not affected
Tsx async abort: Not affected
注意
此输出显示九个 NUMA 节点。第一个节点对应于 Grace CPU,第二个节点对应于 Hopper GPU,其余七个节点对应于 NVIDIA 多实例 GPU (MIG) 实例。
如果未使用 MIG 模式,则可以忽略这七个 MIG 实例。
检查非统一内存访问设置#
lscpu
输出包含关于您的 Grace 机器上的非统一内存访问 (NUMA) 设置的基本信息。
要了解更多关于 NUMA 设置的信息,请运行 numactl -H
命令,以下是来自 Grace Superchip 机器的示例输出
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71
node 0 size: 245090 MB
node 0 free: 99633 MB
node 1 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
node 1 size: 245317 MB
node 1 free: 126895 MB
node distances:
node 0 1
0: 10 40
1: 40 10
输出显示此机器上有两个 NUMA 节点,每个 NUMA 节点上的内核数量,以及每个节点可用的内存量。输出还显示了 NUMA 节点之间的节点距离,这有助于内核调度器在最接近内存驻留数据的 CPU 内核上执行应用程序线程。
以下是来自 Grace + Hopper Superchip 系统的示例输出
available: 9 nodes (0-8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
71
node 0 size: 490310 MB
node 0 free: 166560 MB
node 1 cpus:
node 1 size: 95232 MB
node 1 free: 92094 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus:
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus:
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus:
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus:
node 8 size: 0 MB
node 8 free: 0 MB
node distances:
node 0 1 2 3 4 5 6 7 8
0: 10 80 80 80 80 80 80 80 80
1: 80 10 255 255 255 255 255 255 255
2: 80 255 10 255 255 255 255 255 255
3: 80 255 255 10 255 255 255 255 255
4: 80 255 255 255 10 255 255 255 255
5: 80 255 255 255 255 10 255 255 255
6: 80 255 255 255 255 255 10 255 255
7: 80 255 255 255 255 255 255 10 255
8: 80 255 255 255 255 255 255 255 10
如检查 CPU 中所述,如果未使用 MIG,则可以忽略最后七个 NUMA 节点。
检查 GPU#
运行 nvidia-smi 命令显示系统中 GPU 的状态。
以下是来自 Grace Hopper Superchip 系统的示例输出
+----------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+-----------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
=========================================+======================+==================|
| | | |
| 0 GH200 480GB Off| 00000009:01:00.0 Off | 0 |
| | | |
| N/A 29C P0 108W/900W | 0MiB / 97871MiB | 8% Default |
| | | |
| | | Disabled |
+-----------------------------------------+----------------------+-----------------+
+----------------------------------------------------------------------------------+
| Processes: | | |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|==================================================================================|
| No running processes found |
+----------------------------------------------------------------------------------+
检查内存#
检查 Grace 系统上内存的常用方法之一是运行 sudo dmidecode -t memory
命令。以下是来自 Grace-Grace 机器的示例输出
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.6.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.
Handle 0x000B, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Single-bit ECC
Maximum Capacity: 480 GB
Error Information Handle: No Error
Number Of Devices: 2
Handle 0x000C, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000B
Error Information Handle: 0x0000
Total Width: 540 bits
Data Width: 480 bits
Size: 240 GB
Form Factor: Die
Set: None
Locator: Not Specified
Bank Locator: Not Specified
Type: LPDDR5
Type Detail: None
Speed: Unknown
Manufacturer: Not Specified
Serial Number: 9223381974177924187
Asset Tag: Not Specified
Part Number: Not Specified
Rank: 1
Configured Memory Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Memory Technology: DRAM
Memory Operating Mode Capability: None
Firmware Version: Not Specified
Module Manufacturer ID: Unknown
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: None
Cache Size: None
Logical Size: None
Handle 0x000D, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000B
Error Information Handle: 0x0000
Total Width: 540 bits
Data Width: 480 bits
Size: 240 GB
Form Factor: Die
Set: None
Locator: Not Specified
Bank Locator: Not Specified
Type: LPDDR5
Type Detail: None
Speed: Unknown
Manufacturer: Not Specified
Serial Number: 9223382071351559259
Asset Tag: Not Specified
Part Number: Not Specified
Rank: 1
Configured Memory Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Memory Technology: DRAM
Memory Operating Mode Capability: None
Firmware Version: Not Specified
Module Manufacturer ID: Unknown
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: None
Cache Size: None
Logical Size: None
您可以从输出中看到,有两个区域的 LPDDR5 内存,每个区域为 240 GB,每个区域来自一个 Grace 芯片。