了解您的 Grace 机器#

在您启动 Grace 机器后,运行 sudo ipmitool fru print 命令并检查关于 NVIDIA Grace 模块的信息。

以下是来自 Grace CPU Superchip 机器的示例输出。FRU 设备描述为 PG535,产品名称为 C2。

FRU Device Description  : PG535 (ID 192)
Board Mfg Date          : [REDACTED]
Board Mfg               : NVIDIA
Board Product           : PG535
Board Serial            : [REDACTED]
Board Part Number       : 699-2G535-0200-DV2
Product Manufacturer    : NVIDIA
Product Name            : C2
Product Part Number     : 900-2G535-0000-000
Product Version         : B-R00
Product Serial          : [REDACTED]

以下是来自 Grace Hopper Superchip 机器的示例输出。FRU 设备描述为 PG530,产品名称为 GH200。

FRU Device Description  : PG530 (ID 133)
Board Mfg Date          : [REDACTED]
Board Mfg               : NVIDIA
Board Product           : PG530
Board Serial            : [REDACTED]
Board Part Number       : 699-2G530-0206-QS1
Product Manufacturer    : NVIDIA
Product Name            : GH200 480GB
Product Part Number     : 900-2G530-0000-000
Product Version         : A-R00
Product Serial          : [REDACTED]

检查 CPU#

Linux 中的 lscpu 命令行实用程序获取关于系统的 CPU 信息,从 sysfs/proc/cpuinfo 文件中获取 CPU 架构信息,并在终端中显示该信息。

在您启动 Grace 机器后,运行 lscpu 命令并检查 CPU。

以下是来自 Grace CPU Superchip 机器的示例输出

Architecture:                   aarch64
  CPU op-mode(s):                 64-bit
  Byte Order:                     Little Endian
CPU(s):                         144
  On-line CPU(s) list:            0-143
Vendor ID:                      ARM
  Model:                          0
  Thread(s) per core:             1
  Core(s) per socket:             72
  Socket(s):                      2
  Stepping:                       r0p0
  Frequency boost:                disabled
  CPU max MHz:                    3582.0000
  CPU min MHz:                    81.0000
  BogoMIPS:                       2000.00
  Flags:                          fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp a
                                  simdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4
                                  asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs
                                  sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesh
                                  a3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
  L1d:                           9 MiB (144 instances)
  L1i:                           9 MiB (144 instances)
  L2:                            144 MiB (144 instances)
  L3:                            228 MiB (2 instances)
NUMA:
  NUMA node(s):                  2
  NUMA node0 CPU(s):             0-71
  NUMA node1 CPU(s):             72-143
Vulnerabilities:
  Itlb multihit:                 Not affected
  L1tf:                          Not affected
  Mds:                           Not affected
  Meltdown:                      Not affected
  Mmio stale data:               Not affected
  Retbleed:                      Not affected
  Spec store bypass:             Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                    Mitigation; --user pointer sanitization
  Spectre v2:                    Not affected
  Srbds:                         Not affected
  Tsx async abort:               Not affected

从该输出中,您可以看到诸如 CPU 插槽数量、每个插槽的内核数量、每个内核的硬件线程数量以及最大/最小 CPU 频率等信息。您还可以找到 L1、L2 和 L3 缓存的大小。

以下是 Grace Hopper Superchip 系统的示例输出

Architecture:                   aarch64
  CPU op-mode(s):                 64-bit
  Byte Order:                     Little Endian
CPU(s):                         72
  On-line CPU(s) list:            0-71
Vendor ID:                      ARM
  Model:                          0
  Thread(s) per core:             1
  Core(s) per socket:             72
  Socket(s):                      1
  Stepping:                       r0p0
  Frequency boost:                disabled
  CPU max MHz:                    3591.0000
  CPU min MHz:                    81.0000
  BogoMIPS:                       2000.00
  Flags:                          fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
                                  cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha
                                  512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpod
                                  p sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint s
                                  vei8mm svebf16 i8mm bf16 dgh bti
Caches (sum of all):
  L1d:                           4.5 MiB (72 instances)
  L1i:                           4.5 MiB (72 instances)
  L2:                            72 MiB (72 instances)
  L3:                            114 MiB (1 instance)
NUMA:
  NUMA node(s):                  9
  NUMA node0 CPU(s):             0-71
  NUMA node1 CPU(s):
  NUMA node2 CPU(s):
  NUMA node3 CPU(s):
  NUMA node4 CPU(s):
  NUMA node5 CPU(s):
  NUMA node6 CPU(s):
  NUMA node7 CPU(s):
  NUMA node8 CPU(s):
Vulnerabilities:
  Itlb multihit:                 Not affected
  L1tf:                          Not affected
  Mds:                           Not affected
  Meltdown:                      Not affected
  Mmio stale data:               Not affected
  Retbleed:                      Not affected
  Spec store bypass:             Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:                    Mitigation; --user pointer sanitization
  Spectre v2:                    Not affected
  Srbds:                         Not affected
  Tsx async abort:               Not affected

注意

此输出显示九个 NUMA 节点。第一个节点对应于 Grace CPU,第二个节点对应于 Hopper GPU,其余七个节点对应于 NVIDIA 多实例 GPU (MIG) 实例。

如果未使用 MIG 模式,则可以忽略这七个 MIG 实例。

检查非统一内存访问设置#

lscpu 输出包含关于您的 Grace 机器上的非统一内存访问 (NUMA) 设置的基本信息。

要了解更多关于 NUMA 设置的信息,请运行 numactl -H 命令,以下是来自 Grace Superchip 机器的示例输出

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71
node 0 size: 245090 MB
node 0 free: 99633 MB
node 1 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
node 1 size: 245317 MB
node 1 free: 126895 MB
node distances:
node 0   1
  0: 10 40
  1: 40 10

输出显示此机器上有两个 NUMA 节点,每个 NUMA 节点上的内核数量,以及每个节点可用的内存量。输出还显示了 NUMA 节点之间的节点距离,这有助于内核调度器在最接近内存驻留数据的 CPU 内核上执行应用程序线程。

以下是来自 Grace + Hopper Superchip 系统的示例输出

available: 9 nodes (0-8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
71
node 0 size: 490310 MB
node 0 free: 166560 MB
node 1 cpus:
node 1 size: 95232 MB
node 1 free: 92094 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus:
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus:
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus:
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus:
node 8 size: 0 MB
node 8 free: 0 MB
node distances:
node     0   1    2    3    4    5    6    7    8
     0:  10  80   80   80   80   80   80   80   80
     1:  80  10   255  255  255  255  255  255  255
     2:  80  255  10   255  255  255  255  255  255
     3:  80  255  255  10   255  255  255  255  255
     4:  80  255  255  255  10   255  255  255  255
     5:  80  255  255  255  255  10   255  255  255
     6:  80  255  255  255  255  255  10   255  255
     7:  80  255  255  255  255  255  255  10   255
     8:  80  255  255  255  255  255  255  255  10

检查 CPU 中所述,如果未使用 MIG,则可以忽略最后七个 NUMA 节点。

检查 GPU#

运行 nvidia-smi 命令显示系统中 GPU 的状态。

以下是来自 Grace Hopper Superchip 系统的示例输出

+----------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.06 Driver Version: 535.104.06 CUDA Version: 12.2              |
|-----------------------------------------+----------------------+-----------------+
| GPU Name Persistence-M          | Bus-Id Disp.A        | Volatile Uncorr. ECC    |
|                                                                                  |
| Fan Temp Perf Pwr:Usage/Cap     | Memory-Usage         | GPU-Util Compute M.     |
|                                 |                      | MIG M.                  |
=========================================+======================+==================|
|                                  |                      |                        |
| 0    GH200  480GB             Off| 00000009:01:00.0 Off |          0             |
|                                  |                      |                        |
| N/A  29C    P0         108W/900W |      0MiB / 97871MiB |      8% Default        |
|                                  |                      |                        |
|                                  |                      | Disabled               |
+-----------------------------------------+----------------------+-----------------+

+----------------------------------------------------------------------------------+
| Processes:                      |                      |                         |
| GPU     GI     CI     PID Type      Process name                     GPU Memory  |
|         ID     ID                                                    Usage       |
|==================================================================================|
| No running processes found                                                       |
+----------------------------------------------------------------------------------+

检查内存#

检查 Grace 系统上内存的常用方法之一是运行 sudo dmidecode -t memory 命令。以下是来自 Grace-Grace 机器的示例输出

# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.6.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.
Handle 0x000B, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 480 GB
        Error Information Handle: No Error
        Number Of Devices: 2
Handle 0x000C, DMI type 17, 92 bytes
Memory Device
       Array Handle: 0x000B
       Error Information Handle: 0x0000
       Total Width: 540 bits
       Data Width: 480 bits
       Size: 240 GB
       Form Factor: Die
       Set: None
       Locator: Not Specified
       Bank Locator: Not Specified
       Type: LPDDR5
       Type Detail: None
       Speed: Unknown
       Manufacturer: Not Specified
       Serial Number: 9223381974177924187
       Asset Tag: Not Specified
       Part Number: Not Specified
       Rank: 1
       Configured Memory Speed: Unknown
       Minimum Voltage: Unknown
       Maximum Voltage: Unknown
       Configured Voltage: Unknown
       Memory Technology: DRAM
       Memory Operating Mode Capability: None
       Firmware Version: Not Specified
       Module Manufacturer ID: Unknown
       Module Product ID: Unknown
       Memory Subsystem Controller Manufacturer ID: Unknown
       Memory Subsystem Controller Product ID: Unknown
       Non-Volatile Size: None
       Volatile Size: None
       Cache Size: None
       Logical Size: None
Handle 0x000D, DMI type 17, 92 bytes
Memory Device
       Array Handle: 0x000B
       Error Information Handle: 0x0000
       Total Width: 540 bits
       Data Width: 480 bits
       Size: 240 GB
       Form Factor: Die
       Set: None
       Locator: Not Specified
       Bank Locator: Not Specified
       Type: LPDDR5
       Type Detail: None
       Speed: Unknown
       Manufacturer: Not Specified
       Serial Number: 9223382071351559259
       Asset Tag: Not Specified
       Part Number: Not Specified
       Rank: 1
       Configured Memory Speed: Unknown
       Minimum Voltage: Unknown
       Maximum Voltage: Unknown
       Configured Voltage: Unknown
       Memory Technology: DRAM
       Memory Operating Mode Capability: None
       Firmware Version: Not Specified
       Module Manufacturer ID: Unknown
       Module Product ID: Unknown
       Memory Subsystem Controller Manufacturer ID: Unknown
       Memory Subsystem Controller Product ID: Unknown
       Non-Volatile Size: None
       Volatile Size: None
       Cache Size: None
       Logical Size: None

您可以从输出中看到,有两个区域的 LPDDR5 内存,每个区域为 240 GB,每个区域来自一个 Grace 芯片。