解决 MLAG 问题

本主题概述了一些场景,说明如何使用 NetQ 排除 多机箱链路聚合 - MLAG 在 Cumulus Linux 交换机上的问题。每个场景都从指示当前 MLAG 状态的日志消息开始。

NetQ 可以监控 MLAG 配置的许多方面,包括

  • 验证所有节点的当前状态
  • 验证双连接状态
  • 检查对等链路是否是网桥的一部分
  • 验证 MLAG Bond 是否不是网桥成员
  • 验证 VXLAN 接口是否不是网桥成员
  • 检查由 systemctl 引起的远程端服务故障
  • 检查 VLAN-VNI 映射不匹配
  • 检查对等链路子接口上的第 3 层 MTU 不匹配
  • 检查 VXLAN 主动-主动地址不一致
  • 验证 STP 优先级在两个对等体之间是否相同

场景 1:所有节点都已启动

当 MLAG 配置平稳运行时,NetQ 会发送一条消息,指示所有节点都已启动

2017-05-22T23:13:09.683429+00:00 noc-pr netq-notifier[5501]: INFO: CLAG: All nodes are up

运行 netq show mlag 可以确认这一点

cumulus@switch:~$ netq show mlag
Matching clag records:
Hostname          Peer              SysMac             State      Backup #Bond #Dual Last Changed
                                                                            s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P)        spine02           00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:49 2019
spine02           spine01(P)        00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:53 2019
leaf01(P)         leaf02            44:38:39:ff:ff:01  up         up     12    12    Thu Feb  7 18:31:15 2019
leaf02            leaf01(P)         44:38:39:ff:ff:01  up         up     12    12    Thu Feb  7 18:31:20 2019
leaf03(P)         leaf04            44:38:39:ff:ff:02  up         up     12    12    Thu Feb  7 18:31:26 2019
leaf04            leaf03(P)         44:38:39:ff:ff:02  up         up     12    12    Thu Feb  7 18:31:30 2019

您还可以验证特定节点是否已启动

cumulus@switch:~$ netq spine01 show mlag
Matching mlag records:
Hostname          Peer              SysMac             State      Backup #Bond #Dual Last Changed
                                                                            s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P)        spine02           00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:49 2019

同样,使用 NetQ 检查 MLAG 状态也可以确认这一点

cumulus@switch:~$ netq check mlag
clag check result summary:

Total nodes         : 6
Checked nodes       : 6
Failed nodes        : 0
Rotten nodes        : 0
Warning nodes       : 0


Peering Test             : passed
Backup IP Test           : passed
Clag SysMac Test         : passed
VXLAN Anycast IP Test    : passed
Bridge Membership Test   : passed
Spanning Tree Test       : passed
Dual Home Test           : passed
Single Home Test         : passed
Conflicted Bonds Test    : passed
ProtoDown Bonds Test     : passed
SVI Test                 : passed

NVIDIA 弃用了 clag 关键字,并将其替换为 mlag 关键字。clag 关键字目前仍然有效,但您应该开始使用 mlag 关键字。请记住,您还应该更新任何使用 clag 关键字的脚本。

当您直接登录到交换机时,可以运行 clagctl 来获取状态

cumulus@switch:/var/log$ sudo clagctl
    
The peer is alive
Peer Priority, ID, and Role: 4096 00:02:00:00:00:4e primary
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 secondary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (active)
System MAC: 44:38:39:ff:ff:01
    
CLAG Interfaces
Our Interface    Peer Interface   CLAG Id Conflicts            Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38            vx-38            -       -                    -
vx-33            vx-33            -       -                    -
hostbond4        hostbond4        1       -                    -
hostbond5        hostbond5        2       -                    -
vx-37            vx-37            -       -                    -
vx-36            vx-36            -       -                    -
vx-35            vx-35            -       -                    -
vx-34            vx-34            -       -                    -

场景 2:双连接 Bond 已关闭

当 MLAG 配置失去双连接时,您会收到来自 NetQ 的类似以下消息

2017-05-22T23:14:40.290918+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 1 link(s) are down. They are: spine01 hostbond5
2017-05-22T23:14:53.081480+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 1 node(s) have failures. They are: spine01
2017-05-22T23:14:58.161267+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 2 node(s) have failures. They are: spine01, leaf01

要开始调查,请显示 clagd 服务的状态

cumulus@switch:~$ netq spine01 show services clagd
Matching services records:
Hostname          Service              PID   VRF             Enabled Active Monitored Status           Uptime                    Last Changed
----------------- -------------------- ----- --------------- ------- ------ --------- ---------------- ------------------------- -------------------------
spine01           clagd                2678  default         yes     yes    yes       ok               23h:57m:16s               Thu Feb  7 18:30:49 2019

检查 MLAG 状态可提供故障原因

cumulus@switch:~$ netq check mlag
Checked Nodes: 6, Warning Nodes: 2
Node             Reason
---------------- --------------------------------------------------------------------------
spine01          Link Down: hostbond5
leaf01           Singly Attached Bonds: hostbond5

您可以检索 JSON 格式的输出以导出到另一个工具

cumulus@switch:~$ netq check mlag json
{
    "warningNodes": [
        { 
            "node": "spine01", 
            "reason": "Link Down: hostbond5" 
        }
        ,
        { 
            "node": "lea01", 
            "reason": "Singly Attached Bonds: hostbond5" 
        }
    ],
    "failedNodes":[
    ],
    "summary":{
        "checkedNodeCount":6,
        "failedNodeCount":0,
        "warningNodeCount":2
    }
}

修复问题后,您可以显示 MLAG 状态以查看是否所有节点都已启动。来自 NetQ 的通知指示所有节点都已启动,并且 netq check 标志也指示没有故障。

cumulus@switch:~$ netq show mlag
    
Matching clag records:
Hostname          Peer              SysMac             State      Backup #Bond #Dual Last Changed
                                                                            s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P)        spine02           00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:49 2019
spine02           spine01(P)        00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:53 2019
leaf01(P)         leaf02            44:38:39:ff:ff:01  up         up     12    12    Thu Feb  7 18:31:15 2019
leaf02            leaf01(P)         44:38:39:ff:ff:01  up         up     12    12    Thu Feb  7 18:31:20 2019
leaf03(P)         leaf04            44:38:39:ff:ff:02  up         up     12    12    Thu Feb  7 18:31:26 2019
leaf04            leaf03(P)         44:38:39:ff:ff:02  up         up     12    12    Thu Feb  7 18:31:30 2019

当您直接登录到交换机时,可以运行 clagctl 来获取状态

cumulus@switch:/var/log$ sudo clagctl
    
The peer is alive
Peer Priority, ID, and Role: 4096 00:02:00:00:00:4e primary
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 secondary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (active)
System MAC: 44:38:39:ff:ff:01
    
CLAG Interfaces
Our Interface    Peer Interface   CLAG Id Conflicts            Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38            vx-38            -       -                    -
vx-33            vx-33            -       -                    -
hostbond4        hostbond4        1       -                    -
hostbond5        -                2       -                    -
vx-37            vx-37            -       -                    -
vx-36            vx-36            -       -                    -
vx-35            vx-35            -       -                    -
vx-34            vx-34            -       -                    -

场景 3:VXLAN 主动-主动设备或接口已关闭

当 MLAG 配置中的 VXLAN 主动-主动设备或接口关闭时,日志消息还包括 VXLAN 检查。

2017-05-22T23:16:51.517522+00:00 noc-pr netq-notifier[5501]: WARNING: VXLAN: 2 node(s) have failures. They are: spine01, leaf01
2017-05-22T23:16:51.525403+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 2 link(s) are down. They are: leaf01 vx-37, spine01 vx-37
2017-05-22T23:17:04.703044+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 2 node(s) have failures. They are: spine01, leaf01

要开始调查,请显示 clagd 服务的状态

cumulus@switch:~$ netq spine01 show services clagd
    
Matching services records:
Hostname          Service              PID   VRF             Enabled Active Monitored Status           Uptime                    Last Changed
----------------- -------------------- ----- --------------- ------- ------ --------- ---------------- ------------------------- -------------------------
spine01           clagd                2678  default         yes     yes    yes       error            23h:57m:16s               Thu Feb  7 18:30:49 2019

检查 MLAG 状态可提供故障原因

cumulus@switch:~$ netq check mlag
Checked Nodes: 6, Warning Nodes: 2, Failed Nodes: 2
Node             Reason
---------------- --------------------------------------------------------------------------
spine01          Protodown Bonds: vx-37:vxlan-single
leaf01           Protodown Bonds: vx-37:vxlan-single

您可以检索 JSON 格式的输出以导出到另一个工具

cumulus@switch:~$ netq check mlag json
{
    "failedNodes": [
        { 
            "node": "spine01", 
            "reason": "Protodown Bonds: vx-37:vxlan-single" 
        }
        ,
        { 
            "node": "leaf01", 
            "reason": "Protodown Bonds: vx-37:vxlan-single" 
        }
    ],
    "summary":{ 
            "checkedNodeCount": 6, 
            "failedNodeCount": 2, 
            "warningNodeCount": 2 
    }
}

修复问题后,您可以显示 MLAG 状态以查看是否所有节点都已启动

cumulus@switch:~$ netq show mlag
Matching clag session records are:
Hostname          Peer              SysMac             State      Backup #Bond #Dual Last Changed
                                                                         s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P)        spine02           00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:49 2019
spine02           spine01(P)        00:01:01:10:00:01  up         up     24    24    Thu Feb  7 18:30:53 2019
leaf01(P)         leaf02            44:38:39:ff:ff:01  up         up     12    12    Thu Feb  7 18:31:15 2019
leaf02            leaf01(P)         44:38:39:ff:ff:01  up         up     12    12    Thu Feb  7 18:31:20 2019
leaf03(P)         leaf04            44:38:39:ff:ff:02  up         up     12    12    Thu Feb  7 18:31:26 2019
leaf04            leaf03(P)         44:38:39:ff:ff:02  up         up     12    12    Thu Feb  7 18:31:30 2019

当您直接登录到交换机时,可以运行 clagctl 来获取状态

cumulus@switch:/var/log$ sudo clagctl
 
The peer is alive
Peer Priority, ID, and Role: 4096 00:02:00:00:00:4e primary
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 secondary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (active)
System MAC: 44:38:39:ff:ff:01
 
CLAG Interfaces
Our Interface    Peer Interface   CLAG Id Conflicts            Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38            vx-38            -       -                    -
vx-33            vx-33            -       -                    -
hostbond4        hostbond4        1       -                    -
hostbond5        hostbond5        2       -                    -
vx-37            -                -       -                    vxlan-single
vx-36            vx-36            -       -                    -
vx-35            vx-35            -       -                    -
vx-34            vx-34            -       -                    -

场景 4:远程端 clagd 被 systemctl 命令停止

如果您通过 systemctl 命令停止 clagd 服务,NetQ Notifier 会发送类似于以下内容的消息

2017-05-22T23:51:19.539033+00:00 noc-pr netq-notifier[5501]: WARNING: VXLAN: 1 node(s) have failures. They are: leaf01
2017-05-22T23:51:19.622379+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 2 link(s) flapped and are down. They are: leaf01 hostbond5, leaf01 hostbond4
2017-05-22T23:51:19.622922+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 23 link(s) are down. They are: leaf01 VlanA-1-104-v0, leaf01 VlanA-1-101-v0, leaf01 VlanA-1, leaf01 vx-33, leaf01 vx-36, leaf01 vx-37, leaf01 vx-34, leaf01 vx-35, leaf01 swp7, leaf01 VlanA-1-102-v0, leaf01 VlanA-1-103-v0, leaf01 VlanA-1-100-v0, leaf01 VlanA-1-106-v0, leaf01 swp8, leaf01 VlanA-1.106, leaf01 VlanA-1.105, leaf01 VlanA-1.104, leaf01 VlanA-1.103, leaf01 VlanA-1.102, leaf01 VlanA-1.101, leaf01 VlanA-1.100, leaf01 VlanA-1-105-v0, leaf01 vx-38
2017-05-22T23:51:27.696572+00:00 noc-pr netq-notifier[5501]: INFO: LINK: 15 link(s) are up. They are: leaf01 VlanA-1.106, leaf01 VlanA-1-104-v0, leaf01 VlanA-1.104, leaf01 VlanA-1.103, leaf01 VlanA-1.101, leaf01 VlanA-1-100-v0, leaf01 VlanA-1.100, leaf01 VlanA-1.102, leaf01 VlanA-1-101-v0, leaf01 VlanA-1-102-v0, leaf01 VlanA-1.105, leaf01 VlanA-1-103-v0, leaf01 VlanA-1-106-v0, leaf01 VlanA-1, leaf01 VlanA-1-105-v0
2017-05-22T23:51:36.156708+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 2 node(s) have failures. They are: spine01, leaf01

显示 MLAG 状态可以揭示哪些节点已关闭

cumulus@switch:~$ netq show mlag
Matching CLAG session records are:
Node             Peer             SysMac            State Backup #Bonds #Dual Last Changed
---------------- ---------------- ----------------- ----- ------ ------ ----- -------------------------
spine01(P)       spine02           00:01:01:10:00:01 up   up     9      9     Thu Feb  7 18:30:53 2019
spine02          spine01(P)        00:01:01:10:00:01 up   up     9      9     Thu Feb  7 18:31:04 2019
leaf01                             44:38:39:ff:ff:01 down n/a    0      0     Thu Feb  7 18:31:13 2019
leaf03(P)        leaf04            44:38:39:ff:ff:02 up   up     8      8     Thu Feb  7 18:31:19 2019
leaf04           leaf03(P)         44:38:39:ff:ff:02 up   up     8      8     Thu Feb  7 18:31:25 2019

检查 MLAG 状态可提供故障原因

cumulus@switch:~$ netq check mlag
Checked Nodes: 6, Warning Nodes: 1, Failed Nodes: 2
Node             Reason
---------------- --------------------------------------------------------------------------
spine01          Peer Connectivity failed
leaf01           Peer Connectivity failed

您可以检索 JSON 格式的输出以导出到另一个工具

cumulus@switch:~$ netq check mlag json
{
    "failedNodes": [
        { 
            "node": "spine01", 
            "reason": "Peer Connectivity failed" 
        }
        ,
        { 
            "node": "leaf01", 
            "reason": "Peer Connectivity failed" 
        }
    ],
    "summary":{ 
        "checkedNodeCount": 6, 
        "failedNodeCount": 2, 
        "warningNodeCount": 1 
    }
}

当您直接登录到交换机时,可以运行 clagctl 来获取状态

cumulus@switch:~$ sudo clagctl
    
The peer is not alive
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 primary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (inactive)
System MAC: 44:38:39:ff:ff:01
    
CLAG Interfaces
Our Interface    Peer Interface   CLAG Id Conflicts            Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38            -                -       -                    -
vx-33            -                -       -                    -
hostbond4        -                1       -                    -
hostbond5        -                2       -                    -
vx-37            -                -       -                    -
vx-36            -                -       -                    -
vx-35            -                -       -                    -
vx-34            -                -       -                    -