解决 MLAG 问题
本主题概述了一些场景,说明如何使用 NetQ 排除 多机箱链路聚合 - MLAG 在 Cumulus Linux 交换机上的问题。每个场景都从指示当前 MLAG 状态的日志消息开始。
NetQ 可以监控 MLAG 配置的许多方面,包括
- 验证所有节点的当前状态
- 验证双连接状态
- 检查对等链路是否是网桥的一部分
- 验证 MLAG Bond 是否不是网桥成员
- 验证 VXLAN 接口是否不是网桥成员
- 检查由
systemctl
引起的远程端服务故障 - 检查 VLAN-VNI 映射不匹配
- 检查对等链路子接口上的第 3 层 MTU 不匹配
- 检查 VXLAN 主动-主动地址不一致
- 验证 STP 优先级在两个对等体之间是否相同
场景 1:所有节点都已启动
当 MLAG 配置平稳运行时,NetQ 会发送一条消息,指示所有节点都已启动
2017-05-22T23:13:09.683429+00:00 noc-pr netq-notifier[5501]: INFO: CLAG: All nodes are up
运行 netq show mlag
可以确认这一点
cumulus@switch:~$ netq show mlag
Matching clag records:
Hostname Peer SysMac State Backup #Bond #Dual Last Changed
s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P) spine02 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:49 2019
spine02 spine01(P) 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:53 2019
leaf01(P) leaf02 44:38:39:ff:ff:01 up up 12 12 Thu Feb 7 18:31:15 2019
leaf02 leaf01(P) 44:38:39:ff:ff:01 up up 12 12 Thu Feb 7 18:31:20 2019
leaf03(P) leaf04 44:38:39:ff:ff:02 up up 12 12 Thu Feb 7 18:31:26 2019
leaf04 leaf03(P) 44:38:39:ff:ff:02 up up 12 12 Thu Feb 7 18:31:30 2019
您还可以验证特定节点是否已启动
cumulus@switch:~$ netq spine01 show mlag
Matching mlag records:
Hostname Peer SysMac State Backup #Bond #Dual Last Changed
s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P) spine02 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:49 2019
同样,使用 NetQ 检查 MLAG 状态也可以确认这一点
cumulus@switch:~$ netq check mlag
clag check result summary:
Total nodes : 6
Checked nodes : 6
Failed nodes : 0
Rotten nodes : 0
Warning nodes : 0
Peering Test : passed
Backup IP Test : passed
Clag SysMac Test : passed
VXLAN Anycast IP Test : passed
Bridge Membership Test : passed
Spanning Tree Test : passed
Dual Home Test : passed
Single Home Test : passed
Conflicted Bonds Test : passed
ProtoDown Bonds Test : passed
SVI Test : passed
NVIDIA 弃用了 clag
关键字,并将其替换为 mlag
关键字。clag
关键字目前仍然有效,但您应该开始使用 mlag
关键字。请记住,您还应该更新任何使用 clag
关键字的脚本。
当您直接登录到交换机时,可以运行 clagctl
来获取状态
cumulus@switch:/var/log$ sudo clagctl
The peer is alive
Peer Priority, ID, and Role: 4096 00:02:00:00:00:4e primary
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 secondary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (active)
System MAC: 44:38:39:ff:ff:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38 vx-38 - - -
vx-33 vx-33 - - -
hostbond4 hostbond4 1 - -
hostbond5 hostbond5 2 - -
vx-37 vx-37 - - -
vx-36 vx-36 - - -
vx-35 vx-35 - - -
vx-34 vx-34 - - -
场景 2:双连接 Bond 已关闭
当 MLAG 配置失去双连接时,您会收到来自 NetQ 的类似以下消息
2017-05-22T23:14:40.290918+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 1 link(s) are down. They are: spine01 hostbond5
2017-05-22T23:14:53.081480+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 1 node(s) have failures. They are: spine01
2017-05-22T23:14:58.161267+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 2 node(s) have failures. They are: spine01, leaf01
要开始调查,请显示 clagd
服务的状态
cumulus@switch:~$ netq spine01 show services clagd
Matching services records:
Hostname Service PID VRF Enabled Active Monitored Status Uptime Last Changed
----------------- -------------------- ----- --------------- ------- ------ --------- ---------------- ------------------------- -------------------------
spine01 clagd 2678 default yes yes yes ok 23h:57m:16s Thu Feb 7 18:30:49 2019
检查 MLAG 状态可提供故障原因
cumulus@switch:~$ netq check mlag
Checked Nodes: 6, Warning Nodes: 2
Node Reason
---------------- --------------------------------------------------------------------------
spine01 Link Down: hostbond5
leaf01 Singly Attached Bonds: hostbond5
您可以检索 JSON 格式的输出以导出到另一个工具
cumulus@switch:~$ netq check mlag json
{
"warningNodes": [
{
"node": "spine01",
"reason": "Link Down: hostbond5"
}
,
{
"node": "lea01",
"reason": "Singly Attached Bonds: hostbond5"
}
],
"failedNodes":[
],
"summary":{
"checkedNodeCount":6,
"failedNodeCount":0,
"warningNodeCount":2
}
}
修复问题后,您可以显示 MLAG 状态以查看是否所有节点都已启动。来自 NetQ 的通知指示所有节点都已启动,并且 netq check
标志也指示没有故障。
cumulus@switch:~$ netq show mlag
Matching clag records:
Hostname Peer SysMac State Backup #Bond #Dual Last Changed
s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P) spine02 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:49 2019
spine02 spine01(P) 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:53 2019
leaf01(P) leaf02 44:38:39:ff:ff:01 up up 12 12 Thu Feb 7 18:31:15 2019
leaf02 leaf01(P) 44:38:39:ff:ff:01 up up 12 12 Thu Feb 7 18:31:20 2019
leaf03(P) leaf04 44:38:39:ff:ff:02 up up 12 12 Thu Feb 7 18:31:26 2019
leaf04 leaf03(P) 44:38:39:ff:ff:02 up up 12 12 Thu Feb 7 18:31:30 2019
当您直接登录到交换机时,可以运行 clagctl
来获取状态
cumulus@switch:/var/log$ sudo clagctl
The peer is alive
Peer Priority, ID, and Role: 4096 00:02:00:00:00:4e primary
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 secondary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (active)
System MAC: 44:38:39:ff:ff:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38 vx-38 - - -
vx-33 vx-33 - - -
hostbond4 hostbond4 1 - -
hostbond5 - 2 - -
vx-37 vx-37 - - -
vx-36 vx-36 - - -
vx-35 vx-35 - - -
vx-34 vx-34 - - -
场景 3:VXLAN 主动-主动设备或接口已关闭
当 MLAG 配置中的 VXLAN 主动-主动设备或接口关闭时,日志消息还包括 VXLAN 检查。
2017-05-22T23:16:51.517522+00:00 noc-pr netq-notifier[5501]: WARNING: VXLAN: 2 node(s) have failures. They are: spine01, leaf01
2017-05-22T23:16:51.525403+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 2 link(s) are down. They are: leaf01 vx-37, spine01 vx-37
2017-05-22T23:17:04.703044+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 2 node(s) have failures. They are: spine01, leaf01
要开始调查,请显示 clagd
服务的状态
cumulus@switch:~$ netq spine01 show services clagd
Matching services records:
Hostname Service PID VRF Enabled Active Monitored Status Uptime Last Changed
----------------- -------------------- ----- --------------- ------- ------ --------- ---------------- ------------------------- -------------------------
spine01 clagd 2678 default yes yes yes error 23h:57m:16s Thu Feb 7 18:30:49 2019
检查 MLAG 状态可提供故障原因
cumulus@switch:~$ netq check mlag
Checked Nodes: 6, Warning Nodes: 2, Failed Nodes: 2
Node Reason
---------------- --------------------------------------------------------------------------
spine01 Protodown Bonds: vx-37:vxlan-single
leaf01 Protodown Bonds: vx-37:vxlan-single
您可以检索 JSON 格式的输出以导出到另一个工具
cumulus@switch:~$ netq check mlag json
{
"failedNodes": [
{
"node": "spine01",
"reason": "Protodown Bonds: vx-37:vxlan-single"
}
,
{
"node": "leaf01",
"reason": "Protodown Bonds: vx-37:vxlan-single"
}
],
"summary":{
"checkedNodeCount": 6,
"failedNodeCount": 2,
"warningNodeCount": 2
}
}
修复问题后,您可以显示 MLAG 状态以查看是否所有节点都已启动
cumulus@switch:~$ netq show mlag
Matching clag session records are:
Hostname Peer SysMac State Backup #Bond #Dual Last Changed
s
----------------- ----------------- ------------------ ---------- ------ ----- ----- -------------------------
spine01(P) spine02 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:49 2019
spine02 spine01(P) 00:01:01:10:00:01 up up 24 24 Thu Feb 7 18:30:53 2019
leaf01(P) leaf02 44:38:39:ff:ff:01 up up 12 12 Thu Feb 7 18:31:15 2019
leaf02 leaf01(P) 44:38:39:ff:ff:01 up up 12 12 Thu Feb 7 18:31:20 2019
leaf03(P) leaf04 44:38:39:ff:ff:02 up up 12 12 Thu Feb 7 18:31:26 2019
leaf04 leaf03(P) 44:38:39:ff:ff:02 up up 12 12 Thu Feb 7 18:31:30 2019
当您直接登录到交换机时,可以运行 clagctl
来获取状态
cumulus@switch:/var/log$ sudo clagctl
The peer is alive
Peer Priority, ID, and Role: 4096 00:02:00:00:00:4e primary
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 secondary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (active)
System MAC: 44:38:39:ff:ff:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38 vx-38 - - -
vx-33 vx-33 - - -
hostbond4 hostbond4 1 - -
hostbond5 hostbond5 2 - -
vx-37 - - - vxlan-single
vx-36 vx-36 - - -
vx-35 vx-35 - - -
vx-34 vx-34 - - -
场景 4:远程端 clagd 被 systemctl 命令停止
如果您通过 systemctl
命令停止 clagd
服务,NetQ Notifier 会发送类似于以下内容的消息
2017-05-22T23:51:19.539033+00:00 noc-pr netq-notifier[5501]: WARNING: VXLAN: 1 node(s) have failures. They are: leaf01
2017-05-22T23:51:19.622379+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 2 link(s) flapped and are down. They are: leaf01 hostbond5, leaf01 hostbond4
2017-05-22T23:51:19.622922+00:00 noc-pr netq-notifier[5501]: WARNING: LINK: 23 link(s) are down. They are: leaf01 VlanA-1-104-v0, leaf01 VlanA-1-101-v0, leaf01 VlanA-1, leaf01 vx-33, leaf01 vx-36, leaf01 vx-37, leaf01 vx-34, leaf01 vx-35, leaf01 swp7, leaf01 VlanA-1-102-v0, leaf01 VlanA-1-103-v0, leaf01 VlanA-1-100-v0, leaf01 VlanA-1-106-v0, leaf01 swp8, leaf01 VlanA-1.106, leaf01 VlanA-1.105, leaf01 VlanA-1.104, leaf01 VlanA-1.103, leaf01 VlanA-1.102, leaf01 VlanA-1.101, leaf01 VlanA-1.100, leaf01 VlanA-1-105-v0, leaf01 vx-38
2017-05-22T23:51:27.696572+00:00 noc-pr netq-notifier[5501]: INFO: LINK: 15 link(s) are up. They are: leaf01 VlanA-1.106, leaf01 VlanA-1-104-v0, leaf01 VlanA-1.104, leaf01 VlanA-1.103, leaf01 VlanA-1.101, leaf01 VlanA-1-100-v0, leaf01 VlanA-1.100, leaf01 VlanA-1.102, leaf01 VlanA-1-101-v0, leaf01 VlanA-1-102-v0, leaf01 VlanA-1.105, leaf01 VlanA-1-103-v0, leaf01 VlanA-1-106-v0, leaf01 VlanA-1, leaf01 VlanA-1-105-v0
2017-05-22T23:51:36.156708+00:00 noc-pr netq-notifier[5501]: WARNING: CLAG: 2 node(s) have failures. They are: spine01, leaf01
显示 MLAG 状态可以揭示哪些节点已关闭
cumulus@switch:~$ netq show mlag
Matching CLAG session records are:
Node Peer SysMac State Backup #Bonds #Dual Last Changed
---------------- ---------------- ----------------- ----- ------ ------ ----- -------------------------
spine01(P) spine02 00:01:01:10:00:01 up up 9 9 Thu Feb 7 18:30:53 2019
spine02 spine01(P) 00:01:01:10:00:01 up up 9 9 Thu Feb 7 18:31:04 2019
leaf01 44:38:39:ff:ff:01 down n/a 0 0 Thu Feb 7 18:31:13 2019
leaf03(P) leaf04 44:38:39:ff:ff:02 up up 8 8 Thu Feb 7 18:31:19 2019
leaf04 leaf03(P) 44:38:39:ff:ff:02 up up 8 8 Thu Feb 7 18:31:25 2019
检查 MLAG 状态可提供故障原因
cumulus@switch:~$ netq check mlag
Checked Nodes: 6, Warning Nodes: 1, Failed Nodes: 2
Node Reason
---------------- --------------------------------------------------------------------------
spine01 Peer Connectivity failed
leaf01 Peer Connectivity failed
您可以检索 JSON 格式的输出以导出到另一个工具
cumulus@switch:~$ netq check mlag json
{
"failedNodes": [
{
"node": "spine01",
"reason": "Peer Connectivity failed"
}
,
{
"node": "leaf01",
"reason": "Peer Connectivity failed"
}
],
"summary":{
"checkedNodeCount": 6,
"failedNodeCount": 2,
"warningNodeCount": 1
}
}
当您直接登录到交换机时,可以运行 clagctl
来获取状态
cumulus@switch:~$ sudo clagctl
The peer is not alive
Our Priority, ID, and Role: 8192 44:38:39:00:a5:38 primary
Peer Interface and IP: peerlink-3.4094 169.254.0.9
VxLAN Anycast IP: 36.0.0.20
Backup IP: 27.0.0.20 (inactive)
System MAC: 44:38:39:ff:ff:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
vx-38 - - - -
vx-33 - - - -
hostbond4 - 1 - -
hostbond5 - 2 - -
vx-37 - - - -
vx-36 - - - -
vx-35 - - - -
vx-34 - - - -