Recently I faced interconnect failure at one of customer environment. Everything run smoothly until Monday morning I received notification event about interconnect link down.
Cluster Interconnect link is DOWN
In environment that I have pleasure to work with, we have fabric-attached MetroCluster configuration. Between filers there are double HA interconnect cable (attached to the FC switch) and the Heartbeat communication is served via MetroCluster FC-VI card. In case this single two ported card goes down, then we can talk about a little disaster, because without heartbeat messages between nodes we have guaranteed takeover.
Cluster Interconnect link, FC-VI troubleshooting – Survived node investigation
On survived node you have several things to check. First it’s good to get familiar with syslog.
filer2(takeover)> rdfile /etc/messages
Note: remember that older log you can see by viewing messages.0 (.1) etc.
You should see similar output:
Sat May 30 00:38:41 MEST [filer2: cf.ic.xferTimedOut:error]: WAFL interconnect transfer timed out Sat May 30 00:38:42 MEST [filer2: cf.nm.nicViError:info]: Interconnect nic 1 has error on VI #4 SEND_DESC_TIMEOUT 7 Sat May 30 00:38:42 MEST [filer2: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN Sat May 30 00:38:42 MEST [filer2: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of filer2 by filer1 disabled (interconnect error) Sat May 30 00:38:42 MEST [filer2: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #10 SEND_DESC_TIMEOUT 4 Sat May 30 00:38:42 MEST [filer2: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN Sat May 30 00:38:43 MEST [filer2: cf.fsm.partnerNotResponding:notice]: Cluster monitor: partner not responding Sat May 30 00:38:43 MEST [filer2: cf.fsm.takeoverCountdown:warning]: Cluster monitor: takeover scheduled in 9 seconds Sat May 30 00:38:45 MEST [filer2: cf.fsm.autoTakeoverCancelled:notice]: Cluster monitor: pending takeover cancelled Sat May 30 00:38:45 MEST [filer2: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of filer1 disabled (status of backup mailbox is uncertain) Sat May 30 00:38:51 MEST [filer2: cf.rv.notConnected:error]: Connection for cfo_rv failed Sat May 30 00:38:51 MEST [filer2: cf.rv.notConnected:error]: Connection for cfo_rv2 failed Sat May 30 00:38:52 MEST [filer2: ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible due to reason status of backup mailbox is uncertain. Sat May 30 00:38:54 MEST [filer2: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of filer1 enabled Sat May 30 00:38:54 MEST [filer2: cf.fsm.takeover.noHeartbeat:ALERT]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node. Sat May 30 00:38:54 MEST [filer2: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER Sat May 30 00:38:54 MEST [filer2: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started Sat May 30 00:38:55 MEST [filer1/filer2: coredump.spare.none:info]: No sparecore disk was found. Sat May 30 00:38:56 MEST [filer2: raid.vol.replay.nvram:info]: Performing raid replay on volume(s) Sat May 30 00:38:56 MEST [filer2: raid.replay.partner.nvram:notice]: Replaying partner NVRAM. Sat May 30 00:38:56 MEST [filer2: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
IC as interconnect command
filer2(takeover)> priv set diag
filer2(takeover)*> ic status
It shows output that strongly indicates our concerns:
Link 0: down Link 1: down cfo_rv connection state : NOT CONNECTED cfo_rv nic used : 0 cfo_rv2 connection state : NOT CONNECTED cfo_rv2 nic used : 1
Note: at this stage you can try ic reset nic or ic reset link.
Time for sysconfig – FC-VI card
filer2(takeover)> sysconfig -a
From this perspective you can see also confirmation and more details about not working interface.
slot 2: FCVI Host Adapter 2a (QLogic 2432(2462) rev. 3, F-port, ) Physical link: UP FC Node Name: 21:00:00:00:00:00:00:00 Firmware rev: 5.0.2 Serial No: XXXXXXXXXXXXXX Host Port Id: 0x20400 Cacheline size: 16 FC Packet size: 2048 SFF Vendor: FINISAR CORP. SFF Part Number: XXXXXXXXXXXXXX SFF Serial Number: XXXXXXXXXXXXXX SFF Capabilities: 1, 2 or 4 Gbit Link Data Rate: 4 Gbit slot 2: FCVI Host Adapter 2b (QLogic 2432(2462) rev. 3, F-port, ) Physical link: UP FC Node Name: 21:00:00:00:00:00:00:00 Firmware rev: 5.0.2 Serial No: XXXXXXXXXXXXXX Host Port Id: 0x20400 Cacheline size: 16 FC Packet size: 2048 SFF Vendor: FINISAR CORP. SFF Part Number: XXXXXXXXXXXXXX SFF Serial Number: XXXXXXXXXXXXXX SFF Capabilities: 1, 2 or 4 Gbit Link Data Rate: 4 Gbit Switch Port: F2_switch2:4
Cluster Interconnect link, FC-VI troubleshooting – Taken over node investigation
Now it’s time to look at our taken over node. Use you RLM or SP to connect remotely to filer. Remember to use naroot user.
RLM – event log
Investigate event log:
RLM rfiler1> events all
Now our view on failure becomes more clear.
Record 1589: Fri May 29 22:39:01 2015 [RLM.critical]: Filer Reboots Record 1590: Fri May 29 22:39:01 2015 [Trap Event.critical]: hwassist abnormal_reboot (28) Record 1591: Fri May 29 22:39:01 2015 [Trap Event.critical]: SNMP abnormal_reboot (28) Record 1592: Fri May 29 22:39:55 2015 [RLM.critical]: Filer Reboots Record 1593: Fri May 29 22:50:05 2015 [RLM.critical]: Heartbeat stopped
RLM – system log
RLM rfiler1> system log
And now there is no doubts.
================ Log #4 start time Fri May 29 22:39:31 2015 Press CTRL-C for special boot menu The platform doesn't support service processor Processing PCI error... Probing PXB(21,6,0) Probing PXB(21,7,0) Probing EXB(21,8,0) report Dv(24,8,0) from error source register 0x1840. Probing EXB(21,9,0) Probing EXB(21,10,0) Probing EXB(21,11,0) PANIC: PCI Error NMI from device(s): Dv(24,8,0). HT2000(2) ALERT - PXB(21,6,0): Status(SigTrAbt), SecStatus(RcvMstAbt), Err(MstAbt), SpcErr(RSpcSCE). PXB(21,7,0): SecStatus(RcvMstAbt). EXB(21,8,0): SecStatus(RcvMstAbt), RootErr (UCor,NFatal), UCorrErr(CpAbt). EXB(21,9,0): SecStatus(RcvMstAbt). EXB(21,10,0): SecStatus(RcvMstAbt). in process main_proc on release NetApp Release 7.3.6P2 on Fri May 29 22:39:32 GMT 2015 version: NetApp Release 7.3.6P2: Wed Sep 14 01:32:17 PDT 2011 cc flags: 2O DUMPCORE: START DUMPCORE: END -- coredump *NOT* written. halt after panic during system initialization
RLM – system console
From system console view you have possibility to see boot loader and as name suggest you can choose the way of booting ONTAP. Also useful because of specific diagnostic image.
RLM rfiler1> system console
Entering into this mode we are running diagnostic image from where we have possibility to run full diagnostic check. But be aware that it could not see failure of your FC-Vi card.
Cluster Interconnect link, FC-VI troubleshooting – FC SAN switch connectivity
That’s right. Last, but not least thing worth to check is connectivity between filers and switches. Put it simpliest and assume that configuration looks more or less like this one from Picture 2. We have two filers located on different sites and backend is assured by SAN switches.
FC SAN switches – zoning configuration
You are not able to check WWPNs for ports assigned to faulty card, at least you cannot to this remotely. Our hunch tells us that to confirm FC-VI failure we should check the fabric where faulty card is attached. However we have knowledge about WWPNs assigned to FC-VI at the survived node, therfore we can find existed zoning configuration. After you locate port related with FC-VI, check its port status at your Brocade switch:
switch1_f1:admin> portshow 4
And you will see something like this.
portState: 2 Offline portPhys: 4 No_Light portScn: 2 Offline
Confirm this also for the other port at the same fabric.
Cluster Interconnect link, FC-VI troubleshooting – Summarize
After all steps we can confirm FC-VI failure.
1. No heartbeat between filer nodes.
2. Node1 has been taken over by Node2.
3. Interconnect link is DOWN for both FC-VI ports.
4. RLM to Node1 and information about PCI failure. Collecting information from logs.
5. Diagnostic test not shown PCI card failure (because card is already invisible to system).
6. FC links are down on between switches located at fabric and FC-VI, on fabric where suspected card is attached.
How to use RLM to collect data required to troubleshoot a hardware issue
How to troubleshoot unexplained takeovers or reboots
Location and meaning of dual-port, 2-Gb MetroCluster adapter LEDs
Data ONTAP® 8.2 High Availability and MetroCluster Configuration Guide For 7-Mode