[MetroCluster] How to troubleshoot interconnect link down and FC-VI

Recently I faced interconnect failure at one of customer environment. Everything run smoothly until Monday morning I received notification event about interconnect link down.

Cluster Interconnect link is DOWN

In environment that I have pleasure to work with, we have fabric-attached MetroCluster configuration. Between filers there are double HA interconnect cable (attached to the FC switch) and the Heartbeat communication is served via MetroCluster FC-VI card. In case this single two ported card goes down, then we can talk about a little disaster, because without heartbeat messages between nodes we have guaranteed takeover.

MetroCluster Interconnect card FCVI
Picture 1. MetroCluster Interconnect card FCVI

Cluster Interconnect link, FC-VI troubleshooting – Survived node investigation

On survived node you have several things to check. First it’s good to get familiar with syslog.

Checking syslog

filer2(takeover)> rdfile /etc/messages
Note: remember that older log you can see by viewing messages.0 (.1) etc.

You should see similar output:

Sat May 30 00:38:41 MEST [filer2: cf.ic.xferTimedOut:error]: WAFL interconnect transfer timed out
Sat May 30 00:38:42 MEST [filer2: cf.nm.nicViError:info]: Interconnect nic 1 has error on VI #4 SEND_DESC_TIMEOUT 7
Sat May 30 00:38:42 MEST [filer2: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN
Sat May 30 00:38:42 MEST [filer2: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of filer2 by filer1 disabled (interconnect error)
Sat May 30 00:38:42 MEST [filer2: cf.nm.nicViError:info]: Interconnect nic 0 has error on VI #10 SEND_DESC_TIMEOUT 4
Sat May 30 00:38:42 MEST [filer2: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Sat May 30 00:38:43 MEST [filer2: cf.fsm.partnerNotResponding:notice]: Cluster monitor: partner not responding
Sat May 30 00:38:43 MEST [filer2: cf.fsm.takeoverCountdown:warning]: Cluster monitor: takeover scheduled in 9 seconds
Sat May 30 00:38:45 MEST [filer2: cf.fsm.autoTakeoverCancelled:notice]: Cluster monitor: pending takeover cancelled
Sat May 30 00:38:45 MEST [filer2: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of filer1 disabled (status of backup mailbox is uncertain)
Sat May 30 00:38:51 MEST [filer2: cf.rv.notConnected:error]: Connection for cfo_rv failed
Sat May 30 00:38:51 MEST [filer2: cf.rv.notConnected:error]: Connection for cfo_rv2 failed
Sat May 30 00:38:52 MEST [filer2: ha.takeoverImpNotDef:warning]: Takeover of the partner node is impossible due to reason status of backup mailbox is uncertain.
Sat May 30 00:38:54 MEST [filer2: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of filer1 enabled
Sat May 30 00:38:54 MEST [filer2: cf.fsm.takeover.noHeartbeat:ALERT]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
Sat May 30 00:38:54 MEST [filer2: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Sat May 30 00:38:54 MEST [filer2: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Sat May 30 00:38:55 MEST [filer1/filer2: coredump.spare.none:info]: No sparecore disk was found.
Sat May 30 00:38:56 MEST [filer2: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Sat May 30 00:38:56 MEST [filer2: raid.replay.partner.nvram:notice]: Replaying partner NVRAM.
Sat May 30 00:38:56 MEST [filer2: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.

IC as interconnect command

filer2(takeover)> priv set diag
filer2(takeover)*> ic status
It shows output that strongly indicates our concerns:

    Link 0: down
    Link 1: down
    cfo_rv connection state :            NOT CONNECTED
    cfo_rv nic used :                0
    cfo_rv2 connection state :            NOT CONNECTED
    cfo_rv2 nic used :                1

Note: at this stage you can try ic reset nic or ic reset link.

Time for sysconfig – FC-VI card

filer2(takeover)> sysconfig -a
From this perspective you can see also confirmation and more details about not working interface.

    slot 2: FCVI Host Adapter 2a (QLogic 2432(2462) rev. 3, F-port, )
        Physical link: UP
        FC Node Name:    21:00:00:00:00:00:00:00
        Firmware rev:    5.0.2    Serial No:    XXXXXXXXXXXXXX
        Host Port Id:    0x20400
        Cacheline size:    16    FC Packet size:    2048
        SFF Vendor:    FINISAR CORP.
        SFF Part Number:    XXXXXXXXXXXXXX
        SFF Serial Number:    XXXXXXXXXXXXXX
        SFF Capabilities:  1, 2 or 4 Gbit
        Link Data Rate:    4 Gbit
    slot 2: FCVI Host Adapter 2b (QLogic 2432(2462) rev. 3, F-port, )
        Physical link: UP
        FC Node Name:    21:00:00:00:00:00:00:00
        Firmware rev:    5.0.2    Serial No:    XXXXXXXXXXXXXX
        Host Port Id:    0x20400
        Cacheline size:    16    FC Packet size:    2048
        SFF Vendor:    FINISAR CORP.
        SFF Part Number:    XXXXXXXXXXXXXX
        SFF Serial Number:    XXXXXXXXXXXXXX
        SFF Capabilities:  1, 2 or 4 Gbit
        Link Data Rate:    4 Gbit
        Switch Port:    F2_switch2:4

Cluster Interconnect link, FC-VI troubleshooting – Taken over node investigation

Now it’s time to look at our taken over node. Use you RLM or SP to connect remotely to filer. Remember to use naroot user.

RLM – event log

Investigate event log:
RLM rfiler1> events all
Now our view on failure becomes more clear.

Record 1589: Fri May 29 22:39:01 2015 [RLM.critical]: Filer Reboots
Record 1590: Fri May 29 22:39:01 2015 [Trap Event.critical]: hwassist abnormal_reboot (28)
Record 1591: Fri May 29 22:39:01 2015 [Trap Event.critical]: SNMP abnormal_reboot (28)
Record 1592: Fri May 29 22:39:55 2015 [RLM.critical]: Filer Reboots
Record 1593: Fri May 29 22:50:05 2015 [RLM.critical]: Heartbeat stopped

RLM – system log

RLM rfiler1> system log
And now there is no doubts.

================ Log #4 start time Fri May 29 22:39:31 2015
Press CTRL-C for special boot menu
The platform doesn't support service processor 
Processing PCI error...
Probing PXB(21,6,0)
Probing PXB(21,7,0)
Probing EXB(21,8,0)
    report Dv(24,8,0) from error source register 0x1840.
Probing EXB(21,9,0)
Probing EXB(21,10,0)
Probing EXB(21,11,0)

PANIC: PCI Error NMI from device(s): Dv(24,8,0). HT2000(2) ALERT - PXB(21,6,0): Status(SigTrAbt), SecStatus(RcvMstAbt), Err(MstAbt), SpcErr(RSpcSCE). PXB(21,7,0): SecStatus(RcvMstAbt). EXB(21,8,0): SecStatus(RcvMstAbt), RootErr

(UCor,NFatal), UCorrErr(CpAbt). EXB(21,9,0): SecStatus(RcvMstAbt). EXB(21,10,0): SecStatus(RcvMstAbt).  in process main_proc on release NetApp Release 7.3.6P2 on Fri May 29 22:39:32 GMT 2015


version: NetApp Release 7.3.6P2: Wed Sep 14 01:32:17 PDT 2011
cc flags: 2O

DUMPCORE: START

DUMPCORE: END -- coredump *NOT* written.
halt after panic during system initialization

RLM – system console

From system console view you have possibility to see boot loader and as name suggest you can choose the way of booting ONTAP. Also useful because of specific diagnostic image.
RLM rfiler1> system console
LOADER> boot_diags
Entering into this mode we are running diagnostic image from where we have possibility to run full diagnostic check. But be aware that it could not see failure of your FC-Vi card.

Cluster Interconnect link, FC-VI troubleshooting – FC SAN switch connectivity

That’s right. Last, but not least thing worth to check is connectivity between filers and switches. Put it simpliest and assume that configuration looks more or less like this one from Picture 2. We have two filers located on different sites and backend is assured by SAN switches.

Picture 2. Diagram with presented fibric-attached FC-VI configuration.
Picture 2. Diagram with presented fibric-attached FC-VI configuration.

FC SAN switches – zoning configuration

You are not able to check WWPNs for ports assigned to faulty card, at least you cannot to this remotely. Our hunch tells us that to confirm FC-VI failure we should check the fabric where faulty card is attached. However we have knowledge about WWPNs assigned to FC-VI at the survived node, therfore we can find existed zoning configuration. After you locate port related with FC-VI, check its port status at your Brocade switch:
switch1_f1:admin> portshow 4
And you will see something like this.

portState: 2    Offline  
portPhys:  4    No_Light 
portScn:   2    Offline   

Confirm this also for the other port at the same fabric.

Cluster Interconnect link, FC-VI troubleshooting – Summarize

After all steps we can confirm FC-VI failure.
1. No heartbeat between filer nodes.
2. Node1 has been taken over by Node2.
3. Interconnect link is DOWN for both FC-VI ports.
4. RLM to Node1 and information about PCI failure. Collecting information from logs.
5. Diagnostic test not shown PCI card failure (because card is already invisible to system).
6. FC links are down on between switches located at fabric and FC-VI, on fabric where suspected card is attached.

Useful docs:
How to use RLM to collect data required to troubleshoot a hardware issue
How to troubleshoot unexplained takeovers or reboots
Location and meaning of dual-port, 2-Gb MetroCluster adapter LEDs
Data ONTAP® 8.2 High Availability and MetroCluster Configuration Guide For 7-Mode

Leave a Reply

Your email address will not be published. Required fields are marked *