HP 3PAR CRC errors in correlation with Brocade SAN

HP 3PAR CRC errors and Invalid transmission word in correlation with Brocade SAN switches

The advantage of HP 3PAR is hidden in monitoring mechanism. One of them is especially useful in case we do not have any special monitoring within our SAN. Thanks to that, we have opportunity to detect possible issue, before it will affect substantially our environment.

3par showhost -lesb

Intermittent CRC Errors Detected

Let’s take a look how HP 3PAR can present such an information to us. First example comes from HP 3PAR Service Processor Onsite Customer Care (SPOCC).

Event type: evt_host_port_crc_errors			     
ID: 30003
Component: Port 3:1:3
Short Dsc: Host Port 3:1:3 experienced over 50 CRC errors (50) in 24 hours Event String: Host Port 3:1:3 experienced over 50 CRC errors (50) in 24 hours
...
Event String: Port 3:1:3 Degraded (Intermittent CRC Errors Detected {0x2})

Same thing can be checked on HP 3PAR system itself. Just look at the event log.

3PAR-cluster cli% shoeventlog -oneline -startt 1/1/16
2016-01-01 12:33:18 GMT        2 Minor           FC LESB Error   sw_port:2:3:2 FC LESB Error Port ID [2:3:2]-Counters: (Invalid transmission word) (Invalid CRC) -ALPAs:  140700
2016-01-01 12:44:19 GMT        2 Minor           FC LESB Error   sw_port:3:1:3 FC LESB Error Port ID [3:1:3]-Counters: (Invalid transmission word)-ALPAs:  140700


Next step lead us to deeper investigation on the presented issue.

Fibre Channel Link Error Status Block (LESB) on HP 3PAR

HP 3PAR offers LESB which stands for Link Error Status Block and this functionality is preserved for Fibre Channel. It gives six counters with fundamental checks of errors in terms of connectivity and its quality.

First do a check on all hosts within system for recognition of host that could be responsible for generated errors. Command showhost with -lesb option gives us necessary information.

3PAR-cluster cli% showhost -lesb
 Id Name       -WWN/iSCSI_Name- Port  ALPA     LinkFail LossSync LossSig PrimSeq InvWord InvCRC
  0 hostdb_1   5001418231E0666A 3:1:3 0x140700        0        0       0       0    1032    627
               5001418231E0666B 3:2:3 0xa0700         0        0       0       0       0      0
               5001418231E0666A 2:3:2 0x140700        0        0       0       0    1032    627
               5001418231E0666B 2:2:3 0xa0700         0        0       0       0       0      0

As shown above (of course output is truncated to things that have matter for this article) one host is tricky, moreover only one SFP/HBA of this host could be consider as problematic. Also we can note that storage ports are the same to those one reported by 3PAR system.

Let’s go deeper and make sure that hostdb_1 is the issue, not the storage port. It is time to check storage ports and hosts that are logged under them. First ..

3PAR-cluster cli% showportlesb single 2:3:2
     ID     ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC
 0x140200 22210001CE003356        0        1       0       0       0      0
host0   0x140700 5001418231E0666A        0        0       0       0    1036    629
...
...

.. and the second one.

3PAR-cluster cli% showportlesb single 3:1:3
     ID     ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC
 0x140300 23210001CE003457               0        2       0       0       0      0
host0   0x140700 5001418231E0666A        0        0       0       0    1036    629
...
...

The above output should show many hosts (depends on your configuration). As it was proved counters indicates on a specific WWPN of host.

Brocade SAN investigation on CRC errors

This is the time to look at the fabric. At this point we have knowledge that issue should not be caused by invalid storage port, as counters related to specific storage ports are clean. But to leave nothing to chance, we also check storage ports within fabric.

Brocade SAN investigation – searching for WWPNs on fabric

But first things first, let’s go and find problematic node and its connection within switch.

fabric2:admin> nodefind 50:01:41:82:31:E0:66:6A

Local:
 Type Pid    COS     PortName                NodeName                 SCR
 N    140700;      3;50:01:43:80:21:e0:58:7a;50:01:41:82:31:e0:66:7b; 0x00000003
FC4s: FCP
NodeSymb: [33] "HPBJ362A FW:v4.02.03 DVR:v6.5.6.3"
Fabric Port Name: 20:07:00:27:f8:22:44:3e
Permanent Port Name: 50:01:41:82:31:E0:66:6A
Device type: Physical Initiator
Port Index: 7
Share Area: No
Device Shared in Other AD: No
Redirect: No
Partial: No
LSAN: No
Aliases: hostdb_1_HBA2
Storage ports:

WWPN belongs to HBA marked by 3PAR as faulted has been found. Now find the connection between related storage ports and switch.

fabric2:admin> nodefind 22:21:00:01:CE:00:33:56

Local:
 Type Pid    COS     PortName                NodeName                 SCR
 N    140200;      3;22:21:00:01:CE:00:33:56;1d:f4:00:01:ad:00:23:65; 0x00000003
...
Permanent Port Name: 22:21:00:01:CE:00:33:56
Device type: Physical Target
Port Index: 2
...
fabric2:admin> nodefind 23:21:00:01:CE:00:34:57

Local:
 Type Pid    COS     PortName                NodeName                 SCR
 N    140300;      3;23:21:00:01:CE:00:34:57;1d:f5:00:01:ad:00:23:55; 0x00000003
...
Permanent Port Name: 23:21:00:01:CE:00:34:57
Device type: Physical Target
Port Index: 3
...

Let’s summarize our findings.

fabric2:admin> switchshow
2   2   140200   id    N4       Online      FC  F-Port  22:21:00:01:CE:00:33:56
3   3   140300   id    N4       Online      FC  F-Port  23:21:00:01:CE:00:34:57
7   7   140700   id    N8       Online      FC  F-Port  50:01:41:82:31:E0:66:6A

Brocade SAN investigation – error counters in practice

Start with the ports, where investigated storage ports are connected. Check them using below commands.

  • portstatsshow
  • porterrshow
  • portshow

If some counters appear to be in abnormal values, check them, however related to CRC errors could be also visible at this level. Other errors are not the subject of this article. Also to be sure that problem has continuous character, clear the counters on all investigated ports and give them time to show you how bad they are.

Time to look on port where host was attached.

fabric2:admin> portstatsshow 7
stat_wtx                442075321   4-byte words transmitted
stat_wrx                3050945891  4-byte words received
stat_ftx                2713593668  Frames transmitted
stat_frx                2811690785  Frames received
stat_c2_frx             0           Class 2 frames received
stat_c3_frx             2811690785  Class 3 frames received
stat_lc_rx              0           Link control frames received
stat_mc_rx              0           Multicast frames received
stat_mc_to              0           Multicast timeouts
stat_mc_tx              0           Multicast frames transmitted
tim_rdy_pri             0           Time R_RDY high priority
tim_txcrd_z             0       Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc  0- 3:  0           0           0           0
tim_txcrd_z_vc  4- 7:  0           0           0           0
tim_txcrd_z_vc  8-11:  0           0           0           0
tim_txcrd_z_vc 12-15:  0           0           0           0
er_enc_in               0           Encoding errors inside of frames
er_crc                  0           Frames with CRC errors
er_trunc                0           Frames shorter than minimum
er_toolong              0           Frames longer than maximum
er_bad_eof              0           Frames with bad end-of-frame
er_enc_out              8493        Encoding error outside of frames
er_bad_os               1046        Invalid ordered set
er_rx_c3_timeout        0           Class 3 receive frames discarded due to timeout
er_tx_c3_timeout        0           Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach      0           Class 3 frames discarded due to destination unreachable
er_other_discard        0           Other discards
er_type1_miss           0           frames with FTB type 1 miss
er_type2_miss           0           frames with FTB type 2 miss
er_type6_miss           0           frames with FTB type 6 miss
er_zone_miss            0           frames with hard zoning miss
er_lun_zone_miss        0           frames with LUN zoning miss
er_crc_good_eof         0           Crc error with good eof
er_inv_arb              0           Invalid ARB
open                    0           loop_open
transfer                0           loop_transfer
opened                  0           FL_Port opened
starve_stop             0           tenancies stopped due to starvation
fl_tenancy              0           number of times FL has the tenancy
nl_tenancy              0           number of times NL has the tenancy
zero_tenancy            0           zero tenancy

From above we can see two questionable counters:

  • er_enc_out
  • er_bad_os

With errors visible on 3PAR it is also possible to see on switch port increased values on counters related to bad transmitted frames, especially described as CRC.

Another useful output can be generated using porterrshow command.

fabric2:admin> porterrshow
          frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs
       tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                 tx    rx    err
  7:    2.7g   2.8g   0      0      0      0      0      0      8.4k   0      0      7      18      0      0      0      0      0

Now let’s take a look at counters visible under portshow command.

fabric2:admin> portshow 7
portIndex:   7
portName: port7
portHealth: No Fabric Watch License
...
portFlags: 0x24b03       PRESENT ACTIVE F_PORT G_PORT U_PORT LOGICAL_ONLINE LOGIN NOELP LED ACCEPT FLOGI
...
portState: 1    Online
...
Interrupts:        0          Link_failure: 0          Frjt:         0
Unknown:           0          Loss_of_sync: 7          Fbsy:         0
Lli:               99         Loss_of_sig:  18
Proc_rqrd:         208744     Protocol_err: 0
Timed_out:         0          Invalid_word: 8493
Rx_flushed:        0          Invalid_crc:  0
Tx_unavail:        0          Delim_err:    0
Free_buffer:       0          Address_err:  0
Overrun:           0          Lr_in:        8
Suspended:         0          Lr_out:       0
Parity_err:        0          Ols_in:       0
2_parity_err:      0          Ols_out:      8
CMI_bus_err:       0

Above output gives us very precious clues on the nature of our issue. But before any conclusions, do the check of SFP.

fabric2:admin> sfpshow 7 -f
Identifier:  3    SFP
Connector:   7    LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding:    1    8B10B
Baud Rate:   85   (units 100 megabaud)
Length 9u:   0    (units km)
Length 9u:   0    (units 100 meters)
Length 50u (OM2):  5    (units 10 meters)
Length 50u (OM3):  0    (units 10 meters)
Length 62.5u:2    (units 10 meters)
Length Cu:   0    (units 1 meter)
Vendor Name: HP-F     BROCADE
Vendor OUI:  00:05:1e
Vendor PN:   AJ716A
Vendor Rev:  A
Wavelength:  850  (units nm)
Options:     003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max:      0
BR Min:      0
Serial No:   UAF11228000060N
Date Code:   120709
DD Type:     0x68
Enh Options: 0xfa
Status/Ctrl: 0x30
Alarm flags[0,1] = 0x0, 0x0
Warn Flags[0,1] = 0x0, 0x0
                                           Alarm                  Warn
                                    low         high       low         high
Temperature: 42      Centigrade      -10        90         -5          85
Current:     6.176   mAmps           1.000      17.000     2.000       14.000
Voltage:     3295.0  mVolts          2900.0     3700.0     3000.0      3600.0
RX Power:    -3.2    dBm (475.9uW)   10.0   uW  1258.9 uW  15.8   uW   1000.0 uW
TX Power:    -3.3    dBm (464.1 uW)  125.9  uW  631.0  uW  158.5  uW   562.3  uW

Look at electrical parameters and Rx/Tx power, whether values are within acceptable frames. As we can see, in this case, those characteristics are just fine.

Brocade SAN investigation – final thoughts on CRC errors

The investigation we did lead us finally to some conclusions, which I am more than happy to present here.

First signs of disease that is sweeping our optical veins are errors related to encoding and invalid ordered set.

er_enc_out              8493        Encoding error outside of frames
er_bad_os               1046        Invalid ordered set

First value er_enc_out represents encoding error outside of frames. This counter is mostly connected to physical issues and if other counters on switch port related to CRC are not growing, it suggests problem with optical cable.

Second counter er_bad_os stands for invalid ordered set. What is ordered set? In simple words this is four byte transmission word (the smallest transmission unit defined in FC) that contains data and special character. Ordered sets are able to obtain bit and word synchronization. An ordered set is always begin with special character K28.5.

One of the most recognized Ordered Sets:

  • Start-of-Frame (SOF) and End-of-Frame (EOF) as frame delimiters.
  • Primitive signals (IDLE/ARBff and R_RDY), which are responsible for basic flow control.
  • Primitive sequence which is transmitted continuously. This sequence of ordered sets gives detection and visibility on port state and its behavior.
      » Offline state (OLS) indicates that port is beginning link initialization protocol, or received and recognized the NOS protocol or in the worst case scenario went in offline state.
      » Not operational (NOS) is transmitted when transmitting port is detecting link failure. It also can marked offline port, where waiting for OLS sequence.
      » Link reset (LR) for the purpose of initiate link reset.
      » Link reset response (LRR) for confirmation of LR.

Conclusion: Ordered Sets that are coming to investigated port is broken. Most of the Ordered Sets are IDLE/ARBff. It happen that some of them could not be processed, but it will not cause any issue. But, and there is always a but, many wrong ordered sets in a row can cause issue including loss of sync, and basically means that we are loosing connectivity.

Our hunch should lead us to hardware related issue, whether with cable or with HBA. But this is not the end, and time to move on with investigation another counters that come from portshow.

Interrupts:        0          Link_failure: 0          Frjt:         0
Unknown:           0          Loss_of_sync: 7          Fbsy:         0
Lli:               99         Loss_of_sig:  18
Proc_rqrd:         208744     Protocol_err: 0
Timed_out:         0          Invalid_word: 8493
Rx_flushed:        0          Invalid_crc:  0
Tx_unavail:        0          Delim_err:    0
Free_buffer:       0          Address_err:  0
Overrun:           0          Lr_in:        8
Suspended:         0          Lr_out:       0
Parity_err:        0          Ols_in:       0
2_parity_err:      0          Ols_out:      8
CMI_bus_err:       0

Error counters from above give us final confirmation about the real face of issue. Let me explain, what can be found there.

  • LLi stands for low-level interface and indicates problems with physical state or primitive sequences.
  • Proc_rqrd – Frames delivered for embedded N_Port processing. Processing required list interrupts – number of received frames that cannot be processed by HW. Reasons can be due to invalid DID, SID, DID not in routing tables, invalid VC, class of service, etc.
  • Loss_of_sync – port cannot synchronize (obtain bit and word synchronization).
  • Loss_of_sig – port is losing signal.

All those error counters strongly suggest loss of connectivity because of faulty SFP or cable. Take a closer look on below counters.

Lr_in:        8         Ols_out:      8
Lr_out:       0         Ols_in:       0

Lr and Ols are previously explained primitive sequences. Here we have simple rule:

  • Pair: Lr_in and Ols_out.
  • Pair: Lr_out and Ols_in.

In and out stand for direction of sequence. In means that it is going to switch port, out means that transmission is initiated from switch port. In general if in above pairs we have situation that one value of parameter is significantly higher (far different from equal), then the problem is obvious. If _in is much higher than _out then issue is coming to the switch, and in opposite to this, if _out counter in pair is much higher, then the switch itself cause the issue.

In our case we have equal values in pair Lr_in and Ols_out, but still they are not zero. Link reset came several times, adding to this loss_of_sync, loss_of_sig and many invalid words, conclusion is obvious.

How can we explained it? In most cases reason lies in faulty cable. Statistically it significantly rather caused by faulty HBA (its SFP) of host. That’s it. Just replace cable, in worst case scenario – HBA, and problem will gone as with touch of magic wand.

How to reset LESB counters on HP 3PAR?

Unfortunately answer to this question is negative. Manual clearly enough tells that LESB counter among with CRC errors cannot be reset. Probably such a possibility is directly in shell, so the best way is to contact HP support.

7 thoughts on “HP 3PAR CRC errors in correlation with Brocade SAN

  1. Hello mr. Regmen!

    We thanks a LOOOTT this your post because we solved a big problem involving 3PAR CRC erros an HP Virtual Connect: new ports on VC are using optical fibre mono mode rather than multi mode and it generate a huge CRC quantity erros.

    Following this tutorial step by step we found the problem and solved it.

    Thanks a lot one more time!

    Best Regards,
    Juliano Flores

  2. I followed your guide and it helped me track some issues. Turns out my issue was due to what looks to be outdated HBA firmware that is incompatible with 3.1.1 and not a cable or bad HBA/SFP

Leave a Reply

Your email address will not be published. Required fields are marked *