HP 3PAR CRC errors and Invalid transmission word in correlation with Brocade SAN switches
The advantage of HP 3PAR is hidden in monitoring mechanism. One of them is especially useful in case we do not have any special monitoring within our SAN. Thanks to that, we have opportunity to detect possible issue, before it will affect substantially our environment.
Intermittent CRC Errors Detected
Let’s take a look how HP 3PAR can present such an information to us. First example comes from HP 3PAR Service Processor Onsite Customer Care (SPOCC).
Event type: evt_host_port_crc_errors ID: 30003 Component: Port 3:1:3 Short Dsc: Host Port 3:1:3 experienced over 50 CRC errors (50) in 24 hours Event String: Host Port 3:1:3 experienced over 50 CRC errors (50) in 24 hours ... Event String: Port 3:1:3 Degraded (Intermittent CRC Errors Detected {0x2})
Same thing can be checked on HP 3PAR system itself. Just look at the event log.
3PAR-cluster cli% shoeventlog -oneline -startt 1/1/16 2016-01-01 12:33:18 GMT 2 Minor FC LESB Error sw_port:2:3:2 FC LESB Error Port ID [2:3:2]-Counters: (Invalid transmission word) (Invalid CRC) -ALPAs: 140700 2016-01-01 12:44:19 GMT 2 Minor FC LESB Error sw_port:3:1:3 FC LESB Error Port ID [3:1:3]-Counters: (Invalid transmission word)-ALPAs: 140700
Next step lead us to deeper investigation on the presented issue.
Fibre Channel Link Error Status Block (LESB) on HP 3PAR
HP 3PAR offers LESB which stands for Link Error Status Block and this functionality is preserved for Fibre Channel. It gives six counters with fundamental checks of errors in terms of connectivity and its quality.
First do a check on all hosts within system for recognition of host that could be responsible for generated errors. Command showhost with -lesb option gives us necessary information.
3PAR-cluster cli% showhost -lesb Id Name -WWN/iSCSI_Name- Port ALPA LinkFail LossSync LossSig PrimSeq InvWord InvCRC 0 hostdb_1 5001418231E0666A 3:1:3 0x140700 0 0 0 0 1032 627 5001418231E0666B 3:2:3 0xa0700 0 0 0 0 0 0 5001418231E0666A 2:3:2 0x140700 0 0 0 0 1032 627 5001418231E0666B 2:2:3 0xa0700 0 0 0 0 0 0
As shown above (of course output is truncated to things that have matter for this article) one host is tricky, moreover only one SFP/HBA of this host could be consider as problematic. Also we can note that storage ports are the same to those one reported by 3PAR system.
Let’s go deeper and make sure that hostdb_1 is the issue, not the storage port. It is time to check storage ports and hosts that are logged under them. First ..
3PAR-cluster cli% showportlesb single 2:3:2 ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC 0x140200 22210001CE003356 0 1 0 0 0 0 host0 0x140700 5001418231E0666A 0 0 0 0 1036 629 ... ...
.. and the second one.
3PAR-cluster cli% showportlesb single 3:1:3 ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC 0x140300 23210001CE003457 0 2 0 0 0 0 host0 0x140700 5001418231E0666A 0 0 0 0 1036 629 ... ...
The above output should show many hosts (depends on your configuration). As it was proved counters indicates on a specific WWPN of host.
Brocade SAN investigation on CRC errors
This is the time to look at the fabric. At this point we have knowledge that issue should not be caused by invalid storage port, as counters related to specific storage ports are clean. But to leave nothing to chance, we also check storage ports within fabric.
Brocade SAN investigation – searching for WWPNs on fabric
But first things first, let’s go and find problematic node and its connection within switch.
fabric2:admin> nodefind 50:01:41:82:31:E0:66:6A Local: Type Pid COS PortName NodeName SCR N 140700; 3;50:01:43:80:21:e0:58:7a;50:01:41:82:31:e0:66:7b; 0x00000003 FC4s: FCP NodeSymb: [33] "HPBJ362A FW:v4.02.03 DVR:v6.5.6.3" Fabric Port Name: 20:07:00:27:f8:22:44:3e Permanent Port Name: 50:01:41:82:31:E0:66:6A Device type: Physical Initiator Port Index: 7 Share Area: No Device Shared in Other AD: No Redirect: No Partial: No LSAN: No Aliases: hostdb_1_HBA2 Storage ports:
WWPN belongs to HBA marked by 3PAR as faulted has been found. Now find the connection between related storage ports and switch.
fabric2:admin> nodefind 22:21:00:01:CE:00:33:56 Local: Type Pid COS PortName NodeName SCR N 140200; 3;22:21:00:01:CE:00:33:56;1d:f4:00:01:ad:00:23:65; 0x00000003 ... Permanent Port Name: 22:21:00:01:CE:00:33:56 Device type: Physical Target Port Index: 2 ...
fabric2:admin> nodefind 23:21:00:01:CE:00:34:57 Local: Type Pid COS PortName NodeName SCR N 140300; 3;23:21:00:01:CE:00:34:57;1d:f5:00:01:ad:00:23:55; 0x00000003 ... Permanent Port Name: 23:21:00:01:CE:00:34:57 Device type: Physical Target Port Index: 3 ...
Let’s summarize our findings.
fabric2:admin> switchshow 2 2 140200 id N4 Online FC F-Port 22:21:00:01:CE:00:33:56 3 3 140300 id N4 Online FC F-Port 23:21:00:01:CE:00:34:57 7 7 140700 id N8 Online FC F-Port 50:01:41:82:31:E0:66:6A
Brocade SAN investigation – error counters in practice
Start with the ports, where investigated storage ports are connected. Check them using below commands.
- portstatsshow
- porterrshow
- portshow
If some counters appear to be in abnormal values, check them, however related to CRC errors could be also visible at this level. Other errors are not the subject of this article. Also to be sure that problem has continuous character, clear the counters on all investigated ports and give them time to show you how bad they are.
Time to look on port where host was attached.
fabric2:admin> portstatsshow 7 stat_wtx 442075321 4-byte words transmitted stat_wrx 3050945891 4-byte words received stat_ftx 2713593668 Frames transmitted stat_frx 2811690785 Frames received stat_c2_frx 0 Class 2 frames received stat_c3_frx 2811690785 Class 3 frames received stat_lc_rx 0 Link control frames received stat_mc_rx 0 Multicast frames received stat_mc_to 0 Multicast timeouts stat_mc_tx 0 Multicast frames transmitted tim_rdy_pri 0 Time R_RDY high priority tim_txcrd_z 0 Time TX Credit Zero (2.5Us ticks) tim_txcrd_z_vc 0- 3: 0 0 0 0 tim_txcrd_z_vc 4- 7: 0 0 0 0 tim_txcrd_z_vc 8-11: 0 0 0 0 tim_txcrd_z_vc 12-15: 0 0 0 0 er_enc_in 0 Encoding errors inside of frames er_crc 0 Frames with CRC errors er_trunc 0 Frames shorter than minimum er_toolong 0 Frames longer than maximum er_bad_eof 0 Frames with bad end-of-frame er_enc_out 8493 Encoding error outside of frames er_bad_os 1046 Invalid ordered set er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout er_tx_c3_timeout 0 Class 3 transmit frames discarded due to timeout er_c3_dest_unreach 0 Class 3 frames discarded due to destination unreachable er_other_discard 0 Other discards er_type1_miss 0 frames with FTB type 1 miss er_type2_miss 0 frames with FTB type 2 miss er_type6_miss 0 frames with FTB type 6 miss er_zone_miss 0 frames with hard zoning miss er_lun_zone_miss 0 frames with LUN zoning miss er_crc_good_eof 0 Crc error with good eof er_inv_arb 0 Invalid ARB open 0 loop_open transfer 0 loop_transfer opened 0 FL_Port opened starve_stop 0 tenancies stopped due to starvation fl_tenancy 0 number of times FL has the tenancy nl_tenancy 0 number of times NL has the tenancy zero_tenancy 0 zero tenancy
From above we can see two questionable counters:
- er_enc_out
- er_bad_os
With errors visible on 3PAR it is also possible to see on switch port increased values on counters related to bad transmitted frames, especially described as CRC.
Another useful output can be generated using porterrshow command.
fabric2:admin> porterrshow frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err 7: 2.7g 2.8g 0 0 0 0 0 0 8.4k 0 0 7 18 0 0 0 0 0
Now let’s take a look at counters visible under portshow command.
fabric2:admin> portshow 7 portIndex: 7 portName: port7 portHealth: No Fabric Watch License ... portFlags: 0x24b03 PRESENT ACTIVE F_PORT G_PORT U_PORT LOGICAL_ONLINE LOGIN NOELP LED ACCEPT FLOGI ... portState: 1 Online ... Interrupts: 0 Link_failure: 0 Frjt: 0 Unknown: 0 Loss_of_sync: 7 Fbsy: 0 Lli: 99 Loss_of_sig: 18 Proc_rqrd: 208744 Protocol_err: 0 Timed_out: 0 Invalid_word: 8493 Rx_flushed: 0 Invalid_crc: 0 Tx_unavail: 0 Delim_err: 0 Free_buffer: 0 Address_err: 0 Overrun: 0 Lr_in: 8 Suspended: 0 Lr_out: 0 Parity_err: 0 Ols_in: 0 2_parity_err: 0 Ols_out: 8 CMI_bus_err: 0
Above output gives us very precious clues on the nature of our issue. But before any conclusions, do the check of SFP.
fabric2:admin> sfpshow 7 -f Identifier: 3 SFP Connector: 7 LC Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist Encoding: 1 8B10B Baud Rate: 85 (units 100 megabaud) Length 9u: 0 (units km) Length 9u: 0 (units 100 meters) Length 50u (OM2): 5 (units 10 meters) Length 50u (OM3): 0 (units 10 meters) Length 62.5u:2 (units 10 meters) Length Cu: 0 (units 1 meter) Vendor Name: HP-F BROCADE Vendor OUI: 00:05:1e Vendor PN: AJ716A Vendor Rev: A Wavelength: 850 (units nm) Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable BR Max: 0 BR Min: 0 Serial No: UAF11228000060N Date Code: 120709 DD Type: 0x68 Enh Options: 0xfa Status/Ctrl: 0x30 Alarm flags[0,1] = 0x0, 0x0 Warn Flags[0,1] = 0x0, 0x0 Alarm Warn low high low high Temperature: 42 Centigrade -10 90 -5 85 Current: 6.176 mAmps 1.000 17.000 2.000 14.000 Voltage: 3295.0 mVolts 2900.0 3700.0 3000.0 3600.0 RX Power: -3.2 dBm (475.9uW) 10.0 uW 1258.9 uW 15.8 uW 1000.0 uW TX Power: -3.3 dBm (464.1 uW) 125.9 uW 631.0 uW 158.5 uW 562.3 uW
Look at electrical parameters and Rx/Tx power, whether values are within acceptable frames. As we can see, in this case, those characteristics are just fine.
Brocade SAN investigation – final thoughts on CRC errors
The investigation we did lead us finally to some conclusions, which I am more than happy to present here.
First signs of disease that is sweeping our optical veins are errors related to encoding and invalid ordered set.
er_enc_out 8493 Encoding error outside of frames er_bad_os 1046 Invalid ordered set
First value er_enc_out represents encoding error outside of frames. This counter is mostly connected to physical issues and if other counters on switch port related to CRC are not growing, it suggests problem with optical cable.
Second counter er_bad_os stands for invalid ordered set. What is ordered set? In simple words this is four byte transmission word (the smallest transmission unit defined in FC) that contains data and special character. Ordered sets are able to obtain bit and word synchronization. An ordered set is always begin with special character K28.5.
One of the most recognized Ordered Sets:
- Start-of-Frame (SOF) and End-of-Frame (EOF) as frame delimiters.
- Primitive signals (IDLE/ARBff and R_RDY), which are responsible for basic flow control.
- Primitive sequence which is transmitted continuously. This sequence of ordered sets gives detection and visibility on port state and its behavior.
- » Offline state (OLS) indicates that port is beginning link initialization protocol, or received and recognized the NOS protocol or in the worst case scenario went in offline state.
- » Not operational (NOS) is transmitted when transmitting port is detecting link failure. It also can marked offline port, where waiting for OLS sequence.
- » Link reset (LR) for the purpose of initiate link reset.
- » Link reset response (LRR) for confirmation of LR.
Conclusion: Ordered Sets that are coming to investigated port is broken. Most of the Ordered Sets are IDLE/ARBff. It happen that some of them could not be processed, but it will not cause any issue. But, and there is always a but, many wrong ordered sets in a row can cause issue including loss of sync, and basically means that we are loosing connectivity.
Our hunch should lead us to hardware related issue, whether with cable or with HBA. But this is not the end, and time to move on with investigation another counters that come from portshow.
Interrupts: 0 Link_failure: 0 Frjt: 0 Unknown: 0 Loss_of_sync: 7 Fbsy: 0 Lli: 99 Loss_of_sig: 18 Proc_rqrd: 208744 Protocol_err: 0 Timed_out: 0 Invalid_word: 8493 Rx_flushed: 0 Invalid_crc: 0 Tx_unavail: 0 Delim_err: 0 Free_buffer: 0 Address_err: 0 Overrun: 0 Lr_in: 8 Suspended: 0 Lr_out: 0 Parity_err: 0 Ols_in: 0 2_parity_err: 0 Ols_out: 8 CMI_bus_err: 0
Error counters from above give us final confirmation about the real face of issue. Let me explain, what can be found there.
- LLi stands for low-level interface and indicates problems with physical state or primitive sequences.
- Proc_rqrd – Frames delivered for embedded N_Port processing. Processing required list interrupts – number of received frames that cannot be processed by HW. Reasons can be due to invalid DID, SID, DID not in routing tables, invalid VC, class of service, etc.
- Loss_of_sync – port cannot synchronize (obtain bit and word synchronization).
- Loss_of_sig – port is losing signal.
All those error counters strongly suggest loss of connectivity because of faulty SFP or cable. Take a closer look on below counters.
Lr_in: 8 Ols_out: 8 Lr_out: 0 Ols_in: 0
Lr and Ols are previously explained primitive sequences. Here we have simple rule:
- Pair: Lr_in and Ols_out.
- Pair: Lr_out and Ols_in.
In and out stand for direction of sequence. In means that it is going to switch port, out means that transmission is initiated from switch port. In general if in above pairs we have situation that one value of parameter is significantly higher (far different from equal), then the problem is obvious. If _in is much higher than _out then issue is coming to the switch, and in opposite to this, if _out counter in pair is much higher, then the switch itself cause the issue.
In our case we have equal values in pair Lr_in and Ols_out, but still they are not zero. Link reset came several times, adding to this loss_of_sync, loss_of_sig and many invalid words, conclusion is obvious.
How can we explained it? In most cases reason lies in faulty cable. Statistically it significantly rather caused by faulty HBA (its SFP) of host. That’s it. Just replace cable, in worst case scenario – HBA, and problem will gone as with touch of magic wand.
How to reset LESB counters on HP 3PAR?
Unfortunately answer to this question is negative. Manual clearly enough tells that LESB counter among with CRC errors cannot be reset. Probably such a possibility is directly in shell, so the best way is to contact HP support.
Hello mr. Regmen!
We thanks a LOOOTT this your post because we solved a big problem involving 3PAR CRC erros an HP Virtual Connect: new ports on VC are using optical fibre mono mode rather than multi mode and it generate a huge CRC quantity erros.
Following this tutorial step by step we found the problem and solved it.
Thanks a lot one more time!
Best Regards,
Juliano Flores
Dear Juliano,
I am very glad that you found it useful 🙂
Cheers!
I followed your guide and it helped me track some issues. Turns out my issue was due to what looks to be outdated HBA firmware that is incompatible with 3.1.1 and not a cable or bad HBA/SFP
Very thorough breakdown. Awesome stuff.
Thanks for your guide very useful and informative.
Must tell you my friend… WE ARE FRIENDS NOW!
Your article saved my life last night. Was finally able to sleep. Found a very similar issue on HP VC modules thanks to “er_bad_os” hint.
that is the KB that mimics my issues which was happening every 2-3 months:
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c02619780
You have a good one and much appreciated again!
Now it’s clear! Respect and a BIG thank you!