With the latest patched Rocky Linux 8 server and Intel’s E810 NIC, we are seeing a large amount of rx_csum_bad.nic errors. Researching this issue finds the following that I do not have access to: Intel i40e and ice driver report "rx_csum_bad.nic" statistic growing - Red Hat Customer Portal
Other sources confirm that the ice driver from the Linux kernel has other known issues, and the source install of Intel’s ice driver is the way to go:
https://community.intel.com/t5/Ethernet-Products/ice-rx-errros-is-too-sensitive-to-IP-TCP-attack-packets-Intel/m-p/1668177/highlight/true
https://forum.proxmox.com/threads/default-ice-driver-kernel-6-poor-performance.133855/
Obviously compiling drivers from source isn’t the most desirable solution going forward, so I was curious if anybody had a more suitable Rocky Linux solution? Or if anybody has access to RH’s solution? Should I post something on https://bugs.rockylinux.org/ even though RH already has a “Solution Verified - Updated November 13 2024”?
Appreciate any help and guidance.
You can register yourself on the Red Hat portal by creating an account and you should be able to gain access to the documentation. If it’s still not available at that point, you can register for a Red Hat Developer subscription and once that is active on the same account you registered on the portal, you’ll then be able to read the documentation.
I don’t believe the Proxmox link is valid in your case, since they are mentioning Kernel 6.x whereas Rocky 8 is 4.x.
If after registering on the RH site and if the fix doesn’t help you, at that point I would open a bug on the Rocky bugs website.
Unfortunately there’s not much in that RH doc. It’s kind of expected behavior given the Intel forum post too.
–
Environment
- Red Hat Enterprise Linux
- Intel 700-series (
i40e driver) NIC
- Intel 800-series (
ice driver) NIC
Issue
- Intel
i40e and ice driver report rx_csum_bad.nic statistic growing
- NIC has
RX-ERR counter grown, and ethtool -S statistics show a similar number of rx_csum_bad.nic
Resolution
These counters indicate a Layer 3 or Layer 4 checksum issue.
Troubleshoot the sender and network path to determine where the checksum failure occurs, and resolve it.
Root Cause
The Intel 700-series and 800-series have many advanced features.
One feature is that the hardware performs a checksum check of IPv4, IPv6, local Layer 4, and UDP Tunnel outer header.
This is done in driver functions i40e_rx_checksum() and ice_rx_csum() which grow the hw_csum_rx_error statistic.
In ethtool -S reporting, that hw_csum_rx_error becomes rx_csum_bad.nic.
What that tells you is that there are communication errors in your network that were not being recorded and/or reported before and the new driver now discloses it.
So you apparently have an upstream networking problem. Possibly an overloaded or failing switch or router, or a defective cable or cable connector.
Doubt that is “an overloaded or failing switch or router, or a defective cable or cable connector.” I can reproduce this exact issue in multiple different environments, with different upstream networking involved. However, you are making me think a little more about what else, specifically, could be having an impact “upstream”. Interesting, and I’ll have to think on this some more.
If you’ve already done the network troubleshooting and really need the OEM driver, you can make a request over at ELRepo.