On multiple production systems with 5 network interfaces (Intel i210 NICs and also some systems with older Intel 82574 NICs), we are now seeing the output of netstat -in and ip -s link showindicating TX drops, like so:
e4 is not connected (ip addr shows NO-CARRIER). The other interfaces are all connected. e3 has an embedded device which has powered off and on a few times, and the TX drops increment when that happens.
e0 and e1 are connected to COTS switches.
e2 is connected to a FreeBSD host (which shows no drops, errors or collisions in its netstat output).
The small number of TX drops for e0, e1, e2 appeared early at some point during the time the interface was being initialized.
In RL9.5, these TX drops on the same physical hosts are consistently at 0 on all the systems (a dozen or so) where RL9.6 is showing drops.
I suspect this is just an accounting change, but Iād like to pin down what part of the stack changed (or what setting changed) that would trigger these drops.
ethtool -S <ifname> does not show any TX drops for i210, but tx_dropped values that match the netstat -in numbers for the older 82574 NIC.
kernel is 5.14.0-570.30.1.el9_6.x86_64
igb driver in use - all interfaces 1 Gbps.
First question⦠have anyone else here seen this issue firsthand (zero TX drops on 9.5, non-zero drops on 9.6)?
Next question⦠Does anyone have an inkling about what has changed in the network stack that could trigger this?
Looking at the last column where it says āTX-DRā, I only see numbers for e3 and e4 (you already explained why those two are dropping), but I only see a blank column for e0, e1, e2 so where are the drops?
Itās not 1 or 3. Same hardware (inside and outside the box). Switching between RL9.5 and RL9.6 is ALWAYS 0 TX drops for RL9.5 and non-zero for RL9.6.
If itās 2 (interface misconfig), it is a change in the underlying code (network stack) somehow. /etc files are the same.
It could certainly be 4 (but Iāll add: changes to kernel network code and/or userland tools like netstat(8) or ip(8)).
So Iām mainly leaning towards 4 and trying to figure out what changed between 9.5 and 9.6 that could be implicated. Hence the post here for help (and corroborating stories - positive or negative - from people that are using RL9.5 and RL9.6).
I have an old Proliant DL380e with 9.6 kernels it crashes, and Iām stuck using 5.14.0-503.40.1.el9_5 on it which was the last 9.5 kernel release. So my situation is far worse than yours.
I expect if you still have a 9.5 kernel installed like the version above, youāll probably find it also works fine, but as soon as you boot into a 9.6 kernel, the problems start. In the end I just use kernel-lt from elrepo instead - I get a 6.x kernel and much more hardware support than the default kernel in 9.x.
Obviously RHEL changed hardware support causing issues in 9.6.
Maybe try installing elrepo kernel-lt as well on yours and see if the problems go away:
Yep, I tried three or four of the subsequent kernels that came out after and the effect was still the same. I gave up trying further ones after that, since the elrepo one worked. I could of course raise a bug report for it, but Iām not that entirely bothered about it. Either it wasnāt meant to happen and thus a regression, or it was intentional. I would have expected it to work at least until the EOL of EL9.
Since the server doesnāt support x86_64-v3, Iāll be retiring it in 2032 anyway assuming it doesnāt break before then.
In my earlier reply, I had suggested doing āip -s link show e3ā which might help you to obtain more information.
In my (fully up to date) Rocky 9.6 running on a Dell PowerEdge server, my ānetstat-inā shows 21 RX-DRP (out of 549,188,830) but āifconfig eno3ā has more detail, and shows those 21 as actually āMISSEDā.
Google Gemini states that āThe missed statistic in the Linux command ip refers to packets that a network interface driver failed to process and dropped before they could be passed up the networking stack. This typically indicates a performance bottleneck where the system, often the CPU, isnāt fast enough to handle the incoming packet rate.ā
You mention your computer is an old Proliant DL380e, which Gemini tells me was released circa 2012, could your ādroppedā packets actually be āmissedā packets due to CPU?
One possibility is that 9.5 is giving inaccurate results, and 9.6 is showing the real situation (unlikely but possible), so maybe try using something other than netstat, first on 9.5 and then on 9.6, some raw kernel counter or /proc filesystem, or some other tool.
Really interesting to see your charts, the y-axis labels are cut-off, is the typical throughput a constant 1,000 Mbits/second? Google Gemini lists these three reasons for input discard, and Iād guess it is the first reason, since the problem went away with the new RPM:
A full receive buffer: The NICās internal buffer is full and cannot store the incoming packet.
Invalid packet format: The packet is malformed and the NIC cannot properly read it.
Hardware limitations: The card is unable to process packets as fast as they are arriving due to hardware constraints.
So, maybe the new RPM is just faster at processing the receive buffers?
(and that would be helpful for the OP too, who stated having a circa 2012 laptop)
Ok there are a lot of things going on in this thread, let me see what I can do to help with willing community members for bisecting (Iām waiting on hardware procurement).
There is an issues with various versions of Intel NICs (i210, 82574, E810) where on Rocky 9.6 which didnāt happen on 9.5.
Iwalker has a kernel panic on boot, Iām not going to chalk this up to the above specific issue but the ELRepo kernel-lt is 6.1 and is both ahead and behind 9.6 depending on subsystems.
If you have the kernel panic stack trace from that Iām happy to look at it.
We know that the Intel OOT solves the the RX issue (or at least appears to do so), one of the issues is the OOT driver is is just blob dumps, and its hard to validate what is in that OOT driver versus what is in Linusās tree, still on the LKML, or just not even yet pushed to the LKML. But its a good data point.
However to look at the code deeper, I need to understand what starts to work when and narrowing down the possible change sets. Grepping through the upstream change log there wasnāt something SUPER obvious with RX errors / drops that was net new changes (ie things without a FIXES line in the commit).
For those with this issue that can have resources for testing.
Since I donāt have access to this hardware yet, Iām needing to lean on those that have time and non-prod resources.