Date: Thu, 23 May 2024 09:12:16 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 279245] igc(4) I226 (and I225) hangups Message-ID: <bug-279245-227@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D279245 Bug ID: 279245 Summary: igc(4) I226 (and I225) hangups Product: Base System Version: 13.2-RELEASE Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: freebsd_email@congenio.de When using an I226 under OpnSense (FreeBSD 13.2-RELEASE kernel - I also tri= ed FreeBSD 14.0-RELEASE), I experience connection hangups about once per day u= nder no specific circumstances (maximum was 3 times within one hour, I also had = none in three days). This problem manifests in a dead connection (no packets are received, note = are sent), but the low-level counters (dev.igc.0.mac_stats) still increase. The conditon can be cleard up by bringing the interface down and up again o= r by shortly disconnecting the cable. There are reports on this and other related problems all over the internet = for different OSes, see: Windows: https://forums.evga.com/PSA-Intel-I226V-25GbE-on-Raptor-Lake-Motherboards-H= as-a-Connection-Drop-Issue-No-Fix-m3595279.aspx OpnSense (FreeBSD): https://forum.opnsense.org/index.php?topic=3D40404.msg199288#msg199288 pfSense (FreeBSD): https://forum.netgate.com/topic/181571/chinese-i226-v-on-23-05-1-problems My specific variant is an I226-V, rev.4, built into a Minisforum MS-01: igc0@pci0:87:0:0: class=3D0x020000 rev=3D0x04 hdr=3D0x00 vendor=3D0x8= 086 device=3D0x125c subvendor=3D0x8086 subdevice=3D0x0000 vendor =3D 'Intel Corporation' device =3D 'Ethernet Controller I226-V' class =3D network subclass =3D ethernet However, there are reports of the I226-LM connected to the same machine sho= wing the same behaviour, see: https://forum.opnsense.org/index.php?topic=3D40556 igc1@pci0:88:0:0: class=3D0x020000 rev=3D0x04 hdr=3D0x00 vendor=3D0x8= 086 device=3D0x125b subvendor=3D0x8086 subdevice=3D0x0000 vendor =3D 'Intel Corporation' device =3D 'Ethernet Controller I226-LM' class =3D network subclass =3D ethernet This seems to indicate that at least the I226 family (which is a successor = to the problem-ridden I225 using the same driver module) is affected by this problem. I tried all possible settings I could think of to make this go away, like reducing the speed from 2.5 to 1 Gbps, disabling EEE (which is off by defau= lt anyway) to no avail. Interestingly, the Minisforum-MS01 has gained much interest in the last few months and there was a specific review on Youtube were the creator states i= n a comment that he is not seeing this problem (https://www.youtube.com/watch?v=3D_wgX1sDab-M). However, he uses OpnSense = under a Proxmox hypervisor, thus using the Linux driver modules (OpnSense itself = uses the virtualized virtio NICs). This and the reports of gamers stating they had "micro-hangs" manifesting as short lags in online games got me thinking. So I compared the Linux and FreeBSD drivers and found, that the Linux driver has a specific routine to catch, protocol and clear "TX hang" conditions, s= ee from line 3150 here: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/ig= c/igc_main.c, which reads: if (test_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) { struct igc_hw *hw =3D &adapter->hw; /* Detect a transmit hang in hardware, this serializes the * check with the clearing of time_stamp and movement of i */ clear_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags); if (tx_buffer->next_to_watch && time_after(jiffies, tx_buffer->time_stamp + (adapter->tx_timeout_factor * HZ)) && !(rd32(IGC_STATUS) & IGC_STATUS_TXOFF) && (rd32(IGC_TDH(tx_ring->reg_idx)) !=3D readl(tx_ring->ta= il)) && !tx_ring->oper_gate_closed) { /* detected Tx unit hang */ netdev_err(tx_ring->netdev, "Detected Tx Unit Hang\n" " Tx Queue <%d>\n" " TDH <%x>\n" " TDT <%x>\n" " next_to_use <%x>\n" " next_to_clean <%x>\n" "buffer_info[next_to_clean]\n" " time_stamp <%lx>\n" " next_to_watch <%p>\n" " jiffies <%lx>\n" " desc.status <%x>\n", tx_ring->queue_index, rd32(IGC_TDH(tx_ring->reg_idx)), readl(tx_ring->tail), tx_ring->next_to_use, tx_ring->next_to_clean, tx_buffer->time_stamp, tx_buffer->next_to_watch, jiffies, tx_buffer->next_to_watch->wb.status); netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index); /* we are about to reset, no point in enabling stuf= f */ return true; } } There is also a routine to reset the adapter: /** * igc_tx_timeout - Respond to a Tx Hang * @netdev: network interface device structure * @txqueue: queue number that timed out **/ static void igc_tx_timeout(struct net_device *netdev, unsigned int __always_unused txqueue) { struct igc_adapter *adapter =3D netdev_priv(netdev); struct igc_hw *hw =3D &adapter->hw; /* Do the reset outside of interrupt context */ adapter->tx_timeout_count++; schedule_work(&adapter->reset_task); wr32(IGC_EICS, (adapter->eims_enable_mask & ~adapter->eims_other)); } I did not see anything to this extent in the FreeBSD driver igc module. Intel themselves do not offer an OEM driver for FreeBSD in their Intel Netw= ork Connections 29.1 package. So, my theory is that there is a hardware ideosyncrasy in this Intel adapter family which causes packet flow to stop sometimes. This is handled in the Linux driver module by testing if no packets are processed for a short period. That detection and handling would not be there if there was no problem, so = we can take this for a fact. I suspect that the same handling is contained in the Windows drivers, too - which I cannot ascertain because I cannot look at the source code. However, this would be in line with the observed "micro-hangs" under Windows from other users. Alas, under FreeBSD, there is no handling of this condition which might exp= lain the total packet loss after it occurs. If it were fixed in FreeBSD, it would be a great benefit for applications l= ike pfSense and OpnSense since now, these adapters are essentially unusable. A potential fix would still produce "micro-hangs" once in a while, however = this is far better than losing the connection completely. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-279245-227>