Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 May 2024 09:12:16 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 279245] igc(4) I226 (and I225) hangups
Message-ID:  <bug-279245-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D279245

            Bug ID: 279245
           Summary: igc(4) I226 (and I225) hangups
           Product: Base System
           Version: 13.2-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: freebsd_email@congenio.de

When using an I226 under OpnSense (FreeBSD 13.2-RELEASE kernel - I also tri=
ed
FreeBSD 14.0-RELEASE), I experience connection hangups about once per day u=
nder
no specific circumstances (maximum was 3 times within one hour, I also had =
none
in three days).

This problem manifests in a dead connection (no packets are received, note =
are
sent), but the low-level counters (dev.igc.0.mac_stats) still increase.
The conditon can be cleard up by bringing the interface down and up again o=
r by
shortly disconnecting the cable.

There are reports on this and other related problems all over the internet =
for
different OSes, see:

Windows:
https://forums.evga.com/PSA-Intel-I226V-25GbE-on-Raptor-Lake-Motherboards-H=
as-a-Connection-Drop-Issue-No-Fix-m3595279.aspx
OpnSense (FreeBSD):
https://forum.opnsense.org/index.php?topic=3D40404.msg199288#msg199288
pfSense (FreeBSD):
https://forum.netgate.com/topic/181571/chinese-i226-v-on-23-05-1-problems

My specific variant is an I226-V, rev.4, built into a Minisforum MS-01:

igc0@pci0:87:0:0:       class=3D0x020000 rev=3D0x04 hdr=3D0x00 vendor=3D0x8=
086
device=3D0x125c subvendor=3D0x8086 subdevice=3D0x0000
    vendor     =3D 'Intel Corporation'
    device     =3D 'Ethernet Controller I226-V'
    class      =3D network
    subclass   =3D ethernet


However, there are reports of the I226-LM connected to the same machine sho=
wing
the same behaviour, see: https://forum.opnsense.org/index.php?topic=3D40556

igc1@pci0:88:0:0:       class=3D0x020000 rev=3D0x04 hdr=3D0x00 vendor=3D0x8=
086
device=3D0x125b subvendor=3D0x8086 subdevice=3D0x0000
    vendor     =3D 'Intel Corporation'
    device     =3D 'Ethernet Controller I226-LM'
    class      =3D network
    subclass   =3D ethernet

This seems to indicate that at least the I226 family (which is a successor =
to
the problem-ridden I225 using the same driver module) is affected by this
problem.
I tried all possible settings I could think of to make this go away, like
reducing the speed from 2.5 to 1 Gbps, disabling EEE (which is off by defau=
lt
anyway) to no avail.

Interestingly, the Minisforum-MS01 has gained much interest in the last few
months and there was a specific review on Youtube were the creator states i=
n a
comment that he is not seeing this problem
(https://www.youtube.com/watch?v=3D_wgX1sDab-M). However, he uses OpnSense =
under
a Proxmox hypervisor, thus using the Linux driver modules (OpnSense itself =
uses
the virtualized virtio NICs).

This and the reports of gamers stating they had "micro-hangs" manifesting as
short lags in online games got me thinking.
So I compared the Linux and FreeBSD drivers and found, that the Linux driver
has a specific routine to catch, protocol and clear "TX hang" conditions, s=
ee
from line 3150 here:
https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/ig=
c/igc_main.c,
which reads:

        if (test_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
                struct igc_hw *hw =3D &adapter->hw;

                /* Detect a transmit hang in hardware, this serializes the
                * check with the clearing of time_stamp and movement of i
                */
                clear_bit(IGC_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags);
                if (tx_buffer->next_to_watch &&
                    time_after(jiffies, tx_buffer->time_stamp +
                    (adapter->tx_timeout_factor * HZ)) &&
                    !(rd32(IGC_STATUS) & IGC_STATUS_TXOFF) &&
                    (rd32(IGC_TDH(tx_ring->reg_idx)) !=3D readl(tx_ring->ta=
il))
&&
                    !tx_ring->oper_gate_closed) {
                        /* detected Tx unit hang */
                        netdev_err(tx_ring->netdev,
                                   "Detected Tx Unit Hang\n"
                                   "  Tx Queue             <%d>\n"
                                   "  TDH                  <%x>\n"
                                   "  TDT                  <%x>\n"
                                   "  next_to_use          <%x>\n"
                                   "  next_to_clean        <%x>\n"
                                   "buffer_info[next_to_clean]\n"
                                   "  time_stamp           <%lx>\n"
                                   "  next_to_watch        <%p>\n"
                                   "  jiffies              <%lx>\n"
                                   "  desc.status          <%x>\n",
                                   tx_ring->queue_index,
                                   rd32(IGC_TDH(tx_ring->reg_idx)),
                                   readl(tx_ring->tail),
                                   tx_ring->next_to_use,
                                   tx_ring->next_to_clean,
                                   tx_buffer->time_stamp,
                                   tx_buffer->next_to_watch,
                                   jiffies,
                                   tx_buffer->next_to_watch->wb.status);
                        netif_stop_subqueue(tx_ring->netdev,
                                            tx_ring->queue_index);

                        /* we are about to reset, no point in enabling stuf=
f */
                        return true;
                }
        }

There is also a routine to reset the adapter:

/**
 * igc_tx_timeout - Respond to a Tx Hang
 * @netdev: network interface device structure
 * @txqueue: queue number that timed out
 **/
static void igc_tx_timeout(struct net_device *netdev,
                           unsigned int __always_unused txqueue)
{
        struct igc_adapter *adapter =3D netdev_priv(netdev);
        struct igc_hw *hw =3D &adapter->hw;

        /* Do the reset outside of interrupt context */
        adapter->tx_timeout_count++;
        schedule_work(&adapter->reset_task);
        wr32(IGC_EICS,
             (adapter->eims_enable_mask & ~adapter->eims_other));
}

I did not see anything to this extent in the FreeBSD driver igc module.

Intel themselves do not offer an OEM driver for FreeBSD in their Intel Netw=
ork
Connections 29.1 package.

So, my theory is that there is a hardware ideosyncrasy in this Intel adapter
family which causes packet flow to stop sometimes.
This is handled in the Linux driver module by testing if no packets are
processed for a short period.
That detection and handling would not be there if there was no problem, so =
we
can take this for a fact.

I suspect that the same handling is contained in the Windows drivers, too -
which I cannot ascertain because I cannot look at the source code.
However, this would be in line with the observed "micro-hangs" under Windows
from other users.

Alas, under FreeBSD, there is no handling of this condition which might exp=
lain
the total packet loss after it occurs.
If it were fixed in FreeBSD, it would be a great benefit for applications l=
ike
pfSense and OpnSense since now, these adapters are essentially unusable.
A potential fix would still produce "micro-hangs" once in a while, however =
this
is far better than losing the connection completely.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-279245-227>