Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Sep 2025 13:28:04 +0000
From:      R Tyler Croy <rtyler@brokenco.de>
To:        Kevin Bowling <kevin.bowling@kev009.com>
Cc:        Karl Denninger <karl@denninger.net>, freebsd-net@freebsd.org
Subject:   Re: Intermittent failure of routing/gateway with ix(4) (x86_64)
Message-ID:  <s8ozURoM0GrRaAOlHq_ZvZtl_4GPgZ41D2X7GF90B_ySAjTNIcV198tMBD649LGjyNdhozMziNVENp4twgEHtuh2zxdcH03tgFUvWpr8dDY=@brokenco.de>
In-Reply-To: <CAK7dMtA6Sab8GJzZ5MS5U=V_q3es2da8=0ykjkL158HX_uNCtA@mail.gmail.com>
References:  <6HcKz3OxJSnjZdoCr4I0mksk9RemKJflHVgnkYI8-FydM4mDDldzTm8qthQ0iJCftaETOVeQFde4fz5i1703B8Gd2ZBvPmBwF_MMnhuJ8VM=@brokenco.de> <7f463f6f-3a6e-47db-aab8-7817cc6192a5@denninger.net> <sSicWwTvxOBE6MKtGEVz2jjuAVg4HK5nhZtn1KCEnBNhAyDNlVzWQFXPgAcaj2tDBDKSLQ7RHU3QXeaUbuQVkR4MWcTtiBBejX0k69IYFjA=@brokenco.de> <CAK7dMtAMCCGrpJt4=RDxsBdit-phDLtzF3V=dEMm%2BeQNbUJQvw@mail.gmail.com> <MoHSr-TloFU0n1vS7a8AekF0A5eDUQyr8JiDvjG5JAWtykKU6KdVx1i8E_ub5LydXacoNgKiaBj2Jq6h4j09L06ARlcgIfoVH9nTnx9Jtxw=@brokenco.de> <CAK7dMtA6Sab8GJzZ5MS5U=V_q3es2da8=0ykjkL158HX_uNCtA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--------e7fa70cffdf454547f47d61ba468026f8ef287eccd513fa30db7351945fe40f7
Content-Type: multipart/mixed;boundary=---------------------6e2ab88b99e2e86b21d2966f2734c82a

-----------------------6e2ab88b99e2e86b21d2966f2734c82a
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;charset=utf-8

(replies inline)


On Saturday, August 30th, 2025 at 10:45 AM, Kevin Bowling <kevin.bowling@k=
ev009.com> wrote:

> =


> You've got an assortment of MAC level errors going on in the 'after':
> +dev.ix.1.mac_stats.checksum_errs: 137
> +dev.ix.1.mac_stats.rx_missed_packets: 930676
> +dev.ix.1.mac_stats.rx_errs: 930676
> +dev.ix.0.mac_stats.checksum_errs: 45
> +dev.ix.0.mac_stats.local_faults: 563829
> +dev.ix.0.mac_stats.short_discards: 6
> +dev.ix.0.mac_stats.byte_errs: 6
> +dev.ix.0.mac_stats.ill_errs: 6
> =


> I would be surprised if it is not a hardware issue given the MAC
> errors on both ports.
> =


> It's been a minute since I looked at this but IIRC the thermal diode
> is somehow botched somewhere in the ix family so we don't get a
> notification in software if the PHY over temps. But it's also
> possible yours is already cooked or some other issue. A potentially
> useful hint, the X550 uses a lot less power and produces less heat.


I have modified the case and added a lot more ventilation pushing specific=
ally over the two NICs which seems to have helped, maybe? It's such an int=
ermittent issue that it's hard to tell whether something has changed due t=
o my actions, lunar alignment, relative humidity, etc. =



Here's another snippet from before/after a stall that I observed overnight=
 here.

-dev.ix.1.mac_stats.checksum_errs: 0
+dev.ix.1.mac_stats.checksum_errs: 100
 dev.ix.1.mac_stats.rx_errs: 0
 dev.ix.1.queue1.interrupt_rate: 31250
 dev.ix.1.queue0.interrupt_rate: 31250
-dev.ix.0.mac_stats.checksum_errs: 1
+dev.ix.0.mac_stats.checksum_errs: 373
 dev.ix.0.mac_stats.rec_len_errs: 0
-dev.ix.0.mac_stats.byte_errs: 0
-dev.ix.0.mac_stats.ill_errs: 0
+dev.ix.0.mac_stats.byte_errs: 3
+dev.ix.0.mac_stats.ill_errs: 3
 dev.ix.0.mac_stats.crc_errs: 0
-dev.ix.0.mac_stats.rx_errs: 0
+dev.ix.0.mac_stats.rx_errs: 15284
 dev.ix.0.queue1.interrupt_rate: 31250
 dev.ix.0.queue0.interrupt_rate: 31250


With the "cooked" comment I'm wondering if you might think these NICs are =
irreparably damaged and thus no amount of fan-fiddling will improve the si=
tuation, assuming it's thermal? The other interesting variable I have obse=
rved is that since making cooling modifications and relocating in the rack=
, the stalls _seem_ to be occurring overnight, sometime now between 3-5am =
local. =



I have set up a periodic cron to snapshot the `sysctl dev.ix` to see if I =
can observe any other patterns with the data.

I purchased these in a lot with two other NICs that are in other machines =
in the same rack but those do _not_ act as gateways. Those two devices hav=
e performed to expectations albeit in wildly different chassis. If 10GigE =
NICs were cheaper I would toss these in the bin and get some different car=
ds, but I'm rather motivated to ensure there's not a software/configuratio=
n issue here first :)


Cheers


-----------------------6e2ab88b99e2e86b21d2966f2734c82a--

--------e7fa70cffdf454547f47d61ba468026f8ef287eccd513fa30db7351945fe40f7
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: ProtonMail

wsG5BAEBCgBtBYJoxXFXCZDqliUJ0zoEdUUUAAAAAAAcACBzYWx0QG5vdGF0
aW9ucy5vcGVucGdwanMub3Jn+StqkX8gGhOvsrRfOCksKHOWPHpU0QN1qBsa
7XSRdV0WIQR7rUKRWE2z4+Kwj8/qliUJ0zoEdQAAIikP/jhAMS1itxfzSQCp
XduAOc0T7SZEQ+PufFIOnDkZDcCbnvzIfy4Hk0WTmnQm4EWuPNaPGM/zTEAm
7KP99c5IIklOooxHyhlunQp/k+4tFE4ubv0o7MYHDkUYfdGTZQPisvZFqc/q
il5ZfJoz1ZSL+dEvpQaA6XNM5VUc9Ll4bI5q56EXxv3cmcVFcqs895Nka03x
hxH2XEsqSzyeYAuH8gbL4UOsf2yGJzoWRQVnpdxP1McwXNbU7oP0SozVxPWp
EGeI1IIlkBffrLviTFhs9VZ4IfvxC1cA7YYP5sP0S0IX+L0LLXB7a3akXGdg
lz/4Wg998I5HThAfHQYN9vm7a8NHAuiB+GDcOJqAbk5v0VYGnrKZBPqyKSoK
yv/L/sNyAT/FCkbDt6Zb9c1MISRyQDyt/pTG2EZK+xdNBvRGioyJQj/X4Gfp
I19irqP0/ZnTmCOh6Q856yh639nUwgICRNcq4JJRSoRQwK6bkLkwGrQZIfkI
ex/nt9oR84BNPfAQVKD5xrd4KKSj+mBWMYS5NSzaxD6jfVn8Nw1uyGa6SE5N
7sOgi7OVTnp71wgCOKzjYrGMzzHzUx3NklSaMf+Guv8wre/qqcCimkAE6Ews
SEB85f98gBgFQnajWeak+0MEza6DVwhlwwaqoBNT3qBQ9xdxvdNAZ4ptQhzp
QyihMKVM
=eWPx
-----END PGP SIGNATURE-----


--------e7fa70cffdf454547f47d61ba468026f8ef287eccd513fa30db7351945fe40f7--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?s8ozURoM0GrRaAOlHq_ZvZtl_4GPgZ41D2X7GF90B_ySAjTNIcV198tMBD649LGjyNdhozMziNVENp4twgEHtuh2zxdcH03tgFUvWpr8dDY=>