Date: Sat, 13 Sep 2025 13:28:04 +0000 From: R Tyler Croy <rtyler@brokenco.de> To: Kevin Bowling <kevin.bowling@kev009.com> Cc: Karl Denninger <karl@denninger.net>, freebsd-net@freebsd.org Subject: Re: Intermittent failure of routing/gateway with ix(4) (x86_64) Message-ID: <s8ozURoM0GrRaAOlHq_ZvZtl_4GPgZ41D2X7GF90B_ySAjTNIcV198tMBD649LGjyNdhozMziNVENp4twgEHtuh2zxdcH03tgFUvWpr8dDY=@brokenco.de> In-Reply-To: <CAK7dMtA6Sab8GJzZ5MS5U=V_q3es2da8=0ykjkL158HX_uNCtA@mail.gmail.com> References: <6HcKz3OxJSnjZdoCr4I0mksk9RemKJflHVgnkYI8-FydM4mDDldzTm8qthQ0iJCftaETOVeQFde4fz5i1703B8Gd2ZBvPmBwF_MMnhuJ8VM=@brokenco.de> <7f463f6f-3a6e-47db-aab8-7817cc6192a5@denninger.net> <sSicWwTvxOBE6MKtGEVz2jjuAVg4HK5nhZtn1KCEnBNhAyDNlVzWQFXPgAcaj2tDBDKSLQ7RHU3QXeaUbuQVkR4MWcTtiBBejX0k69IYFjA=@brokenco.de> <CAK7dMtAMCCGrpJt4=RDxsBdit-phDLtzF3V=dEMm%2BeQNbUJQvw@mail.gmail.com> <MoHSr-TloFU0n1vS7a8AekF0A5eDUQyr8JiDvjG5JAWtykKU6KdVx1i8E_ub5LydXacoNgKiaBj2Jq6h4j09L06ARlcgIfoVH9nTnx9Jtxw=@brokenco.de> <CAK7dMtA6Sab8GJzZ5MS5U=V_q3es2da8=0ykjkL158HX_uNCtA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------e7fa70cffdf454547f47d61ba468026f8ef287eccd513fa30db7351945fe40f7 Content-Type: multipart/mixed;boundary=---------------------6e2ab88b99e2e86b21d2966f2734c82a -----------------------6e2ab88b99e2e86b21d2966f2734c82a Content-Transfer-Encoding: quoted-printable Content-Type: text/plain;charset=utf-8 (replies inline) On Saturday, August 30th, 2025 at 10:45 AM, Kevin Bowling <kevin.bowling@k= ev009.com> wrote: > = > You've got an assortment of MAC level errors going on in the 'after': > +dev.ix.1.mac_stats.checksum_errs: 137 > +dev.ix.1.mac_stats.rx_missed_packets: 930676 > +dev.ix.1.mac_stats.rx_errs: 930676 > +dev.ix.0.mac_stats.checksum_errs: 45 > +dev.ix.0.mac_stats.local_faults: 563829 > +dev.ix.0.mac_stats.short_discards: 6 > +dev.ix.0.mac_stats.byte_errs: 6 > +dev.ix.0.mac_stats.ill_errs: 6 > = > I would be surprised if it is not a hardware issue given the MAC > errors on both ports. > = > It's been a minute since I looked at this but IIRC the thermal diode > is somehow botched somewhere in the ix family so we don't get a > notification in software if the PHY over temps. But it's also > possible yours is already cooked or some other issue. A potentially > useful hint, the X550 uses a lot less power and produces less heat. I have modified the case and added a lot more ventilation pushing specific= ally over the two NICs which seems to have helped, maybe? It's such an int= ermittent issue that it's hard to tell whether something has changed due t= o my actions, lunar alignment, relative humidity, etc. = Here's another snippet from before/after a stall that I observed overnight= here. -dev.ix.1.mac_stats.checksum_errs: 0 +dev.ix.1.mac_stats.checksum_errs: 100 dev.ix.1.mac_stats.rx_errs: 0 dev.ix.1.queue1.interrupt_rate: 31250 dev.ix.1.queue0.interrupt_rate: 31250 -dev.ix.0.mac_stats.checksum_errs: 1 +dev.ix.0.mac_stats.checksum_errs: 373 dev.ix.0.mac_stats.rec_len_errs: 0 -dev.ix.0.mac_stats.byte_errs: 0 -dev.ix.0.mac_stats.ill_errs: 0 +dev.ix.0.mac_stats.byte_errs: 3 +dev.ix.0.mac_stats.ill_errs: 3 dev.ix.0.mac_stats.crc_errs: 0 -dev.ix.0.mac_stats.rx_errs: 0 +dev.ix.0.mac_stats.rx_errs: 15284 dev.ix.0.queue1.interrupt_rate: 31250 dev.ix.0.queue0.interrupt_rate: 31250 With the "cooked" comment I'm wondering if you might think these NICs are = irreparably damaged and thus no amount of fan-fiddling will improve the si= tuation, assuming it's thermal? The other interesting variable I have obse= rved is that since making cooling modifications and relocating in the rack= , the stalls _seem_ to be occurring overnight, sometime now between 3-5am = local. = I have set up a periodic cron to snapshot the `sysctl dev.ix` to see if I = can observe any other patterns with the data. I purchased these in a lot with two other NICs that are in other machines = in the same rack but those do _not_ act as gateways. Those two devices hav= e performed to expectations albeit in wildly different chassis. If 10GigE = NICs were cheaper I would toss these in the bin and get some different car= ds, but I'm rather motivated to ensure there's not a software/configuratio= n issue here first :) Cheers -----------------------6e2ab88b99e2e86b21d2966f2734c82a-- --------e7fa70cffdf454547f47d61ba468026f8ef287eccd513fa30db7351945fe40f7 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: ProtonMail wsG5BAEBCgBtBYJoxXFXCZDqliUJ0zoEdUUUAAAAAAAcACBzYWx0QG5vdGF0 aW9ucy5vcGVucGdwanMub3Jn+StqkX8gGhOvsrRfOCksKHOWPHpU0QN1qBsa 7XSRdV0WIQR7rUKRWE2z4+Kwj8/qliUJ0zoEdQAAIikP/jhAMS1itxfzSQCp XduAOc0T7SZEQ+PufFIOnDkZDcCbnvzIfy4Hk0WTmnQm4EWuPNaPGM/zTEAm 7KP99c5IIklOooxHyhlunQp/k+4tFE4ubv0o7MYHDkUYfdGTZQPisvZFqc/q il5ZfJoz1ZSL+dEvpQaA6XNM5VUc9Ll4bI5q56EXxv3cmcVFcqs895Nka03x hxH2XEsqSzyeYAuH8gbL4UOsf2yGJzoWRQVnpdxP1McwXNbU7oP0SozVxPWp EGeI1IIlkBffrLviTFhs9VZ4IfvxC1cA7YYP5sP0S0IX+L0LLXB7a3akXGdg lz/4Wg998I5HThAfHQYN9vm7a8NHAuiB+GDcOJqAbk5v0VYGnrKZBPqyKSoK yv/L/sNyAT/FCkbDt6Zb9c1MISRyQDyt/pTG2EZK+xdNBvRGioyJQj/X4Gfp I19irqP0/ZnTmCOh6Q856yh639nUwgICRNcq4JJRSoRQwK6bkLkwGrQZIfkI ex/nt9oR84BNPfAQVKD5xrd4KKSj+mBWMYS5NSzaxD6jfVn8Nw1uyGa6SE5N 7sOgi7OVTnp71wgCOKzjYrGMzzHzUx3NklSaMf+Guv8wre/qqcCimkAE6Ews SEB85f98gBgFQnajWeak+0MEza6DVwhlwwaqoBNT3qBQ9xdxvdNAZ4ptQhzp QyihMKVM =eWPx -----END PGP SIGNATURE----- --------e7fa70cffdf454547f47d61ba468026f8ef287eccd513fa30db7351945fe40f7--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?s8ozURoM0GrRaAOlHq_ZvZtl_4GPgZ41D2X7GF90B_ySAjTNIcV198tMBD649LGjyNdhozMziNVENp4twgEHtuh2zxdcH03tgFUvWpr8dDY=>