Date: Tue, 31 Dec 2013 20:53:23 -0500 From: Curtis Villamizar <curtis@ipv6.occnc.com> To: Yonghyeon PYUN <pyunyh@gmail.com> Cc: freebsd-stable@freebsd.org, curtis@ipv6.occnc.com Subject: regression: msk0 watchdog timeout and interrupt storm Message-ID: <201401010153.s011rNcm082703@maildrop2.v6ds.occnc.com>
next in thread | raw e-mail | index | archive | help
I'm getting an interrupt storm from mskc running with the latest if_msk.c code. The OS is built from source (259540): FreeBSD 10.0-PRERELEASE (GENERIC) #0 r259540: Sat Dec 21 00:05:39 EST 2013 While not the latest, the point is that sys/dev/msk is up to date wrt stable_9 and also wrt head. The odd thing is that the machine seemed to run fine for a day or two and then started exhibiting this behaviour and has become useless. This is now highly reproducible (it happens within seconds when trying to do a long file transfer between two machines with GbE) so if there is anything I can do to instrument this, please make suggestions. What I know so far is: 1. When the watchdog occurs, Y2_IS_STAT_BMU is set in the prior interrupt mask. 2. This would put us in from msk_intr into msk_handle_events, with msk_handle_events returning 0. 3. msk_handle_events reads in sc->msk_stat_cons. The last recorded value of sc->msk_stat_cons is alway 1024. 4. The only way to exit msk_handle_events with sc->msk_stat_cons greater than zero yet not do anything is hit the top of loop conditional and fall out: sd = &sc->msk_stat_ring[cons]; control = le32toh(sd->msk_control); if ((control & HW_OWNER) == 0) break; 5. The code after the loop can return zero if the ring buffer pointer hasn't moved. That code is: sc->msk_stat_cons = cons; bus_dmamap_sync(sc->msk_stat_tag, sc->msk_stat_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); if (rxput[MSK_PORT_A] > 0) msk_rxput(sc->msk_if[MSK_PORT_A]); if (rxput[MSK_PORT_B] > 0) msk_rxput(sc->msk_if[MSK_PORT_B]); return (sc->msk_stat_cons != CSR_READ_2(sc, STAT_PUT_IDX)); 6. If the return value is zero, the interrupt isn't cleared. That was suspect. The code in msk_intr is: domore = msk_handle_events(sc); if ((status & Y2_IS_STAT_BMU) != 0 && domore == 0) CSR_WRITE_4(sc, STAT_CTRL, SC_STAT_CLR_IRQ); 7. This code before the return in msk_handle_events should force the clear but doesn't fix anything. if ((control & HW_OWNER) == 0) return; This looks like some sort of fall off the end of a ring buffer type of problem (since it always points to entry 0x400) but since I haven't done driver work in ages, that is mostly just a wild guess and I really have no idea yet at to what is going wrong. Also please keep me on the Cc since I'm not subscribed to the list, though I will check the archives from time to time. Thanks, Curtis reference: http://lists.freebsd.org/pipermail/freebsd-stable/2013-November/075699.html
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201401010153.s011rNcm082703>