From owner-freebsd-stable@FreeBSD.ORG Wed Jan 1 01:53:26 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C0299210 for ; Wed, 1 Jan 2014 01:53:26 +0000 (UTC) Received: from maildrop2.v6ds.occnc.com (maildrop2.v6ds.occnc.com [IPv6:2001:470:88e6:3::232]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 6433B1293 for ; Wed, 1 Jan 2014 01:53:26 +0000 (UTC) Received: from harbor3.ipv6.occnc.com (harbor3.v6ds.occnc.com [IPv6:2001:470:88e6:3::239]) (authenticated bits=128) by maildrop2.v6ds.occnc.com (8.14.7/8.14.7) with ESMTP id s011rNcm082703; Tue, 31 Dec 2013 20:53:23 -0500 (EST) (envelope-from curtis@ipv6.occnc.com) Message-Id: <201401010153.s011rNcm082703@maildrop2.v6ds.occnc.com> To: Yonghyeon PYUN Subject: regression: msk0 watchdog timeout and interrupt storm From: Curtis Villamizar Date: Tue, 31 Dec 2013 20:53:23 -0500 Cc: freebsd-stable@freebsd.org, curtis@ipv6.occnc.com X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: curtis@ipv6.occnc.com List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jan 2014 01:53:26 -0000 I'm getting an interrupt storm from mskc running with the latest if_msk.c code. The OS is built from source (259540): FreeBSD 10.0-PRERELEASE (GENERIC) #0 r259540: Sat Dec 21 00:05:39 EST 2013 While not the latest, the point is that sys/dev/msk is up to date wrt stable_9 and also wrt head. The odd thing is that the machine seemed to run fine for a day or two and then started exhibiting this behaviour and has become useless. This is now highly reproducible (it happens within seconds when trying to do a long file transfer between two machines with GbE) so if there is anything I can do to instrument this, please make suggestions. What I know so far is: 1. When the watchdog occurs, Y2_IS_STAT_BMU is set in the prior interrupt mask. 2. This would put us in from msk_intr into msk_handle_events, with msk_handle_events returning 0. 3. msk_handle_events reads in sc->msk_stat_cons. The last recorded value of sc->msk_stat_cons is alway 1024. 4. The only way to exit msk_handle_events with sc->msk_stat_cons greater than zero yet not do anything is hit the top of loop conditional and fall out: sd = &sc->msk_stat_ring[cons]; control = le32toh(sd->msk_control); if ((control & HW_OWNER) == 0) break; 5. The code after the loop can return zero if the ring buffer pointer hasn't moved. That code is: sc->msk_stat_cons = cons; bus_dmamap_sync(sc->msk_stat_tag, sc->msk_stat_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); if (rxput[MSK_PORT_A] > 0) msk_rxput(sc->msk_if[MSK_PORT_A]); if (rxput[MSK_PORT_B] > 0) msk_rxput(sc->msk_if[MSK_PORT_B]); return (sc->msk_stat_cons != CSR_READ_2(sc, STAT_PUT_IDX)); 6. If the return value is zero, the interrupt isn't cleared. That was suspect. The code in msk_intr is: domore = msk_handle_events(sc); if ((status & Y2_IS_STAT_BMU) != 0 && domore == 0) CSR_WRITE_4(sc, STAT_CTRL, SC_STAT_CLR_IRQ); 7. This code before the return in msk_handle_events should force the clear but doesn't fix anything. if ((control & HW_OWNER) == 0) return; This looks like some sort of fall off the end of a ring buffer type of problem (since it always points to entry 0x400) but since I haven't done driver work in ages, that is mostly just a wild guess and I really have no idea yet at to what is going wrong. Also please keep me on the Cc since I'm not subscribed to the list, though I will check the archives from time to time. Thanks, Curtis reference: http://lists.freebsd.org/pipermail/freebsd-stable/2013-November/075699.html