Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 01 Jan 2014 16:44:57 -0500
From:      Curtis Villamizar <curtis@ipv6.occnc.com>
To:        curtis@ipv6.occnc.com
Cc:        Yonghyeon PYUN <pyunyh@gmail.com>, freebsd-stable@freebsd.org
Subject:   Re: regression: msk0 watchdog timeout and interrupt storm
Message-ID:  <201401012144.s01LivSi099164@maildrop2.v6ds.occnc.com>
In-Reply-To: Your message of "Tue, 31 Dec 2013 20:53:23 -0500." <201401010153.s011rNcm082703@maildrop2.v6ds.occnc.com>

next in thread | previous in thread | raw e-mail | index | archive | help

Replying to self (and top posting).

I'm not sure if the problem is fixed or masked.

The symptom (watchdog and interrupt storm) has gone away with the
following change in if_mskreg.h:

@@ -2329,8 +2329,13 @@
  */
 #if (BUS_SPACE_MAXADDR > 0xFFFFFFFF)
 #define        MSK_64BIT_DMA
+#if 1
+#define MSK_TX_RING_CNT                256
+#define MSK_RX_RING_CNT                256
+#else
 #define MSK_TX_RING_CNT                384
 #define MSK_RX_RING_CNT                512
+#endif
 #else
 #undef MSK_64BIT_DMA
 #define MSK_TX_RING_CNT                256

This backs out a very small part of the change made to if_mskreg.h in
revision 227582.

The following is what I think is affected by this change:

	count = imin(4096, roundup2(count, 1024));
	sc->msk_stat_count = count;
	stat_sz = count * sizeof(struct msk_stat_desc);

The change makes count end up being 1024 (and stat_sz 8192).

For me the problem is fixed/masked but I would also consider putting
the increase to MSK_TX_RING_CNT and MSK_RX_RING_CNT back and forcing
count above to be no greater than 1024 if that would help someone else
debug the problem.  I'm not sure where the 4096 came from but
replacing that with 1024 is equivalent to "count = 1024" with no math
involved.

This does seem to me like a regression in 10.0 caused by the change to
if_mskreg.h (Nov 16).  The workaround so far has been fine for me.

Curtis


In message <201401010153.s011rNcm082703@maildrop2.v6ds.occnc.com>
Curtis Villamizar writes:
>  
> I'm getting an interrupt storm from mskc running with the latest
> if_msk.c code.  The OS is built from source (259540):
>  
> FreeBSD 10.0-PRERELEASE (GENERIC) #0 r259540: Sat Dec 21 00:05:39 EST 2013
>  
> While not the latest, the point is that sys/dev/msk is up to date wrt
> stable_9 and also wrt head.
>  
> The odd thing is that the machine seemed to run fine for a day or two
> and then started exhibiting this behaviour and has become useless.
>  
> This is now highly reproducible (it happens within seconds when trying
> to do a long file transfer between two machines with GbE) so if there
> is anything I can do to instrument this, please make suggestions.
>  
> What I know so far is:
>  
>   1.  When the watchdog occurs, Y2_IS_STAT_BMU is set in the prior
>       interrupt mask.
>  
>   2.  This would put us in from msk_intr into msk_handle_events, with
>       msk_handle_events returning 0.
>  
>   3.  msk_handle_events reads in sc->msk_stat_cons.  The last recorded
>       value of sc->msk_stat_cons is alway 1024.
>  
>   4.  The only way to exit msk_handle_events with sc->msk_stat_cons
>       greater than zero yet not do anything is hit the top of loop
>       conditional and fall out:
>  
>       sd = &sc->msk_stat_ring[cons];
>       control = le32toh(sd->msk_control);
>       if ((control & HW_OWNER) == 0)
>           break;
>  
>   5.  The code after the loop can return zero if the ring buffer
>       pointer hasn't moved.  That code is:
>  
>       sc->msk_stat_cons = cons;
>       bus_dmamap_sync(sc->msk_stat_tag, sc->msk_stat_map,
>           BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
>  
>       if (rxput[MSK_PORT_A] > 0)
>               msk_rxput(sc->msk_if[MSK_PORT_A]);
>       if (rxput[MSK_PORT_B] > 0)
>               msk_rxput(sc->msk_if[MSK_PORT_B]);
>  
>       return (sc->msk_stat_cons != CSR_READ_2(sc, STAT_PUT_IDX));
>  
>   6.  If the return value is zero, the interrupt isn't cleared.  That
>       was suspect.  The code in msk_intr is:
>  
>       domore = msk_handle_events(sc);
>       if ((status & Y2_IS_STAT_BMU) != 0 && domore == 0)
>               CSR_WRITE_4(sc, STAT_CTRL, SC_STAT_CLR_IRQ);
>  
>   7.  This code before the return in msk_handle_events should force
>       the clear but doesn't fix anything.
>  
>       if ((control & HW_OWNER) == 0)
>               return;
>  
> This looks like some sort of fall off the end of a ring buffer type of
> problem (since it always points to entry 0x400) but since I haven't
> done driver work in ages, that is mostly just a wild guess and I
> really have no idea yet at to what is going wrong.
>  
> Also please keep me on the Cc since I'm not subscribed to the list,
> though I will check the archives from time to time.
>  
> Thanks,
>  
> Curtis
>  
>  
> reference:
> http://lists.freebsd.org/pipermail/freebsd-stable/2013-November/075699.html



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201401012144.s01LivSi099164>