Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 May 2017 11:09:12 +0100
From:      Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@FreeBSD.org>
To:        Colin Percival <cperciva@tarsnap.com>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r301198 - head/sys/dev/xen/netfront
Message-ID:  <20170509100912.h3ylwugahvfi5cc2@dhcp-3-128.uk.xensource.com>
In-Reply-To: <0100015bccba6abc-4c3b1487-25e3-4640-8221-885341ece829-000000@email.amazonses.com>
References:  <201606021116.u52BGajD047287@repo.freebsd.org> <0100015bccba6abc-4c3b1487-25e3-4640-8221-885341ece829-000000@email.amazonses.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, May 03, 2017 at 05:13:40AM +0000, Colin Percival wrote:
> On 06/02/16 04:16, Roger Pau Monné wrote:
> > Author: royger
> > Date: Thu Jun  2 11:16:35 2016
> > New Revision: 301198
> > URL: https://svnweb.freebsd.org/changeset/base/301198
> 
> I think this commit is responsible for panics I'm seeing in EC2 on T2 family
> instances.  Every time a DHCP request is made, we call into xn_ifinit_locked
> (not sure why -- something to do with making the interface promiscuous?) and
> hit this code
> 
> > @@ -1760,7 +1715,7 @@ xn_ifinit_locked(struct netfront_info *n
> >  		xn_alloc_rx_buffers(rxq);
> >  		rxq->ring.sring->rsp_event = rxq->ring.rsp_cons + 1;
> >  		if (RING_HAS_UNCONSUMED_RESPONSES(&rxq->ring))
> > -			taskqueue_enqueue(rxq->tq, &rxq->intrtask);
> > +			xn_rxeof(rxq);
> >  		XN_RX_UNLOCK(rxq);
> >  	}
> 
> but under high traffic volumes I think a separate thread can already be
> running in xn_rxeof, having dropped the RX lock while it passes a packet
> up the stack.  This would result in two different threads trying to process
> the same set of responses from the ring, with (unsurprisingly) bad results.

Hm, right, xn_rxeof drops the lock while pushing the packet up the stack.
There's a "XXX" comment on top of that, could you try to remove the lock
dripping and see what happens?

I'm not sure there's any reason to drop the lock here, I very much doubt
if_input is going to sleep.

> I'm not 100% sure that this is what's causing the panic, but it's definitely
> happening under high traffic conditions immediately after xn_ifinit_locked is
> called, so I think my speculation is well-founded.
> 
> There are a few things I don't understand here:
> 1. Why DHCP requests are resulting in calls into xn_ifinit_locked.

Maybe DHCP flaps the interface up and down? TBH I have no idea.
Enabling/disabling certain features (CSUM, TSO) will also cause the interface
to reconnect, which might cause incoming packets to get stuck in the RX ring.

> 2. Why the calls into xn_ifinit_locked are only happening on T2 instances
> and not on any of the other EC2 instances I've tried.

Maybe T2 instances are on a more noisy environment? That I'm afraid I have no
idea.

> 3. Why xn_ifinit_locked is consuming ring responses.

There might be pending RX packets on the ring, so netfront consumes them and
signals netback. In the unlikely event that the RX ring was full when
xn_ifinit_locked is called, not doing this would mean the RX queue would get
stuck forever, since there's no guarantee netfront will receive event channel
notifications.

> so I'm not sure what the solution is, but hopefully someone who knows this
> code better will be able to help...

My first try would be to disable dropping the lock in xn_rxeof, I think that is
utterly incorrect. That should prevent multiple consumers from pocking at the
ring at the same time.

Roger.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170509100912.h3ylwugahvfi5cc2>