From owner-svn-src-all@freebsd.org  Tue May  9 10:09:18 2017
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5D2ADD631D2;
 Tue,  9 May 2017 10:09:18 +0000 (UTC)
 (envelope-from royger@gmail.com)
Received: from mail-qk0-x22e.google.com (mail-qk0-x22e.google.com
 [IPv6:2607:f8b0:400d:c09::22e])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 00C891FD1;
 Tue,  9 May 2017 10:09:17 +0000 (UTC)
 (envelope-from royger@gmail.com)
Received: by mail-qk0-x22e.google.com with SMTP id a72so58089493qkj.2;
 Tue, 09 May 2017 03:09:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=sender:date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=pVrFE1xF34rK4krEsRGJ9Xao9emvlBrs+R2VvBT6NoQ=;
 b=cDPkZGd68S61rlreR3ucUGq9bHR8qbr9WRs+l4HT23tTW9iBVobyaNYSJc20rsDFjb
 jiWdt63Cft6h5SCaSn9Hon1R/d5Wlnhbqrx59//o3YbyXo6HXK7ofIdZrn554S0ybZzk
 kBbF3PfukL1k9C+FTIyG3f+7aNqJXNoJRdg3yIsEvaact7hGclNK8ZfvXwRMOjWvZnKs
 DRW5SXi+enF+663na4A2+wwujWVcAE9L6RvA/Pfz+j7mK3FFwwIvtLtMtn/g7eAV03Lt
 3H3LRo9djJKSl4zAKC+1jJF92dpyPufCLvr45RtbiAT2YhVltT9z9B98dijbedvmOtos
 u2ag==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
 :references:mime-version:content-disposition
 :content-transfer-encoding:in-reply-to:user-agent;
 bh=pVrFE1xF34rK4krEsRGJ9Xao9emvlBrs+R2VvBT6NoQ=;
 b=Jcy5gnKsW6mkBxYxQW1XFMf+KMc44SFP+eT9PfZh+RypAk5h3njpAPL9T7OVRaJiBD
 Iy3o3q+LylRzMZkRbDmRrnIxajcMm45nZZbICSZ8JDNV1n+aaTNXDKrrAGAZuyImeihB
 KYyPwjs2/nYa8NDg3aULd5R7+85nAxdRk/Sb3owGiyY/QSturp49YxVcrD0XxrOi6K7I
 C92n2ohOnVgdLCff+5ah1NyNOIcBLk0AOncBHvop0NH6kj1b8NOeEO6F/AJUWYJgdAFF
 0IJ0qIliuKF3YgQraacb/6BCdOoQT1jBCgCLV99MPUuEaR3nqQ2xMbdsdYwgErdklLeA
 Bc5A==
X-Gm-Message-State: AN3rC/4s7p60M1USDcQ6mOH2hvJXRpMVPmyt8EDJKnl9n0RMLUkXytcB
 UtyB19WYU03AGGTk
X-Received: by 10.80.146.51 with SMTP id i48mr25038295eda.48.1494324556951;
 Tue, 09 May 2017 03:09:16 -0700 (PDT)
Received: from localhost (default-46-102-197-194.interdsl.co.uk.
 [46.102.197.194])
 by smtp.gmail.com with ESMTPSA id z27sm4924832edb.54.2017.05.09.03.09.15
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 09 May 2017 03:09:16 -0700 (PDT)
Sender: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@gmail.com>
Date: Tue, 9 May 2017 11:09:12 +0100
From: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@FreeBSD.org>
To: Colin Percival <cperciva@tarsnap.com>
Cc: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: Re: svn commit: r301198 - head/sys/dev/xen/netfront
Message-ID: <20170509100912.h3ylwugahvfi5cc2@dhcp-3-128.uk.xensource.com>
References: <201606021116.u52BGajD047287@repo.freebsd.org>
 <0100015bccba6abc-4c3b1487-25e3-4640-8221-885341ece829-000000@email.amazonses.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <0100015bccba6abc-4c3b1487-25e3-4640-8221-885341ece829-000000@email.amazonses.com>
User-Agent: NeoMutt/20170428 (1.8.2)
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 09 May 2017 10:09:18 -0000

On Wed, May 03, 2017 at 05:13:40AM +0000, Colin Percival wrote:
> On 06/02/16 04:16, Roger Pau Monné wrote:
> > Author: royger
> > Date: Thu Jun  2 11:16:35 2016
> > New Revision: 301198
> > URL: https://svnweb.freebsd.org/changeset/base/301198
> 
> I think this commit is responsible for panics I'm seeing in EC2 on T2 family
> instances.  Every time a DHCP request is made, we call into xn_ifinit_locked
> (not sure why -- something to do with making the interface promiscuous?) and
> hit this code
> 
> > @@ -1760,7 +1715,7 @@ xn_ifinit_locked(struct netfront_info *n
> >  		xn_alloc_rx_buffers(rxq);
> >  		rxq->ring.sring->rsp_event = rxq->ring.rsp_cons + 1;
> >  		if (RING_HAS_UNCONSUMED_RESPONSES(&rxq->ring))
> > -			taskqueue_enqueue(rxq->tq, &rxq->intrtask);
> > +			xn_rxeof(rxq);
> >  		XN_RX_UNLOCK(rxq);
> >  	}
> 
> but under high traffic volumes I think a separate thread can already be
> running in xn_rxeof, having dropped the RX lock while it passes a packet
> up the stack.  This would result in two different threads trying to process
> the same set of responses from the ring, with (unsurprisingly) bad results.

Hm, right, xn_rxeof drops the lock while pushing the packet up the stack.
There's a "XXX" comment on top of that, could you try to remove the lock
dripping and see what happens?

I'm not sure there's any reason to drop the lock here, I very much doubt
if_input is going to sleep.

> I'm not 100% sure that this is what's causing the panic, but it's definitely
> happening under high traffic conditions immediately after xn_ifinit_locked is
> called, so I think my speculation is well-founded.
> 
> There are a few things I don't understand here:
> 1. Why DHCP requests are resulting in calls into xn_ifinit_locked.

Maybe DHCP flaps the interface up and down? TBH I have no idea.
Enabling/disabling certain features (CSUM, TSO) will also cause the interface
to reconnect, which might cause incoming packets to get stuck in the RX ring.

> 2. Why the calls into xn_ifinit_locked are only happening on T2 instances
> and not on any of the other EC2 instances I've tried.

Maybe T2 instances are on a more noisy environment? That I'm afraid I have no
idea.

> 3. Why xn_ifinit_locked is consuming ring responses.

There might be pending RX packets on the ring, so netfront consumes them and
signals netback. In the unlikely event that the RX ring was full when
xn_ifinit_locked is called, not doing this would mean the RX queue would get
stuck forever, since there's no guarantee netfront will receive event channel
notifications.

> so I'm not sure what the solution is, but hopefully someone who knows this
> code better will be able to help...

My first try would be to disable dropping the lock in xn_rxeof, I think that is
utterly incorrect. That should prevent multiple consumers from pocking at the
ring at the same time.

Roger.