From owner-svn-src-all@freebsd.org Tue May 9 10:09:18 2017 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5D2ADD631D2; Tue, 9 May 2017 10:09:18 +0000 (UTC) (envelope-from royger@gmail.com) Received: from mail-qk0-x22e.google.com (mail-qk0-x22e.google.com [IPv6:2607:f8b0:400d:c09::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 00C891FD1; Tue, 9 May 2017 10:09:17 +0000 (UTC) (envelope-from royger@gmail.com) Received: by mail-qk0-x22e.google.com with SMTP id a72so58089493qkj.2; Tue, 09 May 2017 03:09:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=pVrFE1xF34rK4krEsRGJ9Xao9emvlBrs+R2VvBT6NoQ=; b=cDPkZGd68S61rlreR3ucUGq9bHR8qbr9WRs+l4HT23tTW9iBVobyaNYSJc20rsDFjb jiWdt63Cft6h5SCaSn9Hon1R/d5Wlnhbqrx59//o3YbyXo6HXK7ofIdZrn554S0ybZzk kBbF3PfukL1k9C+FTIyG3f+7aNqJXNoJRdg3yIsEvaact7hGclNK8ZfvXwRMOjWvZnKs DRW5SXi+enF+663na4A2+wwujWVcAE9L6RvA/Pfz+j7mK3FFwwIvtLtMtn/g7eAV03Lt 3H3LRo9djJKSl4zAKC+1jJF92dpyPufCLvr45RtbiAT2YhVltT9z9B98dijbedvmOtos u2ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition :content-transfer-encoding:in-reply-to:user-agent; bh=pVrFE1xF34rK4krEsRGJ9Xao9emvlBrs+R2VvBT6NoQ=; b=Jcy5gnKsW6mkBxYxQW1XFMf+KMc44SFP+eT9PfZh+RypAk5h3njpAPL9T7OVRaJiBD Iy3o3q+LylRzMZkRbDmRrnIxajcMm45nZZbICSZ8JDNV1n+aaTNXDKrrAGAZuyImeihB KYyPwjs2/nYa8NDg3aULd5R7+85nAxdRk/Sb3owGiyY/QSturp49YxVcrD0XxrOi6K7I C92n2ohOnVgdLCff+5ah1NyNOIcBLk0AOncBHvop0NH6kj1b8NOeEO6F/AJUWYJgdAFF 0IJ0qIliuKF3YgQraacb/6BCdOoQT1jBCgCLV99MPUuEaR3nqQ2xMbdsdYwgErdklLeA Bc5A== X-Gm-Message-State: AN3rC/4s7p60M1USDcQ6mOH2hvJXRpMVPmyt8EDJKnl9n0RMLUkXytcB UtyB19WYU03AGGTk X-Received: by 10.80.146.51 with SMTP id i48mr25038295eda.48.1494324556951; Tue, 09 May 2017 03:09:16 -0700 (PDT) Received: from localhost (default-46-102-197-194.interdsl.co.uk. [46.102.197.194]) by smtp.gmail.com with ESMTPSA id z27sm4924832edb.54.2017.05.09.03.09.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 May 2017 03:09:16 -0700 (PDT) Sender: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= Date: Tue, 9 May 2017 11:09:12 +0100 From: Roger Pau =?iso-8859-1?Q?Monn=E9?= To: Colin Percival Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r301198 - head/sys/dev/xen/netfront Message-ID: <20170509100912.h3ylwugahvfi5cc2@dhcp-3-128.uk.xensource.com> References: <201606021116.u52BGajD047287@repo.freebsd.org> <0100015bccba6abc-4c3b1487-25e3-4640-8221-885341ece829-000000@email.amazonses.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <0100015bccba6abc-4c3b1487-25e3-4640-8221-885341ece829-000000@email.amazonses.com> User-Agent: NeoMutt/20170428 (1.8.2) X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 May 2017 10:09:18 -0000 On Wed, May 03, 2017 at 05:13:40AM +0000, Colin Percival wrote: > On 06/02/16 04:16, Roger Pau Monné wrote: > > Author: royger > > Date: Thu Jun 2 11:16:35 2016 > > New Revision: 301198 > > URL: https://svnweb.freebsd.org/changeset/base/301198 > > I think this commit is responsible for panics I'm seeing in EC2 on T2 family > instances. Every time a DHCP request is made, we call into xn_ifinit_locked > (not sure why -- something to do with making the interface promiscuous?) and > hit this code > > > @@ -1760,7 +1715,7 @@ xn_ifinit_locked(struct netfront_info *n > > xn_alloc_rx_buffers(rxq); > > rxq->ring.sring->rsp_event = rxq->ring.rsp_cons + 1; > > if (RING_HAS_UNCONSUMED_RESPONSES(&rxq->ring)) > > - taskqueue_enqueue(rxq->tq, &rxq->intrtask); > > + xn_rxeof(rxq); > > XN_RX_UNLOCK(rxq); > > } > > but under high traffic volumes I think a separate thread can already be > running in xn_rxeof, having dropped the RX lock while it passes a packet > up the stack. This would result in two different threads trying to process > the same set of responses from the ring, with (unsurprisingly) bad results. Hm, right, xn_rxeof drops the lock while pushing the packet up the stack. There's a "XXX" comment on top of that, could you try to remove the lock dripping and see what happens? I'm not sure there's any reason to drop the lock here, I very much doubt if_input is going to sleep. > I'm not 100% sure that this is what's causing the panic, but it's definitely > happening under high traffic conditions immediately after xn_ifinit_locked is > called, so I think my speculation is well-founded. > > There are a few things I don't understand here: > 1. Why DHCP requests are resulting in calls into xn_ifinit_locked. Maybe DHCP flaps the interface up and down? TBH I have no idea. Enabling/disabling certain features (CSUM, TSO) will also cause the interface to reconnect, which might cause incoming packets to get stuck in the RX ring. > 2. Why the calls into xn_ifinit_locked are only happening on T2 instances > and not on any of the other EC2 instances I've tried. Maybe T2 instances are on a more noisy environment? That I'm afraid I have no idea. > 3. Why xn_ifinit_locked is consuming ring responses. There might be pending RX packets on the ring, so netfront consumes them and signals netback. In the unlikely event that the RX ring was full when xn_ifinit_locked is called, not doing this would mean the RX queue would get stuck forever, since there's no guarantee netfront will receive event channel notifications. > so I'm not sure what the solution is, but hopefully someone who knows this > code better will be able to help... My first try would be to disable dropping the lock in xn_rxeof, I think that is utterly incorrect. That should prevent multiple consumers from pocking at the ring at the same time. Roger.