From owner-freebsd-current@FreeBSD.ORG  Sat May  2 09:03:53 2015
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 311DAD52;
 Sat,  2 May 2015 09:03:53 +0000 (UTC)
Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com
 [IPv6:2607:f8b0:4001:c05::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 022A411CC;
 Sat,  2 May 2015 09:03:52 +0000 (UTC)
Received: by igbyr2 with SMTP id yr2so54764736igb.0;
 Sat, 02 May 2015 02:03:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=MqnlBVLvIfhoB1jY4Nx8wpqCp5Urb5ZF7nsOCKYzQDw=;
 b=exsQiJHRpF9wrn9sanao/NH//RPuIpOQDeKBYIjKxKXnsXpe8iDnC5KG9z5CKM7M7f
 C/RCvQ4zHVe2lZBbiRAlCD9WNYhQANAv4ti6S49nvjBIHibL9ReqFemkmUWievwGtSBm
 w4cMr0xXpxd/Sq21Zcojw3SITBOHvOl3iFoiYAOmM56T/bkAm7R6f2LLfHRavtw+p2kG
 0YFG9K1jOpGIG5HscqAdEatoa0vqlSETFIeQ1c0nSovu3xfzaUdY67wKLjZCm0erxRop
 gR9Ez5xspdMsTmRafhqnR4LPnJGag8R4rigVhSYJ5q0QqQWKr6FLTFCQ2cD9Cy1AfABh
 y8gA==
MIME-Version: 1.0
X-Received: by 10.107.168.143 with SMTP id e15mr16701890ioj.88.1430557432213; 
 Sat, 02 May 2015 02:03:52 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.38.133 with HTTP; Sat, 2 May 2015 02:03:52 -0700 (PDT)
In-Reply-To: <1494.1430550164@critter.freebsd.dk>
References: <1494.1430550164@critter.freebsd.dk>
Date: Sat, 2 May 2015 02:03:52 -0700
X-Google-Sender-Auth: QBT3T5CkZxuyjqTKRW1Ppx7tBxg
Message-ID: <CAJ-VmonoY4aE6=mRkJZT-HRA25gKqSpSPvHFj1L-m9r9MaLgtQ@mail.gmail.com>
Subject: Re: iwn crashes in current (r282269)
From: Adrian Chadd <adrian@freebsd.org>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>, 
 "freebsd-wireless@freebsd.org" <freebsd-wireless@freebsd.org>
Cc: "current@freebsd.org" <current@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 May 2015 09:03:53 -0000

Hi,


On 2 May 2015 at 00:02, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> May  2 01:01:34 critter kernel: iwn0: device timeout
> May  2 01:01:34 critter kernel: firmware: 'iwn6000g2afw' version 0: 677296 bytes loaded at 0xffffffff81f880c0
> May  2 01:01:34 critter kernel: iwn0: iwn_read_firmware: ucode rev=0x12a80601
> May  2 01:01:40 critter kernel: iwn0: iwn_tx_data: m=0xfffff80236fe8500: seqno (9550) (78) != ring index (0) !
> May  2 01:01:40 critter kernel: iwn0: iwn_intr: fatal firmware error
> May  2 01:01:40 critter kernel: iwn0: iwn_panicked: controller panicked, iv_state = 5; resetting...
> May  2 01:01:40 critter kernel: firmware: 'iwn6000g2afw' version 0: 677296 bytes loaded at 0xffffffff81f880c0
> May  2 01:01:40 critter kernel: iwn0: iwn_read_firmware: ucode rev=0x12a80601
>
> And then the machine hung.
>
> No further details, as the screen-blanker was on.

So there's something odd with iwn and sequence number allocations.
what's supposed to happen here is that:

* net80211 handles sequence number allocation;
* then A-MPDU is negotiated;
* then the driver handles sequence number allocations.

The firmware requires that for 11n transmit, each frame goes into a
ring slot that's seqno % 256. It's not an arbitrary slot. It'll panic
otherwise, like you saw above.

Now, something's upsetting it. It may be a noisy environment leading
to BAR frame transmissions and eventual tear-down of the A-MPDU state,
leading to net80211 taking over sequence number allocation again. I
fixed a whole of those races in the ath(4) driver when I implemented
11n and found there's no locking at all going on there. :( It could
also be something inside net80211 that's advancing the sequence number
space, even though A-MPDU is enabled.

There's only a couple of places where ni_txseqs is updated in
net80211. If it were getting updated there, it should be obvious. But
it does do a check to see if AMPDU is enabled and running, and none of
that is consistently locked.

iwn_addba_response() sets the ni_txseq for the tid to be whatever was
negotiated during the aggregation negotiation (ADDBA) and then sets
the initial ring slot id to be whatever the starting sequence number
is ('ssn' in *_ampdu_tx_start()). iwn_tx_data() does do sequence
number allocation there. It's possible we're seeing races where
aggregation is being torn down during active transmit and the state is
all mucked up.

I recall seeing issues in ath(4) where there were some packets queued
between sending out the initial aggregation negotiation and it being
negotiated, which meant some packets would go out with sequence
numbers /after/ what was initially negotatied during ADDBA. Ie:

* you're at seq X, and you negotiate ADDBA at seq X;
* you queue a bunch of transmit frames, seq X -> X + n;
* peer says "ADDBA acceptable, starting seq X";
* the next frame you transmit comes from seq X + n + 1, but the other
peer is confused.

Here it may show up as:

* you negotiate seq X via addba;
* you queue a bunch more frames via the normal transmit path;
* you get the addba response, set initial ssn to X;
* the 'cur' pointer here in the ring is now X % 256, but the next
frame you transmit is (X + n) % 256, and stuff is out of alignment.

So, would someone please help see if that's the case? That'd be really
helpful. :)


-adrian