From owner-freebsd-bugs@FreeBSD.ORG Mon Jul 30 22:30:18 2012 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1B1BE1065691 for ; Mon, 30 Jul 2012 22:30:18 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 06EB58FC17 for ; Mon, 30 Jul 2012 22:30:18 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q6UMUHis090429 for ; Mon, 30 Jul 2012 22:30:17 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q6UMUHEu090426; Mon, 30 Jul 2012 22:30:17 GMT (envelope-from gnats) Date: Mon, 30 Jul 2012 22:30:17 GMT Message-Id: <201207302230.q6UMUHEu090426@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org From: Adrian Chadd Cc: Subject: Re: kern/170254: [ath] Atheros 9220 device timeouts. X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Adrian Chadd List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Jul 2012 22:30:18 -0000 The following reply was made to PR kern/170254; it has been noted by GNATS. From: Adrian Chadd To: lev@freebsd.org Cc: bug-followup@freebsd.org, freebsd-wireless@freebsd.org Subject: Re: kern/170254: [ath] Atheros 9220 device timeouts. Date: Mon, 30 Jul 2012 15:21:33 -0700 And just so others can see what's going on.. here's an example "device timeout" that's worth investigating. On 30 July 2012 00:18, Lev Serebryakov wrote: > Jul 30 01:20:50 gateway kernel: ath0: device timeout > Jul 30 01:20:50 gateway kernel: ath0: ath_reset: called > Jul 30 01:20:50 gateway kernel: ath0: ath_stoptxdma: tx queue [9] 0x1b8b000, link 0 .. queue 9 is the beacon queue. > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [0] 0, link 0 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [1] 0x20e6780, link 0xd58e7420 The hardware thinks the TX descriptor pointer for this TX queue is 0x20e6780. Let's see if that lines up with reality. > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [2] 0, link 0 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [3] 0, link 0 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [8] 0x2115b40, link 0xd5915ba0 > Jul 30 01:20:50 gateway kernel: Q1[ 0] (DS.V:0xd58e63c0 DS.P:0x20e63c0) L:020e6420 D:16753a3e F:0001 > Jul 30 01:20:50 gateway kernel: Seq: 45152 swtry: 0 ADDBAW?: 0 DOBAW?: 0 > Jul 30 01:20:50 gateway kernel: 413f0612 0000902c 24348000 1b0e090d 022c0180 315a02d8 > Jul 30 01:20:50 gateway kernel: 08000000 0006318c 00000000 3f000000 3f000000 3f000000 00000000 00000000 > Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 First half of a two-part frame. Looks fine to me. It's a legacy frame. > Jul 30 01:20:50 gateway kernel: (DS.V:0xd58e6420 DS.P:0x20e6420) L:020e73c0 D:167e7810 F:0001 > Jul 30 01:20:50 gateway kernel: Seq: 45152 swtry: 0 ADDBAW?: 0 DOBAW?: 0 > Jul 30 01:20:50 gateway kernel: 00000000 000005dc 24348000 1b0e090d aa55aa55 aa55aa55 > Jul 30 01:20:50 gateway kernel: a801aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 00000000 00000000 > Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > Jul 30 01:20:50 gateway kernel: [end] This is the second half. Now, note the 0xaa55? It looks like the descriptor memory was never bzero()'ed. I'll look into why this is. I would've expected that the descriptor contents would be bzero()'ed before used. But - it's not marked as completed, so it's not a "missed TX completion interrupt" issue. It could be a "hardware was never given this particular descriptor to TX" issue. It could be a "hardware was given this descriptor but it hung" issue. It also unfortunately may be a "previous descriptor was corrupted and the hardware stopped processing things at this point" issue. Since the HWQ isn't empty, it's not a "nothing was actually added to the HW queue" issue. Now, the hardware TX descriptor pointer (TXDP) is pointing to 0x20e6780. That isn't this descriptor at all. So it's quite likely pointing at something garbage. I'll explain how it works in detail below. Next: Q8 is the content-after-beacon queue. The single frame in it did complete, but for whatever reason, the driver hadn't serviced this queue yet to handle said completed frame. > Jul 30 01:20:50 gateway kernel: Q8[ 0] (DS.V:0xd5915b40 DS.P:0x2115b40) L:02115ba0 D:1f00913e F:0003 * > Jul 30 01:20:50 gateway kernel: Seq: 2224 swtry: 0 ADDBAW?: 0 DOBAW?: 0 > Jul 30 01:20:50 gateway kernel: 413f007c 01005028 00018000 0000001b 000004aa 00000000 > Jul 30 01:20:50 gateway kernel: 08000000 0000000c 00000000 3f000000 3f000000 3f000000 00000000 00000000 > Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > Jul 30 01:20:50 gateway kernel: (DS.V:0xd5915ba0 DS.P:0x2115ba0) L:00000000 D:1f008c46 F:0003 * > Jul 30 01:20:50 gateway kernel: Seq: 2224 swtry: 0 ADDBAW?: 0 DOBAW?: 0 > Jul 30 01:20:50 gateway kernel: 00000000 00000048 00018000 0000001b aa55aa55 aa55aa55 > Jul 30 01:20:50 gateway kernel: a801aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 00808080 00000001 > Jul 30 01:20:50 gateway kernel: 0396cbc4 40bc118a 8c006fc7 80808080 80808080 80808080 80808080 00000983 > Jul 30 01:20:50 gateway kernel: [end] How often is this happening? Now, what's going on in Q1: The TXDP points to a descriptor that isn't the one that's in the head of the TX descriptor list. There are a few possibilities: There may be some races in the HW TXQ handling which is not keeping that descriptor list in lock-step with what the hardware expects. I'll review this again. There's a race where the TX DMA engine will re-read the link pointer of the previous descriptor to look at where the next descriptor is, and if that descriptor has been freed by the hardware - or if the link pointer pointed to NULL at that time) the hardware will stop transmitting. The only way to recover this is to write a new TXDP and write "1" to the relevant TXE bit for that queue. Sam implemented a fix for a single-queue device in his TDMA work - the driver stores the last descriptor that was handled and doesn't free it until the next descriptor in the list has been successfully TX'ed. That way if TX stops for a little while (eg because of noisy air, or RX that occurs, or it hits the end of a TX burst opportunity) the hardware will be able to re-read the link pointer from the last completed descriptor in order to get the next descriptor to start TXing from. There are two issues with Sam's implementation: * It doesn't track this TX buffer stuff for each queue (which it needs to!) - it only does it globally. Since there's 10 TX queues with separate DMA engines, we should really be keeping this "last" buffer around for each TX queue; * It's only enabled when you enabled TDMA support. So - do you have TDMA support compiled into the driver/kernel? If not, would you mind compiling in TDMA support so this code gets included, and we'll see if it helps? The next thing we can try is to hack up the TX watchdog routine to first try re-setting the DMA for any TX queue that's stopped, if the queue just has unserviced frames. In this instance, TX should just start up again. That won't _fix_ the problem, but it'll help us narrow down whether it's actually what I think it is. Finally, once I finish the AR93xx support, this issue will go away. The FIFO model was designed specifically to avoid this software hack. :-) Adrian