Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Jul 2012 22:30:17 GMT
From:      Adrian Chadd <adrian@freebsd.org>
To:        freebsd-bugs@FreeBSD.org
Subject:   Re: kern/170254: [ath] Atheros 9220 device timeouts.
Message-ID:  <201207302230.q6UMUHEu090426@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/170254; it has been noted by GNATS.

From: Adrian Chadd <adrian@freebsd.org>
To: lev@freebsd.org
Cc: bug-followup@freebsd.org, freebsd-wireless@freebsd.org
Subject: Re: kern/170254: [ath] Atheros 9220 device timeouts.
Date: Mon, 30 Jul 2012 15:21:33 -0700

 And just so others can see what's going on.. here's an example "device
 timeout" that's worth investigating.
 
 On 30 July 2012 00:18, Lev Serebryakov <lev@freebsd.org> wrote:
 
 > Jul 30 01:20:50 gateway kernel: ath0: device timeout
 > Jul 30 01:20:50 gateway kernel: ath0: ath_reset: called
 > Jul 30 01:20:50 gateway kernel: ath0: ath_stoptxdma: tx queue [9] 0x1b8b000, link 0
 
 .. queue 9 is the beacon queue.
 
 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [0] 0, link 0
 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [1] 0x20e6780, link 0xd58e7420
 
 The hardware thinks the TX descriptor pointer for this TX queue is
 0x20e6780. Let's see if that lines up with reality.
 
 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [2] 0, link 0
 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [3] 0, link 0
 > Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [8] 0x2115b40, link 0xd5915ba0
 
 > Jul 30 01:20:50 gateway kernel: Q1[  0] (DS.V:0xd58e63c0 DS.P:0x20e63c0) L:020e6420 D:16753a3e F:0001
 > Jul 30 01:20:50 gateway kernel: Seq: 45152 swtry: 0 ADDBAW?: 0 DOBAW?: 0
 > Jul 30 01:20:50 gateway kernel: 413f0612 0000902c 24348000 1b0e090d 022c0180 315a02d8
 > Jul 30 01:20:50 gateway kernel: 08000000 0006318c 00000000 3f000000 3f000000 3f000000 00000000 00000000
 > Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 
 First half of a two-part frame. Looks fine to me. It's a legacy frame.
 
 > Jul 30 01:20:50 gateway kernel: (DS.V:0xd58e6420 DS.P:0x20e6420) L:020e73c0 D:167e7810 F:0001
 > Jul 30 01:20:50 gateway kernel: Seq: 45152 swtry: 0 ADDBAW?: 0 DOBAW?: 0
 > Jul 30 01:20:50 gateway kernel: 00000000 000005dc 24348000 1b0e090d aa55aa55 aa55aa55
 > Jul 30 01:20:50 gateway kernel: a801aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 00000000 00000000
 > Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 > Jul 30 01:20:50 gateway kernel: [end]
 
 This is the second half. Now, note the 0xaa55? It looks like the
 descriptor memory was never bzero()'ed. I'll look into why this is. I
 would've expected that the descriptor contents would be bzero()'ed
 before used.
 
 But - it's not marked as completed, so it's not a "missed TX
 completion interrupt" issue. It could be a "hardware was never given
 this particular descriptor to TX" issue. It could be a "hardware was
 given this descriptor but it hung" issue. It also unfortunately may be
 a "previous descriptor was corrupted and the hardware stopped
 processing things at this point" issue. Since the HWQ isn't empty,
 it's not a "nothing was actually added to the HW queue" issue.
 
 Now, the hardware TX descriptor pointer (TXDP) is pointing to
 0x20e6780. That isn't this descriptor at all. So it's quite likely
 pointing at something garbage. I'll explain how it works in detail
 below.
 
 Next: Q8 is the content-after-beacon queue. The single frame in it did
 complete, but for whatever reason, the driver hadn't serviced this
 queue yet to handle said completed frame.
 
 > Jul 30 01:20:50 gateway kernel: Q8[  0] (DS.V:0xd5915b40 DS.P:0x2115b40) L:02115ba0 D:1f00913e F:0003 *
 > Jul 30 01:20:50 gateway kernel: Seq: 2224 swtry: 0 ADDBAW?: 0 DOBAW?: 0
 > Jul 30 01:20:50 gateway kernel: 413f007c 01005028 00018000 0000001b 000004aa 00000000
 > Jul 30 01:20:50 gateway kernel: 08000000 0000000c 00000000 3f000000 3f000000 3f000000 00000000 00000000
 > Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 > Jul 30 01:20:50 gateway kernel: (DS.V:0xd5915ba0 DS.P:0x2115ba0) L:00000000 D:1f008c46 F:0003 *
 > Jul 30 01:20:50 gateway kernel: Seq: 2224 swtry: 0 ADDBAW?: 0 DOBAW?: 0
 > Jul 30 01:20:50 gateway kernel: 00000000 00000048 00018000 0000001b aa55aa55 aa55aa55
 > Jul 30 01:20:50 gateway kernel: a801aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 00808080 00000001
 > Jul 30 01:20:50 gateway kernel: 0396cbc4 40bc118a 8c006fc7 80808080 80808080 80808080 80808080 00000983
 > Jul 30 01:20:50 gateway kernel: [end]
 
 How often is this happening?
 
 Now, what's going on in Q1: The TXDP points to a descriptor that isn't
 the one that's in the head of the TX descriptor list. There are a few
 possibilities:
 
 There may be some races in the HW TXQ handling which is not keeping
 that descriptor list in lock-step with what the hardware expects. I'll
 review this again.
 
 There's a race where the TX DMA engine will re-read the link pointer
 of the previous descriptor to look at where the next descriptor is,
 and if that descriptor has been freed by the hardware - or if the link
 pointer pointed to NULL at that time) the hardware will stop
 transmitting. The only way to recover this is to write a new TXDP and
 write "1" to the relevant TXE bit for that queue.
 
 Sam implemented a fix for a single-queue device in his TDMA work - the
 driver stores the last descriptor that was handled and doesn't free it
 until the next descriptor in the list has been successfully TX'ed.
 That way if TX stops for a little while (eg because of noisy air, or
 RX that occurs, or it hits the end of a TX burst opportunity) the
 hardware will be able to re-read the link pointer from the last
 completed descriptor in order to get the next descriptor to start
 TXing from.
 
 There are two issues with Sam's implementation:
 
 * It doesn't track this TX buffer stuff for each queue (which it needs
 to!) - it only does it globally. Since there's 10 TX queues with
 separate DMA engines, we should really be keeping this "last" buffer
 around for each TX queue;
 * It's only enabled when you enabled TDMA support.
 
 So - do you have TDMA support compiled into the driver/kernel? If not,
 would you mind compiling in TDMA support so this code gets included,
 and we'll see if it helps?
 
 The next thing we can try is to hack up the TX watchdog routine to
 first try re-setting the DMA for any TX queue that's stopped, if the
 queue just has unserviced frames. In this instance, TX should just
 start up again. That won't _fix_ the problem, but it'll help us narrow
 down whether it's actually what I think it is.
 
 Finally, once I finish the AR93xx support, this issue will go away.
 The FIFO model was designed specifically to avoid this software hack.
 :-)
 
 
 
 
 Adrian



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201207302230.q6UMUHEu090426>