Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Jul 2012 15:21:33 -0700
From:      Adrian Chadd <adrian@freebsd.org>
To:        lev@freebsd.org
Cc:        freebsd-wireless@freebsd.org, bug-followup@freebsd.org
Subject:   Re: kern/170254: [ath] Atheros 9220 device timeouts.
Message-ID:  <CAJ-Vmo=0MokPCeif5TtCiZzFwkg8Qgi_fpwpO_yYtWcZR-h86w@mail.gmail.com>
In-Reply-To: <653414070.20120730111836@serebryakov.spb.ru>
References:  <653414070.20120730111836@serebryakov.spb.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
And just so others can see what's going on.. here's an example "device
timeout" that's worth investigating.

On 30 July 2012 00:18, Lev Serebryakov <lev@freebsd.org> wrote:

> Jul 30 01:20:50 gateway kernel: ath0: device timeout
> Jul 30 01:20:50 gateway kernel: ath0: ath_reset: called
> Jul 30 01:20:50 gateway kernel: ath0: ath_stoptxdma: tx queue [9] 0x1b8b000, link 0

.. queue 9 is the beacon queue.

> Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [0] 0, link 0
> Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [1] 0x20e6780, link 0xd58e7420

The hardware thinks the TX descriptor pointer for this TX queue is
0x20e6780. Let's see if that lines up with reality.

> Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [2] 0, link 0
> Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [3] 0, link 0
> Jul 30 01:20:50 gateway kernel: ath0: ath_tx_stopdma: tx queue [8] 0x2115b40, link 0xd5915ba0

> Jul 30 01:20:50 gateway kernel: Q1[  0] (DS.V:0xd58e63c0 DS.P:0x20e63c0) L:020e6420 D:16753a3e F:0001
> Jul 30 01:20:50 gateway kernel: Seq: 45152 swtry: 0 ADDBAW?: 0 DOBAW?: 0
> Jul 30 01:20:50 gateway kernel: 413f0612 0000902c 24348000 1b0e090d 022c0180 315a02d8
> Jul 30 01:20:50 gateway kernel: 08000000 0006318c 00000000 3f000000 3f000000 3f000000 00000000 00000000
> Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

First half of a two-part frame. Looks fine to me. It's a legacy frame.

> Jul 30 01:20:50 gateway kernel: (DS.V:0xd58e6420 DS.P:0x20e6420) L:020e73c0 D:167e7810 F:0001
> Jul 30 01:20:50 gateway kernel: Seq: 45152 swtry: 0 ADDBAW?: 0 DOBAW?: 0
> Jul 30 01:20:50 gateway kernel: 00000000 000005dc 24348000 1b0e090d aa55aa55 aa55aa55
> Jul 30 01:20:50 gateway kernel: a801aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 00000000 00000000
> Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> Jul 30 01:20:50 gateway kernel: [end]

This is the second half. Now, note the 0xaa55? It looks like the
descriptor memory was never bzero()'ed. I'll look into why this is. I
would've expected that the descriptor contents would be bzero()'ed
before used.

But - it's not marked as completed, so it's not a "missed TX
completion interrupt" issue. It could be a "hardware was never given
this particular descriptor to TX" issue. It could be a "hardware was
given this descriptor but it hung" issue. It also unfortunately may be
a "previous descriptor was corrupted and the hardware stopped
processing things at this point" issue. Since the HWQ isn't empty,
it's not a "nothing was actually added to the HW queue" issue.

Now, the hardware TX descriptor pointer (TXDP) is pointing to
0x20e6780. That isn't this descriptor at all. So it's quite likely
pointing at something garbage. I'll explain how it works in detail
below.

Next: Q8 is the content-after-beacon queue. The single frame in it did
complete, but for whatever reason, the driver hadn't serviced this
queue yet to handle said completed frame.

> Jul 30 01:20:50 gateway kernel: Q8[  0] (DS.V:0xd5915b40 DS.P:0x2115b40) L:02115ba0 D:1f00913e F:0003 *
> Jul 30 01:20:50 gateway kernel: Seq: 2224 swtry: 0 ADDBAW?: 0 DOBAW?: 0
> Jul 30 01:20:50 gateway kernel: 413f007c 01005028 00018000 0000001b 000004aa 00000000
> Jul 30 01:20:50 gateway kernel: 08000000 0000000c 00000000 3f000000 3f000000 3f000000 00000000 00000000
> Jul 30 01:20:50 gateway kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> Jul 30 01:20:50 gateway kernel: (DS.V:0xd5915ba0 DS.P:0x2115ba0) L:00000000 D:1f008c46 F:0003 *
> Jul 30 01:20:50 gateway kernel: Seq: 2224 swtry: 0 ADDBAW?: 0 DOBAW?: 0
> Jul 30 01:20:50 gateway kernel: 00000000 00000048 00018000 0000001b aa55aa55 aa55aa55
> Jul 30 01:20:50 gateway kernel: a801aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 aa55aa55 00808080 00000001
> Jul 30 01:20:50 gateway kernel: 0396cbc4 40bc118a 8c006fc7 80808080 80808080 80808080 80808080 00000983
> Jul 30 01:20:50 gateway kernel: [end]

How often is this happening?

Now, what's going on in Q1: The TXDP points to a descriptor that isn't
the one that's in the head of the TX descriptor list. There are a few
possibilities:

There may be some races in the HW TXQ handling which is not keeping
that descriptor list in lock-step with what the hardware expects. I'll
review this again.

There's a race where the TX DMA engine will re-read the link pointer
of the previous descriptor to look at where the next descriptor is,
and if that descriptor has been freed by the hardware - or if the link
pointer pointed to NULL at that time) the hardware will stop
transmitting. The only way to recover this is to write a new TXDP and
write "1" to the relevant TXE bit for that queue.

Sam implemented a fix for a single-queue device in his TDMA work - the
driver stores the last descriptor that was handled and doesn't free it
until the next descriptor in the list has been successfully TX'ed.
That way if TX stops for a little while (eg because of noisy air, or
RX that occurs, or it hits the end of a TX burst opportunity) the
hardware will be able to re-read the link pointer from the last
completed descriptor in order to get the next descriptor to start
TXing from.

There are two issues with Sam's implementation:

* It doesn't track this TX buffer stuff for each queue (which it needs
to!) - it only does it globally. Since there's 10 TX queues with
separate DMA engines, we should really be keeping this "last" buffer
around for each TX queue;
* It's only enabled when you enabled TDMA support.

So - do you have TDMA support compiled into the driver/kernel? If not,
would you mind compiling in TDMA support so this code gets included,
and we'll see if it helps?

The next thing we can try is to hack up the TX watchdog routine to
first try re-setting the DMA for any TX queue that's stopped, if the
queue just has unserviced frames. In this instance, TX should just
start up again. That won't _fix_ the problem, but it'll help us narrow
down whether it's actually what I think it is.

Finally, once I finish the AR93xx support, this issue will go away.
The FIFO model was designed specifically to avoid this software hack.
:-)




Adrian



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmo=0MokPCeif5TtCiZzFwkg8Qgi_fpwpO_yYtWcZR-h86w>