From owner-freebsd-stable@FreeBSD.ORG  Wed Aug 27 01:11:13 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 046651065674;
	Wed, 27 Aug 2008 01:11:13 +0000 (UTC)
	(envelope-from alex.kovalenko@verizon.net)
Received: from vms044pub.verizon.net (vms044pub.verizon.net [206.46.252.44])
	by mx1.freebsd.org (Postfix) with ESMTP id BB3B08FC23;
	Wed, 27 Aug 2008 01:11:12 +0000 (UTC)
	(envelope-from alex.kovalenko@verizon.net)
Received: from [10.0.3.231] ([70.111.10.128])
	by vms044.mailsrvcs.net (Sun Java System Messaging Server 6.2-6.01
	(built Apr
	3 2006)) with ESMTPA id <0K6800H8YJA0I168@vms044.mailsrvcs.net>; Tue,
	26 Aug 2008 20:10:49 -0500 (CDT)
Date: Tue, 26 Aug 2008 21:10:40 -0400
From: "Alexandre \"Sunny\" Kovalenko" <alex.kovalenko@verizon.net>
In-reply-to: <487983FA.5090807@freebsd.org>
To: Sam Leffler <sam@freebsd.org>
Message-id: <1219799440.1118.18.camel@RabbitsDen>
MIME-version: 1.0
X-Mailer: Evolution 2.22.3.1 FreeBSD GNOME Team Port
Content-type: text/plain; charset=utf-8
Content-transfer-encoding: 8BIT
References: <1210640542.1008.33.camel@RabbitsDen>
	<4828FDE5.8090409@freebsd.org> <1210696676.985.18.camel@RabbitsDen>
	<482DB4FE.3010300@freebsd.org> <1215824846.1074.23.camel@RabbitsDen>
	<4878250E.8040205@freebsd.org> <1215867198.1551.4.camel@RabbitsDen>
	<4878E25E.3000000@freebsd.org> <1215887769.1551.6.camel@RabbitsDen>
	<487983FA.5090807@freebsd.org>
Cc: stable@freebsd.org
Subject: Re: Hard(?) lock
	when	reassociating	ath	with	wpa_supplicant	on	RELENG_7
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Aug 2008 01:11:13 -0000

On Sat, 2008-07-12 at 21:26 -0700, Sam Leffler wrote:
> Alexandre "Sunny" Kovalenko wrote:
> > On Sat, 2008-07-12 at 09:57 -0700, Sam Leffler wrote:
> >   
> >> Alexandre "Sunny" Kovalenko wrote:
> >>     
> >>> On Fri, 2008-07-11 at 20:29 -0700, Sam Leffler wrote:
> >>>   
> >>>       
> >>>> Alexandre "Sunny" Kovalenko wrote:
> >>>>     
> >>>>         
> >>>>> On Fri, 2008-05-16 at 12:23 -0400, Sam Leffler wrote:
> >>>>>       
> >>>>>           
> >>>>>> Alexandre "Sunny" Kovalenko wrote:
> >>>>>>         
> >>>>>>             
> >>>>>>> On Mon, 2008-05-12 at 19:33 -0700, Sam Leffler wrote: 
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> Alexandre "Sunny" Kovalenko wrote:
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> I seem to be able to lock my machine by going into wpa_cli and asking it
> >>>>>>>>> to 'reassoc'. The reason for question mark after "hard" is that debug
> >>>>>>>>> information (caused by wlandebug and athdebug) is being printed on the
> >>>>>>>>> console. The only way to get machine's attention is to hold power button
> >>>>>>>>> for 8 seconds.
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> So this is just livelock due to console debug msgs.
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>> I am not sure, I have parsed this well enough, so I will try to clarify:
> >>>>>>> machine becomes unresponsive *without* any debugging turned on, to an
> >>>>>>> extent that pressing the power button twice *does not* generate ACPI
> >>>>>>> console message (something to the tune of "going into S5 already --
> >>>>>>> gimme a break"). If I turn ath debugging on, I do see those messages,
> >>>>>>> and only them, scrolling on the console.
> >>>>>>>           
> >>>>>>>               
> >>>>>> Guess I misunderstood you.
> >>>>>>         
> >>>>>>             
> >>>>> I have finally got enough time and equipment to investigate this
> >>>>> further. Here are some conclusions:
> >>>>>
> >>>>> -- at this point (RELENG_7 as of July 9th around 15:30 EST) it is indeed
> >>>>> a livelock.
> >>>>>
> >>>>> -- all system does, is executing ath_intr (if_ath.c) in the tight loop
> >>>>> with the same status -- 0x1000 AKA HAL_INT_MIB. In order to eliminate
> >>>>> possibility that I have caused livelock with the debug messages, I have
> >>>>> put conditional panic() into ath_intr, as soon as sc->sc_stats.ast_mib
> >>>>> reaches 10,000. Without any kind of the debug messages, it will be
> >>>>> triggered within 40-60 seconds after starting of wpa_supplicant.
> >>>>>
> >>>>> -- I suspect that comment below, might not hold true on my equipment
> >>>>> if (status & HAL_INT_MIB) {
> >>>>>    sc->sc_stats.ast_mib++;
> >>>>>    /*
> >>>>>     * Disable interrupts until we service the MIB
> >>>>>     * interrupt; otherwise it will continue to fire.
> >>>>>     */
> >>>>>    ath_hal_intrset(ah, 0);
> >>>>>    /*
> >>>>>     * Let the hal handle the event.  We assume it will <============
> >>>>>     * clear whatever condition caused the interrupt.   <============
> >>>>>     */
> >>>>>     ath_hal_mibevent(ah, &sc->sc_halstats);
> >>>>>     ath_hal_intrset(ah, sc->sc_imask);
> >>>>> }
> >>>>>
> >>>>> My hardware is:
> >>>>> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413,
> >>>>> RF5413)
> >>>>> ath0: <Atheros 5212> mem 0xedf00000-0xedf0ffff irq 17 at device 0.0 on
> >>>>> pci3
> >>>>> ath0: [ITHREAD]
> >>>>> ath0: using obsoleted if_watchdog interface
> >>>>> ath0: Ethernet address: 00:16:cf:26:2f:3f
> >>>>> ath0: mac 10.3 phy 6.1 radio 10.2
> >>>>>
> >>>>> My wpa_supplicant.conf is:
> >>>>> ctrl_interface=/var/run/wpa_supplicant
> >>>>> ctrl_interface_group=wheel
> >>>>> eapol_version=2
> >>>>>
> >>>>> network={
> >>>>>   ssid="XXXXXXXXXXX"
> >>>>>   scan_ssid=1
> >>>>>   priority=1
> >>>>>   proto=WPA
> >>>>>   pairwise=TKIP
> >>>>>   group=TKIP
> >>>>>   key_mgmt=WPA-PSK
> >>>>>   psk="xxxxxxxxxxxxxxxxxxxxxx"
> >>>>> }
> >>>>>
> >>>>> Access point is Netgear WNDR3300-1B with 11N and 11G SSID set up to
> >>>>> different values. Only 11G SSID is configured in wpa_supplicant.conf. In
> >>>>> the test setup, AP is with 10' (3m) from the laptop.
> >>>>>
> >>>>> AP is successfully used by handful of Windows clients (including this
> >>>>> same laptop) and iBook G4.
> >>>>>
> >>>>> Neither wpa_supplicant with '-d -d' nor wlandebug 0xFFFFFFFF show
> >>>>> anything but normal scan. 
> >>>>>
> >>>>> athdebug 0xFFFFFFFF loops with ath_intr: status 0x1000 until I power
> >>>>> down my laptop.
> >>>>>
> >>>>> I would appreciate any suggestion on what I can investigate further --
> >>>>> at this point I have comfortable console setup and should be able to
> >>>>> field requests for further information much better.
> >>>>>
> >>>>>       
> >>>>>           
> >>>> Are you running powerd?
> >>>>     
> >>>>         
> >>> I do. And I just tried disabling it, and I could not reproduce the
> >>> problem any more. Is there any way to reconcile if_ath with powerd?
> >>>
> >>>   
> >>>       
> >> Don't know.  There appear to be two issues.  When the MIB interrupts 
> >> arrive the kernel may service them w/ the cpu at a reduced clock 
> >> frequency.  Since powered is currently the only mechanism for increasing 
> >> the frequency and it runs in user space it can take a while to bump the 
> >> clock rate leading to livelock (because the logic to reduce the _cause_ 
> >> of the MIB interrupt takes a long time to run).  John Baldwin suggested 
> >> raising the clock frequency when handling interrupts in the kernel but 
> >> nothing has been done to make that happen.
> >>
> >> Separately there is a question as to why the MIB interrupts are 
> >> happening at all.  This is possibly due to misprogramming of the 
> >> baseband h/w in the ath card.  Unfortunately I've been trying to get 
> >> Atheros to help understand/resolve this question for a very long time 
> >> (as their code also exhibits this behaviour) but they've been 
> >> unresponsive.  I have some experimental code to address this in new hal 
> >> versions (such as 0.10.5.6 available in http://www.freebsd.org/~sam) but 
> >> apparently it does not entirely fix the problem.
> >>     
> > Would it be of any value to you, if I build the new hal and see what
> > happen? I can live without powerd as the workaround, but I'd rather help
> > if I can.
> >
> >   
> Testing the latest hal is always useful but I've not MFC'd many ath 
> changes to RELENG_7 and some people have reported problems on RELENG_7 
> w/ the new hal that do not occur on HEAD.

I have finally carved enough quality time to see whether new HAL makes
any difference WRT the issue, and I am happy to report that, with the
HAL version 0.10.5.6, I could no longer reproduce it. I am seeing whole
lot of messages

update_stats: bogus ndx0 -1, max 12, mode 3

and 

bogus rix 255, max 12, mode 3


scrolling by, but I don't know whether these are related to the problem,
I have seen with the previous version of the HAL, or the downside of
using new HAL with RELENG_7.

Reverting the HAL brings about lockups pretty reliably.

At this point I am very happy with my machine, however, if there is any
additional information that I can provide, or any patches to test, I
will be more than happy to oblige.

Thank you very much for doing this work.


-- 
Alexandre "Sunny" Kovalenko (Олександр Коваленко)