From owner-freebsd-stable@FreeBSD.ORG Wed Aug 27 01:11:13 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 046651065674; Wed, 27 Aug 2008 01:11:13 +0000 (UTC) (envelope-from alex.kovalenko@verizon.net) Received: from vms044pub.verizon.net (vms044pub.verizon.net [206.46.252.44]) by mx1.freebsd.org (Postfix) with ESMTP id BB3B08FC23; Wed, 27 Aug 2008 01:11:12 +0000 (UTC) (envelope-from alex.kovalenko@verizon.net) Received: from [10.0.3.231] ([70.111.10.128]) by vms044.mailsrvcs.net (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0K6800H8YJA0I168@vms044.mailsrvcs.net>; Tue, 26 Aug 2008 20:10:49 -0500 (CDT) Date: Tue, 26 Aug 2008 21:10:40 -0400 From: "Alexandre \"Sunny\" Kovalenko" In-reply-to: <487983FA.5090807@freebsd.org> To: Sam Leffler Message-id: <1219799440.1118.18.camel@RabbitsDen> MIME-version: 1.0 X-Mailer: Evolution 2.22.3.1 FreeBSD GNOME Team Port Content-type: text/plain; charset=utf-8 Content-transfer-encoding: 8BIT References: <1210640542.1008.33.camel@RabbitsDen> <4828FDE5.8090409@freebsd.org> <1210696676.985.18.camel@RabbitsDen> <482DB4FE.3010300@freebsd.org> <1215824846.1074.23.camel@RabbitsDen> <4878250E.8040205@freebsd.org> <1215867198.1551.4.camel@RabbitsDen> <4878E25E.3000000@freebsd.org> <1215887769.1551.6.camel@RabbitsDen> <487983FA.5090807@freebsd.org> Cc: stable@freebsd.org Subject: Re: Hard(?) lock when reassociating ath with wpa_supplicant on RELENG_7 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Aug 2008 01:11:13 -0000 On Sat, 2008-07-12 at 21:26 -0700, Sam Leffler wrote: > Alexandre "Sunny" Kovalenko wrote: > > On Sat, 2008-07-12 at 09:57 -0700, Sam Leffler wrote: > > > >> Alexandre "Sunny" Kovalenko wrote: > >> > >>> On Fri, 2008-07-11 at 20:29 -0700, Sam Leffler wrote: > >>> > >>> > >>>> Alexandre "Sunny" Kovalenko wrote: > >>>> > >>>> > >>>>> On Fri, 2008-05-16 at 12:23 -0400, Sam Leffler wrote: > >>>>> > >>>>> > >>>>>> Alexandre "Sunny" Kovalenko wrote: > >>>>>> > >>>>>> > >>>>>>> On Mon, 2008-05-12 at 19:33 -0700, Sam Leffler wrote: > >>>>>>> > >>>>>>> > >>>>>>>> Alexandre "Sunny" Kovalenko wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>>> I seem to be able to lock my machine by going into wpa_cli and asking it > >>>>>>>>> to 'reassoc'. The reason for question mark after "hard" is that debug > >>>>>>>>> information (caused by wlandebug and athdebug) is being printed on the > >>>>>>>>> console. The only way to get machine's attention is to hold power button > >>>>>>>>> for 8 seconds. > >>>>>>>>> > >>>>>>>>> > >>>>>>>> So this is just livelock due to console debug msgs. > >>>>>>>> > >>>>>>>> > >>>>>>> I am not sure, I have parsed this well enough, so I will try to clarify: > >>>>>>> machine becomes unresponsive *without* any debugging turned on, to an > >>>>>>> extent that pressing the power button twice *does not* generate ACPI > >>>>>>> console message (something to the tune of "going into S5 already -- > >>>>>>> gimme a break"). If I turn ath debugging on, I do see those messages, > >>>>>>> and only them, scrolling on the console. > >>>>>>> > >>>>>>> > >>>>>> Guess I misunderstood you. > >>>>>> > >>>>>> > >>>>> I have finally got enough time and equipment to investigate this > >>>>> further. Here are some conclusions: > >>>>> > >>>>> -- at this point (RELENG_7 as of July 9th around 15:30 EST) it is indeed > >>>>> a livelock. > >>>>> > >>>>> -- all system does, is executing ath_intr (if_ath.c) in the tight loop > >>>>> with the same status -- 0x1000 AKA HAL_INT_MIB. In order to eliminate > >>>>> possibility that I have caused livelock with the debug messages, I have > >>>>> put conditional panic() into ath_intr, as soon as sc->sc_stats.ast_mib > >>>>> reaches 10,000. Without any kind of the debug messages, it will be > >>>>> triggered within 40-60 seconds after starting of wpa_supplicant. > >>>>> > >>>>> -- I suspect that comment below, might not hold true on my equipment > >>>>> if (status & HAL_INT_MIB) { > >>>>> sc->sc_stats.ast_mib++; > >>>>> /* > >>>>> * Disable interrupts until we service the MIB > >>>>> * interrupt; otherwise it will continue to fire. > >>>>> */ > >>>>> ath_hal_intrset(ah, 0); > >>>>> /* > >>>>> * Let the hal handle the event. We assume it will <============ > >>>>> * clear whatever condition caused the interrupt. <============ > >>>>> */ > >>>>> ath_hal_mibevent(ah, &sc->sc_halstats); > >>>>> ath_hal_intrset(ah, sc->sc_imask); > >>>>> } > >>>>> > >>>>> My hardware is: > >>>>> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, > >>>>> RF5413) > >>>>> ath0: mem 0xedf00000-0xedf0ffff irq 17 at device 0.0 on > >>>>> pci3 > >>>>> ath0: [ITHREAD] > >>>>> ath0: using obsoleted if_watchdog interface > >>>>> ath0: Ethernet address: 00:16:cf:26:2f:3f > >>>>> ath0: mac 10.3 phy 6.1 radio 10.2 > >>>>> > >>>>> My wpa_supplicant.conf is: > >>>>> ctrl_interface=/var/run/wpa_supplicant > >>>>> ctrl_interface_group=wheel > >>>>> eapol_version=2 > >>>>> > >>>>> network={ > >>>>> ssid="XXXXXXXXXXX" > >>>>> scan_ssid=1 > >>>>> priority=1 > >>>>> proto=WPA > >>>>> pairwise=TKIP > >>>>> group=TKIP > >>>>> key_mgmt=WPA-PSK > >>>>> psk="xxxxxxxxxxxxxxxxxxxxxx" > >>>>> } > >>>>> > >>>>> Access point is Netgear WNDR3300-1B with 11N and 11G SSID set up to > >>>>> different values. Only 11G SSID is configured in wpa_supplicant.conf. In > >>>>> the test setup, AP is with 10' (3m) from the laptop. > >>>>> > >>>>> AP is successfully used by handful of Windows clients (including this > >>>>> same laptop) and iBook G4. > >>>>> > >>>>> Neither wpa_supplicant with '-d -d' nor wlandebug 0xFFFFFFFF show > >>>>> anything but normal scan. > >>>>> > >>>>> athdebug 0xFFFFFFFF loops with ath_intr: status 0x1000 until I power > >>>>> down my laptop. > >>>>> > >>>>> I would appreciate any suggestion on what I can investigate further -- > >>>>> at this point I have comfortable console setup and should be able to > >>>>> field requests for further information much better. > >>>>> > >>>>> > >>>>> > >>>> Are you running powerd? > >>>> > >>>> > >>> I do. And I just tried disabling it, and I could not reproduce the > >>> problem any more. Is there any way to reconcile if_ath with powerd? > >>> > >>> > >>> > >> Don't know. There appear to be two issues. When the MIB interrupts > >> arrive the kernel may service them w/ the cpu at a reduced clock > >> frequency. Since powered is currently the only mechanism for increasing > >> the frequency and it runs in user space it can take a while to bump the > >> clock rate leading to livelock (because the logic to reduce the _cause_ > >> of the MIB interrupt takes a long time to run). John Baldwin suggested > >> raising the clock frequency when handling interrupts in the kernel but > >> nothing has been done to make that happen. > >> > >> Separately there is a question as to why the MIB interrupts are > >> happening at all. This is possibly due to misprogramming of the > >> baseband h/w in the ath card. Unfortunately I've been trying to get > >> Atheros to help understand/resolve this question for a very long time > >> (as their code also exhibits this behaviour) but they've been > >> unresponsive. I have some experimental code to address this in new hal > >> versions (such as 0.10.5.6 available in http://www.freebsd.org/~sam) but > >> apparently it does not entirely fix the problem. > >> > > Would it be of any value to you, if I build the new hal and see what > > happen? I can live without powerd as the workaround, but I'd rather help > > if I can. > > > > > Testing the latest hal is always useful but I've not MFC'd many ath > changes to RELENG_7 and some people have reported problems on RELENG_7 > w/ the new hal that do not occur on HEAD. I have finally carved enough quality time to see whether new HAL makes any difference WRT the issue, and I am happy to report that, with the HAL version 0.10.5.6, I could no longer reproduce it. I am seeing whole lot of messages update_stats: bogus ndx0 -1, max 12, mode 3 and bogus rix 255, max 12, mode 3 scrolling by, but I don't know whether these are related to the problem, I have seen with the previous version of the HAL, or the downside of using new HAL with RELENG_7. Reverting the HAL brings about lockups pretty reliably. At this point I am very happy with my machine, however, if there is any additional information that I can provide, or any patches to test, I will be more than happy to oblige. Thank you very much for doing this work. -- Alexandre "Sunny" Kovalenko (Олександр Коваленко)