From owner-freebsd-stable@FreeBSD.ORG Sat Jul 12 18:36:17 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 87F67106564A; Sat, 12 Jul 2008 18:36:17 +0000 (UTC) (envelope-from alex.kovalenko@verizon.net) Received: from vms044pub.verizon.net (vms044pub.verizon.net [206.46.252.44]) by mx1.freebsd.org (Postfix) with ESMTP id 69BC38FC08; Sat, 12 Jul 2008 18:36:17 +0000 (UTC) (envelope-from alex.kovalenko@verizon.net) Received: from [10.0.3.231] ([70.111.175.14]) by vms044.mailsrvcs.net (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0K3W007CXP0F1AO4@vms044.mailsrvcs.net>; Sat, 12 Jul 2008 13:36:17 -0500 (CDT) Date: Sat, 12 Jul 2008 14:36:09 -0400 From: "Alexandre \"Sunny\" Kovalenko" In-reply-to: <4878E25E.3000000@freebsd.org> To: Sam Leffler Message-id: <1215887769.1551.6.camel@RabbitsDen> MIME-version: 1.0 X-Mailer: Evolution 2.22.3.1 FreeBSD GNOME Team Port Content-type: text/plain; charset=utf-8 Content-transfer-encoding: 8BIT References: <1210640542.1008.33.camel@RabbitsDen> <4828FDE5.8090409@freebsd.org> <1210696676.985.18.camel@RabbitsDen> <482DB4FE.3010300@freebsd.org> <1215824846.1074.23.camel@RabbitsDen> <4878250E.8040205@freebsd.org> <1215867198.1551.4.camel@RabbitsDen> <4878E25E.3000000@freebsd.org> Cc: stable@freebsd.org Subject: Re: Hard(?) lock when reassociating ath with wpa_supplicant on RELENG_7 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Jul 2008 18:36:17 -0000 On Sat, 2008-07-12 at 09:57 -0700, Sam Leffler wrote: > Alexandre "Sunny" Kovalenko wrote: > > On Fri, 2008-07-11 at 20:29 -0700, Sam Leffler wrote: > > > >> Alexandre "Sunny" Kovalenko wrote: > >> > >>> On Fri, 2008-05-16 at 12:23 -0400, Sam Leffler wrote: > >>> > >>>> Alexandre "Sunny" Kovalenko wrote: > >>>> > >>>>> On Mon, 2008-05-12 at 19:33 -0700, Sam Leffler wrote: > >>>>> > >>>>>> Alexandre "Sunny" Kovalenko wrote: > >>>>>> > >>>>>>> I seem to be able to lock my machine by going into wpa_cli and asking it > >>>>>>> to 'reassoc'. The reason for question mark after "hard" is that debug > >>>>>>> information (caused by wlandebug and athdebug) is being printed on the > >>>>>>> console. The only way to get machine's attention is to hold power button > >>>>>>> for 8 seconds. > >>>>>>> > >>>>>> So this is just livelock due to console debug msgs. > >>>>>> > >>>>> I am not sure, I have parsed this well enough, so I will try to clarify: > >>>>> machine becomes unresponsive *without* any debugging turned on, to an > >>>>> extent that pressing the power button twice *does not* generate ACPI > >>>>> console message (something to the tune of "going into S5 already -- > >>>>> gimme a break"). If I turn ath debugging on, I do see those messages, > >>>>> and only them, scrolling on the console. > >>>>> > >>>> Guess I misunderstood you. > >>>> > >>> I have finally got enough time and equipment to investigate this > >>> further. Here are some conclusions: > >>> > >>> -- at this point (RELENG_7 as of July 9th around 15:30 EST) it is indeed > >>> a livelock. > >>> > >>> -- all system does, is executing ath_intr (if_ath.c) in the tight loop > >>> with the same status -- 0x1000 AKA HAL_INT_MIB. In order to eliminate > >>> possibility that I have caused livelock with the debug messages, I have > >>> put conditional panic() into ath_intr, as soon as sc->sc_stats.ast_mib > >>> reaches 10,000. Without any kind of the debug messages, it will be > >>> triggered within 40-60 seconds after starting of wpa_supplicant. > >>> > >>> -- I suspect that comment below, might not hold true on my equipment > >>> if (status & HAL_INT_MIB) { > >>> sc->sc_stats.ast_mib++; > >>> /* > >>> * Disable interrupts until we service the MIB > >>> * interrupt; otherwise it will continue to fire. > >>> */ > >>> ath_hal_intrset(ah, 0); > >>> /* > >>> * Let the hal handle the event. We assume it will <============ > >>> * clear whatever condition caused the interrupt. <============ > >>> */ > >>> ath_hal_mibevent(ah, &sc->sc_halstats); > >>> ath_hal_intrset(ah, sc->sc_imask); > >>> } > >>> > >>> My hardware is: > >>> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, > >>> RF5413) > >>> ath0: mem 0xedf00000-0xedf0ffff irq 17 at device 0.0 on > >>> pci3 > >>> ath0: [ITHREAD] > >>> ath0: using obsoleted if_watchdog interface > >>> ath0: Ethernet address: 00:16:cf:26:2f:3f > >>> ath0: mac 10.3 phy 6.1 radio 10.2 > >>> > >>> My wpa_supplicant.conf is: > >>> ctrl_interface=/var/run/wpa_supplicant > >>> ctrl_interface_group=wheel > >>> eapol_version=2 > >>> > >>> network={ > >>> ssid="XXXXXXXXXXX" > >>> scan_ssid=1 > >>> priority=1 > >>> proto=WPA > >>> pairwise=TKIP > >>> group=TKIP > >>> key_mgmt=WPA-PSK > >>> psk="xxxxxxxxxxxxxxxxxxxxxx" > >>> } > >>> > >>> Access point is Netgear WNDR3300-1B with 11N and 11G SSID set up to > >>> different values. Only 11G SSID is configured in wpa_supplicant.conf. In > >>> the test setup, AP is with 10' (3m) from the laptop. > >>> > >>> AP is successfully used by handful of Windows clients (including this > >>> same laptop) and iBook G4. > >>> > >>> Neither wpa_supplicant with '-d -d' nor wlandebug 0xFFFFFFFF show > >>> anything but normal scan. > >>> > >>> athdebug 0xFFFFFFFF loops with ath_intr: status 0x1000 until I power > >>> down my laptop. > >>> > >>> I would appreciate any suggestion on what I can investigate further -- > >>> at this point I have comfortable console setup and should be able to > >>> field requests for further information much better. > >>> > >>> > >> Are you running powerd? > >> > > I do. And I just tried disabling it, and I could not reproduce the > > problem any more. Is there any way to reconcile if_ath with powerd? > > > > > > Don't know. There appear to be two issues. When the MIB interrupts > arrive the kernel may service them w/ the cpu at a reduced clock > frequency. Since powered is currently the only mechanism for increasing > the frequency and it runs in user space it can take a while to bump the > clock rate leading to livelock (because the logic to reduce the _cause_ > of the MIB interrupt takes a long time to run). John Baldwin suggested > raising the clock frequency when handling interrupts in the kernel but > nothing has been done to make that happen. > > Separately there is a question as to why the MIB interrupts are > happening at all. This is possibly due to misprogramming of the > baseband h/w in the ath card. Unfortunately I've been trying to get > Atheros to help understand/resolve this question for a very long time > (as their code also exhibits this behaviour) but they've been > unresponsive. I have some experimental code to address this in new hal > versions (such as 0.10.5.6 available in http://www.freebsd.org/~sam) but > apparently it does not entirely fix the problem. Would it be of any value to you, if I build the new hal and see what happen? I can live without powerd as the workaround, but I'd rather help if I can. > > Sam > -- Alexandre "Sunny" Kovalenko (Олександр Коваленко)