From owner-freebsd-stable@FreeBSD.ORG Sun Jul 13 04:26:37 2008 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 432E61065680 for ; Sun, 13 Jul 2008 04:26:37 +0000 (UTC) (envelope-from sam@freebsd.org) Received: from ebb.errno.com (ebb.errno.com [69.12.149.25]) by mx1.freebsd.org (Postfix) with ESMTP id 0CAF48FC08 for ; Sun, 13 Jul 2008 04:26:36 +0000 (UTC) (envelope-from sam@freebsd.org) Received: from trouble.errno.com (trouble.errno.com [10.0.0.248]) (authenticated bits=0) by ebb.errno.com (8.13.6/8.12.6) with ESMTP id m6D4QYTO055050 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 12 Jul 2008 21:26:36 -0700 (PDT) (envelope-from sam@freebsd.org) Message-ID: <487983FA.5090807@freebsd.org> Date: Sat, 12 Jul 2008 21:26:34 -0700 From: Sam Leffler Organization: FreeBSD Project User-Agent: Thunderbird 2.0.0.9 (X11/20071125) MIME-Version: 1.0 To: "Alexandre \"Sunny\" Kovalenko" References: <1210640542.1008.33.camel@RabbitsDen> <4828FDE5.8090409@freebsd.org> <1210696676.985.18.camel@RabbitsDen> <482DB4FE.3010300@freebsd.org> <1215824846.1074.23.camel@RabbitsDen> <4878250E.8040205@freebsd.org> <1215867198.1551.4.camel@RabbitsDen> <4878E25E.3000000@freebsd.org> <1215887769.1551.6.camel@RabbitsDen> In-Reply-To: <1215887769.1551.6.camel@RabbitsDen> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-DCC--Metrics: ebb.errno.com; whitelist Cc: stable@freebsd.org Subject: Re: Hard(?) lock when reassociating ath with wpa_supplicant on RELENG_7 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 13 Jul 2008 04:26:37 -0000 Alexandre "Sunny" Kovalenko wrote: > On Sat, 2008-07-12 at 09:57 -0700, Sam Leffler wrote: > >> Alexandre "Sunny" Kovalenko wrote: >> >>> On Fri, 2008-07-11 at 20:29 -0700, Sam Leffler wrote: >>> >>> >>>> Alexandre "Sunny" Kovalenko wrote: >>>> >>>> >>>>> On Fri, 2008-05-16 at 12:23 -0400, Sam Leffler wrote: >>>>> >>>>> >>>>>> Alexandre "Sunny" Kovalenko wrote: >>>>>> >>>>>> >>>>>>> On Mon, 2008-05-12 at 19:33 -0700, Sam Leffler wrote: >>>>>>> >>>>>>> >>>>>>>> Alexandre "Sunny" Kovalenko wrote: >>>>>>>> >>>>>>>> >>>>>>>>> I seem to be able to lock my machine by going into wpa_cli and asking it >>>>>>>>> to 'reassoc'. The reason for question mark after "hard" is that debug >>>>>>>>> information (caused by wlandebug and athdebug) is being printed on the >>>>>>>>> console. The only way to get machine's attention is to hold power button >>>>>>>>> for 8 seconds. >>>>>>>>> >>>>>>>>> >>>>>>>> So this is just livelock due to console debug msgs. >>>>>>>> >>>>>>>> >>>>>>> I am not sure, I have parsed this well enough, so I will try to clarify: >>>>>>> machine becomes unresponsive *without* any debugging turned on, to an >>>>>>> extent that pressing the power button twice *does not* generate ACPI >>>>>>> console message (something to the tune of "going into S5 already -- >>>>>>> gimme a break"). If I turn ath debugging on, I do see those messages, >>>>>>> and only them, scrolling on the console. >>>>>>> >>>>>>> >>>>>> Guess I misunderstood you. >>>>>> >>>>>> >>>>> I have finally got enough time and equipment to investigate this >>>>> further. Here are some conclusions: >>>>> >>>>> -- at this point (RELENG_7 as of July 9th around 15:30 EST) it is indeed >>>>> a livelock. >>>>> >>>>> -- all system does, is executing ath_intr (if_ath.c) in the tight loop >>>>> with the same status -- 0x1000 AKA HAL_INT_MIB. In order to eliminate >>>>> possibility that I have caused livelock with the debug messages, I have >>>>> put conditional panic() into ath_intr, as soon as sc->sc_stats.ast_mib >>>>> reaches 10,000. Without any kind of the debug messages, it will be >>>>> triggered within 40-60 seconds after starting of wpa_supplicant. >>>>> >>>>> -- I suspect that comment below, might not hold true on my equipment >>>>> if (status & HAL_INT_MIB) { >>>>> sc->sc_stats.ast_mib++; >>>>> /* >>>>> * Disable interrupts until we service the MIB >>>>> * interrupt; otherwise it will continue to fire. >>>>> */ >>>>> ath_hal_intrset(ah, 0); >>>>> /* >>>>> * Let the hal handle the event. We assume it will <============ >>>>> * clear whatever condition caused the interrupt. <============ >>>>> */ >>>>> ath_hal_mibevent(ah, &sc->sc_halstats); >>>>> ath_hal_intrset(ah, sc->sc_imask); >>>>> } >>>>> >>>>> My hardware is: >>>>> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, >>>>> RF5413) >>>>> ath0: mem 0xedf00000-0xedf0ffff irq 17 at device 0.0 on >>>>> pci3 >>>>> ath0: [ITHREAD] >>>>> ath0: using obsoleted if_watchdog interface >>>>> ath0: Ethernet address: 00:16:cf:26:2f:3f >>>>> ath0: mac 10.3 phy 6.1 radio 10.2 >>>>> >>>>> My wpa_supplicant.conf is: >>>>> ctrl_interface=/var/run/wpa_supplicant >>>>> ctrl_interface_group=wheel >>>>> eapol_version=2 >>>>> >>>>> network={ >>>>> ssid="XXXXXXXXXXX" >>>>> scan_ssid=1 >>>>> priority=1 >>>>> proto=WPA >>>>> pairwise=TKIP >>>>> group=TKIP >>>>> key_mgmt=WPA-PSK >>>>> psk="xxxxxxxxxxxxxxxxxxxxxx" >>>>> } >>>>> >>>>> Access point is Netgear WNDR3300-1B with 11N and 11G SSID set up to >>>>> different values. Only 11G SSID is configured in wpa_supplicant.conf. In >>>>> the test setup, AP is with 10' (3m) from the laptop. >>>>> >>>>> AP is successfully used by handful of Windows clients (including this >>>>> same laptop) and iBook G4. >>>>> >>>>> Neither wpa_supplicant with '-d -d' nor wlandebug 0xFFFFFFFF show >>>>> anything but normal scan. >>>>> >>>>> athdebug 0xFFFFFFFF loops with ath_intr: status 0x1000 until I power >>>>> down my laptop. >>>>> >>>>> I would appreciate any suggestion on what I can investigate further -- >>>>> at this point I have comfortable console setup and should be able to >>>>> field requests for further information much better. >>>>> >>>>> >>>>> >>>> Are you running powerd? >>>> >>>> >>> I do. And I just tried disabling it, and I could not reproduce the >>> problem any more. Is there any way to reconcile if_ath with powerd? >>> >>> >>> >> Don't know. There appear to be two issues. When the MIB interrupts >> arrive the kernel may service them w/ the cpu at a reduced clock >> frequency. Since powered is currently the only mechanism for increasing >> the frequency and it runs in user space it can take a while to bump the >> clock rate leading to livelock (because the logic to reduce the _cause_ >> of the MIB interrupt takes a long time to run). John Baldwin suggested >> raising the clock frequency when handling interrupts in the kernel but >> nothing has been done to make that happen. >> >> Separately there is a question as to why the MIB interrupts are >> happening at all. This is possibly due to misprogramming of the >> baseband h/w in the ath card. Unfortunately I've been trying to get >> Atheros to help understand/resolve this question for a very long time >> (as their code also exhibits this behaviour) but they've been >> unresponsive. I have some experimental code to address this in new hal >> versions (such as 0.10.5.6 available in http://www.freebsd.org/~sam) but >> apparently it does not entirely fix the problem. >> > Would it be of any value to you, if I build the new hal and see what > happen? I can live without powerd as the workaround, but I'd rather help > if I can. > > Testing the latest hal is always useful but I've not MFC'd many ath changes to RELENG_7 and some people have reported problems on RELENG_7 w/ the new hal that do not occur on HEAD. Sam