From owner-freebsd-stable@FreeBSD.ORG Fri Apr 6 10:17:19 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B1B0106564A; Fri, 6 Apr 2012 10:17:19 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by mx1.freebsd.org (Postfix) with ESMTP id 332578FC08; Fri, 6 Apr 2012 10:17:18 +0000 (UTC) Received: by wibhj6 with SMTP id hj6so337090wib.13 for ; Fri, 06 Apr 2012 03:17:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=Dwa2bmPA5hjiEWgEEyjsap/2xtyLmQ9+f0sEE+s3PuM=; b=0G3RfwEKkxSlBM56X5uDqtAlhMNESgvGnPCl0sdXALBUibxz0E1wxQvVZuDhzmdQDY I4W74lil0wRTydcuIpeHXFfdAmTbsFJTh2am70Zk7PZJxL8rGjmqi2dDS9dcJkCzJxiF PoYjFSwFcnchOaFAOGOYbpqA58G5U+IGas+te2QwqkCU9upMRKdOLDMswDojERUHwHyY Ok70QbnJ443PDmApv7oaaHAz8HZHzBFOeUV6kNbjZj3WduI07Y4+ujUaorkrcOes3ODz Jg02Q3cbJ/xvNZ+mynCynzt3gupYMQgw8s2rYaOahwRRpLLKK/4BUY5JE8MV/nXbUX4f uhog== Received: by 10.180.101.230 with SMTP id fj6mr18667536wib.13.1333707436263; Fri, 06 Apr 2012 03:17:16 -0700 (PDT) Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id o2sm8494973wiv.11.2012.04.06.03.17.13 (version=SSLv3 cipher=OTHER); Fri, 06 Apr 2012 03:17:14 -0700 (PDT) Sender: Alexander Motin Message-ID: <4F7EC2A8.3000001@FreeBSD.org> Date: Fri, 06 Apr 2012 13:17:12 +0300 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:10.0.2) Gecko/20120226 Thunderbird/10.0.2 MIME-Version: 1.0 To: John Baldwin References: <201204041624.q34GOXvc063506@ambrisko.com> <201204041447.49398.jhb@freebsd.org> In-Reply-To: <201204041447.49398.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org, scottl@freebsd.org Subject: Re: [stable-ish 9] Dell R815 ipmi(4) attach failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Apr 2012 10:17:19 -0000 On 04/04/12 21:47, John Baldwin wrote: > On Wednesday, April 04, 2012 12:24:33 pm Doug Ambrisko wrote: >> John Baldwin writes: >> | On Tuesday, April 03, 2012 12:37:50 pm Doug Ambrisko wrote: >> |> John Baldwin writes: >> |> | On Monday, April 02, 2012 7:27:13 pm Doug Ambrisko wrote: >> |> |> Doug Ambrisko writes: >> |> |> | John Baldwin writes: >> |> |> | | On Saturday, March 31, 2012 3:25:48 pm Doug Ambrisko wrote: >> |> |> | |> Sean Bruno writes: >> |> |> | |> | Noting a failure to attach to the onboard IPMI controller > with >> | this >> |> | dell >> |> |> | |> | R815. Not sure what to start poking at and thought I'd > though >> | this >> |> | over >> |> |> | |> | here for comment. >> |> |> | |> | >> |> |> | |> | -bash-4.2$ dmesg |grep ipmi >> |> |> | |> | ipmi0: KCS mode found at io 0xca8 on acpi >> |> |> | |> | ipmi1: on isa0 >> |> |> | |> | device_attach: ipmi1 attach returned 16 >> |> |> | |> | ipmi1: on isa0 >> |> |> | |> | device_attach: ipmi1 attach returned 16 >> |> |> | |> | ipmi0: Timed out waiting for GET_DEVICE_ID >> |> |> | |> >> |> |> | |> I've run into this recently. A quick hack to fix it is: >> |> |> | |> >> |> |> | |> Index: ipmi.c >> |> |> | |> >> | =================================================================== >> |> |> | |> RCS file: /cvs/src/sys/dev/ipmi/ipmi.c,v >> |> |> | |> retrieving revision 1.14 >> |> |> | |> diff -u -p -r1.14 ipmi.c >> |> |> | |> --- ipmi.c 14 Apr 2011 07:14:22 -0000 1.14 >> |> |> | |> +++ ipmi.c 31 Mar 2012 19:18:35 -0000 >> |> |> | |> @@ -695,7 +695,6 @@ ipmi_startup(void *arg) >> |> |> | |> if (error == EWOULDBLOCK) { >> |> |> | |> device_printf(dev, "Timed out waiting for >> | GET_DEVICE_ID\n"); >> |> |> | |> ipmi_free_request(req); >> |> |> | |> - return; >> |> |> | |> } else if (error) { >> |> |> | |> device_printf(dev, "Failed GET_DEVICE_ID: %d\n", > error); >> |> |> | |> ipmi_free_request(req); >> |> |> | |> >> |> |> | |> The issue is that the wakeup doesn't actually wake up the > msleep >> |> |> | |> in ipmi_submit_driver_request. The error being reported is > that >> |> |> | |> the msleep timed out. This doesn't seem to be critical > problem >> |> |> | |> since after this things seemed to work work. I saw this on > 9.X. >> |> |> | |> Haven't seen it on 8.2. Not sure about -current. >> |> |> | |> >> |> |> | |> It doesn't happen on all machines. >> |> |> | | >> |> |> | | Hmm, are you seeing the KCS thread manage the request but the >> | wakeup() >> |> | is >> |> |> | | lost? >> |> |> | >> |> |> | It was a couple of weeks ago that I played with it. I put > printf's >> |> |> | around the msleep and wakeup. I saw the wakeup called but the > sleep >> |> |> | not get it. I can try the test again later today. Right now my > main >> |> |> | work machine is recovering from a power outage. This was with 9.0 >> |> |> | when I first saw it. This issue seems to only happen at boot > time. >> |> |> | If I kldload the module after the system is booted then it seems > to >> | work >> |> |> | okay. The KCS part was working fine and got the data okay from > the >> |> |> | request. I haven't seen or heard any issues with 8.2. >> |> |> >> |> |> With -current I patched ipmi.c with: >> |> |> Index: ipmi.c >> |> |> =================================================================== >> |> |> --- ipmi.c (revision 233806) >> |> |> +++ ipmi.c (working copy) >> |> |> @@ -523,7 +523,11 @@ >> |> |> * waiter that we awaken. >> |> |> */ >> |> |> if (req->ir_owner == NULL) >> |> |> +{ >> |> |> +device_printf(sc->ipmi_dev, "DEBUG %s %d before wakeup >> |> | %d\n",__FUNCTION__,__LINE__,ticks); >> |> |> wakeup(req); >> |> |> +device_printf(sc->ipmi_dev, "DEBUG %s %d after wakeup >> |> | %d\n",__FUNCTION__,__LINE__,ticks); >> |> |> +} >> |> |> else { >> |> |> dev = req->ir_owner; >> |> |> TAILQ_INSERT_TAIL(&dev->ipmi_completed_requests, > req, >> |> | ir_link); >> |> |> @@ -543,7 +547,11 @@ >> |> |> IPMI_LOCK(sc); >> |> |> error = sc->ipmi_enqueue_request(sc, req); >> |> |> if (error == 0) >> |> |> +{ >> |> |> +device_printf(sc->ipmi_dev, "DEBUG %s %d before msleep >> |> | %d\n",__FUNCTION__,__LINE__,ticks); >> |> |> error = msleep(req,&sc->ipmi_lock, 0, "ipmireq", > timo); >> |> |> +device_printf(sc->ipmi_dev, "DEBUG %s %d after msleep >> |> | %d\n",__FUNCTION__,__LINE__,ticks); >> |> |> +} >> |> |> if (error == 0) >> |> |> error = req->ir_error; >> |> |> IPMI_UNLOCK(sc); >> |> |> @@ -695,8 +703,11 @@ >> |> |> error = ipmi_submit_driver_request(sc, req, MAX_TIMEOUT); >> |> |> if (error == EWOULDBLOCK) { >> |> |> device_printf(dev, "Timed out waiting for >> | GET_DEVICE_ID\n"); >> |> |> + printf("DJA\n"); >> |> |> +/* >> |> |> ipmi_free_request(req); >> |> |> return; >> |> |> +*/ >> |> |> } else if (error) { >> |> |> device_printf(dev, "Failed GET_DEVICE_ID: %d\n", > error); >> |> |> ipmi_free_request(req); >> |> |> >> |> |> and get >> |> |> # dmesg | grep ipmi >> |> |> ipmi0: KCS mode found at io 0xca8 on acpi >> |> |> ipmi1: on isa0 >> |> |> device_attach: ipmi1 attach returned 16 >> |> |> ipmi1: on isa0 >> |> |> device_attach: ipmi1 attach returned 16 >> |> |> ipmi0: DEBUG ipmi_submit_driver_request 551 before msleep 2 >> |> |> ipmi0: DEBUG ipmi_complete_request 527 before wakeup 6201 >> |> |> ipmi0: DEBUG ipmi_complete_request 529 after wakeup 6263 >> |> |> ipmi0: DEBUG ipmi_submit_driver_request 553 after msleep 6323 >> |> | >> |> | Actually, can you compile with: >> |> | >> |> | options KTR >> |> | options KTR_COMPILE=KTR_SCHED >> |> | options KTR_MASK=KTR_SCHED >> |> | >> |> | and then add a temporary hack to ipmi.c to set ktr_mask to 0 after >> |> | ipmi_submit_driver_request() returns in ipmi_startup()? You can >> |> | then use 'ktrdump -ct' after boot to capture a log of what the > scheduler >> |> | did including if it timed out the sleep, etc. I think this would be >> |> | useful for figuring out what went wrong. It does seem that it timed >> |> | out after 3 seconds. >> |> >> |> Assuming I didn't mess up, the log should be at: >> |> http://people.freebsd.org/~ambrisko/ipmi_ktr_dump.txt >> |> again, I using ipmi(4) as module loaded via the loader. >> | >> | If you use "-ct" then you get a file you can feed into schedgraph. >> | However, just reading the log, it seems that IRQ 20 keeps preempting >> | the KCS worker thread preventing it from getting anything done. Also, >> | there seem to be a lot of threads on CPU 0's runqueue waiting for a >> | chance to run (load average of 12 or 13 the entire time). You can try >> | just bumping up the max timeout from 3 seconds to higher perhaps. Not >> | sure why IRQ 20 keeps firing though. It might be related to USB, so >> | you could try fiddling with USB options in the BIOS perhaps, or disabling >> | the USB drivers to see if that fixes IPMI. >> >> Tried without USB in kernel: >> http://people.freebsd.org/~ambrisko/ipmi_ktr_dump_no_usb.txt > > Hmm, it's still just running constantly (note that the idle thread is > _never_ scheduled). The lion's share of the time seems to be spent in > "xpt_thrd". Note that there are several places where nothing happens except > that "xpt_thrd" runs constantly (spinning) during 10's of statclock ticks. I > would maybe start debugging that to see what in the world it is doing. Maybe > it is polling some hardware down in xpt_action() (i.e., xpt_action() for a > single bus called down into a driver and it is just spinning using polling > instead of sleeping and waiting for an interrupt). "xpt_thrd" is a bus scanner thread. It is scheduled by CAM for every bus on attach and by controller driver on hot-plug events. For some controllers it may be quite CPU-hungry. For example, for legacy ATA controllers, where bus reset may take many seconds of hardware polling, while devices just spinning up. For ahci(4) it was improved about year ago to not use polling when possible, but it still may loop for some time if controller is not responding on reset. What mfi(4), mentioned in log, does during scanning, I am not sure. -- Alexander Motin