Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 06 Apr 2012 13:17:12 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        John Baldwin <jhb@freebsd.org>
Cc:        freebsd-stable@freebsd.org, scottl@freebsd.org
Subject:   Re: [stable-ish 9] Dell R815 ipmi(4) attach failure
Message-ID:  <4F7EC2A8.3000001@FreeBSD.org>
In-Reply-To: <201204041447.49398.jhb@freebsd.org>
References:  <201204041624.q34GOXvc063506@ambrisko.com> <201204041447.49398.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 04/04/12 21:47, John Baldwin wrote:
> On Wednesday, April 04, 2012 12:24:33 pm Doug Ambrisko wrote:
>> John Baldwin writes:
>> | On Tuesday, April 03, 2012 12:37:50 pm Doug Ambrisko wrote:
>> |>  John Baldwin writes:
>> |>  | On Monday, April 02, 2012 7:27:13 pm Doug Ambrisko wrote:
>> |>  |>  Doug Ambrisko writes:
>> |>  |>  | John Baldwin writes:
>> |>  |>  | | On Saturday, March 31, 2012 3:25:48 pm Doug Ambrisko wrote:
>> |>  |>  | |>  Sean Bruno writes:
>> |>  |>  | |>  | Noting a failure to attach to the onboard IPMI controller
> with
>> | this
>> |>  | dell
>> |>  |>  | |>  | R815.  Not sure what to start poking at and thought I'd
> though
>> | this
>> |>  | over
>> |>  |>  | |>  | here for comment.
>> |>  |>  | |>  |
>> |>  |>  | |>  | -bash-4.2$ dmesg |grep ipmi
>> |>  |>  | |>  | ipmi0: KCS mode found at io 0xca8 on acpi
>> |>  |>  | |>  | ipmi1:<IPMI System Interface>  on isa0
>> |>  |>  | |>  | device_attach: ipmi1 attach returned 16
>> |>  |>  | |>  | ipmi1:<IPMI System Interface>  on isa0
>> |>  |>  | |>  | device_attach: ipmi1 attach returned 16
>> |>  |>  | |>  | ipmi0: Timed out waiting for GET_DEVICE_ID
>> |>  |>  | |>
>> |>  |>  | |>  I've run into this recently.  A quick hack to fix it is:
>> |>  |>  | |>
>> |>  |>  | |>  Index: ipmi.c
>> |>  |>  | |>
>> | ===================================================================
>> |>  |>  | |>  RCS file: /cvs/src/sys/dev/ipmi/ipmi.c,v
>> |>  |>  | |>  retrieving revision 1.14
>> |>  |>  | |>  diff -u -p -r1.14 ipmi.c
>> |>  |>  | |>  --- ipmi.c	14 Apr 2011 07:14:22 -0000	1.14
>> |>  |>  | |>  +++ ipmi.c	31 Mar 2012 19:18:35 -0000
>> |>  |>  | |>  @@ -695,7 +695,6 @@ ipmi_startup(void *arg)
>> |>  |>  | |>   	if (error == EWOULDBLOCK) {
>> |>  |>  | |>   		device_printf(dev, "Timed out waiting for
>> | GET_DEVICE_ID\n");
>> |>  |>  | |>   		ipmi_free_request(req);
>> |>  |>  | |>  -		return;
>> |>  |>  | |>   	} else if (error) {
>> |>  |>  | |>   		device_printf(dev, "Failed GET_DEVICE_ID: %d\n",
> error);
>> |>  |>  | |>   		ipmi_free_request(req);
>> |>  |>  | |>
>> |>  |>  | |>  The issue is that the wakeup doesn't actually wake up the
> msleep
>> |>  |>  | |>  in ipmi_submit_driver_request.  The error being reported is
> that
>> |>  |>  | |>  the msleep timed out.  This doesn't seem to be critical
> problem
>> |>  |>  | |>  since after this things seemed to work work.  I saw this on
> 9.X.
>> |>  |>  | |>  Haven't seen it on 8.2.  Not sure about -current.
>> |>  |>  | |>
>> |>  |>  | |>  It doesn't happen on all machines.
>> |>  |>  | |
>> |>  |>  | | Hmm, are you seeing the KCS thread manage the request but the
>> | wakeup()
>> |>  | is
>> |>  |>  | | lost?
>> |>  |>  |
>> |>  |>  | It was a couple of weeks ago that I played with it.  I put
> printf's
>> |>  |>  | around the msleep and wakeup.  I saw the wakeup called but the
> sleep
>> |>  |>  | not get it.  I can try the test again later today.  Right now my
> main
>> |>  |>  | work machine is recovering from a power outage.  This was with 9.0
>> |>  |>  | when I first saw it.  This issue seems to only happen at boot
> time.
>> |>  |>  | If I kldload the module after the system is booted then it seems
> to
>> | work
>> |>  |>  | okay.  The KCS part was working fine and got the data okay from
> the
>> |>  |>  | request.  I haven't seen or heard any issues with 8.2.
>> |>  |>
>> |>  |>  With -current I patched ipmi.c with:
>> |>  |>  Index: ipmi.c
>> |>  |>  ===================================================================
>> |>  |>  --- ipmi.c      (revision 233806)
>> |>  |>  +++ ipmi.c      (working copy)
>> |>  |>  @@ -523,7 +523,11 @@
>> |>  |>           * waiter that we awaken.
>> |>  |>           */
>> |>  |>          if (req->ir_owner == NULL)
>> |>  |>  +{
>> |>  |>  +device_printf(sc->ipmi_dev, "DEBUG %s %d before wakeup
>> |>  | %d\n",__FUNCTION__,__LINE__,ticks);
>> |>  |>                  wakeup(req);
>> |>  |>  +device_printf(sc->ipmi_dev, "DEBUG %s %d after wakeup
>> |>  | %d\n",__FUNCTION__,__LINE__,ticks);
>> |>  |>  +}
>> |>  |>          else {
>> |>  |>                  dev = req->ir_owner;
>> |>  |>                  TAILQ_INSERT_TAIL(&dev->ipmi_completed_requests,
> req,
>> |>  | ir_link);
>> |>  |>  @@ -543,7 +547,11 @@
>> |>  |>          IPMI_LOCK(sc);
>> |>  |>          error = sc->ipmi_enqueue_request(sc, req);
>> |>  |>          if (error == 0)
>> |>  |>  +{
>> |>  |>  +device_printf(sc->ipmi_dev, "DEBUG %s %d before msleep
>> |>  | %d\n",__FUNCTION__,__LINE__,ticks);
>> |>  |>                  error = msleep(req,&sc->ipmi_lock, 0, "ipmireq",
> timo);
>> |>  |>  +device_printf(sc->ipmi_dev, "DEBUG %s %d after msleep
>> |>  | %d\n",__FUNCTION__,__LINE__,ticks);
>> |>  |>  +}
>> |>  |>          if (error == 0)
>> |>  |>                  error = req->ir_error;
>> |>  |>          IPMI_UNLOCK(sc);
>> |>  |>  @@ -695,8 +703,11 @@
>> |>  |>          error = ipmi_submit_driver_request(sc, req, MAX_TIMEOUT);
>> |>  |>          if (error == EWOULDBLOCK) {
>> |>  |>                  device_printf(dev, "Timed out waiting for
>> | GET_DEVICE_ID\n");
>> |>  |>  +               printf("DJA\n");
>> |>  |>  +/*
>> |>  |>                  ipmi_free_request(req);
>> |>  |>                  return;
>> |>  |>  +*/
>> |>  |>          } else if (error) {
>> |>  |>                  device_printf(dev, "Failed GET_DEVICE_ID: %d\n",
> error);
>> |>  |>                  ipmi_free_request(req);
>> |>  |>
>> |>  |>  and get
>> |>  |>    # dmesg | grep ipmi
>> |>  |>    ipmi0: KCS mode found at io 0xca8 on acpi
>> |>  |>    ipmi1:<IPMI System Interface>  on isa0
>> |>  |>    device_attach: ipmi1 attach returned 16
>> |>  |>    ipmi1:<IPMI System Interface>  on isa0
>> |>  |>    device_attach: ipmi1 attach returned 16
>> |>  |>    ipmi0: DEBUG ipmi_submit_driver_request 551 before msleep 2
>> |>  |>    ipmi0: DEBUG ipmi_complete_request 527 before wakeup 6201
>> |>  |>    ipmi0: DEBUG ipmi_complete_request 529 after wakeup 6263
>> |>  |>    ipmi0: DEBUG ipmi_submit_driver_request 553 after msleep 6323
>> |>  |
>> |>  | Actually, can you compile with:
>> |>  |
>> |>  | options  	KTR
>> |>  | options  	KTR_COMPILE=KTR_SCHED
>> |>  | options 	KTR_MASK=KTR_SCHED
>> |>  |
>> |>  | and then add a temporary hack to ipmi.c to set ktr_mask to 0 after
>> |>  | ipmi_submit_driver_request() returns in ipmi_startup()?  You can
>> |>  | then use 'ktrdump -ct' after boot to capture a log of what the
> scheduler
>> |>  | did including if it timed out the sleep, etc.  I think this would be
>> |>  | useful for figuring out what went wrong.  It does seem that it timed
>> |>  | out after 3 seconds.
>> |>
>> |>  Assuming I didn't mess up, the log should be at:
>> |>  	http://people.freebsd.org/~ambrisko/ipmi_ktr_dump.txt
>> |>  again, I using ipmi(4) as module loaded via the loader.
>> |
>> | If you use "-ct" then you get a file you can feed into schedgraph.
>> | However, just reading the log, it seems that IRQ 20 keeps preempting
>> | the KCS worker thread preventing it from getting anything done.  Also,
>> | there seem to be a lot of threads on CPU 0's runqueue waiting for a
>> | chance to run (load average of 12 or 13 the entire time).  You can try
>> | just bumping up the max timeout from 3 seconds to higher perhaps.  Not
>> | sure why IRQ 20 keeps firing though.  It might be related to USB, so
>> | you could try fiddling with USB options in the BIOS perhaps, or disabling
>> | the USB drivers to see if that fixes IPMI.
>>
>> Tried without USB in kernel:
>> 	http://people.freebsd.org/~ambrisko/ipmi_ktr_dump_no_usb.txt
>
> Hmm, it's still just running constantly (note that the idle thread is
> _never_ scheduled).  The lion's share of the time seems to be spent in
> "xpt_thrd".  Note that there are several places where nothing happens except
> that "xpt_thrd" runs constantly (spinning) during 10's of statclock ticks.  I
> would maybe start debugging that to see what in the world it is doing.  Maybe
> it is polling some hardware down in xpt_action() (i.e., xpt_action() for a
> single bus called down into a driver and it is just spinning using polling
> instead of sleeping and waiting for an interrupt).

"xpt_thrd" is a bus scanner thread. It is scheduled by CAM for every bus 
on attach and by controller driver on hot-plug events. For some 
controllers it may be quite CPU-hungry. For example, for legacy ATA 
controllers, where bus reset may take many seconds of hardware polling, 
while devices just spinning up. For ahci(4) it was improved about year 
ago to not use polling when possible, but it still may loop for some 
time if controller is not responding on reset. What mfi(4), mentioned in 
log, does during scanning, I am not sure.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F7EC2A8.3000001>