From owner-freebsd-stable@FreeBSD.ORG Wed Sep 27 16:33:17 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E811816A494; Wed, 27 Sep 2006 16:33:17 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id CD8C543D49; Wed, 27 Sep 2006 16:33:16 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k8RGWuZP019733; Wed, 27 Sep 2006 10:33:01 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <451AA7B1.5080202@samsco.org> Date: Wed, 27 Sep 2006 10:32:49 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Oliver Brandmueller References: <451A1375.5080202@gneto.com> <20060927071538.GF22229@e-Gitt.NET> <451A4189.5020906@samsco.org> <20060927152824.GJ22229@e-Gitt.NET> <20060927155553.GB14563@icarus.home.lan> <20060927155904.GM22229@e-Gitt.NET> In-Reply-To: <20060927155904.GM22229@e-Gitt.NET> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-stable@freebsd.org, John Baldwin Subject: Re: 6.2 SHOWSTOPPER - em completely unusable on 6.2 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Sep 2006 16:33:18 -0000 Oliver Brandmueller wrote: > Hi, > > On Wed, Sep 27, 2006 at 08:55:53AM -0700, Jeremy Chadwick wrote: > >>>The SMBus Interface is not used at all (it's not even really usable). >>>Anyway, as soon as I unload the ichsmb module I cannot triger the >>>problem anymore. If I load it again, the problem cann again be triggered >>>by a buildworld. Statistical relevance: I did 4 buildworlds, alternating >>>the load/unload of ichsmb - both times with ichsmb loaded I saw 3 >>>watchdog timeouts during the buildworld was running, while ichsmb was >>>not loaded I did not see a single watchdog timeout. The use of the >>>interface was around the same during all the time (constant NFS traffic >>>of around 1-2 MBit/s). >> >>Interesting find. For what it's worth -- I too load the appropriate >>smbus drivers on the system with the "em0 problem" (loading smbus and >>ichsmb). That system is a single processor / single core system, with >>HT disabled in the BIOS (which doesn't matter since FreeBSD disables >>it anyways). Kernel is non-SMP. Only reason I mention this is: >> >> >>>The UP/SMP idea seems to be only of interest, because on an UP machine >>>it's more likely to share interrupts than on SMP machines, it has >>>nothing to do with the fact of UP or SMP itself. > > > I don't think it has to especially with ichsmb here, but only with the > fact, that ichsmb is for me exactly the thing that shares the interrupt > with the em interface that shows the problems. > > - Oliver > My theory here is that something in the kernel, likely VM/VFS, is holding the Giant lock for an inordinate amount of time. During this time, an interrupt fires on the shared em/ichsmb interrupt. The em interrupt handler runs and schedules a task to handle the event. Then the system blocks the interrupt at the PIC and schedules the ichsmb ithread. However, as soon as this ithread tries to run, it gets blocked on the Giant lock that is held elsewhere. While it is blocked, the interrupt stays masked at the PIC, blocking out both ichsmb and em device interrupts. Normally the PIC would get unmasked after the ithread has run, but until the ithread unblocks, this cannot happen. This goes on long enough that pending transactions on the em interface trigger a timeout. Assuming the this analysis is correct, there are a couple of questions. First would be, why is the ithread being blocked for so long? Is the Giant lock actually being held continuously for that long, or is being dropped and relocked often but the scheduler isn't giving the ithread a chance to grab it and run? Second is, why is this only being noticed now? Whether the em driver uses an INTR_FAST handler, like it does now, or an ithread handler, like it used to in 6.1, doesn't affect the ichsmb driver and its interaction with the Giant lock. Maybe there isn't a direct correlation here, and it's just a coincidence that something else in the system changed at the same time as the driver changing. I have a few ideas on tracking down the root cause, but they are pretty pretty painful and slow. The root cause does need to be found and fixed, as it's either a very bad scheduler bug, or a very badly misbehaving subsystem. Both have implications for other possible problems in FreeBSD. Also, the usb driver has the same potential for blocking as the ichsmb driver, as do other drivers. But in the mean time, something needs to be done for 6.2. The options are: 1. Revert the em driver to its 6.1 form, ask people to test if the problem persists. If it doesn't, leave it at that for now. 2. Add INTR_FAST shims to the usb and ichsmb drivers so that neither uses an ithread. Without an ithread, no PIC masking will happen, and these drivers can block all they want without interfering with the em driver. This is a bit of risky work, though, and may not be possible if the devices don't support certain functionality. Also, it doesn't address the root problem. But, getting more interrupt handlers away from needing Giant is a good thing, even if this only a band-aid. 3. Spend the time tracking down and fixing the root problem for 6.2. This is ideal, but it is also an unbounded problem. Thus, it is absolutely not conducive for having a timely and successful 6.2 release. 4. Do nothing for now and tell people to disable usb, ichsmb, etc, as needed. This, of course, is not a good option. Option 1 is the quickest and likely most risk-free fix for the 6.2 release. If someone could test doing a revert and report back, I would appreciate it. Any volunteers? Scott