Date: Wed, 27 Sep 2006 10:10:41 -0700 From: "Jack Vogel" <jfvogel@gmail.com> To: "Scott Long" <scottl@samsco.org> Cc: freebsd-stable@freebsd.org, John Baldwin <jhb@freebsd.org> Subject: Re: 6.2 SHOWSTOPPER - em completely unusable on 6.2 Message-ID: <2a41acea0609271010w45d79c86ne82b45bc9b551e4a@mail.gmail.com> In-Reply-To: <451AA7B1.5080202@samsco.org> References: <451A1375.5080202@gneto.com> <20060927071538.GF22229@e-Gitt.NET> <451A4189.5020906@samsco.org> <20060927152824.GJ22229@e-Gitt.NET> <20060927155553.GB14563@icarus.home.lan> <20060927155904.GM22229@e-Gitt.NET> <451AA7B1.5080202@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
As an optional data point you might wish to consider the Intel driver I am about to release, it has everything that 6.2 does EXCEPT the interrupt changes. I kept those out because I didn't want to break backward compatibility. If someone that has repro'd this problem wants to check this speak up and I'll send a tarball. Jack On 9/27/06, Scott Long <scottl@samsco.org> wrote: > Oliver Brandmueller wrote: > > Hi, > > > > On Wed, Sep 27, 2006 at 08:55:53AM -0700, Jeremy Chadwick wrote: > > > >>>The SMBus Interface is not used at all (it's not even really usable). > >>>Anyway, as soon as I unload the ichsmb module I cannot triger the > >>>problem anymore. If I load it again, the problem cann again be triggered > >>>by a buildworld. Statistical relevance: I did 4 buildworlds, alternating > >>>the load/unload of ichsmb - both times with ichsmb loaded I saw 3 > >>>watchdog timeouts during the buildworld was running, while ichsmb was > >>>not loaded I did not see a single watchdog timeout. The use of the > >>>interface was around the same during all the time (constant NFS traffic > >>>of around 1-2 MBit/s). > >> > >>Interesting find. For what it's worth -- I too load the appropriate > >>smbus drivers on the system with the "em0 problem" (loading smbus and > >>ichsmb). That system is a single processor / single core system, with > >>HT disabled in the BIOS (which doesn't matter since FreeBSD disables > >>it anyways). Kernel is non-SMP. Only reason I mention this is: > >> > >> > >>>The UP/SMP idea seems to be only of interest, because on an UP machine > >>>it's more likely to share interrupts than on SMP machines, it has > >>>nothing to do with the fact of UP or SMP itself. > > > > > > I don't think it has to especially with ichsmb here, but only with the > > fact, that ichsmb is for me exactly the thing that shares the interrupt > > with the em interface that shows the problems. > > > > - Oliver > > > > My theory here is that something in the kernel, likely VM/VFS, is > holding the Giant lock for an inordinate amount of time. During this > time, an interrupt fires on the shared em/ichsmb interrupt. The em > interrupt handler runs and schedules a task to handle the event. Then > the system blocks the interrupt at the PIC and schedules the ichsmb > ithread. However, as soon as this ithread tries to run, it gets blocked > on the Giant lock that is held elsewhere. While it is blocked, the > interrupt stays masked at the PIC, blocking out both ichsmb and em > device interrupts. Normally the PIC would get unmasked after the > ithread has run, but until the ithread unblocks, this cannot happen. > This goes on long enough that pending transactions on the em interface > trigger a timeout. > > Assuming the this analysis is correct, there are a couple of questions. > First would be, why is the ithread being blocked for so long? Is the > Giant lock actually being held continuously for that long, or is being > dropped and relocked often but the scheduler isn't giving the ithread a > chance to grab it and run? Second is, why is this only being noticed > now? Whether the em driver uses an INTR_FAST handler, like it does now, > or an ithread handler, like it used to in 6.1, doesn't affect the ichsmb > driver and its interaction with the Giant lock. Maybe there isn't a > direct correlation here, and it's just a coincidence that something else > in the system changed at the same time as the driver changing. > > I have a few ideas on tracking down the root cause, but they are pretty > pretty painful and slow. The root cause does need to be found and > fixed, as it's either a very bad scheduler bug, or a very badly > misbehaving subsystem. Both have implications for other possible > problems in FreeBSD. Also, the usb driver has the same potential for > blocking as the ichsmb driver, as do other drivers. But in the mean > time, something needs to be done for 6.2. The options are: > > 1. Revert the em driver to its 6.1 form, ask people to test if the > problem persists. If it doesn't, leave it at that for now. > > 2. Add INTR_FAST shims to the usb and ichsmb drivers so that neither > uses an ithread. Without an ithread, no PIC masking will happen, and > these drivers can block all they want without interfering with the > em driver. This is a bit of risky work, though, and may not be possible > if the devices don't support certain functionality. Also, it doesn't > address the root problem. But, getting more interrupt handlers away > from needing Giant is a good thing, even if this only a band-aid. > > 3. Spend the time tracking down and fixing the root problem for 6.2. > This is ideal, but it is also an unbounded problem. Thus, it is > absolutely not conducive for having a timely and successful 6.2 release. > > 4. Do nothing for now and tell people to disable usb, ichsmb, etc, as > needed. This, of course, is not a good option. > > Option 1 is the quickest and likely most risk-free fix for the 6.2 > release. If someone could test doing a revert and report back, I would > appreciate it. Any volunteers? > > Scott > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2a41acea0609271010w45d79c86ne82b45bc9b551e4a>