Date: Wed, 21 Jul 2010 14:25:57 +0200 From: Markus Gebert <markus.gebert@hostpoint.ch> To: Andriy Gapon <avg@icyb.net.ua> Cc: freebsd-stable@freebsd.org, John Baldwin <jhb@freebsd.org> Subject: Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? Message-ID: <5CABE3EC-1EE7-4B6B-85EA-70AA2A107948@hostpoint.ch> In-Reply-To: <4C46B0C6.4020400@icyb.net.ua> References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <9DCFE2F6-D7CB-49CB-8EBC-06C1E5EBB727@hostpoint.ch> <F744F475-3D2B-4BC6-856A-A5D302AA8681@hostpoint.ch> <201007201559.45081.jhb@freebsd.org> <6781BC8B-51E0-4F8B-9307-9C062DE70C21@hostpoint.ch> <4C46B0C6.4020400@icyb.net.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 21.07.2010, at 10:33, Andriy Gapon wrote: > on 21/07/2010 03:57 Markus Gebert said the following: >> Another thing though: Today I compared verbose boot output from = 8-stable and >> the current box. I saw that the ioapic sets up IRQ routing = differently on >> these two systems although the hardware is the same. This seemed not = so >> interesting at first, but then I noticed that 8-stable sets up two = routes (to >> lapic0 and lapic2, or sometimes lapic3) for IRQ58 (mpt0), while = current only >> uses one route (to lapic0). >=20 > My understanding that it's not "two routes", but re-routing. > During early boot all interrupts are bound to BSP; later, when APs = become > online, the interrupts are re-distributed among available CPUs. I guess you're right, misinterpretation on my side. Thanks for = clarifying this. Now being aware of this, it seems to me that in the = machdep.lapic_allclocks=3D0 case, there might just be more interrupts to = be assigned/routed due to "more clocks being used". If that's true, = maybe it's just "luck" that in this case the mpt interrupt gets assigned = to lapic0/cpu0 and the box runs fine. I'm just guessing though, since I = have no clue how interrupts are assigned to lapics exactly (round-robin? = some logic?). >> I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box = behave >> like the one running current. Indeed, this seems to have changed = IRQ58 to be >> routed to lapic0 only. And the box was running for hours without = showing the >> symptoms. >>=20 >> I just checked boot verbose outpout of my 8-stable box again (booted = with >> machdep.lapic_allclocks=3D0 as mentioned above). And now it seems to = have set >> up IRQ routes just like the current box (one route for IRQ58 to = lapic0). >=20 > Not sure how to interpret this properly. > One possibility is a hardware problem where interrupt message route = between > ioapic2 and CPU to which lapic3 belongs is flaky. > Perhaps, this might be a FreeBSD problem: it could be that the system = somehow > tells to not set up such routes, but we don't listen. But this is far = fetched. I'm not sure either. If my "theory" above proved to be true, it would = have been just luck, that 6.x and 7.x (and current) run just fine on the = X4100M2. A (short) test on Ubuntu didn't trigger the problem, so the = Linux kernel is either lucky too by selecting an interrupt route that is = "not flaky", or there's indeed some way to figure out not to use some = lapics for some interrupts. Or we didn't test Linux thoroughly enough. Markus
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5CABE3EC-1EE7-4B6B-85EA-70AA2A107948>