From owner-freebsd-stable@FreeBSD.ORG Wed Jul 21 12:25:59 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 49B941065672; Wed, 21 Jul 2010 12:25:59 +0000 (UTC) (envelope-from markus.gebert@hostpoint.ch) Received: from mail.adm.hostpoint.ch (mail.adm.hostpoint.ch [217.26.48.124]) by mx1.freebsd.org (Postfix) with ESMTP id 088A48FC17; Wed, 21 Jul 2010 12:25:58 +0000 (UTC) Received: from [77.109.131.203] (port=62076 helo=ch4buk-en0.office.hostpoint.internal) by mail.adm.hostpoint.ch with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69 (FreeBSD)) (envelope-from ) id 1ObYN3-000BgB-KX; Wed, 21 Jul 2010 14:25:57 +0200 Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii From: Markus Gebert In-Reply-To: <4C46B0C6.4020400@icyb.net.ua> Date: Wed, 21 Jul 2010 14:25:57 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <5CABE3EC-1EE7-4B6B-85EA-70AA2A107948@hostpoint.ch> References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <9DCFE2F6-D7CB-49CB-8EBC-06C1E5EBB727@hostpoint.ch> <201007201559.45081.jhb@freebsd.org> <6781BC8B-51E0-4F8B-9307-9C062DE70C21@hostpoint.ch> <4C46B0C6.4020400@icyb.net.ua> To: Andriy Gapon X-Mailer: Apple Mail (2.1081) Cc: freebsd-stable@freebsd.org, John Baldwin Subject: Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Jul 2010 12:25:59 -0000 On 21.07.2010, at 10:33, Andriy Gapon wrote: > on 21/07/2010 03:57 Markus Gebert said the following: >> Another thing though: Today I compared verbose boot output from = 8-stable and >> the current box. I saw that the ioapic sets up IRQ routing = differently on >> these two systems although the hardware is the same. This seemed not = so >> interesting at first, but then I noticed that 8-stable sets up two = routes (to >> lapic0 and lapic2, or sometimes lapic3) for IRQ58 (mpt0), while = current only >> uses one route (to lapic0). >=20 > My understanding that it's not "two routes", but re-routing. > During early boot all interrupts are bound to BSP; later, when APs = become > online, the interrupts are re-distributed among available CPUs. I guess you're right, misinterpretation on my side. Thanks for = clarifying this. Now being aware of this, it seems to me that in the = machdep.lapic_allclocks=3D0 case, there might just be more interrupts to = be assigned/routed due to "more clocks being used". If that's true, = maybe it's just "luck" that in this case the mpt interrupt gets assigned = to lapic0/cpu0 and the box runs fine. I'm just guessing though, since I = have no clue how interrupts are assigned to lapics exactly (round-robin? = some logic?). >> I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box = behave >> like the one running current. Indeed, this seems to have changed = IRQ58 to be >> routed to lapic0 only. And the box was running for hours without = showing the >> symptoms. >>=20 >> I just checked boot verbose outpout of my 8-stable box again (booted = with >> machdep.lapic_allclocks=3D0 as mentioned above). And now it seems to = have set >> up IRQ routes just like the current box (one route for IRQ58 to = lapic0). >=20 > Not sure how to interpret this properly. > One possibility is a hardware problem where interrupt message route = between > ioapic2 and CPU to which lapic3 belongs is flaky. > Perhaps, this might be a FreeBSD problem: it could be that the system = somehow > tells to not set up such routes, but we don't listen. But this is far = fetched. I'm not sure either. If my "theory" above proved to be true, it would = have been just luck, that 6.x and 7.x (and current) run just fine on the = X4100M2. A (short) test on Ubuntu didn't trigger the problem, so the = Linux kernel is either lucky too by selecting an interrupt route that is = "not flaky", or there's indeed some way to figure out not to use some = lapics for some interrupts. Or we didn't test Linux thoroughly enough. Markus