From owner-freebsd-current@FreeBSD.ORG Wed Feb 25 07:17:54 2009 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C8A88106567A; Wed, 25 Feb 2009 07:17:54 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 4DCB08FC1D; Wed, 25 Feb 2009 07:17:54 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lstewart.caia.swin.edu.au (lstewart.caia.swin.edu.au [136.186.229.95]) (authenticated bits=0) by lauren.room52.net (8.14.3/8.14.3) with ESMTP id n1P6joMm058667 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 25 Feb 2009 17:45:52 +1100 (EST) (envelope-from lstewart@freebsd.org) Message-ID: <49A4E919.1070503@freebsd.org> Date: Wed, 25 Feb 2009 17:45:45 +1100 From: Lawrence Stewart User-Agent: Thunderbird 2.0.0.19 (X11/20090213) MIME-Version: 1.0 To: Luigi Rizzo References: <20081121231400.GA94863@onelab2.iet.unipi.it> <49A35CB1.4050304@freebsd.org> In-Reply-To: <49A35CB1.4050304@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_SOFTFAIL autolearn=disabled version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on lauren.room52.net Cc: kib@freebsd.org, current@freebsd.org Subject: [SOLVED] Re: Recent versions of pxeboot hang/panic on AMD platform. X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Feb 2009 07:17:55 -0000 Lawrence Stewart wrote: > Luigi Rizzo wrote: >> [copying some people involved with recent related commits] >> >> As reported in kern/118222 recent versions of pxeboot hang/panic >> on AMD platform. >> >> Initial reports mentioned that the RELENG_6 versions worked well, >> however i found out that even the recent RELENG_6 code is problematic. >> >> Specifically, the problem i see on two machines with AMD CPU (one >> is an Asus M2N-VM) motherboard netbooting with PXEboot, is that the >> loading of config files or binary modules (kernel, etc.) randomly >> hangs with recent version of pxeboot (RELENG_6, RELENG_7 and HEAD >> all give the same behaviour). >> >> The same system works fine with an old version of pxeboot from RELENG_6. >> >> Things seem to work fine on i386 (tried a Pentium4, N270 and on qemu) >> with all the versions below. >> >> To make some investigation i started with a reliable version >> (RELENG_6, early 2008) and moved forward to figure out where the >> problem was introduced. I found the following: >> >> RELENG_6 as of 2008.03.01 (svn 176674) works >> RELENG_6 as of 2008.03.15 (svn 177190) works >> (same as previous) >> RELENG_6 as of 2008.03.31 (svn 177768) does NOT work >> changed files: >> Index: RELENG_6/sys/boot/i386/boot2/boot2.c >> Index: RELENG_6/sys/boot/i386/btx/btx/Makefile >> Index: RELENG_6/sys/boot/i386/btx/btx/btx.S >> Index: RELENG_6/sys/boot/i386/gptboot/gptboot.c >> Index: RELENG_6/sys/boot/i386/libi386/biossmap.c >> Index: RELENG_6/sys/boot/i386/libi386/biosmem.c >> >> There is a recent, related change (august 2008) which however >> does not seem to fix the bug. >> >> (all the above is basically an MFC of something applied slightly >> earlier to >> head and RELENG_7 . I have experienced the same exact bug with a fresh >> head and RELENG_7, even though I have not found the exact point there >> where the problem arised). >> >> The fact that the failure occurs at random times, even quite early >> (e.g. while reading the Forth config files) suggests that the problem >> may be related to interrupts coming at the wrong time. >> Unfortunately the changes to btx.S (which i believe may be related to >> the problem, as the changes to the other files seem innocuous or >> unrelated) >> are beyond my knowledge. >> So, anyone has ideas on what could be happening here, and especially >> how likely it is that we might see the same problem with a disk or >> usb-based >> booting ? > > Just adding a "me too" with pxeboot built from head r188509. Running > with pxeboot from AMD64 6.3-RELEASE as Luigi's research hinted seems to > resolve the issue for me also. I haven't tried pxeboot built from > r177768 yet though to see if it too fails. > > To quickly touch on symptoms... I've never seen a panic. I experience > permanent hangs that occur maybe 50% (or possibly even more) of the time > when I reboot or cold start the machine. Only option is to reboot when > it hangs. Rebooting a few times will eventually allow the boot process > to finish and then once the kernel kicks off probing, all is good. > > Hardware is an Intel 865GM chipset based Gigabyte mainboard with a 3GHz > HTT P4 CPU (HTT enabled). > > Happy to help debug further if anyone has ideas to try. > On a whim I decided to try a PCI Intel GigE NIC I had lying around... low and behold I can't make the machine hang during boot any more with pxeboot built from head r188509. To be a bit more specific about the hardware involved, the motherboard is a Gigabyte GA-8I865GM-775 with BIOS version F5 and the onboard NIC shows up as follows: Marvell Yukon pxe rom version (reported during boot): 1.11 pciconf -lv: skc0@pci0:1:9:0: class=0x020000 card=0xe0001458 chip=0x432011ab rev=0x13 hdr=0x00 vendor = 'Marvell Semiconductor (Was: Galileo Technology Ltd)' device = 'Yukon 88E8001/8003/8010 PCI Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet Onboard NIC related verbose dmesg output: skc0: port 0xa400-0xa4ff mem 0xf9040000-0xf9043fff irq 20 at device 9.0 on pci1 skc0: Reserved 0x4000 bytes for rid 0x10 type 3 at 0xf9040000 skc0: interrupt moderation is 100 us skc0: Marvell Yukon Lite Gigabit Ethernet rev. (0x9) skc0: chip ver = 0xb1 skc0: chip rev = 0x09 skc0: SK_EPROM0 = 0x10 skc0: SRAM size = 0x010000 sk0: on skc0 sk0: bpf attached sk0: Ethernet address: miibus0: on sk0 e1000phy0: PHY 0 on miibus0 e1000phy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX-FDX, auto ioapic0: routing intpin 20 (PCI IRQ 20) to lapic 0 vector 54 skc0: [MPSAFE] skc0: [ITHREAD] As a follow on from the Intel NIC discovery, I also noticed John's commit from yesterday (r189017) which looked promising and took it for a spin. I'm happy to report that it appears to resolve the hang with the Marvell card's pxe rom. After at least a dozen reboot/cold start attempts it hasn't hung once, whereas pxeboot build from r189016 hangs most of the time. The addon Intel NIC is still unphased by either pxeboot version and boots just fine regardless. So for me at least, looks like the case is closed. Thanks go to Tor, John and Bjoern for their work on r189017. Cheers, Lawrence