From owner-freebsd-stable@FreeBSD.ORG Mon Jan 9 18:16:49 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7CA53106566B for ; Mon, 9 Jan 2012 18:16:49 +0000 (UTC) (envelope-from lists@jnielsen.net) Received: from ns1.jnielsen.net (secure.freebsdsolutions.net [69.55.234.48]) by mx1.freebsd.org (Postfix) with ESMTP id 06C478FC08 for ; Mon, 9 Jan 2012 18:16:48 +0000 (UTC) Received: from jnielsen.socialserve.com ([12.249.176.26]) (authenticated bits=0) by ns1.jnielsen.net (8.14.4/8.14.4) with ESMTP id q09HovBd025589 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Mon, 9 Jan 2012 12:50:58 -0500 (EST) (envelope-from lists@jnielsen.net) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: John Nielsen In-Reply-To: Date: Mon, 9 Jan 2012 12:50:54 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: Freddie Cash X-Mailer: Apple Mail (2.1084) X-DCC-sonic.net-Metrics: ns1.jnielsen.net 1117; Body=2 Fuz1=2 Fuz2=2 X-Virus-Scanned: clamav-milter 0.97.2 at ns1.jnielsen.net X-Virus-Status: Clean Cc: FreeBSD Stable Subject: Re: Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Jan 2012 18:16:49 -0000 On Jan 9, 2012, at 12:40 PM, Freddie Cash wrote: > Just wondering if anyone else has run into a similar issue. >=20 > We have a ZFS storage server that was running 8.2-STABLE (from around > beginning of Dec 2011) without any issues, that was upgraded to > 9.0-RELEASE (to consolidate all the ZFS and networking fixes/updates > and bring it up to version parity with our other ZFS storage server > running 9.0) last Thursday. The "svn switch" of the source tree, the > buildworld, the buildkernel, the installkernel, the reboot with the > new kernel, the installworld, the reboot into the new world, the > mergemaster processes all completed successfully. About half-way > through the "make delete-old" process, the box locked up. No messages > on the console, no log entries of any kind, everything just stopped. > Had to do a power-cycle. And then everything went to hell. :( >=20 > On reboot, the loader complained about not being able to determine > which disk it was booting from (even though the new loader had already > booted at least once), and gave strange messages about > panic/free/something or other (didn't write that error down). >=20 > I was able to boot using a 9.0 install CD, drop to a loader prompt, > unload the kernel/modules from CD, load the kernel/modules from the > harddrive, set currdev to the harddrive, and boot. But no matter what > I did (gpart bootcode using pmbr/gptboot from CD or from HD; copy > loader from CD, copy /boot from CD), I could not get the loader on the > HD to load the kernel; always gave the same error message: can't > determine which disk we're booting from. >=20 > After trying for 24 hours to make it work, I just re-installed off the > 9.0-RELEASE CD. >=20 > Now, this box (alphadrive) will freeze after running for between 3 and > 10 hours. Even when left completely idle, it will lock up after about > 3 hours. :( >=20 > I have another system (betadrive) that's almost identical hardware > (chassis, backplane, SATA controllers are different, everything else > is the same) that went from 8.2-STABLE to 9.0-RC2 to 9.0-RC3 to > 9.0-RELEASE without any issues. I've tried copying /boot/loader.conf, > /etc/make.conf, /etc/src.conf, /etc/sysctl.conf, /etc/rc.conf from > betadrive to alphadrive, without any change in the freezing behaviour. >=20 > These are ZFS storage systems, with / (UFS) and swap on SSDs, with 16 > or 24 SATA HDs in the pool (3x 5-disk raidz2 + spare and 4x 6-disk > raidz2 resp). All of the ZFS settings are identical between the two > systems (pool name, pool properties, ZFS filesystems, ZFS properties > per filesystem). Dedupe and compression (LZJB) are enabled on both > systems. >=20 > When alphadrive locks up, there are no entries made in any log files; > there are no log entries on the console; there are no entries in the > BIOS event log; there are no entries in the IPMI event log; the > CPU/case temps are below 40C (emergency shutoff is 75C) as shown via > IPMI; RAM usage is under 20 GB (24 GB per box) with the lowest being > under 2 GB used (I run top on the console so I can see the stats when > it locks up, and the time it locks up). It just ... stops. >=20 > The system will even lock up when running in single-user mode, with > only / mounted (ZFS not loaded, zpool not imported). >=20 > Hardware (alphadrive): > Chenbro 5U rackmount chassis with 24 hot-swap drive bays > SuperMicro H8DGi-F motherboard > AMD Opteron 2218 CPU (8-cores at 2.0 GHz) > 24 GB DDR3-SDRAM > 3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out = cables) > 8x Seagate 7200.12 1.5 TB SATA harddrives > 16x WD RE4 1.0 TB SATA harddrives > 1x Kingston 60 GB SSD (for /, swap, L2ARC) >=20 > Hardware (betadrive): > SuperMicro 4U rackmount chassis with 16 hot-swap drive bays > SuperMicro H8DGi-F motherboard > AMD Opteron 2218 CPU (8-cores at 2.0 GHz) > 24 GB DDR3-SDRAM > 2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables) > 16x WD RE4 2.0 TB SATA harddrives > 1x Kingston 60 GB SSD (for /, swap, L2ARC) >=20 > betadrive runs perfectly with FreeBSD 9.0-RELEASE. > alphadrive locks up with FreeBSD 9.0-RELEASE. >=20 > We're currently investigating hardware firmware revisions to see if > anything else is different between the two systems. >=20 > Has anyone experience anything similar? Does anyone have any ideas on > what to look for? Any suggestions on what to try next? =46rom what you've said I strongly suspect that you have some kind of = hardware issue. Dodgy RAM is my first guess, something cooling-related = is my 2nd, and PSU is my 3rd. It is a little suspicious that you only = started having problems after your upgrade but it could be coincidence = or it could be something about the new software tickling the hardware = differently than the old. Open it up, make sure you don't have dust buildup and that all the fans = are spinning, re-seat the RAM and then boot into memtest for a few = hours. If you have spare similar hardware you can also try swapping = components until you isolate the fault. Good luck, JN