From owner-freebsd-stable@FreeBSD.ORG Mon Jan 9 17:40:43 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 37573106566C for ; Mon, 9 Jan 2012 17:40:43 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-vw0-f54.google.com (mail-vw0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id AD4478FC28 for ; Mon, 9 Jan 2012 17:40:42 +0000 (UTC) Received: by vbbfr13 with SMTP id fr13so4505226vbb.13 for ; Mon, 09 Jan 2012 09:40:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=DI+bYCS+gRj6+84LM8pSvHIjPba/oAvPpBjBxWmS0wI=; b=BjNUwkWDSABnXBSTTbqijJp6hWx2oMV3dvYenR5KvZD1q8epxWTjqfWFHS2lHEpD8r rNVvCNGZ8aVeB3yn2mbsoFEXju23Uaq9+2K52hQbJ13X9KuxPaXyMfw0ifHFUMK9s4ZL 6GQV6Q9tvzjCbhisak4ym6ZB2k2N8bKJG4fWs= MIME-Version: 1.0 Received: by 10.52.24.242 with SMTP id x18mr8037827vdf.39.1326130842008; Mon, 09 Jan 2012 09:40:42 -0800 (PST) Received: by 10.220.191.130 with HTTP; Mon, 9 Jan 2012 09:40:41 -0800 (PST) Date: Mon, 9 Jan 2012 09:40:41 -0800 Message-ID: From: Freddie Cash To: FreeBSD Stable Content-Type: text/plain; charset=UTF-8 Subject: Upgrade from 8.2-STABLE to 9.0-RELEASE wedges on SuperMicro H8DGiF-based system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Jan 2012 17:40:43 -0000 Good morning, Just wondering if anyone else has run into a similar issue. We have a ZFS storage server that was running 8.2-STABLE (from around beginning of Dec 2011) without any issues, that was upgraded to 9.0-RELEASE (to consolidate all the ZFS and networking fixes/updates and bring it up to version parity with our other ZFS storage server running 9.0) last Thursday. The "svn switch" of the source tree, the buildworld, the buildkernel, the installkernel, the reboot with the new kernel, the installworld, the reboot into the new world, the mergemaster processes all completed successfully. About half-way through the "make delete-old" process, the box locked up. No messages on the console, no log entries of any kind, everything just stopped. Had to do a power-cycle. And then everything went to hell. :( On reboot, the loader complained about not being able to determine which disk it was booting from (even though the new loader had already booted at least once), and gave strange messages about panic/free/something or other (didn't write that error down). I was able to boot using a 9.0 install CD, drop to a loader prompt, unload the kernel/modules from CD, load the kernel/modules from the harddrive, set currdev to the harddrive, and boot. But no matter what I did (gpart bootcode using pmbr/gptboot from CD or from HD; copy loader from CD, copy /boot from CD), I could not get the loader on the HD to load the kernel; always gave the same error message: can't determine which disk we're booting from. After trying for 24 hours to make it work, I just re-installed off the 9.0-RELEASE CD. Now, this box (alphadrive) will freeze after running for between 3 and 10 hours. Even when left completely idle, it will lock up after about 3 hours. :( I have another system (betadrive) that's almost identical hardware (chassis, backplane, SATA controllers are different, everything else is the same) that went from 8.2-STABLE to 9.0-RC2 to 9.0-RC3 to 9.0-RELEASE without any issues. I've tried copying /boot/loader.conf, /etc/make.conf, /etc/src.conf, /etc/sysctl.conf, /etc/rc.conf from betadrive to alphadrive, without any change in the freezing behaviour. These are ZFS storage systems, with / (UFS) and swap on SSDs, with 16 or 24 SATA HDs in the pool (3x 5-disk raidz2 + spare and 4x 6-disk raidz2 resp). All of the ZFS settings are identical between the two systems (pool name, pool properties, ZFS filesystems, ZFS properties per filesystem). Dedupe and compression (LZJB) are enabled on both systems. When alphadrive locks up, there are no entries made in any log files; there are no log entries on the console; there are no entries in the BIOS event log; there are no entries in the IPMI event log; the CPU/case temps are below 40C (emergency shutoff is 75C) as shown via IPMI; RAM usage is under 20 GB (24 GB per box) with the lowest being under 2 GB used (I run top on the console so I can see the stats when it locks up, and the time it locks up). It just ... stops. The system will even lock up when running in single-user mode, with only / mounted (ZFS not loaded, zpool not imported). Hardware (alphadrive): Chenbro 5U rackmount chassis with 24 hot-swap drive bays SuperMicro H8DGi-F motherboard AMD Opteron 2218 CPU (8-cores at 2.0 GHz) 24 GB DDR3-SDRAM 3x SuperMicro AOC-USAS-L8i SATA controllers (multi-lane break-out cables) 8x Seagate 7200.12 1.5 TB SATA harddrives 16x WD RE4 1.0 TB SATA harddrives 1x Kingston 60 GB SSD (for /, swap, L2ARC) Hardware (betadrive): SuperMicro 4U rackmount chassis with 16 hot-swap drive bays SuperMicro H8DGi-F motherboard AMD Opteron 2218 CPU (8-cores at 2.0 GHz) 24 GB DDR3-SDRAM 2x SuperMicro AOC-USAS2-L8i SATA controllers (multi-lane cables) 16x WD RE4 2.0 TB SATA harddrives 1x Kingston 60 GB SSD (for /, swap, L2ARC) betadrive runs perfectly with FreeBSD 9.0-RELEASE. alphadrive locks up with FreeBSD 9.0-RELEASE. We're currently investigating hardware firmware revisions to see if anything else is different between the two systems. Has anyone experience anything similar? Does anyone have any ideas on what to look for? Any suggestions on what to try next? -- Freddie Cash fjwcash@gmail.com