From owner-freebsd-hackers@FreeBSD.ORG Mon Oct 1 20:00:41 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D7F4B106564A; Mon, 1 Oct 2012 20:00:41 +0000 (UTC) (envelope-from guy.helmer@gmail.com) Received: from mail-vc0-f186.google.com (mail-vc0-f186.google.com [209.85.220.186]) by mx1.freebsd.org (Postfix) with ESMTP id 65EAB8FC0A; Mon, 1 Oct 2012 20:00:41 +0000 (UTC) Received: by vcbfy7 with SMTP id fy7so5085663vcb.13 for ; Mon, 01 Oct 2012 13:00:40 -0700 (PDT) Received: by 10.236.201.134 with SMTP id b6mr2228394yho.15.1349121640824; Mon, 01 Oct 2012 13:00:40 -0700 (PDT) Path: glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: fa.freebsd.hackers Date: Mon, 1 Oct 2012 13:00:40 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=216.81.189.9; posting-account=UWBIDAoAAADr2Rwohhx_YBn6X-E8F7Gr NNTP-Posting-Host: 216.81.189.9 References: User-Agent: G2/1.0 X-Google-Web-Client: true X-Google-IP: 216.81.189.9 MIME-Version: 1.0 Message-ID: <452f3689-b2ad-43e2-8835-8691b25f75c9@googlegroups.com> From: guy.helmer@gmail.com To: fa.freebsd.hackers@googlegroups.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Mon, 01 Oct 2012 20:38:55 +0000 Cc: freebsd-hackers@freebsd.org, freebsd-questions@freebsd.org Subject: Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Oct 2012 20:00:42 -0000 On Wednesday, June 6, 2012 8:36:04 PM UTC-5, Mark Felder wrote: > Hi guys I'm excitedly posting this from my phone. Good news for you guys,= bad news for us -- we were building HA storage on vmware for a client and = can now replicate the crash on demand. I'll be posting details when I get h= ome to my PC tonight, but this hopefully is enough to replicate the crash f= or any curious followers: >=20 >=20 >=20 > ESXi 5 >=20 > 9 or 9-STABLE >=20 > HAST=20 >=20 > 1 cpu is fine >=20 > 1GB of ram >=20 > UFS SUJ on HAST device >=20 > No special loader.conf, sysctl, etc >=20 > No need for VMWare tools >=20 > Run Bonnie++ on the HAST device >=20 >=20 >=20 > We can get the crash to happen on the first run of bonnie++ right now. I'= ll post the exact specs and precise command run in the PR. We found an old = post from 2004 when we looked up the process state obtained from CTRL+T -- = flswai -- which describes the symptoms nearly perfectly. >=20 >=20 >=20 > http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2004-02/0250.html= =20 >=20 >=20 >=20 > Hopefully this gets us closer to a fix... Is this a crash or a hang? Over the past couple of weeks, I've been working= with a FreeBSD 9.1RC1 system under VMware ESXi 5.0 with a 64GB UFS root FS= and 2TB ZFS filesystem mounted via a virtual LSI SAS interface. Sometimes = during heavy I/O load (rsync from other servers) on the ZFS FS, this shows = up in /var/log/messages: Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 5 ee= 60 16 0 1 0 0=20 Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E= rror Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): Retrying command Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef= 42 51 0 1 0 0=20 Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E= rror Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): Retrying command Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef= 64 51 0 1 0 0=20 Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E= rror Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): Retrying command Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef= 66 51 0 1 0 0=20 Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E= rror Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy ... Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 41 f= 3 94 99 0 1 0 0=20 Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E= rror Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): Retrying command These have been happening roughly every other day. mpt0 and em0 were sharing int 18, so today I put=20 hint.mpt.0.msi_enable=3D"1" into /boot/devices.hints and rebooted; now mpt0 is using int 256. I'll see = if it helps. Guy