From owner-freebsd-hackers@FreeBSD.ORG  Mon Oct  1 20:00:41 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D7F4B106564A;
	Mon,  1 Oct 2012 20:00:41 +0000 (UTC)
	(envelope-from guy.helmer@gmail.com)
Received: from mail-vc0-f186.google.com (mail-vc0-f186.google.com
	[209.85.220.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 65EAB8FC0A;
	Mon,  1 Oct 2012 20:00:41 +0000 (UTC)
Received: by vcbfy7 with SMTP id fy7so5085663vcb.13
	for <multiple recipients>; Mon, 01 Oct 2012 13:00:40 -0700 (PDT)
Received: by 10.236.201.134 with SMTP id b6mr2228394yho.15.1349121640824; Mon,
	01 Oct 2012 13:00:40 -0700 (PDT)
Path: glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: fa.freebsd.hackers
Date: Mon, 1 Oct 2012 13:00:40 -0700 (PDT)
In-Reply-To: <fa.WpvXSexDuh60oPevUfqY+fAuWnE@ifi.uio.no>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=216.81.189.9;
	posting-account=UWBIDAoAAADr2Rwohhx_YBn6X-E8F7Gr
NNTP-Posting-Host: 216.81.189.9
References: <fa.AteGcyczS0yepFNHJLTAcaouoeQ@ifi.uio.no>
	<fa.6wX0axVDJXcSbIQcQBtBTui7/9U@ifi.uio.no>
	<fa.NpTOEiPP0T0zl1kPGJMatOK9x+8@ifi.uio.no>
	<fa.h+La4qtNP+efKwHpYLeSk4Y0Kcc@ifi.uio.no>
	<fa.WpvXSexDuh60oPevUfqY+fAuWnE@ifi.uio.no>
User-Agent: G2/1.0
X-Google-Web-Client: true
X-Google-IP: 216.81.189.9
MIME-Version: 1.0
Message-ID: <452f3689-b2ad-43e2-8835-8691b25f75c9@googlegroups.com>
From: guy.helmer@gmail.com
To: fa.freebsd.hackers@googlegroups.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Mailman-Approved-At: Mon, 01 Oct 2012 20:38:55 +0000
Cc: freebsd-hackers@freebsd.org, freebsd-questions@freebsd.org
Subject: Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Oct 2012 20:00:42 -0000

On Wednesday, June 6, 2012 8:36:04 PM UTC-5, Mark Felder wrote:
> Hi guys I'm excitedly posting this from my phone. Good news for you guys,=
 bad news for us -- we were building HA storage on vmware for a client and =
can now replicate the crash on demand. I'll be posting details when I get h=
ome to my PC tonight, but this hopefully is enough to replicate the crash f=
or any curious followers:
>=20
>=20
>=20
> ESXi 5
>=20
> 9 or 9-STABLE
>=20
> HAST=20
>=20
> 1 cpu is fine
>=20
> 1GB of ram
>=20
> UFS SUJ on HAST device
>=20
> No special loader.conf, sysctl, etc
>=20
> No need for VMWare tools
>=20
> Run Bonnie++ on the HAST device
>=20
>=20
>=20
> We can get the crash to happen on the first run of bonnie++ right now. I'=
ll post the exact specs and precise command run in the PR. We found an old =
post from 2004 when we looked up the process state obtained from CTRL+T -- =
flswai -- which describes the symptoms nearly perfectly.
>=20
>=20
>=20
>  http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2004-02/0250.html=
=20
>=20
>=20
>=20
> Hopefully this gets us closer to a fix...

Is this a crash or a hang? Over the past couple of weeks, I've been working=
 with a FreeBSD 9.1RC1 system under VMware ESXi 5.0 with a 64GB UFS root FS=
 and 2TB ZFS filesystem mounted via a virtual LSI SAS interface. Sometimes =
during heavy I/O load (rsync from other servers) on the ZFS FS, this shows =
up in /var/log/messages:

Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 5 ee=
 60 16 0 1 0 0=20
Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E=
rror
Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): Retrying command
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef=
 42 51 0 1 0 0=20
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E=
rror
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): Retrying command
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef=
 64 51 0 1 0 0=20
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E=
rror
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): Retrying command
Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef=
 66 51 0 1 0 0=20
Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E=
rror
Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
...
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 41 f=
3 94 99 0 1 0 0=20
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status E=
rror
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): Retrying command

These have been happening roughly every other day.

mpt0 and em0 were sharing int 18, so today I put=20
hint.mpt.0.msi_enable=3D"1"
into /boot/devices.hints and rebooted; now mpt0 is using int 256. I'll see =
if it helps.

Guy