From owner-freebsd-current@FreeBSD.ORG  Sun Aug  9 17:12:42 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E55BE106564A;
	Sun,  9 Aug 2009 17:12:42 +0000 (UTC)
	(envelope-from artemb@gmail.com)
Received: from mail-yx0-f181.google.com (mail-yx0-f181.google.com
	[209.85.210.181])
	by mx1.freebsd.org (Postfix) with ESMTP id 890658FC20;
	Sun,  9 Aug 2009 17:12:42 +0000 (UTC)
Received: by yxe11 with SMTP id 11so3351852yxe.3
	for <multiple recipients>; Sun, 09 Aug 2009 10:12:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:sender:received:in-reply-to
	:references:date:x-google-sender-auth:message-id:subject:from:to
	:content-type:content-transfer-encoding;
	bh=x9Iv5pD7KUgabkfvaEpoaZKCcsTSX1QG6FkXLAtV5I8=;
	b=cvsB1yWxjEVRpiFWLY974wXvb7341Z660eRpDbClQGciCn3mNLFmTtpdVq7j4xfXON
	M0fofFtdn0aKeltnrH0sIOctyGVjM6GSbkJUexz1az2tftqc4tNyWluuH8ZlkzvQqbM6
	p+zUqKHtAxPTJL/eO0Wstg3hUh7TZ2mZkbyqA=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:content-type
	:content-transfer-encoding;
	b=r8nQYYC9AvTrW1IMiksA+uYyJFTwJmLganR1hZgsK1R1yibNwH54/ge63ITrVxzwcx
	ZGwcTRn3pDqZu9DyhvC9Pzt3rWl1lB0afz5UPmDsDyoqZyOFod7kH7vvckY7bUSZ1Oyv
	7r7OdG8TS2lc+boghA4OdMAZXNPQ0lT/a7NMw=
MIME-Version: 1.0
Sender: artemb@gmail.com
Received: by 10.90.67.6 with SMTP id p6mr3142316aga.100.1249837961645; Sun, 09 
	Aug 2009 10:12:41 -0700 (PDT)
In-Reply-To: <ed91d4a80908071106l3951f384r3fa845eda2fcb0d3@mail.gmail.com>
References: <ed91d4a80908071106l3951f384r3fa845eda2fcb0d3@mail.gmail.com>
Date: Sun, 9 Aug 2009 10:12:41 -0700
X-Google-Sender-Auth: 566cff3dc35df880
Message-ID: <ed91d4a80908091012t2a9db9f8v89dfd9c06fa35113@mail.gmail.com>
From: Artem Belevich <fbsdlist@src.cx>
To: freebsd-scsi@freebsd.org, freebsd-current@freebsd.org, 
	Wes Morgan <morganw@chemikals.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: 
Subject: [SOLVED] Re: mpt errors - UNIT ATTENTION asc:29,0
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Aug 2009 17:12:43 -0000

A bit more digging showed that these mpt errors match
UDMA_CRC_Error_Count reported by drives via SMART.
Connecting the same drives with the same cables to ICH7 SATA ports
completely eliminates the errors which suggests that drives and cables
themself are OK.
So, my guess is that there's fair amount of cross-talk between LSI1068
ports on the motherboard (Asus P5BV/SAS).

In the end I've switched on spread spectrum clocking on the drives
(jumper 1-2 on WD SATA drives) and the errors almost completely
disappeared. I've got 1 CRC error vs hundreds there used to be after
~10TB have been read/written.

What didn't quite work:
* Forcing drives into 1.5Gb mode (jumper 5-6 on WD drives). Errors
became somewhat less frequent, but didn't go away.
* replacing SATA cables -- tried three different sets with virtually
no change in error rate.

--Artem



On Fri, Aug 7, 2009 at 11:06 AM, Artem Belevich<artemb@gmail.com> wrote:
> Hi,
>
> I'm running 8.0-BETA2 on Asus p5BV/SAS with built-in LSI1068
> controller with 8 SATA ports. 6 of the ports hooked up to 1TB WD Green
> drives. The drives are used as a single raidz2 ZFS pool:
>
> =A0 =A0 =A0 =A0NAME =A0 =A0 =A0 =A0STATE =A0 =A0 READ WRITE CKSUM
> =A0 =A0 =A0 =A0z2 =A0 =A0 =A0 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =
=A0 0
> =A0 =A0 =A0 =A0 =A0raidz2 =A0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
> =A0 =A0 =A0 =A0 =A0 =A0da1 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0=
 0
> =A0 =A0 =A0 =A0 =A0 =A0da0 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0=
 0
> =A0 =A0 =A0 =A0 =A0 =A0da2 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0=
 0
> =A0 =A0 =A0 =A0 =A0 =A0da3 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0=
 0
> =A0 =A0 =A0 =A0 =A0 =A0da4 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0=
 0
> =A0 =A0 =A0 =A0 =A0 =A0da5 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0=
 0
>
> I'm runing a simple stress test that copies 10GB file until it fills
> the volume and then runs "zfs scrub" on it.
>
> dd if=3D/dev/urandom of=3D/z2/f.0 bs=3D1m count=3D10240
> for f in {1..350}; do echo $f; cp f.$[$f-1] f.$f; done;
> zpool scrub z2
>
> What concerns me is that I'm periodically getting error messages from
> MPT driver. They usually start few hours after the start of the script
> and by the end of it they are happening every few minutes seemingly
> randomly on all six drives.
>
> Aug =A07 10:25:32 buz kernel: mpt0: mpt_cam_event: 0x16
> Aug =A07 10:25:32 buz kernel: mpt0: mpt_cam_event: 0x16
> Aug =A07 10:25:32 buz kernel: (da4:mpt0:0:4:0): READ(10). CDB: 28 0 46
> 32 97 c0 0 0 80 0
> Aug =A07 10:25:32 buz kernel: (da4:mpt0:0:4:0): CAM Status: SCSI Status E=
rror
> Aug =A07 10:25:32 buz kernel: (da4:mpt0:0:4:0): SCSI Status: Check Condit=
ion
> Aug =A07 10:25:32 buz kernel: (da4:mpt0:0:4:0): UNIT ATTENTION asc:29,0
> Aug =A07 10:25:32 buz kernel: (da4:mpt0:0:4:0): Power on, reset, or bus
> device reset occurred
> Aug =A07 10:25:32 buz kernel: (da4:mpt0:0:4:0): Retrying Command (per Sen=
se Data)
>
> ZFS scrub does not seem to report any issues so far - no checksum or
> read/write errors. WD's hard drive diagnostics tools didn't find any
> issues with te drives either.
>
> Sould somebody shed some light on why would such error happen? Is that
> some sort of hardware issue? Driver bug? Issue with compatibility
> between controller and the drives? System configuration issue (some
> sysctl/tunable needs tweaking, perhaps)?
>
> I'd appreciate any hints on what could be going on and what should be
> done about it.
>
> Thanks,
> --Artem
>