From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 26 09:48:17 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7CA801065673;
	Wed, 26 Oct 2011 09:48:17 +0000 (UTC)
	(envelope-from Karli.Sjoberg@slu.se)
Received: from edge1-1.slu.se (edge1-1.slu.se [193.10.100.96])
	by mx1.freebsd.org (Postfix) with ESMTP id 9D3D48FC1E;
	Wed, 26 Oct 2011 09:48:16 +0000 (UTC)
Received: from Exchange2.ad.slu.se (193.10.100.95) by edge1-1.slu.se
	(193.10.100.96) with Microsoft SMTP Server (TLS) id 8.3.213.0;
	Wed, 26 Oct 2011 11:36:46 +0200
Received: from exmbx3.ad.slu.se ([193.10.100.93]) by Exchange2.ad.slu.se
	([193.10.100.95]) with mapi; Wed, 26 Oct 2011 11:36:46 +0200
From: =?iso-8859-1?Q?Karli_Sj=F6berg?= <Karli.Sjoberg@slu.se>
To: "Kenneth D. Merry" <ken@freebsd.org>
Date: Wed, 26 Oct 2011 11:36:44 +0200
Thread-Topic: AOC-USAS2-L8i zfs panics and SCSI errors in messages
Thread-Index: AcyTwsbQRZ7Pq3e1RYWivOE8XyGqgQ==
Message-ID: <B4D81944-39F5-4053-ACBA-78EBB7DD70EB@slu.se>
References: <82B38DBF-DD3A-46CD-93F6-02CDB6506E05@slu.se>
	<20111025193302.GA30409@nargothrond.kdm.org>
In-Reply-To: <20111025193302.GA30409@nargothrond.kdm.org>
Accept-Language: sv-SE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
acceptlanguage: sv-SE, en-US
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: "freebsd-scsi@freebsd.org" <freebsd-scsi@freebsd.org>,
	"fs@freebsd.org" <fs@freebsd.org>
Subject: Re: AOC-USAS2-L8i zfs panics and SCSI errors in messages
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Oct 2011 09:48:17 -0000

Hi all,

I tracked down what causes the panics!

I got a tip from aragon and phoenix at the forum about
/etc/periodic/security/100.chksetuid

And to put:
daily_status_security_chksetuid_enable=3D"NO"
into /etc/periodic.conf

I can now run periodic daily without any panics. I=B4m still wondering abou=
t the cause of this, the explanation from the forum was that that phase is =
too demanding for multi TB systems. But I have several multi TB servers wit=
h FreeBSD and ZFS, and none of them has ever behaved this way. Besides, the=
 panic is instantaneous, not degenerative. I imagine that a run like that w=
ould start out OK and then just get worse and worse, getting gradually slow=
er and slower until it just wouldn=B4t cope any more and hang. This feels m=
ore like hitting a wall. As if it found something that is couldn=B4t deal w=
ith and has no choice but to panic immediately.

I=B4m hoping this can be resolved without having to know beforehand about p=
utting stuff into periodic.conf that you couldn=B4t have anticipated?

@Ken
The hard drives are connected with two breakout cables from each controller=
 to the caddies with CABMS2FN05 from:
http://www.promise.com/single_page_session/page.aspx?region=3Den-US&m=3D575=
&rsn=3D114

>From controller 1 -> channel 1 -> ports 1,2,3 -> ports 1,2,3 in caddie 1
>From controller 1 -> channel 2 -> ports 1,2 -> ports 4,5 in caddie 1

>From controller 2 -> channel 1 -> ports 1,2,3 -> ports 1,2,3 in caddie 2
>From controller 2 -> channel 2 -> ports 1,2 -> ports 4,5 in caddie 2

Is there any problem with that type of cabling?

These timeouts happens with all harddrives at one time or another, would th=
at mean that all cables are bad? Or of a worse quality perhaps? Regarding t=
he firmware, they are all running version 1AQ10001. I=B4m going to search f=
or known problems with that, and if you know something, your are welcome to=
 share;)

Best Regards
Karli Sj=F6berg

25 okt 2011 kl. 21.33 skrev Kenneth D. Merry:

On Thu, Oct 20, 2011 at 13:28:17 +0200, Karli Sj?berg wrote:
Hi,

I?m in the process of vacating a Sun/Oracle system to a another Supermicro/=
FreeBSD system, doing zfs send/recv between. Two times now, the system has =
panicked while not doing anything at all, and it?s throwing alot of SCSI/CA=
M-related errors while doing IO-intensive operations, like send/recv, resil=
ver, and zpool has sometimes reported read/write errors on the hard drives.=
 Best part is that the errors in messages are about all hard drives at one =
time or another, and they are connected with separate cables, controllers a=
nd caddies. Specs:

HW:
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
10x SAMSUNG HD204UI (in a raidz2 zpool)
1x  OCZ Vertex 3 240GB (L2ARC)

SW:
# uname -a
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 20=
11     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
# zpool get version pool1
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default[/CODE]

I got the panic from the IPMI KVM:
http://i55.tinypic.com/synpzk.png

In looking at the panic, this is a ZFS panic.  Nothing the disks do should
be able to cause ZFS to panic.  ZFS is panicing in avl_add():

/*
* This is unfortunate.  We want to call panic() here, even for
* non-DEBUG kernels.  In userland, however, we can't depend on anything
* in libc or else the rtld build process gets confused.  So, all we can
* do in userland is resort to a normal ASSERT().
*/
if (avl_find(tree, new_node, &where) !=3D NULL)
#ifdef _KERNEL
panic("avl_find() succeeded inside avl_add()");
#else
ASSERT(0);
#endif

There are certainly timeouts and two terminated IOCs in the log below.  Tha=
t
does suggest a hardware or driver problem, but it isn't very obvious what
it might be.

I have seen bad behavior with SATA drives behind 3Gb Maxim expanders
talking to 6GB LSI controllers, but your particular configuration does not
involve any expanders, and therefore is not that particular STP issue.

My best guess, and it is a guess, is that either the drives are misbehaving
(i.e. firmware type problem) or you've got a cabling issue.

If you have more hardware available, you might try swapping out the cables
and/or drives to see if you can reproduce the drive errors with a
different setup.  If you swap the drives, I would use a different brand if
you've got them available.

I'm CCing the fs list, perhaps someone there can look at the stack trace
above and figure out what ZFS might be doing.

Again, ZFS should survive any errors from the drives, and the panic above
looks like ZFS is flagging a logic bug somewhere.


And an extract from /var/log/messages:
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(10). CDB: 2a 0 6 13 6=
6 f 0 0 f 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Err=
or
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Conditio=
n
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION =
asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): WRITE(6). CDB: a 0 1 b2 2 0
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): CAM status: SCSI Status Err=
or
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI status: Check Conditio=
n
Oct 19 17:37:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI sense: UNIT ATTENTION =
asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on dev=
ice handle 0x000c SMID 859
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on dev=
ice handle 0x000c SMID 495
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on dev=
ice handle 0x000c SMID 725
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on dev=
ice handle 0x000c SMID 722
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI command timeout on dev=
ice handle 0x000c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: (1:4:0) terminated ioc 804b scsi 0 stat=
e c xfer 0
Oct 19 17:40:38 fs2-7 last message repeated 3 times
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on=
 handle 0x0c SMID 859 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0c SMID 495
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on=
 handle 0x0c SMID 495 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0c SMID 725
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on=
 handle 0x0c SMID 725 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0c SMID 722
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on=
 handle 0x0c SMID 722 complete
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0c SMID 438
Oct 19 17:40:38 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on=
 handle 0x0c SMID 438 complete
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 6 25 4=
f 75 0 0 b 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Err=
or
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Conditio=
n
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION =
asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): WRITE(10). CDB: 2a 0 2d a5 =
10 ca 0 0 80 0
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): CAM status: SCSI Status Err=
or
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI status: Check Conditio=
n
Oct 19 17:40:38 fs2-7 kernel: (da9:mps1:0:4:0): SCSI sense: UNIT ATTENTION =
asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:40 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on dev=
ice handle 0x000a SMID 976
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on dev=
ice handle 0x000a SMID 636
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on dev=
ice handle 0x000a SMID 888
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI command timeout on dev=
ice handle 0x000a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: (0:1:0) terminated ioc 804b scsi 0 stat=
e c xfer 0
Oct 19 17:45:41 fs2-7 last message repeated 2 times
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on=
 handle 0x0a SMID 976 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0a SMID 636
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on=
 handle 0x0a SMID 636 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0a SMID 888
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on=
 handle 0x0a SMID 888 complete
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_complete_tm_request: sending def=
erred task management request for handle 0x0a SMID 983
Oct 19 17:45:41 fs2-7 kernel: mps0: mpssas_abort_complete: abort request on=
 handle 0x0a SMID 983 complete
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 a=
7 2 0 0 3 0
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Err=
or
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Conditio=
n
Oct 19 17:45:41 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION =
asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): WRITE(10). CDB: 2a 0 6 40 b=
0 9 0 0 9 0
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): CAM status: SCSI Status Err=
or
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI status: Check Conditio=
n
Oct 19 17:45:42 fs2-7 kernel: (da1:mps0:0:1:0): SCSI sense: UNIT ATTENTION =
asc:29,0 (Power on, reset, or bus device reset occurred)

What?s going on?

Regards
Karli Sj?berg_______________________________________________
freebsd-scsi@freebsd.org<mailto:freebsd-scsi@freebsd.org> mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org<mail=
to:freebsd-scsi-unsubscribe@freebsd.org>"

Ken
--
Kenneth Merry
ken@FreeBSD.ORG<mailto:ken@FreeBSD.ORG>


Med V=E4nliga H=E4lsningar
---------------------------------------------------------------------------=
----
Karli Sj=F6berg
Swedish University of Agricultural Sciences
Box 7079 (Visiting Address Kron=E5sv=E4gen 8)
S-750 07 Uppsala, Sweden
Phone:  +46-(0)18-67 15 66
karli.sjoberg@slu.se<mailto:karli.sjoberg@adm.slu.se>