From owner-freebsd-current@FreeBSD.ORG  Tue Nov  4 11:20:19 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 587E216A4CE
	for <freebsd-current@freebsd.org>;
	Tue,  4 Nov 2003 11:20:19 -0800 (PST)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 009C443F75
	for <freebsd-current@freebsd.org>;
	Tue,  4 Nov 2003 11:20:18 -0800 (PST)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.12.9p2/8.12.9) with ESMTP id hA4JInMg064767;
	Tue, 4 Nov 2003 14:18:49 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)hA4JImUL064764;
	Tue, 4 Nov 2003 14:18:49 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Date: Tue, 4 Nov 2003 14:18:48 -0500 (EST)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: Nils Andreas Hakansson <n96andha@midgard.liu.se>
In-Reply-To: <20031104184810.E16804@gandalf.midgard.liu.se>
Message-ID: <Pine.NEB.3.96L.1031104141012.53223E-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: Magnus Dahlstedt <n97magda@midgard.liu.se>
cc: freebsd-current@freebsd.org
Subject: Re: Page fault
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Nov 2003 19:20:19 -0000


On Tue, 4 Nov 2003, Nils Andreas Hakansson wrote:

> I've disabled softupdates because of
> a panic("softdep_move_dependencies: need merge code");

Can't comment on this bit.  Might want to send e-mail to Kirk directly.

> Could someone take a look at this?
> 
> pst: timeout mfa=0x0032d5d0 cmd=0x02
> pst: timeout mfa=0x00336390 cmd=0x02
> pst: timeout mfa=0x0034cdd0 cmd=0x02
> <cut>
> pst: timeout mfa=0x003b7ab0 cmd=0x02
> pst: timeout mfa=0x00396db0 cmd=0x02
> pst: timeout mfa=0x003a3530 cmd=0x02
> pst: timeout mfa=0x00376890 cmd=0x02

This is your storage device getting unhappy, but I'm not really informed
enough on pst to say how or why.  I don't know if it is because the
requests are bad, or because the controller/chain/device is unable to
service the request.

> ufs_access(): Error retrieving ACL on object (5).
> <cut>
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).

This is the UFS ACL code failing closed: it's unable to read the ACLs from
disk due to EIO (I/O failure).  This is a correct response to that
scenario.

> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; lapic.id = 00000000
> fault virtual address   = 0xae18c0de
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xc066a566
> stack pointer           = 0x10:0xea3a78cc
> frame pointer           = 0x10:0xea3a7900
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 76932 (smbd)
> kernel: type 12 trap, code=0
> Stopped at      generic_bcopy+0x1a:     repe movsl      (%esi),%es:(%edi)
> db> trace
> generic_bcopy(cf6b0000,1a8,2,c06bd12c,0) at generic_bcopy+0x1a
> ffs_getextattr(ea3a7960,ea3a795c,c05159ad,d0346200,184) at
> ffs_getextattr+0xe0

This appears to be a bug in UFS2's handling of corrupted EA data on disk.
We have some changes in the TrustedBSD development trees to improve
resilience to on-disk corruption, but haven't merged them yet.  Just to
confirm, could you use "gdb -k" on a copy of your kernel with debugging
symbols to see where *ffs_getextattr+0xe0 is?  For me, it turns up in
ffs_vnops.c:1616, which is a variable assignment.  There's a bcopy not far
above there, which seems the likely candidate.

> vn_extattr_get(cb1a8c8c,8,2,c06bd12c,ea3a79d0) at vn_extattr_get+0xaa
> ufs_getacl(ea3a7a14,ea3a7a40,c061560b,ea3a7a14,c06df280) at
> ufs_getacl+0x99
> ufs_vnoperate(ea3a7a14,c06df280,2,a6,c853cd10) at ufs_vnoperate+0x18
> ufs_access(ea3a7a6c,ea3a7b28,c057dcc9,ea3a7a6c,c0716cc8) at
> ufs_access+0xca
> ufs_vnoperate(ea3a7a6c,c0716cc8,c0716cc8,c853cd10,cb1a8c8c) at
> ufs_vnoperate+0x1
> 8
> vn_open_cred(ea3a7bdc,ea3a7cdc,1a4,d0bb7800,22) at vn_open_cred+0x359
> vn_open(ea3a7bdc,ea3a7cdc,1a4,22,c3ee0fb4) at vn_open+0x30
> kern_open(c853cd10,bfbff130,0,1,1a4) at kern_open+0x143
> open(c853cd10,ea3a7d14,c06c44d0,3ed,3) at open+0x30
> syscall(bfbf002f,82b002f,bfbf002f,bfbffd70,82b3724) at syscall+0x28f
> Xint0x80_syscall() at Xint0x80_syscall+0x1d
> --- syscall (5, FreeBSD ELF32, open), eip = 0x662b5233, esp = 0xbfbff07c,
> ebp =
> 0xbfbff098 ---
> db> show locks
> exclusive sleep mutex Giant r = 0 (0xc07115c0) locked @
> /usr/src/sys/vm/vm_fault
> .c:223

Holding Giant here is good.  So to summarize:

This could be the result of a disk read failure.
The UFS code appears to be intolerant of said failure.
The ACL code failed closed properly, although perhaps not so usefully.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Network Associates Laboratories