From owner-freebsd-hackers@FreeBSD.ORG  Wed May  7 10:47:12 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7798B37B401
	for <freebsd-hackers@freebsd.org>;
	Wed,  7 May 2003 10:47:12 -0700 (PDT)
Received: from internetDog.org
	(CPE00010230ac1b-CM014490005040.cpe.net.cable.rogers.com [24.102.167.64])
	by mx1.FreeBSD.org (Postfix) with ESMTP id EC35C43FAF
	for <freebsd-hackers@freebsd.org>;
	Wed,  7 May 2003 10:47:10 -0700 (PDT)
	(envelope-from alih@internetDog.org)
Received: from alih by internetDog.org with local (Exim 3.12 #1 (Debian))
	id 19DT0w-0003fp-00
	for <freebsd-hackers@freebsd.org>; Wed, 07 May 2003 13:47:34 -0400
Date: Wed, 7 May 2003 13:47:34 -0400
From: Ali Bahar <alih@internetDog.org>
To: freebsd-hackers@freebsd.org
Message-ID: <20030507134734.A12455@internetDog.org>
References: <20030504113221.A27756@internetDog.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20030504113221.A27756@internetDog.org>;
	from alih@internetDog.org on Sun, May 04, 2003 at 11:32:21AM -0400
Subject: Re: cache_purge > cache_zap segmentation fault
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: alih@internetDog.org
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 07 May 2003 17:47:12 -0000

On Sun, May 04, 2003 at 11:32:21AM -0400, Ali wrote:

> this post may be of interest to people familiar with the filesystem code. 

>   syscall2 > open > vn_open > namei > lookup > ufs_vnoperate > 
>     vfs_cache_lookup > ufs_vnoperate > ufs_lookup > ffs_vget > getnewvnode >
>       cache_purge > cache_zap

The name cache is corrupted. 

Most of the threads involve getnewvnode, so a new file is being
opened. The only thread observed to not include getnewvnode, used
cache_enter. So a new cache entry is being created.

I consider it a corruption because a namecache node has a junk value
for nc_src.le_next . This is then de-referenced as the next namecache
node, thus seg faulting. 


   (gdb) p ncp
   $4 = (struct namecache *) 0xc0d62b40
   (gdb) p *ncp
   $5 = {
     nc_hash = {
       le_next = 0x0, 
       le_prev = 0xc0cd2ae4
     }, 
     nc_src = {
       le_next = 0x117, 
       le_prev = 0xc0002a48
     }, 
     nc_dst = {
       tqe_next = 0x0, 
       tqe_prev = 0xc61f9940
     }, 
     nc_dvp = 0xc61f33c0, 
     nc_vp = 0xc61f98c0, 
     nc_flag = 0 '\0', 
     nc_nlen = 7 '\a', 
     nc_name = 0xc0d62b62 "time.el<FB>\t\b[<FB>\t\bX<FB>\t\bM<FB>\t\bJ<FB>\t\b;<FB>\t\b"
   }


As 'cache_purge > cache_zap' is involved, it may be that namecache
node deletions have left a deleted node dangling.

What I do not know, is whether there is a single system-wide name cache,
or a per-directory cache linked list (LL). Neither the beastie book
(Mckusick et al) or FreeBSD Developers' Handbook seem to cover
this. Knowing the answer, would help me determine what the LLs are
supposed to look like -- thereby help diagnose when the LL begins
to go wrong.



> P.P.S. It's been occuring intermittently, and increasingly,
> recently. (Due to its increased prevalence, I even suspected that the
> frequency of kernel crashes, might have corrupted the filesystem in a
> way ignorable/imperceptible by fsck/me!)

I no longer think so. 
Certainly a 'typical' filesystem corruption would lead to all sorts of
random faults, not the consistent execution threads noted above. This
is closer to a 'bug' than to a 'corruption'. Nonetheless, it may still
be (somehow!) caused by me, rather than being a bug in the generic kernel.

regards,
ali
-- 
             Jesus was an Arab.