From owner-freebsd-current@FreeBSD.ORG  Sun Aug 19 20:54:27 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7D96E16A419;
	Sun, 19 Aug 2007 20:54:27 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 2EEB913C45D;
	Sun, 19 Aug 2007 20:54:27 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E1E8B4A3A0;
	Sun, 19 Aug 2007 16:54:26 -0400 (EDT)
Date: Sun, 19 Aug 2007 21:54:26 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: dds@FreeBSD.org, jeff@FreeBSD.org, attilio@FreeBSD.org
Message-ID: <20070819214953.I1099@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: current@FreeBSD.org
Subject: Crash in accounting code: encode_long(), due to bad rusage data?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 19 Aug 2007 20:54:27 -0000


Diomidis, Jeff, Attilio,

I recently upgraded two servers from FreeBSD 6-STABLE to FreeBSD 7-CURRENT in 
anticipation of the forthcoming release.  Both of them run with accounting 
enabled at all times.  When a large pine session was exiting on one of the two 
boxes, I ran into the following panic:

panic: encode_long: -ve value -32749
KDB: enter: panic
db> bt
Tracing pid 81759 tid 100377 td 0xc4a76400
kdb_enter(c0a9f844,1,c0a9ab3f,e6937b10,1,...) at kdb_enter+0x32
panic(c0a9ab3f,ffff8013,e6937bb0,e6937c14,c0719dc2,...) at panic+0x124
encode_long(e6937bb0,e6937bb8,e6937bc0,e6937bb8,0,...) at encode_long+0x1d
acct_process(c4a76400,c0a9c0c2,162,129,e6937c50,...) at acct_process+0x1c2
exit1(c4a76400,0,e6937d2c,c0a08303,c4a76400,...) at exit1+0x957
sys_exit(c4a76400,e6937cfc,4,e6937d38,c0b4bed8,...) at sys_exit+0x1d
syscall(e6937d38) at syscall+0x2b3
Xint0x80_syscall() at Xint0x80_syscall+0x20
--- syscall (1, FreeBSD ELF32, sys_exit), eip = 0x355f2093, esp = 0xbfbfe0ec, 
ebp = 0xbfbfe108 ---

kgdb was of mixed helpfulness:

#2  0xc048d575 in db_command_loop () at /usr/src/sys/ddb/db_command.c:401
#3  0xc048ece5 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:222
#4  0xc0773d56 in kdb_trap (type=3, code=0, tf=0xe6937a90)
     at /usr/src/sys/kern/subr_kdb.c:502
#5  0xc0a08aeb in trap (frame=0xe6937a90) at /usr/src/sys/i386/i386/trap.c:621
#6  0xc09ede3b in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#7  0xc0773ed2 in kdb_enter (msg=0xc0a9f844 "panic") at cpufunc.h:60
#8  0xc074c5d4 in panic (fmt=0xc0a9ab3f "encode_long: -ve value %ld")
     at /usr/src/sys/kern/kern_shutdown.c:547
#9  0xc0719bad in encode_long (val=18) at /usr/src/sys/kern/kern_acct.c:525
#10 0xc0719dc2 in acct_process (td=0xc4a76400)
     at /usr/src/sys/kern/kern_acct.c:391
#11 0xc072d7e7 in exit1 (td=0xc4a76400, rv=0)
     at /usr/src/sys/kern/kern_exit.c:355
#12 0xc072e21d in sys_exit (td=Could not find the frame base for "sys_exit".
) at /usr/src/sys/kern/kern_exit.c:98
#13 0xc0a08303 in syscall (frame=0xe6937d38)
     at /usr/src/sys/i386/i386/trap.c:1008
#14 0xc09edea0 in Xint0x80_syscall ()
     at /usr/src/sys/i386/i386/exception.s:196
#15 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) frame 9
#9  0xc0719bad in encode_long (val=18) at /usr/src/sys/kern/kern_acct.c:525
525             KASSERT(val >= 0,  ("encode_long: -ve value %ld", val));
(kgdb) print val
$1 = 18
(kgdb) list
520     encode_long(long val)
521     {
522             int norm_exp;   /* Normalized exponent */
523             int shift;
524
525             KASSERT(val >= 0,  ("encode_long: -ve value %ld", val));
526             if (val == 0)
527                     return (0);
528             norm_exp = fls(val) - 1;
529             shift = FLT_MANT_DIG - norm_exp - 1;
(kgdb) up
#10 0xc0719dc2 in acct_process (td=0xc4a76400)
     at /usr/src/sys/kern/kern_acct.c:391
391                     acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss +
(kgdb) list
386             tmp = ut;
387             timevaladd(&tmp, &st);
388             /* Convert tmp (i.e. u + s) into hz units to match ru_i*. */
389             t = tmp.tv_sec * hz + tmp.tv_usec / tick;
390             if (t)
391                     acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss +
392                         + ru.ru_isrss) / t);
393             else
394                     acct.ac_mem = 0;
395
(kgdb) inspect ru
$2 = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0,
     tv_usec = 0}, ru_maxrss = 509908, ru_ixrss = 26810576,
   ru_idrss = -2016077424, ru_isrss = 1044992, ru_minflt = 372535,
   ru_majflt = 42, ru_nswap = 0, ru_inblock = 58, ru_oublock = 278,
   ru_msgsnd = 1907, ru_msgrcv = 4671, ru_nsignals = 32, ru_nvcsw = 19892,
   ru_nivcsw = 4614}

I find the large negative value in ru_idrss somewhat sad to contemplate, and 
while this could well be a problem with capturing of process runtime 
information, I'd like it if the accounting code were robust against this sort 
of bug, rather than panicking, and I guess I'd also rather than the process 
runtime information also be correctly captured :-).

I have a usable core dump, and have disabled accounting in the interim.

Robert N M Watson
Computer Laboratory
University of Cambridge