Date: Sun, 19 Aug 2007 14:56:38 -0700 (PDT) From: Jeff Roberson <jroberson@chesapeake.net> To: Robert Watson <rwatson@freebsd.org> Cc: attilio@freebsd.org, dds@freebsd.org, current@freebsd.org Subject: Re: Crash in accounting code: encode_long(), due to bad rusage data? Message-ID: <20070819145532.O568@10.0.0.1> In-Reply-To: <20070819144903.U568@10.0.0.1> References: <20070819214953.I1099@fledge.watson.org> <20070819144903.U568@10.0.0.1>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 19 Aug 2007, Jeff Roberson wrote:
> On Sun, 19 Aug 2007, Robert Watson wrote:
>
>>
>> Diomidis, Jeff, Attilio,
>>
>> I recently upgraded two servers from FreeBSD 6-STABLE to FreeBSD 7-CURRENT
>> in anticipation of the forthcoming release. Both of them run with
>> accounting enabled at all times. When a large pine session was exiting on
>> one of the two boxes, I ran into the following panic:
>>
>> panic: encode_long: -ve value -32749
>> KDB: enter: panic
>> db> bt
>> Tracing pid 81759 tid 100377 td 0xc4a76400
>> kdb_enter(c0a9f844,1,c0a9ab3f,e6937b10,1,...) at kdb_enter+0x32
>> panic(c0a9ab3f,ffff8013,e6937bb0,e6937c14,c0719dc2,...) at panic+0x124
>> encode_long(e6937bb0,e6937bb8,e6937bc0,e6937bb8,0,...) at encode_long+0x1d
>> acct_process(c4a76400,c0a9c0c2,162,129,e6937c50,...) at acct_process+0x1c2
>> exit1(c4a76400,0,e6937d2c,c0a08303,c4a76400,...) at exit1+0x957
>> sys_exit(c4a76400,e6937cfc,4,e6937d38,c0b4bed8,...) at sys_exit+0x1d
>> syscall(e6937d38) at syscall+0x2b3
>> Xint0x80_syscall() at Xint0x80_syscall+0x20
>> --- syscall (1, FreeBSD ELF32, sys_exit), eip = 0x355f2093, esp =
>> 0xbfbfe0ec, ebp = 0xbfbfe108 ---
>>
>> kgdb was of mixed helpfulness:
>>
>> #2 0xc048d575 in db_command_loop () at /usr/src/sys/ddb/db_command.c:401
>> #3 0xc048ece5 in db_trap (type=3, code=0) at
>> /usr/src/sys/ddb/db_main.c:222
>> #4 0xc0773d56 in kdb_trap (type=3, code=0, tf=0xe6937a90)
>> at /usr/src/sys/kern/subr_kdb.c:502
>> #5 0xc0a08aeb in trap (frame=0xe6937a90) at
>> /usr/src/sys/i386/i386/trap.c:621
>> #6 0xc09ede3b in calltrap () at /usr/src/sys/i386/i386/exception.s:139
>> #7 0xc0773ed2 in kdb_enter (msg=0xc0a9f844 "panic") at cpufunc.h:60
>> #8 0xc074c5d4 in panic (fmt=0xc0a9ab3f "encode_long: -ve value %ld")
>> at /usr/src/sys/kern/kern_shutdown.c:547
>> #9 0xc0719bad in encode_long (val=18) at /usr/src/sys/kern/kern_acct.c:525
>> #10 0xc0719dc2 in acct_process (td=0xc4a76400)
>> at /usr/src/sys/kern/kern_acct.c:391
>> #11 0xc072d7e7 in exit1 (td=0xc4a76400, rv=0)
>> at /usr/src/sys/kern/kern_exit.c:355
>> #12 0xc072e21d in sys_exit (td=Could not find the frame base for
>> "sys_exit".
>> ) at /usr/src/sys/kern/kern_exit.c:98
>> #13 0xc0a08303 in syscall (frame=0xe6937d38)
>> at /usr/src/sys/i386/i386/trap.c:1008
>> #14 0xc09edea0 in Xint0x80_syscall ()
>> at /usr/src/sys/i386/i386/exception.s:196
>> #15 0x00000033 in ?? ()
>> Previous frame inner to this frame (corrupt stack?)
>> (kgdb) frame 9
>> #9 0xc0719bad in encode_long (val=18) at /usr/src/sys/kern/kern_acct.c:525
>> 525 KASSERT(val >= 0, ("encode_long: -ve value %ld", val));
>> (kgdb) print val
>> $1 = 18
>> (kgdb) list
>> 520 encode_long(long val)
>> 521 {
>> 522 int norm_exp; /* Normalized exponent */
>> 523 int shift;
>> 524
>> 525 KASSERT(val >= 0, ("encode_long: -ve value %ld", val));
>> 526 if (val == 0)
>> 527 return (0);
>> 528 norm_exp = fls(val) - 1;
>> 529 shift = FLT_MANT_DIG - norm_exp - 1;
>> (kgdb) up
>> #10 0xc0719dc2 in acct_process (td=0xc4a76400)
>> at /usr/src/sys/kern/kern_acct.c:391
>> 391 acct.ac_mem = encode_long((ru.ru_ixrss +
>> ru.ru_idrss
>
> ru_idrss an integral value of the amount of unshared memory residing
> in the data segment of a process (expressed in units of
> kilobytes * ticks-of-execution).
>
> That seems like it's pretty prone to overflow given that this is a 32bit
> machine. This may just be an improper assert. I'm shocked that noone has
> run into this so far.
Sorry, should elaborate, see below.
>
> Jeff
>
>> +
>> (kgdb) list
>> 386 tmp = ut;
>> 387 timevaladd(&tmp, &st);
>> 388 /* Convert tmp (i.e. u + s) into hz units to match ru_i*.
>> */
>> 389 t = tmp.tv_sec * hz + tmp.tv_usec / tick;
>> 390 if (t)
>> 391 acct.ac_mem = encode_long((ru.ru_ixrss +
>> ru.ru_idrss +
>> 392 + ru.ru_isrss) / t);
>> 393 else
>> 394 acct.ac_mem = 0;
>> 395
>> (kgdb) inspect ru
>> $2 = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0,
>> tv_usec = 0}, ru_maxrss = 509908, ru_ixrss = 26810576,
>> ru_idrss = -2016077424, ru_isrss = 1044992, ru_minflt = 372535,
>> ru_majflt = 42, ru_nswap = 0, ru_inblock = 58, ru_oublock = 278,
>> ru_msgsnd = 1907, ru_msgrcv = 4671, ru_nsignals = 32, ru_nvcsw = 19892,
>> ru_nivcsw = 4614}
>
Can you tell me what's in t, tmp, hz and tick?
Thanks,
Jeff
>
>> I find the large negative value in ru_idrss somewhat sad to contemplate,
>> and while this could well be a problem with capturing of process runtime
>> information, I'd like it if the accounting code were robust against this
>> sort of bug, rather than panicking, and I guess I'd also rather than the
>> process runtime information also be correctly captured :-).
>>
>> I have a usable core dump, and have disabled accounting in the interim.
>>
>> Robert N M Watson
>> Computer Laboratory
>> University of Cambridge
>> _______________________________________________
>> freebsd-current@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
>>
>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070819145532.O568>
