Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 21 Sep 2011 11:40:25 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        hackers@FreeBSD.org
Subject:   Re: SW_WATCHDOG vs new eventtimer code
Message-ID:  <4E79A2F9.4010802@FreeBSD.org>
In-Reply-To: <4E78F762.5000906@FreeBSD.org>
References:  <4E78E755.8050404@FreeBSD.org> <4E78F1E7.7020502@FreeBSD.org> <4E78F762.5000906@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Andriy Gapon wrote:
> on 20/09/2011 23:04 Alexander Motin said the following:
>> On 20.09.2011 22:19, Andriy Gapon wrote:
>>> just want to check with you first if the following makes sense.
>>> I use SW_WATCHDOG on one of the test machines, which was recently updated to
>>> from stable/8 to head.  Now it seems to get seemingly random watchdog events.
>>> My theory is that this is because of the eventtimer logic.
>>> If during idle period we accumulate enough timer ticks and then run all those
>>> ticks very rapidly, then the SW_WATCHDOG code may get an impression that it was
>>> not patted for many real ticks.
>>> Not sure what would be the best way to make SW_WATCHDOG happier/smarter.
>> Eventtimer code now set to generate interrupts at least 4 times per
>> second for each CPU. As soon as SW_WATCHDOG only handles periods more
>> then one second, I would say it should not be hurt. I would try to add
>> some debug there to see what's going on (how big the tick busts are).
>> I'll try it to do it tomorrow.

I've built kernel with SW_WATCHDOG and run watchdogd with most tight
parameters (-s 1 -t 2), but observed no problems so far.

> Just in case, here is a debugging snippet from a panic that I've got:
> #14 0xffffffff80660ae5 in handleevents (now=0xffffff80e3e0b8b0, fake=0) at
> /usr/src/sys/kern/kern_clocksource.c:209
> 209             while (bintime_cmp(now, &state->nextstat, >=)) {
> (kgdb) list
> 204             }
> 205             if (runs && fake < 2) {
> 206                     hardclock_anycpu(runs, usermode);
> 207                     done = 1;
> 208             }
> 209             while (bintime_cmp(now, &state->nextstat, >=)) {
> 210                     if (fake < 2)
> 211                             statclock(usermode);
> 212                     bintime_add(&state->nextstat, &statperiod);
> 213                     done = 1;
> (kgdb) p state->nextstat
> $1 = {sec = 90, frac = 15986939599958264124}
> (kgdb) p *now
> $3 = {sec = 106, frac = 11494276814354478452}
> (kgdb) p statperiod
> $4 = {sec = 0, frac = 145249953336295682}
> 
> (kgdb) fr 13
> #13 0xffffffff8042603e in hardclock_anycpu (cnt=15761, usermode=Variable
> "usermode" is not available.
> ) at atomic.h:183
> 183     atomic.h: No such file or directory.
>         in atomic.h
> (kgdb) p cnt
> $5 = 15761
> (kgdb) p newticks
> $6 = 15000
> (kgdb) p watchdog_ticks
> $7 = 16000
> 
> Watchdog timeout was set to ~16 seconds.

It looks like your system was out for about 15 seconds or for some
reason system uptime jumped 15 seconds forward. Have you done anything
special at the moment or have you seen anything strange in system
behavior? What timecounter are you using? I see you are using HPET
eventtimer, but on what hardware (is it per-CPU or global)?

Building kernel with KTR_SPARE2 ktrace enabled should help to collect
valuable info about timers behavior before the crash.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E79A2F9.4010802>