From owner-freebsd-hackers@FreeBSD.ORG Wed Sep 21 08:41:53 2011 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7888B106564A; Wed, 21 Sep 2011 08:41:53 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id CC4568FC16; Wed, 21 Sep 2011 08:41:52 +0000 (UTC) Received: by fxg9 with SMTP id 9so1888123fxg.13 for ; Wed, 21 Sep 2011 01:41:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; bh=QZjX0RUvN+rB5ijgDCEptz8IWZcQWnYFk3tloypKPK8=; b=X3CAxFicWXP/2lHG4G5HcT++TEBaKj+nt2dgsEGWk+1re17+xuI2KbMvyIR4l4DXWX 0IVip9XBmMtqtgqq6aNqEsMDmj6rbXY2ijn4/VJ3b+uPRrFalvaof2EAoPIV2rsoHjt3 O8wAuLT9V2rt6CX59brIP1dAKgf4W2bfLbPQM= Received: by 10.223.58.209 with SMTP id i17mr708048fah.22.1316594511738; Wed, 21 Sep 2011 01:41:51 -0700 (PDT) Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id f10sm3660496fac.14.2011.09.21.01.41.50 (version=SSLv3 cipher=OTHER); Wed, 21 Sep 2011 01:41:50 -0700 (PDT) Sender: Alexander Motin Message-ID: <4E79A2F9.4010802@FreeBSD.org> Date: Wed, 21 Sep 2011 11:40:25 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20091212) MIME-Version: 1.0 To: Andriy Gapon References: <4E78E755.8050404@FreeBSD.org> <4E78F1E7.7020502@FreeBSD.org> <4E78F762.5000906@FreeBSD.org> In-Reply-To: <4E78F762.5000906@FreeBSD.org> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=x-viet-vps Content-Transfer-Encoding: 7bit Cc: hackers@FreeBSD.org Subject: Re: SW_WATCHDOG vs new eventtimer code X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Sep 2011 08:41:53 -0000 Andriy Gapon wrote: > on 20/09/2011 23:04 Alexander Motin said the following: >> On 20.09.2011 22:19, Andriy Gapon wrote: >>> just want to check with you first if the following makes sense. >>> I use SW_WATCHDOG on one of the test machines, which was recently updated to >>> from stable/8 to head. Now it seems to get seemingly random watchdog events. >>> My theory is that this is because of the eventtimer logic. >>> If during idle period we accumulate enough timer ticks and then run all those >>> ticks very rapidly, then the SW_WATCHDOG code may get an impression that it was >>> not patted for many real ticks. >>> Not sure what would be the best way to make SW_WATCHDOG happier/smarter. >> Eventtimer code now set to generate interrupts at least 4 times per >> second for each CPU. As soon as SW_WATCHDOG only handles periods more >> then one second, I would say it should not be hurt. I would try to add >> some debug there to see what's going on (how big the tick busts are). >> I'll try it to do it tomorrow. I've built kernel with SW_WATCHDOG and run watchdogd with most tight parameters (-s 1 -t 2), but observed no problems so far. > Just in case, here is a debugging snippet from a panic that I've got: > #14 0xffffffff80660ae5 in handleevents (now=0xffffff80e3e0b8b0, fake=0) at > /usr/src/sys/kern/kern_clocksource.c:209 > 209 while (bintime_cmp(now, &state->nextstat, >=)) { > (kgdb) list > 204 } > 205 if (runs && fake < 2) { > 206 hardclock_anycpu(runs, usermode); > 207 done = 1; > 208 } > 209 while (bintime_cmp(now, &state->nextstat, >=)) { > 210 if (fake < 2) > 211 statclock(usermode); > 212 bintime_add(&state->nextstat, &statperiod); > 213 done = 1; > (kgdb) p state->nextstat > $1 = {sec = 90, frac = 15986939599958264124} > (kgdb) p *now > $3 = {sec = 106, frac = 11494276814354478452} > (kgdb) p statperiod > $4 = {sec = 0, frac = 145249953336295682} > > (kgdb) fr 13 > #13 0xffffffff8042603e in hardclock_anycpu (cnt=15761, usermode=Variable > "usermode" is not available. > ) at atomic.h:183 > 183 atomic.h: No such file or directory. > in atomic.h > (kgdb) p cnt > $5 = 15761 > (kgdb) p newticks > $6 = 15000 > (kgdb) p watchdog_ticks > $7 = 16000 > > Watchdog timeout was set to ~16 seconds. It looks like your system was out for about 15 seconds or for some reason system uptime jumped 15 seconds forward. Have you done anything special at the moment or have you seen anything strange in system behavior? What timecounter are you using? I see you are using HPET eventtimer, but on what hardware (is it per-CPU or global)? Building kernel with KTR_SPARE2 ktrace enabled should help to collect valuable info about timers behavior before the crash. -- Alexander Motin