From owner-freebsd-hackers@FreeBSD.ORG Fri May 7 00:50:25 2010 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C875D106564A for ; Fri, 7 May 2010 00:50:25 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx06.syd.optusnet.com.au (fallbackmx06.syd.optusnet.com.au [211.29.132.8]) by mx1.freebsd.org (Postfix) with ESMTP id C0F308FC0C for ; Fri, 7 May 2010 00:50:24 +0000 (UTC) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by fallbackmx06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o46MvQS6005886 for ; Fri, 7 May 2010 08:57:26 +1000 Received: from besplex.bde.org (c122-106-171-133.carlnfd1.nsw.optusnet.com.au [122.106.171.133]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o46MvLpL001132 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 7 May 2010 08:57:22 +1000 Date: Fri, 7 May 2010 08:57:22 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Kostik Belousov In-Reply-To: <20100506134719.GY23646@deviant.kiev.zoral.com.ua> Message-ID: <20100507074313.L1195@besplex.bde.org> References: <4BCBA7F8.3060109@natsys-lab.com> <20100425145929.GJ2422@deviant.kiev.zoral.com.ua> <4BDF2E4C.4090009@natsys-lab.com> <20100503163524.GE23646@deviant.kiev.zoral.com.ua> <4BE0C075.2040106@natsys-lab.com> <20100504212448.GS23646@deviant.kiev.zoral.com.ua> <4BE2B93F.4020003@natsys-lab.com> <20100506134719.GY23646@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Fri, 07 May 2010 02:45:39 +0000 Cc: freebsd-hackers@FreeBSD.org, Alexander Krizhanovsky , ur4ina@gmail.com, bde@FreeBSD.org Subject: Re: [PATCH] RUSAGE_THREAD X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 May 2010 00:50:25 -0000 On Thu, 6 May 2010, Kostik Belousov wrote: > On Thu, May 06, 2010 at 04:42:39PM +0400, Alexander Krizhanovsky wrote: >> ... >> Indeed, I concerned more about PROC_LOCK/PROC_SLOCK - they are acquired >> both in kern_getrusage() and could be a problem in multithreaded process >> with intensive performance accounting in threads. This happens for >> example in resource accounting systems for shared hosting which are >> solved by custom extensions of Apache, MySQL and other servers. > > We still need per-process lock for resource summation code. It's hard to see how it can be done without any lock at all. The oldest example of per-CPU summation code is in vm_meter.c It doesn't use any locks, but without any lock at all it is fundamentally broken. Incoherencies in the data (due to reading and writing the counters at different unsynchronized times in the past) would be more obvious for times (except when the unsynchronized accesses give an off-by-2**32 error due to wraparound of a 32-bit counter), so I think getrusage() should be more careful than vm_meter. calcru() already does a lot of work and locking just to synchronize times relative to previous readings. > On the other > hand, I do not understand why p_rux have to be protected by spinlock and > not by process mutex. Hm, spinlock disables preemption, but only for > curthread, It is probably mostly historical. The read-modify-write of pc_switchtime and some other things used to be protected by splstatclock(). Hmm, that was the lowest spl level, closer to a non-spin mutex than anything. In RELENG_4, calcru() doesn't modify switchtime either, and mi_switch() has comments saying that even the splstatclock() should be unnecessary. RELENG_4 is of course non-preemptive, so it has close to the equivalent of critical_enter() automatically. > Process spinlock is not acquired in the mi_switch(), so I > really do not see a reason not to change protection to mutex. mi_switch)() holds thread lock, which is a special type of spinlock. I think the following philosphy is adequate for reading times: the time returned shall be no earlier than the time (according to a uniquely selected clock at the point where the time is read) at the beginning of the call, and coherent with itself. (It will be indeterminately far in the past by the time that it gets back to the caller. Times read by a single caller shall be monotonic, but times read by separate callers are incomparable unless the callers do additional synchronization). Thus the locked part of reading a clock is: - lock sufficiently to synchronize the next step - update memory copy of the time, so that it is not in the past - drop locks needed for the update, but keep ones needed to prevent the memory copy changing - read and use the memory copy, reassembling it into the form needed by the caller. This step may be long delayed from the update, but there is no point in trying to minimize the delay for this step, since we must soon drop all time-related locks and thus lose all control over the delay seen by the caller. Only in very critical kernel environments would it be necessary and possible to hold strong locks to minimize the delay across the whole operation of reading the time and acting on the result. >> I asked about it in my message from 3rd of May and also there I've >> proposesed following patch for calcru1(): >> >> - if ((int64_t)tu < 0) { >> - /* XXX: this should be an assert /phk */ >> - printf("calcru: negative runtime of %jd usec for pid %d >> (%s)\n", >> - (intmax_t)tu, p->p_pid, p->p_comm); >> - tu = ruxp->rux_tu; >> - } >> + KASSERT((int64_t)tu < 0, ("calcru: negative runtime of %jd usec " >> + "for pid %d (%s)\n", >> + (intmax_t)tu, p->p_pid, p->p_comm)); >> > I did understand your proposal, but, as I said, I have no opinion. This patch is mostly backwards: - backwards condition in KASSERT() - removal of KNF formatting - removal of the fixup when the KASSERT() is null. I think I prefer removing the whole statement. But the KASSERT() may be useful for verifying that the failure case really is unreachable due to programming bugs. I think it reachable due to hardware bugs, but only ones that are less likely than a memory parity error, and the handling of hardware bugs here shouldn't be to panic. Bruce