From owner-freebsd-hackers@FreeBSD.ORG  Fri May  7 00:50:25 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id C875D106564A
	for <freebsd-hackers@FreeBSD.org>; Fri,  7 May 2010 00:50:25 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx06.syd.optusnet.com.au
	(fallbackmx06.syd.optusnet.com.au [211.29.132.8])
	by mx1.freebsd.org (Postfix) with ESMTP id C0F308FC0C
	for <freebsd-hackers@FreeBSD.org>; Fri,  7 May 2010 00:50:24 +0000 (UTC)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by fallbackmx06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o46MvQS6005886
	for <freebsd-hackers@FreeBSD.org>; Fri, 7 May 2010 08:57:26 +1000
Received: from besplex.bde.org (c122-106-171-133.carlnfd1.nsw.optusnet.com.au
	[122.106.171.133])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o46MvLpL001132
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 7 May 2010 08:57:22 +1000
Date: Fri, 7 May 2010 08:57:22 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Kostik Belousov <kostikbel@gmail.com>
In-Reply-To: <20100506134719.GY23646@deviant.kiev.zoral.com.ua>
Message-ID: <20100507074313.L1195@besplex.bde.org>
References: <4BCBA7F8.3060109@natsys-lab.com>
	<20100425145929.GJ2422@deviant.kiev.zoral.com.ua>
	<4BDF2E4C.4090009@natsys-lab.com>
	<20100503163524.GE23646@deviant.kiev.zoral.com.ua>
	<4BE0C075.2040106@natsys-lab.com>
	<20100504212448.GS23646@deviant.kiev.zoral.com.ua>
	<4BE2B93F.4020003@natsys-lab.com>
	<20100506134719.GY23646@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Fri, 07 May 2010 02:45:39 +0000
Cc: freebsd-hackers@FreeBSD.org, Alexander Krizhanovsky <ak@natsys-lab.com>,
	ur4ina@gmail.com, bde@FreeBSD.org
Subject: Re: [PATCH] RUSAGE_THREAD
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 07 May 2010 00:50:25 -0000

On Thu, 6 May 2010, Kostik Belousov wrote:

> On Thu, May 06, 2010 at 04:42:39PM +0400, Alexander Krizhanovsky wrote:
>> ...
>> Indeed, I concerned more about PROC_LOCK/PROC_SLOCK - they are acquired
>> both in kern_getrusage() and could be a problem in multithreaded process
>> with intensive performance accounting in threads. This happens for
>> example in resource accounting systems for shared hosting which are
>> solved by custom extensions of Apache, MySQL and other servers.
>
> We still need per-process lock for resource summation code.

It's hard to see how it can be done without any lock at all.  The
oldest example of per-CPU summation code is in vm_meter.c  It doesn't
use any locks, but without any lock at all it is fundamentally broken.
Incoherencies in the data (due to reading and writing the counters at
different unsynchronized times in the past) would be more obvious for
times (except when the unsynchronized accesses give an off-by-2**32
error due to wraparound of a 32-bit counter), so I think getrusage()
should be more careful than vm_meter.  calcru() already does a lot of
work and locking just to synchronize times relative to previous readings.

> On the other
> hand, I do not understand why p_rux have to be protected by spinlock and
> not by process mutex. Hm, spinlock disables preemption, but only for
> curthread,

It is probably mostly historical.  The read-modify-write of pc_switchtime
and some other things used to be protected by splstatclock().  Hmm,
that was the lowest spl level, closer to a non-spin mutex than anything.
In RELENG_4, calcru() doesn't modify switchtime either, and mi_switch()
has comments saying that even the splstatclock() should be unnecessary.
RELENG_4 is of course non-preemptive, so it has close to the equivalent
of critical_enter() automatically.

> Process spinlock is not acquired in the mi_switch(), so I
> really do not see a reason not to change protection to mutex.

mi_switch)() holds thread lock, which is a special type of spinlock.

I think the following philosphy is adequate for reading times: the time
returned shall be no earlier than the time (according to a uniquely
selected clock at the point where the time is read) at the beginning of
the call, and coherent with itself.  (It will be indeterminately far in
the past by the time that it gets back to the caller.  Times read by a
single caller shall be monotonic, but times read by separate callers are
incomparable unless the callers do additional synchronization).  Thus
the locked part of reading a clock is:
- lock sufficiently to synchronize the next step
- update memory copy of the time, so that it is not in the past
- drop locks needed for the update, but keep ones needed to prevent the
   memory copy changing
- read and use the memory copy, reassembling it into the form needed by
   the caller.  This step may be long delayed from the update, but there
   is no point in trying to minimize the delay for this step, since we
   must soon drop all time-related locks and thus lose all control over
   the delay seen by the caller.
Only in very critical kernel environments would it be necessary and
possible to hold strong locks to minimize the delay across the whole
operation of reading the time and acting on the result.

>> I asked about it in my message from 3rd of May and also there I've
>> proposesed following patch for calcru1():
>>
>> -       if ((int64_t)tu < 0) {
>> -               /* XXX: this should be an assert /phk */
>> -               printf("calcru: negative runtime of %jd usec for pid %d
>> (%s)\n",
>> -                   (intmax_t)tu, p->p_pid, p->p_comm);
>> -               tu = ruxp->rux_tu;
>> -       }
>> +       KASSERT((int64_t)tu < 0, ("calcru: negative runtime of %jd usec "
>> +                               "for pid %d (%s)\n",
>> +                               (intmax_t)tu, p->p_pid, p->p_comm));
>>
> I did understand your proposal, but, as I said, I have no opinion.

This patch is mostly backwards:
- backwards condition in KASSERT()
- removal of KNF formatting
- removal of the fixup when the KASSERT() is null.

I think I prefer removing the whole statement.  But the KASSERT() may be
useful for verifying that the failure case really is unreachable due to
programming bugs.  I think it reachable due to hardware bugs, but only
ones that are less likely than a memory parity error, and the handling
of hardware bugs here shouldn't be to panic.

Bruce