From owner-freebsd-arch@FreeBSD.ORG Sun May 27 06:28:36 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BAA9E16A469 for ; Sun, 27 May 2007 06:28:36 +0000 (UTC) (envelope-from peterjeremy@optushome.com.au) Received: from turion.vk2pj.dyndns.org (c220-239-3-125.belrs4.nsw.optusnet.com.au [220.239.3.125]) by mx1.freebsd.org (Postfix) with ESMTP id 4901B13C46A for ; Sun, 27 May 2007 06:28:36 +0000 (UTC) (envelope-from peterjeremy@optushome.com.au) Received: from turion.vk2pj.dyndns.org (localhost.vk2pj.dyndns.org [127.0.0.1]) by turion.vk2pj.dyndns.org (8.14.1/8.14.1) with ESMTP id l4R6SL3v021766; Sun, 27 May 2007 16:28:21 +1000 (EST) (envelope-from peter@turion.vk2pj.dyndns.org) Received: (from peter@localhost) by turion.vk2pj.dyndns.org (8.14.1/8.14.1/Submit) id l4R6SJcb021765; Sun, 27 May 2007 16:28:19 +1000 (EST) (envelope-from peter) Date: Sun, 27 May 2007 16:28:19 +1000 From: Peter Jeremy To: "M. Warner Losh" Message-ID: <20070527062819.GX1992@turion.vk2pj.dyndns.org> References: <86fy5wkim5.fsf@dwp.des.no> <20070517094445.GD1149@turion.vk2pj.dyndns.org> <20070522.230026.1611667537.imp@bsdimp.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LKTjZJSUETSlgu2t" Content-Disposition: inline In-Reply-To: <20070522.230026.1611667537.imp@bsdimp.com> X-PGP-Key: http://members.optusnet.com.au/peterjeremy/pubkey.asc User-Agent: Mutt/1.5.15 (2007-04-06) Cc: des@des.no, antinvidia@gmail.com, freebsd-arch@freebsd.org Subject: Re: A problem with the select(2) interface X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 May 2007 06:28:36 -0000 --LKTjZJSUETSlgu2t Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2007-May-22 23:00:26 -0400, "M. Warner Losh" wrote: >Index: select.2 >=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >RCS file: /cache/ncvs/src/lib/libc/sys/select.2,v >retrieving revision 1.33 >diff -u -r1.33 select.2 >--- select.2 9 Jan 2007 00:28:15 -0000 1.33 >+++ select.2 23 May 2007 03:00:14 -0000 >@@ -222,3 +222,6 @@ > by the > .Fn select > system call. >+.Fx=20 >+does not modify the return value, which can cause problems for applicatio= ns >+ported from other systems. That would make me happy. --=20 Peter Jeremy --LKTjZJSUETSlgu2t Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (FreeBSD) iD8DBQFGWSUD/opHv/APuIcRAqEbAJ4uCXZhWVD7Fx15p7Yvn9G58fw/LQCfeMOi YlFIwLK27IS0/fSyW47W+CE= =Alq8 -----END PGP SIGNATURE----- --LKTjZJSUETSlgu2t-- From owner-freebsd-arch@FreeBSD.ORG Mon May 28 13:32:00 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5900416A46C for ; Mon, 28 May 2007 13:32:00 +0000 (UTC) (envelope-from rink@thunderstone.rink.nu) Received: from mx1.rink.nu (thunderstone.rink.nu [80.112.228.34]) by mx1.freebsd.org (Postfix) with ESMTP id 0E79C13C480 for ; Mon, 28 May 2007 13:31:59 +0000 (UTC) (envelope-from rink@thunderstone.rink.nu) Received: from localhost (localhost [127.0.0.1]) by mx1.rink.nu (Postfix) with ESMTP id 33F996D457 for ; Mon, 28 May 2007 15:01:06 +0200 (CEST) X-Virus-Scanned: amavisd-new at rink.nu Received: from mx1.rink.nu ([127.0.0.1]) by localhost (thunderstone.rink.nu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id we5BG582TeFD for ; Mon, 28 May 2007 15:01:01 +0200 (CEST) Received: from thunderstone.rink.nu (localhost [127.0.0.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.rink.nu (Postfix) with ESMTP id BF9386D433 for ; Mon, 28 May 2007 15:01:01 +0200 (CEST) Received: (from rink@localhost) by thunderstone.rink.nu (8.13.8/8.13.8/Submit) id l4SD11wc008181 for arch@FreeBSD.org; Mon, 28 May 2007 15:01:01 +0200 (CEST) (envelope-from rink) Date: Mon, 28 May 2007 15:01:01 +0200 From: Rink Springer To: arch@FreeBSD.org Message-ID: <20070528130101.GD48357@rink.nu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.13 (2006-08-11) Cc: Subject: FreeBSD/xen structure X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2007 13:32:00 -0000 Hi everyone, As I've just mailed to current@, work is well underway on the Xen porting effort. However, as not only I but a lot of people will want to see this work integrated into CURRENT at some point, I'd like to raise a discussion on the directory layout I'm using. It has not changed from Kip Macy's perforce tree, but I want to ensure that this will be suitable for inclusion in the tree. Basically, i386-xen (it's i386 only for now) is a sub-architecture just like pc98. The layout is the following: i386-xen/ Xen main tree compile/ Compile tree conf/ Kernel configs i386-xen/ Low-level code, comparable to i386/i386/ include/ Include files - most include their i386/include/ version, but some are different or extended. This is basically the machine-dependant stuff. Should a port of amd64-xen happen in the future, it would go using a simular directory layout. Xen-dependant but architecture-independant drivers (such as the Xen block device drivers) are put in the dev/xen directory. It should be possible to use these drivers in a amd64-xen version as well. I'd prefer to keep Xen in a i386-xen tree, as there are quite a lot of changes, comparable to the amd64 <-> i386 split. And I am sure we are not really in favour for douzens of #ifdef XEN's in the tree. Are there any questions, comments, remarks etc. on this layout? You can inspect the work in perforce (//projects/xen3); currently, none of my changes have been committed, but the layout is the same. Thanks, -- Rink P.W. Springer - http://rink.nu "It is such a quiet thing, to fall. But yet a far more terrible thing, to admit it." - Darth Traya From owner-freebsd-arch@FreeBSD.ORG Mon May 28 13:42:38 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6974B16A4A9 for ; Mon, 28 May 2007 13:42:38 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 07BAE13C448 for ; Mon, 28 May 2007 13:42:37 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.64.3]) by phk.freebsd.dk (Postfix) with ESMTP id 574E917380; Mon, 28 May 2007 13:42:36 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.1/8.14.1) with ESMTP id l4SDh5Ap002521; Mon, 28 May 2007 13:43:06 GMT (envelope-from phk@critter.freebsd.dk) To: Rink Springer From: "Poul-Henning Kamp" In-Reply-To: Your message of "Mon, 28 May 2007 15:01:01 +0200." <20070528130101.GD48357@rink.nu> Date: Mon, 28 May 2007 13:43:05 +0000 Message-ID: <2520.1180359785@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: arch@FreeBSD.org Subject: Re: FreeBSD/xen structure X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2007 13:42:38 -0000 In message <20070528130101.GD48357@rink.nu>, Rink Springer writes: >Xen-dependant but architecture-independant drivers (such as the Xen >block device drivers) are put in the dev/xen directory. Presumably, sys/contrib/xen will be used to the extent we import code from the Xen repo ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Mon May 28 13:50:43 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5330516A478; Mon, 28 May 2007 13:50:43 +0000 (UTC) (envelope-from rink@thunderstone.rink.nu) Received: from mx1.rink.nu (thunderstone.rink.nu [80.112.228.34]) by mx1.freebsd.org (Postfix) with ESMTP id 126ED13C487; Mon, 28 May 2007 13:50:43 +0000 (UTC) (envelope-from rink@thunderstone.rink.nu) Received: from localhost (localhost [127.0.0.1]) by mx1.rink.nu (Postfix) with ESMTP id 546376D452; Mon, 28 May 2007 15:49:16 +0200 (CEST) X-Virus-Scanned: amavisd-new at rink.nu Received: from mx1.rink.nu ([127.0.0.1]) by localhost (thunderstone.rink.nu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ji5I2oqYsZ8k; Mon, 28 May 2007 15:49:12 +0200 (CEST) Received: from thunderstone.rink.nu (localhost [127.0.0.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.rink.nu (Postfix) with ESMTP id 0CAF06D433; Mon, 28 May 2007 15:49:12 +0200 (CEST) Received: (from rink@localhost) by thunderstone.rink.nu (8.13.8/8.13.8/Submit) id l4SDnBGe011203; Mon, 28 May 2007 15:49:11 +0200 (CEST) (envelope-from rink) Date: Mon, 28 May 2007 15:49:11 +0200 From: Rink Springer To: Poul-Henning Kamp Message-ID: <20070528134911.GA11031@rink.nu> References: <20070528130101.GD48357@rink.nu> <2520.1180359785@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2520.1180359785@critter.freebsd.dk> User-Agent: Mutt/1.5.13 (2006-08-11) Cc: arch@FreeBSD.org, Rink Springer Subject: Re: FreeBSD/xen structure X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2007 13:50:43 -0000 On Mon, May 28, 2007 at 01:43:05PM +0000, Poul-Henning Kamp wrote: > In message <20070528130101.GD48357@rink.nu>, Rink Springer writes: > > >Xen-dependant but architecture-independant drivers (such as the Xen > >block device drivers) are put in the dev/xen directory. > > Presumably, sys/contrib/xen will be used to the extent we import > code from the Xen repo ? Yes - any GPL-ed stuff should end up there, of course. -- Rink P.W. Springer - http://rink.nu "It is such a quiet thing, to fall. But yet a far more terrible thing, to admit it." - Darth Traya From owner-freebsd-arch@FreeBSD.ORG Tue May 29 05:44:49 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9381316A4C0; Tue, 29 May 2007 05:44:49 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mrout1-b.corp.dcn.yahoo.com (mrout1-b.corp.dcn.yahoo.com [216.109.112.27]) by mx1.freebsd.org (Postfix) with ESMTP id 60B3713C447; Tue, 29 May 2007 05:44:49 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from minion.local.neville-neil.com (proxy8.corp.yahoo.com [216.145.48.13]) by mrout1-b.corp.dcn.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id l4T5YPMY007660; Mon, 28 May 2007 22:34:26 -0700 (PDT) Date: Tue, 29 May 2007 10:55:54 +0900 Message-ID: From: gnn@freebsd.org To: Rink Springer In-Reply-To: <20070528130101.GD48357@rink.nu> References: <20070528130101.GD48357@rink.nu> User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.0.95 (i386-apple-darwin8.8.2) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Cc: arch@freebsd.org Subject: Re: FreeBSD/xen structure X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 05:44:49 -0000 At Mon, 28 May 2007 15:01:01 +0200, Rink Springer wrote: > > Hi everyone, > > As I've just mailed to current@, work is well underway on the Xen > porting effort. However, as not only I but a lot of people will want to > see this work integrated into CURRENT at some point, I'd like to raise a > discussion on the directory layout I'm using. It has not changed from > Kip Macy's perforce tree, but I want to ensure that this will be > suitable for inclusion in the tree. > > Basically, i386-xen (it's i386 only for now) is a sub-architecture just > like pc98. The layout is the following: > > i386-xen/ Xen main tree > compile/ Compile tree > conf/ Kernel configs > i386-xen/ Low-level code, comparable to i386/i386/ > include/ Include files - most include their i386/include/ > version, but some are different or extended. > > This is basically the machine-dependant stuff. Should a port of amd64-xen > happen in the future, it would go using a simular directory layout. > > Xen-dependant but architecture-independant drivers (such as the Xen > block device drivers) are put in the dev/xen directory. It should be > possible to use these drivers in a amd64-xen version as well. > > I'd prefer to keep Xen in a i386-xen tree, as there are quite a lot of > changes, comparable to the amd64 <-> i386 split. And I am sure we are > not really in favour for douzens of #ifdef XEN's in the tree. > > Are there any questions, comments, remarks etc. on this layout? You can > inspect the work in perforce (//projects/xen3); currently, none of my > changes have been committed, but the layout is the same. This looks correct to me, and I think following the pc98 "sub-architecture" model is the right way to go. Thanks for working on this! Please let us know when it's ready to play with. Thanks, George From owner-freebsd-arch@FreeBSD.ORG Tue May 29 05:55:48 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 91F5416A549; Tue, 29 May 2007 05:55:48 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 476A213C448; Tue, 29 May 2007 05:55:48 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.13.8/8.13.4) with ESMTP id l4T5svvS025363; Mon, 28 May 2007 23:54:57 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Mon, 28 May 2007 23:55:13 -0600 (MDT) Message-Id: <20070528.235513.420517798.imp@bsdimp.com> To: rink@freebsd.org From: "M. Warner Losh" In-Reply-To: <20070528130101.GD48357@rink.nu> References: <20070528130101.GD48357@rink.nu> X-Mailer: Mew version 4.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Mon, 28 May 2007 23:54:57 -0600 (MDT) Cc: arch@freebsd.org Subject: Re: FreeBSD/xen structure X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 05:55:48 -0000 In message: <20070528130101.GD48357@rink.nu> Rink Springer writes: : Hi everyone, : : As I've just mailed to current@, work is well underway on the Xen : porting effort. However, as not only I but a lot of people will want to : see this work integrated into CURRENT at some point, I'd like to raise a : discussion on the directory layout I'm using. It has not changed from : Kip Macy's perforce tree, but I want to ensure that this will be : suitable for inclusion in the tree. : : Basically, i386-xen (it's i386 only for now) is a sub-architecture just : like pc98. The layout is the following: : : i386-xen/ Xen main tree : compile/ Compile tree : conf/ Kernel configs : i386-xen/ Low-level code, comparable to i386/i386/ : include/ Include files - most include their i386/include/ : version, but some are different or extended. : : This is basically the machine-dependant stuff. Should a port of amd64-xen : happen in the future, it would go using a simular directory layout. : : Xen-dependant but architecture-independant drivers (such as the Xen : block device drivers) are put in the dev/xen directory. It should be : possible to use these drivers in a amd64-xen version as well. : : I'd prefer to keep Xen in a i386-xen tree, as there are quite a lot of : changes, comparable to the amd64 <-> i386 split. And I am sure we are : not really in favour for douzens of #ifdef XEN's in the tree. How pervasive are these diffs? And to what extent are you able to reuse the i386 stuff? Can you characterize, say, the number of lines and/or files that would be affected if you went the ifdef route? It sounds like the tree is right, but that has its own cost and shouldn't be undertaken lightly. : Are there any questions, comments, remarks etc. on this layout? You can : inspect the work in perforce (//projects/xen3); currently, none of my : changes have been committed, but the layout is the same. I like the layout, assuming it is needed. Warner From owner-freebsd-arch@FreeBSD.ORG Tue May 29 18:01:43 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 34D6A16A400 for ; Tue, 29 May 2007 18:01:43 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id D93DA13C448 for ; Tue, 29 May 2007 18:01:42 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TI1enX042960 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Tue, 29 May 2007 14:01:41 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 11:01:36 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: arch@freebsd.org Message-ID: <20070529105856.L661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 18:01:43 -0000 I'm working with Attilio to break down rusage further to be per-thread in places where it is protected by the global scheduler lock. To support this, I am interested in moving the rlimit cpulimit check into userret(), or perhaps ast(). Is there any reason why we need to check this on every context switch? Any objections to moving it? Eventually it will require a different lock from the one we obtain to call mi_switch(). Jeff From owner-freebsd-arch@FreeBSD.ORG Tue May 29 18:12:02 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4D3B116A421 for ; Tue, 29 May 2007 18:12:02 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 3FBB913C48A for ; Tue, 29 May 2007 18:12:02 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: by elvis.mu.org (Postfix, from userid 1192) id C89381A4D84; Tue, 29 May 2007 11:13:10 -0700 (PDT) Date: Tue, 29 May 2007 11:13:10 -0700 From: Alfred Perlstein To: Jeff Roberson Message-ID: <20070529181310.GP21795@elvis.mu.org> References: <20070529105856.L661@10.0.0.1> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070529105856.L661@10.0.0.1> User-Agent: Mutt/1.4.2.2i Cc: arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 18:12:02 -0000 * Jeff Roberson [070529 11:07] wrote: > I'm working with Attilio to break down rusage further to be per-thread in > places where it is protected by the global scheduler lock. To support > this, I am interested in moving the rlimit cpulimit check into userret(), > or perhaps ast(). Is there any reason why we need to check this on every > context switch? Any objections to moving it? Eventually it will require > a different lock from the one we obtain to call mi_switch(). Er, as long as it's checked each place where we can issue a signal so a SIGXCPU can be sent when tsleep/cv_wait_sig... right? -Alfred From owner-freebsd-arch@FreeBSD.ORG Tue May 29 18:58:12 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B00B816A476 for ; Tue, 29 May 2007 18:58:12 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id 2C94C13C48C for ; Tue, 29 May 2007 18:58:11 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l4TIw0kY071297; Tue, 29 May 2007 14:58:00 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Tue, 29 May 2007 14:56:38 -0400 User-Agent: KMail/1.9.6 References: <20070529105856.L661@10.0.0.1> In-Reply-To: <20070529105856.L661@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705291456.38515.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 29 May 2007 14:58:00 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/3323/Tue May 29 08:10:43 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 18:58:12 -0000 On Tuesday 29 May 2007 02:01:36 pm Jeff Roberson wrote: > I'm working with Attilio to break down rusage further to be per-thread in > places where it is protected by the global scheduler lock. To support > this, I am interested in moving the rlimit cpulimit check into userret(), > or perhaps ast(). Is there any reason why we need to check this on every > context switch? Any objections to moving it? Eventually it will require > a different lock from the one we obtain to call mi_switch(). I think using a per-process spin lock (or a pool of spin locks) would be a good first step. I wouldn't do anything more complicated unless the simple approach doesn't work. The only reason to not move the check into userret() would be if one is worried about threads chewing up CPU time while they are in the kernel w/o bouncing out to userland. Also, it matters which one happens more often (userret() vs mi_switch()). If on average threads perform multiple system calls during a single time slice (no idea if this is true or not), then moving the check to userret() would actually hurt performance. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue May 29 19:04:13 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A1A8916A400 for ; Tue, 29 May 2007 19:04:13 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 622FA13C45A for ; Tue, 29 May 2007 19:04:13 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 9822D17380; Tue, 29 May 2007 19:04:11 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.1/8.14.1) with ESMTP id l4TJ4gnd032970; Tue, 29 May 2007 19:04:42 GMT (envelope-from phk@critter.freebsd.dk) To: Jeff Roberson From: "Poul-Henning Kamp" In-Reply-To: Your message of "Tue, 29 May 2007 11:01:36 MST." <20070529105856.L661@10.0.0.1> Date: Tue, 29 May 2007 19:04:42 +0000 Message-ID: <32969.1180465482@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 19:04:13 -0000 In message <20070529105856.L661@10.0.0.1>, Jeff Roberson writes: >I'm working with Attilio to break down rusage further to be per-thread in >places where it is protected by the global scheduler lock. To support >this, I am interested in moving the rlimit cpulimit check into userret(), >or perhaps ast(). Is there any reason why we need to check this on every >context switch? Any objections to moving it? Eventually it will require >a different lock from the one we obtain to call mi_switch(). For all I care, we can do the cpu limit check once per second only. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Tue May 29 19:04:49 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9A81616A468 for ; Tue, 29 May 2007 19:04:49 +0000 (UTC) (envelope-from kip.macy@gmail.com) Received: from ik-out-1112.google.com (ik-out-1112.google.com [66.249.90.179]) by mx1.freebsd.org (Postfix) with ESMTP id 0F62613C469 for ; Tue, 29 May 2007 19:04:48 +0000 (UTC) (envelope-from kip.macy@gmail.com) Received: by ik-out-1112.google.com with SMTP id c21so848512ika for ; Tue, 29 May 2007 12:04:47 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=qADNHAqynDBrhEJquUCEfpfqps80Jg3RIfjz1BiffAX3zMxRs3gqneHz/8gYPiq6jvAsAHgQ7m/x0qStX7V270e+DN2y+/BR2dPzyjuv9He/qaOeoucm+bIGQ+SOdHP4OXOljkuyWgfbd4+Y4UXyvk01hI59fPU1+fAcOdgwu8s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=LsPQrmMNMBZCZxmrbN7XIhwjQf9rpLIuPUPSH6hwuIW1wklX9Utmook9wYNtCCq5NCuMqJ9nUgadXntP5Cs9STAUA3kR48RUpusaboydoFEyUIIv5sAZp8U2oBRRqo9deJVqY8+p929xhYUeOM2gG4NwX5nzX+o1VqaywpCbJis= Received: by 10.78.138.6 with SMTP id l6mr2081186hud.1180465487505; Tue, 29 May 2007 12:04:47 -0700 (PDT) Received: by 10.78.107.13 with HTTP; Tue, 29 May 2007 12:04:47 -0700 (PDT) Message-ID: Date: Tue, 29 May 2007 12:04:47 -0700 From: "Kip Macy" To: "John Baldwin" In-Reply-To: <200705291456.38515.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 19:04:49 -0000 > I think using a per-process spin lock (or a pool of spin locks) would be a > good first step. I wouldn't do anything more complicated unless the simple > approach doesn't work. The only reason to not move the check into userret() > would be if one is worried about threads chewing up CPU time while they are > in the kernel w/o bouncing out to userland. Also, it matters which one > happens more often (userret() vs mi_switch()). If on average threads perform > multiple system calls during a single time slice (no idea if this is true or > not), then moving the check to userret() would actually hurt performance. Processes can certainly make numerous system calls within a single time slice. However, in userret it would be protected by a per process or per thread blocking mutex as opposed to a global spin mutex. It would be surprising if it isn't a net win, although it is quite possible that on a 2-way system the extra locking could have an adverse effect on some workloads. -Kip From owner-freebsd-arch@FreeBSD.ORG Tue May 29 19:19:40 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7532116A46B; Tue, 29 May 2007 19:19:40 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 322B213C447; Tue, 29 May 2007 19:19:40 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TJJbxk068785 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 15:19:38 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 12:19:33 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200705291456.38515.jhb@freebsd.org> Message-ID: <20070529121653.P661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 19:19:40 -0000 On Tue, 29 May 2007, John Baldwin wrote: > On Tuesday 29 May 2007 02:01:36 pm Jeff Roberson wrote: >> I'm working with Attilio to break down rusage further to be per-thread in >> places where it is protected by the global scheduler lock. To support >> this, I am interested in moving the rlimit cpulimit check into userret(), >> or perhaps ast(). Is there any reason why we need to check this on every >> context switch? Any objections to moving it? Eventually it will require >> a different lock from the one we obtain to call mi_switch(). > > I think using a per-process spin lock (or a pool of spin locks) would be a > good first step. I wouldn't do anything more complicated unless the simple > approach doesn't work. The only reason to not move the check into userret() > would be if one is worried about threads chewing up CPU time while they are > in the kernel w/o bouncing out to userland. Also, it matters which one > happens more often (userret() vs mi_switch()). If on average threads perform > multiple system calls during a single time slice (no idea if this is true or > not), then moving the check to userret() would actually hurt performance. The problem with using a pool or per-process spinlock is that it keeps the contention in the process domain, rather than thread domain. For multithreaded processes this will give the same contention as a global scheduler lock, only slightly reduced in scope. I'd like to solve this in such a way that we don't have to revisit it again. I think I'm going to make the rusage struct per-thread and aggregate it on demand. There will be a lot of code churn, but it will be simple. There are a few cases where which will be complicated, and cpulimit is one of them. Jeff > > -- > John Baldwin > From owner-freebsd-arch@FreeBSD.ORG Tue May 29 20:04:16 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D21B016A4E1 for ; Tue, 29 May 2007 20:04:16 +0000 (UTC) (envelope-from me@johnea.net) Received: from mail.johnea.net (johnea.net [70.167.123.7]) by mx1.freebsd.org (Postfix) with ESMTP id 53B7A13C5A0 for ; Tue, 29 May 2007 20:04:16 +0000 (UTC) (envelope-from me@johnea.net) Received: from [192.168.100.211] (unknown [192.168.100.211]) by mail.johnea.net (Postfix) with ESMTP id D1C40A5D for ; Tue, 29 May 2007 13:05:49 -0700 (PDT) Message-ID: <465C8767.5020504@johnea.net> Date: Tue, 29 May 2007 13:04:55 -0700 From: johnea User-Agent: Mozilla Thunderbird 1.0.7 (X11/20051002) X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <4655C238.7000809@johnea.net> <20070524171930.GA348@lor.one-eyed-alien.net> In-Reply-To: <20070524171930.GA348@lor.one-eyed-alien.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: i386? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 20:04:17 -0000 Hi Again, Ever have one of those moments where your fingers type before your brain gets involved? Most people do, at least occasionally. That brings me to the subject of my previous post. Given just how stupid this question was, I was quite amazed at the polite nature of the replies. It's clear to me now that i386 is just a name for all x86 architectures, not a specific processor. So THANK YOU for your patient responses! And Thanx Again for all of your hard work on FreeBSD. ciao... johnea > On Thu, May 24, 2007 at 09:50:00AM -0700, johnea wrote: > >> Hi, >> >> Why doesn't the FreeBSD project support a 32bit intel architecture newer >> than 386? > > > FreeBSD does support all x86 platforms newer than the 80386. We do not > support the 80386 at all. For historical reasons we call the platform > i386. Changing the name would be difficult and largely pointless i386. > From owner-freebsd-arch@FreeBSD.ORG Tue May 29 20:11:43 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 42F7016A4CF for ; Tue, 29 May 2007 20:11:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail20.syd.optusnet.com.au (mail20.syd.optusnet.com.au [211.29.132.201]) by mx1.freebsd.org (Postfix) with ESMTP id 02D6813C4C3 for ; Tue, 29 May 2007 20:11:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-225-63.carlnfd3.nsw.optusnet.com.au (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail20.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4TKBKlM008040 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 06:11:24 +1000 Date: Wed, 30 May 2007 06:11:21 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20070529105856.L661@10.0.0.1> Message-ID: <20070530044158.H93160@delplex.bde.org> References: <20070529105856.L661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 20:11:43 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > I'm working with Attilio to break down rusage further to be per-thread in > places where it is protected by the global scheduler lock. To support this, > I am interested in moving the rlimit cpulimit check into userret(), or > perhaps ast(). Is there any reason why we need to check this on every > context switch? Any objections to moving it? Eventually it will require a > different lock from the one we obtain to call mi_switch(). Checking it on every userret() would be a good pessimization, and scheduling ast()'s to check it occasionally wouldn't be far behind. Syscalls occur many times more often than context switches. Locking is needed for the updating the runtime in mi_switch, so it seems best to check for the runtime becoming too large there while holding the same lock. This lock is currently sched_lock and it covers the following other fields: Fields in in the proc struct (or attached to it, not counting the thread struct): p_stats->p_ru.ru_nvcsw p_stats->p_ru.ru_nivcsw p_stats->p_ru.rux_runtime (this) p_stats->p_ru.rux_uticks p_stats->p_ru.rux_iticks p_stats->p_ru.rux_sticks p_cpulimit (less volatile than the others) p_sflag (needs sched_lock more than the others?) p_flag (KSE only, invalid? -- needs PROC lock) Related fields in the pcpu: switchtime switchticks Related fields in the thread struct: td_uticks td_sticks td_iticks td_flags Unrelated fields in the thread struct: td_critnest (in a KASSERT() only) td_owepreempt (in a KASSERT() only) td_generation td_sched (not yet proerly locked) td_priority (this and following only in CTR*()) td_inhibitors td_lockname td_wmesg I don't see why you emphasized the cpulimit check, since it is the least delicate part of the rusage maintainance done by mi_check(). I don't see how the locking requirements can be reduced significantly by keeping the rusage in the thread struct more of the time (than the tick counts already are) and only accumulating it into the proc struct when needed. There must be a context switch of `switchtime' here, and this seems to require sched locking. It doesn't help to accumulate the runtime into the thread struct instead of into the rusage -- sched locking still seems to be required to switch the runtime. The core of this switch is: new_switchtime = cpu_ticks(); delta_runtime = new_switchtime - PCPU_GET(switchtime); PCPU_SET(switchtime, new_switchtime); and this must be done without letting any other thread run on the CPU. This seems to require sched_lock. After determining delta_runtime, we can add it into either thread data for accumulation later or to more global data (currently rusage). This part doesn't need such strong locking. We currently add it to rusage since we already hold the lock for this, and then checking the limit here also has low cost. Note that in the above code, keeping `switchtime' in the pcpu is an optimization (for space and perhaps for clarity). It could be kept in the thread that owns the CPU. Then its locking requirements might be clearer -- it must be switched at the same time as td_oncpu. The latter clearly requires sched_lock. The code for the latter has moved from mi_switch() sched_switch(). Maybe you emphasized the cpulimit check because this point wasn't clear -- things that strictly required sched_lock were supposed to have all been moved to sched_switch()? p_sflag and td_flag are only needed if we detect that the cpulimit has been exceeded (to schedule an AST). Moving the limit check to an AST would still require sched locking for setting td_flag. The cpulimit has a low resolution, so it doesn't need to be checked very often. The 4BSD scheduler could easily check it every second or so in the main loop of the scheduler. Accesses in the CTR*()s and KASSERT()s complicate the analysis and removing locks -- you don't want to have to acquire a lock just to debug, and the debugging code also gets in the way of understanding the locking, not to mention the function. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue May 29 20:25:58 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B58DD16A46C for ; Tue, 29 May 2007 20:25:58 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 8309F13C455 for ; Tue, 29 May 2007 20:25:58 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TKPtme093099 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 16:25:56 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 13:25:51 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070530044158.H93160@delplex.bde.org> Message-ID: <20070529132155.S661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <20070530044158.H93160@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 20:25:58 -0000 On Wed, 30 May 2007, Bruce Evans wrote: > On Tue, 29 May 2007, Jeff Roberson wrote: > >> I'm working with Attilio to break down rusage further to be per-thread in >> places where it is protected by the global scheduler lock. To support >> this, I am interested in moving the rlimit cpulimit check into userret(), >> or perhaps ast(). Is there any reason why we need to check this on every >> context switch? Any objections to moving it? Eventually it will require a >> different lock from the one we obtain to call mi_switch(). > > Checking it on every userret() would be a good pessimization, and > scheduling ast()'s to check it occasionally wouldn't be far behind. > Syscalls occur many times more often than context switches. Yes, you're right. I think I'll make a callout that will check once per second. > > Locking is needed for the updating the runtime in mi_switch, so it > seems best to check for the runtime becoming too large there while > holding the same lock. This lock is currently sched_lock and it covers > the following other fields: I think you missed the point of my threadlock patch. There will no longer be a global scheduler lock. Because of this, multiple threads in a process may context switch at the same time. The list below will be protected by a thread lock or a per-process spinlock depending on the use. > > Fields in in the proc struct (or attached to it, not counting the > thread struct): > > p_stats->p_ru.ru_nvcsw > p_stats->p_ru.ru_nivcsw > p_stats->p_ru.rux_runtime (this) > p_stats->p_ru.rux_uticks > p_stats->p_ru.rux_iticks > p_stats->p_ru.rux_sticks > p_cpulimit (less volatile than the others) > p_sflag (needs sched_lock more than the others?) > p_flag (KSE only, invalid? -- needs PROC lock) > > Related fields in the pcpu: > > switchtime > switchticks > > Related fields in the thread struct: > > td_uticks > td_sticks > td_iticks > td_flags > > Unrelated fields in the thread struct: > > td_critnest (in a KASSERT() only) > td_owepreempt (in a KASSERT() only) > td_generation > td_sched (not yet proerly locked) > td_priority (this and following only in CTR*()) > td_inhibitors > td_lockname > td_wmesg > > I don't see why you emphasized the cpulimit check, since it is the > least delicate part of the rusage maintainance done by mi_check(). I > don't see how the locking requirements can be reduced significantly > by keeping the rusage in the thread struct more of the time (than the > tick counts already are) and only accumulating it into the proc struct > when needed. There must be a context switch of `switchtime' here, and > this seems to require sched locking. It doesn't help to accumulate > the runtime into the thread struct instead of into the rusage -- sched > locking still seems to be required to switch the runtime. The core of > this switch is: If we move to a per-thread rusage the only time we have to aggregate and deal with it on a per-process basis is at thread or process exit, and when we check the process runtime to check resource limits. That's why I'm interested in this case. The others are more straightforward. This one must move out of mi_switch() however, or be adjusted with atomics or similar. Jeff > > new_switchtime = cpu_ticks(); > delta_runtime = new_switchtime - PCPU_GET(switchtime); > PCPU_SET(switchtime, new_switchtime); > > and this must be done without letting any other thread run on the CPU. > This seems to require sched_lock. After determining delta_runtime, > we can add it into either thread data for accumulation later or to > more global data (currently rusage). This part doesn't need such > strong locking. We currently add it to rusage since we already hold > the lock for this, and then checking the limit here also has low cost. > > Note that in the above code, keeping `switchtime' in the pcpu is an > optimization (for space and perhaps for clarity). It could be kept > in the thread that owns the CPU. Then its locking requirements might > be clearer -- it must be switched at the same time as td_oncpu. The > latter clearly requires sched_lock. The code for the latter has moved > from mi_switch() sched_switch(). Maybe you emphasized the cpulimit > check because this point wasn't clear -- things that strictly required > sched_lock were supposed to have all been moved to sched_switch()? > > p_sflag and td_flag are only needed if we detect that the cpulimit has > been exceeded (to schedule an AST). Moving the limit check to an AST > would still require sched locking for setting td_flag. > > The cpulimit has a low resolution, so it doesn't need to be checked > very often. The 4BSD scheduler could easily check it every second or > so in the main loop of the scheduler. > > Accesses in the CTR*()s and KASSERT()s complicate the analysis and removing > locks -- you don't want to have to acquire a lock just to debug, and the > debugging code also gets in the way of understanding the locking, not to > mention the function. > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Tue May 29 20:45:44 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0DC3816A421 for ; Tue, 29 May 2007 20:45:44 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outZ.internet-mail-service.net (outZ.internet-mail-service.net [216.240.47.249]) by mx1.freebsd.org (Postfix) with ESMTP id ECBAF13C46A for ; Tue, 29 May 2007 20:45:43 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.32) with ESMTP; Tue, 29 May 2007 13:45:43 -0700 Received: from julian-mac.elischer.org (nat.ironport.com [63.251.108.100]) by idiom.com (Postfix) with ESMTP id E9B7F125B4B; Tue, 29 May 2007 13:45:42 -0700 (PDT) Message-ID: <465C90F5.1000906@elischer.org> Date: Tue, 29 May 2007 13:45:41 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Jeff Roberson References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> In-Reply-To: <20070529121653.P661@10.0.0.1> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 20:45:44 -0000 Jeff Roberson wrote: > > On Tue, 29 May 2007, John Baldwin wrote: > >> On Tuesday 29 May 2007 02:01:36 pm Jeff Roberson wrote: >>> I'm working with Attilio to break down rusage further to be >>> per-thread in >>> places where it is protected by the global scheduler lock. To support >>> this, I am interested in moving the rlimit cpulimit check into >>> userret(), >>> or perhaps ast(). Is there any reason why we need to check this on >>> every >>> context switch? Any objections to moving it? Eventually it will >>> require >>> a different lock from the one we obtain to call mi_switch(). >> >> I think using a per-process spin lock (or a pool of spin locks) would >> be a >> good first step. I wouldn't do anything more complicated unless the >> simple >> approach doesn't work. The only reason to not move the check into >> userret() >> would be if one is worried about threads chewing up CPU time while >> they are >> in the kernel w/o bouncing out to userland. Also, it matters which one >> happens more often (userret() vs mi_switch()). If on average threads >> perform >> multiple system calls during a single time slice (no idea if this is >> true or >> not), then moving the check to userret() would actually hurt performance. > > The problem with using a pool or per-process spinlock is that it keeps > the contention in the process domain, rather than thread domain. For > multithreaded processes this will give the same contention as a global > scheduler lock, only slightly reduced in scope. I'd like to solve this > in such a way that we don't have to revisit it again. > > I think I'm going to make the rusage struct per-thread and aggregate it > on demand. There will be a lot of code churn, but it will be simple. > There are a few cases where which will be complicated, and cpulimit is > one of them. So, there should be somewhere to store the aggregated stats from threads that have already exited. > > Jeff > >> >> -- >> John Baldwin >> > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Tue May 29 21:10:34 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 30FDF16A400; Tue, 29 May 2007 21:10:34 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail33.syd.optusnet.com.au (mail33.syd.optusnet.com.au [211.29.132.104]) by mx1.freebsd.org (Postfix) with ESMTP id AA91F13C44C; Tue, 29 May 2007 21:10:33 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-225-63.carlnfd3.nsw.optusnet.com.au (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail33.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4TLALPK016030 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 07:10:22 +1000 Date: Wed, 30 May 2007 07:10:23 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20070529121653.P661@10.0.0.1> Message-ID: <20070530065423.H93410@delplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 21:10:34 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > The problem with using a pool or per-process spinlock is that it keeps the > contention in the process domain, rather than thread domain. For > multithreaded processes this will give the same contention as a global > scheduler lock, only slightly reduced in scope. I'd like to solve this in > such a way that we don't have to revisit it again. > > I think I'm going to make the rusage struct per-thread and aggregate it on > demand. There will be a lot of code churn, but it will be simple. There are Ugh. > a few cases where which will be complicated, and cpulimit is one of them. No, cpulimit is simple because it can be fuzzy, unlike calcru() which require the rusage to be up to date. I see how rusage accumulation can help for everything _except_ the runtime and tick counts (i.e., for stuff updated by statclock()). For the runtime and tick counts, the possible savings seem to be small and negative. calcru() would have to run the accumulation code and the accumulation code would have to acquire something like sched_lock to transfer the per-thread data (since the lock for updating that data is something like sched_lock). This is has the same locking overheads and larger non-locking overheads than accumulating the runtime directly into the rusage at context switch time -- calcru() needs to acquire something like sched_lock either way. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue May 29 21:18:40 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 622D816A400; Tue, 29 May 2007 21:18:40 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 06E7713C45B; Tue, 29 May 2007 21:18:39 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TLIbPK009179 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 17:18:38 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 14:18:32 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070530065423.H93410@delplex.bde.org> Message-ID: <20070529141342.D661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 21:18:40 -0000 On Wed, 30 May 2007, Bruce Evans wrote: > On Tue, 29 May 2007, Jeff Roberson wrote: > >> The problem with using a pool or per-process spinlock is that it keeps the >> contention in the process domain, rather than thread domain. For >> multithreaded processes this will give the same contention as a global >> scheduler lock, only slightly reduced in scope. I'd like to solve this in >> such a way that we don't have to revisit it again. >> >> I think I'm going to make the rusage struct per-thread and aggregate it on >> demand. There will be a lot of code churn, but it will be simple. There >> are > > Ugh. > >> a few cases where which will be complicated, and cpulimit is one of them. > > No, cpulimit is simple because it can be fuzzy, unlike calcru() which require > the rusage to be up to date. cpulimit is complicated because it requires aggregate statistics from all threads like rusage. It may be queried infrequently however. It's just one of the few cases where we actually examine the values as if we still only have one thread per process. > > I see how rusage accumulation can help for everything _except_ the > runtime and tick counts (i.e., for stuff updated by statclock()). For > the runtime and tick counts, the possible savings seem to be small and > negative. calcru() would have to run the accumulation code and the > accumulation code would have to acquire something like sched_lock to > transfer the per-thread data (since the lock for updating that data > is something like sched_lock). This is has the same locking overheads > and larger non-locking overheads than accumulating the runtime directly > into the rusage at context switch time -- calcru() needs to acquire > something like sched_lock either way. Yes, it will make calcru() more expensive. However, this should be infrequent relative to context switches. It's only used for calls to getrusage(), fill_kinfo_proc(), and certain clock_gettime() calls. The thing that will protect mi_switch() is not process global. I want to keep process global locks out of mi_switch() or we reduce concurrency for multi-threaded applications. Jeff > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Tue May 29 21:38:36 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7EA7D16A421 for ; Tue, 29 May 2007 21:38:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id 0598D13C447 for ; Tue, 29 May 2007 21:38:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4TLcASW012615 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 07:38:21 +1000 Date: Wed, 30 May 2007 07:38:11 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Julian Elischer In-Reply-To: <465C90F5.1000906@elischer.org> Message-ID: <20070530072219.G11288@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <465C90F5.1000906@elischer.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 21:38:36 -0000 On Tue, 29 May 2007, Julian Elischer wrote: > Jeff Roberson wrote: >> I think I'm going to make the rusage struct per-thread and aggregate it on >> demand. There will be a lot of code churn, but it will be simple. There >> are a few cases where which will be complicated, and cpulimit is one of >> them. > > So, there should be somewhere to store the aggregated stats from threads that > have already exited. We already have that. It is the per-process rusage. There is already delayed accumulation for tick counts (these are currently accumulated in the rusage from the thread at context switch time). There is also delayed conversion of stats to the form needed by getrusage(). Stats are kept in raw form as long as possible (forever if nothing calls getrusage() or wait[34]() to look at them) to avoid conversion overhead for them on every exit(). So there are many precedents for delayed stats handling. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue May 29 21:42:03 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BD8FC16A469 for ; Tue, 29 May 2007 21:42:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id 258B013C455 for ; Tue, 29 May 2007 21:42:02 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l4TLfohl072298; Tue, 29 May 2007 17:41:50 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Tue, 29 May 2007 17:37:24 -0400 User-Agent: KMail/1.9.6 References: <20070529105856.L661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> In-Reply-To: <20070529141342.D661@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200705291737.25355.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 29 May 2007 17:41:50 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/3323/Tue May 29 08:10:43 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 21:42:03 -0000 On Tuesday 29 May 2007 05:18:32 pm Jeff Roberson wrote: > On Wed, 30 May 2007, Bruce Evans wrote: > > I see how rusage accumulation can help for everything _except_ the > > runtime and tick counts (i.e., for stuff updated by statclock()). For > > the runtime and tick counts, the possible savings seem to be small and > > negative. calcru() would have to run the accumulation code and the > > accumulation code would have to acquire something like sched_lock to > > transfer the per-thread data (since the lock for updating that data > > is something like sched_lock). This is has the same locking overheads > > and larger non-locking overheads than accumulating the runtime directly > > into the rusage at context switch time -- calcru() needs to acquire > > something like sched_lock either way. > > Yes, it will make calcru() more expensive. However, this should be > infrequent relative to context switches. It's only used for calls to > getrusage(), fill_kinfo_proc(), and certain clock_gettime() calls. > > The thing that will protect mi_switch() is not process global. I want to > keep process global locks out of mi_switch() or we reduce concurrency for > multi-threaded applications. I still think it would be wise to try the simple approach first and only engage in further complexity if it is warranted. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue May 29 21:50:17 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7A8D316A400; Tue, 29 May 2007 21:50:17 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 3908313C455; Tue, 29 May 2007 21:50:17 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TLoAFn018017 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 17:50:15 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 14:50:05 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: John Baldwin In-Reply-To: <200705291737.25355.jhb@freebsd.org> Message-ID: <20070529144657.T661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <200705291737.25355.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 21:50:17 -0000 On Tue, 29 May 2007, John Baldwin wrote: > On Tuesday 29 May 2007 05:18:32 pm Jeff Roberson wrote: >> On Wed, 30 May 2007, Bruce Evans wrote: >>> I see how rusage accumulation can help for everything _except_ the >>> runtime and tick counts (i.e., for stuff updated by statclock()). For >>> the runtime and tick counts, the possible savings seem to be small and >>> negative. calcru() would have to run the accumulation code and the >>> accumulation code would have to acquire something like sched_lock to >>> transfer the per-thread data (since the lock for updating that data >>> is something like sched_lock). This is has the same locking overheads >>> and larger non-locking overheads than accumulating the runtime directly >>> into the rusage at context switch time -- calcru() needs to acquire >>> something like sched_lock either way. >> >> Yes, it will make calcru() more expensive. However, this should be >> infrequent relative to context switches. It's only used for calls to >> getrusage(), fill_kinfo_proc(), and certain clock_gettime() calls. >> >> The thing that will protect mi_switch() is not process global. I want to >> keep process global locks out of mi_switch() or we reduce concurrency for >> multi-threaded applications. > > I still think it would be wise to try the simple approach first and only > engage in further complexity if it is warranted. I have indirectly shown that this approach will not yield sufficient results by decreasing the scope of the sched lock in other ways. This would gate context switches in the same way that a global scheduler lock would, except not over as long of a period. Moving stats to be per-thread really is not very complicated, and very likely optimizes the common case even in the absence of increased concurrency. We require fewer indirections for all stats increments in this way and also touch fewer cache lines in mi_switch(). Jeff > > -- > John Baldwin > From owner-freebsd-arch@FreeBSD.ORG Tue May 29 21:57:22 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0A1DB16A474 for ; Tue, 29 May 2007 21:57:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id C211813C4DB for ; Tue, 29 May 2007 21:57:21 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TLvH2N020001 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 17:57:18 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 14:57:12 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070530072219.G11288@besplex.bde.org> Message-ID: <20070529145508.U661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <465C90F5.1000906@elischer.org> <20070530072219.G11288@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Julian Elischer , freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 21:57:22 -0000 On Wed, 30 May 2007, Bruce Evans wrote: > On Tue, 29 May 2007, Julian Elischer wrote: > >> Jeff Roberson wrote: >>> I think I'm going to make the rusage struct per-thread and aggregate it on >>> demand. There will be a lot of code churn, but it will be simple. There >>> are a few cases where which will be complicated, and cpulimit is one of >>> them. >> >> So, there should be somewhere to store the aggregated stats from threads >> that have already exited. > > We already have that. It is the per-process rusage. There is already > delayed accumulation for tick counts (these are currently accumulated in > the rusage from the thread at context switch time). There is also > delayed conversion of stats to the form needed by getrusage(). Stats > are kept in raw form as long as possible (forever if nothing calls > getrusage() or wait[34]() to look at them) to avoid conversion overhead > for them on every exit(). So there are many precedents for delayed > stats handling. >From thread_exit(): /* Add our usage into the usage of all our children. */ if (p->p_numthreads == 1) ruadd(p->p_ru, &p->p_rux, &p->p_stats->p_cru, &p->p_crux); Is the comment wrong or the code wrong? This adds our child's rusage to our own. I assume the comment is actually wrong and that we're adding our children's resource usage to our own so the pstats structure can be freed and our parent can aggregate it all together. Jeff > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Tue May 29 22:28:13 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C741816A421 for ; Tue, 29 May 2007 22:28:13 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 912C013C480 for ; Tue, 29 May 2007 22:28:13 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4TMSBhi026110 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Tue, 29 May 2007 18:28:12 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 15:28:07 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: arch@freebsd.org Message-ID: <20070529152059.Y661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: initial rusage patch. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 22:28:13 -0000 http://people.freebsd.org/~jeff/rusage3.diff I'm providing this patch for discussion only. I've just implemented enough that you can see the fallout from this change. I have not yet tested enough to say that this is perfect. I have also not yet fixed the RLIMIT_CPU check in mi_switch(). You can see, however, that with this change there is no access to struct proc in mi_switch() except for INVARIANTS and KTR. Aside, of course, for the rlimit that needs to move anyway. You can also see, that most access to rusage are done through fewer indirects and to local memory. The storage impact of struct proc doesn't change as pstat can be reclaimed as it could before. Furthermore, the only time we need locks is in rufetch() where we aggregate the threads counters and rusage structs into one allocated by the caller. Doing this aggregation less frequently means we're touching struct proc less frequently. In this patch the scheduler lock protects this aggregation. In my threadlock diff this will be protected by the per process spinlock and the thread lock. However, in most places that we aggregate with calcru() we're grabbing a spinlock anyway. So it is not so expensive to grab another. Thanks, Jeff From owner-freebsd-arch@FreeBSD.ORG Tue May 29 23:32:48 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1938F16A41F for ; Tue, 29 May 2007 23:32:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id A8C3413C44B for ; Tue, 29 May 2007 23:32:47 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-225-63.carlnfd3.nsw.optusnet.com.au (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4TNWVdi016912 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 09:32:33 +1000 Date: Wed, 30 May 2007 09:32:33 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20070529145508.U661@10.0.0.1> Message-ID: <20070530083125.H93647@delplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <465C90F5.1000906@elischer.org> <20070530072219.G11288@besplex.bde.org> <20070529145508.U661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Julian Elischer , freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 23:32:48 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > On Wed, 30 May 2007, Bruce Evans wrote: >> We already have that. It is the per-process rusage. There is already >> delayed accumulation for tick counts (these are currently accumulated in >> the rusage from the thread at context switch time). There is also >> delayed conversion of stats to the form needed by getrusage(). Stats >> ... > >> From thread_exit(): > /* Add our usage into the usage of all our children. */ > if (p->p_numthreads == 1) > ruadd(p->p_ru, &p->p_rux, &p->p_stats->p_cru, &p->p_crux); > > Is the comment wrong or the code wrong? This adds our child's rusage to our > own. I assume the comment is actually wrong and that we're adding our > children's resource usage to our own so the pstats structure can be freed and > our parent can aggregate it all together. Yes, it is the comment that is wrong. The comment that (now partly) describes this code correctly is still in kern_exit.c: % /* % * Save exit status and finalize rusage info except for times, % * adding in child rusage info later when our time is locked. % */ Recent history: kern_exit.c 1.276 moved the call later in the file and updated this comment. 1.284 moved the call even later and added the wrong comment to it. 1.286 moved the call and its wrong comment to kern_thread.c. The "later" in the comment in kern_exit.c is now even later and in a different file. I think 1.276 and 1.284 had no significant effect, since the races for the non-times parts of the rusage are lost before the part of the code that is moved, and 1.276 and 1.284 don't affect the race for the times parts. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue May 29 23:57:02 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3909E16A421 for ; Tue, 29 May 2007 23:57:02 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outO.internet-mail-service.net (outO.internet-mail-service.net [216.240.47.238]) by mx1.freebsd.org (Postfix) with ESMTP id 23EA513C44B for ; Tue, 29 May 2007 23:57:02 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.32) with ESMTP; Tue, 29 May 2007 16:57:01 -0700 Received: from julian-mac.elischer.org (nat.ironport.com [63.251.108.100]) by idiom.com (Postfix) with ESMTP id 75879125B2B; Tue, 29 May 2007 16:57:01 -0700 (PDT) Message-ID: <465CBDCB.2040202@elischer.org> Date: Tue, 29 May 2007 16:56:59 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Jeff Roberson References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <465C90F5.1000906@elischer.org> <20070530072219.G11288@besplex.bde.org> <20070529145508.U661@10.0.0.1> In-Reply-To: <20070529145508.U661@10.0.0.1> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2007 23:57:02 -0000 Jeff Roberson wrote: > On Wed, 30 May 2007, Bruce Evans wrote: > >> On Tue, 29 May 2007, Julian Elischer wrote: >> >>> Jeff Roberson wrote: >>>> I think I'm going to make the rusage struct per-thread and aggregate >>>> it on demand. There will be a lot of code churn, but it will be >>>> simple. There are a few cases where which will be complicated, and >>>> cpulimit is one of them. >>> >>> So, there should be somewhere to store the aggregated stats from >>> threads that have already exited. >> >> We already have that. It is the per-process rusage. There is already >> delayed accumulation for tick counts (these are currently accumulated in >> the rusage from the thread at context switch time). There is also >> delayed conversion of stats to the form needed by getrusage(). Stats >> are kept in raw form as long as possible (forever if nothing calls >> getrusage() or wait[34]() to look at them) to avoid conversion overhead >> for them on every exit(). So there are many precedents for delayed >> stats handling. > >> From thread_exit(): > /* Add our usage into the usage of all our children. */ > if (p->p_numthreads == 1) > ruadd(p->p_ru, &p->p_rux, &p->p_stats->p_cru, &p->p_crux); This is only called when the last thread is exiting (as part of exit()) .... and we are adding our process's rusage to the storage for have already put their rusage numbers. I assume that wait() will reap it from there later. making it per-thread would effectively remove the conditional and make something similar happen on every thread_exit(). (not necessarily a bad thing). > > Is the comment wrong or the code wrong? This adds our child's rusage to > our own. I assume the comment is actually wrong and that we're adding > our children's resource usage to our own so the pstats structure can be > freed and our parent can aggregate it all together. > > Jeff > > >> >> Bruce >> From owner-freebsd-arch@FreeBSD.ORG Wed May 30 00:46:17 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EA77216A400 for ; Wed, 30 May 2007 00:46:17 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 9CD9F13C468 for ; Wed, 30 May 2007 00:46:17 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4U0kFKs055461 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Tue, 29 May 2007 20:46:16 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 17:46:10 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: arch@freebsd.org In-Reply-To: <20070529152059.Y661@10.0.0.1> Message-ID: <20070529174045.G661@10.0.0.1> References: <20070529152059.Y661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Re: initial rusage patch. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 00:46:18 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > http://people.freebsd.org/~jeff/rusage3.diff > > I'm providing this patch for discussion only. I've just implemented enough > that you can see the fallout from this change. I have not yet tested enough > to say that this is perfect. I have also not yet fixed the RLIMIT_CPU check > in mi_switch(). I have updated the patch at the same location to solve various exit related races. This patch appears to work in the common case. I have only to fix the cpu limit code. I see three potential ways to do this: 1) Make a thread that runs once per second and scans all procs looking for exceeded cpu limits. 2) Improve on 1 by making a queue of procs with limits set. 3) Improve on 2 by setting a per-process timeout. The only disadvantage to 2 and 3 is that they require extra space in the proc but reduce runtime, especially on systems with no limits. I advocate 3 and will start on it later unless anyone makes a strong suggestion otherwise. Actually, it seems that if I use a callout_handle the storage overhead in the proc is only one pointer. So this definitely seems like the way to go. Jeff > > You can see, however, that with this change there is no access to struct proc > in mi_switch() except for INVARIANTS and KTR. Aside, of course, for the > rlimit that needs to move anyway. > > You can also see, that most access to rusage are done through fewer indirects > and to local memory. The storage impact of struct proc doesn't change as > pstat can be reclaimed as it could before. > > Furthermore, the only time we need locks is in rufetch() where we aggregate > the threads counters and rusage structs into one allocated by the caller. > Doing this aggregation less frequently means we're touching struct proc less > frequently. > > In this patch the scheduler lock protects this aggregation. In my threadlock > diff this will be protected by the per process spinlock and the thread lock. > However, in most places that we aggregate with calcru() we're grabbing a > spinlock anyway. So it is not so expensive to grab another. > > Thanks, > Jeff > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Wed May 30 02:53:44 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CFDD216A400 for ; Wed, 30 May 2007 02:53:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 2BEA313C45D for ; Wed, 30 May 2007 02:53:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4U2rT2f012334 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 12:53:32 +1000 Date: Wed, 30 May 2007 12:53:30 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070529152059.Y661@10.0.0.1> Message-ID: <20070530094538.D11725@besplex.bde.org> References: <20070529152059.Y661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: initial rusage patch. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 02:53:44 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > http://people.freebsd.org/~jeff/rusage3.diff % Index: compat/svr4/svr4_misc.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/compat/svr4/svr4_misc.c,v % retrieving revision 1.92 % diff -u -p -r1.92 svr4_misc.c % --- compat/svr4/svr4_misc.c 18 May 2007 07:10:44 -0000 1.92 % +++ compat/svr4/svr4_misc.c 29 May 2007 14:42:19 -0000 % @@ -1238,7 +1238,7 @@ loop: % /* Found a zombie, so cache info in local variables. */ % pid = p->p_pid; % status = p->p_xstat; % - ru = *p->p_ru; % + rufetch(p, &ru); % calcru(p, &ru.ru_utime, &ru.ru_stime); % PROC_UNLOCK(p); % sx_sunlock(&proctree_lock); This and similar changes seem to be wrong. *p->p_ru is where things have been accumulated already. At this point (with a zombie in wait*()), I think all of the threads have gone away so rufetch() will find nothing. Accumulation into *p->p_ru must occur at thread exit time, and wait*() doesn't need changing to access it. The non-compat wait*() does this correctly. % Index: kern/kern_acct.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_acct.c,v % retrieving revision 1.89 % diff -u -p -r1.89 kern_acct.c % --- kern/kern_acct.c 22 May 2007 06:51:37 -0000 1.89 % +++ kern/kern_acct.c 29 May 2007 14:43:21 -0000 % @@ -383,19 +383,19 @@ acct_process(struct thread *td) % acct.ac_etime = encode_timeval(tmp); % % /* (4) The average amount of memory used */ % - r = &p->p_stats->p_ru; % + rufetch(p, &ru); Should be same null change as as for wait*()? If not, an rufetch() call is probably needed before the calcru() in (2). % Index: kern/kern_clock.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_clock.c,v % retrieving revision 1.198 % diff -u -p -r1.198 kern_clock.c % --- kern/kern_clock.c 28 May 2007 21:50:54 -0000 1.198 % +++ kern/kern_clock.c 29 May 2007 13:15:14 -0000 % @@ -396,8 +396,8 @@ stopprofclock(p) % /* % * Statistics clock. Grab profile sample, and if divider reaches 0, % * do process and kernel statistics. Most of the statistics are only % - * used by user-level statistics programs. The main exceptions are % - * ke->ke_uticks, p->p_rux.rux_sticks, p->p_rux.rux_iticks, and p->p_estcpu. % + * used by user-level statistics programs. % + * % * This should be called by all active processors. % */ % void Better just remove third sentence. Statistics involving tick counters are still done here, and now just not in p_rux. Kernel-only statistics not mentioned in the comment were always done here. p_estcpu rotted earlier. It is now spelled td_estcpu and is only maintained by SCHED_4BSD. It is still spelled p_estcpu in comments near ttyinfo() and supposed to be maintained by schedulers for sorting in ttyinfo(). The part of the comment about the divider rotted even earlier so it should be removed too. % Index: kern/kern_exit.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v % retrieving revision 1.298 % diff -u -p -r1.298 kern_exit.c % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 % +++ kern/kern_exit.c 29 May 2007 15:02:48 -0000 % @@ -229,7 +229,7 @@ retry: % */ % EVENTHANDLER_INVOKE(process_exit, p); % % - MALLOC(p->p_ru, struct rusage *, sizeof(struct rusage), % + MALLOC(p->p_exitru, struct rusage *, sizeof(struct rusage), % M_ZOMBIE, M_WAITOK); % /* % * If parent is waiting for us to exit or exec, This is a better name, but more churn. % @@ -445,8 +445,8 @@ retry: % PROC_LOCK(p); % p->p_xstat = rv; % p->p_xthread = td; % - p->p_stats->p_ru.ru_nvcsw++; % - *p->p_ru = p->p_stats->p_ru; % + rufetch(p, p->p_exitru); % + p->p_exitru->ru_nvcsw++; % % /* % * Notify interested parties of our demise. This fixes 2 races (needed sched_lock here). If possible this should be moved later, together with the times finalization in thread_exit(), to pick up any changes that occur during exit. % @@ -721,7 +721,7 @@ loop: % if (status) % *status = p->p_xstat; /* convert to int */ % if (rusage) { % - *rusage = *p->p_ru; % + *rusage = *p->p_exitru; % calcru(p, &rusage->ru_utime, &rusage->ru_stime); % } % This is correct code for wait*() -- don't call rufetch(), but just use the result of the rufetch() to p_exitru. % Index: kern/kern_resource.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v % retrieving revision 1.171 % diff -u -p -r1.171 kern_resource.c % --- kern/kern_resource.c 27 May 2007 20:50:23 -0000 1.171 % +++ kern/kern_resource.c 29 May 2007 15:39:33 -0000 % @@ -802,10 +802,12 @@ calcru(struct proc *p, struct timeval *u % * We reset the thread and CPU state as if we had performed a context % * switch right here. % */ % - if (curthread->td_proc == p) { % - td = curthread; % + td = curthread; % + if (td->td_proc == p) { % u = cpu_ticks(); % p->p_rux.rux_runtime += u - PCPU_GET(switchtime); % + p->p_rux.rux_runtime += td->td_runtime; % + td->td_runtime = 0; % PCPU_SET(switchtime, u); % p->p_rux.rux_uticks += td->td_uticks; % td->td_uticks = 0; The old code here was broken in rev.1.153-1.154. This function (calcru()) is called quite often on non-current threads (mainly from fill_kinfo*()). That used to be handled by doing read-only accesses to the times on the other CPUs and non-current threads. I think the tick counts were only per-process before 1.153, so no special handling was needed for them. Now it is even more necessary to look at times on other CPUs, since rux_runtime is missing not only the current timeslice for threads running on other CPUs, but also the runtime accumulated in td_runtime but not yet copied to rux_runtime for all threads in the process other than curthread for the current CPU. Something like rufetch() for the times and tick counts only is needed here. % @@ -950,22 +952,21 @@ kern_getrusage(td, who, rup) % } % % void % -ruadd(ru, rux, ru2, rux2) % - struct rusage *ru; % - struct rusage_ext *rux; % - struct rusage *ru2; % - struct rusage_ext *rux2; % -{ % - register long *ip, *ip2; % - register int i; % - % - rux->rux_runtime += rux2->rux_runtime; % - rux->rux_uticks += rux2->rux_uticks; % - rux->rux_sticks += rux2->rux_sticks; % - rux->rux_iticks += rux2->rux_iticks; % - rux->rux_uu += rux2->rux_uu; % - rux->rux_su += rux2->rux_su; % - rux->rux_tu += rux2->rux_tu; % +ruadd(struct rusage *ru, struct rusage_ext *rux, struct rusage *ru2, % + struct rusage_ext *rux2) % +{ % + long *ip, *ip2; % + int i; % + % + if (rux) { % + rux->rux_runtime += rux2->rux_runtime; % + rux->rux_uticks += rux2->rux_uticks; % + rux->rux_sticks += rux2->rux_sticks; % + rux->rux_iticks += rux2->rux_iticks; % + rux->rux_uu += rux2->rux_uu; % + rux->rux_su += rux2->rux_su; % + rux->rux_tu += rux2->rux_tu; % + } % if (ru->ru_maxrss < ru2->ru_maxrss) % ru->ru_maxrss = ru2->ru_maxrss; % ip = &ru->ru_first; Unrelated style changes. Style bug checking (rux != NULL). I don't like the interface chaange. % @@ -975,6 +976,33 @@ ruadd(ru, rux, ru2, rux2) % } % % /* % + * Update the rusage_ext structure and fetch a valid aggregate rusage % + * for proc p if storage for one is supplied. % + */ % +void % +rufetch(struct proc *p, struct rusage *ru) ru should be spelled rup when it is a pointer. % +{ % + struct thread *td; % + % + if (ru) % + memset(ru, 0, sizeof(*ru)); I think (ru != NULL) shouldn't be supported and doesn't occur in this patch. The pointer is always &foo or p->p_exitru. % + mtx_lock_spin(&sched_lock); % + FOREACH_THREAD_IN_PROC(p, td) { % + p->p_rux.rux_runtime += td->td_runtime; % + td->td_runtime = 0; % + p->p_rux.rux_uticks += td->td_uticks; % + td->td_uticks = 0; % + p->p_rux.rux_iticks += td->td_iticks; % + td->td_iticks = 0; % + p->p_rux.rux_sticks += td->td_sticks; % + td->td_sticks = 0; % + if (ru) % + ruadd(ru, NULL, &td->td_ru, NULL); % + } % + mtx_unlock_spin(&sched_lock); % +} This is unaware of other CPUs and the per-CPU data switchticks and switchtime on them, but probably should be. Some callers will get stale times by not adding in the delta from switchtime to the current time on all CPUs. These callers also typically do: rufetch(p, rup); calcru(p, &rup->ru_utime, &rup->ru_stime); This gives a lot of duplication and some races. rufetch() adds in the times for other threads, so problems from not doing this addition in calcru() are reduced. calcru() repeats most of the code in the above loop for curthread only. calcru() handles switchtime etc. where the above doesn't, but again for curthread only. Races are caused by dropping the lock between rufetch() and calcru(). It's bogus that rufetch() copies p's rusage to *rup and then we use p's rusage but not not *rup in the calcru() call. Calcru() also uses rusage_ext which belongs only to p. These scatterings of the data give unnecessary races. For wait*() the races probably don't occur because p's rusage is not updated after we copy it, but for getrusage() it results in the times being slightly more up to date than the rest of the results. This bogusness is mostly old. % Index: sys/proc.h % =================================================================== % RCS file: /usr/home/ncvs/src/sys/sys/proc.h,v % retrieving revision 1.477 % diff -u -p -r1.477 proc.h % --- sys/proc.h 24 Apr 2007 10:59:21 -0000 1.477 % +++ sys/proc.h 29 May 2007 15:14:31 -0000 % @@ -49,6 +49,7 @@ % #include % #include /* XXX. */ % #include % +#include More namespace pollution. rusage_ext exists partly to avoid this include. The include of rtprio is XXXed because it is pollution. % @@ -255,10 +256,12 @@ struct thread { % struct kse_upcall *td_upcall; /* (k + j) Upcall structure. */ % u_int td_estcpu; /* (j) Sum of the same field in KSEs. */ % u_int td_slptime; /* (j) How long completely blocked. */ % - u_int td_pticks; /* (k) Statclock hits for profiling */ % - u_int td_sticks; /* (k) Statclock hits in system mode. */ % - u_int td_iticks; /* (k) Statclock hits in intr mode. */ % - u_int td_uticks; /* (k) Statclock hits in user mode. */ % + struct rusage td_ru; /* (j) rusage information */ % + uint64_t td_runtime; /* (j) How many cpu ticks we've run. */ % + uint64_t td_pticks; /* (j) Statclock hits for profiling */ % + uint64_t td_sticks; /* (j) Statclock hits in system mode. */ % + uint64_t td_iticks; /* (j) Statclock hits in intr mode. */ % + u_int td_uticks; /* (j) Statclock hits in user mode. */ u_int is the the correct type for these tick counters. This is enough for 388 days worth of ticks unless stathz is broken (> 128 Hz). 64-bis never actually worked even for the accumulated tick counters, since the algorithms in calcru() depend on them only using 32 bits for things like 32x32 -> 64 bit multiplications to not overflow. Thus rusage times never worked for processes with more than 388 days user or sys time. It is easy to arrange for accumulation of the per-thread counters more often than every 388 days so that they don't overflow. % Index: ufs/ufs/ufs_bmap.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/ufs/ufs/ufs_bmap.c,v % retrieving revision 1.65 % diff -u -p -r1.65 ufs_bmap.c % --- ufs/ufs/ufs_bmap.c 25 Oct 2005 19:46:15 -0000 1.65 % +++ ufs/ufs/ufs_bmap.c 29 May 2007 14:21:47 -0000 % @@ -226,7 +226,7 @@ ufs_bmaparray(vp, bn, bnp, nbp, runp, ru % vfs_busy_pages(bp, 0); % bp->b_iooffset = dbtob(bp->b_blkno); % bstrategy(bp); % - curproc->p_stats->p_ru.ru_inblock++; /* XXX */ % + curthread->td_ru.ru_inblock++; % error = bufwait(bp); % if (error) { % brelse(bp); Was the XXX because it was locked by (n) (nothing)? Is it locked now? The XXX was cloned to ext2fs but was not attached to most accesses to ru_inblock. % Index: vm/vm_fault.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/vm/vm_fault.c,v % retrieving revision 1.228 % diff -u -p -r1.228 vm_fault.c % --- vm/vm_fault.c 22 May 2007 04:45:59 -0000 1.228 % +++ vm/vm_fault.c 29 May 2007 15:21:57 -0000 % @@ -918,15 +918,10 @@ readrest: % * Unlock everything, and return % */ % unlock_and_deallocate(&fs); % - PROC_LOCK(curproc); % - if ((curproc->p_sflag & PS_INMEM) && curproc->p_stats) { % - if (hardfault) { % - curproc->p_stats->p_ru.ru_majflt++; % - } else { % - curproc->p_stats->p_ru.ru_minflt++; % - } % - } % - PROC_UNLOCK(curproc); % + if (hardfault) % + curthread->td_ru.ru_majflt++; % + else % + curthread->td_ru.ru_minflt++; % % return (KERN_SUCCESS); % } What locks these now? The locking legend still says (c) (proc). Bruce From owner-freebsd-arch@FreeBSD.ORG Wed May 30 03:07:23 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DCFC616A421; Wed, 30 May 2007 03:07:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail24.syd.optusnet.com.au (mail24.syd.optusnet.com.au [211.29.133.165]) by mx1.freebsd.org (Postfix) with ESMTP id 7BE8913C43E; Wed, 30 May 2007 03:07:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail24.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4U379d0023367 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 13:07:12 +1000 Date: Wed, 30 May 2007 13:07:10 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070529141342.D661@10.0.0.1> Message-ID: <20070530125553.G12128@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 03:07:23 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > On Wed, 30 May 2007, Bruce Evans wrote: > >> On Tue, 29 May 2007, Jeff Roberson wrote: >>> a few cases where which will be complicated, and cpulimit is one of them. >> >> No, cpulimit is simple because it can be fuzzy, unlike calcru() which >> require >> the rusage to be up to date. > > cpulimit is complicated because it requires aggregate statistics from all > threads like rusage. It may be queried infrequently however. It's just one > of the few cases where we actually examine the values as if we still only > have one thread per process. It still doesn't need very accurate statistics, unlike the others. However, as you point out, almost all of the other cases are already more aware of multiple threads and heavyweight to handle it (e.g., calcru() already had a related accumulation loop until it was broken). cpulimit is complicated and/or different because it shouldn't do heavyweight accumulation. >> I see how rusage accumulation can help for everything _except_ the >> runtime and tick counts (i.e., for stuff updated by statclock()). For >> the runtime and tick counts, the possible savings seem to be small and >> negative. calcru() would have to run the accumulation code and the >> accumulation code would have to acquire something like sched_lock to >> transfer the per-thread data (since the lock for updating that data >> is something like sched_lock). This is has the same locking overheads >> and larger non-locking overheads than accumulating the runtime directly >> into the rusage at context switch time -- calcru() needs to acquire >> something like sched_lock either way. > > Yes, it will make calcru() more expensive. However, this should be > infrequent relative to context switches. It's only used for calls to > getrusage(), fill_kinfo_proc(), and certain clock_gettime() calls. > > The thing that will protect mi_switch() is not process global. I want to > keep process global locks out of mi_switch() or we reduce concurrency for > multi-threaded applications. This became clearer with patches and would have been clearer with (smaller) diffs in mail -- mi_switch() still needs locking but it isn't sched locking. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed May 30 03:12:22 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7B5B816A400 for ; Wed, 30 May 2007 03:12:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 2CD3413C468 for ; Wed, 30 May 2007 03:12:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4U3CJ8o086115 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 23:12:20 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 20:12:14 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070530094538.D11725@besplex.bde.org> Message-ID: <20070529195649.E661@10.0.0.1> References: <20070529152059.Y661@10.0.0.1> <20070530094538.D11725@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: initial rusage patch. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 03:12:22 -0000 On Wed, 30 May 2007, Bruce Evans wrote: > On Tue, 29 May 2007, Jeff Roberson wrote: > >> http://people.freebsd.org/~jeff/rusage3.diff > > % Index: compat/svr4/svr4_misc.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/compat/svr4/svr4_misc.c,v > % retrieving revision 1.92 > % diff -u -p -r1.92 svr4_misc.c > % --- compat/svr4/svr4_misc.c 18 May 2007 07:10:44 -0000 1.92 > % +++ compat/svr4/svr4_misc.c 29 May 2007 14:42:19 -0000 > % @@ -1238,7 +1238,7 @@ loop: > % /* Found a zombie, so cache info in local variables. > */ > % pid = p->p_pid; > % status = p->p_xstat; > % - ru = *p->p_ru; > % + rufetch(p, &ru); > % calcru(p, &ru.ru_utime, &ru.ru_stime); > % PROC_UNLOCK(p); > % sx_sunlock(&proctree_lock); > > This and similar changes seem to be wrong. *p->p_ru is where things have > been accumulated already. At this point (with a zombie in wait*()), I > think all of the threads have gone away so rufetch() will find nothing. > Accumulation into *p->p_ru must occur at thread exit time, and wait*() > doesn't need changing to access it. The non-compat wait*() does this > correctly. Please refresh your patch. I fixed this in the follow up when I changed rufetch() to check first for p_ruexit. There is no p_ru and stats are not accumulated in pstats anymore. They are accumulated in p_rux and td_ru. > > % Index: kern/kern_acct.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_acct.c,v > % retrieving revision 1.89 > % diff -u -p -r1.89 kern_acct.c > % --- kern/kern_acct.c 22 May 2007 06:51:37 -0000 1.89 > % +++ kern/kern_acct.c 29 May 2007 14:43:21 -0000 > % @@ -383,19 +383,19 @@ acct_process(struct thread *td) > % acct.ac_etime = encode_timeval(tmp); > % % /* (4) The average amount of memory used */ > % - r = &p->p_stats->p_ru; > % + rufetch(p, &ru); > > Should be same null change as as for wait*()? If not, an rufetch() call > is probably needed before the calcru() in (2). What do you mean by 2? > > % Index: kern/kern_clock.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_clock.c,v > % retrieving revision 1.198 > % diff -u -p -r1.198 kern_clock.c > % --- kern/kern_clock.c 28 May 2007 21:50:54 -0000 1.198 > % +++ kern/kern_clock.c 29 May 2007 13:15:14 -0000 > % @@ -396,8 +396,8 @@ stopprofclock(p) > % /* > % * Statistics clock. Grab profile sample, and if divider reaches 0, > % * do process and kernel statistics. Most of the statistics are only > % - * used by user-level statistics programs. The main exceptions are > % - * ke->ke_uticks, p->p_rux.rux_sticks, p->p_rux.rux_iticks, and > p->p_estcpu. > % + * used by user-level statistics programs. % + * > % * This should be called by all active processors. > % */ > % void > > Better just remove third sentence. Statistics involving tick counters > are still done here, and now just not in p_rux. Kernel-only statistics > not mentioned in the comment were always done here. p_estcpu rotted > earlier. It is now spelled td_estcpu and is only maintained by > SCHED_4BSD. It is still spelled p_estcpu in comments near ttyinfo() > and supposed to be maintained by schedulers for sorting in ttyinfo(). > > The part of the comment about the divider rotted even earlier so it > should be removed too. > > % Index: kern/kern_exit.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v > % retrieving revision 1.298 > % diff -u -p -r1.298 kern_exit.c > % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 > % +++ kern/kern_exit.c 29 May 2007 15:02:48 -0000 > % @@ -229,7 +229,7 @@ retry: > % */ > % EVENTHANDLER_INVOKE(process_exit, p); > % % - MALLOC(p->p_ru, struct rusage *, sizeof(struct rusage), > % + MALLOC(p->p_exitru, struct rusage *, sizeof(struct rusage), > % M_ZOMBIE, M_WAITOK); > % /* > % * If parent is waiting for us to exit or exec, > > This is a better name, but more churn. I want to change the name so no one mistakenly follows the pointer. I'm explicitly breaking the API because it has a totally different use now. > > % @@ -445,8 +445,8 @@ retry: > % PROC_LOCK(p); > % p->p_xstat = rv; > % p->p_xthread = td; > % - p->p_stats->p_ru.ru_nvcsw++; > % - *p->p_ru = p->p_stats->p_ru; > % + rufetch(p, p->p_exitru); > % + p->p_exitru->ru_nvcsw++; > % % /* > % * Notify interested parties of our demise. > > This fixes 2 races (needed sched_lock here). > > If possible this should be moved later, together with the times > finalization in thread_exit(), to pick up any changes that occur during > exit. I think I have resolved this in my most recent patch. > > % @@ -721,7 +721,7 @@ loop: > % if (status) > % *status = p->p_xstat; /* convert to int */ > % if (rusage) { > % - *rusage = *p->p_ru; > % + *rusage = *p->p_exitru; > % calcru(p, &rusage->ru_utime, > &rusage->ru_stime); > % } > % > > This is correct code for wait*() -- don't call rufetch(), but just use > the result of the rufetch() to p_exitru. In my newest patch rufetch() acomplishes the same thing. It'd be nice to always handle it via this function so it's consistent. Perhaps I should change this also to be an rufetch(). Now the setting of p_exitru signals callers to no longer iterate over the list. > > % Index: kern/kern_resource.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v > % retrieving revision 1.171 > % diff -u -p -r1.171 kern_resource.c > % --- kern/kern_resource.c 27 May 2007 20:50:23 -0000 1.171 > % +++ kern/kern_resource.c 29 May 2007 15:39:33 -0000 > % @@ -802,10 +802,12 @@ calcru(struct proc *p, struct timeval *u > % * We reset the thread and CPU state as if we had performed a context > % * switch right here. > % */ > % - if (curthread->td_proc == p) { > % - td = curthread; > % + td = curthread; > % + if (td->td_proc == p) { > % u = cpu_ticks(); > % p->p_rux.rux_runtime += u - PCPU_GET(switchtime); > % + p->p_rux.rux_runtime += td->td_runtime; > % + td->td_runtime = 0; > % PCPU_SET(switchtime, u); > % p->p_rux.rux_uticks += td->td_uticks; > % td->td_uticks = 0; > > The old code here was broken in rev.1.153-1.154. This function (calcru()) > is called quite often on non-current threads (mainly from fill_kinfo*()). > That used to be handled by doing read-only accesses to the times on the > other CPUs and non-current threads. I think the tick counts were only > per-process before 1.153, so no special handling was needed for them. > > Now it is even more necessary to look at times on other CPUs, since > rux_runtime is missing not only the current timeslice for threads running > on other CPUs, but also the runtime accumulated in td_runtime but not > yet copied to rux_runtime for all threads in the process other than > curthread for the current CPU. Something like rufetch() for the times > and tick counts only is needed here. I can do that. I had intended to do this when I allowed rufetch() to take a NULL rusage. Although it still will be inaccurate with threads that are presently running on another cpu. Although it can only be off by as many ticks occur between context switches and it currently suffers from this limitation. > > % @@ -950,22 +952,21 @@ kern_getrusage(td, who, rup) > % } > % % void > % -ruadd(ru, rux, ru2, rux2) > % - struct rusage *ru; > % - struct rusage_ext *rux; > % - struct rusage *ru2; > % - struct rusage_ext *rux2; > % -{ > % - register long *ip, *ip2; > % - register int i; > % - > % - rux->rux_runtime += rux2->rux_runtime; > % - rux->rux_uticks += rux2->rux_uticks; > % - rux->rux_sticks += rux2->rux_sticks; > % - rux->rux_iticks += rux2->rux_iticks; > % - rux->rux_uu += rux2->rux_uu; > % - rux->rux_su += rux2->rux_su; > % - rux->rux_tu += rux2->rux_tu; > % +ruadd(struct rusage *ru, struct rusage_ext *rux, struct rusage *ru2, > % + struct rusage_ext *rux2) > % +{ > % + long *ip, *ip2; > % + int i; > % + > % + if (rux) { > % + rux->rux_runtime += rux2->rux_runtime; > % + rux->rux_uticks += rux2->rux_uticks; > % + rux->rux_sticks += rux2->rux_sticks; > % + rux->rux_iticks += rux2->rux_iticks; > % + rux->rux_uu += rux2->rux_uu; > % + rux->rux_su += rux2->rux_su; > % + rux->rux_tu += rux2->rux_tu; > % + } > % if (ru->ru_maxrss < ru2->ru_maxrss) > % ru->ru_maxrss = ru2->ru_maxrss; > % ip = &ru->ru_first; > > Unrelated style changes. > > Style bug checking (rux != NULL). > > I don't like the interface chaange. I can define another function which adds copies the ru and not the rux if you like that better. > > % @@ -975,6 +976,33 @@ ruadd(ru, rux, ru2, rux2) > % } > % % /* > % + * Update the rusage_ext structure and fetch a valid aggregate rusage > % + * for proc p if storage for one is supplied. > % + */ > % +void > % +rufetch(struct proc *p, struct rusage *ru) > > ru should be spelled rup when it is a pointer. > > % +{ > % + struct thread *td; > % + > % + if (ru) > % + memset(ru, 0, sizeof(*ru)); > > I think (ru != NULL) shouldn't be supported and doesn't occur in this > patch. The pointer is always &foo or p->p_exitru. I think it will be necessary when we only want to sanitize the rux. > > % + mtx_lock_spin(&sched_lock); > % + FOREACH_THREAD_IN_PROC(p, td) { > % + p->p_rux.rux_runtime += td->td_runtime; > % + td->td_runtime = 0; > % + p->p_rux.rux_uticks += td->td_uticks; > % + td->td_uticks = 0; > % + p->p_rux.rux_iticks += td->td_iticks; > % + td->td_iticks = 0; > % + p->p_rux.rux_sticks += td->td_sticks; > % + td->td_sticks = 0; > % + if (ru) > % + ruadd(ru, NULL, &td->td_ru, NULL); > % + } > % + mtx_unlock_spin(&sched_lock); > % +} > > This is unaware of other CPUs and the per-CPU data switchticks and > switchtime on them, but probably should be. Some callers will get > stale times by not adding in the delta from switchtime to the current > time on all CPUs. These callers also typically do: > > rufetch(p, rup); > calcru(p, &rup->ru_utime, &rup->ru_stime); > > This gives a lot of duplication and some races. rufetch() adds in > the times for other threads, so problems from not doing this addition > in calcru() are reduced. calcru() repeats most of the code in the > above loop for curthread only. calcru() handles switchtime etc. > where the above doesn't, but again for curthread only. > > Races are caused by dropping the lock between rufetch() and calcru(). Yes, I think we should require the sched lock to be held over the whole blob or make another function that does both with sched lock held. In my threadlock diff this should be proc slock. I noticed this problem but ignored it because there are presently races in calcru as well without this patc. > > It's bogus that rufetch() copies p's rusage to *rup and then we use > p's rusage but not not *rup in the calcru() call. Calcru() also uses > rusage_ext which belongs only to p. These scatterings of the data > give unnecessary races. For wait*() the races probably don't occur > because p's rusage is not updated after we copy it, but for getrusage() > it results in the times being slightly more up to date than the rest > of the results. This bogusness is mostly old. > > % Index: sys/proc.h > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/sys/proc.h,v > % retrieving revision 1.477 > % diff -u -p -r1.477 proc.h > % --- sys/proc.h 24 Apr 2007 10:59:21 -0000 1.477 > % +++ sys/proc.h 29 May 2007 15:14:31 -0000 > % @@ -49,6 +49,7 @@ > % #include > % #include /* XXX. */ > % #include > % +#include > > More namespace pollution. rusage_ext exists partly to avoid this include. > The include of rtprio is XXXed because it is pollution. Yes, I thought you wouldn't be happy about this. I'm not impressed with the idea of changing every kernel file that includes proc.h to also include resource.h however. > > % @@ -255,10 +256,12 @@ struct thread { > % struct kse_upcall *td_upcall; /* (k + j) Upcall structure. */ > % u_int td_estcpu; /* (j) Sum of the same field in KSEs. > */ > % u_int td_slptime; /* (j) How long completely blocked. > */ > % - u_int td_pticks; /* (k) Statclock hits for profiling > */ > % - u_int td_sticks; /* (k) Statclock hits in system mode. > */ > % - u_int td_iticks; /* (k) Statclock hits in intr mode. > */ > % - u_int td_uticks; /* (k) Statclock hits in user mode. > */ > % + struct rusage td_ru; /* (j) rusage information */ > % + uint64_t td_runtime; /* (j) How many cpu ticks we've run. > */ > % + uint64_t td_pticks; /* (j) Statclock hits for profiling > */ > % + uint64_t td_sticks; /* (j) Statclock hits in system mode. > */ > % + uint64_t td_iticks; /* (j) Statclock hits in intr mode. > */ > % + u_int td_uticks; /* (j) Statclock hits in user mode. > */ > > u_int is the the correct type for these tick counters. This is enough > for 388 days worth of ticks unless stathz is broken (> 128 Hz). 64-bis > never actually worked even for the accumulated tick counters, since > the algorithms in calcru() depend on them only using 32 bits for things > like 32x32 -> 64 bit multiplications to not overflow. Thus rusage > times never worked for processes with more than 388 days user or sys time. > It is easy to arrange for accumulation of the per-thread counters more > often than every 388 days so that they don't overflow. I changed them to match the 64bit counters they were feeding into in the proc. If you like I can change them back and check for overflow. > > % Index: ufs/ufs/ufs_bmap.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/ufs/ufs/ufs_bmap.c,v > % retrieving revision 1.65 > % diff -u -p -r1.65 ufs_bmap.c > % --- ufs/ufs/ufs_bmap.c 25 Oct 2005 19:46:15 -0000 1.65 > % +++ ufs/ufs/ufs_bmap.c 29 May 2007 14:21:47 -0000 > % @@ -226,7 +226,7 @@ ufs_bmaparray(vp, bn, bnp, nbp, runp, ru > % vfs_busy_pages(bp, 0); > % bp->b_iooffset = dbtob(bp->b_blkno); > % bstrategy(bp); > % - curproc->p_stats->p_ru.ru_inblock++; /* XXX */ > % + curthread->td_ru.ru_inblock++; > % error = bufwait(bp); > % if (error) { > % brelse(bp); > > Was the XXX because it was locked by (n) (nothing)? Is it locked now? > The XXX was cloned to ext2fs but was not attached to most accesses to > ru_inblock. I don't really believe it needs a lock. It's only modified by curthread now and only read by threads which are calling getrusage() while this thread is probably concurrently running and very likely changing the counter. > > % Index: vm/vm_fault.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/vm/vm_fault.c,v > % retrieving revision 1.228 > % diff -u -p -r1.228 vm_fault.c > % --- vm/vm_fault.c 22 May 2007 04:45:59 -0000 1.228 > % +++ vm/vm_fault.c 29 May 2007 15:21:57 -0000 > % @@ -918,15 +918,10 @@ readrest: > % * Unlock everything, and return > % */ > % unlock_and_deallocate(&fs); > % - PROC_LOCK(curproc); > % - if ((curproc->p_sflag & PS_INMEM) && curproc->p_stats) { > % - if (hardfault) { > % - curproc->p_stats->p_ru.ru_majflt++; > % - } else { > % - curproc->p_stats->p_ru.ru_minflt++; > % - } > % - } > % - PROC_UNLOCK(curproc); > % + if (hardfault) > % + curthread->td_ru.ru_majflt++; > % + else > % + curthread->td_ru.ru_minflt++; > % % return (KERN_SUCCESS); > % } > > What locks these now? The locking legend still says (c) (proc). Same as above. I intend to change the legend. > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Wed May 30 03:15:15 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id ADB2716A421; Wed, 30 May 2007 03:15:15 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5E30313C457; Wed, 30 May 2007 03:15:15 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4U3FD3b086501 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 29 May 2007 23:15:14 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 20:15:08 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070530125553.G12128@besplex.bde.org> Message-ID: <20070529201255.X661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 03:15:15 -0000 On Wed, 30 May 2007, Bruce Evans wrote: > On Tue, 29 May 2007, Jeff Roberson wrote: > >> On Wed, 30 May 2007, Bruce Evans wrote: >> >>> On Tue, 29 May 2007, Jeff Roberson wrote: > >>>> a few cases where which will be complicated, and cpulimit is one of them. >>> >>> No, cpulimit is simple because it can be fuzzy, unlike calcru() which >>> require >>> the rusage to be up to date. >> >> cpulimit is complicated because it requires aggregate statistics from all >> threads like rusage. It may be queried infrequently however. It's just >> one of the few cases where we actually examine the values as if we still >> only have one thread per process. > > It still doesn't need very accurate statistics, unlike the others. > However, as you point out, almost all of the other cases are already more > aware of multiple threads and heavyweight to handle it (e.g., calcru() > already had a related accumulation loop until it was broken). cpulimit > is complicated and/or different because it shouldn't do heavyweight > accumulation. > >>> I see how rusage accumulation can help for everything _except_ the >>> runtime and tick counts (i.e., for stuff updated by statclock()). For >>> the runtime and tick counts, the possible savings seem to be small and >>> negative. calcru() would have to run the accumulation code and the >>> accumulation code would have to acquire something like sched_lock to >>> transfer the per-thread data (since the lock for updating that data >>> is something like sched_lock). This is has the same locking overheads >>> and larger non-locking overheads than accumulating the runtime directly >>> into the rusage at context switch time -- calcru() needs to acquire >>> something like sched_lock either way. >> >> Yes, it will make calcru() more expensive. However, this should be >> infrequent relative to context switches. It's only used for calls to >> getrusage(), fill_kinfo_proc(), and certain clock_gettime() calls. >> >> The thing that will protect mi_switch() is not process global. I want to >> keep process global locks out of mi_switch() or we reduce concurrency for >> multi-threaded applications. > > This became clearer with patches and would have been clearer with > (smaller) diffs in mail -- mi_switch() still needs locking but it isn't > sched locking. Hopefully you see the value in my approach now? I don't think it's turning out so badly, except for some details which need refining. It certainly make mi_switch() and statclock() cleaner. And hopefully we can remove more code from ast() and mi_switch() by changing the cpu limits. Jeff > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Wed May 30 05:12:23 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 66A2816A421 for ; Wed, 30 May 2007 05:12:23 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id F162513C455 for ; Wed, 30 May 2007 05:12:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4U5CJGK098854 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 01:12:21 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 29 May 2007 22:12:14 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070529201255.X661@10.0.0.1> Message-ID: <20070529220936.W661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 05:12:23 -0000 I have updated the patch at: http://people.freebsd.org/~jeff/rusage3.diff I have incorporated much of the feedback from the earlier diff. I have also changed the cpulimit code to use a callback. This removes a PS_ flag and a bunch of code from mi_switch() and ast(). mi_switch() now does not need to touch the proc at all unless debugging is enabled. I have tested the cpu limiting code as well and it seems to function properly. My only reservation is the extra overhead of another callout per proc. Thanks, Jeff From owner-freebsd-arch@FreeBSD.ORG Wed May 30 06:18:33 2007 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3352716A482 for ; Wed, 30 May 2007 06:18:33 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail11.syd.optusnet.com.au (mail11.syd.optusnet.com.au [211.29.132.192]) by mx1.freebsd.org (Postfix) with ESMTP id A9D9513C46C for ; Wed, 30 May 2007 06:18:32 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail11.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4U6IIDS017262 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 16:18:21 +1000 Date: Wed, 30 May 2007 16:18:20 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070529195649.E661@10.0.0.1> Message-ID: <20070530150615.Y12540@besplex.bde.org> References: <20070529152059.Y661@10.0.0.1> <20070530094538.D11725@besplex.bde.org> <20070529195649.E661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: initial rusage patch. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 06:18:33 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > On Wed, 30 May 2007, Bruce Evans wrote: > >> On Tue, 29 May 2007, Jeff Roberson wrote: >> >>> http://people.freebsd.org/~jeff/rusage3.diff >> >> % Index: compat/svr4/svr4_misc.c >> % =================================================================== >> % RCS file: /usr/home/ncvs/src/sys/compat/svr4/svr4_misc.c,v >> % retrieving revision 1.92 >> % diff -u -p -r1.92 svr4_misc.c >> % --- compat/svr4/svr4_misc.c 18 May 2007 07:10:44 -0000 1.92 >> % +++ compat/svr4/svr4_misc.c 29 May 2007 14:42:19 -0000 >> % @@ -1238,7 +1238,7 @@ loop: >> % /* Found a zombie, so cache info in local variables. >> */ >> % pid = p->p_pid; >> % status = p->p_xstat; >> % - ru = *p->p_ru; >> % + rufetch(p, &ru); >> % calcru(p, &ru.ru_utime, &ru.ru_stime); >> % PROC_UNLOCK(p); >> % sx_sunlock(&proctree_lock); >> >> This and similar changes seem to be wrong. *p->p_ru is where things have >> been accumulated already. At this point (with a zombie in wait*()), I >> think all of the threads have gone away so rufetch() will find nothing. >> Accumulation into *p->p_ru must occur at thread exit time, and wait*() >> doesn't need changing to access it. The non-compat wait*() does this >> correctly. > > Please refresh your patch. I fixed this in the follow up when I changed > rufetch() to check first for p_ruexit. There is no p_ru and stats are not > accumulated in pstats anymore. They are accumulated in p_rux and td_ru. I don't like that fix. Why not just use p_exitru after finalizing it except for the times? The non-compat wait*() still does this like I want. >> % Index: kern/kern_acct.c >> % =================================================================== >> % RCS file: /usr/home/ncvs/src/sys/kern/kern_acct.c,v >> % retrieving revision 1.89 >> % diff -u -p -r1.89 kern_acct.c >> % --- kern/kern_acct.c 22 May 2007 06:51:37 -0000 1.89 >> % +++ kern/kern_acct.c 29 May 2007 14:43:21 -0000 >> % @@ -383,19 +383,19 @@ acct_process(struct thread *td) >> % acct.ac_etime = encode_timeval(tmp); >> % % /* (4) The average amount of memory used */ >> % - r = &p->p_stats->p_ru; >> % + rufetch(p, &ru); >> >> Should be same null change as as for wait*()? If not, an rufetch() call >> is probably needed before the calcru() in (2). > > What do you mean by 2? Step (2) of the excessively commented steps in kern_acct.c, of which the above shows step (4). >> % Index: kern/kern_clock.c >> ... Please trim quotes. >> % Index: kern/kern_exit.c >> [p_exitru] is a better name [than p_ru], but more churn. > > I want to change the name so no one mistakenly follows the pointer. I'm > explicitly breaking the API because it has a totally different use now. It doesn't seem to any different except for the parts that I don't like. It was already the exit ru. Do you plan further changes than make it a non-exit ru? >> % @@ -445,8 +445,8 @@ retry: >> % PROC_LOCK(p); >> % p->p_xstat = rv; >> % p->p_xthread = td; >> % - p->p_stats->p_ru.ru_nvcsw++; >> % - *p->p_ru = p->p_stats->p_ru; >> % + rufetch(p, p->p_exitru); >> % + p->p_exitru->ru_nvcsw++; >> % % /* >> % * Notify interested parties of our demise. >> >> This fixes 2 races (needed sched_lock here). >> >> If possible this should be moved later, together with the times >> finalization in thread_exit(), to pick up any changes that occur during >> exit. > > I think I have resolved this in my most recent patch. No, it doesn't move things later so it has incompletely finalized data, and it restores the races fixed by calling rufetch() in the above. thread_exit() has not yet been called for the last thread in the process. A statclock interrupt may generate new statistics after any code near the above "finalizes" the statstics, and without locking a statclock interrupt in the middle of the copying may gave an incoherent set of statistics. The rufetch() call in the above fixes the second of these problems by sypplying locking. >> % @@ -721,7 +721,7 @@ loop: >> % if (status) >> % *status = p->p_xstat; /* convert to int */ >> % if (rusage) { >> % - *rusage = *p->p_ru; >> % + *rusage = *p->p_exitru; >> % calcru(p, &rusage->ru_utime, >> &rusage->ru_stime); >> % } >> % >> >> This is correct code for wait*() -- don't call rufetch(), but just use >> the result of the rufetch() to p_exitru. > > In my newest patch rufetch() acomplishes the same thing. It'd be nice to > always handle it via this function so it's consistent. Perhaps I should > change this also to be an rufetch(). Now the setting of p_exitru signals > callers to no longer iterate over the list. I prefer the direct access. >> % @@ -950,22 +952,21 @@ kern_getrusage(td, who, rup) >> % ... >> % +ruadd(struct rusage *ru, struct rusage_ext *rux, struct rusage *ru2, >> % + struct rusage_ext *rux2) >> % +{ >> % + long *ip, *ip2; >> % + int i; >> % + >> % + if (rux) { >> % + rux->rux_runtime += rux2->rux_runtime; >> ... >> I don't like the interface chaange. > > I can define another function which adds copies the ru and not the rux if you > like that better. Yes, I like specialized functions. >> % Index: sys/proc.h >> % =================================================================== >> % RCS file: /usr/home/ncvs/src/sys/sys/proc.h,v >> % retrieving revision 1.477 >> % diff -u -p -r1.477 proc.h >> % --- sys/proc.h 24 Apr 2007 10:59:21 -0000 1.477 >> % +++ sys/proc.h 29 May 2007 15:14:31 -0000 >> % @@ -49,6 +49,7 @@ >> % #include >> % #include /* XXX. */ >> % #include >> % +#include >> >> More namespace pollution. rusage_ext exists partly to avoid this include. >> The include of rtprio is XXXed because it is pollution. > > Yes, I thought you wouldn't be happy about this. I'm not impressed with the > idea of changing every kernel file that includes proc.h to also include > resource.h however. It could remain indirect, or minimize the pollution using a new header that declares only the resource struct (but the latter would be an obfuscation). >> % @@ -255,10 +256,12 @@ struct thread { >> ... >> % + uint64_t td_pticks; /* (j) Statclock hits for profiling >> */ >> % + uint64_t td_sticks; /* (j) Statclock hits in system mode. >> */ >> % + uint64_t td_iticks; /* (j) Statclock hits in intr mode. >> */ >> % + u_int td_uticks; /* (j) Statclock hits in user mode. >> */ >> >> u_int is the the correct type for these tick counters. This is enough >> for 388 days worth of ticks unless stathz is broken (> 128 Hz). 64-bis >> ... > I changed them to match the 64bit counters they were feeding into in the > proc. If you like I can change them back and check for overflow. Their overflow can't happen :-). I think 32+64->64 bit additions are efficient enough (more efficient that 64+64->64 on 32-bit machines) so these counters should be optimized for space. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed May 30 06:36:16 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 66C5316A421; Wed, 30 May 2007 06:36:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail34.syd.optusnet.com.au (mail34.syd.optusnet.com.au [211.29.133.218]) by mx1.freebsd.org (Postfix) with ESMTP id DF90A13C45B; Wed, 30 May 2007 06:36:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail34.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4U6a4D2014239 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 16:36:05 +1000 Date: Wed, 30 May 2007 16:36:05 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070529201255.X661@10.0.0.1> Message-ID: <20070530161912.P12540@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 06:36:16 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > On Wed, 30 May 2007, Bruce Evans wrote: > >> On Tue, 29 May 2007, Jeff Roberson wrote: >>> The thing that will protect mi_switch() is not process global. I want to >>> keep process global locks out of mi_switch() or we reduce concurrency for >>> multi-threaded applications. >> >> This became clearer with patches and would have been clearer with >> (smaller) diffs in mail -- mi_switch() still needs locking but it isn't >> sched locking. > > Hopefully you see the value in my approach now? I don't think it's turning > out so badly, except for some details which need refining. It certainly make > mi_switch() and statclock() cleaner. And hopefully we can remove more code > from ast() and mi_switch() by changing the cpu limits. Yes, it shows promise. I like the possibility of fixing the stats for other CPUs in calcru() in a general way. The aggregation points also give a good place to fix the scaling by the cputick frequency. The cputick frequency may be variable, so the current frequency should not be used to scale all previously recorded ticks. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed May 30 10:59:15 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9376816A400 for ; Wed, 30 May 2007 10:59:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail34.syd.optusnet.com.au (mail34.syd.optusnet.com.au [211.29.133.218]) by mx1.freebsd.org (Postfix) with ESMTP id 2C98E13C455 for ; Wed, 30 May 2007 10:59:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail34.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4UAx4MV028021 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 20:59:05 +1000 Date: Wed, 30 May 2007 20:59:05 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070529220936.W661@10.0.0.1> Message-ID: <20070530201618.T13220@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 10:59:15 -0000 On Tue, 29 May 2007, Jeff Roberson wrote: > I have updated the patch at: > > http://people.freebsd.org/~jeff/rusage3.diff New minor points about an even later update: % Index: kern/kern_exit.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v % retrieving revision 1.298 % diff -u -p -r1.298 kern_exit.c % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 % +++ kern/kern_exit.c 29 May 2007 21:38:21 -0000 % ... % @@ -229,7 +233,7 @@ retry: % */ % EVENTHANDLER_INVOKE(process_exit, p); % % - MALLOC(p->p_ru, struct rusage *, sizeof(struct rusage), % + MALLOC(ru, struct rusage *, sizeof(struct rusage), % M_ZOMBIE, M_WAITOK); % /* % * If parent is waiting for us to exit or exec, Perhaps this should not be micro-optimized for space (allocate it inside struct proc for the whole process lifetime). Alternatively, inherit it from the first thread in the process that exits. % Index: kern/kern_resource.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v % retrieving revision 1.171 % diff -u -p -r1.171 kern_resource.c % --- kern/kern_resource.c 27 May 2007 20:50:23 -0000 1.171 % +++ kern/kern_resource.c 29 May 2007 22:06:05 -0000 % @@ -619,6 +620,47 @@ setrlimit(td, uap) % return (error); % } % % +static void % +lim_cb(void *arg) % +{ % + struct rlimit rlim; % + struct thread *td; % + struct proc *p; Unsorted decls. % + % + p = (struct proc *)arg; Unnecessary cast. % + PROC_LOCK_ASSERT(p, MA_OWNED); % + /* % + * Check if the process exceeds its cpu resource allocation. If % + * it reaches the max, arrange to kill the process in ast(). % + */ % + mtx_lock_spin(&sched_lock); % + FOREACH_THREAD_IN_PROC(p, td) % + ruxcollect(&p->p_rux, td); % + if (p->p_cpulimit != RLIM_INFINITY && This should just retun if the limit is infinity. p_cpulimit is still marked as locked by sched_lock, but proc locking should be enough now, so sched locking (or whatever it becomes) is not needed for the early test and retrun. The early return rarely happens but it gives simpler code as well as saving time when it happens. % + p->p_rux.rux_runtime > p->p_cpulimit * cpu_tickrate()) { % + lim_rlimit(p, RLIMIT_CPU, &rlim); % + if (p->p_rux.rux_runtime >= rlim.rlim_max * cpu_tickrate()) { Hmm, the tick rate conversion bug gives more than wrong stats. Here it shoots processs. E.g., suppose you have a cpulimit of 1 hour and you throttle the CPU to 8 times slower. Processes that have run for > 7.5 minutes at the old (higher) tick rate are then killed here. % ... % + if (p->p_cpulimit != RLIM_INFINITY) % + callout_reset(&p->p_limco, hz, lim_cb, (void *)p); Unnecessary cast. % Index: kern/kern_synch.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_synch.c,v % retrieving revision 1.295 % diff -u -p -r1.295 kern_synch.c % --- kern/kern_synch.c 18 May 2007 07:10:45 -0000 1.295 % +++ kern/kern_synch.c 29 May 2007 21:35:40 -0000 % @@ -401,35 +401,17 @@ mi_switch(int flags, struct thread *newt % ... % /* % * Compute the amount of time during which the current % - * process was running, and add that to its total so far. % + * thread was running, and add that to its total so far. % */ % new_switchtime = cpu_ticks(); % - p->p_rux.rux_runtime += (new_switchtime - PCPU_GET(switchtime)); % - p->p_rux.rux_uticks += td->td_uticks; % - td->td_uticks = 0; % - p->p_rux.rux_iticks += td->td_iticks; % - td->td_iticks = 0; % - p->p_rux.rux_sticks += td->td_sticks; % - td->td_sticks = 0; % - % + td->td_runtime += (new_switchtime - PCPU_GET(switchtime)); Clean further by removing unnecesary parens. % td->td_generation++; /* bump preempt-detect counter */ % - % - /* % - * Check if the process exceeds its cpu resource allocation. If % - * it reaches the max, arrange to kill the process in ast(). % - */ % - if (p->p_cpulimit != RLIM_INFINITY && % - p->p_rux.rux_runtime >= p->p_cpulimit * cpu_tickrate()) { % - p->p_sflag |= PS_XCPU; % - td->td_flags |= TDF_ASTPENDING; % - } % - % /* % * Finish up stats for outgoing thread. % */ Clean further by merging the "start" and "finish" code (so that switchtime is cleared immediately after it is copied, etc). There is nothing in between any more. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed May 30 11:18:00 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6692E16A46D for ; Wed, 30 May 2007 11:18:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx01.syd.optusnet.com.au (fallbackmx01.syd.optusnet.com.au [211.29.132.93]) by mx1.freebsd.org (Postfix) with ESMTP id F176913C44B for ; Wed, 30 May 2007 11:17:59 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail31.syd.optusnet.com.au (mail31.syd.optusnet.com.au [211.29.132.102]) by fallbackmx01.syd.optusnet.com.au (8.12.11.20060308/8.12.11) with ESMTP id l4TKqjSB008093 for ; Wed, 30 May 2007 06:52:45 +1000 Received: from c211-30-225-63.carlnfd3.nsw.optusnet.com.au (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail31.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4TKqfA0030283 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 06:52:43 +1000 Date: Wed, 30 May 2007 06:52:42 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Kip Macy In-Reply-To: Message-ID: <20070530062757.L93410@delplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: rusage breakdown and cpu limits. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 11:18:00 -0000 On Tue, 29 May 2007, Kip Macy wrote: >> I think using a per-process spin lock (or a pool of spin locks) would be a >> good first step. I wouldn't do anything more complicated unless the simple >> approach doesn't work. The only reason to not move the check into >> userret() >> would be if one is worried about threads chewing up CPU time while they are >> in the kernel w/o bouncing out to userland. Also, it matters which one >> happens more often (userret() vs mi_switch()). If on average threads >> perform >> multiple system calls during a single time slice (no idea if this is true >> or >> not), then moving the check to userret() would actually hurt performance. > > Processes can certainly make numerous system calls within a single > time slice. Not many more than a few hundred million syscalls can be made within a timeslice of 100 mS. FreeBSD does too many context switches for interrupts, so the number in practice seem to be mostly in the range of 1-10, but I hope for 100-1000. > However, in userret it would be protected by a per process > or per thread blocking mutex as opposed to a global spin mutex. It > would be surprising if it isn't a net win, although it is quite > possible that on a 2-way system the extra locking could have an > adverse effect on some workloads. Any locking within userret() would be a good pessimization. There are none now, but still a lot of bloat. In this case, correct proc locking isn't even possible, since the runtime update must occur while something like a global scheduler lock is held. When a context switch occurs, the lock must protect at least the old process and the new process, and somehow prevent interference from other processes. The update of the runtime needs essentially the same lock. Any locking in userret() would need to use the same lock as the update to be perfectly correct. Fortunately, the cpulimit limit check only needs to be correct to within seconds or even minutes. A sloppy unlocked check don't often enough would work OK, at least if you re-check with correct locking before killing the process. Alternatively, the sloppiness can be due to delayed updates -- let the rusage data lag by up to a second or so in the context of the check; the runtime would accumulate accurately somewhere, but the check wouldn't see it all the accumulation step would need the full lock for reading and writing the scattered data and a lesser lock for updating the accumulated data. userret() still shouldn't be pessimized by acquiring the lesser lock. I still think this misses the point -- the check is the easy part, and can be done at no extra locking cost while the full lock is held. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed May 30 19:04:27 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A18D616A421 for ; Wed, 30 May 2007 19:04:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5842513C45E for ; Wed, 30 May 2007 19:04:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4UJ4OHO074016 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Wed, 30 May 2007 15:04:25 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 30 May 2007 12:04:17 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070530201618.T13220@besplex.bde.org> Message-ID: <20070530115752.F661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 May 2007 19:04:27 -0000 Ok, hopefully we're in the home stretch here. Patch is at the same location. I did the following: 1) Renamed p_ru back. I erroneously thought some code was accessing pstats indirectly through here. That's why I changed the name. 2) Cleaned up some more comments, casts, style bugs, etc. 3) Fixed a problem where we migh've called callout_init_mtx() multiple times. 4) Renamed ruxcollect since it doesn't do the same thing as rucollect(). 5) Resorted lim_cb() which now holds the sched_lock over much less code but is certainly safe as is. Hopefully you will find this to your liking. I intend to fix some more of the races in a follow-up commit when I change the scheduler locking. Sorry for the top post. Jeff On Wed, 30 May 2007, Bruce Evans wrote: > On Tue, 29 May 2007, Jeff Roberson wrote: > >> I have updated the patch at: >> >> http://people.freebsd.org/~jeff/rusage3.diff > > New minor points about an even later update: > > % Index: kern/kern_exit.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v > % retrieving revision 1.298 > % diff -u -p -r1.298 kern_exit.c > % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 > % +++ kern/kern_exit.c 29 May 2007 21:38:21 -0000 > % ... > % @@ -229,7 +233,7 @@ retry: > % */ > % EVENTHANDLER_INVOKE(process_exit, p); > % % - MALLOC(p->p_ru, struct rusage *, sizeof(struct rusage), > % + MALLOC(ru, struct rusage *, sizeof(struct rusage), > % M_ZOMBIE, M_WAITOK); > % /* > % * If parent is waiting for us to exit or exec, > > Perhaps this should not be micro-optimized for space (allocate it inside > struct proc for the whole process lifetime). Alternatively, inherit it > from the first thread in the process that exits. > > % Index: kern/kern_resource.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v > % retrieving revision 1.171 > % diff -u -p -r1.171 kern_resource.c > % --- kern/kern_resource.c 27 May 2007 20:50:23 -0000 1.171 > % +++ kern/kern_resource.c 29 May 2007 22:06:05 -0000 > % @@ -619,6 +620,47 @@ setrlimit(td, uap) > % return (error); > % } > % % +static void > % +lim_cb(void *arg) > % +{ > % + struct rlimit rlim; > % + struct thread *td; > % + struct proc *p; > > Unsorted decls. > > % + > % + p = (struct proc *)arg; > > Unnecessary cast. > > % + PROC_LOCK_ASSERT(p, MA_OWNED); > % + /* > % + * Check if the process exceeds its cpu resource allocation. If > % + * it reaches the max, arrange to kill the process in ast(). > % + */ > % + mtx_lock_spin(&sched_lock); > % + FOREACH_THREAD_IN_PROC(p, td) > % + ruxcollect(&p->p_rux, td); > % + if (p->p_cpulimit != RLIM_INFINITY && > > This should just retun if the limit is infinity. > > p_cpulimit is still marked as locked by sched_lock, but proc locking > should be enough now, so sched locking (or whatever it becomes) is > not needed for the early test and retrun. The early return rarely > happens but it gives simpler code as well as saving time when it > happens. > > % + p->p_rux.rux_runtime > p->p_cpulimit * cpu_tickrate()) { > % + lim_rlimit(p, RLIMIT_CPU, &rlim); > % + if (p->p_rux.rux_runtime >= rlim.rlim_max * cpu_tickrate()) { > > Hmm, the tick rate conversion bug gives more than wrong stats. Here it > shoots processs. E.g., suppose you have a cpulimit of 1 hour and you > throttle the CPU to 8 times slower. Processes that have run for > 7.5 > minutes at the old (higher) tick rate are then killed here. > > % ... > % + if (p->p_cpulimit != RLIM_INFINITY) > % + callout_reset(&p->p_limco, hz, lim_cb, (void *)p); > > Unnecessary cast. > > % Index: kern/kern_synch.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_synch.c,v > % retrieving revision 1.295 > % diff -u -p -r1.295 kern_synch.c > % --- kern/kern_synch.c 18 May 2007 07:10:45 -0000 1.295 > % +++ kern/kern_synch.c 29 May 2007 21:35:40 -0000 > % @@ -401,35 +401,17 @@ mi_switch(int flags, struct thread *newt > % ... > % /* > % * Compute the amount of time during which the current > % - * process was running, and add that to its total so far. > % + * thread was running, and add that to its total so far. > % */ > % new_switchtime = cpu_ticks(); > % - p->p_rux.rux_runtime += (new_switchtime - PCPU_GET(switchtime)); > % - p->p_rux.rux_uticks += td->td_uticks; > % - td->td_uticks = 0; > % - p->p_rux.rux_iticks += td->td_iticks; > % - td->td_iticks = 0; > % - p->p_rux.rux_sticks += td->td_sticks; > % - td->td_sticks = 0; > % - > % + td->td_runtime += (new_switchtime - PCPU_GET(switchtime)); > > Clean further by removing unnecesary parens. > > % td->td_generation++; /* bump preempt-detect counter */ > % - > % - /* > % - * Check if the process exceeds its cpu resource allocation. If > % - * it reaches the max, arrange to kill the process in ast(). > % - */ > % - if (p->p_cpulimit != RLIM_INFINITY && > % - p->p_rux.rux_runtime >= p->p_cpulimit * cpu_tickrate()) { > % - p->p_sflag |= PS_XCPU; > % - td->td_flags |= TDF_ASTPENDING; > % - } > % - > % /* > % * Finish up stats for outgoing thread. > % */ > > Clean further by merging the "start" and "finish" code (so that > switchtime is cleared immediately after it is copied, etc). There is > nothing in between any more. > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Thu May 31 00:17:57 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1171516A468 for ; Thu, 31 May 2007 00:17:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail32.syd.optusnet.com.au (mail32.syd.optusnet.com.au [211.29.132.63]) by mx1.freebsd.org (Postfix) with ESMTP id 8CECD13C489 for ; Thu, 31 May 2007 00:17:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail32.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4V0HhuN021232 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 31 May 2007 10:17:45 +1000 Date: Thu, 31 May 2007 10:17:17 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070530115752.F661@10.0.0.1> Message-ID: <20070531091419.S826@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2007 00:17:57 -0000 On Wed, 30 May 2007, Jeff Roberson wrote: > Ok, hopefully we're in the home stretch here. Patch is at the same location. > I did the following: > ... % Index: kern/kern_acct.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_acct.c,v % retrieving revision 1.89 % diff -u -p -r1.89 kern_acct.c % --- kern/kern_acct.c 22 May 2007 06:51:37 -0000 1.89 % +++ kern/kern_acct.c 30 May 2007 00:14:12 -0000 % @@ -337,7 +337,7 @@ acct_process(struct thread *td) % struct timeval ut, st, tmp; % struct plimit *newlim, *oldlim; % struct proc *p; % - struct rusage *r; % + struct rusage ru; % int t, ret, vfslocked; % % /* % @@ -370,6 +370,7 @@ acct_process(struct thread *td) % bcopy(p->p_comm, acct.ac_comm, sizeof acct.ac_comm); % % /* (2) The amount of user and system time that was used */ % + rufetch(p, &ru); % calcru(p, &ut, &st); % acct.ac_utime = encode_timeval(ut); % acct.ac_stime = encode_timeval(st); % @@ -383,19 +384,18 @@ acct_process(struct thread *td) % acct.ac_etime = encode_timeval(tmp); % % /* (4) The average amount of memory used */ % - r = &p->p_stats->p_ru; % tmp = ut; % timevaladd(&tmp, &st); % /* Convert tmp (i.e. u + s) into hz units to match ru_i*. */ % t = tmp.tv_sec * hz + tmp.tv_usec / tick; % if (t) % - acct.ac_mem = encode_long((r->ru_ixrss + r->ru_idrss + % - + r->ru_isrss) / t); % + acct.ac_mem = encode_long((ru.ru_ixrss + ru.ru_idrss + % + + ru.ru_isrss) / t); % else % acct.ac_mem = 0; % % /* (5) The number of disk I/O operations done */ % - acct.ac_io = encode_long(r->ru_inblock + r->ru_oublock); % + acct.ac_io = encode_long(ru.ru_inblock + ru.ru_oublock); % % /* (6) The UID and GID of the process */ % acct.ac_uid = p->p_ucred->cr_ruid; I see the problem here. acct_process() should use the finalized p_ru. However, acct_process() is called from exit1() eve before the premature finalization there. Thus the rufetch() here is necessary to aggregate the stats, but it may give slightly anachronistic stats. No one cares (especially for the acct times, which had a granularity of 1/64 second until recently, so they were usually 0 and rarely showed the differences of a few uS due to this race), but it would be cleaner not to do extra work here. % Index: kern/kern_exit.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v % retrieving revision 1.298 % diff -u -p -r1.298 kern_exit.c % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 % +++ kern/kern_exit.c 30 May 2007 00:10:21 -0000 % @@ -169,7 +170,8 @@ retry: % * Threading support has been turned off. % */ % } % - % + KASSERT(p->p_numthreads == 1, % + ("exit1: proc %p exiting with %d threads", p, p->p_numthreads)); % /* % * Wakeup anyone in procfs' PIOCWAIT. They should have a hold % * on our vmspace, so we should block below until they have Lost some formatting. % @@ -438,15 +442,17 @@ retry: % PROC_UNLOCK(q); % } % % - /* % - * Save exit status and finalize rusage info except for times, % - * adding in child rusage info later when our time is locked. % - */ % + /* Save exit status. */ % PROC_LOCK(p); % p->p_xstat = rv; % p->p_xthread = td; % - p->p_stats->p_ru.ru_nvcsw++; % - *p->p_ru = p->p_stats->p_ru; % + /* % + * All statistics have been aggregated into the final td_ru by % + * thread_exit(), just copy it into p_ru here. % + */ % + *ru = td->td_ru; % + ru->ru_nvcsw++; % + p->p_ru = ru; % % /* % * Notify interested parties of our demise. Still have various bugs here, starting with the grammar error (comma splice) in the comment. I now understand what you mean by thread_exit() having aggregated the stats -- thread_exit() hasn't been called yet for this thread, but it has been called for other threads in the process, if any. So td_ru has aggregated stats, and accessing it directly just gives the old races for the unthreaded case: - td_ru is locked by sched_lock, but there is no sched_locking here. - any further updates of td_ru except for the times are not copied into p_ru unless they occur during the above copy in which case they corrupt the copy. These bugs can probably be fixed by moving the copying to the final thread_exit(). Maybe acct_process() can be moved there too, but I suspect too much of the process has gone away by then for acct_process() to work. % Index: kern/kern_resource.c % =================================================================== % RCS file: /usr/home/ncvs/src/sys/kern/kern_resource.c,v % retrieving revision 1.171 % diff -u -p -r1.171 kern_resource.c % --- kern/kern_resource.c 27 May 2007 20:50:23 -0000 1.171 % +++ kern/kern_resource.c 30 May 2007 11:52:28 -0000 % @@ -619,6 +619,38 @@ setrlimit(td, uap) % return (error); % } % % +static void % +lim_cb(void *arg) % +{ % + struct rlimit rlim; % + struct thread *td; % + struct proc *p; % + % + p = arg; % ... % + callout_reset(&p->p_limco, hz, lim_cb, p); % +} % + A better cleanup than implicitly casting p to "void *" here: use the original "void *" argment `arg'. % Index: sys/proc.h % =================================================================== % RCS file: /usr/home/ncvs/src/sys/sys/proc.h,v % retrieving revision 1.477 % diff -u -p -r1.477 proc.h % --- sys/proc.h 24 Apr 2007 10:59:21 -0000 1.477 % +++ sys/proc.h 30 May 2007 11:41:53 -0000 % ... % @@ -572,7 +576,7 @@ struct proc { % struct mdproc p_md; /* Any machine-dependent fields. */ % struct callout p_itcallout; /* (h + c) Interval timer callout. */ % u_short p_acflag; /* (c) Accounting flags. */ % - struct rusage *p_ru; /* (a) Exit information. XXX */ % + struct rusage *p_ru; /* (a) Exit information. */ The locking annotation doesn't work well for pointers. Locking for the contents is more interesting than locking for the pointer. The locking annotation for struct rusage says mostly (j) (sched_lock), but that only applies to td_ru -- it doesn't apply here and applies even less to rusages in userland. resource.h is supposed to be a userland header. % Index: sys/resource.h % =================================================================== % RCS file: /usr/home/ncvs/src/sys/sys/resource.h,v % retrieving revision 1.30 % diff -u -p -r1.30 resource.h % --- sys/resource.h 16 Nov 2005 18:18:52 -0000 1.30 % +++ sys/resource.h 29 May 2007 20:29:22 -0000 % @@ -50,33 +50,31 @@ % /* % * Resource utilization information. % * % - * Locking key: % - * c - locked by proc mtx % - * j - locked by sched_lock mtx % - * n - not locked, lazy % + * All fields are only modified by curthread and % + * no locks are required to read. % */ Oops, I now see that you removed all the (j)s and sched locking here. But this isn't quite right -- many of the fields are accessed by statclock(), so something like interrupt atomicity or splhigh() is required to access them. The above (mainly in exit1()) gives much the same races as in RELENG_4: - RELENG_4: statclock() uses splhigh() but not interrupt atomicity. exit1() uses no locking and thus races with statclock(). above: statclock() still uses sched_lock but not interrupt atomicity. exit1() uses no locking and thus races with statclock(). Time fields are mostly in rux and still fully locked by sched_lock. exit1() copies some of them to p_ru, but that copy is not used. I think proc locking is still used for p_ru -- it is used in kern_wait(), where it seems to be necessary to prevent multiple threads in the parent process racing to reap the child. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu May 31 00:36:47 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1400316A41F for ; Thu, 31 May 2007 00:36:47 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outU.internet-mail-service.net (outU.internet-mail-service.net [216.240.47.244]) by mx1.freebsd.org (Postfix) with ESMTP id F1D9D13C4C3 for ; Thu, 31 May 2007 00:36:46 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.32) with ESMTP; Wed, 30 May 2007 17:36:46 -0700 Received: from julian-mac.elischer.org (nat.ironport.com [63.251.108.100]) by idiom.com (Postfix) with ESMTP id 2C0E9125B4A; Wed, 30 May 2007 17:36:46 -0700 (PDT) Message-ID: <465E189C.4000609@elischer.org> Date: Wed, 30 May 2007 17:36:44 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.0 (Macintosh/20070326) MIME-Version: 1.0 To: Bruce Evans References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> In-Reply-To: <20070531091419.S826@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2007 00:36:47 -0000 Bruce Evans wrote: > - RELENG_4: statclock() uses splhigh() but not interrupt atomicity. > exit1() uses no locking and thus races with statclock(). > above: statclock() still uses sched_lock but not interrupt atomicity. sched_lock blocks interrupts > exit1() uses no locking and thus races with statclock(). > Time fields are mostly in rux and still fully locked by sched_lock. > exit1() copies some of them to p_ru, but that copy is not used. I > think proc locking is still used for p_ru -- it is used in kern_wait(), > where it seems to be necessary to prevent multiple threads in the > parent process racing to reap the child. > > Bruce > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Thu May 31 01:11:45 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AAEEB16A421 for ; Thu, 31 May 2007 01:11:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id 2D35213C45A for ; Thu, 31 May 2007 01:11:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l4V1BTn3007608 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 31 May 2007 11:11:31 +1000 Date: Thu, 31 May 2007 11:11:31 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Julian Elischer In-Reply-To: <465E189C.4000609@elischer.org> Message-ID: <20070531110004.H1221@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <465E189C.4000609@elischer.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2007 01:11:45 -0000 On Wed, 30 May 2007, Julian Elischer wrote: > Bruce Evans wrote: > >> - RELENG_4: statclock() uses splhigh() but not interrupt atomicity. >> exit1() uses no locking and thus races with statclock(). >> above: statclock() still uses sched_lock but not interrupt atomicity. > > > sched_lock blocks interrupts > >> exit1() uses no locking and thus races with statclock(). ^^^^^^^^^^^^^^^^^^^^^^^ exit1() uses no locking and is thus unaffected by sched_lock. More precisely, it uses mounds of locking, but at the point that it races with statclock() it holds only PROC_LOCK(p). It acquires sched_lock for the first time much later. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu May 31 08:19:28 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1C2CC16A468 for ; Thu, 31 May 2007 08:19:28 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id E65B913C46A for ; Thu, 31 May 2007 08:19:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l4V8JON1026052 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Thu, 31 May 2007 04:19:26 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Thu, 31 May 2007 01:19:16 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070531091419.S826@besplex.bde.org> Message-ID: <20070531010631.N661@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 May 2007 08:19:28 -0000 On Thu, 31 May 2007, Bruce Evans wrote: > On Wed, 30 May 2007, Jeff Roberson wrote: > > % Index: kern/kern_exit.c > % =================================================================== > % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v > % retrieving revision 1.298 > % diff -u -p -r1.298 kern_exit.c > % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 > % +++ kern/kern_exit.c 30 May 2007 00:10:21 -0000 > % @@ -169,7 +170,8 @@ retry: > % * Threading support has been turned off. > % */ > % } > % - > % + KASSERT(p->p_numthreads == 1, > % + ("exit1: proc %p exiting with %d threads", p, p->p_numthreads)); > % /* > % * Wakeup anyone in procfs' PIOCWAIT. They should have a hold > % * on our vmspace, so we should block below until they have > > Lost some formatting. The extra newline? Isn't that a style violation? > > % @@ -438,15 +442,17 @@ retry: > % PROC_UNLOCK(q); > % } > % % - /* > % - * Save exit status and finalize rusage info except for times, > % - * adding in child rusage info later when our time is locked. > % - */ > % + /* Save exit status. */ > % PROC_LOCK(p); > % p->p_xstat = rv; > % p->p_xthread = td; > % - p->p_stats->p_ru.ru_nvcsw++; > % - *p->p_ru = p->p_stats->p_ru; > % + /* > % + * All statistics have been aggregated into the final td_ru by > % + * thread_exit(), just copy it into p_ru here. > % + */ > % + *ru = td->td_ru; > % + ru->ru_nvcsw++; > % + p->p_ru = ru; > % % /* > % * Notify interested parties of our demise. > > Still have various bugs here, starting with the grammar error (comma > splice) in the comment. I now understand what you mean by thread_exit() > having aggregated the stats -- thread_exit() hasn't been called yet > for this thread, but it has been called for other threads in the process, > if any. So td_ru has aggregated stats, and accessing it directly just > gives the old races for the unthreaded case: > - td_ru is locked by sched_lock, but there is no sched_locking here. > - any further updates of td_ru except for the times are not copied into > p_ru unless they occur during the above copy in which case they corrupt > the copy. > These bugs can probably be fixed by moving the copying to the final > thread_exit(). Maybe acct_process() can be moved there too, but I > suspect too much of the process has gone away by then for acct_process() > to work. I tried moving the rusage summing and assignment to p_ru to later in this function. This lead to a race which produced "runtime went backwards" printfs because runtime became 0. I didn't quite understand this as we check for PRS_ZOMBIE in wait, but I also didn't investigate terribly far. More notes on locking below. > > Oops, I now see that you removed all the (j)s and sched locking here. > But this isn't quite right -- many of the fields are accessed by > statclock(), so something like interrupt atomicity or splhigh() is > required to access them. The above (mainly in exit1()) gives much > the same races as in RELENG_4: > - RELENG_4: statclock() uses splhigh() but not interrupt atomicity. > exit1() uses no locking and thus races with statclock(). > above: statclock() still uses sched_lock but not interrupt atomicity. > exit1() uses no locking and thus races with statclock(). > Time fields are mostly in rux and still fully locked by sched_lock. > exit1() copies some of them to p_ru, but that copy is not used. I > think proc locking is still used for p_ru -- it is used in kern_wait(), > where it seems to be necessary to prevent multiple threads in the > parent process racing to reap the child. I believe the sched lock locking is necessary for rux. For rusage, however, it's only ever modified by curthread. The only races that are present when reading without locks are potentially inconsistent rusage where the stats are not all from a single snapshot of the program. However, the stats can obviously change before the copy makes it back to user-space anyhow. I don't see a real race that needs protecting. This is why I believe the code in exit1() is also safe although technically it could lose some information on the way to thread_exit(). To fix this we'd have to do the final accumulation under sched_lock the last time we lock it. However there are other races in the proc code there that were not immediately obvious. Also, doing the accumulation here negates the possibility of running process accounting after p_ru is valid, which it is not doing now. Related to rusage but unrelated to my patch; Why are irxss, irdss, and irsss expressed in units of bytes*ticks of execution when rusage doesn't report the ticks of execution and rather a timeval? Are consumers expected to sum the timevals and then extrapolate? Jeff > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 01:17:42 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9DA1C16A421 for ; Fri, 1 Jun 2007 01:17:42 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 467D913C487 for ; Fri, 1 Jun 2007 01:17:42 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l511HUkA077510 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Thu, 31 May 2007 21:17:36 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Thu, 31 May 2007 18:17:26 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070531010631.N661@10.0.0.1> Message-ID: <20070531181228.W799@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 01:17:42 -0000 On Thu, 31 May 2007, Jeff Roberson wrote: > On Thu, 31 May 2007, Bruce Evans wrote: > > I believe the sched lock locking is necessary for rux. For rusage, however, > it's only ever modified by curthread. The only races that are present when > reading without locks are potentially inconsistent rusage where the stats are > not all from a single snapshot of the program. However, the stats can > obviously change before the copy makes it back to user-space anyhow. I don't > see a real race that needs protecting. > Now that I've said all of that and committed the patch, I just realized that there is still one race that is unacceptable. When the thread exits in thread_exit() and adds the stats of both threads together we could lose changes in the still-running thread. I guess I may just leave a ru in struct proc that they are added to. In the threadlock patch I serialize thread_exit() on a per-process basis. This will be compatible with the locking there. This will also get rid of that MALLOC you didn't like at the expense of slightly bloating struct proc, which I don't like. Jeff > This is why I believe the code in exit1() is also safe although technically > it could lose some information on the way to thread_exit(). To fix this we'd > have to do the final accumulation under sched_lock the last time we lock it. > However there are other races in the proc code there that were not > immediately obvious. Also, doing the accumulation here negates the > possibility of running process accounting after p_ru is valid, which it is > not doing now. > > Related to rusage but unrelated to my patch; Why are irxss, irdss, and irsss > expressed in units of bytes*ticks of execution when rusage doesn't report the > ticks of execution and rather a timeval? Are consumers expected to sum the > timevals and then extrapolate? > > Jeff > >> >> Bruce >> > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 06:25:23 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CF8D616A400 for ; Fri, 1 Jun 2007 06:25:23 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 9AA1013C457 for ; Fri, 1 Jun 2007 06:25:23 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l516PLGK022023 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Fri, 1 Jun 2007 02:25:22 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Thu, 31 May 2007 23:25:17 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: arch@freebsd.org Message-ID: <20070531221856.B799@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Last chance for threadlock review. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 06:25:23 -0000 Next week I intend to commit the final scheduler lock decomposition patch. Before that time I intend to: 1) Continue to test more variations on schedulers, architectures, preemption, debugging features, and a buildworld or two. 2) Post the patch to current@ to appeal for more testers. Kris Kennaway and myself have been hammering on it pretty hard but there are always cases that are missed when it's only two individuals. 3) Solicit for more thorough review here! I haven't had much feedback on the difficult parts of the patch. I really would like more eyes on it. Some things to focus on: a) Turnstile locking b) sleepqueue locking c) blocked lock d) thread_lock() e) New cpu_switch() argument. I assume adding an argument to a function doesn't effect the assembly implementation if it is unused. After this diff goes in I will provide a copy of ULE for download that actually implements per-cpu scheduler locks. So far the system has been fully stable with my threadlock diff, however, there are still some problems with the per-cpu locks in ULE. There are also some significant performance wins depending on the workload. I also intend to produce some follow-on patches to improve rusage and add more architecture support for the new cpu_switch(). Architectures that do not implement the special cpu_switch() can not take advantage of the per-cpu ULE but otherwise work. http://people.freebsd.org/~jeff/threadlock.diff Thanks, Jeff From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 08:18:54 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7C70716A400 for ; Fri, 1 Jun 2007 08:18:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id 008A613C484 for ; Fri, 1 Jun 2007 08:18:53 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l518IeYT017863 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 18:18:43 +1000 Date: Fri, 1 Jun 2007 18:18:42 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070531010631.N661@10.0.0.1> Message-ID: <20070601154833.O4207@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 08:18:54 -0000 On Thu, 31 May 2007, Jeff Roberson wrote: > On Thu, 31 May 2007, Bruce Evans wrote: > >> On Wed, 30 May 2007, Jeff Roberson wrote: >> >> % Index: kern/kern_exit.c >> % =================================================================== >> % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v >> % retrieving revision 1.298 >> % diff -u -p -r1.298 kern_exit.c >> % --- kern/kern_exit.c 14 May 2007 22:21:58 -0000 1.298 >> % +++ kern/kern_exit.c 30 May 2007 00:10:21 -0000 >> % @@ -169,7 +170,8 @@ retry: >> % * Threading support has been turned off. >> % */ >> % } >> % - >> % + KASSERT(p->p_numthreads == 1, >> % + ("exit1: proc %p exiting with %d threads", p, p->p_numthreads)); >> % /* >> % * Wakeup anyone in procfs' PIOCWAIT. They should have a hold >> % * on our vmspace, so we should block below until they have >> >> Lost some formatting. > > The extra newline? Isn't that a style violation? This is arguable. Although strict KNF may require no extra blank lines (indent -sob), and this rule is not too bad for block comments because the line with the "/*" provides some separation, there are many examples of old code in kern, including many in this file, where a blank line is left before the block comment. In the 4.4Lite2 kern_exit.c, the count is approx. 10:8 in favour of no extra blank line before block comments within functions (not counting cases where there clearly shouldn't be one, e.g., after #ifdef). In the current kern_exit.c after this change was committed, the count is 32:13 in favour of a blank line (down from 33:12 before this change). My version of style.9 requires leaving a blank line before comments and not leaving a blank line before statements (if you want to leave a blank line between statements to form separate blocks of statements, there must be a comment at the start of each new block). >> % @@ -438,15 +442,17 @@ retry: >> % PROC_UNLOCK(q); >> % } >> % % - /* >> % - * Save exit status and finalize rusage info except for times, >> % - * adding in child rusage info later when our time is locked. >> % - */ >> % + /* Save exit status. */ >> % PROC_LOCK(p); >> % p->p_xstat = rv; >> % p->p_xthread = td; >> % - p->p_stats->p_ru.ru_nvcsw++; >> % - *p->p_ru = p->p_stats->p_ru; >> % + /* >> % + * All statistics have been aggregated into the final td_ru by >> % + * thread_exit(), just copy it into p_ru here. >> % + */ >> % + *ru = td->td_ru; >> % + ru->ru_nvcsw++; >> % + p->p_ru = ru; >> % % /* >> % * Notify interested parties of our demise. >> >> Still have various bugs here [mainly races with statclock]. >> These bugs can probably be fixed by moving the copying to the final >> thread_exit(). Maybe acct_process() can be moved there too, but I >> suspect too much of the process has gone away by then for acct_process() >> to work. > > I tried moving the rusage summing and assignment to p_ru to later in this > function. This lead to a race which produced "runtime went backwards" > printfs because runtime became 0. I didn't quite understand this as we check > for PRS_ZOMBIE in wait, but I also didn't investigate terribly far. Some technical problem. >> Oops, I now see that you removed all the (j)s and sched locking here. >> But this isn't quite right -- many of the fields are accessed by >> statclock(), so something like interrupt atomicity or splhigh() is >> required to access them. The above (mainly in exit1()) gives much >> the same races as in RELENG_4: >> - RELENG_4: statclock() uses splhigh() but not interrupt atomicity. >> exit1() uses no locking and thus races with statclock(). >> above: statclock() still uses sched_lock but not interrupt atomicity. >> exit1() uses no locking and thus races with statclock(). >> Time fields are mostly in rux and still fully locked by sched_lock. >> exit1() copies some of them to p_ru, but that copy is not used. I >> think proc locking is still used for p_ru -- it is used in kern_wait(), >> where it seems to be necessary to prevent multiple threads in the >> parent process racing to reap the child. > > I believe the sched lock locking is necessary for rux. For rusage, however, > it's only ever modified by curthread. The only races that are present when > reading without locks are potentially inconsistent rusage where the stats are > not all from a single snapshot of the program. However, the stats can > obviously change before the copy makes it back to user-space anyhow. I don't > see a real race that needs protecting. No, at least the ru_*rss fields are updated by statclock(). These seem to be the only problematic fields. They are unimportant. However, they should be easy to protect correctly using the same lock as the td_*ticks fields. They are still protected by sched_lock in statclock() and rufetch() but not here. Hmm, this is confusing. Normal locking is not used for thread-local fields. Instead, a side effect of spinlocking is used: mtx_lock_spin(&any) in non-interrupt code has the side effect of masking interrupts on the current CPU, so statclock() can access anything on the current CPU without acquiring any locks, just like an interrupt handler on a UP system. This is used for td_*ticks. It doesn't work for ru_*rss since since exit1() doesn't hold a spinlock when copying td_ru. The sched_locking of ru_*rss in statclock() doesn't help -- I think it now does nothing except waste time, since these fields are now thread-local so they need the same locking as td_*ticks, which is null in statclock() but non-null elsewhere. Related bugs: - td_[usip]ticks are still under (j) (sched_lock) in proc.h. - td_(uu,us}ticks have always (?) been under (k) (thread-local). That is more correct than (j), but they are updated by an interrupt handler and seem to be accessed without disabling interrupts elsewhere. > This is why I believe the code in exit1() is also safe Modification by only curthread isn't quite enough -- see above. > although technically > it could lose some information on the way to thread_exit(). To fix this we'd > have to do the final accumulation under sched_lock the last time we lock it. > However there are other races in the proc code there that were not > immediately obvious. Also, doing the accumulation here negates the > possibility of running process accounting after p_ru is valid, which it is > not doing now. I see. > Related to rusage but unrelated to my patch; Why are irxss, irdss, and irsss > expressed in units of bytes*ticks of execution when rusage doesn't report the > ticks of execution and rather a timeval? Are consumers expected to sum the > timevals and then extrapolate? I don't really know, but guess this is historical. Ticks can be managed more efficiently than timevals. There are remarkably few consumers - mainly time(1), csh and window(1) in /usr/src. csh still seems to hard- code the tick freqency as 100. This would make all rss values off by a factor of stathz/100. time(1) hard-coded the frequency as 100 in FreeBSD-1.1. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 08:54:38 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 45E6016A46D for ; Fri, 1 Jun 2007 08:54:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id D458D13C44B for ; Fri, 1 Jun 2007 08:54:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l518sQJa022148 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 18:54:27 +1000 Date: Fri, 1 Jun 2007 18:54:28 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans In-Reply-To: <20070601154833.O4207@besplex.bde.org> Message-ID: <20070601184224.A4207@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070601154833.O4207@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 08:54:38 -0000 On Fri, 1 Jun 2007, Bruce Evans wrote: > Hmm, this is confusing. Normal locking is not used for thread-local > fields. Instead, a side effect of spinlocking is used: > mtx_lock_spin(&any) in non-interrupt code has the side effect of > masking interrupts on the current CPU, so statclock() can access > anything on the current CPU without acquiring any locks, just like > an interrupt handler on a UP system. This is used for td_*ticks. > It doesn't work for ru_*rss since since exit1() doesn't hold a > spinlock when copying td_ru. The sched_locking of ru_*rss in > statclock() doesn't help -- I think it now does nothing except > waste time, since these fields are now thread-local so they need > the same locking as td_*ticks, which is null in statclock() but > non-null elsewhere. > > Related bugs: > - td_[usip]ticks are still under (j) (sched_lock) in proc.h. > - td_(uu,us}ticks have always (?) been under (k) (thread-local). That > is more correct than (j), but they are updated by an interrupt handler > and seem to be accessed without disabling interrupts elsewhere. Oops, it's more confusing than that. It is not a bug for td_[usip]ticks to be under sched_lock, since they must be under a more specific lock than `any' for when they are reset in ruxagg(). That lock has the dual purpose of locking out interrupts as a side effect and locking out other threads explicitly. Please add a lock assertion in ruxagg() and friends that the relevant lock is held. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 08:57:29 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 70C2916A41F for ; Fri, 1 Jun 2007 08:57:29 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 47FFC13C489 for ; Fri, 1 Jun 2007 08:57:29 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l518vRw2034047 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 04:57:28 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Fri, 1 Jun 2007 01:57:22 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070601154833.O4207@besplex.bde.org> Message-ID: <20070601014601.I799@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070601154833.O4207@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 08:57:29 -0000 On Fri, 1 Jun 2007, Bruce Evans wrote: >> The extra newline? Isn't that a style violation? > > This is arguable. Although strict KNF may require no extra blank lines > (indent -sob), and this rule is not too bad for block comments because > the line with the "/*" provides some separation, there are many examples > of old code in kern, including many in this file, where a blank line > is left before the block comment. In the 4.4Lite2 kern_exit.c, the > count is approx. 10:8 in favour of no extra blank line before block > comments within functions (not counting cases where there clearly > shouldn't be one, e.g., after #ifdef). In the current kern_exit.c > after this change was committed, the count is 32:13 in favour of a > blank line (down from 33:12 before this change). > > My version of style.9 requires leaving a blank line before comments > and not leaving a blank line before statements (if you want to leave > a blank line between statements to form separate blocks of statements, > there must be a comment at the start of each new block). I see, in that case, I'll follow this rule in the future. It seems sensible to me. >>> These bugs can probably be fixed by moving the copying to the final >>> thread_exit(). Maybe acct_process() can be moved there too, but I >>> suspect too much of the process has gone away by then for acct_process() >>> to work. >> >> I tried moving the rusage summing and assignment to p_ru to later in this >> function. This lead to a race which produced "runtime went backwards" >> printfs because runtime became 0. I didn't quite understand this as we >> check for PRS_ZOMBIE in wait, but I also didn't investigate terribly far. > > Some technical problem. Well, I think the whole exit/wait path could probably use some attention. There is an incredible amount of locking and unlocking to deal with various races and lock order issues. And there are many subtle effects which aren't immediately obvious. I'm nervous about how many bugs might be caused if we start going down this path so close to 7.0. > >> I believe the sched lock locking is necessary for rux. For rusage, >> however, it's only ever modified by curthread. The only races that are >> present when reading without locks are potentially inconsistent rusage >> where the stats are not all from a single snapshot of the program. However, >> the stats can obviously change before the copy makes it back to user-space >> anyhow. I don't see a real race that needs protecting. > > No, at least the ru_*rss fields are updated by statclock(). These > seem to be the only problematic fields. They are unimportant. However, > they should be easy to protect correctly using the same lock as the > td_*ticks fields. They are still protected by sched_lock in statclock() > and rufetch() but not here. > > Hmm, this is confusing. Normal locking is not used for thread-local > fields. Instead, a side effect of spinlocking is used: > mtx_lock_spin(&any) in non-interrupt code has the side effect of > masking interrupts on the current CPU, so statclock() can access > anything on the current CPU without acquiring any locks, just like > an interrupt handler on a UP system. This is used for td_*ticks. > It doesn't work for ru_*rss since since exit1() doesn't hold a > spinlock when copying td_ru. The sched_locking of ru_*rss in > statclock() doesn't help -- I think it now does nothing except > waste time, since these fields are now thread-local so they need > the same locking as td_*ticks, which is null in statclock() but > non-null elsewhere. Yes, you're right. I'll move the copying under the scheduler lock in exit1() so we can't lose any of the rss information. > > Related bugs: > - td_[usip]ticks are still under (j) (sched_lock) in proc.h. > - td_(uu,us}ticks have always (?) been under (k) (thread-local). That > is more correct than (j), but they are updated by an interrupt handler > and seem to be accessed without disabling interrupts elsewhere. Well td_[uisp]ticks are set and cleared while the sched_lock is held. In threadlock.diff the thread lock is responsible for this. That reminds me that I didn't add the per-thread locking to rufetch() in the patch I posted earlier. Thanks, Jeff > >> This is why I believe the code in exit1() is also safe > > Modification by only curthread isn't quite enough -- see above. > >> although technically it could lose some information on the way to >> thread_exit(). To fix this we'd have to do the final accumulation under >> sched_lock the last time we lock it. However there are other races in the >> proc code there that were not immediately obvious. Also, doing the >> accumulation here negates the possibility of running process accounting >> after p_ru is valid, which it is not doing now. > > I see. > >> Related to rusage but unrelated to my patch; Why are irxss, irdss, and >> irsss expressed in units of bytes*ticks of execution when rusage doesn't >> report the ticks of execution and rather a timeval? Are consumers expected >> to sum the timevals and then extrapolate? > > I don't really know, but guess this is historical. Ticks can be managed > more efficiently than timevals. There are remarkably few consumers > - mainly time(1), csh and window(1) in /usr/src. csh still seems to hard- > code the tick freqency as 100. This would make all rss values off > by a factor of stathz/100. time(1) hard-coded the frequency as 100 > in FreeBSD-1.1. > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 09:28:04 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5472116A400 for ; Fri, 1 Jun 2007 09:28:04 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 0129E13C44B for ; Fri, 1 Jun 2007 09:28:03 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l519S1FQ036567 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 05:28:02 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Fri, 1 Jun 2007 02:27:56 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070601184224.A4207@besplex.bde.org> Message-ID: <20070601021132.J799@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070601154833.O4207@besplex.bde.org> <20070601184224.A4207@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 09:28:04 -0000 On Fri, 1 Jun 2007, Bruce Evans wrote: > On Fri, 1 Jun 2007, Bruce Evans wrote: > >> Hmm, this is confusing. Normal locking is not used for thread-local >> fields. Instead, a side effect of spinlocking is used: >> mtx_lock_spin(&any) in non-interrupt code has the side effect of >> masking interrupts on the current CPU, so statclock() can access >> anything on the current CPU without acquiring any locks, just like >> an interrupt handler on a UP system. This is used for td_*ticks. >> It doesn't work for ru_*rss since since exit1() doesn't hold a >> spinlock when copying td_ru. The sched_locking of ru_*rss in >> statclock() doesn't help -- I think it now does nothing except >> waste time, since these fields are now thread-local so they need >> the same locking as td_*ticks, which is null in statclock() but >> non-null elsewhere. >> >> Related bugs: >> - td_[usip]ticks are still under (j) (sched_lock) in proc.h. >> - td_(uu,us}ticks have always (?) been under (k) (thread-local). That >> is more correct than (j), but they are updated by an interrupt handler >> and seem to be accessed without disabling interrupts elsewhere. > > Oops, it's more confusing than that. It is not a bug for td_[usip]ticks > to be under sched_lock, since they must be under a more specific lock > than `any' for when they are reset in ruxagg(). That lock has the dual > purpose of locking out interrupts as a side effect and locking out other > threads explicitly. > > Please add a lock assertion in ruxagg() and friends that the relevant > lock is held. I have such an assert in a threadlock.diff that I will post tomorrow. I could add it in before as well, but I suspect threadlock is going in next week and there are different locks involved there anyway. Jeff > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 09:50:37 2007 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E379B16A41F for ; Fri, 1 Jun 2007 09:50:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail33.syd.optusnet.com.au (mail33.syd.optusnet.com.au [211.29.132.104]) by mx1.freebsd.org (Postfix) with ESMTP id 67EF413C448 for ; Fri, 1 Jun 2007 09:50:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail33.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l519i6b6015553 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 19:50:27 +1000 Date: Fri, 1 Jun 2007 19:44:03 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jeff Roberson In-Reply-To: <20070531181228.W799@10.0.0.1> Message-ID: <20070601192834.S4657@besplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070531181228.W799@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@FreeBSD.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 09:50:38 -0000 On Thu, 31 May 2007, Jeff Roberson wrote: > Now that I've said all of that and committed the patch, I just realized that > there is still one race that is unacceptable. When the thread exits in > thread_exit() and adds the stats of both threads together we could lose > changes in the still-running thread. I think I see. The same problem seems to affect all calls to ruxagg() and rucollect() for threads that aren't curthread. You cannot control the stats for other threads using a spinlock since statclock() doesn't use a spinlock for the tick counts and shouldn't (modulo this bug) use one for the rss's. Resetting the tick counts in ruxagg() is particulary dangerous. Resetting the runtime in ruxagg() isn't a problem because the runtime isn't touched by statclock(). ruxcollect() only does insufficently locked accesses for reading the rss's, except in thread_exit(). It should be easy to avoid the resettings by accumulating into a local rux as is already done for ru's (put an rux in each thread and add these up when required). This reduces to the same problem as for the rss's. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 09:57:27 2007 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CE85716A400 for ; Fri, 1 Jun 2007 09:57:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 936D013C483 for ; Fri, 1 Jun 2007 09:57:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l519vO7r039190 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 05:57:26 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Fri, 1 Jun 2007 02:57:19 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070601192834.S4657@besplex.bde.org> Message-ID: <20070601025432.N799@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070531181228.W799@10.0.0.1> <20070601192834.S4657@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@FreeBSD.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 09:57:27 -0000 On Fri, 1 Jun 2007, Bruce Evans wrote: > On Thu, 31 May 2007, Jeff Roberson wrote: > >> Now that I've said all of that and committed the patch, I just realized >> that there is still one race that is unacceptable. When the thread exits >> in thread_exit() and adds the stats of both threads together we could lose >> changes in the still-running thread. > > I think I see. > > The same problem seems to affect all calls to ruxagg() and rucollect() > for threads that aren't curthread. You cannot control the stats for > other threads using a spinlock since statclock() doesn't use a spinlock > for the tick counts and shouldn't (modulo this bug) use one for the > rss's. Resetting the tick counts in ruxagg() is particulary dangerous. > Resetting the runtime in ruxagg() isn't a problem because the runtime > isn't touched by statclock(). ruxcollect() only does insufficently > locked accesses for reading the rss's, except in thread_exit(). It > should be easy to avoid the resettings by accumulating into a local > rux as is already done for ru's (put an rux in each thread and add > these up when required). This reduces to the same problem as for the > rss's. Well, it isn't terribly inconvenient to hold the sched_lock/thread_lock in statclock() where the stats are updated. This makes the solution simply to grab thread/sched lock in ruxagg(). Jeff > > Bruce > From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 10:18:57 2007 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8163416A4F4 for ; Fri, 1 Jun 2007 10:18:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail11.syd.optusnet.com.au (mail11.syd.optusnet.com.au [211.29.132.192]) by mx1.freebsd.org (Postfix) with ESMTP id C4B1313C483 for ; Fri, 1 Jun 2007 10:18:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-225-63.carlnfd3.nsw.optusnet.com.au (c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63]) by mail11.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l51AIgSr010918 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 20:18:45 +1000 Date: Fri, 1 Jun 2007 20:18:44 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20070601014601.I799@10.0.0.1> Message-ID: <20070601200348.G6201@delplex.bde.org> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070601154833.O4207@besplex.bde.org> <20070601014601.I799@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@FreeBSD.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 10:18:57 -0000 On Fri, 1 Jun 2007, Jeff Roberson wrote: > On Fri, 1 Jun 2007, Bruce Evans wrote: > Well, I think the whole exit/wait path could probably use some attention. > There is an incredible amount of locking and unlocking to deal with various > races and lock order issues. And there are many subtle effects which aren't > immediately obvious. I'm nervous about how many bugs might be caused if we > start going down this path so close to 7.0. I'm afraid to look too closely :-). >> Related bugs: >> - td_[usip]ticks are still under (j) (sched_lock) in proc.h. >> - td_(uu,us}ticks have always (?) been under (k) (thread-local). That >> is more correct than (j), but they are updated by an interrupt handler >> and seem to be accessed without disabling interrupts elsewhere. [See other replies for large modifications to this] > Well td_[uisp]ticks are set and cleared while the sched_lock is held. In > threadlock.diff the thread lock is responsible for this. That reminds me > that I didn't add the per-thread locking to rufetch() in the patch I posted > earlier. But the ticks fields aren't aren't under sched_lock in the patches or committed version. The could easily be under time_lock, but were carefully pushed out of that too in the time_lock changes. Per-thread locking in statclock() and rufetch() could fix this but would give more locking overhead in statclock(). Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 19:23:37 2007 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6F2BA16A469 for ; Fri, 1 Jun 2007 19:23:37 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 20BC913C447 for ; Fri, 1 Jun 2007 19:23:37 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l51JNU2b091022 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Fri, 1 Jun 2007 15:23:35 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Fri, 1 Jun 2007 12:23:25 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: arch@freebsd.org In-Reply-To: <20070531221856.B799@10.0.0.1> Message-ID: <20070601122118.B606@10.0.0.1> References: <20070531221856.B799@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Re: Last chance for threadlock review. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 19:23:37 -0000 On Thu, 31 May 2007, Jeff Roberson wrote: > Next week I intend to commit the final scheduler lock decomposition patch. > Before that time I intend to: > > 1) Continue to test more variations on schedulers, architectures, > preemption, debugging features, and a buildworld or two. > > 2) Post the patch to current@ to appeal for more testers. Kris Kennaway and > myself have been hammering on it pretty hard but there are always cases that > are missed when it's only two individuals. > > 3) Solicit for more thorough review here! I haven't had much feedback on > the difficult parts of the patch. I really would like more eyes on it. Some > things to focus on: > a) Turnstile locking > b) sleepqueue locking > c) blocked lock > d) thread_lock() > e) New cpu_switch() argument. I assume adding an argument to a function > doesn't effect the assembly implementation if it is unused. I should add: f) Signals, which are already a mess, have had some code consolidated. Earlier on there were bugs in this. If you have any great ideas for simplifying the signal code I'm all ears. I have considered using the per-process spinlock to block the threads on when they are in the stopped state. This may reduce some of the locking further. Jeff > > After this diff goes in I will provide a copy of ULE for download that > actually implements per-cpu scheduler locks. So far the system has been > fully stable with my threadlock diff, however, there are still some problems > with the per-cpu locks in ULE. There are also some significant performance > wins depending on the workload. > > I also intend to produce some follow-on patches to improve rusage and add > more architecture support for the new cpu_switch(). Architectures that do > not implement the special cpu_switch() can not take advantage of the per-cpu > ULE but otherwise work. > > http://people.freebsd.org/~jeff/threadlock.diff > > Thanks, > Jeff > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 19:37:58 2007 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DAF4316A469 for ; Fri, 1 Jun 2007 19:37:58 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7AEB813C43E for ; Fri, 1 Jun 2007 19:37:58 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.101] (c-71-231-138-78.hsd1.or.comcast.net [71.231.138.78]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l51JbuaQ094843 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Fri, 1 Jun 2007 15:37:57 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Fri, 1 Jun 2007 12:37:51 -0700 (PDT) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: Bruce Evans In-Reply-To: <20070601200348.G6201@delplex.bde.org> Message-ID: <20070601123530.B606@10.0.0.1> References: <20070529105856.L661@10.0.0.1> <200705291456.38515.jhb@freebsd.org> <20070529121653.P661@10.0.0.1> <20070530065423.H93410@delplex.bde.org> <20070529141342.D661@10.0.0.1> <20070530125553.G12128@besplex.bde.org> <20070529201255.X661@10.0.0.1> <20070529220936.W661@10.0.0.1> <20070530201618.T13220@besplex.bde.org> <20070530115752.F661@10.0.0.1> <20070531091419.S826@besplex.bde.org> <20070531010631.N661@10.0.0.1> <20070601154833.O4207@besplex.bde.org> <20070601014601.I799@10.0.0.1> <20070601200348.G6201@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@FreeBSD.org Subject: Re: Updated rusage patch X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2007 19:37:59 -0000 On Fri, 1 Jun 2007, Bruce Evans wrote: > On Fri, 1 Jun 2007, Jeff Roberson wrote: > >> On Fri, 1 Jun 2007, Bruce Evans wrote: > >> Well, I think the whole exit/wait path could probably use some attention. >> There is an incredible amount of locking and unlocking to deal with various >> races and lock order issues. And there are many subtle effects which >> aren't immediately obvious. I'm nervous about how many bugs might be >> caused if we start going down this path so close to 7.0. > > I'm afraid to look too closely :-). > >>> Related bugs: >>> - td_[usip]ticks are still under (j) (sched_lock) in proc.h. >>> - td_(uu,us}ticks have always (?) been under (k) (thread-local). That >>> is more correct than (j), but they are updated by an interrupt handler >>> and seem to be accessed without disabling interrupts elsewhere. > [See other replies for large modifications to this] > >> Well td_[uisp]ticks are set and cleared while the sched_lock is held. In >> threadlock.diff the thread lock is responsible for this. That reminds me >> that I didn't add the per-thread locking to rufetch() in the patch I posted >> earlier. > > But the ticks fields aren't aren't under sched_lock in the patches or > committed version. The could easily be under time_lock, but were > carefully pushed out of that too in the time_lock changes. Per-thread > locking in statclock() and rufetch() could fix this but would give more > locking overhead in statclock(). Please grep for statclock in threadlock.diff. This removes time_lock from statclock all together and protects the whole thing with thread_lock(). With this change all cpus can execute statclock() concurrently with sched_smp.c. This patch also has fixes for locking ruxagg() as well as asserts. It does not yet protect the ru copying in exit(). I want to figure out the synchronization issues with wait first. Jeff > > Bruce >