From owner-freebsd-current@FreeBSD.ORG Wed Jul 26 01:12:29 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3E44216A4DF; Wed, 26 Jul 2006 01:12:29 +0000 (UTC) (envelope-from mime@traveller.cz) Received: from ss.eunet.cz (ss.eunet.cz [193.85.228.13]) by mx1.FreeBSD.org (Postfix) with ESMTP id BA30A43D45; Wed, 26 Jul 2006 01:12:28 +0000 (GMT) (envelope-from mime@traveller.cz) Received: from localhost.i.cz (ss.eunet.cz [193.85.228.13]) by ss.eunet.cz (8.13.6/8.13.6) with ESMTP id k6Q1CQhd040305 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NO); Wed, 26 Jul 2006 03:12:26 +0200 (CEST) (envelope-from mime@traveller.cz) From: Michal Mertl To: freebsd-current@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Date: Wed, 26 Jul 2006 03:12:13 +0200 Message-Id: <1153876334.55623.49.camel@genius.i.cz> Mime-Version: 1.0 X-Mailer: Evolution 2.6.2 FreeBSD GNOME Team Port Content-Transfer-Encoding: 8bit Cc: olli@lurza.secnetix.de Subject: Re: vmstat's entries type X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jul 2006 01:12:29 -0000 Oliver Fromme wrote: > John Baldwin wrote: > > On Sunday 23 July 2006 20:03, Sten Daniel Sørsdal wrote: > > > sthaug at nethelp.no wrote: > > > > > > One approach that we could use for 64-bit counters would be to just > > > > > > use 32-bits one, and poll them for overflow and bump an overflow > > > > > > count. This assumes that the 32-bit counters overflow much less often > > > > > > than the polling interval, and easily triples the amount of storage > > > > > > for each of them... It is ugly :-( > > > > > > > > > > > What's wrong with the add+adc (asm) approach found on any i386? > > > > > > > > Presumably the fact that add + adc isn't an atomic operation. So if > > > > you want to guarantee 64 bit consistency, you need locking or similar. > > > > > > > > > > Would it not be necessary to do this locking anyway? > > > I don't see how polling for overflow would help this consistency. > > > Are both suggestions insufficient? > > > > I actually think that add + adc is ok for the case of incrementing simple > > counters. You can even do 'inc ; addc $0' > > (I'm familiar with asm programming, but I'm not a low-level > threading or SMP expert, so please excuse me if this is a > dumb question ...) > > If you just do add+adc (or inc+adc) and another thread (on > the same or different processor, I don't know) happens to > read the counter value at the same time (i.e. after the > lower 32bit have overflowed, but before the upper 32bit get > incremented), then that other thread would get a value > that's off by 2^32. > > What am I missing? I don't remember all the details, but when I was proposing (with patches) the change in network counters several years ago, I gave up to the (possibly right) opposition. Probably from BDE, I don't remember. Your explanation of a possible failure scenario is just one example of what can possibly get wrong there. It ('add' instruction followed by 'addc' - or more generally working with 64bit counters on a 32bit architecture - or even more generally working with an integer in a kernel context in multiprocossor environment) can get wrong in more "exotic" ways on architectures without implicitly coherent cache - you can read an old value of something, modify it, and write it back, overwriting much more recent copy or something like it. Even a simple increment may not be fully safe (it is also, in the end, read-modify-write operation, which can be, in theory at least, interrupted in between any two operations). I have not studied enough of it, but it makes sense to me and I believe these were among the reasons why 64 bit counters on 32 bit I386 were rejected at the time. The modifications of the counters may be wrapped into preprocessor macros though. The right implementation of the macro can be 100% correct, but it will add big overhead - e.g. lock instrunction prefix (needed in I386 SMP) takes possibly hundreds of cycles to execute). Therefore, I think that we should either go with per-CPU copies of the counter in whatever size appropriate and have the total be sum of the values (possibly also taking care of overflow) or we should just accept the status quo - use something "natural" for the architecture (e.g. int or long) and hope for the best (a wrong counter normally doesn't cause any problems). It (int or sometimes long) has been good enough for decades. The first way (per-CPU counters) shouldn't be that difficult to do (almost?) correctly either, I just wanted to propose the change with lesser potential for a bikeshed (even possibly fruitful :-)) and higher potential for real change in the sources (and better value of the counters for me - I believe that I won't build any new 32bit system, so long should be long enough for me). I believe I should be able to code both ways (and can easily test I386 and AMD64, but these are not a good architectures AFAIK, as they ensure cache coherency). My patch for network counters wrapped every counter operation in a macro which could have been expanded to different code. Doing the operations absolutely safely was terribly inefficient. Even per-CPU increments aren't probable 100% safe and that was the reason PCPU_LAZY_INC was introduced - so that consumers knew they can't really rely on the counter value. Michal