From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 05:19:11 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DD0C6106564A; Sun, 3 Jun 2012 05:19:11 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 5DDF48FC0A; Sun, 3 Jun 2012 05:19:11 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q535J5Qr016796; Sun, 3 Jun 2012 08:19:05 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q535J4m1082109; Sun, 3 Jun 2012 08:19:04 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q535J4j0082108; Sun, 3 Jun 2012 08:19:04 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 3 Jun 2012 08:19:04 +0300 From: Konstantin Belousov To: Bruce Evans Message-ID: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> <20120602171632.GC2358@deviant.kiev.zoral.com.ua> <20120603063330.H3418@besplex.bde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="eEhvUqzJgUABKnxr" Content-Disposition: inline In-Reply-To: <20120603063330.H3418@besplex.bde.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 05:19:11 -0000 --eEhvUqzJgUABKnxr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: > On Sat, 2 Jun 2012, Konstantin Belousov wrote: >=20 > >On Sat, Jun 02, 2012 at 06:00:06PM +0100, Attilio Rao wrote: > >>... > >>2012/6/2 Konstantin Belousov : > >>>On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: > >[Tried to trim the text] >=20 > [Trimmed more] >=20 > >>>Right, exactly, and this is why I object to the "offsets" approach. > >>>It basically moves us to the old times of the "jump tables" shared > >>>libraries, that fortunately was never a case for FreeBSD even when > >>>a.out was used. > >> > >>I'm objecting to this either. > >My english is not good enough to understand this. Do you agree or disagr= ee > >with my statement that 'indexes' make it very hard to maintain ABI ? >=20 > Syscall numbers are basically indexes, and work OK (because there aren't > many of them even after ~30-35 years of accumulating them). >=20 > >... > >>The gettimeofday() implementation is a different story than what is ask= ed=20 > >>here. > > > >But the goal is to have fast clocks, right ? What else is planned ? > > > >In fact, I think that if the whole goal is only fast clocks, then we > >do not need any additional system mechanisms, since we can easily export > >coefficients for rdtsc formula already. E.g. we can put it into elf auxv, > >which is ugly but bearable. >=20 > How do you get the timehands offsets? These only need to be updated > every second or so, or when used, but how can the application know > when they need to be updated if this is not done automatically in the > kernel by writing to a shared page? I can only think of the > application arranging an alarm signal every second or so and updating > then. No good for libraries. What is timehands offsets ? Do you mean things like leap seconds ? This is indeed problematic for auxv. For auxv it could be solved by providing offset for next recheck using syscalls, and making libc code to respect this offset. But, I do think that vdso in shared page is the right solution, not auxv. >=20 > rdtsc is also very unportable, even on CPUs that have it. But all other > x86 timecounter hardware is too slow if you want gettimeofday() to be fast > and as accurate as it is now. !rdtsc hardware is probably cannot be used at all due to need to provide usermode access to device registers. The mere presence of rdtsc does not means that usermode indeed can use it, it should be decided by kernel based on the current in-kernel time source. If rdtsc is not usable, the corresponding data should not be exported, or implementation should go directly into syscall or whatever. In fact, I would be very grateful if an expert in time-keeping provided concise description of the algorithm for translating rdtsc output into struct timeval, also enumerating required parameters. --eEhvUqzJgUABKnxr Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/K88gACgkQC3+MBN1Mb4glQwCg1YIEeb2XDWk6r2fPtZ1/5rB0 yfYAoIXaW0zTrBFZOBQHEVFDhV1t/pNY =N/wE -----END PGP SIGNATURE----- --eEhvUqzJgUABKnxr-- From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 07:19:07 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C561F106564A; Sun, 3 Jun 2012 07:19:07 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id A3D148FC12; Sun, 3 Jun 2012 07:19:06 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA28084; Sun, 03 Jun 2012 10:19:04 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1Sb55c-000M72-5L; Sun, 03 Jun 2012 10:19:04 +0300 Message-ID: <4FCB0FE5.4050607@FreeBSD.org> Date: Sun, 03 Jun 2012 10:19:01 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Attilio Rao , Mitsuru IWASAKI References: <20120603.002554.119853142.iwasaki@jp.FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-acpi@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: cpu stopping [Was: preparation for x86/acpica/acpi_wakeup.c] X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 07:19:07 -0000 on 03/06/2012 00:39 Attilio Rao said the following: > The first thing to consider is that right now we only have 2 states > for CPUs: started and stopped. These states are controlled by > started_cpus and stopped_cpus masks respectively. It seems you really > want to add an intermediate level among the 2 where you have: started > -> suspended -> started -> suspended ... -> stopped and you need to > expand the mechanism for dealing with started and stopped cpus to do > that. I'm pretty sure this will be very helpful also for other > architectures that want to do the same. As the first thing I must admit that I haven't looked at the patch :-) But really I don't see why we need to differentiate between stopped and suspended state as both of them ultimately mean exactly the same thing - CPUs are spinning on some condition (and they are in a well-defined place and state). My view of how this should work is: - there can be only one master CPU that controls all other (slave) CPUs - the master sets entry and exit hooks - the master signals slaves to enter the stop state - the slaves execute the enter hook and start spinning on the release condition - the master does whatever it wants to do in this special system state - the master signals the slaves to resume - the slave exit the spin loop and execute the exit hook We have almost all of this in place. Only now we have different IPIs and different IPI handlers to do the job (cpustop_handler and cpususpend_handler). I think that the hooks model should be more universal. In my opinion, what really would deserve a completely independent path is the hard-stop case. As this can be invoked nested to the other cases. E.g. exotic situations like a breakpoint or a trap or a panic in the suspend or the normal stop code paths. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 09:54:10 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6A6AC106564A; Sun, 3 Jun 2012 09:54:10 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 7908D8FC08; Sun, 3 Jun 2012 09:54:09 +0000 (UTC) Received: by laai10 with SMTP id i10so3072293laa.13 for ; Sun, 03 Jun 2012 02:54:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=Pr4Hnj2HezsD1rehQuPsqCPcosSqnis1+4C86gL9b50=; b=WvH6NwwC/Ix+vXjpZx/aZxicqBDNEqCfqd2xK20qQaEu33kbIpese2sz68YUzszipm 3iC62advSifR+flumJlcijLxoann4oMUSJsWiS4aBf1ThZqh66wqr8Z1ckOfknXmrR0R 27lE/v8a/9PikykdUrdKwkIumen/4o6Al3ppoTRGfJ3rTh5QFT+LgZUsJz5153RxV5sf yOE9HT4zKJNDeaekISQHfWln1LkxLOieH/iXjzjMcnlV5c/wufDh8xutg6NdjNFcEl/G wdiwwi9aeF6F2kPIE7h5r8gLTYxgm1nj+EJS1DY4atHi1ME3wsoP/kYCKZjtY6EqlCHl fvIA== MIME-Version: 1.0 Received: by 10.152.103.11 with SMTP id fs11mr8689233lab.23.1338717248070; Sun, 03 Jun 2012 02:54:08 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Sun, 3 Jun 2012 02:54:07 -0700 (PDT) In-Reply-To: <4FCB0FE5.4050607@FreeBSD.org> References: <20120603.002554.119853142.iwasaki@jp.FreeBSD.org> <4FCB0FE5.4050607@FreeBSD.org> Date: Sun, 3 Jun 2012 10:54:07 +0100 X-Google-Sender-Auth: qSLZcHBUQgn9exkzIhnujbmC38s Message-ID: From: Attilio Rao To: Andriy Gapon Content-Type: text/plain; charset=UTF-8 Cc: freebsd-acpi@freebsd.org, Mitsuru IWASAKI , freebsd-arch@freebsd.org Subject: Re: cpu stopping [Was: preparation for x86/acpica/acpi_wakeup.c] X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 09:54:10 -0000 2012/6/3 Andriy Gapon : > on 03/06/2012 00:39 Attilio Rao said the following: >> The first thing to consider is that right now we only have 2 states >> for CPUs: started and stopped. These states are controlled by >> started_cpus and stopped_cpus masks respectively. It seems you really >> want to add an intermediate level among the 2 where you have: started >> -> suspended -> started -> suspended ... -> stopped and you need to >> expand the mechanism for dealing with started and stopped cpus to do >> that. I'm pretty sure this will be very helpful also for other >> architectures that want to do the same. > > As the first thing I must admit that I haven't looked at the patch :-) > > > But really I don't see why we need to differentiate between stopped and > suspended state as both of them ultimately mean exactly the same thing - CPUs > are spinning on some condition (and they are in a well-defined place and state). This is debeatable and I'm not sure I agree. At some point we may want to implement CPU on-the-fly suspension for CPUs which is a different event than "stopping" (where stopping will be "permanent stopping" and suspending will be "possible to recover suspension"). The important thing about this is that we need to expand our model in a way that it makes simple to add more states to the CPUs than simple started/stopped. Right now we don't have any architecture for this in place. > My view of how this should work is: > - there can be only one master CPU that controls all other (slave) CPUs > - the master sets entry and exit hooks > - the master signals slaves to enter the stop state > - the slaves execute the enter hook and start spinning on the release condition > - the master does whatever it wants to do in this special system state > - the master signals the slaves to resume > - the slave exit the spin loop and execute the exit hook > > We have almost all of this in place. Only now we have different IPIs and > different IPI handlers to do the job (cpustop_handler and cpususpend_handler). > I think that the hooks model should be more universal. For hook you mean like a rendezvous handler? I'm not sure I understand otherwise. > In my opinion, what really would deserve a completely independent path is the > hard-stop case. As this can be invoked nested to the other cases. E.g. exotic > situations like a breakpoint or a trap or a panic in the suspend or the normal > stop code paths. What I'm really interested is expanding our model in a way that it can handle multiple CPU states. Then it is just a matter of adding the right states and it is all trivial work. And however, as already mentioned, I'm not sure I would assimilate suspended = stopped. Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 10:49:46 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4E753106566B; Sun, 3 Jun 2012 10:49:46 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id 295408FC1E; Sun, 3 Jun 2012 10:49:43 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q53AnRFV000363 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 3 Jun 2012 20:49:29 +1000 Date: Sun, 3 Jun 2012 20:49:27 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> Message-ID: <20120603184315.T856@besplex.bde.org> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> <20120602171632.GC2358@deviant.kiev.zoral.com.ua> <20120603063330.H3418@besplex.bde.org> <20120603051904.GG2358@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 10:49:46 -0000 On Sun, 3 Jun 2012, Konstantin Belousov wrote: > On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: >> On Sat, 2 Jun 2012, Konstantin Belousov wrote: >>> ... >>> In fact, I think that if the whole goal is only fast clocks, then we >>> do not need any additional system mechanisms, since we can easily export >>> coefficients for rdtsc formula already. E.g. we can put it into elf auxv, >>> which is ugly but bearable. >> >> How do you get the timehands offsets? These only need to be updated >> every second or so, or when used, but how can the application know >> when they need to be updated if this is not done automatically in the >> kernel by writing to a shared page? I can only think of the >> application arranging an alarm signal every second or so and updating >> then. No good for libraries. > What is timehands offsets ? Do you mean things like leap seconds ? Yes. binuptime() is: % void % binuptime(struct bintime *bt) % { % struct timehands *th; % u_int gen; % % do { % th = timehands; % gen = th->th_generation; % *bt = th->th_offset; % bintime_addx(bt, th->th_scale * tc_delta(th)); % } while (gen == 0 || gen != th->th_generation); % } Without the kernel providing th->th_offset, you have to do lots of ntp handling for yourself (compatibly with the kernel) just to get an accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but they do affect CLOCK_REALTIME which is the clock id used by gettimeofday(). For the former, you only have to advance the offset for yourself occasionally (compatibly with the kernel) and manage (compatibly with the kernel, especially in the long term) ntp slewing and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. > This is indeed problematic for auxv. For auxv it could be solved by > providing offset for next recheck using syscalls, and making libc code to > respect this offset. But, I do think that vdso in shared page > is the right solution, not auxv. timehands in a shared pages is close to working. th_generation protects things in the same way as in the kernel, modulo assumptions that writes are ordered. >> rdtsc is also very unportable, even on CPUs that have it. But all other >> x86 timecounter hardware is too slow if you want gettimeofday() to be fast >> and as accurate as it is now. > !rdtsc hardware is probably cannot be used at all due to need to provide > usermode access to device registers. The mere presence of rdtsc does not > means that usermode indeed can use it, it should be decided by kernel > based on the current in-kernel time source. If rdtsc is not usable, the > corresponding data should not be exported, or implementation should go > directly into syscall or whatever. But then applications would: - use gettimeofday() more than they should ("it works on Linux"), even more than now since when "it works on FreeBSD-x86" too - just be slow when gettimeofday() is slow - kludge around gettimeofday() being slow like they do now - kludge around gettimeofday() being slow not like they do now (use more complications to probe it being slow). I found some RedHat documentation for gettimeofday() in VDSO. It seems to leave it to the sysadmin to "tune" gettimeofday() using a boot parameter to configure gettimeofday() being accurate/slow, less-accurate/ less-slow, or inaccurate/fast. A per-process parameter would be more correct and harder to use (add mounds of autoconfig and runtime code in every program[mer] that cares to detect and use it). > In fact, I would be very grateful if an expert in time-keeping provided > concise description of the algorithm for translating rdtsc output into > struct timeval, also enumerating required parameters. See above. You just scale tc_delta(th) == (uint32_t)(rdtsc() - rdtsc_offset) when th is for TSC, using a carefully managed fixed point scale factor. The delta is reduced to 32 bits so that the scaling can be efficient. The result is a bintime fraction which is added to a bintime offset. Both offsets are even more carefully managed, and everything is protected by th_generation, and for optimality there are multiple timehands so that th_generation very rarely changes underneath you. The resulting bintime is then converted to a timeval or timespec as required. This gives uptimes. Another offset is added for real times. Times in seconds are handled more directly; it is assumed that time_t is atomic so that th_generation is not needed for protecting them. The TSC frequency is limited to about 4 GHz, so the above tc_delta() works for about 4 seconds after rdtsc_offset is updated. But the bintime fraction only works for 1 second. If either of these wraps, then the result is still latter than the update time; however, it may be earlier than a previous result. So the update must occur at least once per second for the TSC. Otherwise, negative time differences occur (the final result is in advance of th_offset since the bintime fraction is >= 0, but will be before a previous final result if the bintime fraction wraps). Negative time differences are more worse than lost "ticks" that cause all results to be in the past. The updates are broken by at least stopping in ddb and perhaps by suspend/resume. The correct fix is probably to update (or just zap) the timecounter as the first step of resuming from ddb or sleep (this must be done before any other timecounter call). Note that times going backwards cannot detected in binuptime(), etc., since to detect it you would have to write the previous time, but that would requires pessimal locking that is intentionally left out. Timecounter internals like th_offsets are currently private in kern_tc.c. I don't like exposing them for this, or cloning them for FFCLOCK. Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 14:42:47 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9135A106566B; Sun, 3 Jun 2012 14:42:47 +0000 (UTC) (envelope-from iwasaki@jp.FreeBSD.org) Received: from locore.org (ns01.locore.org [218.45.21.227]) by mx1.freebsd.org (Postfix) with ESMTP id 236248FC0C; Sun, 3 Jun 2012 14:42:47 +0000 (UTC) Received: from localhost (celeron.v4.locore.org [192.168.0.10]) by locore.org (8.14.5/8.14.5/iwasaki) with ESMTP/inet id q53EgiCq031408; Sun, 3 Jun 2012 23:42:44 +0900 (JST) (envelope-from iwasaki@jp.FreeBSD.org) Date: Sun, 03 Jun 2012 23:42:43 +0900 (JST) Message-Id: <20120603.234243.28389486.iwasaki@jp.FreeBSD.org> To: avg@FreeBSD.org From: Mitsuru IWASAKI In-Reply-To: <4FCB0FE5.4050607@FreeBSD.org> References: <20120603.002554.119853142.iwasaki@jp.FreeBSD.org> <4FCB0FE5.4050607@FreeBSD.org> X-Mailer: Mew version 3.3 on Emacs 20.7 / Mule 4.0 (HANANOEN) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: attilio@FreeBSD.org, freebsd-acpi@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: cpu stopping X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 14:42:47 -0000 Hi, thanks for comments. > As the first thing I must admit that I haven't looked at the patch :-) Never mind :) What I'm trying to do in the patches is just to unify amd64/i386 independent part (acpi_wakeup.c) for the code maintenance, so please let's commit it first, then start re-design the cpususpend_handler(). > But really I don't see why we need to differentiate between stopped and > suspended state as both of them ultimately mean exactly the same thing - CPUs > are spinning on some condition (and they are in a well-defined place and state). Yes, amd64/i386 cpususpend_handler() is very similar to cpustop_handler() actually, some resume related procedures are added for suspend. > My view of how this should work is: > - there can be only one master CPU that controls all other (slave) CPUs > - the master sets entry and exit hooks Entry hook for suspending might be ---- ctx_fpusave(suspfpusave[cpu]); wbinvd(); CPU_SET_ATOMIC(cpu, &stopped_cpus); ---- and for stopping is ---- /* Indicate that we are stopped */ CPU_SET_ATOMIC(cpu, &stopped_cpus); ---- Correct? I think stopping hook can be replaced with suspending hook. Exit hook for suspending is ---- pmap_init_pat(); load_cr3(susppcbs[cpu]->pcb_cr3); initializecpu(); PCPU_SET(switchtime, 0); PCPU_SET(switchticks, ticks); [snip] /* Resume MCA and local APIC */ mca_resume(); lapic_setup(0); ---- For stopping should be ---- if (cpu == 0 && cpustop_restartfunc != NULL) { cpustop_restartfunc(); cpustop_restartfunc = NULL; } ---- > - the master signals slaves to enter the stop state > - the slaves execute the enter hook and start spinning on the release condition > - the master does whatever it wants to do in this special system state > - the master signals the slaves to resume > - the slave exit the spin loop and execute the exit hook I think it would be possible. However I personally think that priority of x86/x86/mp_machdep.c is higher and more effective than merging cpususpend/stop_handler(). Thanks From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 19:02:02 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BD3B41065672 for ; Sun, 3 Jun 2012 19:02:02 +0000 (UTC) (envelope-from matthewstory@gmail.com) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 7C68E8FC0C for ; Sun, 3 Jun 2012 19:02:02 +0000 (UTC) Received: by obcni5 with SMTP id ni5so7921841obc.13 for ; Sun, 03 Jun 2012 12:02:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=0jdHHvmFDrycVQfGFLm6bz/vexsA4didlqLd1cNGU2Y=; b=VdoWc4erh/u7c7W16aZuAcOfiJCKknyJPW6Udg/oRo5z/MKXL6i4oYAFxGc/iErbVV UhqLOiaEivAwLy3z1Tf6n6Wtp+iwfNJxUbEQ1Ry7TkKiBy702YDxKVjldXc0TR5MMOKg 631JDhuT8D0R6/fpcgFlPYiAhCs8LTHVQULVnaMWBvvbUVWave5pc6MwtzmU6Nc2s/Wg DzHQl5J6PHb/xyEl/ZhVM9DvQpt0opI0Pf63cWOk3qh0P7QT7KFYKAcrrpQ6PDGy69hW gtk9K5dfE3crL8xTQJV3YeW4WgkLS2cVxpQZT4B18n4ShAIuMdrwaa5DH1HYK21fu9w2 iCEQ== MIME-Version: 1.0 Received: by 10.60.3.40 with SMTP id 8mr9506170oez.31.1338750122095; Sun, 03 Jun 2012 12:02:02 -0700 (PDT) Received: by 10.76.116.68 with HTTP; Sun, 3 Jun 2012 12:02:02 -0700 (PDT) Date: Sun, 3 Jun 2012 15:02:02 -0400 Message-ID: From: Matthew Story To: freebsd-arch@freebsd.org Content-Type: multipart/mixed; boundary=e89a8f83a34578732404c1960d10 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: ASCII Notes from FreeBSD Network Summit at BSDCan X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 19:02:02 -0000 --e89a8f83a34578732404c1960d10 Content-Type: text/plain; charset=ISO-8859-1 gnn asked me to forward these along to arch. notes are (as) literal a copy of the whiteboard session as I could work into ASCII -- regards, matt --e89a8f83a34578732404c1960d10 Content-Type: text/plain; charset=US-ASCII; name="network-whiteboard-part1.txt" Content-Disposition: attachment; filename="network-whiteboard-part1.txt" Content-Transfer-Encoding: base64 X-Attachment-Id: f_h30h8hf30 KiBtYnVmCiAgLT4gdmFyaWFibGUKICAtPiBtdGFnCiAgLT4gb2ZmbG9hZGluZwogIC0+IGluZGly ZWN0aW9uCiogbDIvbDMgc3BsaXQKKiBpZm5ldCByZWRlc2lnbgogICAgLT4gcXVldWUKICAgIC0+ IGluZGlyZWN0aW9uCiAgICAtPiBkZWR1cGUgMTBHCiAgICAtPiB2YXJpYWJsZSBzaXplCiogY2hl Y2tzdW0KKiBJT1ggcm9hZG1hcAoqIG5ldG1hcAoqIGxhdGVuY3kvYncgbWVhc3VyZQoqIE5JQy9T dGFjayBsb2FkIGRpc3RyaWJ1dGlvbgoKbWJ1ZiBwcm9ibGVtcyBpbmRpcmVjdGlvbjoKICAgIC0+ IDIgdHlwZXMgb2YgbWJ1ZnMKICAgICAgICAtPiB2ZXJ5IHNtYWxsCiAgICAgICAgLT4gdmVyeSBs YXJnZQogICAgLT4gdG9vIG11Y2ggaW5kaXJlY3Rpb24KICAgICAgICAtPiBKZWZmUiBwYXRjaD8K ICAgICAgICAgICAgLT4gdmFyaWFibGUtc2l6ZSBtYnVmIHBhdGNoCiAgICAgICAgICAgICAgICAt PiBhbnlvbmUgb3duPwogICAgICAgICAgICAgICAgICAgIC0+IG5vCiAgICAgICAgICAgICAgICAt PiB5b3UgZG9uJ3QgaGF2ZSB0byBoYXZlIGluZGlyZWN0aW9uCiAgICAgICAgICAgICAgICAtPiBu byBjbHVzdGVycyByZXF1aXJlZAogICAgICAgICAgICAgICAgICAgIC0+IHN1cHBvcnQgZm9yIGNs dXN0ZXJzIHJlbWFpbnMsCiAgICAgICAgICAgICAgICAgICAgICAgbmVjZXNzYXJ5IGZvciBhcmNo IHcvbyBhY2Nlc3MgdG8gYWxsIG1lbW9yeQogICAgICAgICAgICAgICAgLT4gcGF0Y2ggaXMgc3Bl Y2lmaWMsIGFueSBvdGhlciBjb25jZXJucz8KICAgICAgICAgICAgICAgICAgICAtPiBJTyBWZWN0 b3IgZGVzaWduLCBzY2F0dGVyL2dhdGhlciAoc29tZSBzb3J0IG9rIGlvdmVjKQogICAgICAgICAg ICAgICAgICAgIC0+IGJhdGNoaW5nPwogICAgICAgICAgICAgICAgICAgICAgICAtPiBzYWNyaWZp Y2UgbGVzcyBpbmRpcmVjdGlvbiBpbiBoZWFkZXIsCiAgICAgICAgICAgICAgICAgICAgICAgICAg IGZvciBtb3JlIGluZGlyZWN0aW9uIGluIG1ldGEtZGF0YQogICAgICAgICAgICAgICAgICAgICAg ICAtPiBvciBpcyBpdCBqdXN0IG1vdmluZyB0aGUgaW5kaXJlY3Rpb24/CiAgICAgICAgICAgICAg ICAgICAgLT4gc3RyaXBwaW5nIGhlYWRlcnMKICAgICAgICAgICAgICAgICAgICAtPiBoZWFkZXIg YXQgZW5kPyAKICAgICAgICAgICAgICAgICAgICAtPiBzaXplIGNob2ljZXMKICAgICAgICAgICAg ICAgICAgICAtPiBwcm9maWxpbmcKICAgICAgICAgICAgICAgICAgICAtPiBwcml2YXRlIGFsbG9j YXRpb24KICAgICAgICAgICAgICAgIC0+IGFueW9uZSBvd24/CiAgICAgICAgICAgICAgICAgICAg LT4geWVzLCBycnNACgpXaGF0IGRvIHdlIHdhbnQgdG8gc3RvcmUgaW4gdmFyaWFibGUgbWJ1ZnMK ClZMQU4gSUQgKGV0YykKUSBpbiBRIGluIC4uLgpNQUMgQWRkcmVzcwpNUExTCkZsb3cgSUQgKyB0 eXBlCkZJQgo4MDIuMTEgLS0tLS0+IFFvUyAoM2IpLCBBZ2UgKDhiKSwgU2VxICgxOGIpLCB2aWV3 IFRJRCAoNGIpLCBSYXRlIGNvbnRyb2wgKDE2QikKSW50ZXJmYWNlSUQgKyBnZW5lcmF0aW9uCkZp cmV3YWxsIFJ1bGVzIDggLSAxNkIgKGp1bmlwZXIpCihjYW4ndCByZWFkKQpQYWNrZXQgVGltZXN0 YW1wICg2NGIpCkxvY2FsIERhdGEgKENQVSwgZXRjKQpKb3VybmFsIG9mIHVzZSAodHJhY2UpCklQ U2VjIC0+IGRhdGEgJiByZWZlcmVuY2UKSGVhZGVyIHBhcnNlIHN0YXRlCk1BQyBsYWJlbHMKVklN QUdFPyAocG9pbnRlcikKVFNPCkNoZWNrc3VtCkNBUlAsIExBR0cKQUxUUSB0YWcK --e89a8f83a34578732404c1960d10 Content-Type: text/plain; charset=US-ASCII; name="clusters.txt" Content-Disposition: attachment; filename="clusters.txt" Content-Transfer-Encoding: base64 X-Attachment-Id: f_h30h8z9q1 Ky0tLS0tLS0tLS0tLS0tLS0tLS0tLS0rICAgICAgKy0tLS0tLS0tLS0tLS0tLS0tLS0tKwp8IGhl YWRlciAgICAgcG9pbnQgdG8gICstLS0tLS0rPiBjbHVzdGVyICAgICAgICAgICB8CnwgICAgICAg ICAgICAgICAgICBvciAgKy0rICAgIHwgICAgICAgICAgICAgICAgICAgIHwKKy0tLS0tLS0tLS0t LS0tLS0tLS0tLS0rIHwgICAgfCAgICAgICAgICAgICAgICAgICAgfAp8IGRhdGEgICAgICAgICAg ICAgICAgPCstKyAgICB8ICAgICAgICAgICAgICAgICAgICB8CnwgICAgICAgICAgICAgICAgICAg ICAgfCAgICAgIHwgICAgICAgICAgICAgICAgICAgIHwKfCAgICAgICAgICAgICAgICAgICAgICB8 ICAgICAgKy0tLS0tLS0tLS0tLS0tLS0tLS0tKworLS0tLS0tLS0tLS0tLS0tLS0tLS0tLSsKLiAg ICAgICAgIHwgIHwgICAgICAgICAuIAouICAgICAgICAgfCAgfCAgICAgICAgIC4gIHByb3Bvc2Vk IHNvbHV0aW9uIC4uLgouICAgICAgICB2fHZ2fHYgICAgICAgIC4KLiAgICAgICAgIHZ2dnYgICAg ICAgICAuIAouICAgICAgICAgIHZ2ICAgICAgICAgIC4KLiAuIC4gLiAuIC4uLi4gLiAuIC4gLiAu Cg== --e89a8f83a34578732404c1960d10-- From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 19:35:26 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 53EEA106566B; Sun, 3 Jun 2012 19:35:26 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 3964C8FC1B; Sun, 3 Jun 2012 19:35:25 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA01101; Sun, 03 Jun 2012 22:35:16 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1SbGa4-000Mor-3y; Sun, 03 Jun 2012 22:35:16 +0300 Message-ID: <4FCBBC72.8070209@FreeBSD.org> Date: Sun, 03 Jun 2012 22:35:14 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Attilio Rao References: <20120603.002554.119853142.iwasaki@jp.FreeBSD.org> <4FCB0FE5.4050607@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-acpi@FreeBSD.org, Mitsuru IWASAKI , freebsd-arch@FreeBSD.org Subject: Re: cpu stopping [Was: preparation for x86/acpica/acpi_wakeup.c] X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 19:35:26 -0000 on 03/06/2012 12:54 Attilio Rao said the following: > 2012/6/3 Andriy Gapon : >> on 03/06/2012 00:39 Attilio Rao said the following: >>> The first thing to consider is that right now we only have 2 states >>> for CPUs: started and stopped. These states are controlled by >>> started_cpus and stopped_cpus masks respectively. It seems you really >>> want to add an intermediate level among the 2 where you have: started >>> -> suspended -> started -> suspended ... -> stopped and you need to >>> expand the mechanism for dealing with started and stopped cpus to do >>> that. I'm pretty sure this will be very helpful also for other >>> architectures that want to do the same. >> >> As the first thing I must admit that I haven't looked at the patch :-) >> >> >> But really I don't see why we need to differentiate between stopped and >> suspended state as both of them ultimately mean exactly the same thing - CPUs >> are spinning on some condition (and they are in a well-defined place and state). > > This is debeatable and I'm not sure I agree. > At some point we may want to implement CPU on-the-fly suspension for > CPUs which is a different event than "stopping" (where stopping will > be "permanent stopping" and suspending will be "possible to recover > suspension"). Right, but that should operate on the level above the current code. I.e. first stop all slave CPUs, than set state of a target CPU (which includes global view of that state), then resume all other CPUs. > The important thing about this is that we need to expand our model in > a way that it makes simple to add more states to the CPUs than simple > started/stopped. Right now we don't have any architecture for this in > place. I can't disagree with this, but I think that the current IPI-to-stop code is not a place for that. It's too low level. >> My view of how this should work is: >> - there can be only one master CPU that controls all other (slave) CPUs >> - the master sets entry and exit hooks >> - the master signals slaves to enter the stop state >> - the slaves execute the enter hook and start spinning on the release condition >> - the master does whatever it wants to do in this special system state >> - the master signals the slaves to resume >> - the slave exit the spin loop and execute the exit hook >> >> We have almost all of this in place. Only now we have different IPIs and >> different IPI handlers to do the job (cpustop_handler and cpususpend_handler). >> I think that the hooks model should be more universal. > > For hook you mean like a rendezvous handler? I'm not sure I understand > otherwise. Maybe, perhaps. I meant just a couple of function pointers. cpustop_restartfunc seems to be a better analogy. >> In my opinion, what really would deserve a completely independent path is the >> hard-stop case. As this can be invoked nested to the other cases. E.g. exotic >> situations like a breakpoint or a trap or a panic in the suspend or the normal >> stop code paths. > > What I'm really interested is expanding our model in a way that it can > handle multiple CPU states. Then it is just a matter of adding the > right states and it is all trivial work. > > And however, as already mentioned, I'm not sure I would assimilate > suspended = stopped. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 19:45:41 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 24093106564A; Sun, 3 Jun 2012 19:45:41 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id DBFA78FC1F; Sun, 3 Jun 2012 19:45:39 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id WAA01145; Sun, 03 Jun 2012 22:45:34 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1SbGk1-000MpY-T6; Sun, 03 Jun 2012 22:45:33 +0300 Message-ID: <4FCBBEDD.5000604@FreeBSD.org> Date: Sun, 03 Jun 2012 22:45:33 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Mitsuru IWASAKI References: <20120603.002554.119853142.iwasaki@jp.FreeBSD.org> <4FCB0FE5.4050607@FreeBSD.org> <20120603.234243.28389486.iwasaki@jp.FreeBSD.org> In-Reply-To: <20120603.234243.28389486.iwasaki@jp.FreeBSD.org> X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: attilio@FreeBSD.org, freebsd-acpi@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: cpu stopping X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 19:45:41 -0000 on 03/06/2012 17:42 Mitsuru IWASAKI said the following: > Hi, thanks for comments. > >> As the first thing I must admit that I haven't looked at the patch :-) > > Never mind :) What I'm trying to do in the patches is just to unify > amd64/i386 independent part (acpi_wakeup.c) for the code maintenance, > so please let's commit it first, then start re-design the > cpususpend_handler(). In no way I am trying to delay your work :) Just shared my view on the design of cpu stopping code. >> But really I don't see why we need to differentiate between stopped and >> suspended state as both of them ultimately mean exactly the same thing - CPUs >> are spinning on some condition (and they are in a well-defined place and state). > > Yes, amd64/i386 cpususpend_handler() is very similar to cpustop_handler() > actually, some resume related procedures are added for suspend. > >> My view of how this should work is: >> - there can be only one master CPU that controls all other (slave) CPUs >> - the master sets entry and exit hooks > > Entry hook for suspending might be > ---- > ctx_fpusave(suspfpusave[cpu]); > wbinvd(); > CPU_SET_ATOMIC(cpu, &stopped_cpus); > ---- > > and for stopping is > ---- > /* Indicate that we are stopped */ > CPU_SET_ATOMIC(cpu, &stopped_cpus); > ---- > > Correct? Yes. The only nit is that CPU_SET_ATOMIC(cpu, &stopped_cpus) could be part of the wait loop prologue. No need to duplicate it in each hook. > I think stopping hook can be replaced with suspending hook. Perhaps... But let's not go into this topic just yet. > Exit hook for suspending is > ---- > pmap_init_pat(); > load_cr3(susppcbs[cpu]->pcb_cr3); > initializecpu(); > PCPU_SET(switchtime, 0); > PCPU_SET(switchticks, ticks); > [snip] > /* Resume MCA and local APIC */ > mca_resume(); > lapic_setup(0); > ---- > > For stopping should be > ---- > if (cpu == 0 && cpustop_restartfunc != NULL) { > cpustop_restartfunc(); > cpustop_restartfunc = NULL; > } > ---- > >> - the master signals slaves to enter the stop state >> - the slaves execute the enter hook and start spinning on the release condition >> - the master does whatever it wants to do in this special system state >> - the master signals the slaves to resume >> - the slave exit the spin loop and execute the exit hook > > I think it would be possible. However I personally think that > priority of x86/x86/mp_machdep.c is higher and more effective than > merging cpususpend/stop_handler(). I do not disagree. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 20:02:10 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9FBBC1065745; Sun, 3 Jun 2012 20:02:10 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id A3F3A8FC1A; Sun, 3 Jun 2012 20:02:09 +0000 (UTC) Received: by laai10 with SMTP id i10so3262639laa.13 for ; Sun, 03 Jun 2012 13:02:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=dUPpHTdTJTdWNZBt33McuZiSIzerq38ci95MZLizuIg=; b=ysizMijAvLhD1tY3B6qSu1V1CfNTU92DMr4EGBk+nYjYr+s9XZKnUc3M17gT2pzYJp O9KLbi3YI5KIdPtoD7CEgU9c2/kHdPLM8qpSL/UWydFcoePPSjW14l87IOZUO1Xypl6V te7tQYsfeztiGf8RKcONNtnAYOUw0o0A2rt/NuwapqjqQuJmXDOtCx7Qon4NUPcsyE5t mWRiiO0b6sCMu6jHeJuh56ipp6ki4HPslwVWjGUxtYcGGxBCsBD3n/wUl4/Xm22kWWfu xUDIR2KWaNdHleZLuC2aGDAC3D5Dkxtoqlf9P8ffzGdi//Skmz5ROJ1LnDj5BL/Ej7fV rewA== MIME-Version: 1.0 Received: by 10.152.103.11 with SMTP id fs11mr9876014lab.23.1338753721887; Sun, 03 Jun 2012 13:02:01 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Sun, 3 Jun 2012 13:02:01 -0700 (PDT) In-Reply-To: <4FCBBC72.8070209@FreeBSD.org> References: <20120603.002554.119853142.iwasaki@jp.FreeBSD.org> <4FCB0FE5.4050607@FreeBSD.org> <4FCBBC72.8070209@FreeBSD.org> Date: Sun, 3 Jun 2012 21:02:01 +0100 X-Google-Sender-Auth: xSOxkHrJHCFZkowbHu9KLCDOs4E Message-ID: From: Attilio Rao To: Andriy Gapon Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-acpi@freebsd.org, Mitsuru IWASAKI , freebsd-arch@freebsd.org Subject: Re: cpu stopping [Was: preparation for x86/acpica/acpi_wakeup.c] X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 20:02:10 -0000 2012/6/3 Andriy Gapon : > on 03/06/2012 12:54 Attilio Rao said the following: >> 2012/6/3 Andriy Gapon : >>> on 03/06/2012 00:39 Attilio Rao said the following: >>>> The first thing to consider is that right now we only have 2 states >>>> for CPUs: started and stopped. These states are controlled by >>>> started_cpus and stopped_cpus masks respectively. It seems you really >>>> want to add an intermediate level among the 2 where you have: started >>>> -> suspended -> started -> suspended ... -> stopped and you need to >>>> expand the mechanism for dealing with started and stopped cpus to do >>>> that. I'm pretty sure this will be very helpful also for other >>>> architectures that want to do the same. >>> >>> As the first thing I must admit that I haven't looked at the patch :-) >>> >>> >>> But really I don't see why we need to differentiate between stopped and >>> suspended state as both of them ultimately mean exactly the same thing = - CPUs >>> are spinning on some condition (and they are in a well-defined place an= d state). >> >> This is debeatable and I'm not sure I agree. >> At some point we may want to implement CPU on-the-fly suspension for >> CPUs which is a different event than "stopping" (where stopping will >> be "permanent stopping" and suspending will be "possible to recover >> suspension"). > > Right, but that should operate on the level above the current code. > I.e. first stop all slave CPUs, than set state of a target CPU (which inc= ludes > global view of that state), then resume all other CPUs. > >> The important thing about this is that we need to expand our model in >> a way that it makes simple to add more states to the CPUs than simple >> started/stopped. Right now we don't have any architecture for this in >> place. > > I can't disagree with this, but I think that the current IPI-to-stop code= is not > a place for that. =C2=A0It's too low level. Yeah, I was referring in particular to the handling of the masks and few other things (stoppcbs, which could be rebased as suspendpcbs for that, etc.). The point I'm really trying to make is: our model is very very biased on the on/off case (started/stopped) and we need to abstract this and have a framework for adding several CPU states. After you have an abstracted model, you can simply make several states easi= lly. This is not a simple work and it is also less simple for synchronization, which right now is very much simplified/unhandled. I would be very happy if you or Mitsuru plan to work on that. Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 17:50:11 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00AC7106564A; Mon, 4 Jun 2012 17:50:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id BD3B18FC12; Mon, 4 Jun 2012 17:50:10 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 287E1B990; Mon, 4 Jun 2012 13:50:10 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Mon, 4 Jun 2012 10:53:51 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <20120602171632.GC2358@deviant.kiev.zoral.com.ua> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201206041053.51802.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 04 Jun 2012 13:50:10 -0400 (EDT) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 17:50:11 -0000 On Saturday, June 02, 2012 1:27:58 pm Attilio Rao wrote: > >> The gettimeofday() implementation is a different story than what is asked here. > > > > But the goal is to have fast clocks, right ? What else is planned ? > > > > In fact, I think that if the whole goal is only fast clocks, then we > > do not need any additional system mechanisms, since we can easily export > > coefficients for rdtsc formula already. E.g. we can put it into elf auxv, > > which is ugly but bearable. > > Not sure if there is anything else besides gettimeofday() that we want > right now, in particular on global basis. > I just mean to say that I don't think Giovanni put a lot of effort in > correctness/robustness of gettimeofday userland implementation, so we > should not judge that part of the patch too tightly. I think this is an important question actually. Is there anything that really needs to be here besides gettimeofday()? I mean, is there any real-world application that needs to call getpid() or getppid() a bunch of times? Things that are static like that the application can easily cache (and should if it actually needs it). gettimeofday() is different because it is dynamic. > >> > Interesting question is how much shared the shared page needs be. > >> > Obvious needs are shared between all same-ABI processes, but I can also > >> > easily see a need for the per-process private information be present in > >> > the 'private-shared' page. For silly but typical example, useful for > >> > moronix-style benchmarks, see getpid(). > >> > >> Really the performance benefits of having fast getpid() is marginal if > >> compared to heavilly used things like gettimeofday(). I cannot think > >> of a per-process page implementing a fast syscall that can bring many > >> perfomance advantages. > > > > This is completely true, but there may be other process-private data that > > could benefit from the low access cost. I just do not know right now. > > I don't know either, thus I don't think there is a big urgence for > per-process shared pages at all. I can't think of anything useful. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 17:50:12 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 291EA1065675; Mon, 4 Jun 2012 17:50:12 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 8B5178FC1D; Mon, 4 Jun 2012 17:50:11 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id E8B70B99B; Mon, 4 Jun 2012 13:50:10 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Mon, 4 Jun 2012 11:01:57 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> In-Reply-To: <20120603184315.T856@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201206041101.57486.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 04 Jun 2012 13:50:11 -0400 (EDT) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 17:50:12 -0000 On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > On Sun, 3 Jun 2012, Konstantin Belousov wrote: > > > On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: > >> On Sat, 2 Jun 2012, Konstantin Belousov wrote: > >>> ... > >>> In fact, I think that if the whole goal is only fast clocks, then we > >>> do not need any additional system mechanisms, since we can easily export > >>> coefficients for rdtsc formula already. E.g. we can put it into elf auxv, > >>> which is ugly but bearable. > >> > >> How do you get the timehands offsets? These only need to be updated > >> every second or so, or when used, but how can the application know > >> when they need to be updated if this is not done automatically in the > >> kernel by writing to a shared page? I can only think of the > >> application arranging an alarm signal every second or so and updating > >> then. No good for libraries. > > What is timehands offsets ? Do you mean things like leap seconds ? > > Yes. binuptime() is: > > % void > % binuptime(struct bintime *bt) > % { > % struct timehands *th; > % u_int gen; > % > % do { > % th = timehands; > % gen = th->th_generation; > % *bt = th->th_offset; > % bintime_addx(bt, th->th_scale * tc_delta(th)); > % } while (gen == 0 || gen != th->th_generation); > % } > > Without the kernel providing th->th_offset, you have to do lots of ntp > handling for yourself (compatibly with the kernel) just to get an > accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but > they do affect CLOCK_REALTIME which is the clock id used by > gettimeofday(). For the former, you only have to advance the offset > for yourself occasionally (compatibly with the kernel) and manage > (compatibly with the kernel, especially in the long term) ntp slewing > and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. I think duplicating this logic in userland would just be wasteful. I have a private fast gettimeofday() at my current job and it works by exporting the current timehands structure (well, the equivalent) to userland. The userland bits then fetch a copy of the details and do the same as bintime(). (I move the math (bintime_addx() and the multiply)) out of the loop however. > > This is indeed problematic for auxv. For auxv it could be solved by > > providing offset for next recheck using syscalls, and making libc code to > > respect this offset. But, I do think that vdso in shared page > > is the right solution, not auxv. > > timehands in a shared pages is close to working. th_generation protects > things in the same way as in the kernel, modulo assumptions that writes > are ordered. It would work fine. And in fact, having multiple timehands is actually a bug, not a feature. It lets you compute bogus timestamps if you get preempted at the wrong time and end up with time jumping around. At Yahoo! we reduced the number of timehands structures down to 2 or some such, and I'm now of the opinion we should just have one and dispense with the entire array. For my userland case I only export a single timehands copy. > >> rdtsc is also very unportable, even on CPUs that have it. But all other > >> x86 timecounter hardware is too slow if you want gettimeofday() to be fast > >> and as accurate as it is now. For all the hardware where people run mysql and similar software that calls getimeofday() a lot, rdtsc() works just fine. > > !rdtsc hardware is probably cannot be used at all due to need to provide > > usermode access to device registers. The mere presence of rdtsc does not > > means that usermode indeed can use it, it should be decided by kernel > > based on the current in-kernel time source. If rdtsc is not usable, the > > corresponding data should not be exported, or implementation should go > > directly into syscall or whatever. Yes, the patches I have only work if the kernel uses the TSC as its main timecounter as well. > But then applications would: > - use gettimeofday() more than they should ("it works on Linux"), even > more than now since when "it works on FreeBSD-x86" too > - just be slow when gettimeofday() is slow > - kludge around gettimeofday() being slow like they do now > - kludge around gettimeofday() being slow not like they do now (use more > complications to probe it being slow). Some applications really need fine-grained timing with as little overhead as possible. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 18:19:30 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3C44010657DB; Mon, 4 Jun 2012 18:19:30 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 817478FC0A; Mon, 4 Jun 2012 18:19:29 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q54IJIpJ045754; Mon, 4 Jun 2012 21:19:18 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q54IJH3K092775; Mon, 4 Jun 2012 21:19:17 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q54IJHlL092774; Mon, 4 Jun 2012 21:19:17 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 4 Jun 2012 21:19:17 +0300 From: Konstantin Belousov To: John Baldwin Message-ID: <20120604181917.GD85127@deviant.kiev.zoral.com.ua> References: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="mSxgbZZZvrAyzONB" Content-Disposition: inline In-Reply-To: <201206041101.57486.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 18:19:30 -0000 --mSxgbZZZvrAyzONB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > > On Sun, 3 Jun 2012, Konstantin Belousov wrote: > >=20 > > > On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: > > >> On Sat, 2 Jun 2012, Konstantin Belousov wrote: > > >>> ... > > >>> In fact, I think that if the whole goal is only fast clocks, then we > > >>> do not need any additional system mechanisms, since we can easily e= xport > > >>> coefficients for rdtsc formula already. E.g. we can put it into elf= auxv, > > >>> which is ugly but bearable. > > >> > > >> How do you get the timehands offsets? These only need to be updated > > >> every second or so, or when used, but how can the application know > > >> when they need to be updated if this is not done automatically in the > > >> kernel by writing to a shared page? I can only think of the > > >> application arranging an alarm signal every second or so and updating > > >> then. No good for libraries. > > > What is timehands offsets ? Do you mean things like leap seconds ? > >=20 > > Yes. binuptime() is: > >=20 > > % void > > % binuptime(struct bintime *bt) > > % { > > % struct timehands *th; > > % u_int gen; > > %=20 > > % do { > > % th =3D timehands; > > % gen =3D th->th_generation; > > % *bt =3D th->th_offset; > > % bintime_addx(bt, th->th_scale * tc_delta(th)); > > % } while (gen =3D=3D 0 || gen !=3D th->th_generation); > > % } > >=20 > > Without the kernel providing th->th_offset, you have to do lots of ntp > > handling for yourself (compatibly with the kernel) just to get an > > accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but > > they do affect CLOCK_REALTIME which is the clock id used by > > gettimeofday(). For the former, you only have to advance the offset > > for yourself occasionally (compatibly with the kernel) and manage > > (compatibly with the kernel, especially in the long term) ntp slewing > > and other syscall/sysctl kernel activity that micro-adjusts th->th_scal= e. >=20 > I think duplicating this logic in userland would just be wasteful. I have > a private fast gettimeofday() at my current job and it works by exporting > the current timehands structure (well, the equivalent) to userland. The > userland bits then fetch a copy of the details and do the same as bintime= (). > (I move the math (bintime_addx() and the multiply)) out of the loop howev= er. I started yesterday an implementation which uses shared page to export some variant of timehands, and uses auxv to provide the libc with a pointer to timehands when rdtsc is reasonable. I almost finished both 32bit and 64bit userspace, but there is kernel-side work left. Is your implementation ready or close to be ready for commit ? In other words, should I drop the efforts, or continue ? >=20 > > > This is indeed problematic for auxv. For auxv it could be solved by > > > providing offset for next recheck using syscalls, and making libc cod= e to > > > respect this offset. But, I do think that vdso in shared page > > > is the right solution, not auxv. > >=20 > > timehands in a shared pages is close to working. th_generation protects > > things in the same way as in the kernel, modulo assumptions that writes > > are ordered. >=20 > It would work fine. And in fact, having multiple timehands is actually a > bug, not a feature. It lets you compute bogus timestamps if you get pree= mpted > at the wrong time and end up with time jumping around. At Yahoo! we redu= ced > the number of timehands structures down to 2 or some such, and I'm now of > the opinion we should just have one and dispense with the entire array. >=20 > For my userland case I only export a single timehands copy. Well, I have to use two copies due to time_t ABI differences, one for 32, and one for 64-bit. >=20 > > >> rdtsc is also very unportable, even on CPUs that have it. But all o= ther > > >> x86 timecounter hardware is too slow if you want gettimeofday() to b= e fast > > >> and as accurate as it is now. >=20 > For all the hardware where people run mysql and similar software that cal= ls > getimeofday() a lot, rdtsc() works just fine. I also try to mimic kernel code as close as possible, so there are two possible tsc counters, selection is managed by kernel, but the code lives in libc or possible vdso. But I do not see immediate use for vdso just for gettimeofday(2) and clock_gettime(2), although having vdso to provide unwinding tables for signal trampolines is _very_ desirable. >=20 > > > !rdtsc hardware is probably cannot be used at all due to need to prov= ide > > > usermode access to device registers. The mere presence of rdtsc does = not > > > means that usermode indeed can use it, it should be decided by kernel > > > based on the current in-kernel time source. If rdtsc is not usable, t= he > > > corresponding data should not be exported, or implementation should go > > > directly into syscall or whatever. >=20 > Yes, the patches I have only work if the kernel uses the TSC as its main > timecounter as well. >=20 > > But then applications would: > > - use gettimeofday() more than they should ("it works on Linux"), even > > more than now since when "it works on FreeBSD-x86" too > > - just be slow when gettimeofday() is slow > > - kludge around gettimeofday() being slow like they do now > > - kludge around gettimeofday() being slow not like they do now (use more > > complications to probe it being slow). >=20 > Some applications really need fine-grained timing with as little overhead > as possible. >=20 > --=20 > John Baldwin --mSxgbZZZvrAyzONB Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/M/CUACgkQC3+MBN1Mb4ghXwCgkPtKRATwrzKbJDD0j9LeoqLR 0/MAnRtpx6mS4HOad3y/lgGdV2bducK9 =zlG/ -----END PGP SIGNATURE----- --mSxgbZZZvrAyzONB-- From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 20:51:10 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 89F79106564A; Mon, 4 Jun 2012 20:51:10 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au [211.29.132.197]) by mx1.freebsd.org (Postfix) with ESMTP id 020DD8FC12; Mon, 4 Jun 2012 20:51:09 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q54Kp06q023119 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 5 Jun 2012 06:51:01 +1000 Date: Tue, 5 Jun 2012 06:51:00 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201206041101.57486.jhb@freebsd.org> Message-ID: <20120605054930.H3236@besplex.bde.org> References: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 20:51:10 -0000 On Mon, 4 Jun 2012, John Baldwin wrote: > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: >> On Sun, 3 Jun 2012, Konstantin Belousov wrote: >>> What is timehands offsets ? Do you mean things like leap seconds ? >> >> Yes. binuptime() is: >> >> % void >> % binuptime(struct bintime *bt) >> % { >> % struct timehands *th; >> % u_int gen; >> % >> % do { >> % th = timehands; >> % gen = th->th_generation; >> % *bt = th->th_offset; >> % bintime_addx(bt, th->th_scale * tc_delta(th)); >> % } while (gen == 0 || gen != th->th_generation); >> % } >> >> Without the kernel providing th->th_offset, you have to do lots of ntp >> handling for yourself (compatibly with the kernel) just to get an >> accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but >> they do affect CLOCK_REALTIME which is the clock id used by >> gettimeofday(). For the former, you only have to advance the offset >> for yourself occasionally (compatibly with the kernel) and manage >> (compatibly with the kernel, especially in the long term) ntp slewing >> and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. > > I think duplicating this logic in userland would just be wasteful. I have Sure. I modestly proposed it. > a private fast gettimeofday() at my current job and it works by exporting > the current timehands structure (well, the equivalent) to userland. The > userland bits then fetch a copy of the details and do the same as bintime(). How do you keep this up to date, especially for leap seconds? > (I move the math (bintime_addx() and the multiply)) out of the loop however. My version has a comment saying to do that, but I just noticed that it wouldn't work so well -- the timehands fields would have to be copied to local variables while under protection of the generation count, so it would give messier code to optimize a case that occurs _very_ rarely. >> timehands in a shared pages is close to working. th_generation protects >> things in the same way as in the kernel, modulo assumptions that writes >> are ordered. > > It would work fine. And in fact, having multiple timehands is actually a > bug, not a feature. It lets you compute bogus timestamps if you get preempted > at the wrong time and end up with time jumping around. At Yahoo! we reduced > the number of timehands structures down to 2 or some such, and I'm now of > the opinion we should just have one and dispense with the entire array. No, it is a feature. The time should never jump around (backwards), but it can easily jump forwards. It makes little difference if preemption occurs after the timehands have been read, or while reading them but in such a way that the timehands become stale during preemption but not stale enough for their generation to change so that you notice that they are stale -- you get a stale timestamp either way (with staleness approximately the preemption time). Times read by different threads can easily have different staleness according to which timehands they ended up using and this may be quite different from which timehands they started using and from which timehands is active after they return. Perhaps this is what you mean. But again, this happens anyway when the preemption occurs after the timehands have been read. The main point of timehands was originally to give a copy of the time that was stable for a time hopefully long enough for the timehands to be read without them being clobbered by an update. binuptime() was: 1.59 (phk 26-Mar-98): void 1.113 (phk 07-Feb-02): binuptime(struct bintime *bt) 1.113 (phk 07-Feb-02): { 1.113 (phk 07-Feb-02): struct timecounter *tc; 1.113 (phk 07-Feb-02): 1.113 (phk 07-Feb-02): tc = timecounter; 1.113 (phk 07-Feb-02): *bt = tc->tc_offset; 1.113 (phk 07-Feb-02): bintime_addx(bt, tc->tc_scale * tco_delta(tc)); 1.113 (phk 07-Feb-02): } This has an obvious race if the thread running this is preempted for a long time, so that the copy of the time is actually not stable for long enough. This was fixed (except I think in some cases using ddb) by using the generation count. With the generation count, multiple timehands are probably unnecessary, but they reduce locking bugs (no memory ordering for the generation count) and give the optimization that binuptime() etc. doesn't have to spin waiting for updates. Now it is the thread doing the updates that gets the most advantanges from multiple timehands. It doesn't have to worry much about locking, or being preempted, or blocking for a long time, since it knows that binuptime() etc. will keep using a previous generation safely and not busy-wait for it, provided only that it doesn't block for so long that the oldest previous generation doesn't become too old to work. 2 timehands are probably enough for this, but 1 isn't. > For my userland case I only export a single timehands copy. So readers block for a long time if the writer is updating and the writer blocks? Works best for UP :-). Actually, there are problems in the kernel even for UP. Consider the writer doing an update and being preempted by ddb, and ddb using binuptime(), though it shouldn't. This is deadlock if there is only 1 timehands. My version runs the update as a normal interrupt handler so that it can be interrupted by fast interrupt handlers. This gives similar problems -- fast interrupt handlers shouldn't call binuptime() either (this can deadlock in the timecounter hardware function for at least the i8254 timecounter), but they do and this is useful for things like timestamps from serial hardware. Multiple timehands at least limit this problem. Applications have similar problems (more like my kernel version since applications can't get as exclusive as access as a fast interrupt handler can). >>>> rdtsc is also very unportable, even on CPUs that have it. But all other >>>> x86 timecounter hardware is too slow if you want gettimeofday() to be fast >>>> and as accurate as it is now. > > For all the hardware where people run mysql and similar software that calls > getimeofday() a lot, rdtsc() works just fine. That wasn't the case until recently (except 10-15 years ago for UP with no SMM). Someone just fixed rdtsc()-based time function in dtrace. It tries to add a per-cpu rdtsc() offset, but the offset was backwards. It takes P-state invariance and maybe more for the offset to be 0 and not drift. >>> !rdtsc hardware is probably cannot be used at all due to need to provide >>> usermode access to device registers. The mere presence of rdtsc does not >>> means that usermode indeed can use it, it should be decided by kernel >>> based on the current in-kernel time source. If rdtsc is not usable, the >>> corresponding data should not be exported, or implementation should go >>> directly into syscall or whatever. > > Yes, the patches I have only work if the kernel uses the TSC as its main > timecounter as well. The detail I miss most is the TSC being available for use in userland even if it is not the primary timecounter. Maybe it its quality is enough for the application, or the application can fix it up using per-cpu offsets. Bruce From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 21:16:12 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B1FF1065672; Mon, 4 Jun 2012 21:16:12 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from mail-qa0-f49.google.com (mail-qa0-f49.google.com [209.85.216.49]) by mx1.freebsd.org (Postfix) with ESMTP id 8D0B48FC17; Mon, 4 Jun 2012 21:16:11 +0000 (UTC) Received: by qabj40 with SMTP id j40so2205084qab.15 for ; Mon, 04 Jun 2012 14:16:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=47d5Z5JgH9ySqYO2XKNw1/VzpGOBNoLmoXkzW0H/sQ4=; b=vfoNg9FxHW016viznrJ7FWcwnWn2FHqz+57u/FedhZT0zKs8zFGUu2SKBbcdhgt21/ BymWkbKcqUHWOgBIcCGKmPX6F8xDrFLLjLYwUa3sO8bYGM+5tf6uC9/YIlPQ+Wg/77qU Y0uktFvXhURuYhyt/WV1EizZOQMrxllFrR5Zgr/J2Z0RFYiS08/MA3y/6635azStnGFJ VTEUdaD6kPwJnpU754WjNjXZIwry4wUphANHxiJ2b/pBHvHbcffuclP+Z8uAOSSpXZgL NPhthxgEk8aP/dVC5QC7LtpUFRoVJF5UjUfAWib6ayr5/Xee8fsXgov8c9efX/dJV+E8 I3ag== MIME-Version: 1.0 Received: by 10.224.202.8 with SMTP id fc8mr14783196qab.40.1338844570879; Mon, 04 Jun 2012 14:16:10 -0700 (PDT) Sender: giovanni.trematerra@gmail.com Received: by 10.229.160.20 with HTTP; Mon, 4 Jun 2012 14:16:10 -0700 (PDT) In-Reply-To: <20120604181917.GD85127@deviant.kiev.zoral.com.ua> References: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org> <20120604181917.GD85127@deviant.kiev.zoral.com.ua> Date: Mon, 4 Jun 2012 23:16:10 +0200 X-Google-Sender-Auth: c7e69wjeNf-PVXITIT-QPF4N8XM Message-ID: From: Giovanni Trematerra To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 21:16:12 -0000 On Mon, Jun 4, 2012 at 8:19 PM, Konstantin Belousov w= rote: > On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: >> On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: >> > On Sun, 3 Jun 2012, Konstantin Belousov wrote: >> I think duplicating this logic in userland would just be wasteful. =A0I = have >> a private fast gettimeofday() at my current job and it works by exportin= g >> the current timehands structure (well, the equivalent) to userland. =A0T= he >> userland bits then fetch a copy of the details and do the same as bintim= e(). >> (I move the math (bintime_addx() and the multiply)) out of the loop howe= ver. > I started yesterday an implementation which uses shared page to export > some variant of timehands, and uses auxv to provide the libc with a point= er > to timehands when rdtsc is reasonable. > > I almost finished both 32bit and 64bit userspace, but there is > kernel-side work left. Is your implementation ready or close to be ready > for commit ? In other words, should I drop the efforts, or continue ? > Hey wait, What are you doing? This is completely unfair. You didn't even review my patch. I really don't understand your way to completely ignore me and start implem= ent yesterday something you didn't care about for more than 3 years. It costs me a lot of time and energy and I think I deserve more respect tha= t just be ignored. -- Gianni From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 21:30:16 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DC4501065678; Mon, 4 Jun 2012 21:30:15 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 96ECE8FC14; Mon, 4 Jun 2012 21:30:15 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id ECFD9B94F; Mon, 4 Jun 2012 17:30:14 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Date: Mon, 4 Jun 2012 17:22:07 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <201206041101.57486.jhb@freebsd.org> <20120604181917.GD85127@deviant.kiev.zoral.com.ua> In-Reply-To: <20120604181917.GD85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201206041722.07269.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 04 Jun 2012 17:30:15 -0400 (EDT) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 21:30:16 -0000 On Monday, June 04, 2012 2:19:17 pm Konstantin Belousov wrote: > On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: > > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > > > On Sun, 3 Jun 2012, Konstantin Belousov wrote: > > > > > > > On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: > > > >> On Sat, 2 Jun 2012, Konstantin Belousov wrote: > > > >>> ... > > > >>> In fact, I think that if the whole goal is only fast clocks, then we > > > >>> do not need any additional system mechanisms, since we can easily export > > > >>> coefficients for rdtsc formula already. E.g. we can put it into elf auxv, > > > >>> which is ugly but bearable. > > > >> > > > >> How do you get the timehands offsets? These only need to be updated > > > >> every second or so, or when used, but how can the application know > > > >> when they need to be updated if this is not done automatically in the > > > >> kernel by writing to a shared page? I can only think of the > > > >> application arranging an alarm signal every second or so and updating > > > >> then. No good for libraries. > > > > What is timehands offsets ? Do you mean things like leap seconds ? > > > > > > Yes. binuptime() is: > > > > > > % void > > > % binuptime(struct bintime *bt) > > > % { > > > % struct timehands *th; > > > % u_int gen; > > > % > > > % do { > > > % th = timehands; > > > % gen = th->th_generation; > > > % *bt = th->th_offset; > > > % bintime_addx(bt, th->th_scale * tc_delta(th)); > > > % } while (gen == 0 || gen != th->th_generation); > > > % } > > > > > > Without the kernel providing th->th_offset, you have to do lots of ntp > > > handling for yourself (compatibly with the kernel) just to get an > > > accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but > > > they do affect CLOCK_REALTIME which is the clock id used by > > > gettimeofday(). For the former, you only have to advance the offset > > > for yourself occasionally (compatibly with the kernel) and manage > > > (compatibly with the kernel, especially in the long term) ntp slewing > > > and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. > > > > I think duplicating this logic in userland would just be wasteful. I have > > a private fast gettimeofday() at my current job and it works by exporting > > the current timehands structure (well, the equivalent) to userland. The > > userland bits then fetch a copy of the details and do the same as bintime(). > > (I move the math (bintime_addx() and the multiply)) out of the loop however. > I started yesterday an implementation which uses shared page to export > some variant of timehands, and uses auxv to provide the libc with a pointer > to timehands when rdtsc is reasonable. > > I almost finished both 32bit and 64bit userspace, but there is > kernel-side work left. Is your implementation ready or close to be ready > for commit ? In other words, should I drop the efforts, or continue ? No, mine is not general purpose. I'll see if I can make a public patch of what it looks like. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 21:30:16 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5B8D4106567F; Mon, 4 Jun 2012 21:30:16 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 2B7358FC08; Mon, 4 Jun 2012 21:30:16 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 86A4FB95B; Mon, 4 Jun 2012 17:30:15 -0400 (EDT) From: John Baldwin To: Giovanni Trematerra Date: Mon, 4 Jun 2012 17:23:49 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <20120604181917.GD85127@deviant.kiev.zoral.com.ua> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201206041723.49562.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 04 Jun 2012 17:30:15 -0400 (EDT) Cc: Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 21:30:16 -0000 On Monday, June 04, 2012 5:16:10 pm Giovanni Trematerra wrote: > On Mon, Jun 4, 2012 at 8:19 PM, Konstantin Belousov wrote: > > On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: > >> On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > >> > On Sun, 3 Jun 2012, Konstantin Belousov wrote: > > >> I think duplicating this logic in userland would just be wasteful. I have > >> a private fast gettimeofday() at my current job and it works by exporting > >> the current timehands structure (well, the equivalent) to userland. The > >> userland bits then fetch a copy of the details and do the same as bintime(). > >> (I move the math (bintime_addx() and the multiply)) out of the loop however. > > I started yesterday an implementation which uses shared page to export > > some variant of timehands, and uses auxv to provide the libc with a pointer > > to timehands when rdtsc is reasonable. > > > > I almost finished both 32bit and 64bit userspace, but there is > > kernel-side work left. Is your implementation ready or close to be ready > > for commit ? In other words, should I drop the efforts, or continue ? > > > > Hey wait, What are you doing? > This is completely unfair. You didn't even review my patch. > I really don't understand your way to completely ignore me and start implement > yesterday something you didn't care about for more than 3 years. > It costs me a lot of time and energy and I think I deserve more respect that > just be ignored. In fairness, I would not be able to use your version of gettimeofday(). My application requires something where we can interpolate based on the value of rdtsc(). Also, I don't really see the need to export anything other than the details to make gettimeofday() faster. I don't see a practical need for using shared variables for getpid(), getpgid(), getppid(), getuid(), or the like. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 21:30:17 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 22B1C106566C; Mon, 4 Jun 2012 21:30:17 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id BCE4A8FC0A; Mon, 4 Jun 2012 21:30:16 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 1ABA8B9A3; Mon, 4 Jun 2012 17:30:16 -0400 (EDT) From: John Baldwin To: Bruce Evans Date: Mon, 4 Jun 2012 17:30:05 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <201206041101.57486.jhb@freebsd.org> <20120605054930.H3236@besplex.bde.org> In-Reply-To: <20120605054930.H3236@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201206041730.05478.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 04 Jun 2012 17:30:16 -0400 (EDT) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 21:30:17 -0000 On Monday, June 04, 2012 4:51:00 pm Bruce Evans wrote: > On Mon, 4 Jun 2012, John Baldwin wrote: > > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > >> On Sun, 3 Jun 2012, Konstantin Belousov wrote: > >>> What is timehands offsets ? Do you mean things like leap seconds ? > >> > >> Yes. binuptime() is: > >> > >> % void > >> % binuptime(struct bintime *bt) > >> % { > >> % struct timehands *th; > >> % u_int gen; > >> % > >> % do { > >> % th = timehands; > >> % gen = th->th_generation; > >> % *bt = th->th_offset; > >> % bintime_addx(bt, th->th_scale * tc_delta(th)); > >> % } while (gen == 0 || gen != th->th_generation); > >> % } > >> > >> Without the kernel providing th->th_offset, you have to do lots of ntp > >> handling for yourself (compatibly with the kernel) just to get an > >> accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but > >> they do affect CLOCK_REALTIME which is the clock id used by > >> gettimeofday(). For the former, you only have to advance the offset > >> for yourself occasionally (compatibly with the kernel) and manage > >> (compatibly with the kernel, especially in the long term) ntp slewing > >> and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. > > > > I think duplicating this logic in userland would just be wasteful. I have > > Sure. I modestly proposed it. > > > a private fast gettimeofday() at my current job and it works by exporting > > the current timehands structure (well, the equivalent) to userland. The > > userland bits then fetch a copy of the details and do the same as bintime(). > > How do you keep this up to date, especially for leap seconds? I added a hack to tc_windup() where it updates the shared copy of the variables with the results of the tc_windup() call each time it is invoked. > My version has a comment saying to do that, but I just noticed that > it wouldn't work so well -- the timehands fields would have to be > copied to local variables while under protection of the generation > count, so it would give messier code to optimize a case that occurs > _very_ rarely. It's not that messy in my experience. > >> timehands in a shared pages is close to working. th_generation protects > >> things in the same way as in the kernel, modulo assumptions that writes > >> are ordered. > > > > It would work fine. And in fact, having multiple timehands is actually a > > bug, not a feature. It lets you compute bogus timestamps if you get preempted > > at the wrong time and end up with time jumping around. At Yahoo! we reduced > > the number of timehands structures down to 2 or some such, and I'm now of > > the opinion we should just have one and dispense with the entire array. > > No, it is a feature. The time should never jump around (backwards), but > it can easily jump forwards. It makes little difference if preemption > occurs after the timehands have been read, or while reading them but in > such a way that the timehands become stale during preemption but not stale > enough for their generation to change so that you notice that they are > stale -- you get a stale timestamp either way (with staleness approximately > the preemption time). Times read by different threads can easily have > different staleness according to which timehands they ended up using and > this may be quite different from which timehands they started using and > from which timehands is active after they return. Perhaps this is what > you mean. But again, this happens anyway when the preemption occurs after > the timehands have been read. Time definitely jumped backwards at Yahoo!. The problem case was when NTP was adjusting the time, so if you used a timehands structure that was a few generations old (stale), you could have a fairly large component that was (delta * scale). If the scale had slowed down in subsequent updates, then the computed time would jump out into the future. On the next time update with a newer timehands, the effective base was less than the previous calculation thought it should have been, and the scale was smaller, so the end result if the TSC had not advanced very far was for the new time to be less than the previous time, and thus time jumped backwards. > The main point of timehands was originally to give a copy of the time > that was stable for a time hopefully long enough for the timehands to be > read without them being clobbered by an update. binuptime() was: > > 1.59 (phk 26-Mar-98): void > 1.113 (phk 07-Feb-02): binuptime(struct bintime *bt) > 1.113 (phk 07-Feb-02): { > 1.113 (phk 07-Feb-02): struct timecounter *tc; > 1.113 (phk 07-Feb-02): > 1.113 (phk 07-Feb-02): tc = timecounter; > 1.113 (phk 07-Feb-02): *bt = tc->tc_offset; > 1.113 (phk 07-Feb-02): bintime_addx(bt, tc->tc_scale * tco_delta(tc)); > 1.113 (phk 07-Feb-02): } > > This has an obvious race if the thread running this is preempted for a long > time, so that the copy of the time is actually not stable for long enough. > This was fixed (except I think in some cases using ddb) by using the > generation count. The problem with having too many timehands structures is you can get a stable timehands structure that is too stale. > > For my userland case I only export a single timehands copy. > > So readers block for a long time if the writer is updating and the > writer blocks? Works best for UP :-). The update to the shared timehands structure does not take a long time, specifically for userland it does not require all of tc_windup()'s execution time, merely the time to update the values. > > For all the hardware where people run mysql and similar software that calls > > getimeofday() a lot, rdtsc() works just fine. > > That wasn't the case until recently (except 10-15 years ago for UP with > no SMM). Someone just fixed rdtsc()-based time function in dtrace. It > tries to add a per-cpu rdtsc() offset, but the offset was backwards. It > takes P-state invariance and maybe more for the offset to be 0 and > not drift. I do have the luxury of using fairly modern Intel CPUs at work, and all of them have invariant TSCs. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 22:12:27 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0B9111065673; Mon, 4 Jun 2012 22:12:27 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 9554D8FC1B; Mon, 4 Jun 2012 22:12:26 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q54MC30n091092; Tue, 5 Jun 2012 01:12:03 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q54MC2ej094220; Tue, 5 Jun 2012 01:12:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q54MC284094219; Tue, 5 Jun 2012 01:12:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 5 Jun 2012 01:12:02 +0300 From: Konstantin Belousov To: Giovanni Trematerra Message-ID: <20120604221202.GG85127@deviant.kiev.zoral.com.ua> References: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org> <20120604181917.GD85127@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Y+xroYBkGM9OatJL" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 22:12:27 -0000 --Y+xroYBkGM9OatJL Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jun 04, 2012 at 11:16:10PM +0200, Giovanni Trematerra wrote: > On Mon, Jun 4, 2012 at 8:19 PM, Konstantin Belousov = wrote: > > On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: > >> On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > >> > On Sun, 3 Jun 2012, Konstantin Belousov wrote: >=20 > >> I think duplicating this logic in userland would just be wasteful. =9A= I have > >> a private fast gettimeofday() at my current job and it works by export= ing > >> the current timehands structure (well, the equivalent) to userland. = =9AThe > >> userland bits then fetch a copy of the details and do the same as bint= ime(). > >> (I move the math (bintime_addx() and the multiply)) out of the loop ho= wever. > > I started yesterday an implementation which uses shared page to export > > some variant of timehands, and uses auxv to provide the libc with a poi= nter > > to timehands when rdtsc is reasonable. > > > > I almost finished both 32bit and 64bit userspace, but there is > > kernel-side work left. Is your implementation ready or close to be ready > > for commit ? In other words, should I drop the efforts, or continue ? > > >=20 > Hey wait, What are you doing? > This is completely unfair. You didn't even review my patch. I did. I am quite saddened if you did not note that I did reviewed your patch. > I really don't understand your way to completely ignore me and start impl= ement > yesterday something you didn't care about for more than 3 years. > It costs me a lot of time and energy and I think I deserve more respect t= hat > just be ignored. I did not ignored the problem for 3 years. In fact, I did some, IMO non-trivial development moving the whole issue forward. In particular, I developed the shared page infrastructure that are currently used (yes, we already do have properly implemented shared page and sub-allocator of memory from it). I did some relevant rtld and libc changes, in particular, libc now have full access and uses auxv. So I consider this statement as a form of insult. I indeed never had much desire to delve into the timekeeping code. But periodically raising discussions, and final flamefest about the issue made me realize that I spent more efforts discussing the 'shared page' 'idea' then it would be to implement fast gettimeofday() and clock_gettime() using existing infrastructure. Having my hands somewhat deep into our ABI/ELF everything, I very much want to not paint myself into corner with unsustaining decisions that make ABI maintainance problematic. So I decided to save my time and implement it 'properly', to close the question and possibly remove the item from the ideas page. Please note that what I do now is still not a vdso. It does allow the vdso to plug into the framework later, but currently I only plan to reuse shared page and auxv transport to implement gettimeofday without usermode->kernelmode trip. If you want to do full-flesged vdso at once, I will be very much pleasured and probably support this work technically. For posted patch, I do respond with the NACK. --Y+xroYBkGM9OatJL Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/NMrIACgkQC3+MBN1Mb4i3NACePjulGq8ZJL/dXcHjRCmvf3M7 1EIAnjkQGFTHATGrwScdXfF08wQ19zzp =FgPt -----END PGP SIGNATURE----- --Y+xroYBkGM9OatJL-- From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 22:42:59 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 391F31065670; Mon, 4 Jun 2012 22:42:59 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from mail-qa0-f47.google.com (mail-qa0-f47.google.com [209.85.216.47]) by mx1.freebsd.org (Postfix) with ESMTP id 9E9CC8FC14; Mon, 4 Jun 2012 22:42:58 +0000 (UTC) Received: by qabg1 with SMTP id g1so1910309qab.13 for ; Mon, 04 Jun 2012 15:42:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=NhlwsO0jlVnsBdzv3UwsBnq8l/AWNtwdk9Bs4WjfqFA=; b=MjHpZ8KNIGKk2J1PzwPs5hU9YqUWLjSDoagbq/n0KEjyuk8PBt3WRvm9YCEMq86vo6 iHCofrZP4hiST1apkeJ/qwd8RXi5cDjW+2dFq+kKXxQn6P32I4LyuRCp/bActbLe/mZf hQP3GGA4LQ+Dau0IPnYrhYX6qoe7TxHMC1m9FmBRZNCKS4ex/bklqczJ3PTZ0/CF61ZU FneWSLEmSLeJ2aIMPQD6Hh8aPD4geCogHpuGsVV2rX5qBaDRgAUUHUuqvYL+2TwEjXjd OTeRSXwVV7r08MLQ2oXJ1+zzrziCWDQJI+BjaGaSIbFEAFM9XaI8auoQPZ8ksiRumx4x qAHw== MIME-Version: 1.0 Received: by 10.229.137.14 with SMTP id u14mr4365698qct.87.1338849777748; Mon, 04 Jun 2012 15:42:57 -0700 (PDT) Received: by 10.229.160.20 with HTTP; Mon, 4 Jun 2012 15:42:57 -0700 (PDT) In-Reply-To: <20120604221202.GG85127@deviant.kiev.zoral.com.ua> References: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> <20120603184315.T856@besplex.bde.org> <201206041101.57486.jhb@freebsd.org> <20120604181917.GD85127@deviant.kiev.zoral.com.ua> <20120604221202.GG85127@deviant.kiev.zoral.com.ua> Date: Tue, 5 Jun 2012 00:42:57 +0200 Message-ID: From: Giovanni Trematerra To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 22:42:59 -0000 On Tue, Jun 5, 2012 at 12:12 AM, Konstantin Belousov wrote: > On Mon, Jun 04, 2012 at 11:16:10PM +0200, Giovanni Trematerra wrote: >> On Mon, Jun 4, 2012 at 8:19 PM, Konstantin Belousov wrote: >> > On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: >> >> On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: >> >> > On Sun, 3 Jun 2012, Konstantin Belousov wrote: >> >> >> I think duplicating this logic in userland would just be wasteful. = =A0I have >> >> a private fast gettimeofday() at my current job and it works by expor= ting >> >> the current timehands structure (well, the equivalent) to userland. = =A0The >> >> userland bits then fetch a copy of the details and do the same as bin= time(). >> >> (I move the math (bintime_addx() and the multiply)) out of the loop h= owever. >> > I started yesterday an implementation which uses shared page to export >> > some variant of timehands, and uses auxv to provide the libc with a po= inter >> > to timehands when rdtsc is reasonable. >> > >> > I almost finished both 32bit and 64bit userspace, but there is >> > kernel-side work left. Is your implementation ready or close to be rea= dy >> > for commit ? In other words, should I drop the efforts, or continue ? >> > >> >> Hey wait, What are you doing? >> This is completely unfair. You didn't even review my patch. > I did. I am quite saddened if you did not note that I did reviewed your > patch. > >> I really don't understand your way to completely ignore me and start imp= lement >> yesterday something you didn't care about for more than 3 years. >> It costs me a lot of time and energy and I think I deserve more respect = that >> just be ignored. > > I did not ignored the problem for 3 years. In fact, I did some, IMO > non-trivial development moving the whole issue forward. In particular, I > developed the shared page infrastructure that are currently used (yes, > we already do have properly implemented shared page and sub-allocator > of memory from it). I did some relevant rtld and libc changes, in > particular, libc now have full access and uses auxv. So I consider this > statement as a form of insult. > Really? My apologize if you felt to be insulted. I didn't it on purpose. Honestly I don't think there will be other occasions to hurt your feelings. > > If you want to do full-flesged vdso at once, I will be very much pleasure= d > and probably support this work technically. Thank you for your offer. I'll appreciate it but I'm not going to work on it anymore. -- Gianni From owner-freebsd-arch@FreeBSD.ORG Mon Jun 4 23:26:40 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CDB041065670; Mon, 4 Jun 2012 23:26:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 473E58FC08; Mon, 4 Jun 2012 23:26:40 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q54NQasv002000 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 5 Jun 2012 09:26:37 +1000 Date: Tue, 5 Jun 2012 09:26:36 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201206041730.05478.jhb@freebsd.org> Message-ID: <20120605075448.B3655@besplex.bde.org> References: <201206041101.57486.jhb@freebsd.org> <20120605054930.H3236@besplex.bde.org> <201206041730.05478.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jun 2012 23:26:40 -0000 On Mon, 4 Jun 2012, John Baldwin wrote: > On Monday, June 04, 2012 4:51:00 pm Bruce Evans wrote: >> On Mon, 4 Jun 2012, John Baldwin wrote: >>> ... >>> a private fast gettimeofday() at my current job and it works by exporting >>> the current timehands structure (well, the equivalent) to userland. The >>> userland bits then fetch a copy of the details and do the same as bintime(). >> >> How do you keep this up to date, especially for leap seconds? > > I added a hack to tc_windup() where it updates the shared copy of the variables > with the results of the tc_windup() call each time it is invoked. > >> My version has a comment saying to do that, but I just noticed that >> it wouldn't work so well -- the timehands fields would have to be >> copied to local variables while under protection of the generation >> count, so it would give messier code to optimize a case that occurs >> _very_ rarely. > > It's not that messy in my experience. Just 3-4 lines. With only 16 copies of them in kern_tc.c. I doubt that you provide all of these :-). But full clock_gettime() support requires more than half of these :-(. Maybe even all of the FFCLOCK parts for full compatibility :-(. >>>> timehands in a shared pages is close to working. th_generation protects >>>> things in the same way as in the kernel, modulo assumptions that writes >>>> are ordered. >>> >>> It would work fine. And in fact, having multiple timehands is actually a >>> bug, not a feature. It lets you compute bogus timestamps if you get preempted >>> at the wrong time and end up with time jumping around. At Yahoo! we reduced >>> the number of timehands structures down to 2 or some such, and I'm now of >>> the opinion we should just have one and dispense with the entire array. >> >> No, it is a feature. The time should never jump around (backwards), but >> it can easily jump forwards. It makes little difference if preemption >> occurs after the timehands have been read, or while reading them but in >> such a way that the timehands become stale during preemption but not stale >> enough for their generation to change so that you notice that they are >> stale -- you get a stale timestamp either way (with staleness approximately >> the preemption time). Times read by different threads can easily have >> different staleness according to which timehands they ended up using and >> this may be quite different from which timehands they started using and >> from which timehands is active after they return. Perhaps this is what >> you mean. But again, this happens anyway when the preemption occurs after >> the timehands have been read. > > Time definitely jumped backwards at Yahoo!. The problem case was when NTP > was adjusting the time, so if you used a timehands structure that was a > few generations old (stale), you could have a fairly large component that > was (delta * scale). > If the scale had slowed down in subsequent updates, > then the computed time would jump out into the future. On the next time > update with a newer timehands, the effective base was less than the previous > calculation thought it should have been, and the scale was smaller, so the > end result if the TSC had not advanced very far was for the new time to be > less than the previous time, and thus time jumped backwards. Hmm, changing th_scale in tc_windup() indeed seems to be quite broken, and reducing to 1 timehands might work around this. tc_windup() captures the time using the current scale, so any future reads on the new timehands will be monotonic, but current and future reads on other timehands may be too far ahead. Current and future reads on the new timehands are prevented by the generation count -- this is why reducing to 1 timehands might work. Reducing the number of timehands to >= 2 only reduces the maximum non-monotonicity. Someone should test this using adjtime(2). I think you can use it to slew much faster than ntpd will let you. To fix this, just kill all the timehands by setting their generation count to 0 iff changing the scale. This preserves the optimization except when the scale changes, unless ntp (kernel PLL, etc.) changes it almost every time. ntpd only changes things every few seconds or minutes, and I hope the kernel doesn't need to change the frequency often. BTW, the ntp part of the locking is quite broken. You can see this in kern_adjtime() where it uses Giant locking to try to protect the non-atomic write to time_adjtime and other less critical variables. kern_ntptime.c still says that almost everything must be locked by splclock(), but that is null. Actually, almost everything must be locked by something that locks out tc_windup() or a little more. Sched locking might have done it for hardclock() and tc_windup(), but no mutex except Giant has ever been used in kern_ntptime.c. There is also essentially null locking for pps calls from non-clock fast interrupt handlers. The worst case in a useful configuration seems to be 1 CPU executing ntp_update_second() via a fast interrupt handler, and another CPU executing hardpps() via another fast interrupt handler. A non-useful configuration might more than 1 other CPU executing hardpps(). >>> For my userland case I only export a single timehands copy. >> >> So readers block for a long time if the writer is updating and the >> writer blocks? Works best for UP :-). > > The update to the shared timehands structure does not take a long time, > specifically for userland it does not require all of tc_windup()'s > execution time, merely the time to update the values. But whenever it is preempted, it it may take a long time. You have no control short of privileged rtprio and nice to prevent preemption. OTOH, tc_windup() is run from a fast interrupt handler. Almost nothing can prevent it being called without much delay or preempt it (sched locking of it used to delay it enough to cause significant lock contention). Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 07:49:34 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2371D1065675; Tue, 5 Jun 2012 07:49:34 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id C2FCB8FC12; Tue, 5 Jun 2012 07:49:33 +0000 (UTC) Received: from localhost (58.wheelsystems.com [83.12.187.58]) by mail.dawidek.net (Postfix) with ESMTPSA id 27352FE1; Tue, 5 Jun 2012 09:49:32 +0200 (CEST) Date: Tue, 5 Jun 2012 09:47:42 +0200 From: Pawel Jakub Dawidek To: "Andrey A. Chernov" Message-ID: <20120605074741.GA1391@garage.freebsd.pl> References: <201206042134.q54LYoVJ067685@svn.freebsd.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="5mCyUwZo2JvN/JJP" Content-Disposition: inline In-Reply-To: <201206042134.q54LYoVJ067685@svn.freebsd.org> X-OS: FreeBSD 10.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, freebsd-arch@FreeBSD.org Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 07:49:34 -0000 --5mCyUwZo2JvN/JJP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jun 04, 2012 at 09:34:49PM +0000, Andrey A. Chernov wrote: > Author: ache > Date: Mon Jun 4 21:34:49 2012 > New Revision: 236582 > URL: http://svn.freebsd.org/changeset/base/236582 >=20 > Log: > 1) IEEE Std 1003.1-2008, "errno" section, is explicit that > =20 > "The setting of errno after a successful call to a function is > unspecified unless the description of that function specifies that > errno shall not be modified." Very interesting. However free(3) is always successful. Maybe we need more context here, but the sentence above might talk about functions that can either succeed or fail and such functions do set errno on failure, but we don't know what they do to errno on success - they sometimes interact with the errno, free(3) never does. I aware that my interpretation might be too wishful, but it is pretty obvious to save errno value when calling a function that can eventually fail - when we save the errno we don't know if it will fail or not, so we have to do that, but requiring to save errno when calling a void function that can't fail is simply silly and complicates the code without a reason. > However, free() in IEEE Std 1003.1-2008 does not mention its interaction > with errno, so MAY modify it after successful call > (it depends on particular free() implementation, OS-specific, etc.). Expecting documentation to describe interaction with some global variable that it doesn't need is pretty silly too (ok, errno is special, but still). It make sense to describe all the cases when the function actually is sometimes using the global variable, but for a function that never fails and should never touch the global it doesn't make sense. Maybe that's why it doesn't mention interaction with errno? I agree that the standards aren't clear, but if saving errno around free(3) is the way to go, then I'm sure we have much more problems in our code, even if it is not suppose to be portable it should be correct - we never know who and when will take the code and port it. I guess what I'm trying to say here is that this is much bigger change than it looks and we should probably agree on some global rule here. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl --5mCyUwZo2JvN/JJP Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAk/NuZ0ACgkQForvXbEpPzSfyACeK8eSY42ZOt2Sl1X4SOxGXsdC WvIAoOFeogjkUqP7aMxtyL4lqO4yUNyp =sCiA -----END PGP SIGNATURE----- --5mCyUwZo2JvN/JJP-- From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 12:25:36 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AFE8A106564A for ; Tue, 5 Jun 2012 12:25:36 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 409178FC15 for ; Tue, 5 Jun 2012 12:25:36 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id CAE656F43 for ; Tue, 5 Jun 2012 12:25:34 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 6F9DD95EF; Tue, 5 Jun 2012 14:25:34 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: arch@freebsd.org Date: Tue, 05 Jun 2012 14:25:33 +0200 Message-ID: <86bokyvtc2.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Subject: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 12:25:36 -0000 While working on Capsicum last year, I noticed that some of the spare KTR types are (ab)used for different purposes by different parts of the code. KTR_SPARE[234] are all documented as "/* XXX Used by cxgb */", but KTR_SPARE3, for instance, is widely used for clock events. Here is a complete list: sys/sys/ktr.h: #define KTR_SPARE2 0x00000800 /* XXX Used= by cxgb */ sys/sys/ktr.h: #define KTR_SPARE3 0x00008000 /* XXX Used= by cxgb */ sys/sys/ktr.h: #define KTR_SPARE4 0x00010000 /* XXX Used= by cxgb */ sys/geom/sched/gs_scheduler.h: #define KTR_GSCHED KTR_SPARE4 sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "ipi at %d: now %d.%0= 8x%08x", sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "handle at %d: now %d.%0= 8x%08x", sys/kern/kern_clocksource.c: CTR2(KTR_SPARE2, "skip at %d: %d"= , curcpu, skip); sys/kern/kern_clocksource.c: CTR5(KTR_SPARE2, "next at %d: next %d.%0= 8x%08x by %d", sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "intr at %d: now %d.%0= 8x%08x", sys/kern/kern_clocksource.c: CTR5(KTR_SPARE2, "load p at= %d: now %d.%08x first in %d.%08x", sys/kern/kern_clocksource.c: CTR5(KTR_SPARE2, "load at %d: ne= xt %d.%08x%08x eq %d", sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "idle at %d: now %d.%0= 8x%08x", sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "active at %d: now %d.%0= 8x%08x", sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "set_cyc at %d: now %d.%= 08x%08x", sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "set_cyc at %d: t %d.%08= x%08x", sys/kern/kern_clocksource.c: CTR3(KTR_SPARE2, "new co at %d: on %d in= %d", sys/amd64/amd64/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", sys/amd64/amd64/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done", sys/dev/cxgb/cxgb_osdep.h: #define KTR_CXGB KTR_SPARE2 sys/dev/cxgb/ulp/iw_cxgb/iw_cxgb_hal.h: #define KTR_IW_CXGB KTR_SPARE4 sys/dev/cxgb/ulp/tom/cxgb_defs.h: #define KTR_TOM KTR_SPARE2 sys/dev/cxgb/ulp/tom/cxgb_defs.h: #define KTR_TCB KTR_SPARE3 sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c: CTR2(KTR_SPARE2, "wr_ack: snd_una= =3D%u credits=3D%d", snd_una, credits); sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c: CTR1(KTR_SPARE2, "wr_ack: s= bdrop(%d)", bytes); sys/dev/gem/if_gem.c: #define KTR_GEM KTR_SPARE2 sys/dev/drm2/drmP.h: #define KTR_DRM_REG KTR_SPARE3 sys/dev/hme/if_hme.c: #define KTR_HME KTR_SPARE2 /* XXX */ sys/dev/cas/if_cas.c: #define KTR_CAS KTR_SPARE2 sys/dev/ath/if_ath.c: #define ATH_KTR_INTR KTR_SPARE4 sys/dev/ath/if_ath.c: #define ATH_KTR_ERR KTR_SPARE3 sys/dev/ath/if_ath_rx.c: #define ATH_KTR_INTR KTR_SPARE4 sys/dev/ath/if_ath_rx.c: #define ATH_KTR_ERR KTR_SPARE3 sys/i386/xen/xen_machdep.c: CTR0(KTR_SPARE2, "ni_cli disabling interrup= ts"); sys/i386/xen/xen_machdep.c: CTR2(KTR_SPARE2, "%x xen_restore_flags efla= gs %x", rebp(), eflags); sys/i386/xen/xen_machdep.c: CTR1(KTR_SPARE2, "%x xen_cli disabling inte= rrupts", rebp()); sys/i386/xen/xen_machdep.c: CTR1(KTR_SPARE2, "%x xen_sti enabling inter= rupts", rebp()); sys/i386/i386/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", sys/i386/i386/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done", sys/powerpc/powerpc/cpu.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", sys/powerpc/powerpc/cpu.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done", sys/pc98/pc98/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", sys/pc98/pc98/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done", sys/sparc64/sparc64/pmap.c: CTR5(KTR_SPARE2, sys/sparc64/sparc64/tsb.c: CTR5(KTR_SPARE2, sys/sparc64/include/bus.h: #define KTR_BUS KTR= _SPARE2 Most of this is in device drivers, which should use KTR_DEV. There is one major use of KTR_SPAREx in common code: KTR_SPARE2 is used for clock events. It is also used incorrectly by the sparc64 pmap core (there is a separate KTR_PMAP for that). I suggest that we 1) rename one of the spare KTRs to KTR_CLOCK and use that for clock events. I already have a patch for that. 2) eliminate all other use of KTR_SPARE[0-9] in non-device code. I think the existing KTRs should already cover most cases. 3) modify device drivers to use KTR_DEV for events that aren't covered by existing, more specific KTRs, which is almost none. For instance, there is no reason why cxgb shouldn't just use KTR_NET. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 13:07:10 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 563611065674; Tue, 5 Jun 2012 13:07:10 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 1052A8FC17; Tue, 5 Jun 2012 13:07:09 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 19EBE6F72; Tue, 5 Jun 2012 13:07:09 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id D331B95FE; Tue, 5 Jun 2012 15:07:08 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: John Baldwin References: <20120602171632.GC2358@deviant.kiev.zoral.com.ua> <201206041053.51802.jhb@freebsd.org> Date: Tue, 05 Jun 2012 15:07:08 +0200 In-Reply-To: <201206041053.51802.jhb@freebsd.org> (John Baldwin's message of "Mon, 4 Jun 2012 10:53:51 -0400") Message-ID: <86y5o1vrer.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 13:07:10 -0000 John Baldwin writes: > I think this is an important question actually. Is there anything > that really needs to be here besides gettimeofday()? I mean, is there > any real-world application that needs to call getpid() or getppid() a > bunch of times? Yes, for fork detection when accessing resources shared between descendants of the process that allocated them. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 13:09:24 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7E97C106566B; Tue, 5 Jun 2012 13:09:24 +0000 (UTC) (envelope-from ache@vniz.net) Received: from vniz.net (vniz.net [194.87.13.69]) by mx1.freebsd.org (Postfix) with ESMTP id E639D8FC1D; Tue, 5 Jun 2012 13:09:23 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by vniz.net (8.14.5/8.14.5) with ESMTP id q55D9MpJ014011; Tue, 5 Jun 2012 17:09:22 +0400 (MSK) (envelope-from ache@vniz.net) Received: (from ache@localhost) by localhost (8.14.5/8.14.5/Submit) id q55D9MQe014010; Tue, 5 Jun 2012 17:09:22 +0400 (MSK) (envelope-from ache) Date: Tue, 5 Jun 2012 17:09:22 +0400 From: Andrey Chernov To: Pawel Jakub Dawidek Message-ID: <20120605130922.GE13306@vniz.net> Mail-Followup-To: Andrey Chernov , Pawel Jakub Dawidek , src-committers@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, svn-src-head@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gj572EiMnwbLXET9" Content-Disposition: inline In-Reply-To: <20120605074741.GA1391@garage.freebsd.pl> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 13:09:24 -0000 --gj572EiMnwbLXET9 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jun 05, 2012 at 09:47:42AM +0200, Pawel Jakub Dawidek wrote: > > "The setting of errno after a successful call to a function is > > unspecified unless the description of that function specifies that > > errno shall not be modified." >=20 > Very interesting. However free(3) is always successful. Maybe we need > more context here, but the sentence above might talk about functions > that can either succeed or fail and such functions do set errno on > failure, but we don't know what they do to errno on success - they > sometimes interact with the errno, free(3) never does. According to Austing Group interpretation, this setence talks about=20 funtions which always succeed too, please see http://austingroupbugs.net/view.php?id=3D385 > I aware that my interpretation might be too wishful, but it is pretty > obvious to save errno value when calling a function that can eventually > fail - when we save the errno we don't know if it will fail or not, so > we have to do that, but requiring to save errno when calling a void > function that can't fail is simply silly and complicates the code > without a reason. It still can fail due to internal errors, it just not returns failure. For internal errors POSIX states that errno state is unspecified. > I agree that the standards aren't clear, but if saving errno around > free(3) is the way to go, then I'm sure we have much more problems in > our code, even if it is not suppose to be portable it should be correct > - we never know who and when will take the code and port it. Currently they are pretty clear in that moment, although I agree that if=20 POSIX says it should not modify errno, the life will be easy. Lets look at= =20 their further movement, since they are already aware of this specific=20 problem. > I guess what I'm trying to say here is that this is much bigger change > than it looks and we should probably agree on some global rule here. =2E..which not violate standards. --=20 http://ache.vniz.net/ --gj572EiMnwbLXET9 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAk/OBQIACgkQVg5YK5ZEdN3tRwCfSZV9vBpAGgmbFiu6NQuciGF1 ussAn3c6HZUcV5JLevuVuJGCnrw/PpBI =sd4B -----END PGP SIGNATURE----- --gj572EiMnwbLXET9-- From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 13:10:08 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 27A0B1065670; Tue, 5 Jun 2012 13:10:08 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id CAEC28FC0A; Tue, 5 Jun 2012 13:10:07 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 2130B6F77; Tue, 5 Jun 2012 13:10:07 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id DB8CD9600; Tue, 5 Jun 2012 15:10:06 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Pawel Jakub Dawidek References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> Date: Tue, 05 Jun 2012 15:10:06 +0200 In-Reply-To: <20120605074741.GA1391@garage.freebsd.pl> (Pawel Jakub Dawidek's message of "Tue, 5 Jun 2012 09:47:42 +0200") Message-ID: <86txypvr9t.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, "Andrey A. Chernov" , freebsd-arch@FreeBSD.org Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 13:10:08 -0000 Pawel Jakub Dawidek writes: > Very interesting. However free(3) is always successful. Maybe we need > more context here, but the sentence above might talk about functions > that can either succeed or fail and such functions do set errno on > failure, but we don't know what they do to errno on success - they > sometimes interact with the errno, free(3) never does. Even if free() itself never fails, it might have side effects such as unmapping a slab, logging a KTR event etc. which can modify errno. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 13:35:36 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 31FA91065674; Tue, 5 Jun 2012 13:35:36 +0000 (UTC) (envelope-from ache@vniz.net) Received: from vniz.net (vniz.net [194.87.13.69]) by mx1.freebsd.org (Postfix) with ESMTP id 534128FC0C; Tue, 5 Jun 2012 13:35:12 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by vniz.net (8.14.5/8.14.5) with ESMTP id q55DZA9X014565; Tue, 5 Jun 2012 17:35:10 +0400 (MSK) (envelope-from ache@vniz.net) Received: (from ache@localhost) by localhost (8.14.5/8.14.5/Submit) id q55DZ8EY014564; Tue, 5 Jun 2012 17:35:08 +0400 (MSK) (envelope-from ache) Date: Tue, 5 Jun 2012 17:35:08 +0400 From: Andrey Chernov To: Dag-Erling Sm??rgrav Message-ID: <20120605133508.GA14460@vniz.net> Mail-Followup-To: Andrey Chernov , Dag-Erling Sm??rgrav , Pawel Jakub Dawidek , svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <86txypvr9t.fsf@ds4.des.no> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <86txypvr9t.fsf@ds4.des.no> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 13:35:36 -0000 On Tue, Jun 05, 2012 at 03:10:06PM +0200, Dag-Erling Sm??rgrav wrote: > Pawel Jakub Dawidek writes: > > Very interesting. However free(3) is always successful. Maybe we need > > more context here, but the sentence above might talk about functions > > that can either succeed or fail and such functions do set errno on > > failure, but we don't know what they do to errno on success - they > > sometimes interact with the errno, free(3) never does. > > Even if free() itself never fails, it might have side effects such as > unmapping a slab, logging a KTR event etc. which can modify errno. I totally agree. Even if our free() will be cleaned in this sense or save errno internally, we need the code which not relays on some particular implementation but works in general scope with any standard-conformant free(). -- http://ache.vniz.net/ From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 13:39:02 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6AFF81065675; Tue, 5 Jun 2012 13:39:02 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 017DB8FC1B; Tue, 5 Jun 2012 13:39:01 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q55Dcg18059915; Tue, 5 Jun 2012 16:38:42 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q55DcfIw099602; Tue, 5 Jun 2012 16:38:41 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q55DcefT099601; Tue, 5 Jun 2012 16:38:40 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 5 Jun 2012 16:38:40 +0300 From: Konstantin Belousov To: John Baldwin Message-ID: <20120605133840.GK85127@deviant.kiev.zoral.com.ua> References: <201206041101.57486.jhb@freebsd.org> <20120604181917.GD85127@deviant.kiev.zoral.com.ua> <201206041722.07269.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="PEfPc/DjvCj+JzNg" Content-Disposition: inline In-Reply-To: <201206041722.07269.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 13:39:02 -0000 --PEfPc/DjvCj+JzNg Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jun 04, 2012 at 05:22:07PM -0400, John Baldwin wrote: > On Monday, June 04, 2012 2:19:17 pm Konstantin Belousov wrote: > > On Mon, Jun 04, 2012 at 11:01:57AM -0400, John Baldwin wrote: > > > On Sunday, June 03, 2012 6:49:27 am Bruce Evans wrote: > > > > On Sun, 3 Jun 2012, Konstantin Belousov wrote: > > > >=20 > > > > > On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: > > > > >> On Sat, 2 Jun 2012, Konstantin Belousov wrote: > > > > >>> ... > > > > >>> In fact, I think that if the whole goal is only fast clocks, th= en we > > > > >>> do not need any additional system mechanisms, since we can easi= ly export > > > > >>> coefficients for rdtsc formula already. E.g. we can put it into= elf auxv, > > > > >>> which is ugly but bearable. > > > > >> > > > > >> How do you get the timehands offsets? These only need to be upd= ated > > > > >> every second or so, or when used, but how can the application kn= ow > > > > >> when they need to be updated if this is not done automatically i= n the > > > > >> kernel by writing to a shared page? I can only think of the > > > > >> application arranging an alarm signal every second or so and upd= ating > > > > >> then. No good for libraries. > > > > > What is timehands offsets ? Do you mean things like leap seconds ? > > > >=20 > > > > Yes. binuptime() is: > > > >=20 > > > > % void > > > > % binuptime(struct bintime *bt) > > > > % { > > > > % struct timehands *th; > > > > % u_int gen; > > > > %=20 > > > > % do { > > > > % th =3D timehands; > > > > % gen =3D th->th_generation; > > > > % *bt =3D th->th_offset; > > > > % bintime_addx(bt, th->th_scale * tc_delta(th)); > > > > % } while (gen =3D=3D 0 || gen !=3D th->th_generation); > > > > % } > > > >=20 > > > > Without the kernel providing th->th_offset, you have to do lots of = ntp > > > > handling for yourself (compatibly with the kernel) just to get an > > > > accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, b= ut > > > > they do affect CLOCK_REALTIME which is the clock id used by > > > > gettimeofday(). For the former, you only have to advance the offset > > > > for yourself occasionally (compatibly with the kernel) and manage > > > > (compatibly with the kernel, especially in the long term) ntp slewi= ng > > > > and other syscall/sysctl kernel activity that micro-adjusts th->th_= scale. > > >=20 > > > I think duplicating this logic in userland would just be wasteful. I= have > > > a private fast gettimeofday() at my current job and it works by expor= ting > > > the current timehands structure (well, the equivalent) to userland. = The > > > userland bits then fetch a copy of the details and do the same as bin= time(). > > > (I move the math (bintime_addx() and the multiply)) out of the loop h= owever. > > I started yesterday an implementation which uses shared page to export > > some variant of timehands, and uses auxv to provide the libc with a poi= nter > > to timehands when rdtsc is reasonable. > >=20 > > I almost finished both 32bit and 64bit userspace, but there is > > kernel-side work left. Is your implementation ready or close to be ready > > for commit ? In other words, should I drop the efforts, or continue ? >=20 > No, mine is not general purpose. I'll see if I can make a public patch o= f what > it looks like. My first version that seems to work on amd64 is at http://people.freebsd.org/~kib/misc/moronix.1.patch The plugs do allow for the new gettimeofday code to be replaced by vdso version in future. This is definitely WIP, in particular, the memory barriers handling in the __vdso_gettimeofday and in the tc_windup updater is missing. Also, clock_gettime() support would require ABI change. I only compiled amd64 kernel, i386 is probably broken, other architectures are definitely broken. --PEfPc/DjvCj+JzNg Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/OC+AACgkQC3+MBN1Mb4gFvgCg7kdxK3EZJGiLz8SDf3/xTkEg XA8An0Mb5+KWdwgLW+SjCaI7UFY3ufJS =Ev9z -----END PGP SIGNATURE----- --PEfPc/DjvCj+JzNg-- From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 14:32:30 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 24CFF106564A for ; Tue, 5 Jun 2012 14:32:30 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 6A1F78FC1A for ; Tue, 5 Jun 2012 14:32:29 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q55EWGdu067703; Tue, 5 Jun 2012 17:32:16 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q55EWGf6099940; Tue, 5 Jun 2012 17:32:16 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q55EWFBL099939; Tue, 5 Jun 2012 17:32:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 5 Jun 2012 17:32:15 +0300 From: Konstantin Belousov To: Dag-Erling Sm??rgrav Message-ID: <20120605143215.GL85127@deviant.kiev.zoral.com.ua> References: <86bokyvtc2.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="bFUYW7mPOLJ+Jd2A" Content-Disposition: inline In-Reply-To: <86bokyvtc2.fsf@ds4.des.no> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 14:32:30 -0000 --bFUYW7mPOLJ+Jd2A Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jun 05, 2012 at 02:25:33PM +0200, Dag-Erling Sm??rgrav wrote: > While working on Capsicum last year, I noticed that some of the spare > KTR types are (ab)used for different purposes by different parts of the > code. KTR_SPARE[234] are all documented as "/* XXX Used by cxgb */", > but KTR_SPARE3, for instance, is widely used for clock events. Here is > a complete list: >=20 > sys/sys/ktr.h: #define KTR_SPARE2 0x00000800 /* XXX Us= ed by cxgb */ > sys/sys/ktr.h: #define KTR_SPARE3 0x00008000 /* XXX Us= ed by cxgb */ > sys/sys/ktr.h: #define KTR_SPARE4 0x00010000 /* XXX Us= ed by cxgb */ > sys/geom/sched/gs_scheduler.h: #define KTR_GSCHED KTR_SPARE4 > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "ipi at %d: now %d.= %08x%08x", > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "handle at %d: now %d.= %08x%08x", > sys/kern/kern_clocksource.c: CTR2(KTR_SPARE2, "skip at %d: %= d", curcpu, skip); > sys/kern/kern_clocksource.c: CTR5(KTR_SPARE2, "next at %d: next %d.= %08x%08x by %d", > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "intr at %d: now %d.= %08x%08x", > sys/kern/kern_clocksource.c: CTR5(KTR_SPARE2, "load p = at %d: now %d.%08x first in %d.%08x", > sys/kern/kern_clocksource.c: CTR5(KTR_SPARE2, "load at %d: = next %d.%08x%08x eq %d", > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "idle at %d: now %d.= %08x%08x", > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "active at %d: now %d.= %08x%08x", > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "set_cyc at %d: now %d= .%08x%08x", > sys/kern/kern_clocksource.c: CTR4(KTR_SPARE2, "set_cyc at %d: t %d.%= 08x%08x", > sys/kern/kern_clocksource.c: CTR3(KTR_SPARE2, "new co at %d: on %d = in %d", > sys/amd64/amd64/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", > sys/amd64/amd64/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done= ", > sys/dev/cxgb/cxgb_osdep.h: #define KTR_CXGB KTR_SPARE2 > sys/dev/cxgb/ulp/iw_cxgb/iw_cxgb_hal.h: #define KTR_IW_CXGB KTR_SPARE4 > sys/dev/cxgb/ulp/tom/cxgb_defs.h: #define KTR_TOM KTR_SPARE2 > sys/dev/cxgb/ulp/tom/cxgb_defs.h: #define KTR_TCB KTR_SPARE3 > sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c: CTR2(KTR_SPARE2, "wr_ack: snd_una= =3D%u credits=3D%d", snd_una, credits); > sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c: CTR1(KTR_SPARE2, "wr_ack:= sbdrop(%d)", bytes); > sys/dev/gem/if_gem.c: #define KTR_GEM KTR_SPARE2 > sys/dev/drm2/drmP.h: #define KTR_DRM_REG KTR_SPARE3 > sys/dev/hme/if_hme.c: #define KTR_HME KTR_SPARE2 /* XXX */ > sys/dev/cas/if_cas.c: #define KTR_CAS KTR_SPARE2 > sys/dev/ath/if_ath.c: #define ATH_KTR_INTR KTR_SPARE4 > sys/dev/ath/if_ath.c: #define ATH_KTR_ERR KTR_SPARE3 > sys/dev/ath/if_ath_rx.c: #define ATH_KTR_INTR KTR_SPARE4 > sys/dev/ath/if_ath_rx.c: #define ATH_KTR_ERR KTR_SPARE3 > sys/i386/xen/xen_machdep.c: CTR0(KTR_SPARE2, "ni_cli disabling interr= upts"); > sys/i386/xen/xen_machdep.c: CTR2(KTR_SPARE2, "%x xen_restore_flags ef= lags %x", rebp(), eflags); > sys/i386/xen/xen_machdep.c: CTR1(KTR_SPARE2, "%x xen_cli disabling in= terrupts", rebp()); > sys/i386/xen/xen_machdep.c: CTR1(KTR_SPARE2, "%x xen_sti enabling int= errupts", rebp()); > sys/i386/i386/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", > sys/i386/i386/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done= ", > sys/powerpc/powerpc/cpu.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", > sys/powerpc/powerpc/cpu.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done= ", > sys/pc98/pc98/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d", > sys/pc98/pc98/machdep.c: CTR2(KTR_SPARE2, "cpu_idle(%d) at %d done= ", > sys/sparc64/sparc64/pmap.c: CTR5(KTR_SPARE2, > sys/sparc64/sparc64/tsb.c: CTR5(KTR_SPARE2, > sys/sparc64/include/bus.h: #define KTR_BUS K= TR_SPARE2 >=20 > Most of this is in device drivers, which should use KTR_DEV. There is > one major use of KTR_SPAREx in common code: KTR_SPARE2 is used for clock > events. It is also used incorrectly by the sparc64 pmap core (there is > a separate KTR_PMAP for that). >=20 > I suggest that we >=20 > 1) rename one of the spare KTRs to KTR_CLOCK and use that for clock > events. I already have a patch for that. >=20 > 2) eliminate all other use of KTR_SPARE[0-9] in non-device code. I > think the existing KTRs should already cover most cases. >=20 > 3) modify device drivers to use KTR_DEV for events that aren't covered > by existing, more specific KTRs, which is almost none. For instance, > there is no reason why cxgb shouldn't just use KTR_NET. Moving all device drivers to KTR_DEV makes the KTR unusable for device driver debugging. When looking at the drm2 and gem traces, I do not want to see other devices tracepoints. Amount of data from GEM is huge, and obfuscating it with unrelated debugging recycles the ktr ring faster, aside of making noise that cayses log to be meaningless. --bFUYW7mPOLJ+Jd2A Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/OGG8ACgkQC3+MBN1Mb4j1uQCgpc0bZke1nm1HxOMv4QRMdyZP nCAAoN2XUHUgTUNM8FXQc1bf50Co5ivR =mbRt -----END PGP SIGNATURE----- --bFUYW7mPOLJ+Jd2A-- From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 14:40:33 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D1B2D1065673 for ; Tue, 5 Jun 2012 14:40:33 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 92B7F8FC15 for ; Tue, 5 Jun 2012 14:40:33 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 92DCE6FE8; Tue, 5 Jun 2012 14:40:32 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 6B55D960F; Tue, 5 Jun 2012 16:40:32 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Konstantin Belousov References: <86bokyvtc2.fsf@ds4.des.no> <20120605143215.GL85127@deviant.kiev.zoral.com.ua> Date: Tue, 05 Jun 2012 16:40:32 +0200 In-Reply-To: <20120605143215.GL85127@deviant.kiev.zoral.com.ua> (Konstantin Belousov's message of "Tue, 5 Jun 2012 17:32:15 +0300") Message-ID: <86pq9dvn33.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 14:40:34 -0000 Konstantin Belousov writes: > Moving all device drivers to KTR_DEV makes the KTR unusable for device > driver debugging. When looking at the drm2 and gem traces, I do not want > to see other devices tracepoints. Amount of data from GEM is huge, and > obfuscating it with unrelated debugging recycles the ktr ring faster, asi= de > of making noise that cayses log to be meaningless. We only have a limited number of KTR types - 32, to be precise. We can't spare one for each driver, and there's no reason why *your* driver (for any value of "you") should get its own while everybody else shares KTR_DEV. If you think KTR_DEV is too noisy, add sysctls to enable or disable tracing on a per-device basis. It should be quite easy to generalize. (I still haven't gotten around to implementing a similar infrastructure for network interfaces...) DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 14:49:46 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DBC311065687 for ; Tue, 5 Jun 2012 14:49:46 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 3E4EB8FC19 for ; Tue, 5 Jun 2012 14:49:44 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q55EncnA070300; Tue, 5 Jun 2012 17:49:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q55Enc67000151; Tue, 5 Jun 2012 17:49:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q55Encuk000150; Tue, 5 Jun 2012 17:49:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 5 Jun 2012 17:49:38 +0300 From: Konstantin Belousov To: Dag-Erling Sm??rgrav Message-ID: <20120605144938.GN85127@deviant.kiev.zoral.com.ua> References: <86bokyvtc2.fsf@ds4.des.no> <20120605143215.GL85127@deviant.kiev.zoral.com.ua> <86pq9dvn33.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="CQDko/0aYvuiEzgn" Content-Disposition: inline In-Reply-To: <86pq9dvn33.fsf@ds4.des.no> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 14:49:46 -0000 --CQDko/0aYvuiEzgn Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jun 05, 2012 at 04:40:32PM +0200, Dag-Erling Sm??rgrav wrote: > Konstantin Belousov writes: > > Moving all device drivers to KTR_DEV makes the KTR unusable for device > > driver debugging. When looking at the drm2 and gem traces, I do not want > > to see other devices tracepoints. Amount of data from GEM is huge, and > > obfuscating it with unrelated debugging recycles the ktr ring faster, a= side > > of making noise that cayses log to be meaningless. >=20 > We only have a limited number of KTR types - 32, to be precise. We > can't spare one for each driver, and there's no reason why *your* driver > (for any value of "you") should get its own while everybody else shares > KTR_DEV. I want to have only *my* driver trace points in the ring, by whatever means. Breaking it right now would mean that I cannot do any GEM debugging. >=20 > If you think KTR_DEV is too noisy, add sysctls to enable or disable > tracing on a per-device basis. It should be quite easy to generalize. So you are planning to break some useful, but possibly randomly-achieved functionality, and delegate the work to repair it to somebody else ? >=20 > (I still haven't gotten around to implementing a similar infrastructure > for network interfaces...) >=20 > DES > --=20 > Dag-Erling Sm??rgrav - des@des.no --CQDko/0aYvuiEzgn Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/OHIEACgkQC3+MBN1Mb4ipIwCeNnkTQuffMM3uGSnbZt2zY5pU rl0AoMUPQEGfFH93xKOOz/jGwcHZ5BIQ =gP0D -----END PGP SIGNATURE----- --CQDko/0aYvuiEzgn-- From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 15:02:17 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BDD41065670 for ; Tue, 5 Jun 2012 15:02:17 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id D142E8FC16 for ; Tue, 5 Jun 2012 15:02:16 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 441E1B95D; Tue, 5 Jun 2012 11:02:16 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Tue, 5 Jun 2012 09:47:48 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <86bokyvtc2.fsf@ds4.des.no> In-Reply-To: <86bokyvtc2.fsf@ds4.des.no> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <201206050947.48750.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 05 Jun 2012 11:02:16 -0400 (EDT) Cc: Dag-Erling =?utf-8?q?Sm=C3=B8rgrav?= Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 15:02:17 -0000 On Tuesday, June 05, 2012 8:25:33 am Dag-Erling Sm=C3=B8rgrav wrote: > Most of this is in device drivers, which should use KTR_DEV. There is > one major use of KTR_SPAREx in common code: KTR_SPARE2 is used for clock > events. It is also used incorrectly by the sparc64 pmap core (there is > a separate KTR_PMAP for that). >=20 > I suggest that we >=20 > 1) rename one of the spare KTRs to KTR_CLOCK and use that for clock > events. I already have a patch for that. >=20 > 2) eliminate all other use of KTR_SPARE[0-9] in non-device code. I > think the existing KTRs should already cover most cases. >=20 > 3) modify device drivers to use KTR_DEV for events that aren't covered > by existing, more specific KTRs, which is almost none. For instance, > there is no reason why cxgb shouldn't just use KTR_NET. There is a reason in that you may want to only get those specific events and not drown in noise from the network stack itself for example. What I tend = to do in drivers where I want to do this is have something like this: #if 0 #define KTR_CXGB KTR_DEV #else #define KTR_CXGB 0 #endif and then use 'KTR_CXGB' instead of 'KTR_DEV' or 'KTR_SPARE2' explicitly. I= t=20 looks like most of the drivers are already doing this and if it is #if 0'd = by default, then I would just let them be. The two CTR()s in tom/cxgb_cpl_io.c should probably be using KTR_TOM instead of KTR_SPARE2 directly. As a long term goal I would like to switch to using individual ints instead= of=20 a 32-bit bitmask as that would let us add new trace classes with ease. I=20 haven't figured out a design for that that I fully like yet however. =2D-=20 John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 15:02:18 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B188E106566B; Tue, 5 Jun 2012 15:02:18 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 826C18FC08; Tue, 5 Jun 2012 15:02:18 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D5EA7B91A; Tue, 5 Jun 2012 11:02:17 -0400 (EDT) From: John Baldwin To: "Dag-Erling =?utf-8?q?Sm=C3=B8rgrav?=" Date: Tue, 5 Jun 2012 10:08:29 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <201206041053.51802.jhb@freebsd.org> <86y5o1vrer.fsf@ds4.des.no> In-Reply-To: <86y5o1vrer.fsf@ds4.des.no> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <201206051008.29568.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 05 Jun 2012 11:02:17 -0400 (EDT) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 15:02:18 -0000 On Tuesday, June 05, 2012 9:07:08 am Dag-Erling Sm=C3=B8rgrav wrote: > John Baldwin writes: > > I think this is an important question actually. Is there anything > > that really needs to be here besides gettimeofday()? I mean, is there > > any real-world application that needs to call getpid() or getppid() a > > bunch of times? >=20 > Yes, for fork detection when accessing resources shared between > descendants of the process that allocated them. So you call getpid() on each access to a shared resource? =2D-=20 John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 15:06:36 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C3FCE1065670 for ; Tue, 5 Jun 2012 15:06:36 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 84CC38FC12 for ; Tue, 5 Jun 2012 15:06:36 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 7032A600C; Tue, 5 Jun 2012 15:06:35 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 3CB259616; Tue, 5 Jun 2012 17:06:35 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Konstantin Belousov References: <86bokyvtc2.fsf@ds4.des.no> <20120605143215.GL85127@deviant.kiev.zoral.com.ua> <86pq9dvn33.fsf@ds4.des.no> <20120605144938.GN85127@deviant.kiev.zoral.com.ua> Date: Tue, 05 Jun 2012 17:06:34 +0200 In-Reply-To: <20120605144938.GN85127@deviant.kiev.zoral.com.ua> (Konstantin Belousov's message of "Tue, 5 Jun 2012 17:49:38 +0300") Message-ID: <86lik1vlvp.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 15:06:36 -0000 Konstantin Belousov writes: > Dag-Erling Sm=C3=B8rgrav writes: >> We only have a limited number of KTR types - 32, to be precise. We >> can't spare one for each driver, and there's no reason why *your* driver >> (for any value of "you") should get its own while everybody else shares >> KTR_DEV. > I want to have only *my* driver trace points in the ring, by whatever > means. Breaking it right now would mean that I cannot do any GEM > debugging. Well, so does everybody else. Here is a list of files that use the same KTR that you use for GEM (KTR_SPARE2): sys/kern/kern_clocksource.c sys/amd64/amd64/machdep.c sys/dev/cxgb/cxgb_osdep.h sys/dev/cxgb/ulp/tom/cxgb_defs.h sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c sys/dev/gem/if_gem.c sys/dev/hme/if_hme.c sys/dev/cas/if_cas.c sys/i386/xen/xen_machdep.c sys/i386/i386/machdep.c sys/powerpc/powerpc/cpu.c sys/pc98/pc98/machdep.c sys/sparc64/sparc64/pmap.c sys/sparc64/sparc64/tsb.c sys/sparc64/include/bus.h Note that sys/*/*/machdep.c issue a KTR_SPARE2 event every time the CPU enters or exits the idle thread. > > If you think KTR_DEV is too noisy, add sysctls to enable or disable > > tracing on a per-device basis. It should be quite easy to generalize. > So you are planning to break some useful, but possibly randomly-achieved > functionality, and delegate the work to repair it to somebody else ? It's already broken, and you're one of the people responsible for breaking it. I'm trying to fix it. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 15:18:55 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 71AE81065670 for ; Tue, 5 Jun 2012 15:18:55 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1.freebsd.org (Postfix) with ESMTP id E0EF18FC1F for ; Tue, 5 Jun 2012 15:18:54 +0000 (UTC) Received: by lbon10 with SMTP id n10so5079617lbo.13 for ; Tue, 05 Jun 2012 08:18:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=MHENfAEasMRpGfIstiRUJCKWvR7ZEbTuk4j/iUM9wFs=; b=A/4yJKtw07vu5KBI2n14+3AZju0hqGudY+FqTstSr7tkfyb5pWklTPUR3Ts/mqaZDE x+skyRSxWUpZ+ASX1lHoXvSjlTaWyhNR6NVqwG2g36n1xmnTCVdAD/PtJtIOCvvy5AZv kwMzQuPwUXh34XBEwKJozzJ5QXIWFc/RSvxRgCvHKU+nP8R76FRTUrzxB4TdtqH4Qpqd xchDcfT+0WdTRuTjYI3LSxoM5lxOYa8HnXv2QGM9evjHKTDjNYNTLo5td3zCwEGRqOjA 6+ahlmdtFHyTerNmrx1000lLKmUKQ9Pgm6mI3OnRD7ysmICCJMnZ8cXzOVsxcYNd4hah oXNQ== MIME-Version: 1.0 Received: by 10.152.104.171 with SMTP id gf11mr17537340lab.5.1338909533497; Tue, 05 Jun 2012 08:18:53 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Tue, 5 Jun 2012 08:18:53 -0700 (PDT) In-Reply-To: <86bokyvtc2.fsf@ds4.des.no> References: <86bokyvtc2.fsf@ds4.des.no> Date: Tue, 5 Jun 2012 16:18:53 +0100 X-Google-Sender-Auth: yoGy18sgeGBgc5IfgGCTkwhD24c Message-ID: From: Attilio Rao To: =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 15:18:55 -0000 2012/6/5 Dag-Erling Sm=C3=B8rgrav : > While working on Capsicum last year, I noticed that some of the spare > KTR types are (ab)used for different purposes by different parts of the > code. =C2=A0KTR_SPARE[234] are all documented as "/* XXX Used by cxgb */"= , > but KTR_SPARE3, for instance, is widely used for clock events. =C2=A0Here= is > a complete list: The truth is, KTR is thought to be a mechanism for catering "on-the-fly" the tracing of the events, but the very limited mask/classes of events it provides makes this completely useless. I don't recall a case where I had to not patch manually KTR knobs to do actual debugging. What I really would like to see is: - Of course remove the bogus usage of KTR_SPAREX in the drivers - Make the mask of events much bigger than the current one - Enlarge the number of KTR_SPARE available (16 would be ok) - By default have KTR_SPARE0-15 to be on in the kernel along with KTR option, or maybe when the kernel is still in the debugging phase (but leave in a knob for disabling it) - Use the dynamic masking system to just mask the SPARE you are interested into. This way your driver can simply use a KTR_SPARE for development and you will mask out the right one at run time. Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 15:20:02 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3A8A4106564A for ; Tue, 5 Jun 2012 15:20:02 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1.freebsd.org (Postfix) with ESMTP id A73818FC0A for ; Tue, 5 Jun 2012 15:20:01 +0000 (UTC) Received: by lbon10 with SMTP id n10so5080875lbo.13 for ; Tue, 05 Jun 2012 08:20:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=7zHR9gqF6/POmh+5xGUSHk6TJRkKwirfO5Sewq48GJg=; b=P/AFgL1uecmxbDpA3bPz4lOJjDMQBQ+TAVyPdO3Q5Imh0xZoYk+xExa0gpF39U4Jzn 5FjNJJAxk5dhOSY3mJwIVphn9wU+RAAJc/2lE0nhJDZ1IGqi7ujbNwWD80s3w0zPwtAd eqh36srOF6cEGb5KrULG4+uRIi67OAdKW1twmMIwknnFB2Miq2J/Rcju3vjspxZr63KG LfRtximyUGcs14ftGZKKwILDEAa5fIeaCa9lbaXP+jDtfkASR4OcITCihP9Hjc0sqb1q Ds0ymugdg3v4q+NKQEKob7i2/YEAJGrG7vYCmTRpADywMzJOrbIVQrdmOueEVJ1KFomn XGCg== MIME-Version: 1.0 Received: by 10.112.42.34 with SMTP id k2mr8325423lbl.0.1338909600597; Tue, 05 Jun 2012 08:20:00 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Tue, 5 Jun 2012 08:20:00 -0700 (PDT) In-Reply-To: References: <86bokyvtc2.fsf@ds4.des.no> Date: Tue, 5 Jun 2012 16:20:00 +0100 X-Google-Sender-Auth: h5OrgPT3WFakYIK2E8w_Xp9Q5o4 Message-ID: From: Attilio Rao To: =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 15:20:02 -0000 2012/6/5 Attilio Rao : > 2012/6/5 Dag-Erling Sm=C3=B8rgrav : >> While working on Capsicum last year, I noticed that some of the spare >> KTR types are (ab)used for different purposes by different parts of the >> code. =C2=A0KTR_SPARE[234] are all documented as "/* XXX Used by cxgb */= ", >> but KTR_SPARE3, for instance, is widely used for clock events. =C2=A0Her= e is >> a complete list: > > The truth is, KTR is thought to be a mechanism for catering > "on-the-fly" the tracing of the events, but the very limited > mask/classes of events it provides makes this completely useless. > I don't recall a case where I had to not patch manually KTR knobs to > do actual debugging. > > What I really would like to see is: > - Of course remove the bogus usage of KTR_SPAREX in the drivers > - Make the mask of events much bigger than the current one > - Enlarge the number of KTR_SPARE available (16 would be ok) > - By default have KTR_SPARE0-15 to be on in the kernel along with KTR > option, or maybe when the kernel is still in the debugging phase (but > leave in a knob for disabling it) > - Use the dynamic masking system to just mask the SPARE you are > interested into. This way your driver can simply use a KTR_SPARE for > development and you will mask out the right one at run time. Forgot to mention, even if this is mostly unrelated to your point: we should make a better job of breaking further the current set of KTR classes on a per-subsystem basis. KTR_VFS or KTR_VM (and others) are far too large right now. Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 15:44:39 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 166C710656E9; Tue, 5 Jun 2012 15:44:39 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id C54408FC16; Tue, 5 Jun 2012 15:44:38 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id BF56D603A; Tue, 5 Jun 2012 15:44:37 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 93533961D; Tue, 5 Jun 2012 17:44:37 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: John Baldwin References: <201206041053.51802.jhb@freebsd.org> <86y5o1vrer.fsf@ds4.des.no> <201206051008.29568.jhb@freebsd.org> Date: Tue, 05 Jun 2012 17:44:37 +0200 In-Reply-To: <201206051008.29568.jhb@freebsd.org> (John Baldwin's message of "Tue, 5 Jun 2012 10:08:29 -0400") Message-ID: <86haupvk4a.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 15:44:39 -0000 John Baldwin writes: > So you call getpid() on each access to a shared resource? I don't, but I've seen code that does, under the assumption that all the world is Linux and getpid() is free. Here's a sample from RHEL6 on a 3.1 GHz i5, using raise(0) as a baseline: getpid(): 10,000,000 iterations in 24,400 ms gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms raise(0): 10,000,000 iterations in 1,284,593 ms The difference between the first two is due to the fact that while getpid() just returns a constant, gettimeofday(0, 0) performs two comparisons first. Passing an actual struct timeval to gettimeofday() slows it down by a factor of about 6. (strace confirms that no system calls occur for either getpid() or gettimeofday(0, 0)) Here is the same program running on FreeBSD 9.0-RELEASE in VirtualBox on an otherwise idle 3.4 GHz i7: getpid(): 10,000,000 iterations in 777,251 ms gettimeofday(0, 0): 10,000,000 iterations in 799,808 ms raise(0): 10,000,000 iterations in 2,142,275 ms DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 16:22:22 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DF196106566C; Tue, 5 Jun 2012 16:22:22 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id AE2A28FC0A; Tue, 5 Jun 2012 16:22:22 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2775FB91A; Tue, 5 Jun 2012 12:22:22 -0400 (EDT) From: John Baldwin To: "Dag-Erling =?utf-8?q?Sm=C3=B8rgrav?=" Date: Tue, 5 Jun 2012 12:22:12 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> In-Reply-To: <86haupvk4a.fsf@ds4.des.no> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <201206051222.12627.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 05 Jun 2012 12:22:22 -0400 (EDT) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 16:22:23 -0000 On Tuesday, June 05, 2012 11:44:37 am Dag-Erling Sm=C3=B8rgrav wrote: > John Baldwin writes: > > So you call getpid() on each access to a shared resource? >=20 > I don't, but I've seen code that does, under the assumption that all the > world is Linux and getpid() is free. Here's a sample from RHEL6 on a > 3.1 GHz i5, using raise(0) as a baseline: >=20 > getpid(): 10,000,000 iterations in 24,400 ms > gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms > raise(0): 10,000,000 iterations in 1,284,593 ms >=20 > The difference between the first two is due to the fact that while > getpid() just returns a constant, gettimeofday(0, 0) performs two > comparisons first. Passing an actual struct timeval to gettimeofday() > slows it down by a factor of about 6. >=20 > (strace confirms that no system calls occur for either getpid() or > gettimeofday(0, 0)) >=20 > Here is the same program running on FreeBSD 9.0-RELEASE in VirtualBox on > an otherwise idle 3.4 GHz i7: >=20 > getpid(): 10,000,000 iterations in 777,251 ms > gettimeofday(0, 0): 10,000,000 iterations in 799,808 ms > raise(0): 10,000,000 iterations in 2,142,275 ms Yes, we know getpid() is slow, I think the question is does it matter that= =20 it's slow in something other than a microbenchmark. Can you name the=20 application that you've seen use getpid()? =2D-=20 John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 16:56:11 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C513F106567B; Tue, 5 Jun 2012 16:56:11 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 782AF8FC0C; Tue, 5 Jun 2012 16:56:11 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 9F9CA7300B; Tue, 5 Jun 2012 19:14:46 +0200 (CEST) Date: Tue, 5 Jun 2012 19:14:46 +0200 From: Luigi Rizzo To: John Baldwin Message-ID: <20120605171446.GA28387@onelab2.iet.unipi.it> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201206051222.12627.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov , Dag-Erling Sm??rgrav Subject: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 16:56:11 -0000 On Tue, Jun 05, 2012 at 12:22:12PM -0400, John Baldwin wrote: > On Tuesday, June 05, 2012 11:44:37 am Dag-Erling Sm??rgrav wrote: > > John Baldwin writes: > > > So you call getpid() on each access to a shared resource? > > > > I don't, but I've seen code that does, under the assumption that all the > > world is Linux and getpid() is free. Here's a sample from RHEL6 on a > > 3.1 GHz i5, using raise(0) as a baseline: > > > > getpid(): 10,000,000 iterations in 24,400 ms > > gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms > > raise(0): 10,000,000 iterations in 1,284,593 ms > > > > The difference between the first two is due to the fact that while > > getpid() just returns a constant, gettimeofday(0, 0) performs two > > comparisons first. Passing an actual struct timeval to gettimeofday() > > slows it down by a factor of about 6. > > > > (strace confirms that no system calls occur for either getpid() or > > gettimeofday(0, 0)) > > > > Here is the same program running on FreeBSD 9.0-RELEASE in VirtualBox on > > an otherwise idle 3.4 GHz i7: > > > > getpid(): 10,000,000 iterations in 777,251 ms > > gettimeofday(0, 0): 10,000,000 iterations in 799,808 ms > > raise(0): 10,000,000 iterations in 2,142,275 ms > > Yes, we know getpid() is slow, I think the question is does it matter that > it's slow in something other than a microbenchmark. Can you name the > application that you've seen use getpid()? i think the important question is, for any function X: Q1 "does it require horrible hacks or a huge amount of work to make X syscall-free ?" rather than Q2 "does it matter to make X fast" If the answer to Q1 is "no" then there is no question we should try to implement it. Clearly the answer changes depending on the infrastructure we have in place (e.g. without some shared kernel page we could not export gettimeofday() calibration data, or PID numbers, etc). And if we really want to educate people to use syscalls in a sensible way (which I do see as a valuable goal, just not always) we could always use an environment variable, LIBC_OPTIONS, which enables or disables certain optimizations, similar to MALLOC_OPTIONS. cheers luigi or > -- > John Baldwin > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 17:26:22 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4EBA21065672; Tue, 5 Jun 2012 17:26:22 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay06.ispgateway.de (smtprelay06.ispgateway.de [80.67.31.104]) by mx1.freebsd.org (Postfix) with ESMTP id 0A2438FC19; Tue, 5 Jun 2012 17:26:22 +0000 (UTC) Received: from [87.79.196.217] (helo=fabiankeil.de) by smtprelay06.ispgateway.de with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1SbxTH-0008PR-Pd; Tue, 05 Jun 2012 19:23:07 +0200 Date: Tue, 5 Jun 2012 19:15:45 +0200 From: Fabian Keil To: George Neville-Neil Message-ID: <20120605191545.65779e1e@fabiankeil.de> In-Reply-To: References: <86wr40tfhf.wl%gnn@neville-neil.com> <20120528190300.3a43fc8d@fabiankeil.de> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/Lte0l0kRmu9E+IA_EkjArnW"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: arch@freebsd.org Subject: Re: RFC: A trial io provider for DTrace... X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 17:26:22 -0000 --Sig_/Lte0l0kRmu9E+IA_EkjArnW Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable George Neville-Neil wrote: > On May 28, 2012, at 13:03 , Fabian Keil wrote: > >> Remember you need to be root to use DTrace. > >=20 > > Do you intent to eventually commit your patch to get dtrace working > > with sudo? I've been using it since you posted it last October and > > haven't seen any issues. > > http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028120.= html > >=20 >=20 > Sorry, what I meant was that you needed root privilege to run DTrace, > sudo will give you that. I got that, but was under the impression that the patch was still necessary to get dtrace working with sudo and thus was surprised that it hadn't been committed yet. Apparently I missed the memo that the problem has already been fixed differently and the patch is no longer required. Fabian --Sig_/Lte0l0kRmu9E+IA_EkjArnW Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAk/OPsgACgkQBYqIVf93VJ3DUgCfXiSZjUbQEKXruMiHKUXpUesO pyQAnjQ+/hVdoHfPqpqmmPr7hFCVHL22 =LRiD -----END PGP SIGNATURE----- --Sig_/Lte0l0kRmu9E+IA_EkjArnW-- From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 17:40:16 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DEDEF10656E8; Tue, 5 Jun 2012 17:40:16 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 37A748FC12; Tue, 5 Jun 2012 17:40:15 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 3E1F760A6; Tue, 5 Jun 2012 17:40:14 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id D346E962F; Tue, 5 Jun 2012 19:40:13 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: John Baldwin References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> Date: Tue, 05 Jun 2012 19:40:13 +0200 In-Reply-To: <201206051222.12627.jhb@freebsd.org> (John Baldwin's message of "Tue, 5 Jun 2012 12:22:12 -0400") Message-ID: <868vg1verm.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 17:40:17 -0000 John Baldwin writes: > Yes, we know getpid() is slow, I think the question is does it matter tha= t=20 > it's slow in something other than a microbenchmark. Can you name the=20 > application that you've seen use getpid()? I've seen it in a proprietary multi-platform shared memory library. Closer to home, I believe sqlite3 does the same thing, and we do this ourselves, albeit on a smaller, non-performance-critical scale, e.g. in the pidfile API and (IIRC) in nsswitch and the resolver. BTW, raise(0) was a poor choice of baseline since it actually calls getpid(), which makes no difference on Linux but does on FreeBSD. The actual numbers for FreeBSD are: getpid(): 10,000,000 iterations in 784,638 ms gettimeofday(0, 0): 10,000,000 iterations in 801,375 ms kill(pid, 0): 10,000,000 iterations in 1,190,791 ms DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 18:37:10 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6E4D41065673; Tue, 5 Jun 2012 18:37:10 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id EFC478FC1B; Tue, 5 Jun 2012 18:37:09 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q55IassO020480 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 6 Jun 2012 04:36:56 +1000 Date: Wed, 6 Jun 2012 04:36:54 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Luigi Rizzo In-Reply-To: <20120605171446.GA28387@onelab2.iet.unipi.it> Message-ID: <20120606040931.F1050@besplex.bde.org> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Tue, 05 Jun 2012 18:47:47 +0000 Cc: Gianni , John Baldwin , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org, Konstantin Belousov , Dag-Erling Sm??rgrav Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 18:37:10 -0000 On Tue, 5 Jun 2012, Luigi Rizzo wrote: > On Tue, Jun 05, 2012 at 12:22:12PM -0400, John Baldwin wrote: >> On Tuesday, June 05, 2012 11:44:37 am Dag-Erling Sm??rgrav wrote: >>> John Baldwin writes: >>>> So you call getpid() on each access to a shared resource? >>> >>> I don't, but I've seen code that does, under the assumption that all the >>> world is Linux and getpid() is free. Here's a sample from RHEL6 on a >>> 3.1 GHz i5, using raise(0) as a baseline: >>> >>> getpid(): 10,000,000 iterations in 24,400 ms >>> gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms >>> raise(0): 10,000,000 iterations in 1,284,593 ms That's one slow system or broken units. 24.4 seconds for 10 million "syscalls" in the fastest case? If the comma is really a decimal point, then 24.4 milliseconds makes sense, but then the number of iterations would be only 10, with a the second comma being a syntax error. If ms actually means microseconds, then someone should fix ping(1) to stop pretending that it is 1000 times as fast as it is. After adjusting by factors of 1000 here and there, this format is still hard to parse. I like the format of nsec/operation. 24400 10 million operations in 24400 moroseconds seems to scale to 2.44 nsec/call (if 1 moro = 1 micro). But that is impossibly fast, unless getpid() is inlined to a load of the shared variable (it may also need the load to be moved outside the loop). I can't see any reasonable adjustment that gives 24.4 nsec/call. >>> The difference between the first two is due to the fact that while >>> getpid() just returns a constant, gettimeofday(0, 0) performs two >>> comparisons first. Passing an actual struct timeval to gettimeofday() >>> slows it down by a factor of about 6. >>> >>> (strace confirms that no system calls occur for either getpid() or >>> gettimeofday(0, 0)) >>> >>> Here is the same program running on FreeBSD 9.0-RELEASE in VirtualBox on >>> an otherwise idle 3.4 GHz i7: >>> >>> getpid(): 10,000,000 iterations in 777,251 ms >>> gettimeofday(0, 0): 10,000,000 iterations in 799,808 ms >>> raise(0): 10,000,000 iterations in 2,142,275 ms 2142.275 seconds is really slow. >> Yes, we know getpid() is slow, I think the question is does it matter that >> it's slow in something other than a microbenchmark. Can you name the >> application that you've seen use getpid()? > > i think the important question is, for any function X: > Q1 "does it require horrible hacks or a huge amount of work > to make X syscall-free ?" > rather than > Q2 "does it matter to make X fast" s/huge amount/any/ Work is all the programming work to implement it and maintain it forever. > If the answer to Q1 is "no" then there is no question > we should try to implement it. The answer is sure to be "no", but you should try to implement to see if it is easier or works better than expected. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 18:57:38 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4F3351065670; Tue, 5 Jun 2012 18:57:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id D4F048FC1E; Tue, 5 Jun 2012 18:57:37 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q55IvTj9015074 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 6 Jun 2012 04:57:30 +1000 Date: Wed, 6 Jun 2012 04:57:29 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Andrey Chernov In-Reply-To: <20120605130922.GE13306@vniz.net> Message-ID: <20120606043731.D1124@besplex.bde.org> References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <20120605130922.GE13306@vniz.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.org Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 18:57:38 -0000 On Tue, 5 Jun 2012, Andrey Chernov wrote: > On Tue, Jun 05, 2012 at 09:47:42AM +0200, Pawel Jakub Dawidek wrote: >>> "The setting of errno after a successful call to a function is >>> unspecified unless the description of that function specifies that >>> errno shall not be modified." >> >> Very interesting. However free(3) is always successful. Maybe we need >> more context here, but the sentence above might talk about functions >> that can either succeed or fail and such functions do set errno on >> failure, but we don't know what they do to errno on success - they >> sometimes interact with the errno, free(3) never does. > > According to Austing Group interpretation, this setence talks about > funtions which always succeed too, please see > http://austingroupbugs.net/view.php?id=385 This has very little to do with POSIX. It is a basic part of Standard C that the C library may, at its option, clobber errno, gratuitously or otherwise. From n869.txt: [#3] The value of errno is zero at program startup, but is never set to zero by any library function.159) The value of errno may be set to nonzero by a library function call whether or not there is an error, provided the use of errno is not documented in the description of the function in this International Standard. Use of errno is not documented for free(); thus free() is permitted to clobber errno. POSIX may require errno to not be clobbered, especially for its functions. It probably shouldn't do this for Standard C library functions like free(), since this would be an extension and any use of the extension would give unnecessarily unportanle code. >> I aware that my interpretation might be too wishful, but it is pretty >> obvious to save errno value when calling a function that can eventually >> fail - when we save the errno we don't know if it will fail or not, so >> we have to do that, but requiring to save errno when calling a void >> function that can't fail is simply silly and complicates the code >> without a reason. This has very little to do with success or failure. It does complicate the code for callers, but actually simplifies the library. Since most libary functions aren't required to preserve errno, they can call each other without having save and restore errno when they call each other. > It still can fail due to internal errors, it just not returns failure. > For internal errors POSIX states that errno state is unspecified. > >> I agree that the standards aren't clear, but if saving errno around >> free(3) is the way to go, then I'm sure we have much more problems in >> our code, even if it is not suppose to be portable it should be correct >> - we never know who and when will take the code and port it. > > Currently they are pretty clear in that moment, although I agree that if > POSIX says it should not modify errno, the life will be easy. Lets look at > their further movement, since they are already aware of this specific > problem. They are perfectly clear. >> I guess what I'm trying to say here is that this is much bigger change >> than it looks and we should probably agree on some global rule here. > > ...which not violate standards. Yes, its completion is a very large and ugly change. realpath() is a POSIX interface, so any code that implements or uses it can safely assume POSIX requirements. But non-POSIX code can only safely assume Standard C requirements. OTOH, the libary can assume anything that it wants and implements for itself, since it is the implementation so it can make free() easy to use for itself, with any extensions that aren't incompatible with Standard C. Since free() is allowed to clobber errno, it is also allowed to do a null clobber as a compatible extension. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 18:44:53 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F28541065672; Tue, 5 Jun 2012 18:44:52 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id A30FC8FC20; Tue, 5 Jun 2012 18:44:52 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 3A39C7300A; Tue, 5 Jun 2012 21:03:34 +0200 (CEST) Date: Tue, 5 Jun 2012 21:03:34 +0200 From: Luigi Rizzo To: Bruce Evans Message-ID: <20120605190334.GB29067@onelab2.iet.unipi.it> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120606040931.F1050@besplex.bde.org> User-Agent: Mutt/1.4.2.3i X-Mailman-Approved-At: Tue, 05 Jun 2012 19:09:36 +0000 Cc: Gianni , John Baldwin , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org, Konstantin Belousov , Dag-Erling Sm??rgrav Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 18:44:53 -0000 On Wed, Jun 06, 2012 at 04:36:54AM +1000, Bruce Evans wrote: > On Tue, 5 Jun 2012, Luigi Rizzo wrote: ... > >>Yes, we know getpid() is slow, I think the question is does it matter that > >>it's slow in something other than a microbenchmark. Can you name the > >>application that you've seen use getpid()? > > > >i think the important question is, for any function X: > > Q1 "does it require horrible hacks or a huge amount of work > > to make X syscall-free ?" > >rather than > > Q2 "does it matter to make X fast" > > s/huge amount/any/ > > Work is all the programming work to implement it and maintain it forever. well, some work has a return in term of fun, beauty, pride so the balance is still favourable. cheers luigi From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 19:18:39 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 988711065673; Tue, 5 Jun 2012 19:18:39 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com [209.85.160.54]) by mx1.freebsd.org (Postfix) with ESMTP id 653698FC15; Tue, 5 Jun 2012 19:18:39 +0000 (UTC) Received: by pbbro2 with SMTP id ro2so8367613pbb.13 for ; Tue, 05 Jun 2012 12:18:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=bxQJST5Khq0Vf22zizVcAw4ZL6jRkQ1IJtPSQo0K5sE=; b=iCaaTo0BJQMJozginIS4a6y0m/9aLr4jAQR6nPdoTTbJCNIwF0nAQ4FMDWfrp2mrbU +H13BsCCYGBEK2+KyLxDSSlv85yqHsfK2pLvdd8vFHIi/Lsl6kP+a6r8hu6zQw1iRzby Eh64zrQPfP4J3x1TNlU0Mo/eI/qnXBKnyHukGEVQ+yM+U79+DCUkdediDuzG1d935QKS UUZsahrmM3wEhcaNN7iH+QlO5SIw4X7ZLCPJTAlF6BeB5WlFBn9dvVyBCa1VDgWs8jq+ PD6fAdA9fLedYAhE4Ljk6joIE0O7vZovJuJZJeTwUEN3BI/FE+NkL/WtCyVjmj/mHoY5 yrCA== MIME-Version: 1.0 Received: by 10.68.211.170 with SMTP id nd10mr15393689pbc.68.1338923918908; Tue, 05 Jun 2012 12:18:38 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.143.91.18 with HTTP; Tue, 5 Jun 2012 12:18:38 -0700 (PDT) In-Reply-To: References: <86bokyvtc2.fsf@ds4.des.no> Date: Tue, 5 Jun 2012 12:18:38 -0700 X-Google-Sender-Auth: DSeJkA5dK22r5PTP8ZQcZH3U_pA Message-ID: From: Adrian Chadd To: Attilio Rao Content-Type: text/plain; charset=ISO-8859-1 Cc: =?ISO-8859-1?Q?Dag=2DErling_Sm=F8rgrav?= , arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 19:18:39 -0000 Hi, I'm very tempted to make if_ath use KTR_DEV, but then have an extra ath sysctl which does something like: if (sc->sc_ktr_enable) KTR(); Adrian From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 19:41:05 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CCEDF1065749; Tue, 5 Jun 2012 19:41:05 +0000 (UTC) (envelope-from ache@vniz.net) Received: from vniz.net (vniz.net [194.87.13.69]) by mx1.freebsd.org (Postfix) with ESMTP id 2C4BE8FC19; Tue, 5 Jun 2012 19:41:04 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by vniz.net (8.14.5/8.14.5) with ESMTP id q55Jf2ab021290; Tue, 5 Jun 2012 23:41:03 +0400 (MSK) (envelope-from ache@vniz.net) Received: (from ache@localhost) by localhost (8.14.5/8.14.5/Submit) id q55Jf25B021289; Tue, 5 Jun 2012 23:41:02 +0400 (MSK) (envelope-from ache) Date: Tue, 5 Jun 2012 23:41:02 +0400 From: Andrey Chernov To: Bruce Evans Message-ID: <20120605194102.GA21173@vniz.net> Mail-Followup-To: Andrey Chernov , Bruce Evans , svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <20120605130922.GE13306@vniz.net> <20120606043731.D1124@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120606043731.D1124@besplex.bde.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 19:41:05 -0000 On Wed, Jun 06, 2012 at 04:57:29AM +1000, Bruce Evans wrote: > POSIX may require errno to not be clobbered, especially for its functions. > It probably shouldn't do this for Standard C library functions like free(), > since this would be an extension and any use of the extension would give > unnecessarily unportanle code. POSIX feels itself like they own all Standard C functions now. See "Resolved state" text for upcoming standard there: "At line 30583 [XSH free DESCRIPTION], add a paragraph with CX shading: The free() function shall not modify errno if ptr is a null pointer or a pointer previously returned as if by malloc() and not yet deallocated. At line 30591 [APPLICATION USAGE], add a new paragraph: Because the free() function does not modify errno for valid pointers, it is safe to use it in cleanup code without corrupting earlier errors, ..." > OTOH, the libary can assume anything that it wants and > implements for itself, since it is the implementation so it can make > free() easy to use for itself, with any extensions that aren't incompatible > with Standard C. Since free() is allowed to clobber errno, it is also > allowed to do a null clobber as a compatible extension. Yes, it is safe for free() itself to save errno and still stay compliant with both current and upcoming POSIX and with Standard C. But any code which rely on that is compliant with upcoming POSIX only. Since people don't want mass changes in that area, this is some sort of compromise acceptable for me (in case free() itself will save/restore errno, of course). -- http://ache.vniz.net/ From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 20:11:05 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7D2BF1065673; Tue, 5 Jun 2012 20:11:05 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id 0813E8FC18; Tue, 5 Jun 2012 20:11:04 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q55KB1HC016442 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 6 Jun 2012 06:11:02 +1000 Date: Wed, 6 Jun 2012 06:11:01 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Andrey Chernov In-Reply-To: <20120605194102.GA21173@vniz.net> Message-ID: <20120606054555.U1456@besplex.bde.org> References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <20120605130922.GE13306@vniz.net> <20120606043731.D1124@besplex.bde.org> <20120605194102.GA21173@vniz.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , svn-src-all@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG, svn-src-head@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 20:11:05 -0000 On Tue, 5 Jun 2012, Andrey Chernov wrote: > On Wed, Jun 06, 2012 at 04:57:29AM +1000, Bruce Evans wrote: >> POSIX may require errno to not be clobbered, especially for its functions. >> It probably shouldn't do this for Standard C library functions like free(), >> since this would be an extension and any use of the extension would give >> unnecessarily unportanle code. > > POSIX feels itself like they own all Standard C functions now. See Not really. They can extend anything they want... > "Resolved state" text for upcoming standard there: > > "At line 30583 [XSH free DESCRIPTION], add a paragraph with CX shading: > > The free() function shall not modify errno if ptr is a null pointer > or a pointer previously returned as if by malloc() and not yet > deallocated. ...but the have to mark it as an extension, as they do here. > At line 30591 [APPLICATION USAGE], add a new paragraph: > > Because the free() function does not modify errno for valid pointers, it > is safe to use it in cleanup code without corrupting earlier errors, ..." This is essentially unusable (so a bad idea). Instead of unconditionally saving and restoring errno around calls to free(), portable POSIX code can soon use a messy ifdef to avoid doing this in some cases, but still has to do it in other cases. The results is just bloat and complexity at the source level: #if _POSIX_VERSION < mumble int sverrno; #endif ... if (wantfree) #if _POSIX_VERSION < mumble { /* I made these braces condtional ... */ sverrno = errno; #endif free(p); #if _POSIX_VERSION < mumble errno = sverrno; } /* ... to maximise the ugliness */ #endif >> OTOH, the libary can assume anything that it wants and >> implements for itself, since it is the implementation so it can make >> free() easy to use for itself, with any extensions that aren't incompatible >> with Standard C. Since free() is allowed to clobber errno, it is also >> allowed to do a null clobber as a compatible extension. > > Yes, it is safe for free() itself to save errno and still stay compliant > with both current and upcoming POSIX and with Standard C. But any code > which rely on that is compliant with upcoming POSIX only. Since people > don't want mass changes in that area, this is some sort of compromise > acceptable for me (in case free() itself will save/restore errno, of > course). libc has lots of magic non-conforming code. A little more won't hurt. However, free() is currently not careful about errno. It begins with an optional utrace() call, and this can in theory fail with errno ENOMEM even if there are no bugs in malloc() (all other errors from utrace() indicate bugs in the caller, assuming that the list of errnos in its man page is complete). malloc.c makes a few other sys(lib?)calls and never saves errno. I don't know if the others are reachable from free(). Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 20:14:04 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3A8441065670; Tue, 5 Jun 2012 20:14:04 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 78BF98FC08; Tue, 5 Jun 2012 20:14:03 +0000 (UTC) Received: by laai10 with SMTP id i10so5214325laa.13 for ; Tue, 05 Jun 2012 13:14:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=48RbuAC1Bs4gm51Wx/oRorkom0LY4BcDYWBFZcd52IQ=; b=aDb/RaZPLos5Y9ZBL3uMc2ba/7As8Jz/Zh7iW2Rujg9Fl4cOYHSYl9o0C7/UBMQsQk 7aTqm3JdRwQ20AbukwszX3Slw8Q2Vd3+uplN3QMAfKLAi1Sb4R8h5JXrIPlEYWHKYXq7 eGsXlXj7X+P739Qhnh6wyhizLsVLy5eqIvtpP4BAKvth/hKqHBT2el4999+/suC54Obe ZhwV9RPh+bMYJv1nPepm7aw1uPegjLM7znMjZiO4aHZQs7iZM0JEHZ+syFzo4slPf3LJ fQXAx88Ys/apdYL77kmGFruki17nB53yXINbST9QHUYguz5TyJZ5hJiAIg8A9iIbA8gc ky+w== MIME-Version: 1.0 Received: by 10.112.45.4 with SMTP id i4mr8723394lbm.79.1338927242177; Tue, 05 Jun 2012 13:14:02 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Tue, 5 Jun 2012 13:14:02 -0700 (PDT) In-Reply-To: References: <86bokyvtc2.fsf@ds4.des.no> Date: Tue, 5 Jun 2012 21:14:02 +0100 X-Google-Sender-Auth: Hk2hMq-jYJ-9tNhvIusoz3k2wtM Message-ID: From: Attilio Rao To: Adrian Chadd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= , arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 20:14:04 -0000 2012/6/5 Adrian Chadd : > Hi, > > I'm very tempted to make if_ath use KTR_DEV, but then have an extra > ath sysctl which does something like: > > if (sc->sc_ktr_enable) > =C2=A0 =C2=A0KTR(); But the actual problem is that your output will be overwhelmed by the clutter of all the other KTR_DEV consumers. We very much need an much higher granularity on KTR classes and possibly a way to use it on-the-fly for kernel development and I think what I suggested earlier makes sense. Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 20:30:52 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5A32A106566C; Tue, 5 Jun 2012 20:30:52 +0000 (UTC) (envelope-from gleb.kurtsou@gmail.com) Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by mx1.freebsd.org (Postfix) with ESMTP id C05F08FC0A; Tue, 5 Jun 2012 20:30:50 +0000 (UTC) Received: by lbon10 with SMTP id n10so5373119lbo.13 for ; Tue, 05 Jun 2012 13:30:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=u8AP51ONs2zfWJ3eMhDrdW9UcthNAbwHhoYUE3ia+vg=; b=AC3AEKqo/GksdzFuQBRfpItOG6MX8gxwE+0+fpJlDUEiF4E1aB4MeZCu2JviEODC2f AJeQ2b6zhdZjUZekw5WmZZJACP0EdavtY1hzlIyjvGgAZiiWnB6R0fGkXtDih2IQKxNV Aq9bzfaEmGh41HKsF51i8/FzY3qLpljZhIaALQg+7yxnebA1sWwhBqblF0kzvqMvt0/f qrB1o2QgN3oCtjxIOE6JneH/34fw0gCFjQm8yPfcvcGjmq0QeHcoezN+dkR9Q1oT+0M9 VMkebalIeG+1ejDI8uHtr/fK0XZcbWLvDkTUeDbk/wxlmprjOghnC8tAQDHgCkxrLlRg 9sog== Received: by 10.112.26.165 with SMTP id m5mr8686727lbg.15.1338928249549; Tue, 05 Jun 2012 13:30:49 -0700 (PDT) Received: from localhost ([78.157.92.5]) by mx.google.com with ESMTPS id hg4sm4139283lab.11.2012.06.05.13.30.47 (version=SSLv3 cipher=OTHER); Tue, 05 Jun 2012 13:30:48 -0700 (PDT) Date: Tue, 5 Jun 2012 23:30:42 +0300 From: Gleb Kurtsou To: John Baldwin Message-ID: <20120605203042.GA4081@reks> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <201206051222.12627.jhb@freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov , Dag-Erling =?utf-8?B?U23DuHJncmF2?= Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 20:30:52 -0000 On (05/06/2012 12:22), John Baldwin wrote: > On Tuesday, June 05, 2012 11:44:37 am Dag-Erling Smørgrav wrote: > > John Baldwin writes: > > > So you call getpid() on each access to a shared resource? > > > > I don't, but I've seen code that does, under the assumption that all the > > world is Linux and getpid() is free. Here's a sample from RHEL6 on a > > 3.1 GHz i5, using raise(0) as a baseline: > > > > getpid(): 10,000,000 iterations in 24,400 ms > > gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms > > raise(0): 10,000,000 iterations in 1,284,593 ms > > > > The difference between the first two is due to the fact that while > > getpid() just returns a constant, gettimeofday(0, 0) performs two > > comparisons first. Passing an actual struct timeval to gettimeofday() > > slows it down by a factor of about 6. > > > > (strace confirms that no system calls occur for either getpid() or > > gettimeofday(0, 0)) > > > > Here is the same program running on FreeBSD 9.0-RELEASE in VirtualBox on > > an otherwise idle 3.4 GHz i7: > > > > getpid(): 10,000,000 iterations in 777,251 ms > > gettimeofday(0, 0): 10,000,000 iterations in 799,808 ms > > raise(0): 10,000,000 iterations in 2,142,275 ms > > Yes, we know getpid() is slow, I think the question is does it matter that > it's slow in something other than a microbenchmark. Can you name the > application that you've seen use getpid()? > arc4random* calls getpid() on every invocation (which is right thing to do, imo) to reinitialize generator after fork. As an example consider network daemon encrypting/decrypting packets that is likely to need randomness to encrypt or process considerable portion of data. Too much depends on the crypto protocols/algorithms used, but scenario is pretty much real-life. It's a good example when getpid() is actually needed, but not called often because of being cheap. Thanks, Gleb. From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 21:01:57 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8706810656B0; Tue, 5 Jun 2012 21:01:57 +0000 (UTC) (envelope-from ache@vniz.net) Received: from vniz.net (vniz.net [194.87.13.69]) by mx1.freebsd.org (Postfix) with ESMTP id E88D68FC0C; Tue, 5 Jun 2012 21:01:56 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by vniz.net (8.14.5/8.14.5) with ESMTP id q55L1tOI022765; Wed, 6 Jun 2012 01:01:55 +0400 (MSK) (envelope-from ache@vniz.net) Received: (from ache@localhost) by localhost (8.14.5/8.14.5/Submit) id q55L1sFR022764; Wed, 6 Jun 2012 01:01:54 +0400 (MSK) (envelope-from ache) Date: Wed, 6 Jun 2012 01:01:54 +0400 From: Andrey Chernov To: Bruce Evans Message-ID: <20120605210154.GA22370@vniz.net> Mail-Followup-To: Andrey Chernov , Bruce Evans , svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <20120605130922.GE13306@vniz.net> <20120606043731.D1124@besplex.bde.org> <20120605194102.GA21173@vniz.net> <20120606054555.U1456@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120606054555.U1456@besplex.bde.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: svn-src-head@FreeBSD.ORG, svn-src-all@FreeBSD.ORG, src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 21:01:57 -0000 On Wed, Jun 06, 2012 at 06:11:01AM +1000, Bruce Evans wrote: > This is essentially unusable (so a bad idea). Instead of unconditionally > saving and restoring errno around calls to free(), portable POSIX code > can soon use a messy ifdef to avoid doing this in some cases, but still > has to do it in other cases. The results is just bloat and complexity > at the source level: It looks like they now consider POSIX as moving target where previous POSIX versions compatibility is not so essential to care about much. I don't have other interpretation of their decision to suddenly accept free() as not modifying errno. Since they clearly indicate code differences for old and new standard, they are well aware of them and of resulting code bloating. > However, free() is currently not careful about errno. It begins with > an optional utrace() call, and this can in theory fail with errno ENOMEM > even if there are no bugs in malloc() (all other errors from utrace() > indicate bugs in the caller, assuming that the list of errnos in its man > page is complete). malloc.c makes a few other sys(lib?)calls and never > saves errno. I don't know if the others are reachable from free(). I fill PR about that: http://www.freebsd.org/cgi/query-pr.cgi?pr=168719 -- http://ache.vniz.net/ From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 21:30:42 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BB36D106564A; Tue, 5 Jun 2012 21:30:42 +0000 (UTC) (envelope-from joerg@britannica.bec.de) Received: from mo6-p00-ob.rzone.de (mo6-p00-ob.rzone.de [IPv6:2a01:238:20a:202:5300::1]) by mx1.freebsd.org (Postfix) with ESMTP id 94EE88FC0C; Tue, 5 Jun 2012 21:30:41 +0000 (UTC) X-RZG-AUTH: :JiIXek6mfvEEUpFQdo7Fj1/zg48CFjWjQv0cW+St/nW/afgnrylsiW+1ZjV+pgsJ X-RZG-CLASS-ID: mo00 Received: from britannica.bec.de (ip-109-45-139-202.web.vodafone.de [109.45.139.202]) by smtp.strato.de (jored mo73) (RZmta 29.10 DYNA|AUTH) with (AES128-SHA encrypted) ESMTPA id A07bfao55ISlKb ; Tue, 5 Jun 2012 23:30:37 +0200 (CEST) Received: by britannica.bec.de (sSMTP sendmail emulation); Tue, 05 Jun 2012 23:30:34 +0200 Date: Tue, 5 Jun 2012 23:30:34 +0200 From: Joerg Sonnenberger To: svn-src-all@freebsd.org Message-ID: <20120605213034.GA25293@britannica.bec.de> References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <20120605130922.GE13306@vniz.net> <20120606043731.D1124@besplex.bde.org> <20120605194102.GA21173@vniz.net> <20120606054555.U1456@besplex.bde.org> <20120605210154.GA22370@vniz.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120605210154.GA22370@vniz.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG, svn-src-head@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 21:30:42 -0000 On Wed, Jun 06, 2012 at 01:01:54AM +0400, Andrey Chernov wrote: > On Wed, Jun 06, 2012 at 06:11:01AM +1000, Bruce Evans wrote: > > This is essentially unusable (so a bad idea). Instead of unconditionally > > saving and restoring errno around calls to free(), portable POSIX code > > can soon use a messy ifdef to avoid doing this in some cases, but still > > has to do it in other cases. The results is just bloat and complexity > > at the source level: > > It looks like they now consider POSIX as moving target where previous > POSIX versions compatibility is not so essential to care about much. I > don't have other interpretation of their decision to suddenly accept > free() as not modifying errno. Since they clearly indicate code > differences for old and new standard, they are well aware of them and of > resulting code bloating. Can you please stop the unjustified rants? The "new" behavior of free(3) doesn't break any existing code, so it is certainly compatible with "old" free(3). The "new" behavior can be obtained easily for code that wants to be portable to "old" implementations using the C preprocessor and a small inline wrapper. As such, there is no code bloating. Joerg From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 21:48:14 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 36347106566C; Tue, 5 Jun 2012 21:48:14 +0000 (UTC) (envelope-from ache@vniz.net) Received: from vniz.net (vniz.net [194.87.13.69]) by mx1.freebsd.org (Postfix) with ESMTP id 9F7C68FC14; Tue, 5 Jun 2012 21:48:13 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by vniz.net (8.14.5/8.14.5) with ESMTP id q55LmC3a023628; Wed, 6 Jun 2012 01:48:12 +0400 (MSK) (envelope-from ache@vniz.net) Received: (from ache@localhost) by localhost (8.14.5/8.14.5/Submit) id q55LmBPR023627; Wed, 6 Jun 2012 01:48:11 +0400 (MSK) (envelope-from ache) Date: Wed, 6 Jun 2012 01:48:11 +0400 From: Andrey Chernov To: Joerg Sonnenberger Message-ID: <20120605214811.GA23384@vniz.net> Mail-Followup-To: Andrey Chernov , Joerg Sonnenberger , svn-src-all@FreeBSD.ORG, Bruce Evans , svn-src-head@FreeBSD.ORG, src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , freebsd-arch@FreeBSD.ORG References: <201206042134.q54LYoVJ067685@svn.freebsd.org> <20120605074741.GA1391@garage.freebsd.pl> <20120605130922.GE13306@vniz.net> <20120606043731.D1124@besplex.bde.org> <20120605194102.GA21173@vniz.net> <20120606054555.U1456@besplex.bde.org> <20120605210154.GA22370@vniz.net> <20120605213034.GA25293@britannica.bec.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120605213034.GA25293@britannica.bec.de> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: src-committers@FreeBSD.ORG, Pawel Jakub Dawidek , svn-src-all@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG, svn-src-head@FreeBSD.ORG Subject: Re: svn commit: r236582 - head/lib/libc/stdlib X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 21:48:14 -0000 On Tue, Jun 05, 2012 at 11:30:34PM +0200, Joerg Sonnenberger wrote: > On Wed, Jun 06, 2012 at 01:01:54AM +0400, Andrey Chernov wrote: > > On Wed, Jun 06, 2012 at 06:11:01AM +1000, Bruce Evans wrote: > > > This is essentially unusable (so a bad idea). Instead of unconditionally > > > saving and restoring errno around calls to free(), portable POSIX code > > > can soon use a messy ifdef to avoid doing this in some cases, but still > > > has to do it in other cases. The results is just bloat and complexity > > > at the source level: > > > > It looks like they now consider POSIX as moving target where previous > > POSIX versions compatibility is not so essential to care about much. I > > don't have other interpretation of their decision to suddenly accept > > free() as not modifying errno. Since they clearly indicate code > > differences for old and new standard, they are well aware of them and of > > resulting code bloating. > > Can you please stop the unjustified rants? The "new" behavior of free(3) > doesn't break any existing code, so it is certainly compatible with > "old" free(3). The "new" behavior can be obtained easily for code that > wants to be portable to "old" implementations using the C preprocessor > and a small inline wrapper. As such, there is no code bloating. Could you please read more carefully, if you decide to stay in the topic? I already say exactly that few messages behind: > Yes, it is safe for free() itself to save errno and still stay compliant > with both current and upcoming POSIX and with Standard C. > But any code which rely on that is compliant with upcoming POSIX only. It means that when some program wants to conform to current POSIX and future POSIX, it either must save errno across the free() in any case or use code bloating, just reduced by CPP macro you suggest, not eliminated. And I don't think it is good decision from POSIX side, from compatibility point of view. Are you pretend to attack my personal opinion or what? -- http://ache.vniz.net/ From owner-freebsd-arch@FreeBSD.ORG Tue Jun 5 22:48:06 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 96F8D1065670; Tue, 5 Jun 2012 22:48:06 +0000 (UTC) (envelope-from grehan@freebsd.org) Received: from alto.onthenet.com.au (alto.OntheNet.com.au [203.13.68.12]) by mx1.freebsd.org (Postfix) with ESMTP id 460D78FC0C; Tue, 5 Jun 2012 22:48:06 +0000 (UTC) Received: from dommail.onthenet.com.au (dommail.OntheNet.com.au [203.13.70.57]) by alto.onthenet.com.au (Postfix) with ESMTPS id DE6C51268B; Wed, 6 Jun 2012 08:41:03 +1000 (EST) Received: from 192-168-1-100.tpgi.com.au (110-174-216-99.static.tpgi.com.au [110.174.216.99]) by dommail.onthenet.com.au (MOS 4.2.4-GA) with ESMTP id BEH97495 (AUTH peterg@ptree32.com.au); Wed, 6 Jun 2012 08:41:01 +1000 Message-ID: <4FCE8AF7.40606@freebsd.org> Date: Wed, 06 Jun 2012 08:40:55 +1000 From: Peter Grehan User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.28) Gecko/20120306 Thunderbird/3.1.20 MIME-Version: 1.0 To: Attilio Rao X-Old-Subject: Re: KTR_SPAREx References: <86bokyvtc2.fsf@ds4.des.no> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Junkmail: UCE(71) X-Junkmail-Info: FH_HELO_EQ_D_D_D_D, HELO_DYNAMIC_IPADDR2, SPF_SOFTFAIL, TVD_RCVD_IP X-Junkmail-Status: score=71/51, host=dommail.onthenet.com.au Cc: =?UTF-8?B?YXY=?= , Adrian Chadd , =?UTF-8?B?RGFnLUVybGluZyBTbcO4cmdy?=, arch@freebsd.org Subject: {Spam?} Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jun 2012 22:48:06 -0000 > We very much need an much higher granularity on KTR classes and > possibly a way to use it on-the-fly for kernel development and I think > what I suggested earlier makes sense. Anyone had a look at Dragonfly's ktr ? http://gitweb.dragonflybsd.org/dragonfly.git/blob/HEAD:/sys/sys/ktr.h later, Peter. From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 08:24:27 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 40843106566C; Wed, 6 Jun 2012 08:24:27 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id C47898FC16; Wed, 6 Jun 2012 08:24:26 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id E63F16395; Wed, 6 Jun 2012 08:24:19 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 8AC2A96CE; Wed, 6 Jun 2012 10:24:19 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Bruce Evans References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> Date: Wed, 06 Jun 2012 10:24:19 +0200 In-Reply-To: <20120606040931.F1050@besplex.bde.org> (Bruce Evans's message of "Wed, 6 Jun 2012 04:36:54 +1000 (EST)") Message-ID: <864nqovoek.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Wed, 06 Jun 2012 12:28:08 +0000 Cc: Gianni , John Baldwin , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org, Konstantin Belousov Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 08:24:27 -0000 Bruce Evans writes: > Dag-Erling Sm=C3=B8rgrav writes: > > getpid(): 10,000,000 iterations in 24,400 ms > > gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms > > raise(0): 10,000,000 iterations in 1,284,593 ms > That's one slow system or broken units. Broken units, these are microseconds not milliseconds. Sorry. > After adjusting by factors of 1000 here and there, this format is still > hard to parse. I like the format of nsec/operation. 24400 10 million > operations in 24400 moroseconds seems to scale to 2.44 nsec/call (if 1 > moro =3D 1 micro). But that is impossibly fast, unless getpid() is > inlined to a load of the shared variable (it may also need the load to > be moved outside the loop). I can't see any reasonable adjustment that > gives 24.4 nsec/call. #define ITERATIONS 10000000 struct timeval start, end; int i; gettimeofday(&start, NULL); for (i =3D 0; i < ITERATIONS; ++i) getpid(); gettimeofday(&end, NULL); On Linux, gcc 4.4.6 compiles this to: # gettimeofday(&start, NULL) 0x000000000040064b <+23>: lea -0x20(%rbp),%rax 0x000000000040064f <+27>: mov $0x0,%esi 0x0000000000400654 <+32>: mov %rax,%rdi 0x0000000000400657 <+35>: callq 0x400500 # i =3D 0 0x000000000040065c <+40>: movl $0x0,-0x4(%rbp) 0x0000000000400663 <+47>: jmp 0x40066e # getpid() 0x0000000000400665 <+49>: callq 0x400520 # ++i 0x000000000040066a <+54>: addl $0x1,-0x4(%rbp) # i < ITERATIONS 0x000000000040066e <+58>: cmpl $0x98967f,-0x4(%rbp) 0x0000000000400675 <+65>: jle 0x400665 # gettimeofday(&end, NULL) 0x0000000000400677 <+67>: lea -0x30(%rbp),%rax 0x000000000040067b <+71>: mov $0x0,%esi 0x0000000000400680 <+76>: mov %rax,%rdi 0x0000000000400683 <+79>: callq 0x400500 The code generated by gcc 4.2.1 on FreeBSD is almost identical: # gettimeofday(&start, NULL) 0x00000000004006f7 : lea -0x20(%rbp),%rdi 0x00000000004006fb : mov $0x0,%esi 0x0000000000400700 : callq 0x40057c # i =3D 0 0x0000000000400705 : movl $0x0,-0x4(%rbp) 0x000000000040070c : jmp 0x400717 # getpid() 0x000000000040070e : callq 0x40059c # ++i 0x0000000000400713 : addl $0x1,-0x4(%rbp) # i < ITERATIONS 0x0000000000400717 : cmpl $0x98967f,-0x4(%rbp) 0x000000000040071e : jle 0x40070e # gettimeofday(&end, NULL) 0x0000000000400720 : lea -0x30(%rbp),%rdi 0x0000000000400724 : mov $0x0,%esi 0x0000000000400729 : callq 0x40057c I don't know why gcc 4.4.6 loads &start / &end into %rax before copying it to %esi instead of loading it directly into %esi like 4.2.1 does. I used the same command line (gcc -Wall -Wextra syscall.c) in both cases. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 14:05:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A74881065672; Wed, 6 Jun 2012 14:05:42 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 7A8BD8FC08; Wed, 6 Jun 2012 14:05:42 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id AE995B977; Wed, 6 Jun 2012 10:05:41 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Wed, 6 Jun 2012 08:06:34 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <86bokyvtc2.fsf@ds4.des.no> <4FCE8AF7.40606@freebsd.org> In-Reply-To: <4FCE8AF7.40606@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201206060806.34245.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 06 Jun 2012 10:05:41 -0400 (EDT) Cc: Attilio Rao , av , Adrian Chadd , Peter Grehan Subject: Re: {Spam?} Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 14:05:42 -0000 On Tuesday, June 05, 2012 6:40:55 pm Peter Grehan wrote: > > We very much need an much higher granularity on KTR classes and > > possibly a way to use it on-the-fly for kernel development and I think > > what I suggested earlier makes sense. > > Anyone had a look at Dragonfly's ktr ? > > http://gitweb.dragonflybsd.org/dragonfly.git/blob/HEAD:/sys/sys/ktr.h That does seem to be closer to what I would like. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 16:51:30 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1A116106564A for ; Wed, 6 Jun 2012 16:51:30 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 6E1118FC26 for ; Wed, 6 Jun 2012 16:51:28 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q56GpGbG040880 for ; Wed, 6 Jun 2012 19:51:16 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q56GpFvi022201 for ; Wed, 6 Jun 2012 19:51:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q56GpFQI022200 for arch@freebsd.org; Wed, 6 Jun 2012 19:51:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 6 Jun 2012 19:51:15 +0300 From: Konstantin Belousov To: arch@freebsd.org Message-ID: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="kZU6r8y0YpRwyDfh" Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Subject: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 16:51:30 -0000 --kZU6r8y0YpRwyDfh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline A positive result from the recent flame-bait on arch@ is the working implementation of the fast gettimeofday(2) and clock_gettime(2). The speedup I see is around 6-7x on the 2600K. I think the speedup could be even bigger on the previous generation of CPUs, where lock operations and syscall entry are costlier. A sample test runs of tools/tools/syscall_timing are presented at the end of message. Patch finds yet another use for the shared page, exporting time-keeping information for the binuptime(9) algorithm and re-implementing binuptime(9) in userspace. Kernel directs usermode whether the rdtsc instruction can be used, there is a global override sysctl kern.timecounter.fast_gettime to turn it off regardless of hardware capabilities. The whole struct vdso_timekeep is versioned, as well as individual struct vdso_timehands, which should allow to implement future algorithms without breaking binary compatibility. The code is structured to eventually move __vdso_* functions out of libc into VDSO, if it ever materialize. This desire explains vdso prefix and header file names. I implemented and tested the userspace timecounter on amd64, both for 64 and 32 bit binaries, it would probably work for i386 too. Other architecture maintainers are welcome to add neccessary support there. You need to provide machine/vdso.h header with definitions of VDSO_TIMEHANDS_MD fields for struct vdso_timehands, which should provide information for userspace to implement fast tc_get_timecount(). The fields are filled in per-arch cpu_fill_vdso_timehands(9) function. If your architecture support 32bit compat, there are cpu_fill_vdso_timehands32(9) and VDSO_TIMEHANDS_MD32 to code as well. After that, the lib/libc//sys/__vdso_gettc.c should contain an implemention of __vdso_gettc() function, exact analogue of tc_get_timecount(). Another potential improvement for the patch is to start using rdtscp instruction on the CPUs which support it. Then we could correct rdtsc skews between packages, provided kernel starts maintaining this information, instead of refusing to activate tsc timecounter. In particular, on one Nehalem box I see the rdtsc SMP test failing, but Nehalems do have useful rdtsc, so it is could be fixed later. Patch is available at http://people.freebsd.org/~kib/misc/moronix.2.patch It is not a commit candidate yet, since non-x86 architectures are not handled even at compilation, and i386 is not tested. sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32 gettimeofday Clock resolution: 0.000000076 test loop time iterations periteration gettimeofday 0 1.000994225 21623297 0.000000046 gettimeofday 1 1.000994980 21596492 0.000000046 gettimeofday 2 1.001070595 21598326 0.000000046 gettimeofday 3 1.000922308 21581398 0.000000046 gettimeofday 4 1.000984264 21605539 0.000000046 gettimeofday 5 1.000989697 21601659 0.000000046 gettimeofday 6 1.000996261 21598385 0.000000046 gettimeofday 7 1.001002223 21583933 0.000000046 gettimeofday 8 1.000985847 21599442 0.000000046 gettimeofday 9 1.000994977 21600935 0.000000046 sandy% sudo sysctl kern.timecounter.fast_gettime=0 ~ kern.timecounter.fast_gettime: 1 -> 0 sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32 gettimeofday Clock resolution: 0.000000076 test loop time iterations periteration gettimeofday 0 1.001002747 3219274 0.000000310 gettimeofday 1 1.000971052 3220793 0.000000310 gettimeofday 2 1.001067494 3220768 0.000000310 gettimeofday 3 1.000929999 3220812 0.000000310 gettimeofday 4 1.000996106 3217503 0.000000311 gettimeofday 5 1.001058438 3220346 0.000000310 gettimeofday 6 1.000911510 3217308 0.000000311 gettimeofday 7 1.001085906 3220128 0.000000310 gettimeofday 8 1.000920338 3216582 0.000000311 gettimeofday 9 1.000983577 3219559 0.000000310 --kZU6r8y0YpRwyDfh Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/PioMACgkQC3+MBN1Mb4jPzwCfS14QKbr3jY5UhMGJDowJalb/ NrAAoNhv10qQJOytIVY46eOp5IZ3Z9s1 =D2Fs -----END PGP SIGNATURE----- --kZU6r8y0YpRwyDfh-- From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 17:06:03 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7486A1065686 for ; Wed, 6 Jun 2012 17:06:03 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id E91A28FC14 for ; Wed, 6 Jun 2012 17:06:02 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id F3CA17300A; Wed, 6 Jun 2012 19:24:39 +0200 (CEST) Date: Wed, 6 Jun 2012 19:24:39 +0200 From: Luigi Rizzo To: Konstantin Belousov Message-ID: <20120606172439.GA42362@onelab2.iet.unipi.it> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> User-Agent: Mutt/1.4.2.3i Cc: arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 17:06:03 -0000 On Wed, Jun 06, 2012 at 07:51:15PM +0300, Konstantin Belousov wrote: > A positive result from the recent flame-bait on arch@ is the working > implementation of the fast gettimeofday(2) and clock_gettime(2). The great job! congratulations and thanks for this work! cheers luigi > speedup I see is around 6-7x on the 2600K. I think the speedup could > be even bigger on the previous generation of CPUs, where lock > operations and syscall entry are costlier. A sample test runs of > tools/tools/syscall_timing are presented at the end of message. > > Patch finds yet another use for the shared page, exporting > time-keeping information for the binuptime(9) algorithm and > re-implementing binuptime(9) in userspace. Kernel directs usermode > whether the rdtsc instruction can be used, there is a global override > sysctl kern.timecounter.fast_gettime to turn it off regardless of > hardware capabilities. > > The whole struct vdso_timekeep is versioned, as well as individual > struct vdso_timehands, which should allow to implement future > algorithms without breaking binary compatibility. The code is > structured to eventually move __vdso_* functions out of libc into > VDSO, if it ever materialize. This desire explains vdso prefix and > header file names. > > I implemented and tested the userspace timecounter on amd64, both for > 64 and 32 bit binaries, it would probably work for i386 too. Other > architecture maintainers are welcome to add neccessary support there. > You need to provide machine/vdso.h header with definitions of > VDSO_TIMEHANDS_MD fields for struct vdso_timehands, which should > provide information for userspace to implement fast > tc_get_timecount(). The fields are filled in per-arch > cpu_fill_vdso_timehands(9) function. If your architecture support > 32bit compat, there are cpu_fill_vdso_timehands32(9) and > VDSO_TIMEHANDS_MD32 to code as well. After that, the > lib/libc//sys/__vdso_gettc.c should contain an implemention of > __vdso_gettc() function, exact analogue of tc_get_timecount(). > > Another potential improvement for the patch is to start using rdtscp > instruction on the CPUs which support it. Then we could correct rdtsc > skews between packages, provided kernel starts maintaining this > information, instead of refusing to activate tsc timecounter. In > particular, on one Nehalem box I see the rdtsc SMP test failing, but > Nehalems do have useful rdtsc, so it is could be fixed later. > > Patch is available at http://people.freebsd.org/~kib/misc/moronix.2.patch > It is not a commit candidate yet, since non-x86 architectures are not > handled even at compilation, and i386 is not tested. > > sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32 gettimeofday > Clock resolution: 0.000000076 > test loop time iterations periteration > gettimeofday 0 1.000994225 21623297 0.000000046 > gettimeofday 1 1.000994980 21596492 0.000000046 > gettimeofday 2 1.001070595 21598326 0.000000046 > gettimeofday 3 1.000922308 21581398 0.000000046 > gettimeofday 4 1.000984264 21605539 0.000000046 > gettimeofday 5 1.000989697 21601659 0.000000046 > gettimeofday 6 1.000996261 21598385 0.000000046 > gettimeofday 7 1.001002223 21583933 0.000000046 > gettimeofday 8 1.000985847 21599442 0.000000046 > gettimeofday 9 1.000994977 21600935 0.000000046 > sandy% sudo sysctl kern.timecounter.fast_gettime=0 ~ > kern.timecounter.fast_gettime: 1 -> 0 > sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32 gettimeofday > Clock resolution: 0.000000076 > test loop time iterations periteration > gettimeofday 0 1.001002747 3219274 0.000000310 > gettimeofday 1 1.000971052 3220793 0.000000310 > gettimeofday 2 1.001067494 3220768 0.000000310 > gettimeofday 3 1.000929999 3220812 0.000000310 > gettimeofday 4 1.000996106 3217503 0.000000311 > gettimeofday 5 1.001058438 3220346 0.000000310 > gettimeofday 6 1.000911510 3217308 0.000000311 > gettimeofday 7 1.001085906 3220128 0.000000310 > gettimeofday 8 1.000920338 3216582 0.000000311 > gettimeofday 9 1.000983577 3219559 0.000000310 > From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 18:23:54 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B0E701065672 for ; Wed, 6 Jun 2012 18:23:54 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 86DD38FC14 for ; Wed, 6 Jun 2012 18:23:54 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id EC55EB918; Wed, 6 Jun 2012 14:23:53 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Wed, 6 Jun 2012 14:23:53 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> In-Reply-To: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201206061423.53179.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 06 Jun 2012 14:23:54 -0400 (EDT) Cc: Konstantin Belousov Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 18:23:54 -0000 On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: > A positive result from the recent flame-bait on arch@ is the working > implementation of the fast gettimeofday(2) and clock_gettime(2). The > speedup I see is around 6-7x on the 2600K. I think the speedup could > be even bigger on the previous generation of CPUs, where lock > operations and syscall entry are costlier. A sample test runs of > tools/tools/syscall_timing are presented at the end of message. In general this looks good but I see a few nits / races: 1) You don't follow the model of clearing tk_current to 0 while you are updating the structure that the in-kernel timecounter code uses. This also means you have to avoid using a tk_current of 0 and that userland has to keep spinning as long as tk_current is 0. Without this I believe userland can read a partially updated structure. 2) You read tk->tk_boottime without the tk_current protection in your non-uptime routines. This is racey as the kernel alters the boottime when it skews time for large adjustments from ntp, etc. To be really safe you need to read the boottime inside the loop into a local variable and perhaps use a boolean parameter to decide if you should add it to the computed uptime. > sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32 gettimeofday > Clock resolution: 0.000000076 > test loop time iterations periteration > gettimeofday 0 1.000994225 21623297 0.000000046 > gettimeofday 1 1.000994980 21596492 0.000000046 > gettimeofday 2 1.001070595 21598326 0.000000046 > gettimeofday 3 1.000922308 21581398 0.000000046 > gettimeofday 4 1.000984264 21605539 0.000000046 > gettimeofday 5 1.000989697 21601659 0.000000046 > gettimeofday 6 1.000996261 21598385 0.000000046 > gettimeofday 7 1.001002223 21583933 0.000000046 > gettimeofday 8 1.000985847 21599442 0.000000046 > gettimeofday 9 1.000994977 21600935 0.000000046 > sandy% sudo sysctl kern.timecounter.fast_gettime=0 I think this means you can call gettimeofday() in about 46 ns now vs 310 the "old" way? -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 19:03:44 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3B7FA106564A; Wed, 6 Jun 2012 19:03:44 +0000 (UTC) (envelope-from iwasaki@jp.FreeBSD.org) Received: from locore.org (ns01.locore.org [218.45.21.227]) by mx1.freebsd.org (Postfix) with ESMTP id DCF5F8FC0A; Wed, 6 Jun 2012 19:03:43 +0000 (UTC) Received: from localhost (celeron.v4.locore.org [192.168.0.10]) by locore.org (8.14.5/8.14.5/iwasaki) with ESMTP/inet id q56J3gH9050606; Thu, 7 Jun 2012 04:03:42 +0900 (JST) (envelope-from iwasaki@jp.FreeBSD.org) Date: Thu, 07 Jun 2012 04:03:42 +0900 (JST) Message-Id: <20120607.040342.73368798.iwasaki@jp.FreeBSD.org> To: avg@FreeBSD.org From: Mitsuru IWASAKI In-Reply-To: <4FCBBEDD.5000604@FreeBSD.org> References: <4FCB0FE5.4050607@FreeBSD.org> <20120603.234243.28389486.iwasaki@jp.FreeBSD.org> <4FCBBEDD.5000604@FreeBSD.org> X-Mailer: Mew version 3.3 on Emacs 20.7 / Mule 4.0 (HANANOEN) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: attilio@FreeBSD.org, freebsd-acpi@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: cpu stopping X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 19:03:44 -0000 Hi, I've created the patches of experimental implementation based on discussion so far. http://people.freebsd.org/~iwasaki/acpi/cpustop_hook-20120606.diff In acpi_wakeup.c, cpususpend_handler() and susppcbs are replaced with cpustop_handler() and stoppcbs. This is for RELENG_9 and only for i386 but I think it's enough for the start. From: Andriy Gapon Subject: Re: cpu stopping Date: Sun, 03 Jun 2012 22:45:33 +0300 Message-ID: <4FCBBEDD.5000604@FreeBSD.org> > > Never mind :) What I'm trying to do in the patches is just to unify > > amd64/i386 independent part (acpi_wakeup.c) for the code maintenance, > > so please let's commit it first, then start re-design the > > cpususpend_handler(). > > In no way I am trying to delay your work :) > Just shared my view on the design of cpu stopping code. I got it :) > >> My view of how this should work is: > >> - there can be only one master CPU that controls all other (slave) CPUs > >> - the master sets entry and exit hooks > > > > Entry hook for suspending might be > > ---- > > ctx_fpusave(suspfpusave[cpu]); > > wbinvd(); > > CPU_SET_ATOMIC(cpu, &stopped_cpus); > > ---- > > > > and for stopping is > > ---- > > /* Indicate that we are stopped */ > > CPU_SET_ATOMIC(cpu, &stopped_cpus); > > ---- > > > > Correct? > > Yes. The only nit is that CPU_SET_ATOMIC(cpu, &stopped_cpus) could be part of > the wait loop prologue. No need to duplicate it in each hook. OK, I did so. I hope the patch is not far from your idea. Thanks! From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 19:50:07 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 43C171065670; Wed, 6 Jun 2012 19:50:07 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id D355F8FC16; Wed, 6 Jun 2012 19:50:06 +0000 (UTC) Received: from [10.30.101.53] ([209.117.142.2]) (authenticated bits=0) by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q56JjBx3036296 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO); Wed, 6 Jun 2012 13:45:12 -0600 (MDT) (envelope-from imp@bsdimp.com) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: <201206061423.53179.jhb@freebsd.org> Date: Wed, 6 Jun 2012 13:45:05 -0600 Content-Transfer-Encoding: 7bit Message-Id: <78461459-8D90-4AD1-9983-3522E4DA5816@bsdimp.com> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> To: John Baldwin X-Mailer: Apple Mail (2.1084) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (harmony.bsdimp.com [10.0.0.6]); Wed, 06 Jun 2012 13:45:13 -0600 (MDT) Cc: Konstantin Belousov , freebsd-arch@FreeBSD.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 19:50:07 -0000 On Jun 6, 2012, at 12:23 PM, John Baldwin wrote: > 2) You read tk->tk_boottime without the tk_current protection in your > non-uptime routines. This is racey as the kernel alters the > boottime when it skews time for large adjustments from ntp, etc. One of the 'etc' is leap seconds. Warner From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 20:16:20 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 01E35106567C; Wed, 6 Jun 2012 20:16:20 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id D79468FC1C; Wed, 6 Jun 2012 20:16:18 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id XAA09341; Wed, 06 Jun 2012 23:16:11 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1ScMeJ-0009IJ-3l; Wed, 06 Jun 2012 23:16:11 +0300 Message-ID: <4FCFBA89.9030105@FreeBSD.org> Date: Wed, 06 Jun 2012 23:16:09 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: Mitsuru IWASAKI References: <4FCB0FE5.4050607@FreeBSD.org> <20120603.234243.28389486.iwasaki@jp.FreeBSD.org> <4FCBBEDD.5000604@FreeBSD.org> <20120607.040342.73368798.iwasaki@jp.FreeBSD.org> In-Reply-To: <20120607.040342.73368798.iwasaki@jp.FreeBSD.org> X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: attilio@FreeBSD.org, freebsd-acpi@FreeBSD.org, freebsd-arch@FreeBSD.org Subject: Re: cpu stopping X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 20:16:20 -0000 on 06/06/2012 22:03 Mitsuru IWASAKI said the following: > Hi, > > I've created the patches of experimental implementation based on > discussion so far. > > http://people.freebsd.org/~iwasaki/acpi/cpustop_hook-20120606.diff > > In acpi_wakeup.c, cpususpend_handler() and susppcbs are replaced with > cpustop_handler() and stoppcbs. > > This is for RELENG_9 and only for i386 but I think it's enough for the > start. I think that there is no need for DPCPU here. All (affected) CPUs should see the same hook, IMO. At least I can not imagine the case where something else would be required. Also, it might make sense to provide a void pointer as a potential context for for the context. As Attilio has said before this has many similarities to what smp_rendezvous does, just for different kind of situations. > From: Andriy Gapon > Subject: Re: cpu stopping > Date: Sun, 03 Jun 2012 22:45:33 +0300 > Message-ID: <4FCBBEDD.5000604@FreeBSD.org> > >>> Never mind :) What I'm trying to do in the patches is just to unify >>> amd64/i386 independent part (acpi_wakeup.c) for the code maintenance, >>> so please let's commit it first, then start re-design the >>> cpususpend_handler(). >> >> In no way I am trying to delay your work :) >> Just shared my view on the design of cpu stopping code. > > I got it :) > >>>> My view of how this should work is: >>>> - there can be only one master CPU that controls all other (slave) CPUs >>>> - the master sets entry and exit hooks >>> >>> Entry hook for suspending might be >>> ---- >>> ctx_fpusave(suspfpusave[cpu]); >>> wbinvd(); >>> CPU_SET_ATOMIC(cpu, &stopped_cpus); >>> ---- >>> >>> and for stopping is >>> ---- >>> /* Indicate that we are stopped */ >>> CPU_SET_ATOMIC(cpu, &stopped_cpus); >>> ---- >>> >>> Correct? >> >> Yes. The only nit is that CPU_SET_ATOMIC(cpu, &stopped_cpus) could be part of >> the wait loop prologue. No need to duplicate it in each hook. > > OK, I did so. > > I hope the patch is not far from your idea. > > Thanks! -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 20:59:46 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 56E061065672; Wed, 6 Jun 2012 20:59:46 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id CB9678FC18; Wed, 6 Jun 2012 20:59:45 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q56Kxct6080208; Wed, 6 Jun 2012 23:59:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q56KxcOL023466; Wed, 6 Jun 2012 23:59:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q56KxckZ023465; Wed, 6 Jun 2012 23:59:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 6 Jun 2012 23:59:38 +0300 From: Konstantin Belousov To: John Baldwin Message-ID: <20120606205938.GS85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qZLIv6EoKi7YuaSc" Content-Disposition: inline In-Reply-To: <201206061423.53179.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 20:59:46 -0000 --qZLIv6EoKi7YuaSc Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: > > A positive result from the recent flame-bait on arch@ is the working > > implementation of the fast gettimeofday(2) and clock_gettime(2). The > > speedup I see is around 6-7x on the 2600K. I think the speedup could > > be even bigger on the previous generation of CPUs, where lock > > operations and syscall entry are costlier. A sample test runs of > > tools/tools/syscall_timing are presented at the end of message. >=20 > In general this looks good but I see a few nits / races: >=20 > 1) You don't follow the model of clearing tk_current to 0 while you > are updating the structure that the in-kernel timecounter code > uses. This also means you have to avoid using a tk_current of 0 > and that userland has to keep spinning as long as tk_current is 0. > Without this I believe userland can read a partially updated > structure. I changed the code to be much more similar to the kern_tc.c. I (re)added the generation field, which is set to 0 upon kernel touching timehands. I think this can only happen if tc_windups occurs quite close in succession, or usermode thread is suspended for long enough. BTW, even generation could loop back to the previous value if thread is stopped. There was apparently another issue with version 2. The bcopy() is not atomic, so potentially libc could read wrong tk_current. I redid the interface to write to the shared page to allow use of real atomics. >=20 > 2) You read tk->tk_boottime without the tk_current protection in your > non-uptime routines. This is racey as the kernel alters the > boottime when it skews time for large adjustments from ntp, etc. > To be really safe you need to read the boottime inside the loop > into a local variable and perhaps use a boolean parameter to decide > if you should add it to the computed uptime. I moved the bootime to timehands from timekeep, thank you for the clarification. >=20 > > sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32=20 > gettimeofday > > Clock resolution: 0.000000076 > > test loop time iterations periteration > > gettimeofday 0 1.000994225 21623297 0.000000046 > > gettimeofday 1 1.000994980 21596492 0.000000046 > > gettimeofday 2 1.001070595 21598326 0.000000046 > > gettimeofday 3 1.000922308 21581398 0.000000046 > > gettimeofday 4 1.000984264 21605539 0.000000046 > > gettimeofday 5 1.000989697 21601659 0.000000046 > > gettimeofday 6 1.000996261 21598385 0.000000046 > > gettimeofday 7 1.001002223 21583933 0.000000046 > > gettimeofday 8 1.000985847 21599442 0.000000046 > > gettimeofday 9 1.000994977 21600935 0.000000046 > > sandy% sudo sysctl kern.timecounter.fast_gettime=3D0 >=20 > I think this means you can call gettimeofday() in about 46 ns now > vs 310 the "old" way? Yes. This is for 32bit, while for 64 bit binaries the numbers are 155->25 ns on the same hw. Updated patch is at=20 http://people.freebsd.org/~kib/misc/moronix.3.patch --qZLIv6EoKi7YuaSc Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/PxLkACgkQC3+MBN1Mb4jxiwCfcpH7xT549HAK2pcuZFMjR6V7 pjsAoKXKsHQmD+JU5VnKmiUXve1yOlcH =U/tF -----END PGP SIGNATURE----- --qZLIv6EoKi7YuaSc-- From owner-freebsd-arch@FreeBSD.ORG Wed Jun 6 22:48:21 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 17CFB106566C for ; Wed, 6 Jun 2012 22:48:21 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 42ADF8FC0A for ; Wed, 6 Jun 2012 22:48:20 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id BAA11256; Thu, 07 Jun 2012 01:48:18 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1ScP1V-0009SO-WE; Thu, 07 Jun 2012 01:48:18 +0300 Message-ID: <4FCFDE30.4020109@FreeBSD.org> Date: Thu, 07 Jun 2012 01:48:16 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:12.0) Gecko/20120503 Thunderbird/12.0.1 MIME-Version: 1.0 To: freebsd-arch@FreeBSD.org References: <4FAC3EAB.6050303@delphij.net> <861umkurt8.fsf@ds4.des.no> <20120517055425.GA802@infradead.org> <4FC762DD.90101@FreeBSD.org> <4FC81D9C.2080801@FreeBSD.org> <4FC8E29F.2010806@shatow.net> <4FC95A10.7000806@freebsd.org> <4FC9F94B.8060708@FreeBSD.org> In-Reply-To: <4FC9F94B.8060708@FreeBSD.org> X-Enigmail-Version: 1.5pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: d@delphij.net Subject: Re: Allow small amount of memory be mlock()'ed by unprivileged process? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2012 22:48:21 -0000 on 02/06/2012 14:30 Andriy Gapon said the following: [snip] > Some further technical observations: > o I was overly optimistic about _full_ support for RLIMIT_MEMLOCK - mlockall() > doesn't support itat the moment and I am not sure if it is easy to implement the > support for the MCL_FUTURE case. > > o Currently the default class in default login.conf has memorylocked=unlimited > - not very smart. > > o There is also vm.max_wired sysctl (with no equivalent tunable), which > specifies number of _pages_ that can be wired system wide (by both kernel and > userland). But note that the limit applies only to userland requests, the > kernel is allowed to wire new pages even when the limit is exceeded. By default > the limit is set to 1/3 of available pages. > So watch out for this limit when using ZFS, ZFS can easily starve userland. > > o I've just discovered :-) that we also have RCTL/RACCT framework (not enabled > by default) aka "Resource Accounting" / "Resource Limits", which seems to > parallel the conventional limits in many categories including the locked memory. > Not sure why we have that and if the interactions between conventional limits, > resource limits and privileges would be easy to untangle. [snip] In case someone still follows this thread, here is another observation. While non-privileged users can not explicitly wire/lock memory for their private use, they are still subject to RLIMIT_MEMLOCK accounting. E.g. sysctl system call may temporarily wire userspace buffers and that wiring is checked against the RLIMIT_MEMLOCK limit. And some sysctl calls may require quite large buffer sizes, e.g. OIDs under kern.proc when used by e.g. fstat. I observed the cases when the sysctl wired more than 128KB of memory. I think that on larger/busier systems it could be even more. So, on one hand this vslock-against-RLIMIT_MEMLOCK check is good because it protects against resource starvation via abuse. On the other hand, I am not sure if this is a proper use of RLIMIT_MEMLOCK. After all, vslock-ing by e.g. sysctl is an implementation detail. The memory is wired because of how kernel does things, not because a user/process wants to wire that memory. Besides the wiring is temporary. So I am not sure that it is fair to charge that kind of memory wiring to userland. In any case, beware that if you decide to lower "locked-in-memory size" limit (RLIMIT_MEMLOCK), then some sysctls and the tools using them (like fstat) may start failing. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 01:42:06 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 90CC2106566C; Thu, 7 Jun 2012 01:42:06 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id 0B0948FC14; Thu, 7 Jun 2012 01:42:05 +0000 (UTC) Received: from mail34.syd.optusnet.com.au (mail34.syd.optusnet.com.au [211.29.133.218]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q571aGmt020643; Thu, 7 Jun 2012 11:36:16 +1000 Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail34.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q571ZnIp015171 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 7 Jun 2012 11:35:52 +1000 Date: Thu, 7 Jun 2012 11:35:49 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201206061423.53179.jhb@freebsd.org> Message-ID: <20120607084229.C1474@besplex.bde.org> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Konstantin Belousov , freebsd-arch@FreeBSD.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 01:42:06 -0000 On Wed, 6 Jun 2012, John Baldwin wrote: > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: >> A positive result from the recent flame-bait on arch@ is the working >> implementation of the fast gettimeofday(2) and clock_gettime(2). The >> speedup I see is around 6-7x on the 2600K. I think the speedup could >> be even bigger on the previous generation of CPUs, where lock >> operations and syscall entry are costlier. A sample test runs of >> tools/tools/syscall_timing are presented at the end of message. > > In general this looks good but I see a few nits / races: It is awefully (sic) complete and large. The patch is almost twice as large as the entire kern_tc.c in FreeBSD-4, and that was quite bloated. > 1) You don't follow the model of clearing tk_current to 0 while you > are updating the structure that the in-kernel timecounter code > uses. This also means you have to avoid using a tk_current of 0 > and that userland has to keep spinning as long as tk_current is 0. > Without this I believe userland can read a partially updated > structure. I thought that too at first, but after looking at the patch decided that it may be correct, but is too hard for me to understand. Urk, we both missed that tk_current is an index into the timehands array, so it cannot act as a generation count and it seems to be harder to lock. > 2) You read tk->tk_boottime without the tk_current protection in your > non-uptime routines. This is racey as the kernel alters the > boottime when it skews time for large adjustments from ntp, etc. > To be really safe you need to read the boottime inside the loop > into a local variable and perhaps use a boolean parameter to decide > if you should add it to the computed uptime. The critical problems seem to be mostly here: +static void +timehands_update(void *arg) +{ + struct sysentvec *sv; + struct vdso_timehands th; + uint32_t enabled, idx; + + sv = arg; + sx_xlock(&shared_page_alloc_sx); + enabled = tc_fill_vdso_timehands(&th); I think tc_windup() should just write to the shared page using the same delicate order that it uses for its variables now, but there are callbacks and fill functions like this. This fill function seems to be OK, since it copies to a local variable and checks th_generation to get a consistent snapshot. Now we have to copy it to the shared page atomically. + idx = sv->sv_timekeep_curr; + if (++idx >= VDSO_TH_NUM) + idx = 0; + sv->sv_timekeep_curr = idx; + if (enabled) { + shared_page_write(sv->sv_timekeep_off + + sizeof(struct vdso_timekeep) + idx * + sizeof(struct vdso_timehands), sizeof(th), &th); + } Now I seem to understand this. It has race (1) as you said. Problems are limited by it copying to (previously) old timehands which is unlikely to be in use. The user must have grabbed the pointer to them 10-100 msec ago and been preempted and still be using it. But this is precisely the corner case that the generation count is supposed to fix. shared_page_write() is essentially bcopy(), so it writes non-atomically in any order. + shared_page_write(sv->sv_timekeep_off + offsetof(struct vdso_timekeep, + tk_boottime), sizeof(struct bintime), &boottimebin); + shared_page_write(sv->sv_timekeep_off + offsetof(struct vdso_timekeep, + tk_enabled), sizeof(uint32_t), &enabled); Then more large variables are written non-atomically in any order. The kernel has bugs in this area too (tc_setclock() hacks on bootimebin and then does an invalid (possibly concurrent) call to tc_windup(). + wmb(); Then things become written if we get this far. + shared_page_write(sv->sv_timekeep_off + + offsetof(struct vdso_timekeep, tk_current), sizeof(uint32_t), + &idx); I don't understand this. Why isn't it it before wmb(), or at least done atomically. Ah, it is tk_current. Writing this as atomically 0 at the start and then atomically here should be enough (no wmb()), except for the problems with boottimebin(). Except tk_current is actually the timehands index and there is no timehands generation in userland. I don't understand this. + sx_xunlock(&shared_page_alloc_sx); +} The enabled flag should be cleared when the timecounter is switched away from a TSC. I can't see where that happens. Also, things should change if a TSC is switched to another one (TSC-low <-> TSC). That is a bit more delicate and not convered by the enabled flag. % +static int % +binuptime(struct bintime *bt, struct vdso_timekeep *tk) % +{ % + struct vdso_timehands *th; % + uint32_t curr; % + % + do { % + if (!tk->tk_enabled) % + return (ENOSYS); This should not be acted on before the generation count stablizes. % + % + /* % + * XXXKIB. The load of tk->tk_current should use % + * atomic_load_acq_32 to provide load barrier. But % + * since tk points to r/o mapped page, x86 % + * implementation of atomic_load_acq faults. % + */ % + curr = tk->tk_current; % + rmb(); Memory barriers are intentionally left out in the kernel version. Isn't the generation count enough, provided it is stored using atomic_rel? % + th = &tk->tk_th[curr]; % + if (th->th_algo != VDSO_TH_ALGO_1) % + return (ENOSYS); I don't like having 2 conditional tests. 1 more than in the kernel seems to be needed because all timehands may become unusable here (when the kernel timecounter hardware stops being a TSC, and this happens after any previous userland check of the flags). % + *bt = th->th_offset; % + bintime_addx(bt, th->th_scale * tc_delta(th)); % + } while (curr != tk->tk_current); With generation counts, it is only this second access to what was the generation count that needs to be atomic. If the other one is stale, then it is different from this one. % + return (0); % +} It is a large regression to use the current index instead of the old timehands. The old timehands is stable for 10-100 msec after you load it -- nothing in it, including its generation count changes in that time. So the kernel version of the above loop almost never iterates more than once -- it only iterates if it is preempted for 10-100 msec. But using the current index, you see this change as soon as the kernel updates it, and then iterate, and aren't protected by the 10-100 msec of time based locking. Even accessing the current index requires more locking. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 01:35:14 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 14A19106566B; Thu, 7 Jun 2012 01:35:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx10.syd.optusnet.com.au (fallbackmx10.syd.optusnet.com.au [211.29.132.251]) by mx1.freebsd.org (Postfix) with ESMTP id 6B4878FC20; Thu, 7 Jun 2012 01:35:11 +0000 (UTC) Received: from mail26.syd.optusnet.com.au (mail26.syd.optusnet.com.au [211.29.133.167]) by fallbackmx10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q56LEbQE027963; Thu, 7 Jun 2012 07:15:06 +1000 Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail26.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q56LEIJS013016 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 7 Jun 2012 07:14:26 +1000 Date: Thu, 7 Jun 2012 07:14:06 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= In-Reply-To: <864nqovoek.fsf@ds4.des.no> Message-ID: <20120607064951.C1106@besplex.bde.org> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-925939591-1339017246=:1106" X-Mailman-Approved-At: Thu, 07 Jun 2012 01:49:13 +0000 Cc: Gianni , John Baldwin , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org, Konstantin Belousov Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 01:35:14 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-925939591-1339017246=:1106 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 6 Jun 2012, [utf-8] Dag-Erling Sm=C3=B8rgrav wrote: > Bruce Evans writes: >> Dag-Erling Sm=C3=B8rgrav writes: >>> getpid(): 10,000,000 iterations in 24,400 ms >>> gettimeofday(0, 0): 10,000,000 iterations in 54,104 ms >>> raise(0): 10,000,000 iterations in 1,284,593 ms >> That's one slow system or broken units. > > Broken units, these are microseconds not milliseconds. Sorry. > >> After adjusting by factors of 1000 here and there, this format is still >> hard to parse. I like the format of nsec/operation. 24400 10 million >> operations in 24400 moroseconds seems to scale to 2.44 nsec/call (if 1 >> moro =3D 1 micro). But that is impossibly fast, unless getpid() is >> inlined to a load of the shared variable (it may also need the load to >> be moved outside the loop). I can't see any reasonable adjustment that >> gives 24.4 nsec/call. > > #define ITERATIONS 10000000 > > struct timeval start, end; > int i; > > gettimeofday(&start, NULL); > for (i =3D 0; i < ITERATIONS; ++i) > getpid(); > gettimeofday(&end, NULL); Now 2.44 nsec/call makes sense, but you really should add some volatiles here to ensure that getpid() is not optimized away. I get 3.48-3.49 nsec/call on an Athlon64 2GHz (the ratio of the times is almost exactly proportional to the clock freqencies, so the times in cycles must be almost identical. > On Linux, gcc 4.4.6 compiles this to: > > # gettimeofday(&start, NULL) > 0x000000000040064b <+23>: lea -0x20(%rbp),%rax > 0x000000000040064f <+27>: mov $0x0,%esi > 0x0000000000400654 <+32>: mov %rax,%rdi > 0x0000000000400657 <+35>: callq 0x400500 > > # i =3D 0 > 0x000000000040065c <+40>: movl $0x0,-0x4(%rbp) > 0x0000000000400663 <+47>: jmp 0x40066e > > # getpid() > 0x0000000000400665 <+49>: callq 0x400520 > > # ++i > 0x000000000040066a <+54>: addl $0x1,-0x4(%rbp) > > # i < ITERATIONS > 0x000000000040066e <+58>: cmpl $0x98967f,-0x4(%rbp) > 0x0000000000400675 <+65>: jle 0x400665 > > # gettimeofday(&end, NULL) > 0x0000000000400677 <+67>: lea -0x30(%rbp),%rax > 0x000000000040067b <+71>: mov $0x0,%esi > 0x0000000000400680 <+76>: mov %rax,%rdi > 0x0000000000400683 <+79>: callq 0x400500 > > The code generated by gcc 4.2.1 on FreeBSD is almost identical: > ... SO it loops OK, but we can't see what getpid() does. It must not be doing much. > I don't know why gcc 4.4.6 loads &start / &end into %rax before copying > it to %esi instead of loading it directly into %esi like 4.2.1 does. I > used the same command line (gcc -Wall -Wextra syscall.c) in both cases. Probably unimportant (buried in loop overhead). Program for 3.48-3.49 nsec: % volatile int gpid; It isn't volatile, but declaring it volatile prevents gcc-3.3.1 optimizing away the whole call to getpid() (this reduces the time to 0.99 nsec =3D 2 cycles (2 cycles is the minimum loop overhead on most current x86)). %=20 % int % getpid(void) % { % =09return gpid; % } %=20 % main() % { % =09int i; %=20 % =09for (i =3D 0; i < 1000000000; i++) % =09=09getpid(); % } Compiling with cc -O -fomit-frame-pointer gives: % 08048520 : % 8048520:=09a1 0c 97 04 08 =09mov 0x804970c,% eax % 8048525:=09c3 =09ret=20 % 8048526:=0989 f6 =09mov % esi,%esi %=20 % 08048528
: % 8048528:=0955 =09push % ebp % 8048529:=0989 e5 =09mov % esp,%ebp % 804852b:=0953 =09push % ebx % 804852c:=0983 ec 04 =09sub $0x4,% esp % 804852f:=0983 e4 f0 =09and $0xfffffff0,% esp % 8048532:=09bb 00 00 00 00 =09mov $0x0,% ebx % 8048537:=0990 =09nop=20 % 8048538:=09e8 e3 ff ff ff =09call 8048520 % 804853d:=0943 =09inc % ebx % 804853e:=0981 fb ff c9 9a 3b =09cmp $0x3b9ac9ff,% ebx % 8048544:=097e f2 =09jle 8048538 % 8048546:=098b 5d fc =09mov 0xfffffffc(% ebp),%ebx %=20 % 8048549:=09c9 =09leave=20 % 804854a:=09c3 =09ret=20 % 804854b:=0990 =09nop -fomit-frame-pointer gives nicer object code but has no effect on the runtime. gettimeofday() needs several branches for null pointers, so it much slower even before it does useful work. Your system has an indirection or 2 for shared libraries (1 for the function call and maybe more for the global pid), so it is doing well for getpid() to be no slower in cycles. kib's version has lots of layering (function calls and indirections inherited fro= m the kernel version where they are more needed) that might make it get to th= e useful work at about the same time Linux has done it and returned. 5.4104 nsec/call for gettimeofday() is impossible if there is any rdtsc() hardware call or much layering. rdtsc() takes 9-12 cycles on AthlonXP and Athlon64, but 40+ cycles on Phenom+ and on most (?) Intel CPUs and on most CPUs where it is P-state invariant (it is apparently as hard or harder to synchronize in hardware as in software). So Linux can't be calling it to get 5.4104 nsec/call. But calling and using it should only take another 13-20 nsec at 3 GHz. Excessive generality in the software parts probably adds 10-20 nsec to this. ISTR measuring 29 nsec (60+ cycles) for binuptime() Athlon XP. That's with the hardware part taking about 12 cycles. gettimeofday()'s poor API adds a lot to this. Bruce --0-925939591-1339017246=:1106-- From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 03:00:44 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 90C21106566C; Thu, 7 Jun 2012 03:00:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 24DD78FC14; Thu, 7 Jun 2012 03:00:43 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q5730YqF018640 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 7 Jun 2012 13:00:36 +1000 Date: Thu, 7 Jun 2012 13:00:34 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120606205938.GS85127@deviant.kiev.zoral.com.ua> Message-ID: <20120607130029.K1962@besplex.bde.org> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 03:00:44 -0000 On Wed, 6 Jun 2012, Konstantin Belousov wrote: > On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: >> In general this looks good but I see a few nits / races: >> >> 1) You don't follow the model of clearing tk_current to 0 while you >> are updating the structure that the in-kernel timecounter code >> uses. This also means you have to avoid using a tk_current of 0 >> and that userland has to keep spinning as long as tk_current is 0. >> Without this I believe userland can read a partially updated >> structure. > I changed the code to be much more similar to the kern_tc.c. I (re)added > the generation field, which is set to 0 upon kernel touching timehands. Seems necessary. > I think this can only happen if tc_windups occurs quite close in > succession, or usermode thread is suspended for long enough. BTW, > even generation could loop back to the previous value if thread is > stopped. tc_windup()'s close in succession are bugs, since they cycle the timehands faster than they were designed to be. We already have too many of these bugs (where tc_setclock() calls tc_windup(). I didn't notice this particular problem with it before). Now I will point out that version 2 of your patch adds more of these calls, apparently to get changes to happen sooner. But in sysctl_kern_timecounter_hardware(), such a call was intentionaly left out since it is not needed. Note that tc_tick prevents calls to tc_windup() more often than about once per msec if hz > 1000. The generation count makes tc_windup()s close in succession harmless, except they increase race possibilities by reducing the time-domain locking. The generation count is 32 bits, so it can only loop back to a previous value after 2**32 tc_windup_calls. This "can't happen". What can happen is for the timehands to cycle after something is preempted for 10-100 msec. Then the generation count allows detection of the cycling. It only has an effect in this case. Otherwise, the a thread can be preempted for 10-100 seconds and start up using a timehands pointer that it read into a register that long ago, and safely use the old pointer unless its generation has changed. Even switching the timecounter works in that case. This depends on the hardware part of the timecounter not going away and the software keeping most state per-timehands. > There was apparently another issue with version 2. The bcopy() is not > atomic, so potentially libc could read wrong tk_current. I redid > the interface to write to the shared page to allow use of real atomics. Timecounter code is supposed to be lock-free except for some time-domain locking. I only see 1 problem with this: where tc_windup() writes the generation count and other things without asking for these writes to be ordered. In most cases, the time-domain locking prevents problems. E.g., when the timehands pointer is read, it remains valid for 9+ generations of cycling timehands (9+ to 90+ msec). It is only when it sleeps for this long while holding and planning to use the old pointer that it needs the generation count to actually work. Another case is if writes are out of order (can't happen on x86), so: /* * The write to th_generation fails to protect users of th * via 10-100 msec old pointers if it becomes visible unordered * after any of the writes done by the bcopy(). Very rare to * lose here, but th_generation's point is to not lose here. */ th->th_generation = 0; bcopy(tho, th, offsetof(struct timehands, th_generation)); // finish writing th except for th_generation th->th_generation = ogen; /* * The previous write to th_generation fails to protect users * of th via old pointers if becomes visible unordered before * all of the other writes (users see the generation change * via the old pointer, and now since it has become nonzero * they use the incompletely written data. Again, only a problem * after 10-100 msec. */ timehands = th; /* * Now users can grab th via timehands. If timehands became visible * unordered before all of the other writes except th_generation, * then users use the incompletely written data. Now the time * domain locking doesn't help. */ >> 2) You read tk->tk_boottime without the tk_current protection in your >> non-uptime routines. This is racey as the kernel alters the >> boottime when it skews time for large adjustments from ntp, etc. >> To be really safe you need to read the boottime inside the loop >> into a local variable and perhaps use a boolean parameter to decide >> if you should add it to the computed uptime. > I moved the bootime to timehands from timekeep, thank you for the > clarification. This isn't bug for bug compatible with the kernel. The kernel has a global boottimebin which affects uses of old timehands the instance that it is changed (even before tc_windup() is called). > Updated patch is at > http://people.freebsd.org/~kib/misc/moronix.3.patch I had better not be awed by looking at it :-). Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 09:12:53 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 47489106566B; Thu, 7 Jun 2012 09:12:53 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id A61278FC1A; Thu, 7 Jun 2012 09:12:52 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q579ChPH062268; Thu, 7 Jun 2012 12:12:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q579ChHj027961; Thu, 7 Jun 2012 12:12:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q579Chjd027960; Thu, 7 Jun 2012 12:12:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 7 Jun 2012 12:12:43 +0300 From: Konstantin Belousov To: Bruce Evans Message-ID: <20120607091243.GV85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <20120607130029.K1962@besplex.bde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="fryGc0vzirnrYIcd" Content-Disposition: inline In-Reply-To: <20120607130029.K1962@besplex.bde.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 09:12:53 -0000 --fryGc0vzirnrYIcd Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 07, 2012 at 01:00:34PM +1000, Bruce Evans wrote: > On Wed, 6 Jun 2012, Konstantin Belousov wrote: >=20 > >On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: > >>In general this looks good but I see a few nits / races: > >> > >>1) You don't follow the model of clearing tk_current to 0 while you > >> are updating the structure that the in-kernel timecounter code > >> uses. This also means you have to avoid using a tk_current of 0 > >> and that userland has to keep spinning as long as tk_current is 0. > >> Without this I believe userland can read a partially updated > >> structure. > >I changed the code to be much more similar to the kern_tc.c. I (re)added > >the generation field, which is set to 0 upon kernel touching timehands. >=20 > Seems necessary. >=20 > >I think this can only happen if tc_windups occurs quite close in > >succession, or usermode thread is suspended for long enough. BTW, > >even generation could loop back to the previous value if thread is > >stopped. >=20 > tc_windup()'s close in succession are bugs, since they cycle the timehands > faster than they were designed to be. We already have too many of these > bugs (where tc_setclock() calls tc_windup(). I didn't notice this > particular problem with it before). Now I will point out that version > 2 of your patch adds more of these calls, apparently to get changes to > happen sooner. But in sysctl_kern_timecounter_hardware(), such a call > was intentionaly left out since it is not needed. Note that tc_tick > prevents calls to tc_windup() more often than about once per msec if > hz > 1000. No, I did not added more tc_windup calls. I added a recalculation of the shared page content on the timecounter change, which is not the same as tc_windup() call. This is exactly to handle a disable of usermode rdtsc use when kernel timecounter hardware changes. >=20 > The generation count makes tc_windup()s close in succession harmless, > except they increase race possibilities by reducing the time-domain > locking. The generation count is 32 bits, so it can only loop back to > a previous value after 2**32 tc_windup_calls. This "can't happen". > What can happen is for the timehands to cycle after something is > preempted for 10-100 msec. Then the generation count allows detection > of the cycling. It only has an effect in this case. Otherwise, the > a thread can be preempted for 10-100 seconds and start up using a > timehands pointer that it read into a register that long ago, and > safely use the old pointer unless its generation has changed. Even > switching the timecounter works in that case. This depends on the > hardware part of the timecounter not going away and the software > keeping most state per-timehands. I reinstantiated the generation counter for rev. 3. >=20 > >There was apparently another issue with version 2. The bcopy() is not > >atomic, so potentially libc could read wrong tk_current. I redid > >the interface to write to the shared page to allow use of real atomics. >=20 > Timecounter code is supposed to be lock-free except for some time-domain > locking. I only see 1 problem with this: where tc_windup() writes the > generation count and other things without asking for these writes to > be ordered. In most cases, the time-domain locking prevents problems. In fact, on x86 the ordering is strong enough that no barriers are needed, this is why the problem goes unnoticed so far. > E.g., when the timehands pointer is read, it remains valid for 9+ > generations of cycling timehands (9+ to 90+ msec). It is only when > it sleeps for this long while holding and planning to use the old > pointer that it needs the generation count to actually work. Another > case is if writes are out of order (can't happen on x86), so: >=20 > /* > * The write to th_generation fails to protect users of th > * via 10-100 msec old pointers if it becomes visible unordered > * after any of the writes done by the bcopy(). Very rare to > * lose here, but th_generation's point is to not lose here. > */ > th->th_generation =3D 0; > bcopy(tho, th, offsetof(struct timehands, th_generation)); >=20 > // finish writing th except for th_generation > th->th_generation =3D ogen; > /* > * The previous write to th_generation fails to protect users > * of th via old pointers if becomes visible unordered before > * all of the other writes (users see the generation change > * via the old pointer, and now since it has become nonzero > * they use the incompletely written data. Again, only a problem > * after 10-100 msec. > */ >=20 > timehands =3D th; > /* > * Now users can grab th via timehands. If timehands became visible > * unordered before all of the other writes except th_generation, > * then users use the incompletely written data. Now the time > * domain locking doesn't help. > */ >=20 > >>2) You read tk->tk_boottime without the tk_current protection in your > >> non-uptime routines. This is racey as the kernel alters the > >> boottime when it skews time for large adjustments from ntp, etc. > >> To be really safe you need to read the boottime inside the loop > >> into a local variable and perhaps use a boolean parameter to decide > >> if you should add it to the computed uptime. > >I moved the bootime to timehands from timekeep, thank you for the > >clarification. >=20 > This isn't bug for bug compatible with the kernel. The kernel has a > global boottimebin which affects uses of old timehands the instance > that it is changed (even before tc_windup() is called). >=20 > >Updated patch is at > >http://people.freebsd.org/~kib/misc/moronix.3.patch >=20 > I had better not be awed by looking at it :-). I will test this with your test code when return to home. --fryGc0vzirnrYIcd Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/QcIsACgkQC3+MBN1Mb4jL9gCeM2BJ7raUIf4lK9/cnn7oOt9L DZ0AoLk1bHMpwPz6kSv9mSCtMu5jUbRJ =d4j3 -----END PGP SIGNATURE----- --fryGc0vzirnrYIcd-- From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 10:04:22 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4D1D8106567F; Thu, 7 Jun 2012 10:04:22 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id DBBFA8FC1F; Thu, 7 Jun 2012 10:04:21 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q57A42Q5072985; Thu, 7 Jun 2012 13:04:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q57A42K6028244; Thu, 7 Jun 2012 13:04:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q57A41Lb028243; Thu, 7 Jun 2012 13:04:01 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 7 Jun 2012 13:04:01 +0300 From: Konstantin Belousov To: Dag-Erling Sm??rgrav Message-ID: <20120607100401.GW85127@deviant.kiev.zoral.com.ua> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> <20120607064951.C1106@besplex.bde.org> <86sje7sf31.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="PoKbPPFu8MuDl6RC" Content-Disposition: inline In-Reply-To: <86sje7sf31.fsf@ds4.des.no> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: John Baldwin , freebsd-arch@FreeBSD.org Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 10:04:22 -0000 --PoKbPPFu8MuDl6RC Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 07, 2012 at 10:26:10AM +0200, Dag-Erling Sm??rgrav wrote: > Bruce Evans writes: > > Now 2.44 nsec/call makes sense, but you really should add some volatiles > > here to ensure that getpid() is not optimized away. >=20 > As you can see from the disassembly I provided, it isn't. >=20 > > SO it loops OK, but we can't see what getpid() does. It must not be > > doing much. >=20 > Umm, yes, that's the whole point of this conversation. Linux's getpid() > is not a syscall, but a library function that returns a constant from a > page shared by the kernel. >=20 > > 5.4104 nsec/call for gettimeofday() is impossible if there is any > > rdtsc() hardware call or much layering. >=20 > It's gettimeofday(0, 0), actually, so it doesn't need to read the clock. > If I pass a struct timeval as the first argument - so it *does* need to > read the clock - it's a little bit slower but still faster than an > actual system call. Here's another run that demonstrates this - a > little bit slower than previous runs because I have other processes > running: >=20 > getpid(): 10,000,000 iterations in 30,377 us > gettimeofday(0, 0): 10,000,000 iterations in 55,571 us > gettimeofday(&tv, 0): 10,000,000 iterations in 302,634 us So this timing seems to be approximately same by the order of magnitude as the times I get for the patch, around 25 vs. 30ns/per gettimeofday() call. Linux seems slower probably due to slower CPU ? Mine is 3.4Ghz, while des used 3.1Ghz for Linux box. > kill(pid, 0): 10,000,000 iterations in 1,291,793 us >=20 > I can't test a static build since RHEL6 does not provide a static libc. >=20 > DES > --=20 > Dag-Erling Sm??rgrav - des@des.no --PoKbPPFu8MuDl6RC Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/QfJEACgkQC3+MBN1Mb4itsgCgsxTeKDTcDUfT3Q8hK0aYFBDs 0+sAoMzkk9S8GR9ivMLh2+70M0nWjqOz =tk9Z -----END PGP SIGNATURE----- --PoKbPPFu8MuDl6RC-- From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 11:02:53 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E4BDA106566B; Thu, 7 Jun 2012 11:02:53 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 9F4508FC0C; Thu, 7 Jun 2012 11:02:53 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id 8D02868C0; Thu, 7 Jun 2012 11:02:52 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 270AF9A97; Thu, 7 Jun 2012 13:02:51 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Konstantin Belousov References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> <20120607064951.C1106@besplex.bde.org> <86sje7sf31.fsf@ds4.des.no> <20120607100401.GW85127@deviant.kiev.zoral.com.ua> Date: Thu, 07 Jun 2012 13:02:51 +0200 In-Reply-To: <20120607100401.GW85127@deviant.kiev.zoral.com.ua> (Konstantin Belousov's message of "Thu, 7 Jun 2012 13:04:01 +0300") Message-ID: <8662b3s7tw.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: John Baldwin , freebsd-arch@FreeBSD.org Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 11:02:54 -0000 Konstantin Belousov writes: > Linux seems slower probably due to slower CPU ? Mine is 3.4Ghz, while > des used 3.1Ghz for Linux box. I got better results on the same Linux box yesterday (by about 20%). I'm not sure what has changed. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 08:26:18 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 464A2106564A; Thu, 7 Jun 2012 08:26:18 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id EB3AE8FC08; Thu, 7 Jun 2012 08:26:17 +0000 (UTC) Received: from ds4.des.no (smtp.des.no [194.63.250.102]) by smtp.des.no (Postfix) with ESMTP id E482C682D; Thu, 7 Jun 2012 08:26:10 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 8DD0A9A65; Thu, 7 Jun 2012 10:26:10 +0200 (CEST) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: Bruce Evans References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> <20120607064951.C1106@besplex.bde.org> Date: Thu, 07 Jun 2012 10:26:10 +0200 In-Reply-To: <20120607064951.C1106@besplex.bde.org> (Bruce Evans's message of "Thu, 7 Jun 2012 07:14:06 +1000 (EST)") Message-ID: <86sje7sf31.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailman-Approved-At: Thu, 07 Jun 2012 11:13:20 +0000 Cc: Gianni , John Baldwin , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org, Konstantin Belousov Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 08:26:18 -0000 Bruce Evans writes: > Now 2.44 nsec/call makes sense, but you really should add some volatiles > here to ensure that getpid() is not optimized away. As you can see from the disassembly I provided, it isn't. > SO it loops OK, but we can't see what getpid() does. It must not be > doing much. Umm, yes, that's the whole point of this conversation. Linux's getpid() is not a syscall, but a library function that returns a constant from a page shared by the kernel. > 5.4104 nsec/call for gettimeofday() is impossible if there is any > rdtsc() hardware call or much layering. It's gettimeofday(0, 0), actually, so it doesn't need to read the clock. If I pass a struct timeval as the first argument - so it *does* need to read the clock - it's a little bit slower but still faster than an actual system call. Here's another run that demonstrates this - a little bit slower than previous runs because I have other processes running: getpid(): 10,000,000 iterations in 30,377 us gettimeofday(0, 0): 10,000,000 iterations in 55,571 us gettimeofday(&tv, 0): 10,000,000 iterations in 302,634 us kill(pid, 0): 10,000,000 iterations in 1,291,793 us I can't test a static build since RHEL6 does not provide a static libc. DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 12:37:51 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D1C9A106564A for ; Thu, 7 Jun 2012 12:37:51 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id A68A58FC19 for ; Thu, 7 Jun 2012 12:37:51 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 1CF94B922; Thu, 7 Jun 2012 08:37:51 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 7 Jun 2012 08:10:08 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120607084229.C1474@besplex.bde.org> In-Reply-To: <20120607084229.C1474@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201206070810.08166.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 07 Jun 2012 08:37:51 -0400 (EDT) Cc: Konstantin Belousov Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 12:37:51 -0000 On Wednesday, June 06, 2012 9:35:49 pm Bruce Evans wrote: > On Wed, 6 Jun 2012, John Baldwin wrote: > > > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: > >> A positive result from the recent flame-bait on arch@ is the working > >> implementation of the fast gettimeofday(2) and clock_gettime(2). The > >> speedup I see is around 6-7x on the 2600K. I think the speedup could > >> be even bigger on the previous generation of CPUs, where lock > >> operations and syscall entry are costlier. A sample test runs of > >> tools/tools/syscall_timing are presented at the end of message. > > > > In general this looks good but I see a few nits / races: > > It is awefully (sic) complete and large. The patch is almost twice as > large as the entire kern_tc.c in FreeBSD-4, and that was quite bloated. > > > 1) You don't follow the model of clearing tk_current to 0 while you > > are updating the structure that the in-kernel timecounter code > > uses. This also means you have to avoid using a tk_current of 0 > > and that userland has to keep spinning as long as tk_current is 0. > > Without this I believe userland can read a partially updated > > structure. > > I thought that too at first, but after looking at the patch decided > that it may be correct, but is too hard for me to understand. > Urk, we both missed that tk_current is an index into the timehands > array, so it cannot act as a generation count and it seems to be harder > to lock. Ugh, so it goes a long way to emulate the timehands array in userland. As I mentioned previously, I consider the timehands array to be a bug. However, I do think the generation count in the in-kernel timehands structure is useful and should be kept (and follow the same model of setting it to 0 before doing updates, then updating the structure, then setting the new generation). -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 12:55:34 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D4FFD1065674 for ; Thu, 7 Jun 2012 12:55:34 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 916C78FC18 for ; Thu, 7 Jun 2012 12:55:34 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D8930B978; Thu, 7 Jun 2012 08:55:33 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Date: Thu, 7 Jun 2012 08:50:55 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> In-Reply-To: <20120606205938.GS85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201206070850.55751.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 07 Jun 2012 08:55:34 -0400 (EDT) Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 12:55:35 -0000 On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote: > On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: > > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: > > > A positive result from the recent flame-bait on arch@ is the working > > > implementation of the fast gettimeofday(2) and clock_gettime(2). The > > > speedup I see is around 6-7x on the 2600K. I think the speedup could > > > be even bigger on the previous generation of CPUs, where lock > > > operations and syscall entry are costlier. A sample test runs of > > > tools/tools/syscall_timing are presented at the end of message. > > > > In general this looks good but I see a few nits / races: > > > > 1) You don't follow the model of clearing tk_current to 0 while you > > are updating the structure that the in-kernel timecounter code > > uses. This also means you have to avoid using a tk_current of 0 > > and that userland has to keep spinning as long as tk_current is 0. > > Without this I believe userland can read a partially updated > > structure. > I changed the code to be much more similar to the kern_tc.c. I (re)added > the generation field, which is set to 0 upon kernel touching timehands. Thank you. BTW, I think we should use atomic_load_acq_int() on both accesses to th_gen (and the in-kernel binuptime should do the same). I realize this requires using rmb before the while condition in userland since we can't use atomic_load_acq_int() here. I think it should also use atomic_store_rel_int() for both stores to th_gen during the tc_windup() callback. > I think this can only happen if tc_windups occurs quite close in > succession, or usermode thread is suspended for long enough. BTW, > even generation could loop back to the previous value if thread is > stopped. Having the 32-bit generation count roll over should take a long while. > > > sandy% /usr/home/pooma/build/bsd/DEV/stuff/tests/syscall_timing_32 > > gettimeofday > > > Clock resolution: 0.000000076 > > > test loop time iterations periteration > > > gettimeofday 0 1.000994225 21623297 0.000000046 > > > gettimeofday 1 1.000994980 21596492 0.000000046 > > > gettimeofday 2 1.001070595 21598326 0.000000046 > > > gettimeofday 3 1.000922308 21581398 0.000000046 > > > gettimeofday 4 1.000984264 21605539 0.000000046 > > > gettimeofday 5 1.000989697 21601659 0.000000046 > > > gettimeofday 6 1.000996261 21598385 0.000000046 > > > gettimeofday 7 1.001002223 21583933 0.000000046 > > > gettimeofday 8 1.000985847 21599442 0.000000046 > > > gettimeofday 9 1.000994977 21600935 0.000000046 > > > sandy% sudo sysctl kern.timecounter.fast_gettime=0 > > > > I think this means you can call gettimeofday() in about 46 ns now > > vs 310 the "old" way? > > Yes. This is for 32bit, while for 64 bit binaries the numbers are > 155->25 ns on the same hw. Ah, good. A non-generic hardcoded amd64 version is around 20ns, so this is comparable. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 16:07:30 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 10FAA1065675; Thu, 7 Jun 2012 16:07:30 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id D9AFC8FC21; Thu, 7 Jun 2012 16:07:29 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3FDE7B95E; Thu, 7 Jun 2012 12:07:29 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 7 Jun 2012 09:56:02 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201206070956.03129.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 07 Jun 2012 12:07:29 -0400 (EDT) Cc: Attilio Rao , alc@freebsd.org, Giovanni Trematerra , Konstantin Belousov , Alexander Kabaev Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 16:07:30 -0000 On Friday, June 01, 2012 1:53:15 pm Giovanni Trematerra wrote: > Hello, > I'd like to discuss a way to provide a mechanism to share some read-only > data between kernel and user space programs avoiding syscall overhead, > implementing some them, such as gettimeofday(3) and time(3) as ordinary > user space routine. > > The patch at > http://www.trematerra.net/patches/ksvar_experimental.patch I realize this thread descended a bit, and I do still think that Konstantin's patch is probably the right way forward for gettimeofday(). However, have you thought at all about a per-process page? There was another fork in this thread that dealt with per-process data such as getpid() (for which it does seem there are real-world uses). I realize the KSVAR stuff might not easily be adjusted to working with a per-process page (though Jeff did do something interesting with having a template page defined by DPCPU that was then copied for each CPU). It would also seem that for things like getpid(), getppid(), and getuid() it might be best to go the vdso route. Is that something you would be interested in working on? -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 17:28:51 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 635ED106566B; Thu, 7 Jun 2012 17:28:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id F0CF88FC0C; Thu, 7 Jun 2012 17:28:50 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q57HSeLB054444; Thu, 7 Jun 2012 20:28:40 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q57HSd0v030478; Thu, 7 Jun 2012 20:28:39 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q57HSd4F030477; Thu, 7 Jun 2012 20:28:39 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 7 Jun 2012 20:28:39 +0300 From: Konstantin Belousov To: John Baldwin Message-ID: <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="YBGzgpgHAney5ErF" Content-Disposition: inline In-Reply-To: <201206070850.55751.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 17:28:51 -0000 --YBGzgpgHAney5ErF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: > On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote: > > On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: > > > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: > > > > A positive result from the recent flame-bait on arch@ is the working > > > > implementation of the fast gettimeofday(2) and clock_gettime(2). The > > > > speedup I see is around 6-7x on the 2600K. I think the speedup could > > > > be even bigger on the previous generation of CPUs, where lock > > > > operations and syscall entry are costlier. A sample test runs of > > > > tools/tools/syscall_timing are presented at the end of message. > > >=20 > > > In general this looks good but I see a few nits / races: > > >=20 > > > 1) You don't follow the model of clearing tk_current to 0 while you > > > are updating the structure that the in-kernel timecounter code > > > uses. This also means you have to avoid using a tk_current of 0 > > > and that userland has to keep spinning as long as tk_current is 0. > > > Without this I believe userland can read a partially updated > > > structure. > > I changed the code to be much more similar to the kern_tc.c. I (re)added > > the generation field, which is set to 0 upon kernel touching timehands. >=20 > Thank you. BTW, I think we should use atomic_load_acq_int() on both acce= sses=20 > to th_gen (and the in-kernel binuptime should do the same). I realize th= is > requires using rmb before the while condition in userland since we can't > use atomic_load_acq_int() here. I think it should also use=20 > atomic_store_rel_int() for both stores to th_gen during the tc_windup() > callback. This is done. On the other hand, I removed a store_rel from updating tk_current, since it is after enabling store to th_gen, and the order there does not matter. I also did some restructuring of the userspace, removing layers that Bruce did not liked. Now top-level functions directly call binuptime(). I also shortened the preliminary operations by caching timekeep pointer. Its double-initialization is safe. Latest version is at http://people.freebsd.org/~kib/misc/moronix.4.patch I probably move all shared page helpers to separate file from kern_exec.c, but this will happen after moronix is committed. --YBGzgpgHAney5ErF Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/Q5McACgkQC3+MBN1Mb4goxQCg1CEB9/qDJ7WNNVdNleSpqiUS kZwAniRrYMNQOjHycMeeoCOu4ixtChdl =j52Z -----END PGP SIGNATURE----- --YBGzgpgHAney5ErF-- From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 20:10:01 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 32A0F1065675; Thu, 7 Jun 2012 20:10:01 +0000 (UTC) (envelope-from alexander@leidinger.net) Received: from mail.ebusiness-leidinger.de (mail.ebusiness-leidinger.de [217.11.53.44]) by mx1.freebsd.org (Postfix) with ESMTP id A2FB78FC1A; Thu, 7 Jun 2012 20:09:59 +0000 (UTC) Received: from outgoing.leidinger.net (p4FC4380C.dip.t-dialin.net [79.196.56.12]) by mail.ebusiness-leidinger.de (Postfix) with ESMTPSA id 9B80B84473A; Thu, 7 Jun 2012 22:09:39 +0200 (CEST) Received: from unknown (IO.Leidinger.net [192.168.1.12]) by outgoing.leidinger.net (Postfix) with ESMTPS id BD8922B97; Thu, 7 Jun 2012 22:09:36 +0200 (CEST) Date: Thu, 7 Jun 2012 22:09:33 +0200 From: Alexander Leidinger To: Attilio Rao Message-ID: <20120607220933.00003865@unknown> In-Reply-To: References: <86bokyvtc2.fsf@ds4.des.no> X-Mailer: Claws Mail 3.7.10cvs42 (GTK+ 2.16.6; i586-pc-mingw32msvc) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-EBL-MailScanner-Information: Please contact the ISP for more information X-EBL-MailScanner-ID: 9B80B84473A.A2700 X-EBL-MailScanner: Found to be clean X-EBL-MailScanner-SpamCheck: not spam, spamhaus-ZEN, SpamAssassin (not cached, score=-0.733, required 6, autolearn=disabled, ALL_TRUSTED -1.00, AWL 0.28, T_RP_MATCHES_RCVD -0.01) X-EBL-MailScanner-From: alexander@leidinger.net X-EBL-MailScanner-Watermark: 1339704580.62523@qvMjeo86+nGn4Gz9SRgRPw X-EBL-Spam-Status: No Cc: =?ISO-8859-1?Q?grav?= , Adrian Chadd , Dag-Erling, arch@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 20:10:01 -0000 On Tue, 5 Jun 2012 21:14:02 +0100 Attilio Rao wrote: > 2012/6/5 Adrian Chadd : > > Hi, > > > > I'm very tempted to make if_ath use KTR_DEV, but then have an extra > > ath sysctl which does something like: > > > > if (sc->sc_ktr_enable) > > =A0 =A0KTR(); >=20 > But the actual problem is that your output will be overwhelmed by the > clutter of all the other KTR_DEV consumers. >=20 > We very much need an much higher granularity on KTR classes and > possibly a way to use it on-the-fly for kernel development and I think > what I suggested earlier makes sense. How much of the uncovered uses of KTR really need KTR (instead of dtrace)? How many of them are time critical enough that dtrace is not fast enough? How many of them need to run very early so that not enough kernel infrastructure is available to run dtrace (can we run dtrace scripts very early during boot (when enough kernel infrastructure is available, before anything in userland starts) like in Solaris)? Bye, Alexander. --=20 http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID =3D B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID =3D 72077137 From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 20:44:46 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3FDAD106566C; Thu, 7 Jun 2012 20:44:46 +0000 (UTC) (envelope-from rysto32@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 25EF48FC15; Thu, 7 Jun 2012 20:44:41 +0000 (UTC) Received: by eaac13 with SMTP id c13so577717eaa.13 for ; Thu, 07 Jun 2012 13:44:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=nf0hL9PoXJ3UNUyQuR6Ouy/4h9VUIfDg+RT6VvUX+ak=; b=KuAS+YugtXK0NtcynR3mtDRtq8oW2mvePEX7LmQQ2UNJmsnnT90nzlX7dbehEkV/+R dW8pjgs57+RkNRoS6nKXlRxfiiU3WsjBX3FNMZqW7Np2TATEV3z26RHolURVz1D7q00N MS1aijIkdCoWaBFSpYKG20ZSlzuIWzobdrC35Y+uOR+wEkTmeomOkhaRGcCsFXkBJJh4 cZ0ot7YLdTkjBZ4JcjSnVpUO6pMuT1i0GQWQlfMy+LpohUw76xC3xqKwRt6B8FgfkREq P7aOvLS5jeA/Fqnyt07k/g/V126Ki2jV90WJAnFdQO4ll/6Ji6Thk06Q+gBLaL7Giytc AgXQ== MIME-Version: 1.0 Received: by 10.14.95.207 with SMTP id p55mr2040788eef.40.1339101881046; Thu, 07 Jun 2012 13:44:41 -0700 (PDT) Received: by 10.180.146.131 with HTTP; Thu, 7 Jun 2012 13:44:40 -0700 (PDT) In-Reply-To: <20120607220933.00003865@unknown> References: <86bokyvtc2.fsf@ds4.des.no> <20120607220933.00003865@unknown> Date: Thu, 7 Jun 2012 16:44:40 -0400 Message-ID: From: Ryan Stone To: Alexander Leidinger Content-Type: text/plain; charset=ISO-8859-1 Cc: Attilio Rao , grav , Adrian Chadd , arch@freebsd.org, Dag-Erling@freebsd.org Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 20:44:46 -0000 On Thu, Jun 7, 2012 at 4:09 PM, Alexander Leidinger wrote: > How many of them need to run very early so that not enough > kernel infrastructure is available to run dtrace (can we run dtrace > scripts very early during boot (when enough kernel infrastructure is > available, before anything in userland starts) like in Solaris)? We don't currently have boot-time DTrace in FreeBSD. We also don't have post mortem DTrace (ie the equivalent of ktrdump -m). However, I would suspect that most of the cases in the tree where drivers have been checked in using KTR_SPARE could be replaced by DTrace. From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 20:47:29 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 74E22106566B; Thu, 7 Jun 2012 20:47:29 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 3BE3D8FC14; Thu, 7 Jun 2012 20:47:29 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id A2522B948; Thu, 7 Jun 2012 16:47:28 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 7 Jun 2012 16:42:41 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; ) References: <86bokyvtc2.fsf@ds4.des.no> <20120607220933.00003865@unknown> In-Reply-To: <20120607220933.00003865@unknown> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201206071642.41216.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 07 Jun 2012 16:47:28 -0400 (EDT) Cc: Attilio Rao , Alexander Leidinger , Adrian Chadd , Dag-Erling@freebsd.org, grav Subject: Re: KTR_SPAREx X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 20:47:29 -0000 On Thursday, June 07, 2012 4:09:33 pm Alexander Leidinger wrote: > On Tue, 5 Jun 2012 21:14:02 +0100 Attilio Rao > wrote: > > > 2012/6/5 Adrian Chadd : > > > Hi, > > > > > > I'm very tempted to make if_ath use KTR_DEV, but then have an extra > > > ath sysctl which does something like: > > > > > > if (sc->sc_ktr_enable) > > > KTR(); > > > > But the actual problem is that your output will be overwhelmed by the > > clutter of all the other KTR_DEV consumers. > > > > We very much need an much higher granularity on KTR classes and > > possibly a way to use it on-the-fly for kernel development and I think > > what I suggested earlier makes sense. > > How much of the uncovered uses of KTR really need KTR (instead of > dtrace)? How many of them are time critical enough that dtrace is not > fast enough? How many of them need to run very early so that not enough > kernel infrastructure is available to run dtrace (can we run dtrace > scripts very early during boot (when enough kernel infrastructure is > available, before anything in userland starts) like in Solaris)? Can you run a dtrace script from ddb? (Hint: you can run 'show ktr' from DDB, and you can use ktrdump on a crash dump to get a timeline of events when doing post-mortem analysis.) -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 22:43:11 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 87CE1106564A; Thu, 7 Jun 2012 22:43:11 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 04F768FC0C; Thu, 7 Jun 2012 22:43:09 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q57Mh24b033391; Fri, 8 Jun 2012 01:43:02 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q57Mh1XM031780; Fri, 8 Jun 2012 01:43:01 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q57Mh18n031779; Fri, 8 Jun 2012 01:43:01 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 8 Jun 2012 01:43:01 +0300 From: Konstantin Belousov To: Peter Wemm Message-ID: <20120607224301.GB85127@deviant.kiev.zoral.com.ua> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> <20120607064951.C1106@besplex.bde.org> <86sje7sf31.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="ds9maZbwT7uk2FVi" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , freebsd-arch@freebsd.org, Dag-Erling Sm?rgrav Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 22:43:11 -0000 --ds9maZbwT7uk2FVi Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 07, 2012 at 03:30:54PM -0700, Peter Wemm wrote: > On Thu, Jun 7, 2012 at 1:26 AM, Dag-Erling Sm?rgrav wrote: > > Bruce Evans writes: > >> Now 2.44 nsec/call makes sense, but you really should add some volatil= es > >> here to ensure that getpid() is not optimized away. > > > > As you can see from the disassembly I provided, it isn't. > > > >> SO it loops OK, but we can't see what getpid() does. =9AIt must not be > >> doing much. > > > > Umm, yes, that's the whole point of this conversation. =9ALinux's getpi= d() > > is not a syscall, but a library function that returns a constant from a > > page shared by the kernel. >=20 > It might be worth taking a peek at what they do before going too far > down the rabbit hole. They've had to deal with the whole ABI > stability vs kernel layout thing already. >=20 > As I recall, they literally embed a userland style .so shared object > into the kernel and make it available to the user. The dynamic linker > "finds" it via elf auxinfo and inserts it into the symbol search > order. >=20 > That way, the shared page layout is kernel specific. If they chose to > provide getpid() or gettimeofday() or whatever, its a matter of > adjusting the shared page and inserting code into the .so file. If > the page changes, the code changes. >=20 > Think of what we do with signal trampolines except in a way > ld-elf.so.1 can pull it into user space and gdb "sees" it as a .so > file with debug info. >=20 > I think I remember that they did the shared page thing and then > switched to providing a stub .so file. Yes, this is the thing called VDSO in the thread discussion. --ds9maZbwT7uk2FVi Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/RLnUACgkQC3+MBN1Mb4hULgCg6R/ekHO3tW9BYjjiMafdKXmR gccAoLWFdYgh2qDFvMB7fGpWn1myusKE =HShI -----END PGP SIGNATURE----- --ds9maZbwT7uk2FVi-- From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 22:47:05 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 31CB21065670 for ; Thu, 7 Jun 2012 22:47:05 +0000 (UTC) (envelope-from peter@wemm.org) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id DEBC58FC17 for ; Thu, 7 Jun 2012 22:47:04 +0000 (UTC) Received: by obcni5 with SMTP id ni5so1952667obc.13 for ; Thu, 07 Jun 2012 15:47:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=ZaB+F0tRHi5JZqeDAo3qPkprunOQYi/KY9XrXhcuGNM=; b=f6nzVzE8ee7r/Kudk434F2fagmnXcM4gFqa4a0jTyx5xEbapI/cNYUFrtGHR/DtRlZ irJDhqc4Erq1am8qqM87pfDtNx+S7jyOLfe/GQ28tgDNk7Asp/E8kYN03NjqftpTmtry ujakbrRMNC4/jyS2X8BV5FvCjrm8oSqNDE3c4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding:x-gm-message-state; bh=ZaB+F0tRHi5JZqeDAo3qPkprunOQYi/KY9XrXhcuGNM=; b=nTGRU5M/+SDl4/7n9eMcyR413UlgJJ2C6nAccF3XiKNT/02mNImnlwmt8SI5xQPQxY 4UeAgjh2Fjr3/8XXuPJbw3S7eRJxj2PV+RY3Q6RO/M5GmX1D3wNXBorDylp2eurAwvbh EFAGc5Wzo3vPFNxnbULfSC7t6cJmT10Jnc5rbBSSqVxssr77R3N5hzTwnU4uAd3Labe6 KaQ1z4P5jHDcItB+DUjVLHoAom1QQImfZ+PTXvr3SrwBRiWUxpKcYzy6gIyIRVQ9g5EZ 5KxFR28mFymu35rF5bl77zD3VMVNaMY1rcx2+64boQvSVAnQpD/wAFxASB5Cg0Lpf2rJ aUag== MIME-Version: 1.0 Received: by 10.60.172.195 with SMTP id be3mr3981812oec.48.1339109224093; Thu, 07 Jun 2012 15:47:04 -0700 (PDT) Received: by 10.182.115.35 with HTTP; Thu, 7 Jun 2012 15:47:04 -0700 (PDT) In-Reply-To: <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> Date: Thu, 7 Jun 2012 15:47:04 -0700 Message-ID: From: Peter Wemm To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQnX4FaIIZpmicN4p3UdGvue4oUpi+fiZK8hhtRZJmoHXmqMrlFQI/lLhYvMzrdLsHWnqGxa Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 22:47:05 -0000 On Thu, Jun 7, 2012 at 10:28 AM, Konstantin Belousov wrote: > On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: >> On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote: >> > On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: >> > > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: >> > > > A positive result from the recent flame-bait on arch@ is the worki= ng >> > > > implementation of the fast gettimeofday(2) and clock_gettime(2). T= he >> > > > speedup I see is around 6-7x on the 2600K. I think the speedup cou= ld >> > > > be even bigger on the previous generation of CPUs, where lock >> > > > operations and syscall entry are costlier. A sample test runs of >> > > > tools/tools/syscall_timing are presented at the end of message. >> > > >> > > In general this looks good but I see a few nits / races: >> > > >> > > 1) You don't follow the model of clearing tk_current to 0 while you >> > > =A0 =A0are updating the structure that the in-kernel timecounter cod= e >> > > =A0 =A0uses. =A0This also means you have to avoid using a tk_current= of 0 >> > > =A0 =A0and that userland has to keep spinning as long as tk_current = is 0. >> > > =A0 =A0Without this I believe userland can read a partially updated >> > > =A0 =A0structure. >> > I changed the code to be much more similar to the kern_tc.c. I (re)add= ed >> > the generation field, which is set to 0 upon kernel touching timehands= . >> >> Thank you. =A0BTW, I think we should use atomic_load_acq_int() on both a= ccesses >> to th_gen (and the in-kernel binuptime should do the same). =A0I realize= this >> requires using rmb before the while condition in userland since we can't >> use atomic_load_acq_int() here. =A0I think it should also use >> atomic_store_rel_int() for both stores to th_gen during the tc_windup() >> callback. > This is done. On the other hand, I removed a store_rel from updating > tk_current, since it is after enabling store to th_gen, and the order > there does not matter. > > I also did some restructuring of the userspace, removing layers that > Bruce did not liked. Now top-level functions directly call binuptime(). > I also shortened the preliminary operations by caching timekeep pointer. > Its double-initialization is safe. > > Latest version is at > http://people.freebsd.org/~kib/misc/moronix.4.patch > > I probably move all shared page helpers to separate file from kern_exec.c= , > but this will happen after moronix is committed. Stepping back for a moment.. why even have a shared page at all, in common MI code? The AMD64 kernel can simply make a page readable from within kernel space since it's page level protected. The i386 kernel needs the same treatment. We can save one clock cycle from address generation by switching to page protection for the kernel and using a full 4GB %cs/%ds/etc. With that fix we could do the same there. I've been meaning to "fix" this for about 8 years now. There would have been no need to allocate "space" in userland for things like signal trampolines because it could be executed directly from a kernel page by unprivileged user code. Things like allocating a shared page could be a MD backend decision for architectures that don't have page level access control for where the kernel lives. Things like tc_fill_vdso_timehands() could go away if userland could be allowed to directly read the kernel's version. With a little linker magic, the 'struct timehands' stuff could be marshaled into a page and the auxinfo point to it. --=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 22:56:55 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2DE631065680; Thu, 7 Jun 2012 22:56:55 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id A1FFA8FC17; Thu, 7 Jun 2012 22:56:54 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q57MumjP036943; Fri, 8 Jun 2012 01:56:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q57MumBX031859; Fri, 8 Jun 2012 01:56:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q57MumTS031858; Fri, 8 Jun 2012 01:56:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 8 Jun 2012 01:56:48 +0300 From: Konstantin Belousov To: Peter Wemm Message-ID: <20120607225648.GC85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="N+dhEFW7Y2Uiel/w" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 22:56:55 -0000 --N+dhEFW7Y2Uiel/w Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 07, 2012 at 03:47:04PM -0700, Peter Wemm wrote: > On Thu, Jun 7, 2012 at 10:28 AM, Konstantin Belousov > wrote: > > On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: > >> On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote: > >> > On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: > >> > > On Wednesday, June 06, 2012 12:51:15 pm Konstantin Belousov wrote: > >> > > > A positive result from the recent flame-bait on arch@ is the wor= king > >> > > > implementation of the fast gettimeofday(2) and clock_gettime(2).= The > >> > > > speedup I see is around 6-7x on the 2600K. I think the speedup c= ould > >> > > > be even bigger on the previous generation of CPUs, where lock > >> > > > operations and syscall entry are costlier. A sample test runs of > >> > > > tools/tools/syscall_timing are presented at the end of message. > >> > > > >> > > In general this looks good but I see a few nits / races: > >> > > > >> > > 1) You don't follow the model of clearing tk_current to 0 while you > >> > > =9A =9Aare updating the structure that the in-kernel timecounter c= ode > >> > > =9A =9Auses. =9AThis also means you have to avoid using a tk_curre= nt of 0 > >> > > =9A =9Aand that userland has to keep spinning as long as tk_curren= t is 0. > >> > > =9A =9AWithout this I believe userland can read a partially updated > >> > > =9A =9Astructure. > >> > I changed the code to be much more similar to the kern_tc.c. I (re)a= dded > >> > the generation field, which is set to 0 upon kernel touching timehan= ds. > >> > >> Thank you. =9ABTW, I think we should use atomic_load_acq_int() on both= accesses > >> to th_gen (and the in-kernel binuptime should do the same). =9AI reali= ze this > >> requires using rmb before the while condition in userland since we can= 't > >> use atomic_load_acq_int() here. =9AI think it should also use > >> atomic_store_rel_int() for both stores to th_gen during the tc_windup() > >> callback. > > This is done. On the other hand, I removed a store_rel from updating > > tk_current, since it is after enabling store to th_gen, and the order > > there does not matter. > > > > I also did some restructuring of the userspace, removing layers that > > Bruce did not liked. Now top-level functions directly call binuptime(). > > I also shortened the preliminary operations by caching timekeep pointer. > > Its double-initialization is safe. > > > > Latest version is at > > http://people.freebsd.org/~kib/misc/moronix.4.patch > > > > I probably move all shared page helpers to separate file from kern_exec= .c, > > but this will happen after moronix is committed. >=20 > Stepping back for a moment.. why even have a shared page at all, in > common MI code? The decision to use shared page is delegated to MD, but MI code handles most of the details, since there is no much difference if shared page is used. >=20 > The AMD64 kernel can simply make a page readable from within kernel > space since it's page level protected. All arches which use shared page use it this way now. See below. >=20 > The i386 kernel needs the same treatment. We can save one clock cycle > from address generation by switching to page protection for the kernel > and using a full 4GB %cs/%ds/etc. With that fix we could do the same > there. I've been meaning to "fix" this for about 8 years now. Sorry, I do not follow. Aren't we already use 4GB segments on i386 ? >=20 > There would have been no need to allocate "space" in userland for > things like signal trampolines because it could be executed directly > from a kernel page by unprivileged user code. This is how it is done already. But the shared page is mapped at the fixed location at the usermode, which simplifies things for debugging at least. >=20 > Things like allocating a shared page could be a MD backend decision > for architectures that don't have page level access control for where > the kernel lives. This is exactly how it is done now. Per-ABI struct sysentvec has a flag indicating were the shared page is needed for ABI, and where to map it. >=20 > Things like tc_fill_vdso_timehands() could go away if userland could > be allowed to directly read the kernel's version. With a little > linker magic, the 'struct timehands' stuff could be marshaled into a > page and the auxinfo point to it. I dislike the idea of directly exporting a kernel structure into userland, since this makes it impossible to modify kernel side of the things. IMO rarely executed translation is not a problem, and I can control the ABI. At least until I find time to implement VDSO, where the problem of ABI stability for kernel->user transport will be solved completely. --N+dhEFW7Y2Uiel/w Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/RMa8ACgkQC3+MBN1Mb4hAIwCgioJKGPnE7gfckztJYNCQJONj PZYAn0rdxvVdcGmz7iM5SYF8R67ivu7G =b1NG -----END PGP SIGNATURE----- --N+dhEFW7Y2Uiel/w-- From owner-freebsd-arch@FreeBSD.ORG Thu Jun 7 22:30:55 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 54CDC106566C for ; Thu, 7 Jun 2012 22:30:55 +0000 (UTC) (envelope-from peter@wemm.org) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 01EA18FC14 for ; Thu, 7 Jun 2012 22:30:54 +0000 (UTC) Received: by obcni5 with SMTP id ni5so1932921obc.13 for ; Thu, 07 Jun 2012 15:30:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wemm.org; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=i0JQgQZ9xYJMjfa7QQfX3eHxrZLrBlf9C2/DZE1+xNw=; b=Qx3F3jdux4RfHWpVEb+oJCyAwiTfn3EvPwaA+/PybhA/wZzAn7euLLjNkE/gHpLWgn luEGvpPtKkRKrw5bZISBSxlxLUQLb5tEI3CwXxUb4bRval0uw8gr35K1viseU3onNDj/ lpfogLsqS6QrwaEHraXgzr8kxuMmDDUWIqSD4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding:x-gm-message-state; bh=i0JQgQZ9xYJMjfa7QQfX3eHxrZLrBlf9C2/DZE1+xNw=; b=B+AhmlLpCfJH+7ouyzEWQ00B8AwvzdY68PGkrbwksgncxsvovtsSs9v5BmZXg0SLyR DgXEPoi5JKppkIgulOft7PCNo6bKdEIgQAczxuzZ7pXr0F1LJ8BIPnCQqIwMDyBNO80V Bj6xgm2KfoI0WzfIAI1vZasEQHy/otmwLBWJpjYzjk5AVPRZ4/wtL0kz86ssPSnEyOzv RQr+QSDCjDqMYw7D3/Kagonhf4jl3nl5Y18jM8wGTcZf4FHu+w4Ffp1pQwyWzNwPyZnZ GOIzMqYfmXuLp+OuHXL6IMruOsv4Xp7Je8KQoM4mPhBf1Me8AooemSPjHgvUbofIfCRc 7rVA== MIME-Version: 1.0 Received: by 10.182.115.7 with SMTP id jk7mr3954542obb.9.1339108254334; Thu, 07 Jun 2012 15:30:54 -0700 (PDT) Received: by 10.182.115.35 with HTTP; Thu, 7 Jun 2012 15:30:54 -0700 (PDT) In-Reply-To: <86sje7sf31.fsf@ds4.des.no> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> <20120607064951.C1106@besplex.bde.org> <86sje7sf31.fsf@ds4.des.no> Date: Thu, 7 Jun 2012 15:30:54 -0700 Message-ID: From: Peter Wemm To: =?ISO-8859-1?Q?Dag=2DErling_Sm=F8rgrav?= Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQn4quNIkLzAfusqu3DVMtJHj7w619QSrnyXc6rIPAMjx9uWJOnRHKIDGOstWgL6MoMmbBqS X-Mailman-Approved-At: Thu, 07 Jun 2012 23:14:33 +0000 Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@freebsd.org, Konstantin Belousov Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Jun 2012 22:30:55 -0000 On Thu, Jun 7, 2012 at 1:26 AM, Dag-Erling Sm=F8rgrav wrote: > Bruce Evans writes: >> Now 2.44 nsec/call makes sense, but you really should add some volatiles >> here to ensure that getpid() is not optimized away. > > As you can see from the disassembly I provided, it isn't. > >> SO it loops OK, but we can't see what getpid() does. =A0It must not be >> doing much. > > Umm, yes, that's the whole point of this conversation. =A0Linux's getpid(= ) > is not a syscall, but a library function that returns a constant from a > page shared by the kernel. It might be worth taking a peek at what they do before going too far down the rabbit hole. They've had to deal with the whole ABI stability vs kernel layout thing already. As I recall, they literally embed a userland style .so shared object into the kernel and make it available to the user. The dynamic linker "finds" it via elf auxinfo and inserts it into the symbol search order. That way, the shared page layout is kernel specific. If they chose to provide getpid() or gettimeofday() or whatever, its a matter of adjusting the shared page and inserting code into the .so file. If the page changes, the code changes. Think of what we do with signal trampolines except in a way ld-elf.so.1 can pull it into user space and gdb "sees" it as a .so file with debug info. I think I remember that they did the shared page thing and then switched to providing a stub .so file. --=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From owner-freebsd-arch@FreeBSD.ORG Fri Jun 8 07:48:23 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6BC4C106566B; Fri, 8 Jun 2012 07:48:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au [211.29.132.197]) by mx1.freebsd.org (Postfix) with ESMTP id 905258FC1B; Fri, 8 Jun 2012 07:48:22 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q587mCaB005326 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 17:48:13 +1000 Date: Fri, 8 Jun 2012 17:48:12 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> Message-ID: <20120608155521.S1201@besplex.bde.org> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2012 07:48:23 -0000 On Thu, 7 Jun 2012, Konstantin Belousov wrote: > On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: >> On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote: >>> On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: >>>> In general this looks good but I see a few nits / races: >>>> >>>> 1) You don't follow the model of clearing tk_current to 0 while you >>>> are updating the structure that the in-kernel timecounter code >>>> uses. This also means you have to avoid using a tk_current of 0 >>>> and that userland has to keep spinning as long as tk_current is 0. >>>> Without this I believe userland can read a partially updated >>>> structure. >>> I changed the code to be much more similar to the kern_tc.c. I (re)added >>> the generation field, which is set to 0 upon kernel touching timehands. >> >> Thank you. BTW, I think we should use atomic_load_acq_int() on both accesses >> to th_gen (and the in-kernel binuptime should do the same). I realize this >> requires using rmb before the while condition in userland since we can't >> use atomic_load_acq_int() here. I think it should also use >> atomic_store_rel_int() for both stores to th_gen during the tc_windup() >> callback. The atomic_load_acq_int() (or rmb()) would completely defeat one of the main points in the design of binuptime(), which was to be lock-free so as to be efficient (the atomic_store_rel_int() is rarely done so fixing it doesn't affect efficiency, especially on x86 after kib's recent changes removed the serialization from it). However, I now think the acq part of the load is needed even on x86. x86 allows loads out of order, except in the case where the load is from the same address of a previous store. So no explicit memory barrier is needed (on x86) for loads of th_generation to be ordered relative to stores to th_generation. But read barriers seem to be needed for loads of the variables protected by th_generation to be ordered relative to loads of th_generation. An acq barrier for th_generation works somewhat bogusly (on x86) by supplying a barrier for the one variable that doesn't need it for ordering. The correct fix seems to be to use time-domain locking even more: set the timehands pointer to the previous generation instead of the current one. Modulo other bugs, this gives >= 1 msec for the previous generation to stabilize. Speculative loads would have to be more than 1 msec in the past to cause problems. But they can't be, since the thread must have been preempted for its speculative load to live that long, and the preemption would/should have issued a barrier instruction. Except when the speculative load reaches a register before the preemption -- that case is handled by the generation count: since the timehands being used must be more than 1 generation behind for its th_generation to change, the memory barrier instruction for the preemption ensures that the change to th_generation is seen, so the new timehands is loaded. Second thoughts about whether x86 needs the acq barrier: stores to all the variables in tc_windup() are ordered by x86 memory semantics. This gives them a good ordering relative to the stores to th_generation, or at least can do this. A similar ordering is then forced for the loads in binuptime() etc, since x86 memory semantics ensure that each load occurs after the corresponding store to the same address. Maybe this is enough, or can be made to be enough with a more careful ordering of the stores. This is MD and hard to understand. > This is done. On the other hand, I removed a store_rel from updating > tk_current, since it is after enabling store to th_gen, and the order > there does not matter. Sigh. The extremeness of some locking pessimizations on an Athlon64 i386 UP are: rdtsc takes 6.5 cycles rdtsc; movl mem,%ecx takes 6.5 cycles xchgl mem,%ecx takes 32 cycles rdtsc; lfence; movl mem,%ecx takes 34 cycles rdtsc; xchgl mem,%ecx takes 38 cycles xchgl mem,%ecx; rdtsc takes 40 cycles xchgl mem,%eax; rdtsc takes 40 cycles rdtsc; xchgl mem,%eax takes 44 cycles rdtsc; mfence; movl mem,%ecx takes 52 cycles So the software overheads are 5-8 times larger than the hardware overheads for a TSC timecounter, even when we only lock a single load. Later CPUs have much slower rdtsc, taking 40+ cycles, so the software overheads are relatively smaller, especially since they are mostly in parallel with the slow rdtsc. On core2 i386 SMP: rdtsc takes 65 cycles (yes, 10x slower) rdtsc movl mem,%ecx takes 65 cycles xchgl mem,%ecx takes 25 cycles rdtsc; lfence; movl mem,%ecx takes 73 cycles rdtsc; xchgl mem,%ecx takes 74 cycles xchgl mem,%ecx; rdtsc takes 74 cycles xchgl mem,%eax; rdtsc takes 74 cycles rdtsc; xchgl mem,%eax takes 74 cycles rdtsc; mfence; movl mem,%ecx takes 69 cycles (yes, beats lfence) Note that the get*() APIs have identical locking issues, so if you fix them by adding memory barriers they will become slower than the current non-get*() APIs are without locking, so their existence will be more bogus than before (except with very slow timecounter hardware). > I also did some restructuring of the userspace, removing layers that > Bruce did not liked. Now top-level functions directly call binuptime(). > I also shortened the preliminary operations by caching timekeep pointer. > Its double-initialization is safe. > > Latest version is at > http://people.freebsd.org/~kib/misc/moronix.4.patch Thanks. I didn't look at the patch. To be happy with it, I would require: - about 1/4 the size of the first version (6K) for at least the pure timecounter parts - fix old kernel bugs: - figure out what needs to be done for time-domain locking - fix the bug reported by jhb, that times can go backwards due to old timehands having a slightly different frequency. (I tried to duplicate this in the kernel, but couldn't. I used adjtime(2) with hacks to make it adjust the clock by +-0.5 seconds/second. A loop with "adjtime 1000; adjtime" -1000 then gives huge swings in the frequency. But clock_gettime() didn't show any negative differences. I think the negative difference can't be smaller than ~100 nsec, and since the syscall takes longer than that even clock wrong by a factor of 2 due to the hacked adjtime, it can't see negative differences.) - figure out what TSC-low is fixing and fix it properly. rdtsc is a non-serializing instruction. Thus it can probably appear to go backwards. TSC-low normally discards a factor of 128 of the precision of the TSC. At 2GHz, this reduces the precision from 0.5 nsec to 64 nsec. Most negative differences would become 0. I wonder if TSC-low is "working" just by hiding most negative differences. But it can only hide most (about 127/128) of them. E.g., if the out-of-order times are 64 nsec and 63.5 nsec, then after discarding 128 low bits, the negative difference expands from -0.5 nsec to -64 nsec. Note that the large syscall overhead prevents seeing any small negative time differences from userland in the same way as above. But without that overhead, either in the kernel or in moronix userland, small negative time differences might be visible, depending on how small they are and on how efficient the functions are. TSC-low also breaks seeing small positive differences. This breakage if it is not hidden by syscall overhead or inefficient functions. For some uses, truncation small positive differences to 0 is just as bad as negative differences -- you can't distinguish separate events using their timestamps. Unfortunately, timecounters with low resolution have this problem unavoidably. A TSC should at least be able to distinguish events that are separate at the cycle level, though since the x86 TSC is non-serializing it has a different tyoe of fuzziness. This fuzziness shouldn't be fixed by adding serialization instructions for it (though one for acq may do this accidentally), since that woukld make it much slower. rdtscp should rarely be used since it is serializing so it gives similar slowness. Does it do any more than "cpuid; rdtsc"? > I probably move all shared page helpers to separate file from kern_exec.c, > but this will happen after moronix is committed. It's still moronix? Why would we want that? :-) Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 8 08:03:52 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6235D106566B; Fri, 8 Jun 2012 08:03:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id E7F178FC0C; Fri, 8 Jun 2012 08:03:51 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q5883gvL010688 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 18:03:44 +1000 Date: Fri, 8 Jun 2012 18:03:42 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120607091243.GV85127@deviant.kiev.zoral.com.ua> Message-ID: <20120608174919.S1594@besplex.bde.org> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <20120607130029.K1962@besplex.bde.org> <20120607091243.GV85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2012 08:03:52 -0000 On Thu, 7 Jun 2012, Konstantin Belousov wrote: > On Thu, Jun 07, 2012 at 01:00:34PM +1000, Bruce Evans wrote: >> >> tc_windup()'s close in succession are bugs, since they cycle the timehands >> faster than they were designed to be. We already have too many of these >> bugs (where tc_setclock() calls tc_windup(). I didn't notice this >> particular problem with it before). Now I will point out that version >> 2 of your patch adds more of these calls, apparently to get changes to >> happen sooner. But in sysctl_kern_timecounter_hardware(), such a call >> was intentionaly left out since it is not needed. Note that tc_tick >> prevents calls to tc_windup() more often than about once per msec if >> hz > 1000. > No, I did not added more tc_windup calls. I added a recalculation > of the shared page content on the timecounter change, which is not > the same as tc_windup() call. This is exactly to handle a disable > of usermode rdtsc use when kernel timecounter hardware changes. Oops. I saw a parameter named tc_windup and didn't look too closely at the event handler for this. Please use a slightly different name. Frequent updates of the shared page may cause the same too-fast cycling as frequent calls to tc_windup(). Are event handlers rate-limited? If not, then someone changing the timecounter hardware from a loop in userland could cause similar problems to a settimeofday() loop. Both are privileged operations so this is not a large problem, but it is a stress test that should pass. >> [jhb wrote] >>> There was apparently another issue with version 2. The bcopy() is not >>> atomic, so potentially libc could read wrong tk_current. I redid >>> the interface to write to the shared page to allow use of real atomics. >> >> Timecounter code is supposed to be lock-free except for some time-domain >> locking. I only see 1 problem with this: where tc_windup() writes the >> generation count and other things without asking for these writes to >> be ordered. In most cases, the time-domain locking prevents problems. > In fact, on x86 the ordering is strong enough that no barriers are needed, > this is why the problem goes unnoticed so far. Only the x86 write ordering is clearly strong enough (see another reply). Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 8 08:39:33 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D41A1106564A; Fri, 8 Jun 2012 08:39:33 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id 66C748FC0A; Fri, 8 Jun 2012 08:39:33 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q588dUPu023160 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 18:39:31 +1000 Date: Fri, 8 Jun 2012 18:39:30 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201206070810.08166.jhb@freebsd.org> Message-ID: <20120608180723.S1641@besplex.bde.org> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120607084229.C1474@besplex.bde.org> <201206070810.08166.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Konstantin Belousov , freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2012 08:39:33 -0000 On Thu, 7 Jun 2012, John Baldwin wrote: > On Wednesday, June 06, 2012 9:35:49 pm Bruce Evans wrote: >> On Wed, 6 Jun 2012, John Baldwin wrote: >>> 1) You don't follow the model of clearing tk_current to 0 while you >>> are updating the structure that the in-kernel timecounter code >>> uses. This also means you have to avoid using a tk_current of 0 >>> and that userland has to keep spinning as long as tk_current is 0. >>> Without this I believe userland can read a partially updated >>> structure. >> >> I thought that too at first, but after looking at the patch decided >> that it may be correct, but is too hard for me to understand. >> Urk, we both missed that tk_current is an index into the timehands >> array, so it cannot act as a generation count and it seems to be harder >> to lock. > > Ugh, so it goes a long way to emulate the timehands array in userland. As I > mentioned previously, I consider the timehands array to be a bug. However, I > do think the generation count in the in-kernel timehands structure is useful > and should be kept (and follow the same model of setting it to 0 before doing > updates, then updating the structure, then setting the new generation). Without the timehands array you will need slow atomic ops or maybe MD magic to make them unnecessary. I think 3 generations are necessary and sufficient: one pointed to by timehands for normal use; another that used to be pointed to be timehands and that remains valid for 1 more generation time after timehands was switched away from it, and one invalid/unready/being_updated one that will become the one pointed to by timehands 1 generation after it was updated. 2 generations are needed instead of 1 to allow updating one while the other remains usable, and 3 generations are needed instead of 1 to ensure that the one pointed to by timehands remains valid for a full generation time (average 1.5 generation times) after any read of timehands. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 8 09:16:24 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9F329106574B; Fri, 8 Jun 2012 09:16:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id 2809B8FC17; Fri, 8 Jun 2012 09:16:23 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q589G717025135 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 19:16:09 +1000 Date: Fri, 8 Jun 2012 19:16:07 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120607100401.GW85127@deviant.kiev.zoral.com.ua> Message-ID: <20120608185204.T1708@besplex.bde.org> References: <201206051008.29568.jhb@freebsd.org> <86haupvk4a.fsf@ds4.des.no> <201206051222.12627.jhb@freebsd.org> <20120605171446.GA28387@onelab2.iet.unipi.it> <20120606040931.F1050@besplex.bde.org> <864nqovoek.fsf@ds4.des.no> <20120607064951.C1106@besplex.bde.org> <86sje7sf31.fsf@ds4.des.no> <20120607100401.GW85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Dag-Erling Sm??rgrav , freebsd-arch@freebsd.org Subject: Re: Fast vs slow syscalls (Re: Fwd: [RFC] Kernel shared variables) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2012 09:16:24 -0000 On Thu, 7 Jun 2012, Konstantin Belousov wrote: > On Thu, Jun 07, 2012 at 10:26:10AM +0200, Dag-Erling Sm??rgrav wrote: >> Bruce Evans writes: >>> Now 2.44 nsec/call makes sense, but you really should add some volatiles >>> here to ensure that getpid() is not optimized away. >> >> As you can see from the disassembly I provided, it isn't. >> >>> SO it loops OK, but we can't see what getpid() does. It must not be >>> doing much. >> >> Umm, yes, that's the whole point of this conversation. Linux's getpid() >> is not a syscall, but a library function that returns a constant from a >> page shared by the kernel. Of course, but were down to nearly single-cycle times, so the difference between the libary function using 1 or 2 instructions to load the value may be significant. >>> 5.4104 nsec/call for gettimeofday() is impossible if there is any >>> rdtsc() hardware call or much layering. >> >> It's gettimeofday(0, 0), actually, so it doesn't need to read the clock. >> If I pass a struct timeval as the first argument - so it *does* need to >> read the clock - it's a little bit slower but still faster than an >> actual system call. Here's another run that demonstrates this - a >> little bit slower than previous runs because I have other processes >> running: >> >> getpid(): 10,000,000 iterations in 30,377 us >> gettimeofday(0, 0): 10,000,000 iterations in 55,571 us >> gettimeofday(&tv, 0): 10,000,000 iterations in 302,634 us > So this timing seems to be approximately same by the order of magnitude > as the times I get for the patch, around 25 vs. 30ns/per gettimeofday() > call. Not great. I get 6.97 nsec for a slightly reduced version of FreeBSD's 1998 version of microtime(), which was written in i386 asm. (This depends on rdtsc taking only 6.5 cycles = 3.25 nsec on the test CPU (Athlon64)). >From rev.1.40 of microtime.s: % #include % % ENTRY(microtime) % movl tsc_freq, %ecx % testl %ecx, %ecx % je i8254_microtime This branch is predicted perfectly but costs 0.24 nsec (0.5 cycles). % rdtsc % subl tsc_bias, %eax % mull tsc_multiplier % movl %edx, %eax % addl timeoff+4, %eax /* usec += time.tv_sec */ % movl timeoff, %edx /* sec = time.tv_sec */ Similar to binuptime(). To convert from the old microtime.s, I just converted the variable names from aout to elf (and supplied dummy variables), and removed locking instructions, which were pushfl/cli/popfl). % % cmpl $1000000, %eax /* usec valid? */ % jb 1f % subl $1000000, %eax /* adjust usec */ % incl %edx /* bump sec */ Probably faster with bintimes (can be branch-free then (?)), but by converting directly to the final format we avoid a scaling step. The branch in it is predicted too perfectly by my dummy variables. % 1: % movl 4(%esp), %ecx /* load timeval pointer arg */ % movl %edx, (%ecx) /* tvp->tv_sec = sec */ % movl %eax, 4(%ecx) /* tvp->tv_usec = usec */ % % ret % % i8254_microtime: % ret /* XXX garbage */ > > Linux seems slower probably due to slower CPU ? Mine is 3.4Ghz, while > des used 3.1Ghz for Linux box. If it is a different CPU model, the the speed of rdtsc can vary a lot. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Jun 8 11:29:01 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A17321065674; Fri, 8 Jun 2012 11:29:01 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 0DBA08FC12; Fri, 8 Jun 2012 11:29:00 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q58BSoh7000719; Fri, 8 Jun 2012 14:28:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q58BSotU035924; Fri, 8 Jun 2012 14:28:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q58BSolw035923; Fri, 8 Jun 2012 14:28:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 8 Jun 2012 14:28:50 +0300 From: Konstantin Belousov To: Bruce Evans Message-ID: <20120608112850.GE85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> <20120608155521.S1201@besplex.bde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="mu4SaHkdL1Az71rA" Content-Disposition: inline In-Reply-To: <20120608155521.S1201@besplex.bde.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2012 11:29:01 -0000 --mu4SaHkdL1Az71rA Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jun 08, 2012 at 05:48:12PM +1000, Bruce Evans wrote: > On Thu, 7 Jun 2012, Konstantin Belousov wrote: >=20 > >On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: > >>On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote: > >>>On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote: > >>>>In general this looks good but I see a few nits / races: > >>>> > >>>>1) You don't follow the model of clearing tk_current to 0 while you > >>>> are updating the structure that the in-kernel timecounter code > >>>> uses. This also means you have to avoid using a tk_current of 0 > >>>> and that userland has to keep spinning as long as tk_current is 0. > >>>> Without this I believe userland can read a partially updated > >>>> structure. > >>>I changed the code to be much more similar to the kern_tc.c. I (re)add= ed > >>>the generation field, which is set to 0 upon kernel touching timehands. > >> > >>Thank you. BTW, I think we should use atomic_load_acq_int() on both=20 > >>accesses > >>to th_gen (and the in-kernel binuptime should do the same). I realize= =20 > >>this > >>requires using rmb before the while condition in userland since we can't > >>use atomic_load_acq_int() here. I think it should also use > >>atomic_store_rel_int() for both stores to th_gen during the tc_windup() > >>callback. >=20 > The atomic_load_acq_int() (or rmb()) would completely defeat one of > the main points in the design of binuptime(), which was to be lock-free > so as to be efficient (the atomic_store_rel_int() is rarely done so > fixing it doesn't affect efficiency, especially on x86 after kib's > recent changes removed the serialization from it). However, I now think > the acq part of the load is needed even on x86. x86 allows loads out of > order, except in the case where the load is from the same address of a > previous store. So no explicit memory barrier is needed (on x86) for > loads of th_generation to be ordered relative to stores to th_generation. > But read barriers seem to be needed for loads of the variables protected > by th_generation to be ordered relative to loads of th_generation. An > acq barrier for th_generation works somewhat bogusly (on x86) by supplying > a barrier for the one variable that doesn't need it for ordering. load_acq is not a lock, it is serialization. >=20 > The correct fix seems to be to use time-domain locking even more: set the > timehands pointer to the previous generation instead of the current one. > Modulo other bugs, this gives >=3D 1 msec for the previous generation to > stabilize. Speculative loads would have to be more than 1 msec in the > past to cause problems. But they can't be, since the thread must have > been preempted for its speculative load to live that long, and the > preemption would/should have issued a barrier instruction. Except when > the speculative load reaches a register before the preemption -- that case > is handled by the generation count: since the timehands being used must > be more than 1 generation behind for its th_generation to change, the > memory barrier instruction for the preemption ensures that the change to > th_generation is seen, so the new timehands is loaded. >=20 > Second thoughts about whether x86 needs the acq barrier: stores to all > the variables in tc_windup() are ordered by x86 memory semantics. This > gives them a good ordering relative to the stores to th_generation, or > at least can do this. A similar ordering is then forced for the loads > in binuptime() etc, since x86 memory semantics ensure that each load > occurs after the corresponding store to the same address. Maybe this > is enough, or can be made to be enough with a more careful ordering of > the stores. This is MD and hard to understand. The ordering of loads reg. stores to the same address only happens on the same core. On x86, loads cannot be reordered with other loads, but potentially this could happen on other arches. >=20 > >This is done. On the other hand, I removed a store_rel from updating > >tk_current, since it is after enabling store to th_gen, and the order > >there does not matter. >=20 > Sigh. The extremeness of some locking pessimizations on an Athlon64 i386 > UP are: >=20 > rdtsc takes 6.5 cycles > rdtsc; movl mem,%ecx takes 6.5 cycles > xchgl mem,%ecx takes 32 cycles > rdtsc; lfence; movl mem,%ecx takes 34 cycles > rdtsc; xchgl mem,%ecx takes 38 cycles > xchgl mem,%ecx; rdtsc takes 40 cycles > xchgl mem,%eax; rdtsc takes 40 cycles > rdtsc; xchgl mem,%eax takes 44 cycles > rdtsc; mfence; movl mem,%ecx takes 52 cycles >=20 > So the software overheads are 5-8 times larger than the hardware overheads > for a TSC timecounter, even when we only lock a single load. Later CPUs > have much slower rdtsc, taking 40+ cycles, so the software overheads are > relatively smaller, especially since they are mostly in parallel with > the slow rdtsc. On core2 i386 SMP: I suspect that what you measured for fence overhead is actually a time to retire whole queue or read (and/or write) requests accumulated so far in the pipeline, and not the overhead of synchronous rdtsc read. >=20 > rdtsc takes 65 cycles (yes, 10x slower) > rdtsc movl mem,%ecx takes 65 cycles > xchgl mem,%ecx takes 25 cycles > rdtsc; lfence; movl mem,%ecx takes 73 cycles > rdtsc; xchgl mem,%ecx takes 74 cycles > xchgl mem,%ecx; rdtsc takes 74 cycles > xchgl mem,%eax; rdtsc takes 74 cycles > rdtsc; xchgl mem,%eax takes 74 cycles > rdtsc; mfence; movl mem,%ecx takes 69 cycles (yes, beats lfen= ce) >=20 > Note that the get*() APIs have identical locking issues, so if you fix > them by adding memory barriers they will become slower than the current > non-get*() APIs are without locking, so their existence will be more > bogus than before (except with very slow timecounter hardware). >=20 > >I also did some restructuring of the userspace, removing layers that > >Bruce did not liked. Now top-level functions directly call binuptime(). > >I also shortened the preliminary operations by caching timekeep pointer. > >Its double-initialization is safe. > > > >Latest version is at > >http://people.freebsd.org/~kib/misc/moronix.4.patch >=20 > Thanks. I didn't look at the patch. To be happy with it, I would requir= e: > - about 1/4 the size of the first version (6K) for at least the pure > timecounter parts This might already happen, since I removed the layering you did not liked, from usermode. > - fix old kernel bugs: > - figure out what needs to be done for time-domain locking > - fix the bug reported by jhb, that times can go backwards due to old > timehands having a slightly different frequency. > (I tried to duplicate this in the kernel, but couldn't. I used > adjtime(2) with hacks to make it adjust the clock by +-0.5 > seconds/second. A loop with "adjtime 1000; adjtime" -1000 then > gives huge swings in the frequency. But clock_gettime() didn't > show any negative differences. I think the negative difference > can't be smaller than ~100 nsec, and since the syscall takes > longer than that even clock wrong by a factor of 2 due to the > hacked adjtime, it can't see negative differences.) > - figure out what TSC-low is fixing and fix it properly. rdtsc is > a non-serializing instruction. Thus it can probably appear to go > backwards. TSC-low normally discards a factor of 128 of the precision > of the TSC. At 2GHz, this reduces the precision from 0.5 nsec to > 64 nsec. Most negative differences would become 0. I wonder if > TSC-low is "working" just by hiding most negative differences. > But it can only hide most (about 127/128) of them. E.g., if the > out-of-order times are 64 nsec and 63.5 nsec, then after discarding > 128 low bits, the negative difference expands from -0.5 nsec to > -64 nsec. The goal of the patch is only to move the code from kernel into userspace, trying not to change algorithms. The potential changes you describe above should be done both in kernel in usermode after that. >=20 > Note that the large syscall overhead prevents seeing any small > negative time differences from userland in the same way as above. > But without that overhead, either in the kernel or in moronix > userland, small negative time differences might be visible, > depending on how small they are and on how efficient the functions > are. So far I did not see time going backward in tight gettimeofday() loop. This indeed is one of my main worries. >=20 > TSC-low also breaks seeing small positive differences. This > breakage if it is not hidden by syscall overhead or inefficient > functions. For some uses, truncation small positive differences > to 0 is just as bad as negative differences -- you can't distinguish > separate events using their timestamps. Unfortunately, timecounters > with low resolution have this problem unavoidably. A TSC should > at least be able to distinguish events that are separate at the > cycle level, though since the x86 TSC is non-serializing it has a > different tyoe of fuzziness. This fuzziness shouldn't be fixed > by adding serialization instructions for it (though one for acq > may do this accidentally), since that woukld make it much slower. > rdtscp should rarely be used since it is serializing so it gives > similar slowness. Does it do any more than "cpuid; rdtsc"? rdtscp allows to atomically get current package tsc counter and obtain some reference to current core identifier. If we produce 'skew tables' to compensate different tsc initial values and possible drift, then we could use tsc counter on wider range of hardware, by adjusting returned value from rdtsc by skew table repair addendum. Rdtscp is atomic in this respect. >=20 > >I probably move all shared page helpers to separate file from kern_exec.= c, > >but this will happen after moronix is committed. >=20 > It's still moronix? Why would we want that? :-) >=20 > Bruce --mu4SaHkdL1Az71rA Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/R4fIACgkQC3+MBN1Mb4hUfwCfWVsDpo5c5y89qYoQ8fjjnZcJ NZEAoMdzFuDdVtdE4xlacrcpES1fJ5Zr =94F2 -----END PGP SIGNATURE----- --mu4SaHkdL1Az71rA-- From owner-freebsd-arch@FreeBSD.ORG Fri Jun 8 12:43:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A003B1065688; Fri, 8 Jun 2012 12:43:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 1A36A8FC14; Fri, 8 Jun 2012 12:43:41 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q58Chc8a011143 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 22:43:39 +1000 Date: Fri, 8 Jun 2012 22:43:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120608112850.GE85127@deviant.kiev.zoral.com.ua> Message-ID: <20120608215043.Q2736@besplex.bde.org> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> <20120608155521.S1201@besplex.bde.org> <20120608112850.GE85127@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2012 12:43:42 -0000 On Fri, 8 Jun 2012, Konstantin Belousov wrote: > On Fri, Jun 08, 2012 at 05:48:12PM +1000, Bruce Evans wrote: >> On Thu, 7 Jun 2012, Konstantin Belousov wrote: >> >>> On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: >>>> >>>> Thank you. BTW, I think we should use atomic_load_acq_int() on both >>>> accesses >>>> to th_gen (and the in-kernel binuptime should do the same). I realize >>>> ... >> >> The atomic_load_acq_int() (or rmb()) would completely defeat one of >> the main points in the design of binuptime(), which was to be lock-free >> ... >> by th_generation to be ordered relative to loads of th_generation. An >> acq barrier for th_generation works somewhat bogusly (on x86) by supplying >> a barrier for the one variable that doesn't need it for ordering. > load_acq is not a lock, it is serialization. By "lock-free", I meant "free of all types of locks and atomic ops, including for example the x86 lock prefix which is not a lock but is often used to implement locks via its serialization properties". Then I wrote "barrier", but noted that this is acting strangely by turning th_generation into a locking gate that locks other variables. I think this depends on th_generation being loaded first (in program order) in binuptime() etc. It would be more natural to put the read barrier before the first read of another variable. >> Second thoughts about whether x86 needs the acq barrier: stores to all >> the variables in tc_windup() are ordered by x86 memory semantics. This >> gives them a good ordering relative to the stores to th_generation, or >> at least can do this. A similar ordering is then forced for the loads >> in binuptime() etc, since x86 memory semantics ensure that each load >> occurs after the corresponding store to the same address. Maybe this >> is enough, or can be made to be enough with a more careful ordering of >> the stores. This is MD and hard to understand. > The ordering of loads reg. stores to the same address only happens > on the same core. So my first thoughts were better. > On x86, loads cannot be reordered with other loads, > but potentially this could happen on other arches. I think you mean stores cannot be reordered with other stores. >>> This is done. On the other hand, I removed a store_rel from updating >>> tk_current, since it is after enabling store to th_gen, and the order >>> there does not matter. >> >> Sigh. The extremeness of some locking pessimizations on an Athlon64 i386 >> UP are: >> >> rdtsc takes 6.5 cycles >> rdtsc; movl mem,%ecx takes 6.5 cycles >> xchgl mem,%ecx takes 32 cycles >> rdtsc; lfence; movl mem,%ecx takes 34 cycles >> rdtsc; xchgl mem,%ecx takes 38 cycles >> xchgl mem,%ecx; rdtsc takes 40 cycles >> xchgl mem,%eax; rdtsc takes 40 cycles >> rdtsc; xchgl mem,%eax takes 44 cycles >> rdtsc; mfence; movl mem,%ecx takes 52 cycles All except the first 2 here are twice as high as they should be. >> So the software overheads are 5-8 times larger than the hardware overheads >> for a TSC timecounter, even when we only lock a single load. Later CPUs 2.5-4 times. >> have much slower rdtsc, taking 40+ cycles, so the software overheads are >> relatively smaller, especially since they are mostly in parallel with >> the slow rdtsc. On core2 i386 SMP: > I suspect that what you measured for fence overhead is actually a time > to retire whole queue or read (and/or write) requests accumulated so far > in the pipeline, and not the overhead of synchronous rdtsc read. Yes, full serialization probably takes much longer. I don't know of any better serialization instruction than cpuid (if rdtscp is not available). More times for Athlon64: rdtsc takes 6.5 cycles lfence; rdtsc takes 17 cycles rdtsc; lfence; movl mem,%ecx takes 17 cycles (correction of above) cpuid; rdtsc takes 63 cycles >> rdtsc takes 65 cycles (yes, 10x slower) >> rdtsc movl mem,%ecx takes 65 cycles >> xchgl mem,%ecx takes 25 cycles >> rdtsc; lfence; movl mem,%ecx takes 73 cycles >> rdtsc; xchgl mem,%ecx takes 74 cycles >> xchgl mem,%ecx; rdtsc takes 74 cycles >> xchgl mem,%eax; rdtsc takes 74 cycles >> rdtsc; xchgl mem,%eax takes 74 cycles >> rdtsc; mfence; movl mem,%ecx takes 69 cycles (yes, beats lfence) These times (for core2) are correct. Now with cpuid: rdtsc takes 6.5 cycles lfence; rdtsc takes 75 cycles rdtsc; lfence; movl mem,%ecx takes 73 cycles (correct above) cpuid; rdtsc takes 298 cycles (gak!) >> - fix old kernel bugs: >> ... > The goal of the patch is only to move the code from kernel into userspace, > trying not to change algorithms. The potential changes you describe > above should be done both in kernel in usermode after that. I think being more efficient might expose more races. With syscalls, small and negative time differences can't be seen since the syscall takes longer. With kernel calls, small and negative time differences shouldn't happen since the kernel shouldn't be silly enough to spin calling a timecounter function. >> Note that the large syscall overhead prevents seeing any small >> negative time differences from userland in the same way as above. >> But without that overhead, either in the kernel or in moronix >> userland, small negative time differences might be visible, >> depending on how small they are and on how efficient the functions >> are. > So far I did not see time going backward in tight gettimeofday() loop. > This indeed is one of my main worries. Try taking out the shift. I plan to try to get out of order loads using cache misses. Not sure how that would give out of order rdtsc's. >> ... >> different tyoe of fuzziness. This fuzziness shouldn't be fixed >> by adding serialization instructions for it (though one for acq >> may do this accidentally), since that woukld make it much slower. >> rdtscp should rarely be used since it is serializing so it gives >> similar slowness. Does it do any more than "cpuid; rdtsc"? > rdtscp allows to atomically get current package tsc counter and obtain > some reference to current core identifier. If we produce 'skew tables' > to compensate different tsc initial values and possible drift, then > we could use tsc counter on wider range of hardware, by adjusting > returned value from rdtsc by skew table repair addendum. Rdtscp is > atomic in this respect. rdtscp would be too slow if it is as slow as the above for cpuid; rdtsc. But at least early phenom docs say that both rdtsc and rdtscp take 41+6 cycles. I read a bit more of its documentation. It seems to be exactly rdtsc with the serialization and core number load, and without the register clobbering and extra overhead of the cpuid instruction. I only noticed the other day, when someone fixed it, that the kernel already has this skew adjustment in dtrace code. The adjustment was backwards... The index to the skew table is curcpu. There is a sched_pin() in the initialization code, but none in the timer read code, so I don't see how the latter can work right even if the adjustment is forwards. Unless the caller always does the sched_pin(), but that would be slow and probably undocumented. Bruce