From owner-freebsd-arch@FreeBSD.ORG Thu Dec 15 03:40:57 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 430761065673 for ; Thu, 15 Dec 2011 03:40:57 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id ABADD8FC0C for ; Thu, 15 Dec 2011 03:40:56 +0000 (UTC) Received: from lstewart.caia.swin.edu.au (lstewart.caia.swin.edu.au [136.186.229.95]) by lauren.room52.net (Postfix) with ESMTPSA id E76967E824; Thu, 15 Dec 2011 14:20:58 +1100 (EST) Message-ID: <4EE9679A.70909@freebsd.org> Date: Thu, 15 Dec 2011 14:20:58 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:8.0) Gecko/20111127 Thunderbird/8.0 MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <201111210417.pAL4HOdi023556@svn.freebsd.org> <4ED8575D.2010405@freebsd.org> <201112021927.25234.jkim@FreeBSD.org> <201112022002.49618.jkim@FreeBSD.org> <4EDB6814.9040403@freebsd.org> <405E9C58-F999-45C2-BC20-81ED4CA8A3E9@cubinlab.ee.unimelb.edu.au> In-Reply-To: <405E9C58-F999-45C2-BC20-81ED4CA8A3E9@cubinlab.ee.unimelb.edu.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net Cc: Julien Ridoux , Benjamin Kaduk , Jung-uk Kim Subject: System clock APIs and feed-forward clock integration with BPF (was "Re: svn commit: r227778 - head/sys/net") X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Dec 2011 03:40:57 -0000 [Moving discussion to freebsd-arch@ and renaming thread] On 12/15/11 01:30, Julien Ridoux wrote: > > On 04/12/2011, at 11:31 PM, Lawrence Stewart wrote: > >> On 12/03/11 12:02, Jung-uk Kim wrote: >>> On Friday 02 December 2011 07:27 pm, Jung-uk Kim wrote: >>>> On Thursday 01 December 2011 11:43 pm, Lawrence Stewart wrote: >>>>> On 12/02/11 03:43, Jung-uk Kim wrote: >>>>>> On Thursday 01 December 2011 10:11 am, Lawrence Stewart wrote: >>>>>>> On 11/30/11 05:09, Jung-uk Kim wrote: >>>>>>>> On Tuesday 29 November 2011 11:13 am, Lawrence Stewart wrote: >> [snip] >>>>>>>>> Here's the first of the patches: >>>>>>>>> >>>>>>>>> http://people.freebsd.org/~lstewart/patches/misc/ffclock_bpf >>>>>>>>> _i nt act abi_10.x.r228130.patch >>>>>>>> >>>>>>>> I only glanced at it but it looks very close to what I wanted >>>>>>>> to suggest. >>>>>>> >>>>>>> Final candidate patch is at: >>>>>>> >>>>>>> http://people.freebsd.org/~lstewart/patches/misc/ffclock_bpf_i >>>>>>> nt act abi_10.x.r228180.patch >>>>>>> >>>>>>> Assuming it passes the "make tinderbox" build I'm currently >>>>>>> running and no further input is received from interested >>>>>>> parties, I plan to commit it in ~10 hours. >>>>>>> >>>>>>> Changes since the r228130 patch I sent previously: >>>>>>> >>>>>>> - The new flags in bpf.h are added unconditionally so that >>>>>>> they can always be referenced at compile time and a decision >>>>>>> made at runtime as to whether a flag will be set or not. Using >>>>>>> one of the new flags when the kernel doesn't have FFCLOCK >>>>>>> compiled in results in the flag being ignored. An app should >>>>>>> check for the existence of the "ffclock" kernel feature or the >>>>>>> "kern.sysclock" sysctl tree before attempting to use the >>>>>>> flags. >>>>>>> >>>>>>> - This patch will hopefully be MFCed at some point, so I added >>>>>>> a CTASSERT to bpf.c to ensure that the ABI of structs >>>>>>> bpf_hdr32, bpf_hdr and bpf_xhdr remains intact when FFCLOCK is >>>>>>> enabled and the union of a ffcounter and struct >>>>>>> timeval32/timeval/bpf_ts is switched in. >>>>>>> >>>>>>> - bpf_gettime() more comprehensively covers all the possible >>>>>>> cases of flag combination and does sensible things for each >>>>>>> case (even though some cases are rather silly). >>>>>>> >>>>>>> - The snippet of code at the beginning of catchpacket() that >>>>>>> was manipulating the struct bintime derived from bpf_gettime() >>>>>>> was gross and has been removed in favour of selecting the >>>>>>> right {get}bin{up}time() function call in bpf_gettime(). >>>>>> >>>>>> I did that to reduce branching. Since you are introducing more >>>>>> branches, it warrants a function pointer now. >>>> >>>> There's another reason, BTW. If mbufs are tagged with timestamps >>>> (where only monotonic timestamps are allowed), they must be >>>> converted within the bpf.c. I forgot all the details. :-( >> >> We should document this knowledge in some code comments. >> >>>>> I see, thanks for the explanation. Could you elaborate a bit more >>>>> about how you would implement the function pointer idea? I'm also >>>>> curious in the !FFCLOCK case just how much overhead having the >>>>> 2-layer nested if/else adds. I'm not an very optimisation savvy >>>>> person, but I'm wondering if it's actually worth micro-optimising >>>>> this code. >>>>> >>>>> My initial thoughts about your function pointer idea lead to >>>>> adding a function pointer in the bpf_d and setting it to the >>>>> appropriate function to get the timestamp from at bpf_d creation >>>>> or ioctl time. Whilst I like this idea, I can't see how it would >>>>> work given that the various functions involved in time/ffcounter >>>>> stamp generation all have different signatures. >>>>> >>>>> We could have multiple variants of bpf_gettime() which each call >>>>> the appropriate underlying functions to generate the appropriate >>>>> stamp. Would add quite a lot of code but would reduce the >>>>> overhead of calling bpf_gettime() to an indirect function call + >>>>> the underlying stamp generation function call. This also solves >>>>> the problem of multiple function signatures. >>>> >>>> Please see my patch: >>>> >>>> http://people.freebsd.org/~jkim/bpf_ffclock.diff >>> >>> I booted up the kernel and found it just crashes because of stupid >>> typos. Now a new patch is uploaded in place. Sorry. >>> >>> Anyway, I see no regression nor ABI breakage. :-) >> >> struct bpf_d being part of the ABI was the main thing I was concerned about, so given that it isn't I like your approach a lot. >> >> As noted by Julien, this approach does introduce problems with respect to the follow up patch that adds a permanent bh_ffcounter member to the bpf header. I thought the fromclock() API would sufficiently meet the needs of consumers like BPF, but if we were to proceed with something like Jung-uk's proposed patch I don't think that's true anymore. >> >> We decided to bite the bullet and devise an API that is more compact and can return all appropriate time information from all underlying sysclocks in an efficient manner - something like a more generic version of ffclock_abstime(). Julien and I spent quite some time today nutting out details, and Julien has done a proof of concept patch against my proposed BPF patch which looks good as a starting point for discussion: >> >> http://www.cubinlab.ee.unimelb.edu.au/~jrid/patches/r228254/sysclock_snap.patch >> >> It needs a bit more work and should be split into two patches (one introducing the API and the other being the BPF changes). >> >> With something like this though, the BPF patch becomes radically simplified in the FFCLOCK case. We don't even need a function pointer cached in the bpf_d anymore, but could cache a struct sysclock_snap there instead if we really wanted to minimise overhead in bpf_gettime(). >> >> We'll have a go at refining the patch tomorrow hopefully, but wanted to put this out there for early consideration. >> >> Cheers, >> Lawrence > > > Hi Jung-uk (and all), > > It took us a bit of time but we have a new patch. There are a lot of changes in the patch, and I will start by summarising our motivation for the changes. Hopefully I won't forget anything, but if I leave something unclear, don't hesitate to ask Lawrence or me. > > Starting with the easy bits, we drop the union in the BPF header and followed something similar to what Ben and you proposed. > > Next, the core motivation was to ensure that the timestamping function be called once only per packet captured, while keeping the ability to read both clocks and return the time from one of the two to the user. This forced us to redesign the timestamping function and one thing leading to another a fair bit of the BPF code. > > There 3 orthogonal parameters to the timestamping function in the BPF context: > - which clock is used to return the time > - what is the format of the value returned > - and a performance issue (nothing to do with timestamping or timekeeping): reading the hardware counter or not > > We handle the first one with a new sysclock_getsnapshot() which saves both clock parameters when timestamping and potentially reads the hardware counter. > The second one has been covered in previous discussion with the addition of BPF_T_FFCOUNTER. > > The performance issue created problems. In the current code, the order with which the BPF devices are open on one interface impacts the timestamping function. Let's consider two BPF devices respectively open with BPF_TSTAMP_NORMAL and BPF_TSTAMP_FAST quality. If NORMAL is open after FAST, it is at the head of the list, the hardware counter is read, and FAST benefits from higher quality timestamps. If FAST is open after NORMAL, the timestamping function is called once and return last kernel tick time, then it is called a second time for NORMAL, this time accessing the hardware counter. > > None of these cases are satisfactory. This lead to inconsistent behaviour based on the order the BPF are open. An application can impact the settings of other BPF devices and other applications / kernel consumers. The timestamping suffers from higher latency and worse, higher variability of the latency, which is bad. > > We then took a different approach, which is inspired by your work on BPF. We believe the FAST/NORMAL settings should have 2 level of granularity: system wide and per interface. The system wide is the default behaviour (for example all BPF devices are NORMAL) and the interface one super-seeds the system one. This is implemented by the use of new sysctl living in the net.bpf tree. > > This enables many scenarios such as: > - system default is FAST to improve performance because the system uses a slow timecounter (eg HPET or ACPI). > - the system has 3 interfaces. Interface 2 is put in NORMAL mode (need for accuracy but not much traffic), and the last can do hardware timestamping and is put in BPF_TSTAMP_EXTERNAL mode. > > This then naturally led to store the performance setting (none/fast/normal/external) for each interface instead of each BPF descriptor. > > Each BPF descriptor still maintains the other 2 settings: the choice of clock and the format of the timestamp independently from each other. > > We believe this new patch keeps the spirit of the changes you proposed and improves it while accomodating the changes required by the feed-forward clock and future evolutions with things like IEEE 1588 compatible NICs. > > We plan to push this patch as soon as possible. We would really appreciate if you could comment on this long email and pach very soon. > > The patch can be found here: > http://www.cubinlab.ee.unimelb.edu.au/~jrid/patches/r228429-v4/ffclock_bpf_intactabi.patch > > We will adjust the documentation as well once we have reached consensus. > > Julien (and Lawrence of course)