From owner-freebsd-arch@FreeBSD.ORG Sat Jun 2 17:00:09 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E67B4106564A; Sat, 2 Jun 2012 17:00:08 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id AE3298FC18; Sat, 2 Jun 2012 17:00:07 +0000 (UTC) Received: by laai10 with SMTP id i10so2823846laa.13 for ; Sat, 02 Jun 2012 10:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=kTgNyiKz8flKgtATHJ/9fmR9G2Pd/ZSOjw0m3AUVvQQ=; b=zdzkBwlKOmni4ta3rf9Y6A3FqUJBlyBreqeKnCKGsogRVSZ9uamJ07LPh5h9YwylTQ fFcy+UG7DghtNLzajGSaoBuPnjU25XcBS+iue8qaKXkSOaZ8f20uDivBI1olm02aNxmf J7eKIUiE5czQbqfOloaCMl2PS5TKogg+m9t911LOkNWK8KRUBBTreDx1x7e/Ysp87Oqi s4TH0FT357S42AEbvkgg8jP+CJAknCfRSjH6ccDc7JmBx/Sxpl4y0CCttWpl3czLhM+9 ZbEncDquAacDefbjlvyK5hsafbIpFuGwdRfVKbpbCewtyyeZT5QJkJ7cqgbiT7cwxQy8 gF7w== MIME-Version: 1.0 Received: by 10.112.45.4 with SMTP id i4mr3701338lbm.79.1338656406504; Sat, 02 Jun 2012 10:00:06 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.112.27.65 with HTTP; Sat, 2 Jun 2012 10:00:06 -0700 (PDT) In-Reply-To: References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> Date: Sat, 2 Jun 2012 18:00:06 +0100 X-Google-Sender-Auth: sMCBm15RYm0QSB4Z016r5COjoKo Message-ID: From: Attilio Rao To: freebsd-arch@freebsd.org, Gianni , Alexander Kabaev , Alan Cox , Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Subject: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 17:00:09 -0000 Sorry, resending with all the recipients in. Attilio ---------- Forwarded message ---------- From: Attilio Rao Date: 2012/6/2 Subject: Re: [RFC] Kernel shared variables To: Konstantin Belousov 2012/6/2 Konstantin Belousov : > On Sat, Jun 02, 2012 at 02:01:35PM +0100, Attilio Rao wrote: >> 2012/6/1 Konstantin Belousov : >> > On Fri, Jun 01, 2012 at 07:53:15PM +0200, Giovanni Trematerra wrote: >> >> Hello, >> >> I'd like to discuss a way to provide a mechanism to share some read-o= nly >> >> data between kernel and user space programs avoiding syscall overhead= , >> >> implementing some them, such as gettimeofday(3) and time(3) as ordina= ry >> >> user space routine. >> >> >> >> The patch at >> >> http://www.trematerra.net/patches/ksvar_experimental.patch >> >> >> >> is in a very experimental stage. It's just a proof-of-concept. >> >> Only works for an AMD64 kernel and only for 64-bit applications. >> >> The idea is to have all the variables that we want to share between k= ernel >> >> and user space into one or more consecutive pages of memory that will= be >> >> mapped read-only into every running process. At the start of the firs= t >> >> shared page >> >> there'll be a table with as many entries as the number of the shared = variables. >> >> Each entry is a 32-bit value that is the offset between the start of = the shared >> >> page and the start of the variable in the page. The user space proces= ses need >> >> to find out the map address of shared page and use the table to acces= s to the >> >> shared variables. >> >> Kernel will export a variable to user space as an index, so user spac= e code >> >> must refer to a specific index to access a kernel shared variable. >> >> Let's take a quick look to the KPI/API for exporting/importing kernel >> >> shared variables. >> >> Say we want implement a routine to export an int from the kernel. >> >> To define the variable to be exported inside the kernel you would use >> >> >> >> KSVAR_DEFINE(0, int, test_value); >> >> >> >> You have just defined an int variable named "test_value" at index 0. >> >> Inside the kernel you can write/read as usual using the symbol test_v= alue; >> >> Now you likely want add to libc a function callable from user process= es >> >> that return the test_value variable. So first of all you need the imp= ort the >> >> variable. >> >> >> >> KSVAR_IMPORT(0, int, test_value); >> >> >> >> and to obtain a pointer to read the value you would use >> >> >> >> KSVAR(test_value); >> >> >> >> so your function would look like something like this >> >> >> >> int get_test_value() >> >> { >> >> >> >> =C2=A0 =C2=A0 =C2=A0return (*KSVAR(test_value)); >> >> } >> >> >> >> Then inside your process just call get_test_value() function as you u= sually >> >> do and you'll get a kernel written value without switching in kernel = mode. >> >> >> >> Let's see now in more detail how that could be accomplished. >> >> The shared variables will be accessed as normal variables and are rea= d/write >> >> inside the kernel. The variables need to be inside the same page(s) a= nd nothing >> >> but the shared variables (and the table) must be into the page(s). To >> >> obtain that >> >> I changed the linker script in this way >> >> >> >> --- a/sys/conf/ldscript.amd64 >> >> +++ b/sys/conf/ldscript.amd64 >> >> @@ -177,6 +177,15 @@ SECTIONS >> >> =C2=A0 =C2=A0 *(.ldata .ldata.* .gnu.linkonce.l.*) >> >> =C2=A0 =C2=A0 . =3D ALIGN(. !=3D 0 ? 64 / 8 : 1); >> >> =C2=A0 } >> >> + =C2=A0.ksvar ALIGN(CONSTANT (COMMONPAGESIZE)) : >> >> + =C2=A0{ >> >> + =C2=A0 =C2=A0__ksvar_set_start =3D .; >> >> + =C2=A0 =C2=A0*(.ksvar_table) >> >> + =C2=A0 =C2=A0*(.ksvar) >> >> + >> >> + =C2=A0 . =3D ALIGN(CONSTANT (COMMONPAGESIZE)); >> >> + =C2=A0 __ksvar_set_stop =3D .; >> >> + =C2=A0} >> >> =C2=A0 . =3D ALIGN(64 / 8); >> >> =C2=A0 _end =3D .; PROVIDE (end =3D .); >> >> =C2=A0 . =3D DATA_SEGMENT_END (.); >> >> >> >> When we want to define a variable in the kernel to share with user sp= ace >> >> we have to use KSVAR_DEFINE macro in sys/sys/ksvar.h >> >> >> >> +struct ksvar_set { >> >> + =C2=A0 =C2=A0 =C2=A0 uint32_t idx; >> >> + =C2=A0 =C2=A0 =C2=A0 char *pksvar; >> >> +}; >> >> + >> >> +/* >> >> + * Declare a variable into kernel shared linker_set. >> >> + */ >> >> +#define =C2=A0 =C2=A0 =C2=A0 =C2=A0KSVAR_DEFINE(index, type, name) \ >> >> + =C2=A0 =C2=A0 =C2=A0 static type name __section(".ksvar"); =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \ >> >> + =C2=A0 =C2=A0 =C2=A0 static struct ksvar_set name ## _ksvar_set =3D= { =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\ >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .idx =3D index, = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 \ >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .pksvar =3D (char = *) &name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0\ >> >> + =C2=A0 =C2=A0 =C2=A0 }; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\ >> >> + =C2=A0 =C2=A0 =C2=A0 DATA_SET(ksvar_set, name ## _ksvar_set) >> >> >> >> Every variable must have a unique index. The indexes must >> >> start from zero and be consecutive. When you add an index >> >> you must bump the size of the table (KSVAR_TABLE_SIZE) >> >> (see sys/sys/ksvar.h) >> >> >> >> The variables are inside the kernel static image that isn't managed >> >> by the VM and so we need to allocate pages to map the physical addres= ses. >> >> A new SYSINIT (ksvarinit) will allocate a set of vm_page_t =C2=A0thro= ugh >> >> the vm_phys_fictitious_reg_range interface and fill the table using >> >> the information >> >> of the ksvar_set linker set, then will create a vm_object_t (vm_objec= t_ksvar), >> >> mark the fake pages as valid and put them into it. >> >> When a new process is created by exec(3) the vm_object_ksvar will be >> >> mapped read-only into the process address space by vm_map_fixed routi= ne >> >> just before mapping the user stack. The address of mapping will be re= corded >> >> inside the new p_ksvar field of the struct proc. >> >> This field will be exported through a sysctl to the user space proces= ses. >> >> In order to implement syscalls as user space routines, we have to fin= d out the >> >> mapped address of the kernel shared variables when the libc is mapped= into >> >> the process. So I added a function marked with the attribute construc= tor. >> >> It will called before any code into user process and before any code = inside >> >> the libc. >> >> >> >> +__attribute((constructor)) void init_kernel_shared() >> >> +{ >> >> + =C2=A0 =C2=A0 =C2=A0 int mib[2]; >> >> + =C2=A0 =C2=A0 =C2=A0 size_t len; >> >> + =C2=A0 =C2=A0 =C2=A0 vm_offset_t ksvar_address; >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 mib[0] =3D CTL_KERN; >> >> + =C2=A0 =C2=A0 =C2=A0 mib[1] =3D KERN_KSVAR; >> >> + =C2=A0 =C2=A0 =C2=A0 len =3D sizeof(vm_offset_t); >> >> + =C2=A0 =C2=A0 =C2=A0 if (__sysctl(mib, 2, (void *) &ksvar_address, = &len, NULL, 0) !=3D -1) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ksvar_table =3D (u= int32_t *) ksvar_address; >> >> +} >> >> >> >> Once the libc knows the address of the table it can access to the sha= red >> >> variables. >> >> >> >> Just as proof of concept I re-implemented gettimeofday(3) in user spa= ce. >> >> First of all I didn't remove the entry into the syscall.master, just = renamed the >> >> sys_gettimeofday. I need it for the fallback path. >> >> In the kernel I introduced a struct wall_clock. >> >> >> >> +struct wall_clock >> >> +{ >> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval =C2=A0tv; >> >> + =C2=A0 =C2=A0 =C2=A0 struct timezone tz; >> >> +}; >> >> >> >> The struct is exported through sys/sys/time.h header. >> >> I defined a new kernel shared variable. To do so I added an index in >> >> sys/sys/ksvar.h >> >> WALL_CLOCK_INDEX and bumped KSVAR_TABLE_SIZE to 1. >> >> In the sys/kern/kern_clocksource.c >> >> >> >> +/* kernel shared variable for implmenting gettimeofday. */ >> >> +KSVAR_DEFINE(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >> >> >> >> Now we defined a shared variable at index WALL_CLOCK_INDEX of type >> >> struct wall_clock and named wall_clock. >> >> Inside handleevents I update the info exported by wall_clock. >> >> >> >> + =C2=A0 =C2=A0 =C2=A0 struct timeval tv; >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 /* update time for userspace gettimeofday */ >> >> + =C2=A0 =C2=A0 =C2=A0 microtime(&tv); >> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tv =3D tv; >> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_minuteswest =3D tz_minuteswes= t; >> >> + =C2=A0 =C2=A0 =C2=A0 wall_clock.tz.tz_dsttime =3D tz_dsttime; >> >> >> >> Now, in libc we import the shared variable >> >> >> >> +KSVAR_IMPORT(WALL_CLOCK_INDEX, struct wall_clock, wall_clock); >> >> >> >> note that WALL_CLOCK_INDEX must be the same of the one defined >> >> inside the kernel, and define a new function gettimeofday >> >> >> >> +int >> >> +gettimeofday(struct timeval *tp, struct timezone *tzp) >> >> +{ >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 /* fallback to syscall if kernel doesn't expor= t ksvar */ >> >> + =C2=A0 =C2=A0 =C2=A0 if (!KSVAR_IS_ACTIVE()) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return (sys_gettim= eofday(tp, tzp)); >> >> + >> >> + =C2=A0 =C2=A0 =C2=A0 if (tp !=3D NULL) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tp =3D KSVAR(wall= _clock)->tv; >> >> + =C2=A0 =C2=A0 =C2=A0 if (tzp !=3D NULL) >> >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 *tzp =3D KSVAR(wal= l_clock)->tz; >> >> + =C2=A0 =C2=A0 =C2=A0 return (0); >> >> +} >> >> >> >> Now when a process will call getimeofday, will call that function act= ually. >> >> If the process makes a lot of call to gettimeofday, we will see a >> >> performance boost. >> >> Note that if ksvar are not exported from the kernel (KSVAR_IS_ACTIVE)= , >> >> the function >> >> fallback to call the actual syscall (sys_gettimeofday). >> >> >> >> Open tasks >> >> - implement support for 32-bit emulated processes running in a 64-bit >> >> environment. >> >> - extend support to others arch >> >> - implement more syscalls >> >> - benchmarks >> >> - Test, test, test. >> >> >> >> I'm looking forward to hear about your comments and suggestions. >> > >> > I very much dislike what you described, it makes ABI maintanence >> > a nightmare. >> > Below is some mail I wrote around Spring 2009, making some notes about >> > desired proposal. This is what called vdso in Linux land. >> >> Did you bother to read at least Giovanni's description? >> Because this has nothing to do with VDSO in Linux. > Did you bothered to think shortly why do I object ? > >> >> I think, he just wants to map in userland processes some pages from >> the static image of the kernel (packed together in a specific >> dataset). This imposes some non-trivial problem. The first thing is >> that the static image is not thought to have physical pages tied to >> it. The second is that he needs to make a clean design in order to let >> consumer of this mechanism to correctly locate informations they want >> within the shared page(s) and in the end read the correct values. > Right, exactly, and this is why I object to the "offsets" approach. > It basically moves us to the old times of the "jump tables" shared > libraries, that fortunately was never a case for FreeBSD even when > a.out was used. I'm objecting to this either. >> >> I have some reservations on both the implementation and the approach >> for retrieving datas from the page. >> In particular, I don't like that a new vm_object is allocated for this >> page. What I really would like would be: >> 1) very minimal implementation -- you just use >> pmap_enter()/pmap_remove() specifically when needed, separately, in >> fork(), execve(), etc. cases > Oh, this simply cannot work. And why? Assuming you provide a vm_page_t from an UMA zone just like fakepage do. Of course you cannot recycle for this purpose any page caming from vm_page_alloc(). >> 2) more complete approach -- you make a very quick layer which let you >> map pages from the static image of the kernel and the shared page >> becomes just a specific consumer of this. This way the object has much >> more sense because it becomes an object associated to all the static >> image of the kernel > So you want to circumvent the vm layer. Note sure I agree with your opinion on this. >> >> About the layering, I don't like that you require both a kernel and >> userland header to locate the objects within the page. This is very >> likely ABI breakage prone. It is needed a mechanism for retrieving at >> run time what Giovanni calls "indexes", or making it indexes-agnostic. > > And this is what VDSO is for. VDSO with the standard ELF symbol > interposition rules allow to have libc that is completely unaware of the > shared page and 'indexes', i.e. which works both for older kernel that > do not export required index, and for new kernels that export the same > information in some more advanced format. By having VDSO that exports > e.g. gettimeofday() we would get override for libc gettimeofday, while > having fully functional libc for other, future and past, kernels, even > if the format of the data exported for super-fast gettimeofday changes. > > The tight between VDSO and kernel is not a problem, since VDSO is part > of the kernel from the deployment POV. More. either existing ELF > linker in kernel, or some trivial modifications to it, would allow > to not use 'indexes' on the kernel side too. I admit I don't have a better plan on how to retrieve objects from the shared page at the moment, I didn't give much thought to it. > We already have a shared page between kernel and whole set of the same-AB= I > processes. Currently it is used for signal trampolines only. > The hard parts of the task is to provide VDSO build glue. Also IMO the > hard task is to define sensible gettimeofday() implementation, probably > using rdtsc in usermode. Shared page is easy, or at least it is already > there without ugly and non-working vm hacks. > > As an additional note, already put by Bruce, the implementation of > usermode gettimeofday is exactly opposite of any reasonable implementatio= n. > It looses the precision to the frequency of the event timer. Obvious > approach is to not have any periodically updating data for gettimeofday > purpose, and use some formula with rdtsc and kernel-provided coefficients > on the machines where rdtsc is usable. The gettimeofday() implementation is a different story than what is asked h= ere. > Interesting question is how much shared the shared page needs be. > Obvious needs are shared between all same-ABI processes, but I can also > easily see a need for the per-process private information be present in > the 'private-shared' page. For silly but typical example, useful for > moronix-style benchmarks, see getpid(). Really the performance benefits of having fast getpid() is marginal if compared to heavilly used things like gettimeofday(). I cannot think of a per-process page implementing a fast syscall that can bring many perfomance advantages. Attilio -- Peace can only be achieved by understanding - A. Einstein --=20 Peace can only be achieved by understanding - A. Einstein