From owner-freebsd-threads@FreeBSD.ORG Mon Jun 16 18:37:43 2003 Return-Path: Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E4A2A37B401 for ; Mon, 16 Jun 2003 18:37:43 -0700 (PDT) Received: from hqemgate00.nvidia.com (hqemgate00.nvidia.com [216.228.112.144]) by mx1.FreeBSD.org (Postfix) with ESMTP id 53EBB43FD7 for ; Mon, 16 Jun 2003 18:37:43 -0700 (PDT) (envelope-from gareth@nvidia.com) Received: from mail-sc-0.nvidia.com (Not Verified[172.16.217.105]) id ; Mon, 16 Jun 2003 18:40:31 -0700 Received: by mail-sc-0.nvidia.com with Internet Mail Service (5.5.2653.19) id ; Mon, 16 Jun 2003 18:37:11 -0700 Message-ID: <2D32959E172B8F4D9B02F68266BE421401A6D7E7@mail-sc-3.nvidia.com> From: Gareth Hughes To: 'Julian Elischer' Date: Mon, 16 Jun 2003 18:37:10 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain cc: threads@freebsd.org cc: zander@mail.minion.de cc: 'Daniel Eischen' cc: Andy Ritger Subject: RE: NVIDIA and TLS X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jun 2003 01:37:44 -0000 On Mon, 16 Jun 2003, Julian Elischer wrote: > > I'm not making comments about their qualifications in graphics, > just that it's sad when teh threading interface is distorted by graphics > people.. In effect by insisting on having the TLS values accessible > at the lowest high-performace parts, they are excerting "un-natural" > pressure on the development of threads. :-/ The design of the ELF TLS spec had nothing to do with OpenGL or other graphics people. When this support was added to the Linux C library and toolchain, we started using it because it met our requirements for high performance thread-local storage. Granted, we worked with the GNU libc developers to iron out a few issues, but we had nothing to do with the specification itself. The ELF TLS spec was designed to meet the needs of a class of applications, which OpenGL happens to fall into. Fast thread-local storage is a good thing in general, beyond the scope of 3D graphics alone. > I'm saying that if it were don like this: > > __thread local_context_t *lc; > medium_level_OpenGL_function() > { > int linestodo=1000; > local_context_t *drawing_context; > int i; > > drawing_context = lc; > for (i = linestodo; i; i++) { > OpenGL_Low_level_Thingy(drawing_context, arg1, arg2); > } > > then the performance of TLS wouldn't be so crucial; > It would still be relatively ok (maybe 5 instructions) > but it wouldn't have to be 1 instruction > > In fact if they were inpplemented in the following way: > > __inline OpenGL_Low_level_Thingy(local_context_t *drawing_context, > arg1, arg2) > { > __asm "blah blah " /* load args to known regs */ > call library entrypoint /* with args in known regs...*/ > } > > you would have the fastest version of all without > any requirement for making the TLS so specialised. (purely register > transfer) Umm, you clearly don't understand what I've been talking about. Upon entry into libGL, i.e. when an OpenGL API entrypoint is called by the application, things like the current context or current dispatch table are fetched from thread-local storage. Internally, we pass these pointers around as required. So, we might have something like this: void glBegin(GLenum mode) { // Grab the current dispatch pointer from TLS __GLdispatch *dispatch = GET_CURRENT_DISPATCH(); // Call into the driver backend dispatch->Begin(mode) } or, in x86 assembly: glBegin: mov %gs:__gl_dispatch@ntpoff, %eax jmp *__begin_offset(%eax) (note that Andy mentioned this example in his original email) This would jump to a function inside the driver like this: void __internal_Begin(GLenum mode) { __GLcontext *ctx = GET_CURRENT_CONTEXT(); do_something(ctx, ...); do_something_else(ctx, ...); // and so on } Once we're inside the driver, we know what the current context or other thread-local variable's value is. Two critical points: 1) We have to fetch the value from TLS at least once per entry into the driver. 2) Some of the driver backend functions are very small, typically the more performance critical it is the smaller it is. In general, you want to avoid things like pthread_getspecific() inside the dispatch layer and your 6-instruction implementation of glColor4f or glNormal3f (which can be called millions of times per frame). > I have no intention of wanting yuo to context switch to > another thread.. I'm just saying that it's a pity you don't just > go teh route that other libraries have and make that time critical > fucntions just have the value at hand already. If the time critical function is a 6-instruction function at the top-level of the API (that is, called directly from the dispatch layer), how do you get this value other than looking it up out of thread-local storage? Caching that variable in TLS with a fast access mechanism qualifies as "keeping the value at hand" in my books. > No, I think you misunderstand.. > > A single thread would always use the same context. > I'm not saying otherwise.. > I'm just wishing that you would keep the value of its address > aroundin a local stack variable a bit more instead of deriving it > with %gs all the time. It is, once you've looked it up. Problem is, if the only work you do inside the library is copy three floats off the stack (parameters to the GL API call) into a DMA buffer, set a bit in a bitmask and return, the time spent accessing the current context becomes a large percentage of the time you spend in the library for that call, period. Understand? > If the leaf functions on OpenGL were to be implemented with > asm interfaces (you said they were hand optimised anyhow) > and the callers would cache the drawing context pointer in a local > register, then My ability to give you a TLS pointer in > > > getTLS: lea %eax,%gs(mumble) > movl mumble(%eax), %eax > ret > > would be fast enough as the cost of the extra function call would be > amortised over many low-level calls. > (actually it'd probably be faster than what you have now I think) There is no caller of these functions. In the fast paths, there is no function call, once you get inside the library. That's the whole point. 1) Application calls OpenGL function. 2) OpenGL API dispatch function looks up the dispatch pointer from TLS, and jumps through it. 2 insns. 3) Driver function looks up the current context out of TLS, copies some data into a buffer, and returns. Maybe 6 insns or so. 4) Application continues on. The cost of TLS access becomes significant when the driver is doing less than a dozen non-TLS-lookup instructions for the important API calls. -- Gareth Hughes (gareth@nvidia.com) OpenGL Developer, NVIDIA Corporation