Date: Wed, 18 Jun 2003 01:40:17 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Marcel Moolenaar <marcel@xcllnt.net> Cc: aritger@nvidia.com Subject: Re: Nvidia, TLS and __thread keyword -- an observation Message-ID: <3EF02571.8A4C432D@mindspring.com> References: <20030617071810.GA2451@dhcp01.pn.xcllnt.net> <20030617223910.GB57040@ns1.xcllnt.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Marcel Moolenaar wrote: > I'm not sure you understand the issue (I can easily be wrong, I just > don't see the evidence in your statement). To support the __thread > keyword, our thread library needs to create the TLS as defined in the > binary and its dependent shared libraries by virtue of the .tdata and > .tbss sections/segments, based on the image of the TLS as constructed > by the RTLD for the initial set of modules (created for the initial > thread) and amended by TLS space defined in the dynamicly loaded > libraries; and the TLS has to be created for every new thread at the > time the thread itself is created. Most of these issues can be sidestepped. The correct approach is actually the Microsoft published model, which is a rehash of another even older model: For each shared object, be it a library or dlopen'ed .so, you need: 1) Process attach (currently, .init) 2) Process detach (currently, .fini) 3) Thread attach (you are implying this with .tdata and .tbss) 4) Thread detach (you are implying this with .tdata and .tbss) Really, you want an explicit interface, rather than an implied one. This may mean "implying" the creation of .tini and .tfin sections, or some other approach, which deal with the .tdata/.tbss, or otherwise. Actually, a means of putting the relocation table in a per thread code table would also resolve the relocation issues that lead to people wanting to put locks around the RTLD references, but it's probably more correct to resolve this by serializing the thread attach/detach process, instead. > The static TLS model requires the least amount of work: add support > to allocate the TLS image for every thread creation and point the > thread pointer to it in a way compatible with the runtime spec. There would be no difference between static vs. dynamic, for the most part, if one were to use the .tini/.fin approach. The trouble you are anticipating here is actually all related to the fact of you having defined things as belonging to a data interface, rather than using accessor/mutator functions in order to operate and to hook the thread attach/detach events (attaches are implicit for any existing threads at time of load, and detaches are implicit for any existing threads at time of unload -- meaning you need to deal with it in the same fashion as create/delete, and you need to deal with out-of-order; this is a restriction you already have to live with anyway). > The dynamic TLS model requires more substantial changes and involves > RTLD as well. This is the model that requires __tls_get_addr(). I don't believe that this is true. I think the code examples omit the case where you have triggered functions to deal with explicit attach/detach events. And all such events can be made explicit, and serialized. They are rare enough that there should be almost zero cost relative to doing the same thing with static construction, which would use a linker set to gather the the .tini/.tfin function lists together for call on thread start/stop. Now some general comments: Realize that no matter how you approach this, there is going to be additional runtime overhead to thread creation/deletion for implicit TLS support, even if you get referencing it down to 1 instruction. No matter what you do, you will be paying an increased runtime penalty for thread creation/termination (join, exit, etc.) in exchange for your ability to use implicit TLS via compiler extension. I expect that high performance requirement programs that use a lot of threads will lean not to use __thread, in much the same way that C++ programmers lean not to use RTTI or exceptions -- both available language features, whose cost exceed their benefit. I believe that thread lifetime is proportional to the desirability of implicit TLS, and that thread count is inversely proportional. People writing code that expects to be used by threaded programs need to be aware of this. For example, an OpenGL interface onto a MySQL database or an LDAP server or a DNS server would likely suck, due to the thread impedence mismatch between the implementations. Likewise, thread count would make OpenGL undesiragle technology for implementing a web browser that didn't explicitly and heavily rely on a worker thread pool and the HTTP 1.1 persistent connections approach. Even then, you would still be screwed by any web servers that performed chunk-encoding, since it would rob you of the "Content-Length:" header, and mean that in order to signal end of data, they had to close the connection on you, losing you your persistence. Just some things to think about before you throw out explicit context for frame rate improvements that are unusable to any code that isn't a game or a benchmark... 8-(. -- Terry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3EF02571.8A4C432D>