Date: Mon, 22 May 2023 08:28:13 -0600 From: Warner Losh <imp@bsdimp.com> To: Hans Petter Selasky <hps@selasky.org> Cc: Konstantin Belousov <kostikbel@gmail.com>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org> Subject: Re: [RFC] An idea for general kernel post-processing automation in FreeBSD Message-ID: <CANCZdfq4D3N12c6h_Hy7Mvyg_d%2BpZBnN39CvpDU14-qDEhMFXQ@mail.gmail.com> In-Reply-To: <394775eb-1000-60d9-2ddd-5c2aca6c99ea@selasky.org> References: <b0461320-fb92-95cf-609b-3550eb30a589@selasky.org> <CANCZdfowk1oQnw30hk%2BA-pSq0_QK52dahF6x3AUi=A76J18R9w@mail.gmail.com> <ZGq03_RdpqXppA-h@kib.kiev.ua> <394775eb-1000-60d9-2ddd-5c2aca6c99ea@selasky.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000054aba505fc49154c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, May 22, 2023 at 3:56=E2=80=AFAM Hans Petter Selasky <hps@selasky.or= g> wrote: > On 5/22/23 02:18, Konstantin Belousov wrote: > > On Sun, May 21, 2023 at 05:34:26PM -0600, Warner Losh wrote: > >> On Sun, May 21, 2023 at 2:13=E2=80=AFPM Hans Petter Selasky <hps@selas= ky.org> > wrote: > >> > >>> However, if the data in question is sorted at compile time, then zero > >>> time will be spent sorting the data in the kernel. When a kernel modu= le > >>> is loaded and the sysinit/constructors are already sorted by their > >>> subsystem/order/priority, all you need to do is to merge two sorted > >>> lists into another sorted list, and this is pretty much linear. > >>> > >>> The question is, who can sort the sysinits: > >>> > >> > >> The bigger question is "Do we even want to do that?" > > Hi, > > >> Switching to a faster sort gets us from milliseconds to microseconds > >> without adding a lot of hair. > > When you can more than double the benefit by simple means, why not just > do that? > Cost benefit ratio is terrible: Remove a millisecond at boot in exchange for a less robust build and population of sysinit is a poor trade. > I know embedded CPUs run at kHz speeds when booting up, and only later > on when clocks and such are properly setup, you get the GHz speeds. > None of the tier-1 platforms do this, and we shouldn't drive optimizations for fringe, tier-2 environments (including FIRECRACKER) when they come with a real cost. Presorting is one that comes with a cost greater than anything else collin has done. > And Linux pre-sorts the initializers too, by a simple "level". It looks > like they encode it into the symbol name itself. > Interesting, but not relevant. Honestly, I think we'd be better off with moving to a priority queue / heap to get the order. We only need a partial ordering, and using a heap with a priority queue has good worst / best case properties. It's what CAM uses to pick the next work to do, though there'd be no reuse. Plus, it could be done with pointers, which is way cheaper than copying entire entries around. Pre-sorting would preclude doing things like this. Basically, pre-sorting will make the build more complicated for what's, at best, a marginal gain in an area that almost nobody cares about. > >> > >> > >>> 1) "ld" can sort symbols by name, but encoding the priority into the > >>> symbol name is difficult, especially when arithmetic expressions are > >>> used. The C-pre-processor can only concat stuff. No, character > >>> arithmetic is possible. This solution is not possible. > >>> > >>> 2) The constructor attribute in "C" can take a priority argument, > >>> probably 32-bit and we need 64-bits. Sounds great, but how can you ed= it > >>> these lists and merge the constructors? What about destructors and > >>> priority. Maybe possible. > >>> > >>> 3) The compiler can output strings to a magic file during compilation= , > >>> like the name of global variables to collect and sort. The C-compiler > >>> has #error and #warning, but there is no #write_to_file, simply > >>> speaking. Another solution is to store the file output into a separat= e > >>> section in the object files and use objcopy to extract the text later > >>> on, and process it. > >>> > >> > >> These are all fragile. I don't think the benefit makes the fragility > >> worth it. > > I agree. Linker tricks are cute but often depend on minor features of > > linkers that break often. > > Yes, the linker supported sorting tricks are fragile. My proposal is to > use the already existing named sections for this. This is supported all > over the place, gcc, clang and so on. > As someone who has fought with those schemes, I'd like to register my strong opposition to expanding them unless we get a large benefit. No such benefit has been articulated, so I'd say they are a non-starter. > >>> It may also be another way to fetch PCI/USB device ID information, > >>> instead of using linker hints. Currently when probing USB devices, de= vd > >>> has a linear list of hints it iterates. That means for every device > >>> being plugged, you need to search all products supported for a match. > >>> This is slow. Instead a tool could at least sort the entries, so a > >>> binary search type function could be used, resulting in O(log2(N)) > >>> searching time, instead of O(N). > >>> > >> > >> Except that data is pathologically heterogeneous. There's nothing to > sort > >> in any meaningful way. And it's all about the runtime environment, whi= ch > >> is impossible to know at build time (today we do it at run time). The > linker > >> hints file to know which of many things to load is almost optimal... > Each > >> file is different, especially for PCI, which is why we pre-process it > once > >> and put it into the linker hints.... So it's pretty good once the > system is > >> running, but at startup we parse it multiple times and likely would do > more > >> as we move more towards MINIMAL. It's been suggested that we move > > BTW, if we are moving to MINIMAL (which I quite like), then with the > > proposed approach we need to merge at least N + 1 lists, where N is the > number > > of modules. Can the merge be done in place without incurring n**2 > > behaviour? > > Currently when klds are added during load, a new list is created, > replacing the current one. Technically not sorted in-place. This leads > to N**2 processing, but N is the number of modules loaded, and not the > number of sysinits. You can also merge sysinits from two and two > modules, leading to log2(N) processing time. Something like bulk-loading > modules. > Yea, N here is tiny. Less than 10 almost always, and never bigger than 50 even in complex systems. N^2 here costs nothing, so doing back-flips to avoid it seems like a poor investment of time. Then again, a heap-sort would always give us n lg n behavior, since we coul= d insert the elements one at a time and re-heapify. Doing it with pointers would be an even bigger win since that removes a lot of data copying. > > > >> this into devd, so the data persists in its most useful form for a lon= g > >> time, > >> and that's not a bad suggestion for dealing with the growing number of > >> execs to make this all work... It's very Unix Tooly, but also there's > likely > >> a good case to be made to optimize here... There's some ordering issue= s > >> that would need to be worked out too... > > > > Overall, my feel is the same: if better sort can be used to speed this > up, > > it is better to not create a monster build system. In kernel, we are n= ot > > limited by ABI or POSIX constraints. > > > > Prototypes need to start somewhere. This is what we want to do and these > are the requirements. Having a separate debug channel directly from the > compiler, maybe a debug_printf() attribute, seeing compile time > constants the same way the compiler does, is more clean. > > What comes first? The debug_printf() attribute or the section data hack? > I guess it is easier to start with an egg, than a chicken. > > Linux guys use a perl-script invoking "nm" to read out all initializers > and generates a linker script file: > > > https://elixir.bootlin.com/linux/latest/source/scripts/generate_initcall_= order.pl > > My solution does not require anything more than standard in-base UNIX > utilities. And given our linker is based on LLVM and not GNU, it's > probably better to generate C-code than linker script code? > I still don't like any of this. I think it's extra complication looking for a problem to solve... I'm not a fan. Warner --00000000000054aba505fc49154c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Mon, May 22, 2023 at 3:56=E2=80=AF= AM Hans Petter Selasky <<a href=3D"mailto:hps@selasky.org">hps@selasky.o= rg</a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margi= n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex= ">On 5/22/23 02:18, Konstantin Belousov wrote:<br> > On Sun, May 21, 2023 at 05:34:26PM -0600, Warner Losh wrote:<br> >> On Sun, May 21, 2023 at 2:13=E2=80=AFPM Hans Petter Selasky <<a= href=3D"mailto:hps@selasky.org" target=3D"_blank">hps@selasky.org</a>> = wrote:<br> >><br> >>> However, if the data in question is sorted at compile time, th= en zero<br> >>> time will be spent sorting the data in the kernel. When a kern= el module<br> >>> is loaded and the sysinit/constructors are already sorted by t= heir<br> >>> subsystem/order/priority, all you need to do is to merge two s= orted<br> >>> lists into another sorted list, and this is pretty much linear= .<br> >>><br> >>> The question is, who can sort the sysinits:<br> >>><br> >><br> >> The bigger question is "Do we even want to do that?"<br> <br> Hi,<br> <br> >> Switching to a faster sort gets us from milliseconds to microsecon= ds<br> >> without adding a lot of hair.<br> <br> When you can more than double the benefit by simple means, why not just <br= > do that?<br></blockquote><div><br></div><div>Cost benefit=C2=A0ratio is ter= rible: Remove a millisecond at boot in exchange</div><div>for a less robust= build and population of sysinit is a poor trade.</div><div>=C2=A0</div><bl= ockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-lef= t:1px solid rgb(204,204,204);padding-left:1ex"> I know embedded CPUs run at kHz speeds when booting up, and only later <br> on when clocks and such are properly setup, you get the GHz speeds.<br></bl= ockquote><div><br></div><div>None of the tier-1 platforms do this, and we s= houldn't drive optimizations</div><div>for fringe, tier-2 environments = (including FIRECRACKER) when they come</div><div>with a real cost. Presorti= ng is one that comes with a cost greater than anything</div><div>else colli= n has done.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style= =3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding= -left:1ex"> And Linux pre-sorts the initializers too, by a simple "level". It= looks <br> like they encode it into the symbol name itself.<br></blockquote><div><br><= /div><div>Interesting, but not relevant.</div><div><br></div><div>Honestly,= I think we'd be better off with moving to a priority queue / heap</div= ><div>to get the order. We only need a partial ordering, and using a heap w= ith</div><div>a priority queue has good worst / best case properties. It= 9;s what CAM uses</div><div>to pick the next work to do, though there'd= be no reuse. Plus, it could</div><div>be done with pointers, which is way = cheaper than copying entire entries</div><div>around. Pre-sorting would pre= clude doing things like this.</div><div><br></div><div>Basically, pre-sorti= ng will make the build more complicated for what's,</div><div>at best, = a marginal gain in an area that almost nobody cares about.</div><div>=C2=A0= </div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b= order-left:1px solid rgb(204,204,204);padding-left:1ex"> >><br> >><br> >>> 1) "ld" can sort symbols by name, but encoding the p= riority into the<br> >>> symbol name is difficult, especially when arithmetic expressio= ns are<br> >>> used. The C-pre-processor can only concat stuff. No, character= <br> >>> arithmetic is possible. This solution is not possible.<br> >>><br> >>> 2) The constructor attribute in "C" can take a prior= ity argument,<br> >>> probably 32-bit and we need 64-bits. Sounds great, but how can= you edit<br> >>> these lists and merge the constructors? What about destructors= and<br> >>> priority. Maybe possible.<br> >>><br> >>> 3) The compiler can output strings to a magic file during comp= ilation,<br> >>> like the name of global variables to collect and sort. The C-c= ompiler<br> >>> has #error and #warning, but there is no #write_to_file, simpl= y<br> >>> speaking. Another solution is to store the file output into a = separate<br> >>> section in the object files and use objcopy to extract the tex= t later<br> >>> on, and process it.<br> >>><br> >><br> >> These are all fragile. I don't think the benefit makes the fra= gility<br> >> worth it.<br> > I agree.=C2=A0 Linker tricks are cute but often depend on minor featur= es of<br> > linkers that break often.<br> <br> Yes, the linker supported sorting tricks are fragile. My proposal is to <br= > use the already existing named sections for this. This is supported all <br= > over the place, gcc, clang and so on.<br></blockquote><div><br></div><div>A= s someone who has fought with those schemes, I'd like to register</div>= <div>my strong opposition to expanding them unless we get a large benefit.<= /div><div>No such benefit has been articulated, so I'd say they are a n= on-starter.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style= =3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding= -left:1ex"> >>> It may also be another way to fetch PCI/USB device ID informat= ion,<br> >>> instead of using linker hints. Currently when probing USB devi= ces, devd<br> >>> has a linear list of hints it iterates. That means for every d= evice<br> >>> being plugged, you need to search all products supported for a= match.<br> >>> This is slow. Instead a tool could at least sort the entries, = so a<br> >>> binary search type function could be used, resulting in O(log2= (N))<br> >>> searching time, instead of O(N).<br> >>><br> >><br> >> Except that data is pathologically heterogeneous. There's noth= ing to sort<br> >> in any meaningful way. And it's all about the runtime environm= ent, which<br> >> is impossible to know at build time (today we do it at run time). = The linker<br> >> hints file to know which of many things to load is almost optimal.= .. Each<br> >> file is different, especially for PCI, which is why we pre-process= it once<br> >> and put it into the linker hints.... So it's pretty good once = the system is<br> >> running, but at startup we parse it multiple times and likely woul= d do more<br> >> as we move more towards MINIMAL. It's been suggested that we m= ove<br> > BTW, if we are moving to MINIMAL (which I quite like), then with the<b= r> > proposed approach we need to merge at least N + 1 lists, where N is th= e number<br> > of modules.=C2=A0 Can the merge be done in place without incurring n**= 2<br> > behaviour?<br> <br> Currently when klds are added during load, a new list is created, <br> replacing the current one. Technically not sorted in-place. This leads <br> to N**2 processing, but N is the number of modules loaded, and not the <br> number of sysinits. You can also merge sysinits from two and two <br> modules, leading to log2(N) processing time. Something like bulk-loading <b= r> modules.<br></blockquote><div><br></div><div>Yea, N here is tiny. Less than= 10 almost always, and never bigger than 50</div><div>even in complex syste= ms. N^2 here costs nothing, so doing back-flips to</div><div>avoid it seems= like a poor investment of time.</div><div><br></div><div>Then again, a hea= p-sort would always give us n lg n behavior, since we could</div><div>inser= t the elements one at a time and re-heapify. Doing it with pointers would</= div><div>be an even bigger win since that removes a lot of data copying.</d= iv><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0= px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> > <br> >> this into devd, so the data persists in its most useful form for a= long<br> >> time,<br> >> and that's not a bad suggestion for dealing with the growing n= umber of<br> >> execs to make this all work... It's very Unix Tooly, but also = there's likely<br> >> a good case to be made to optimize here... There's some orderi= ng issues<br> >> that would need to be worked out too...<br> > <br> > Overall, my feel is the same: if better sort can be used to speed this= up,<br> > it is better to not create a monster build system.=C2=A0 In kernel, we= are not<br> > limited by ABI or POSIX constraints.<br> > <br> <br> Prototypes need to start somewhere. This is what we want to do and these <b= r> are the requirements. Having a separate debug channel directly from the <br= > compiler, maybe a debug_printf() attribute, seeing compile time <br> constants the same way the compiler does, is more clean.<br> <br> What comes first? The debug_printf() attribute or the section data hack? <b= r> I guess it is easier to start with an egg, than a chicken.<br> <br> Linux guys use a perl-script invoking "nm" to read out all initia= lizers <br> and generates a linker script file:<br> <br> <a href=3D"https://elixir.bootlin.com/linux/latest/source/scripts/generate_= initcall_order.pl" rel=3D"noreferrer" target=3D"_blank">https://elixir.boot= lin.com/linux/latest/source/scripts/generate_initcall_order.pl</a><br> <br> My solution does not require anything more than standard in-base UNIX <br> utilities. And given our linker is based on LLVM and not GNU, it's <br> probably better to generate C-code than linker script code?<br></blockquote= ><div><br></div><div>I still don't like any of this. I think it's e= xtra complication looking for a</div><div>problem to solve... I'm not a= fan.</div><div><br></div><div>Warner</div></div></div> --00000000000054aba505fc49154c--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfq4D3N12c6h_Hy7Mvyg_d%2BpZBnN39CvpDU14-qDEhMFXQ>