Date: Wed, 19 Apr 2023 21:17:57 +0100 From: Jessica Clarke <jrtc27@freebsd.org> To: Hans Petter Selasky <hselasky@freebsd.org> Cc: Brooks Davis <brooks@freebsd.org>, src-committers <src-committers@freebsd.org>, dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org Subject: Re: git: 8dcf3a82c54c - main - libc: Implement bsort(3) a bitonic type of sorting algorithm. Message-ID: <2840BE79-CC25-427A-A5E9-476A38E749E3@freebsd.org> In-Reply-To: <e2ea5c89-aa48-e257-b9d5-7fe783bf216e@freebsd.org> References: <202304191206.33JC6Qcp062380@gitrepo.freebsd.org> <ZEAM3rO9r5e97BHE@spindle.one-eyed-alien.net> <a073fd36-9aa9-e988-0cc5-86af305fb654@freebsd.org> <3B5734C4-8630-4CD5-BA8D-DE33899161F1@freebsd.org> <e2ea5c89-aa48-e257-b9d5-7fe783bf216e@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 19 Apr 2023, at 20:27, Hans Petter Selasky <hselasky@freebsd.org> = wrote: >=20 > On 4/19/23 19:51, Jessica Clarke wrote: >> On 19 Apr 2023, at 18:24, Hans Petter Selasky <hselasky@freebsd.org> = wrote: >>>=20 >>> On 4/19/23 17:46, Brooks Davis wrote: >>>> On Wed, Apr 19, 2023 at 12:06:26PM +0000, Hans Petter Selasky = wrote: >>>>> The branch main has been updated by hselasky: >>>>>=20 >>>>> URL: = https://cgit.FreeBSD.org/src/commit/?id=3D8dcf3a82c54cb216df3213a013047907= 636a01da >>>>>=20 >>>>> commit 8dcf3a82c54cb216df3213a013047907636a01da >>>>> Author: Hans Petter Selasky <hselasky@FreeBSD.org> >>>>> AuthorDate: 2022-09-08 10:16:43 +0000 >>>>> Commit: Hans Petter Selasky <hselasky@FreeBSD.org> >>>>> CommitDate: 2023-04-19 12:04:22 +0000 >>>>>=20 >>>>> libc: Implement bsort(3) a bitonic type of sorting algorithm. >>>>> The bsort(3) algorithm works by swapping objects, = similarly to qsort(3), >>>>> and does not require any significant amount of additional = memory. >>>>> The bsort(3) algorithm doesn't suffer from the processing = time issues >>>>> known the plague the qsort(3) family of algorithms, and is = bounded by >>>>> a complexity of O(log2(N) * log2(N) * N), where N is the = number of >>>>> elements in the sorting array. The additional complexity = compared to >>>>> mergesort(3) is a fair tradeoff in situations where no memory = may >>>>> be allocated. >>>>> The bsort(3) APIs are identical to those of qsort(3), = allowing for >>>>> easy drop-in and testing. >>>>> The design of the bsort(3) algorithm allows for future = parallell CPU >>>>> execution when sorting arrays. The current version of the = bsort(3) >>>>> algorithm is single threaded. This is possible because fixed = areas >>>>> of the sorting data is compared at a time, and can easily be = divided >>>>> among different CPU's to sort large arrays faster. >>>>> Reviewed by: gbe@, delphij@, pauamma_gundo.com = (manpages) >>>>> Sponsored by: NVIDIA Networking >>>>> Differential Revision: https://reviews.freebsd.org/D36493 >>>> Why was this needed? I'm not really a fan to adding new, = non-standard >>>> sorts without a clear use case. I understand it has some specific >>>> advantages vs qsort, but are we going to use it? Doing so will = make our >>>> code less portable so we almost certainly can't s/qsort/bsort/. >>>=20 >>> Hi Brooks, >>>=20 >>> My long term plan is to get bsort() to replace qsort(), but because = the two algorithms have different characteristics, then some people may = complain it is good for me, but not for you. I want there to be an = option besides from qsort(), which does not use malloc() as an integral = part of sorting. And is faster than O(N*N) sorting (still the worst case = for qsort in FreeBSD). >> Why not do an adaptive qsort instead like other standard libraries? >> Also, nothing actually says qsort doesn=E2=80=99t allocate memory; in = fact, >> glibc=E2=80=99s own documentation states that one should not assume = it is >> in-place and doesn=E2=80=99t allocate. >=20 > Hi Jessica, >=20 >>> qsort() is frequently used to do all kinds of sorting, and some = pointed out that qsort() can technically be any kind of sorting = algorithm, but looking around it is not. >> Because there are variants that are guaranteed n log n? Which is = better >> than your n log^2 n. >=20 > Only so-called mergesort() algorithms can do sorting in O(N log N) = time, from what I know. And it needs to allocate working memory, when = the sorting arrays get large. =46rom past experience, malloc() is a = problem in the fast path. If only mergesort() could pre-allocate that = memory, or have a pointer to pass working memory. What is the point of = allocating and freeing memory over and over again, in certain = applications, doing frequent sorting. The API is broken. pdqsort is n log n time, in-place and doesn=E2=80=99t allocate, and is = used, for example, for Rust=E2=80=99s standard sort_unstable. >>> When I build applications of my own, I always use mergesort(), which = depend on malloc(). Having a dependency on a certain memory allocator to = get performance is not good. >>>=20 >>> I want to distinguish from qsort() by calling it bsort(). If people = use bsort() they know what they get cross-platform. >> No they don=E2=80=99t, bsort can mean multiple things. >=20 > In the C-code standard domain, can you give examples of different = meanings of bsort() which are established? I'm not aware of any such = existing usage. The b could be bitonic, bubble, binary insertion or something else? Searching the literature turns up https://dl.acm.org/doi/pdf/10.1145/3341.3348 as an algorithm described in the April 1985 Communications of the ACM. >> If you want a specific >> algorithm, put it in your program, but please don=E2=80=99t, just use = a >> standardised function like qsort. >=20 > Sorting is a quite generic thing. >=20 > Likewise with <sys/queue.h> we support four different ways to make = lists. And then after epoch(9) was introduced came another three ways, = basically replicas of <sys/queue.h> with different properties (See: = sys/contrib/ck/include/ck_queue.h) >=20 > Personally I think the same about mergesort(), qsort() and bsort(). = They are so different and unique ways to sort that we should have all of = them in libc. They each serve a purpose. Except they=E2=80=99re not, because all of what bsort provides can be = provided by qsort. All people care about are being able to quickly sort their data and whether it=E2=80=99s a stable sort. Occasionally people care = about allocating, but not often. We have multiple list implementations in sys/queue.h because they support different sets of operations. Arguably SLIST and LIST should be subsumed by STAILQ and TAILQ respectively, though. Meanwhile the CK variants exist because they=E2=80=99re lockless data structures, which the sys/queue.h ones are not (and shouldn=E2=80=99t = be). >>> If people use qsort() the output is random basically. It helps very = little my application works on FreeBSD, but not Linux, for example. >> No, the output is sorted, which is far from random. And if you need a >> stable sort you should ask for one. >=20 > Let me rephrase: Random with regards to execution time. >=20 > mergesort() : not for realtime applications (syscalls can sleep when = getting working memory) > qsort() : usually fast, but not always > bsort() : Worst case is "N * log2 N" compared to "N * N" (qsort), and = no syscalls are involved. Modified qsort() : Wort case is =E2=80=9CN * log N=E2=80=9D, and no = syscalls are involved. Please just adapt qsort to not have degenerate behaviour; it will improve our implementation of a standard C function and render your bsort() entirely obsolete. >>> In FreeBSD qsort() is typically used for sorting files up to 16K = entries per directory. Even if qsort() can explode in time, probably = it's not that important. But when using qsort() for sorting millions of = mathematical expressions, for example, doing number analysis, this is = unacceptable. >>>=20 >>> I think "C.A.R. Hoare's quicksort" technique is designed for single = CPU systemsf only. Even if the best-case average is "N*log2(N)", that = amount of processing cannot be split by multiple CPUs. The algorithm is = serial as such. >>>=20 >>> The bsort() algorithm is much more NCPU friendly, because it can = split the work into fixed size smaller and independent work loads. = Otherwise the work load doubles every time you merge two sorted lists. = And when the number of lists to merge is fewer than the number of CPUs = available, your system runs out of guts basically. >> I highly doubt you want a libc sort routine to start spawning = threads. >=20 > Right. I'm thinking more like Grand Dispatch Central, though not that = common in FreeBSD, we still have a wiki page: >=20 > https://wiki.freebsd.org/GrandCentralDispatch >=20 > bsort() itself doesn't have to create threads to be SMP'ed, but can = provide the basic functions for SMP sorting. >=20 > When sorting records, the clever way to do it, is to make a list of = pointers to those records, and sort based on that, swapping pointers = first, then records in the end. >=20 > Technically you could have 4 CPUs go in a tight spin on different = parts of the bsort() algorithm, and by use of existing atomic compare = and set pointer operations, you could make a fully deterministic sorting = algorithm, even if the different CPUs clobber on the same memory = locations. That sounds like a great way to get a big game of cache line ping pong. > This is something I want to explore later. >=20 > Neither mergesort() nor qsort() can do that. You can still parallelise them, there=E2=80=99s just a serial step to = merge or partition respectively. >> That also seems extremely contradictory to your claim about wanting = one >> that doesn=E2=80=99t allocate memory, but creating threads is going = to do much >> more than just that, and would scream POLA violation to me. >=20 > See my explanation above. TAILQ_INSERT_TAIL() can also be used in = multithreading environments, without having to link to pthreads. Only under a lock. At which point multithreading isn't relevant. Please revert this all. FreeBSD=E2=80=99s libc isn=E2=80=99t a place for = your pet projects. New public interfaces in something as core as libc should have good justification, and I still do not see that here. >>>> I also note that the swap code is pointlessly slow for size > 256 = and >>>> should almost certainly use aliasing with matching words like = memcpy >>>> implementations do. Doing so would make it easier to port this = code to >>>> CHERI where that is required. >>>=20 >>> I totally agree about the swap code being pointless. And I tried to = look where is memswap(), but it was not there. This kind of swapping is = done many places, and we could benefit from having a compiler supported = memswap() function. What do you think? >> We agree: https://github.com/CTSRD-CHERI/cheribsd/issues/207. But the >> world doesn=E2=80=99t have that in C. >=20 > Yes, that's a good idea! I'll try to follow that lead. >=20 >> We don=E2=80=99t need new non-standard sorting routines in libc. >=20 > We don't need English tea either ;-) I don=E2=80=99t know, nor do I care, what tea has to do with sorting = routines. Jess
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2840BE79-CC25-427A-A5E9-476A38E749E3>