Date: Thu, 25 Jul 2024 18:20:42 -0500 From: Jake Freeland <jake@technologyfriends.net> To: Mark Johnston <markj@freebsd.org> Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-hackers@freebsd.org Subject: Re: FreeBSD hugepages Message-ID: <beb908e5-8928-4197-9328-3df4e153ac2e@technologyfriends.net> In-Reply-To: <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net> References: <1ced4290-4a31-4218-8611-63a44c307e87@technologyfriends.net> <ZqKhP0aR0fb_f6XE@kib.kiev.ua> <35da66f9-b913-45ea-90f4-16a2fa072848@technologyfriends.net> <ZqKzCK4pHg1mrSOa@nuc> <4d4398e5-81ba-4fbd-9806-649ec70abdb4@technologyfriends.net> <ZqLTA_tSP9dEQwil@nuc> <ZqLUX47CAUAlq7nq@nuc> <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 7/25/24 18:11, Jake Freeland wrote: > On 7/25/24 17:40, Mark Johnston wrote: >> On Thu, Jul 25, 2024 at 06:34:43PM -0400, Mark Johnston wrote: >>> On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote: >>>> On 7/25/24 15:18, Mark Johnston wrote: >>>>> On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote: >>>>>> On 7/25/24 14:02, Konstantin Belousov wrote: >>>>>>> On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote: >>>>>>>> Hi there, >>>>>>>> >>>>>>>> I have been steadily working on bringing Data Plane Development >>>>>>>> Kit (DPDK) >>>>>>>> on FreeBSD up to date with the Linux version. The most >>>>>>>> significant hurdle so >>>>>>>> far has been supporting concurrent DPDK processes, each with >>>>>>>> their own >>>>>>>> contiguous memory regions. >>>>>>>> >>>>>>>> These contiguous regions are used by DPDK as a heap for >>>>>>>> allocating DMA >>>>>>>> buffers and other miscellaneous resources. Retrieving the >>>>>>>> underlying memory >>>>>>>> and mapping these regions is currently different on Linux and >>>>>>>> FreeBSD: >>>>>>>> >>>>>>>> On Linux, hugepages are fetched from the kernel's pre-allocated >>>>>>>> hugepage >>>>>>>> pool and are mapped into virtual address space on DPDK >>>>>>>> initialization. Since >>>>>>>> the hugepages exist in a pool, multiple processes can reserve >>>>>>>> their own >>>>>>>> hugepages and operate concurrently. >>>>>>>> >>>>>>>> On FreeBSD, DPDK uses an in-house contigmem kernel module that >>>>>>>> reserves a >>>>>>>> large contiguous region of memory on load. During DPDK >>>>>>>> initialization, the >>>>>>>> entire region is mapped into virtual address space. This leaves >>>>>>>> no memory >>>>>>>> for another independent DPDK process, so only one process can >>>>>>>> operate at a >>>>>>>> time. >>>>>>>> >>>>>>>> I could modify the DPDK contigmem module to mimic Linux's >>>>>>>> hugepages, but I >>>>>>>> thought it would be better to integrate and upstream a >>>>>>>> hugepage-like >>>>>>>> interface directly in the FreeBSD kernel source. I am writing >>>>>>>> this email to >>>>>>>> see if anyone has any advice on the matter. I did not see any >>>>>>>> previous >>>>>>>> attempts at this in Phabriactor or the commit log, but it is >>>>>>>> possible that I >>>>>>>> missed it. I have read about transparent superpage promotion, >>>>>>>> but that seems >>>>>>>> like a different mechanism altogether. >>>>>>>> >>>>>>>> At a quick glance, the implementation seems straightforward: >>>>>>>> read some >>>>>>>> loader tunables, allocate persistent hugepages at boot time, >>>>>>>> and create a >>>>>>>> pseudo filesystem that supports creating and mapping hugepages. >>>>>>>> I could be >>>>>>>> underestimating the magnitude of this task, but that is why I'm >>>>>>>> asking for >>>>>>>> thoughts and advice :) >>>>>>>> >>>>>>>> For reference, here is Linux's documentation on hugepages: >>>>>>>> https://docs.kernel.org/admin-guide/mm/hugetlbpage.html >>>>>>> Are posix shm largepages objects enough (they were developed to >>>>>>> support >>>>>>> DPDK). Look for shm_create_largepage(3). >>>>>> Yes, shm_create_largepage(2) looks promising, but I would like >>>>>> the ability >>>>>> to allocate these largepages at boot time when memory >>>>>> fragmentation as at a >>>>>> minimum. Perhaps a couple sysctl tunables could be added onto the >>>>>> vm.largepages node to specify a pagesize and allocate some number >>>>>> of pages >>>>>> at boot? >>>>> We could add an rc script which creates named largepage objects. >>>>> This >>>>> can be done using the posixshmcontrol utility. That might not be >>>>> early >>>>> enough during boot for some purposes. In that case, we could have a >>>>> module which creates such objects from within the kernel. This is >>>>> pretty straightforward to do; I wrote a dumb version of this for a >>>>> mips-specific project a few years ago, feel free to take code or >>>>> inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c >>>> Looks simple enough. Thanks for the example code. >>>> >>>>>> It seems Linux had an interface similar to >>>>>> shm_create_largepage(2) back in >>>>>> v2.5, but they removed it in favor of their hugetlbfs filesystem. >>>>>> It would >>>>>> be nice to stay close to the file-backed Linux interface to >>>>>> maximize code >>>>>> sharing in userspace. It looks like the foundation for hugepages >>>>>> is there, >>>>>> but the interface for allocation and access needs to be extended. >>>>> POSIX shm objects have most of the properties one would want, I'd >>>>> expect, save the ability to access them via standard syscalls. What >>>>> else is missing besides the ability to reserve memory at boot time? >>>> Most notably, I would like the ability to allocate pages in a >>>> specific NUMA >>>> domain. >>> I thought this was already supported, but it seems not... >> Thinking a bit more, I'm pretty sure I had just been using something >> like >> >> $ cpuset -n prefer:<domain> posixshmcontrol create -l 1G >> /largepage-1G-<domain> >> >> so didn't need an explicit NUMA configuration parameter. In C one would >> use cpuset_setdomain(2) instead, but that's not as convenient. So, >> imbuing a NUMA domain in struct shm_largepage_conf is still probably a >> reasonable thing to do. > > I just looked at the code, this seems very manageable. I'll draft up a > review. > >>> It should be very easy to implement: extend shm_largepage_conf to >>> include a NUMA domain parameter, and specify that domain when >>> allocating >>> pages for the object (in shm_largepage_dotruncate(), the >>> vm_page_alloc_contig() call should become a >>> vm_page_alloc_contig_domain() call). >>> >>>> Otherwise, in a perfect world, I'd like a unified interface for both >>>> Linux and FreeBSD. Linux hugepages are managed using standard >>>> system calls; >>>> files are mmap(2)'d into virtual address space from hugetlbfs and >>>> ftruncate(2)'d. >>> largepage shm objects work this way as well. > > After reading through the man page, this is quite apparent. Not sure > how I failed make that connection. Anyway, this is starting to look > easier than I thought it would be. The only difference from a > userspace perspective that I can think of right now is how the pages > are created (e.g. hugetlbfs open(2) on Linux vs. > shm_create_largepage(2) on FreeBSD). I suppose I should clarify that hugetlbfs open(2) does not create a hugepage, but rather attaches to one. So it would be analogous to a shm_open(2) instead of shm_create_largepage(2). The hugepages are created at boottime or via sysfs on Linux. My mistake. Jake Freeland > > Thanks for the guidance Mark and Konstantin. > > Jake Freeland >>>> A matching interface would not add an extra kernel >>>> entrypoint and even more importantly, it would ease the >>>> Linux-to-FreeBSD >>>> porting process for programs that use hugepages. >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?beb908e5-8928-4197-9328-3df4e153ac2e>