Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 25 Jul 2024 18:20:42 -0500
From:      Jake Freeland <jake@technologyfriends.net>
To:        Mark Johnston <markj@freebsd.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, freebsd-hackers@freebsd.org
Subject:   Re: FreeBSD hugepages
Message-ID:  <beb908e5-8928-4197-9328-3df4e153ac2e@technologyfriends.net>
In-Reply-To: <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net>
References:  <1ced4290-4a31-4218-8611-63a44c307e87@technologyfriends.net> <ZqKhP0aR0fb_f6XE@kib.kiev.ua> <35da66f9-b913-45ea-90f4-16a2fa072848@technologyfriends.net> <ZqKzCK4pHg1mrSOa@nuc> <4d4398e5-81ba-4fbd-9806-649ec70abdb4@technologyfriends.net> <ZqLTA_tSP9dEQwil@nuc> <ZqLUX47CAUAlq7nq@nuc> <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 7/25/24 18:11, Jake Freeland wrote:
> On 7/25/24 17:40, Mark Johnston wrote:
>> On Thu, Jul 25, 2024 at 06:34:43PM -0400, Mark Johnston wrote:
>>> On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote:
>>>> On 7/25/24 15:18, Mark Johnston wrote:
>>>>> On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote:
>>>>>> On 7/25/24 14:02, Konstantin Belousov wrote:
>>>>>>> On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote:
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I have been steadily working on bringing Data Plane Development 
>>>>>>>> Kit (DPDK)
>>>>>>>> on FreeBSD up to date with the Linux version. The most 
>>>>>>>> significant hurdle so
>>>>>>>> far has been supporting concurrent DPDK processes, each with 
>>>>>>>> their own
>>>>>>>> contiguous memory regions.
>>>>>>>>
>>>>>>>> These contiguous regions are used by DPDK as a heap for 
>>>>>>>> allocating DMA
>>>>>>>> buffers and other miscellaneous resources. Retrieving the 
>>>>>>>> underlying memory
>>>>>>>> and mapping these regions is currently different on Linux and 
>>>>>>>> FreeBSD:
>>>>>>>>
>>>>>>>> On Linux, hugepages are fetched from the kernel's pre-allocated 
>>>>>>>> hugepage
>>>>>>>> pool and are mapped into virtual address space on DPDK 
>>>>>>>> initialization. Since
>>>>>>>> the hugepages exist in a pool, multiple processes can reserve 
>>>>>>>> their own
>>>>>>>> hugepages and operate concurrently.
>>>>>>>>
>>>>>>>> On FreeBSD, DPDK uses an in-house contigmem kernel module that 
>>>>>>>> reserves a
>>>>>>>> large contiguous region of memory on load. During DPDK 
>>>>>>>> initialization, the
>>>>>>>> entire region is mapped into virtual address space. This leaves 
>>>>>>>> no memory
>>>>>>>> for another independent DPDK process, so only one process can 
>>>>>>>> operate at a
>>>>>>>> time.
>>>>>>>>
>>>>>>>> I could modify the DPDK contigmem module to mimic Linux's 
>>>>>>>> hugepages, but I
>>>>>>>> thought it would be better to integrate and upstream a 
>>>>>>>> hugepage-like
>>>>>>>> interface directly in the FreeBSD kernel source. I am writing 
>>>>>>>> this email to
>>>>>>>> see if anyone has any advice on the matter. I did not see any 
>>>>>>>> previous
>>>>>>>> attempts at this in Phabriactor or the commit log, but it is 
>>>>>>>> possible that I
>>>>>>>> missed it. I have read about transparent superpage promotion, 
>>>>>>>> but that seems
>>>>>>>> like a different mechanism altogether.
>>>>>>>>
>>>>>>>> At a quick glance, the implementation seems straightforward: 
>>>>>>>> read some
>>>>>>>> loader tunables, allocate persistent hugepages at boot time, 
>>>>>>>> and create a
>>>>>>>> pseudo filesystem that supports creating and mapping hugepages. 
>>>>>>>> I could be
>>>>>>>> underestimating the magnitude of this task, but that is why I'm 
>>>>>>>> asking for
>>>>>>>> thoughts and advice :)
>>>>>>>>
>>>>>>>> For reference, here is Linux's documentation on hugepages:
>>>>>>>> https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
>>>>>>> Are posix shm largepages objects enough (they were developed to 
>>>>>>> support
>>>>>>> DPDK).  Look for shm_create_largepage(3).
>>>>>> Yes, shm_create_largepage(2) looks promising, but I would like 
>>>>>> the ability
>>>>>> to allocate these largepages at boot time when memory 
>>>>>> fragmentation as at a
>>>>>> minimum. Perhaps a couple sysctl tunables could be added onto the
>>>>>> vm.largepages node to specify a pagesize and allocate some number 
>>>>>> of pages
>>>>>> at boot?
>>>>> We could add an rc script which creates named largepage objects.  
>>>>> This
>>>>> can be done using the posixshmcontrol utility.  That might not be 
>>>>> early
>>>>> enough during boot for some purposes.  In that case, we could have a
>>>>> module which creates such objects from within the kernel. This is
>>>>> pretty straightforward to do; I wrote a dumb version of this for a
>>>>> mips-specific project a few years ago, feel free to take code or
>>>>> inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c
>>>> Looks simple enough. Thanks for the example code.
>>>>
>>>>>> It seems Linux had an interface similar to 
>>>>>> shm_create_largepage(2) back in
>>>>>> v2.5, but they removed it in favor of their hugetlbfs filesystem. 
>>>>>> It would
>>>>>> be nice to stay close to the file-backed Linux interface to 
>>>>>> maximize code
>>>>>> sharing in userspace. It looks like the foundation for hugepages 
>>>>>> is there,
>>>>>> but the interface for allocation and access needs to be extended.
>>>>> POSIX shm objects have most of the properties one would want, I'd
>>>>> expect, save the ability to access them via standard syscalls.  What
>>>>> else is missing besides the ability to reserve memory at boot time?
>>>> Most notably, I would like the ability to allocate pages in a 
>>>> specific NUMA
>>>> domain.
>>> I thought this was already supported, but it seems not...
>> Thinking a bit more, I'm pretty sure I had just been using something
>> like
>>
>> $ cpuset -n prefer:<domain> posixshmcontrol create -l 1G 
>> /largepage-1G-<domain>
>>
>> so didn't need an explicit NUMA configuration parameter.  In C one would
>> use cpuset_setdomain(2) instead, but that's not as convenient. So,
>> imbuing a NUMA domain in struct shm_largepage_conf is still probably a
>> reasonable thing to do.
>
> I just looked at the code, this seems very manageable. I'll draft up a 
> review.
>
>>> It should be very easy to implement: extend shm_largepage_conf to
>>> include a NUMA domain parameter, and specify that domain when 
>>> allocating
>>> pages for the object (in shm_largepage_dotruncate(), the
>>> vm_page_alloc_contig() call should become a
>>> vm_page_alloc_contig_domain() call).
>>>
>>>> Otherwise, in a perfect world, I'd like a unified interface for both
>>>> Linux and FreeBSD. Linux hugepages are managed using standard 
>>>> system calls;
>>>> files are mmap(2)'d into virtual address space from hugetlbfs and
>>>> ftruncate(2)'d.
>>> largepage shm objects work this way as well.
>
> After reading through the man page, this is quite apparent. Not sure 
> how I failed make that connection. Anyway, this is starting to look 
> easier than I thought it would be. The only difference from a 
> userspace perspective that I can think of right now is how the pages 
> are created (e.g. hugetlbfs open(2) on Linux vs. 
> shm_create_largepage(2) on FreeBSD).

I suppose I should clarify that hugetlbfs open(2) does not create a 
hugepage, but rather attaches to one. So it would be analogous to a 
shm_open(2) instead of shm_create_largepage(2). The hugepages are 
created at boottime or via sysfs on Linux. My mistake.

Jake Freeland

>
> Thanks for the guidance Mark and Konstantin.
>
> Jake Freeland
>>>> A matching interface would not add an extra kernel
>>>> entrypoint and even more importantly, it would ease the 
>>>> Linux-to-FreeBSD
>>>> porting process for programs that use hugepages.
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?beb908e5-8928-4197-9328-3df4e153ac2e>