From nobody Thu Jul 25 23:11:00 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WVRT65pnpz5RW3r for ; Thu, 25 Jul 2024 23:11:06 +0000 (UTC) (envelope-from jake@technologyfriends.net) Received: from st43p00im-ztdg10073201.me.com (st43p00im-ztdg10073201.me.com [17.58.63.177]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4WVRT649Gxz4X4s for ; Thu, 25 Jul 2024 23:11:06 +0000 (UTC) (envelope-from jake@technologyfriends.net) Authentication-Results: mx1.freebsd.org; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=technologyfriends.net; s=sig1; t=1721949065; bh=T3MZGmC91k5hpL4U50c8dRSVYIpdzQiDrawMq5J1R6U=; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; b=MTcGnCANnIZtcP0AHTy/4H56RoKxkkiDH4A0Q3n+YdqZqELnqjfJvnLCPWHRl4Beg hCSzz9wL2rnvShqrKiefd58bp/yxtLRMcIk3PzzUeuBSX0vM0/FbkE/JCp7dYIGxU+ tMNRcCmZZgp9e4AV5ykv42SudzOhWfxJ3xIsIpyfd7i0/nGUzII4wY3bavwLtIVjyi k7WhxqrvgfQXxuYmHDUthEzvxtJZj0JnvjIR4EyY9fwo0vhKVucfKjNScRAksTLrKW Iy2LBj5iYrRD0MBXLG0+eQtJVaBajkzzznuyORIdJ1plqt4B8eypIObJO9oT0G0lDW jk2cHzCKlc6WQ== Received: from [10.0.233.209] (st43p00im-dlb-asmtp-mailmevip.me.com [17.42.251.41]) by st43p00im-ztdg10073201.me.com (Postfix) with ESMTPSA id 42E3D9C047F; Thu, 25 Jul 2024 23:11:02 +0000 (UTC) Message-ID: <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net> Date: Thu, 25 Jul 2024 18:11:00 -0500 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: FreeBSD hugepages To: Mark Johnston Cc: Konstantin Belousov , freebsd-hackers@freebsd.org References: <1ced4290-4a31-4218-8611-63a44c307e87@technologyfriends.net> <35da66f9-b913-45ea-90f4-16a2fa072848@technologyfriends.net> <4d4398e5-81ba-4fbd-9806-649ec70abdb4@technologyfriends.net> Content-Language: en-US From: Jake Freeland In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Proofpoint-ORIG-GUID: jmEUEnRQZvL04qP9iu74v2lE2-bSXS5B X-Proofpoint-GUID: jmEUEnRQZvL04qP9iu74v2lE2-bSXS5B X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16 definitions=2024-07-25_12,2024-07-25_03,2024-05-17_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 malwarescore=0 mlxlogscore=999 adultscore=0 bulkscore=0 phishscore=0 suspectscore=0 mlxscore=0 clxscore=1030 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2308100000 definitions=main-2407250090 X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:714, ipnet:17.58.63.0/24, country:US] X-Rspamd-Queue-Id: 4WVRT649Gxz4X4s On 7/25/24 17:40, Mark Johnston wrote: > On Thu, Jul 25, 2024 at 06:34:43PM -0400, Mark Johnston wrote: >> On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote: >>> On 7/25/24 15:18, Mark Johnston wrote: >>>> On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote: >>>>> On 7/25/24 14:02, Konstantin Belousov wrote: >>>>>> On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote: >>>>>>> Hi there, >>>>>>> >>>>>>> I have been steadily working on bringing Data Plane Development Kit (DPDK) >>>>>>> on FreeBSD up to date with the Linux version. The most significant hurdle so >>>>>>> far has been supporting concurrent DPDK processes, each with their own >>>>>>> contiguous memory regions. >>>>>>> >>>>>>> These contiguous regions are used by DPDK as a heap for allocating DMA >>>>>>> buffers and other miscellaneous resources. Retrieving the underlying memory >>>>>>> and mapping these regions is currently different on Linux and FreeBSD: >>>>>>> >>>>>>> On Linux, hugepages are fetched from the kernel's pre-allocated hugepage >>>>>>> pool and are mapped into virtual address space on DPDK initialization. Since >>>>>>> the hugepages exist in a pool, multiple processes can reserve their own >>>>>>> hugepages and operate concurrently. >>>>>>> >>>>>>> On FreeBSD, DPDK uses an in-house contigmem kernel module that reserves a >>>>>>> large contiguous region of memory on load. During DPDK initialization, the >>>>>>> entire region is mapped into virtual address space. This leaves no memory >>>>>>> for another independent DPDK process, so only one process can operate at a >>>>>>> time. >>>>>>> >>>>>>> I could modify the DPDK contigmem module to mimic Linux's hugepages, but I >>>>>>> thought it would be better to integrate and upstream a hugepage-like >>>>>>> interface directly in the FreeBSD kernel source. I am writing this email to >>>>>>> see if anyone has any advice on the matter. I did not see any previous >>>>>>> attempts at this in Phabriactor or the commit log, but it is possible that I >>>>>>> missed it. I have read about transparent superpage promotion, but that seems >>>>>>> like a different mechanism altogether. >>>>>>> >>>>>>> At a quick glance, the implementation seems straightforward: read some >>>>>>> loader tunables, allocate persistent hugepages at boot time, and create a >>>>>>> pseudo filesystem that supports creating and mapping hugepages. I could be >>>>>>> underestimating the magnitude of this task, but that is why I'm asking for >>>>>>> thoughts and advice :) >>>>>>> >>>>>>> For reference, here is Linux's documentation on hugepages: >>>>>>> https://docs.kernel.org/admin-guide/mm/hugetlbpage.html >>>>>> Are posix shm largepages objects enough (they were developed to support >>>>>> DPDK). Look for shm_create_largepage(3). >>>>> Yes, shm_create_largepage(2) looks promising, but I would like the ability >>>>> to allocate these largepages at boot time when memory fragmentation as at a >>>>> minimum. Perhaps a couple sysctl tunables could be added onto the >>>>> vm.largepages node to specify a pagesize and allocate some number of pages >>>>> at boot? >>>> We could add an rc script which creates named largepage objects. This >>>> can be done using the posixshmcontrol utility. That might not be early >>>> enough during boot for some purposes. In that case, we could have a >>>> module which creates such objects from within the kernel. This is >>>> pretty straightforward to do; I wrote a dumb version of this for a >>>> mips-specific project a few years ago, feel free to take code or >>>> inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c >>> Looks simple enough. Thanks for the example code. >>> >>>>> It seems Linux had an interface similar to shm_create_largepage(2) back in >>>>> v2.5, but they removed it in favor of their hugetlbfs filesystem. It would >>>>> be nice to stay close to the file-backed Linux interface to maximize code >>>>> sharing in userspace. It looks like the foundation for hugepages is there, >>>>> but the interface for allocation and access needs to be extended. >>>> POSIX shm objects have most of the properties one would want, I'd >>>> expect, save the ability to access them via standard syscalls. What >>>> else is missing besides the ability to reserve memory at boot time? >>> Most notably, I would like the ability to allocate pages in a specific NUMA >>> domain. >> I thought this was already supported, but it seems not... > Thinking a bit more, I'm pretty sure I had just been using something > like > > $ cpuset -n prefer: posixshmcontrol create -l 1G /largepage-1G- > > so didn't need an explicit NUMA configuration parameter. In C one would > use cpuset_setdomain(2) instead, but that's not as convenient. So, > imbuing a NUMA domain in struct shm_largepage_conf is still probably a > reasonable thing to do. I just looked at the code, this seems very manageable. I'll draft up a review. >> It should be very easy to implement: extend shm_largepage_conf to >> include a NUMA domain parameter, and specify that domain when allocating >> pages for the object (in shm_largepage_dotruncate(), the >> vm_page_alloc_contig() call should become a >> vm_page_alloc_contig_domain() call). >> >>> Otherwise, in a perfect world, I'd like a unified interface for both >>> Linux and FreeBSD. Linux hugepages are managed using standard system calls; >>> files are mmap(2)'d into virtual address space from hugetlbfs and >>> ftruncate(2)'d. >> largepage shm objects work this way as well. After reading through the man page, this is quite apparent. Not sure how I failed make that connection. Anyway, this is starting to look easier than I thought it would be. The only difference from a userspace perspective that I can think of right now is how the pages are created (e.g. hugetlbfs open(2) on Linux vs. shm_create_largepage(2) on FreeBSD). Thanks for the guidance Mark and Konstantin. Jake Freeland >>> A matching interface would not add an extra kernel >>> entrypoint and even more importantly, it would ease the Linux-to-FreeBSD >>> porting process for programs that use hugepages.