From nobody Thu Jul 25 23:20:42 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WVRhH3KTNz5RWpQ for ; Thu, 25 Jul 2024 23:20:47 +0000 (UTC) (envelope-from jake@technologyfriends.net) Received: from ci74p00im-qukt09082502.me.com (ci74p00im-qukt09082502.me.com [17.57.156.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4WVRhG3kKlz4Ybx for ; Thu, 25 Jul 2024 23:20:46 +0000 (UTC) (envelope-from jake@technologyfriends.net) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=technologyfriends.net header.s=sig1 header.b=fp8WFvDe; dmarc=none; spf=pass (mx1.freebsd.org: domain of jake@technologyfriends.net designates 17.57.156.15 as permitted sender) smtp.mailfrom=jake@technologyfriends.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=technologyfriends.net; s=sig1; t=1721949645; bh=DQzdc9nlSQ884hFM/xd4qNbKhM4/Hg4UCxfzxu6qSxo=; h=Message-ID:Date:MIME-Version:Subject:From:To:Content-Type; b=fp8WFvDe+2MWmlKlceGF/naLJRiTAFvnOZBRD/fBYKMwYFo/MFMU1jNAOduz2AqAv zx+IqlKibIwY1KlnuL5GIoHaUKBFyaUl4PRUdA1CHc8MrJZCU5TvfKvEmK+kfQq/J+ qaiXDUJ95D+wlqKI7HmfYzV16f9zCE5orgYEWVIMfN9zf+KzubYQUKfPy870EhsB+J Qi+kJUj2j4W68mFPa4KH4m+UgeDU94B1siG5drbYcnHBTfhF4XBMr4F7YCbRaiXosu zEOcKQd+p6dcdL+ZeI19lDO+RRRrS+6sEQyK8Y1tE1npKev7A09tQK2fhFTdw24koC Ml7K8L8asxNAg== Received: from [10.0.233.209] (ci77p00im-dlb-asmtp-mailmevip.me.com [17.57.156.26]) by ci74p00im-qukt09082502.me.com (Postfix) with ESMTPSA id 4F4E811C00BC; Thu, 25 Jul 2024 23:20:44 +0000 (UTC) Message-ID: Date: Thu, 25 Jul 2024 18:20:42 -0500 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: FreeBSD hugepages From: Jake Freeland To: Mark Johnston Cc: Konstantin Belousov , freebsd-hackers@freebsd.org References: <1ced4290-4a31-4218-8611-63a44c307e87@technologyfriends.net> <35da66f9-b913-45ea-90f4-16a2fa072848@technologyfriends.net> <4d4398e5-81ba-4fbd-9806-649ec70abdb4@technologyfriends.net> <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net> Content-Language: en-US In-Reply-To: <912f849b-95ac-4d29-8a86-300999a0a9c4@technologyfriends.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Proofpoint-GUID: oooY2YmLlt9racBzSjLmSvjRwrmgRz5J X-Proofpoint-ORIG-GUID: oooY2YmLlt9racBzSjLmSvjRwrmgRz5J X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16 definitions=2024-07-25_25,2024-07-25_03,2024-05-17_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 mlxlogscore=999 malwarescore=0 clxscore=1030 adultscore=0 bulkscore=0 suspectscore=0 mlxscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2308100000 definitions=main-2407250158 X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.39 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_SPF_ALLOW(-0.20)[+ip4:17.57.156.0/24]; R_DKIM_ALLOW(-0.20)[technologyfriends.net:s=sig1]; MIME_GOOD(-0.10)[text/plain]; ONCE_RECEIVED(0.10)[]; XM_UA_NO_VERSION(0.01)[]; RCVD_TLS_ALL(0.00)[]; DKIM_TRACE(0.00)[technologyfriends.net:+]; DMARC_NA(0.00)[technologyfriends.net]; FREEMAIL_CC(0.00)[gmail.com,freebsd.org]; RCPT_COUNT_THREE(0.00)[3]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; ASN(0.00)[asn:714, ipnet:17.57.156.0/24, country:US]; FREEFALL_USER(0.00)[jake]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; RCVD_COUNT_ONE(0.00)[1]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[17.57.156.15:from]; MIME_TRACE(0.00)[0:+]; TO_DN_SOME(0.00)[] X-Rspamd-Queue-Id: 4WVRhG3kKlz4Ybx On 7/25/24 18:11, Jake Freeland wrote: > On 7/25/24 17:40, Mark Johnston wrote: >> On Thu, Jul 25, 2024 at 06:34:43PM -0400, Mark Johnston wrote: >>> On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote: >>>> On 7/25/24 15:18, Mark Johnston wrote: >>>>> On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote: >>>>>> On 7/25/24 14:02, Konstantin Belousov wrote: >>>>>>> On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote: >>>>>>>> Hi there, >>>>>>>> >>>>>>>> I have been steadily working on bringing Data Plane Development >>>>>>>> Kit (DPDK) >>>>>>>> on FreeBSD up to date with the Linux version. The most >>>>>>>> significant hurdle so >>>>>>>> far has been supporting concurrent DPDK processes, each with >>>>>>>> their own >>>>>>>> contiguous memory regions. >>>>>>>> >>>>>>>> These contiguous regions are used by DPDK as a heap for >>>>>>>> allocating DMA >>>>>>>> buffers and other miscellaneous resources. Retrieving the >>>>>>>> underlying memory >>>>>>>> and mapping these regions is currently different on Linux and >>>>>>>> FreeBSD: >>>>>>>> >>>>>>>> On Linux, hugepages are fetched from the kernel's pre-allocated >>>>>>>> hugepage >>>>>>>> pool and are mapped into virtual address space on DPDK >>>>>>>> initialization. Since >>>>>>>> the hugepages exist in a pool, multiple processes can reserve >>>>>>>> their own >>>>>>>> hugepages and operate concurrently. >>>>>>>> >>>>>>>> On FreeBSD, DPDK uses an in-house contigmem kernel module that >>>>>>>> reserves a >>>>>>>> large contiguous region of memory on load. During DPDK >>>>>>>> initialization, the >>>>>>>> entire region is mapped into virtual address space. This leaves >>>>>>>> no memory >>>>>>>> for another independent DPDK process, so only one process can >>>>>>>> operate at a >>>>>>>> time. >>>>>>>> >>>>>>>> I could modify the DPDK contigmem module to mimic Linux's >>>>>>>> hugepages, but I >>>>>>>> thought it would be better to integrate and upstream a >>>>>>>> hugepage-like >>>>>>>> interface directly in the FreeBSD kernel source. I am writing >>>>>>>> this email to >>>>>>>> see if anyone has any advice on the matter. I did not see any >>>>>>>> previous >>>>>>>> attempts at this in Phabriactor or the commit log, but it is >>>>>>>> possible that I >>>>>>>> missed it. I have read about transparent superpage promotion, >>>>>>>> but that seems >>>>>>>> like a different mechanism altogether. >>>>>>>> >>>>>>>> At a quick glance, the implementation seems straightforward: >>>>>>>> read some >>>>>>>> loader tunables, allocate persistent hugepages at boot time, >>>>>>>> and create a >>>>>>>> pseudo filesystem that supports creating and mapping hugepages. >>>>>>>> I could be >>>>>>>> underestimating the magnitude of this task, but that is why I'm >>>>>>>> asking for >>>>>>>> thoughts and advice :) >>>>>>>> >>>>>>>> For reference, here is Linux's documentation on hugepages: >>>>>>>> https://docs.kernel.org/admin-guide/mm/hugetlbpage.html >>>>>>> Are posix shm largepages objects enough (they were developed to >>>>>>> support >>>>>>> DPDK).  Look for shm_create_largepage(3). >>>>>> Yes, shm_create_largepage(2) looks promising, but I would like >>>>>> the ability >>>>>> to allocate these largepages at boot time when memory >>>>>> fragmentation as at a >>>>>> minimum. Perhaps a couple sysctl tunables could be added onto the >>>>>> vm.largepages node to specify a pagesize and allocate some number >>>>>> of pages >>>>>> at boot? >>>>> We could add an rc script which creates named largepage objects.  >>>>> This >>>>> can be done using the posixshmcontrol utility.  That might not be >>>>> early >>>>> enough during boot for some purposes.  In that case, we could have a >>>>> module which creates such objects from within the kernel. This is >>>>> pretty straightforward to do; I wrote a dumb version of this for a >>>>> mips-specific project a few years ago, feel free to take code or >>>>> inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c >>>> Looks simple enough. Thanks for the example code. >>>> >>>>>> It seems Linux had an interface similar to >>>>>> shm_create_largepage(2) back in >>>>>> v2.5, but they removed it in favor of their hugetlbfs filesystem. >>>>>> It would >>>>>> be nice to stay close to the file-backed Linux interface to >>>>>> maximize code >>>>>> sharing in userspace. It looks like the foundation for hugepages >>>>>> is there, >>>>>> but the interface for allocation and access needs to be extended. >>>>> POSIX shm objects have most of the properties one would want, I'd >>>>> expect, save the ability to access them via standard syscalls.  What >>>>> else is missing besides the ability to reserve memory at boot time? >>>> Most notably, I would like the ability to allocate pages in a >>>> specific NUMA >>>> domain. >>> I thought this was already supported, but it seems not... >> Thinking a bit more, I'm pretty sure I had just been using something >> like >> >> $ cpuset -n prefer: posixshmcontrol create -l 1G >> /largepage-1G- >> >> so didn't need an explicit NUMA configuration parameter.  In C one would >> use cpuset_setdomain(2) instead, but that's not as convenient. So, >> imbuing a NUMA domain in struct shm_largepage_conf is still probably a >> reasonable thing to do. > > I just looked at the code, this seems very manageable. I'll draft up a > review. > >>> It should be very easy to implement: extend shm_largepage_conf to >>> include a NUMA domain parameter, and specify that domain when >>> allocating >>> pages for the object (in shm_largepage_dotruncate(), the >>> vm_page_alloc_contig() call should become a >>> vm_page_alloc_contig_domain() call). >>> >>>> Otherwise, in a perfect world, I'd like a unified interface for both >>>> Linux and FreeBSD. Linux hugepages are managed using standard >>>> system calls; >>>> files are mmap(2)'d into virtual address space from hugetlbfs and >>>> ftruncate(2)'d. >>> largepage shm objects work this way as well. > > After reading through the man page, this is quite apparent. Not sure > how I failed make that connection. Anyway, this is starting to look > easier than I thought it would be. The only difference from a > userspace perspective that I can think of right now is how the pages > are created (e.g. hugetlbfs open(2) on Linux vs. > shm_create_largepage(2) on FreeBSD). I suppose I should clarify that hugetlbfs open(2) does not create a hugepage, but rather attaches to one. So it would be analogous to a shm_open(2) instead of shm_create_largepage(2). The hugepages are created at boottime or via sysfs on Linux. My mistake. Jake Freeland > > Thanks for the guidance Mark and Konstantin. > > Jake Freeland >>>> A matching interface would not add an extra kernel >>>> entrypoint and even more importantly, it would ease the >>>> Linux-to-FreeBSD >>>> porting process for programs that use hugepages. >