From nobody Thu Jul 25 21:11:22 2024
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WVNq31cRPz5RKTh
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Thu, 25 Jul 2024 21:11:27 +0000 (UTC)
	(envelope-from jake@technologyfriends.net)
Received: from ci74p00im-qukt09090501.me.com (ci74p00im-qukt09090501.me.com [17.57.156.22])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4WVNq26xNfz4G0N
	for <freebsd-hackers@freebsd.org>; Thu, 25 Jul 2024 21:11:26 +0000 (UTC)
	(envelope-from jake@technologyfriends.net)
Authentication-Results: mx1.freebsd.org;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=technologyfriends.net; s=sig1; t=1721941885;
	bh=miwWTOTo4TOzWgG4TNgpUhVhQhtkNzkYMcQBBEeMLz4=;
	h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type;
	b=jcSfAQ8hop1AV80Se9K4M50/r++4spR48NTz4ewVwc6b3gNg+KvUUYtnN4pJj1iGf
	 XUyuwhj3Y4H7f0pGK3P11HmHxzeeYoUU4qs8GCE6XUVabr1iDhwi0zyOBoY7JwR7Vf
	 OYfszJTEAvANaLQttTVUgkzk7ZnI0ijlyf3aiWjqjQLA/28b/+epb7h+SC7CZ3sjfP
	 hsxI2/srfKuXFKR6dawP688XFXNyBrt5iZRDQrI2f6ymFfbv8Quvfg54x0Q4UnvBo9
	 /qp+BscbEBKlxqKm3SOs5BQzc7/TRkU+kiHfjCV8SSpifJPlSr36LvPME9Hy3gQrv0
	 Rj5DcVMCdo0iQ==
Received: from [10.0.233.209] (ci77p00im-dlb-asmtp-mailmevip.me.com [17.57.156.26])
	by ci74p00im-qukt09090501.me.com (Postfix) with ESMTPSA id 4201D46401FD;
	Thu, 25 Jul 2024 21:11:24 +0000 (UTC)
Message-ID: <4d4398e5-81ba-4fbd-9806-649ec70abdb4@technologyfriends.net>
Date: Thu, 25 Jul 2024 16:11:22 -0500
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@FreeBSD.org
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: FreeBSD hugepages
To: Mark Johnston <markj@freebsd.org>
Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-hackers@freebsd.org
References: <1ced4290-4a31-4218-8611-63a44c307e87@technologyfriends.net>
 <ZqKhP0aR0fb_f6XE@kib.kiev.ua>
 <35da66f9-b913-45ea-90f4-16a2fa072848@technologyfriends.net>
 <ZqKzCK4pHg1mrSOa@nuc>
Content-Language: en-US
From: Jake Freeland <jake@technologyfriends.net>
In-Reply-To: <ZqKzCK4pHg1mrSOa@nuc>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Proofpoint-GUID: e_CfyNCBo8I2LLM_8NiIBs3OBGYI2-85
X-Proofpoint-ORIG-GUID: e_CfyNCBo8I2LLM_8NiIBs3OBGYI2-85
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16
 definitions=2024-07-25_21,2024-07-25_03,2024-05-17_01
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 phishscore=0
 suspectscore=0 malwarescore=0 mlxscore=0 spamscore=0 bulkscore=0
 mlxlogscore=999 clxscore=1030 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.19.0-2308100000 definitions=main-2407250145
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:714, ipnet:17.57.156.0/24, country:US]
X-Rspamd-Queue-Id: 4WVNq26xNfz4G0N

On 7/25/24 15:18, Mark Johnston wrote:
> On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote:
>> On 7/25/24 14:02, Konstantin Belousov wrote:
>>> On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote:
>>>> Hi there,
>>>>
>>>> I have been steadily working on bringing Data Plane Development Kit (DPDK)
>>>> on FreeBSD up to date with the Linux version. The most significant hurdle so
>>>> far has been supporting concurrent DPDK processes, each with their own
>>>> contiguous memory regions.
>>>>
>>>> These contiguous regions are used by DPDK as a heap for allocating DMA
>>>> buffers and other miscellaneous resources. Retrieving the underlying memory
>>>> and mapping these regions is currently different on Linux and FreeBSD:
>>>>
>>>> On Linux, hugepages are fetched from the kernel's pre-allocated hugepage
>>>> pool and are mapped into virtual address space on DPDK initialization. Since
>>>> the hugepages exist in a pool, multiple processes can reserve their own
>>>> hugepages and operate concurrently.
>>>>
>>>> On FreeBSD, DPDK uses an in-house contigmem kernel module that reserves a
>>>> large contiguous region of memory on load. During DPDK initialization, the
>>>> entire region is mapped into virtual address space. This leaves no memory
>>>> for another independent DPDK process, so only one process can operate at a
>>>> time.
>>>>
>>>> I could modify the DPDK contigmem module to mimic Linux's hugepages, but I
>>>> thought it would be better to integrate and upstream a hugepage-like
>>>> interface directly in the FreeBSD kernel source. I am writing this email to
>>>> see if anyone has any advice on the matter. I did not see any previous
>>>> attempts at this in Phabriactor or the commit log, but it is possible that I
>>>> missed it. I have read about transparent superpage promotion, but that seems
>>>> like a different mechanism altogether.
>>>>
>>>> At a quick glance, the implementation seems straightforward: read some
>>>> loader tunables, allocate persistent hugepages at boot time, and create a
>>>> pseudo filesystem that supports creating and mapping hugepages. I could be
>>>> underestimating the magnitude of this task, but that is why I'm asking for
>>>> thoughts and advice :)
>>>>
>>>> For reference, here is Linux's documentation on hugepages:
>>>> https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
>>> Are posix shm largepages objects enough (they were developed to support
>>> DPDK).  Look for shm_create_largepage(3).
>> Yes, shm_create_largepage(2) looks promising, but I would like the ability
>> to allocate these largepages at boot time when memory fragmentation as at a
>> minimum. Perhaps a couple sysctl tunables could be added onto the
>> vm.largepages node to specify a pagesize and allocate some number of pages
>> at boot?
> We could add an rc script which creates named largepage objects.  This
> can be done using the posixshmcontrol utility.  That might not be early
> enough during boot for some purposes.  In that case, we could have a
> module which creates such objects from within the kernel.  This is
> pretty straightforward to do; I wrote a dumb version of this for a
> mips-specific project a few years ago, feel free to take code or
> inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c

Looks simple enough. Thanks for the example code.

>> It seems Linux had an interface similar to shm_create_largepage(2) back in
>> v2.5, but they removed it in favor of their hugetlbfs filesystem. It would
>> be nice to stay close to the file-backed Linux interface to maximize code
>> sharing in userspace. It looks like the foundation for hugepages is there,
>> but the interface for allocation and access needs to be extended.
> POSIX shm objects have most of the properties one would want, I'd
> expect, save the ability to access them via standard syscalls.  What
> else is missing besides the ability to reserve memory at boot time?

Most notably, I would like the ability to allocate pages in a specific 
NUMA domain. Otherwise, in a perfect world, I'd like a unified interface 
for both Linux and FreeBSD. Linux hugepages are managed using standard 
system calls; files are mmap(2)'d into virtual address space from 
hugetlbfs and ftruncate(2)'d. A matching interface would not add an 
extra kernel entrypoint and even more importantly, it would ease the 
Linux-to-FreeBSD porting process for programs that use hugepages.

Thanks,
Jake Freeland