From nobody Thu Jul 25 22:40:31 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WVQnw4Dfhz5RSZ2 for ; Thu, 25 Jul 2024 22:40:36 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-oo1-xc2a.google.com (mail-oo1-xc2a.google.com [IPv6:2607:f8b0:4864:20::c2a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WVQnv4HnPz4QcB for ; Thu, 25 Jul 2024 22:40:35 +0000 (UTC) (envelope-from markjdb@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=I77+nYeJ; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=freebsd.org (policy=none); spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates 2607:f8b0:4864:20::c2a as permitted sender) smtp.mailfrom=markjdb@gmail.com Received: by mail-oo1-xc2a.google.com with SMTP id 006d021491bc7-5ce74defe42so285256eaf.0 for ; Thu, 25 Jul 2024 15:40:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721947234; x=1722552034; darn=freebsd.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=1fP/NXahIOTHypv7uwPmL2sR9bnLW2uKYfkRalYuv2w=; b=I77+nYeJowNsT+YCHfGtWeGYBcGESsHJ1xkrvZKRyYFzSAZjf4su0FOv9a+g2y8Jd6 HXmw1Oy4dTGnIkMMJ9Jwq6BTUp6ghLNFHWicJqmauhTM51pNoXSELAQpdruRjw+fhPxM FRr/THpmI2xmnzLA0rw3p8SuTqQUKFNNBbjC5IeBOX4bzisPpXM9WRMVDSDUFbyXPvFj 0z1LEj0eu6O+I/3VqSEZ+XkikxsN/F/lYOsmP7xYrrMZD1hEXx4Lwt4D/EWUYf4D55mN MluhlyNABecmLTe/cbV4Q2LIX3hxBai9t8LZ3ZLdF2wc35I9pM7eynAkdHRmwFa56DNn x7aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721947234; x=1722552034; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1fP/NXahIOTHypv7uwPmL2sR9bnLW2uKYfkRalYuv2w=; b=K2w3TWjlgJBmNurtxx342MKrv1oKWP7K4XoS4atVg4xHn41YK4a+zDsubOPQDpVV8C eWHY78A7Zwpip+re/rvk3au2UdzGZFCde7y1/4NUeKpMtnTEd7YVAxO3kDGrljtZE1LR gGLBSyLk8FDMDnHUaXN2p64zLCoprGokdSJ9U7H8Ar/2mkqODA2tGIsIsTFo+LbV8gzs rfQEnO7ONZTYMo5HxcSymEdF0OtSA+VwAcVcJKkVFsm5CNJqcuDtwiNpGERHfe4EL71I JA7NmW8a42hDmXDP0bHE3MrMa3x77csdWIyL3U5vUxzvv89BdlosBe3SUoJvjZAduvYW DdhA== X-Forwarded-Encrypted: i=1; AJvYcCXO8AS3RFFqfh+RkmgwbHC+yYTxt6lfOXDRGurZdzA5mMJCtjQJTXGoV5OaoiBMv6g0T2iU8aFnDWSWAv1tr+D8SpjlSuQs6edQtTA= X-Gm-Message-State: AOJu0Yz4MDsQ3lKE0IngJYRlnGMABkngokc0YB1eRijk/74nUMSK04oC 2z3+bFf1+zbP4OootzKCmlEwHYVLgE2u8QFf2QH5JrSsfUxLGD2N X-Google-Smtp-Source: AGHT+IGMTcU9SmtioO8WdljWo00ZFxMdeIqxgu/W9uVjQXb1XB/Khc855MOgRIF9UB4xvlt3iPYiMw== X-Received: by 2002:a05:6358:52c9:b0:1aa:a19e:f195 with SMTP id e5c5f4694b2df-1acfb894dfemr442080555d.4.1721947234194; Thu, 25 Jul 2024 15:40:34 -0700 (PDT) Received: from nuc (192-0-220-237.cpe.teksavvy.com. [192.0.220.237]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bb3f8d8269sm11111246d6.20.2024.07.25.15.40.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 Jul 2024 15:40:33 -0700 (PDT) Date: Thu, 25 Jul 2024 18:40:31 -0400 From: Mark Johnston To: Jake Freeland Cc: Konstantin Belousov , freebsd-hackers@freebsd.org Subject: Re: FreeBSD hugepages Message-ID: References: <1ced4290-4a31-4218-8611-63a44c307e87@technologyfriends.net> <35da66f9-b913-45ea-90f4-16a2fa072848@technologyfriends.net> <4d4398e5-81ba-4fbd-9806-649ec70abdb4@technologyfriends.net> List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spamd-Bar: -- X-Spamd-Result: default: False [-2.59 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.99)[-0.995]; MID_RHS_NOT_FQDN(0.50)[]; FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; DMARC_POLICY_SOFTFAIL(0.10)[freebsd.org : SPF not aligned (relaxed), DKIM not aligned (relaxed),none]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_CC(0.00)[gmail.com,freebsd.org]; FREEMAIL_ENVFROM(0.00)[gmail.com]; TO_DN_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_HAS_DN(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::c2a:from]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com]; RCPT_COUNT_THREE(0.00)[3]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; RCVD_VIA_SMTP_AUTH(0.00)[]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; MISSING_XM_UA(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4WVQnv4HnPz4QcB On Thu, Jul 25, 2024 at 06:34:43PM -0400, Mark Johnston wrote: > On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote: > > On 7/25/24 15:18, Mark Johnston wrote: > > > On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote: > > > > On 7/25/24 14:02, Konstantin Belousov wrote: > > > > > On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote: > > > > > > Hi there, > > > > > > > > > > > > I have been steadily working on bringing Data Plane Development Kit (DPDK) > > > > > > on FreeBSD up to date with the Linux version. The most significant hurdle so > > > > > > far has been supporting concurrent DPDK processes, each with their own > > > > > > contiguous memory regions. > > > > > > > > > > > > These contiguous regions are used by DPDK as a heap for allocating DMA > > > > > > buffers and other miscellaneous resources. Retrieving the underlying memory > > > > > > and mapping these regions is currently different on Linux and FreeBSD: > > > > > > > > > > > > On Linux, hugepages are fetched from the kernel's pre-allocated hugepage > > > > > > pool and are mapped into virtual address space on DPDK initialization. Since > > > > > > the hugepages exist in a pool, multiple processes can reserve their own > > > > > > hugepages and operate concurrently. > > > > > > > > > > > > On FreeBSD, DPDK uses an in-house contigmem kernel module that reserves a > > > > > > large contiguous region of memory on load. During DPDK initialization, the > > > > > > entire region is mapped into virtual address space. This leaves no memory > > > > > > for another independent DPDK process, so only one process can operate at a > > > > > > time. > > > > > > > > > > > > I could modify the DPDK contigmem module to mimic Linux's hugepages, but I > > > > > > thought it would be better to integrate and upstream a hugepage-like > > > > > > interface directly in the FreeBSD kernel source. I am writing this email to > > > > > > see if anyone has any advice on the matter. I did not see any previous > > > > > > attempts at this in Phabriactor or the commit log, but it is possible that I > > > > > > missed it. I have read about transparent superpage promotion, but that seems > > > > > > like a different mechanism altogether. > > > > > > > > > > > > At a quick glance, the implementation seems straightforward: read some > > > > > > loader tunables, allocate persistent hugepages at boot time, and create a > > > > > > pseudo filesystem that supports creating and mapping hugepages. I could be > > > > > > underestimating the magnitude of this task, but that is why I'm asking for > > > > > > thoughts and advice :) > > > > > > > > > > > > For reference, here is Linux's documentation on hugepages: > > > > > > https://docs.kernel.org/admin-guide/mm/hugetlbpage.html > > > > > Are posix shm largepages objects enough (they were developed to support > > > > > DPDK). Look for shm_create_largepage(3). > > > > Yes, shm_create_largepage(2) looks promising, but I would like the ability > > > > to allocate these largepages at boot time when memory fragmentation as at a > > > > minimum. Perhaps a couple sysctl tunables could be added onto the > > > > vm.largepages node to specify a pagesize and allocate some number of pages > > > > at boot? > > > We could add an rc script which creates named largepage objects. This > > > can be done using the posixshmcontrol utility. That might not be early > > > enough during boot for some purposes. In that case, we could have a > > > module which creates such objects from within the kernel. This is > > > pretty straightforward to do; I wrote a dumb version of this for a > > > mips-specific project a few years ago, feel free to take code or > > > inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c > > > > Looks simple enough. Thanks for the example code. > > > > > > It seems Linux had an interface similar to shm_create_largepage(2) back in > > > > v2.5, but they removed it in favor of their hugetlbfs filesystem. It would > > > > be nice to stay close to the file-backed Linux interface to maximize code > > > > sharing in userspace. It looks like the foundation for hugepages is there, > > > > but the interface for allocation and access needs to be extended. > > > POSIX shm objects have most of the properties one would want, I'd > > > expect, save the ability to access them via standard syscalls. What > > > else is missing besides the ability to reserve memory at boot time? > > > > Most notably, I would like the ability to allocate pages in a specific NUMA > > domain. > > I thought this was already supported, but it seems not... Thinking a bit more, I'm pretty sure I had just been using something like $ cpuset -n prefer: posixshmcontrol create -l 1G /largepage-1G- so didn't need an explicit NUMA configuration parameter. In C one would use cpuset_setdomain(2) instead, but that's not as convenient. So, imbuing a NUMA domain in struct shm_largepage_conf is still probably a reasonable thing to do. > It should be very easy to implement: extend shm_largepage_conf to > include a NUMA domain parameter, and specify that domain when allocating > pages for the object (in shm_largepage_dotruncate(), the > vm_page_alloc_contig() call should become a > vm_page_alloc_contig_domain() call). > > > Otherwise, in a perfect world, I'd like a unified interface for both > > Linux and FreeBSD. Linux hugepages are managed using standard system calls; > > files are mmap(2)'d into virtual address space from hugetlbfs and > > ftruncate(2)'d. > > largepage shm objects work this way as well. > > > A matching interface would not add an extra kernel > > entrypoint and even more importantly, it would ease the Linux-to-FreeBSD > > porting process for programs that use hugepages.