FreeBSD Mail Archives

Date:      Fri, 02 Feb 2024 19:47:40 -0500
From:      "Drew Gallatin" <gallatin@freebsd.org>
To:        "Rick Macklem" <rick.macklem@gmail.com>, "Richard Scheffenegger" <rscheff@freebsd.org>
Cc:        "freebsd-net@FreeBSD.org" <freebsd-net@freebsd.org>, "FreeBSD Transport" <freebsd-transport@freebsd.org>, rmacklem@freebsd.org, kp@freebsd.org
Subject:   Re: Increasing TCP TSO size support
Message-ID:  <e5df5725-ac9c-4e88-ade5-b0a561bfacd6@app.fastmail.com>
In-Reply-To:  <CAM5tNy6TbvXqrRRD=XpDBRGk81rzW5k38AzXeKFKLDL01fOYQQ@mail.gmail.com>
References:  <2c31ac44-b34b-469c-a6de-fdd927ec2f9e@freebsd.org> <CAM5tNy6TbvXqrRRD=XpDBRGk81rzW5k38AzXeKFKLDL01fOYQQ@mail.gmail.com>

--d72aaec284da4bab8e1160d4085e3fc4
Content-Type: text/plain



On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote:
>  A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS write request
> or read reply will result in a 514 element mbuf chain. Each of these (mostly 2K mbuf clusters)
> are non-contiguous data segments. (I suspect most NICs do not handle this many segments well,
> if at all.)

Excellent point

> 
> The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the ktls), but I do not
> know what it would take to make these work for non-KTLS TSO?


Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG stuff for kTLS, I added support for using M_EXTPG mbufs in sendfile regardless of whether or not kTLS was in use.  That reduced CPU use marginally on 64-bit platforms (due to reducing socket buffer lengths, and hence reducing pointer chasing), and quite a bit more on 32-bit platforms (due to also not needing to map memory into the kernel map, and by reducing pointer chasing even more, as more pages fit into an M_EXTPG mbuf when a paddr_t is 32-bits.


> I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs.
> Does it assume each M_EXTPG mbuf is one contiguous data segment?

No, its fully aware of how to handle M_EXTPG mbufs.  Look at tcp_m_copy().  We added code in the segment counting part of that function to count the hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page being misaligned.

> I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does not have IFCAP_MEXTPG set.
> (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can become
> a single contiguous data segment for TSO or ???)

No, it just means that a NIC driver has been verified to call not mtod() an M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go "boom").

But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf can hold 5 pages (..from memory, too lazy to do the math right now) and reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k per mbuf), the S/G list that a NIC will need to consume would likely decrease only by a factor of 2.  And even then only if the busdma code to map mbufs for DMA is not coalescing adjacent mbufs.  I know busdma does some coalescing, but I can't recall if it coalesces physcally adjacent mbufs.  

> If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being called) were to
> all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS (non-TLS case).


It does.  You should enable it for at least TCP.

Drew
--d72aaec284da4bab8e1160d4085e3fc4
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html><html><head><title></title><style type=3D"text/css">p.Mso=
Normal,p.MsoNoSpacing{margin:0}</style></head><body><div><br></div><div>=
<br></div><div>On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote:<br><=
/div><blockquote type=3D"cite" id=3D"qt" style=3D""><div dir=3D"ltr"><di=
v dir=3D"ltr"><div class=3D"qt-gmail_quote"><div>&nbsp;<span class=3D"fo=
nt" style=3D"font-family:monospace;">A factor here is the if_hw_tsomaxse=
gcount limit. For example, a 1Mbyte NFS write request</span><br></div><d=
iv class=3D"qt-gmail_default" style=3D"font-family:monospace;">or read r=
eply will result in a 514 element mbuf chain. Each of these (mostly 2K m=
buf clusters)<br></div><div class=3D"qt-gmail_default" style=3D"font-fam=
ily:monospace;">are non-contiguous data segments. (I suspect most NICs d=
o not handle this many segments well,<br></div><div class=3D"qt-gmail_de=
fault" style=3D"font-family:monospace;">if at all.)<br></div></div></div=
></div></blockquote><div><br></div><div>Excellent point<br></div><div><b=
r></div><blockquote type=3D"cite" id=3D"qt" style=3D""><div dir=3D"ltr">=
<div dir=3D"ltr"><div class=3D"qt-gmail_quote"><div class=3D"qt-gmail_de=
fault" style=3D"font-family:monospace;"><br></div><div class=3D"qt-gmail=
_default" style=3D"font-family:monospace;">The NFS code does know how to=
 use M_EXTPG mbufs (for NFS over TLS, for the ktls), but I do not<br></d=
iv><div class=3D"qt-gmail_default" style=3D"font-family:monospace;">know=
 what it would take to make these work for non-KTLS TSO?<br></div></div>=
</div></div></blockquote><div><br></div><div><br></div><div>Sendfile alr=
eady uses M_EXTPG mbufs... When I was initially doing M_EXTPG stuff for =
kTLS, I added support for using M_EXTPG mbufs in sendfile regardless of =
whether or not kTLS was in use.&nbsp; That reduced CPU use marginally on=
 64-bit platforms (due to reducing socket buffer lengths, and hence redu=
cing pointer chasing), and quite a bit more on 32-bit platforms (due to =
also not needing to map memory into the kernel map, and by reducing poin=
ter chasing even more, as more pages fit into an M_EXTPG mbuf when a pad=
dr_t is 32-bits.<br></div><div><br></div><div><br></div><blockquote type=
=3D"cite" id=3D"qt" style=3D""><div dir=3D"ltr"><div dir=3D"ltr"><div cl=
ass=3D"qt-gmail_quote"><div class=3D"qt-gmail_default" style=3D"font-fam=
ily:monospace;">I do not know how the TSO loop in tcp_output handles M_E=
XTPG mbufs.<br></div><div class=3D"qt-gmail_default" style=3D"font-famil=
y:monospace;">Does it assume each M_EXTPG mbuf is one contiguous data se=
gment?<br></div></div></div></div></blockquote><div><br></div><div>No, i=
ts fully aware of how to handle M_EXTPG mbufs.&nbsp; Look at tcp_m_copy(=
).&nbsp; We added code in the segment counting part of that function to =
count the hdr/trailer parts of an M_EXTPG mbuf, and to deal with the sta=
rt/end page being misaligned.<br></div><div><br></div><blockquote type=3D=
"cite" id=3D"qt" style=3D""><div dir=3D"ltr"><div dir=3D"ltr"><div class=
=3D"qt-gmail_quote"><div class=3D"qt-gmail_default" style=3D"font-family=
:monospace;">I do see that ip_output() will call mb_unmapped_to_ext() wh=
en the NIC does not have IFCAP_MEXTPG set.<br></div><div class=3D"qt-gma=
il_default" style=3D"font-family:monospace;">(If IFCAP_MEXTPG is set, do=
 the pages need to be contiguous so that it can become<br></div><div cla=
ss=3D"qt-gmail_default" style=3D"font-family:monospace;">a single contig=
uous data segment for TSO or ???)<br></div></div></div></div></blockquot=
e><div><br></div><div>No, it just means that a NIC driver has been verif=
ied to call not mtod() an M_EXTPGS mbuf and deref the resulting data poi=
nter. (which would make it go "boom").<br></div><div><br></div><div>But =
the page size is only 4K on most platforms.&nbsp; So while an M_EXTPGS m=
buf can hold 5 pages (..from memory, too lazy to do the math right now) =
and reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k=
 vs 20k per mbuf), the S/G list that a NIC will need to consume would li=
kely decrease only by a factor of 2.&nbsp; And even then only if the bus=
dma code to map mbufs for DMA is not coalescing adjacent mbufs.&nbsp; I =
know busdma does some coalescing, but I can't recall if it coalesces phy=
scally adjacent mbufs.&nbsp; <br></div><div><br></div><blockquote type=3D=
"cite" id=3D"qt" style=3D""><div dir=3D"ltr"><div dir=3D"ltr"><div class=
=3D"qt-gmail_quote"><div class=3D"qt-gmail_default" style=3D"font-family=
:monospace;">If TSO and the code beneath it (NIC and maybe mb_unmapped_t=
o_ext() being called) were to<br></div><div class=3D"qt-gmail_default" s=
tyle=3D"font-family:monospace;">all work ok for M_EXTPG mbufs, it would =
be easy to enable that for NFS (non-TLS case).<br></div></div></div></di=
v></blockquote><div><br></div><div><br></div><div>It does.&nbsp; You sho=
uld enable it for at least TCP.<br></div><div><br></div><div>Drew<br></d=
iv></body></html>
--d72aaec284da4bab8e1160d4085e3fc4--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?e5df5725-ac9c-4e88-ade5-b0a561bfacd6>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation