Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 7 Jul 2018 20:28:38 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Andrew Gallatin <gallatin@cs.duke.edu>
Cc:        "src-committers@freebsd.org" <src-committers@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>
Subject:   Re: svn commit: r335967 - head/sys/dev/mxge
Message-ID:  <YTOPR0101MB095358F0FF099F552CB35A49DD460@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <b58d62c9-bb81-f214-f3d0-8f5a388479db@cs.duke.edu>
References:  <201807050120.w651KP5K045633@pdx.rh.CN85.dnsmgr.net> <97ae3381-7c25-7b41-9670-84b825722f52@cs.duke.edu> <YTOPR0101MB09538327E9FF2BF025485638DD400@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM>, <b58d62c9-bb81-f214-f3d0-8f5a388479db@cs.duke.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
Andrew Gallatin wrote:
>Given that we do TSO like Linux, and not like MS (meaning
>we express the size of the pre-segmented packet using the
>a 16-bit value in the IPv4/IPv6 header), supporting more
>than 64K is not possible in FreeBSD, so I'm basically
>saying "nerf this constraint".
Well, my understanding was that the total length of the TSO
segment is in the first header mbuf of the chain handed to
the net driver.
I thought the 16bit IP header was normally filled in with the
length because certain drivers/hardware expected that.

>MS windows does it better / different; they express the
>size of the pre-segmented packet in packet metadata,
>leaving ip->ip_len =3D 0.  This is better, since
>then the pseudo hdr checksum in the template header can be
>re-used (with the len added) for every segment by the NIC.
>If you've ever seen a driver set ip->ip_len =3D 0, and re-calc
>the pseudo-hdr checksum, that's why.   This is also why
>MS LSOv2 can support TSO of packets larger than 64K, since they're
>not constrained by the 16-bit value in the IP{4,6} header.
>The value of TSO larger than 64K is questionable at best though.
>Without pacing, you'd just get more packets dropped when
>talking across the internet..
I think some drivers already do TSO segments greater than 64K.
(It has been a while, but I recall "grep"ng for a case where if_hw_tsomax w=
as
set to a large value and did find one. I think it was a "vm" fake hardware
driver.)

I suspect the challenge is more finding out what the hardware actually
expects the IP header length to be set to. If MS uses a setting of 0, I'd g=
uess
most newer hardware can handle that?
Beyond that, this is way out of my area of exeprtise;-)

> if_hw_tsomaxsegsize is the maximum size of contiguous memory
> that a "chunk" of the TSO segment can be stored in for handling by
> the driver's transmit side. Since higher

>And this is what I object to.  TCP should not care about
>this.  Drivers should use busdma, or otherwise be capable of
>chopping large contig regions down to chunks that they can
>handle.   If a driver can really only handle 2K, then it should
>be having busdma give it an s/g list that is 2x as long, not having
>TCP call m_dupcl() 2x as often on page-sized data generated by
>sendfile (or more on non-x86 with larger pages).
>
>> level code such as NFS (and iSCSI, I think?) uses MCLBYTE clusters,
>> anything 2K or higher normally works the same.  Not sure about
>> sosend(), but I think it also copies the data into MCLBYTE clusters?
>> This would change if someday jumbo mbuf clusters become the norm.
>> (I tried changing the NFS code to use jumbo clusters, but it would
>>   result in fragmentation of the memory used for mbuf cluster allocation=
,
>>   so I never committed it.)
>
>At least for sendfile(), vm pages are wrapped up and attached to
>mbufs, so you have 4K (and potentially much more on non-x86).
>Doesn't NFS do something similar when sending data, or do you copy
>into clusters?
Most NFS RPC messages are small and fit into a regular mbuf. I have to look
at the code to see when/if it uses an mbuf cluster for those. (It has chang=
ed
a few times over the years.)
For Read replies, it uses a chain of mbuf clusters. I suspect that it could
do what sendfile does for UFS. Part of the problem is that NFS clients can =
do
byte aligned reads of any size, so going through the buffer cache is useful
sometimes. For write requests, odd sized writes that are byte aligned can o=
ften
happen when a loader does its thing.
For ZFS, I have no idea. I'm not a ZFS guy.
For write requests, the server gets whatever the TCP layer passes up,
which is normally a chain of mbufs.
(For the client substitute Read/Write, since the writes are copied out of t=
he
 buffer cache and the Read replies come up from TCP.)

>I have changes which I have not upstreamed yet which enhance mbufs to
>carry TLS metadata & vector of physical addresses (which I call
>unmapped mbufs) for sendfile and kernel TLS.  As part of that,
>sosend (for kTLS) can allocate many pages and attach them to one mbuf.
>The idea (for kTLS) is that you can keep an entire TLS record (with
>framing information) in a single unmapped mbuf, which saves a
>huge amount of CPU which would be lost to cache misses doing
>pointer-chasing of really long mbuf chains (TLS hdrs and trailers
>are generally 13 and 16 bytes).  The goal was to regain CPU
>during Netflix's transition to https streaming.  However, it
>is unintentionally quite helpful on i386, since it reduces
>overhead from having to map/unmap sf_bufs. FWIW, these mbufs
>have been in production at Netflix for over a year, and carry
>a large fraction of the worlds internet traffic :)
These could probably be useful for the NFS server doing read replies, since
it does a VOP_READ() with a "uio" that refers to buffers (which happen to b=
e
mbuf cluster data areas right now).
For the other cases, I'd have to look at it more closely.

They do sound interesting, rick
[stuff snipped]=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTOPR0101MB095358F0FF099F552CB35A49DD460>