Date: Sat, 7 Jul 2018 20:28:38 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Andrew Gallatin <gallatin@cs.duke.edu> Cc: "src-committers@freebsd.org" <src-committers@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org> Subject: Re: svn commit: r335967 - head/sys/dev/mxge Message-ID: <YTOPR0101MB095358F0FF099F552CB35A49DD460@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <b58d62c9-bb81-f214-f3d0-8f5a388479db@cs.duke.edu> References: <201807050120.w651KP5K045633@pdx.rh.CN85.dnsmgr.net> <97ae3381-7c25-7b41-9670-84b825722f52@cs.duke.edu> <YTOPR0101MB09538327E9FF2BF025485638DD400@YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM>, <b58d62c9-bb81-f214-f3d0-8f5a388479db@cs.duke.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Andrew Gallatin wrote: >Given that we do TSO like Linux, and not like MS (meaning >we express the size of the pre-segmented packet using the >a 16-bit value in the IPv4/IPv6 header), supporting more >than 64K is not possible in FreeBSD, so I'm basically >saying "nerf this constraint". Well, my understanding was that the total length of the TSO segment is in the first header mbuf of the chain handed to the net driver. I thought the 16bit IP header was normally filled in with the length because certain drivers/hardware expected that. >MS windows does it better / different; they express the >size of the pre-segmented packet in packet metadata, >leaving ip->ip_len =3D 0. This is better, since >then the pseudo hdr checksum in the template header can be >re-used (with the len added) for every segment by the NIC. >If you've ever seen a driver set ip->ip_len =3D 0, and re-calc >the pseudo-hdr checksum, that's why. This is also why >MS LSOv2 can support TSO of packets larger than 64K, since they're >not constrained by the 16-bit value in the IP{4,6} header. >The value of TSO larger than 64K is questionable at best though. >Without pacing, you'd just get more packets dropped when >talking across the internet.. I think some drivers already do TSO segments greater than 64K. (It has been a while, but I recall "grep"ng for a case where if_hw_tsomax w= as set to a large value and did find one. I think it was a "vm" fake hardware driver.) I suspect the challenge is more finding out what the hardware actually expects the IP header length to be set to. If MS uses a setting of 0, I'd g= uess most newer hardware can handle that? Beyond that, this is way out of my area of exeprtise;-) > if_hw_tsomaxsegsize is the maximum size of contiguous memory > that a "chunk" of the TSO segment can be stored in for handling by > the driver's transmit side. Since higher >And this is what I object to. TCP should not care about >this. Drivers should use busdma, or otherwise be capable of >chopping large contig regions down to chunks that they can >handle. If a driver can really only handle 2K, then it should >be having busdma give it an s/g list that is 2x as long, not having >TCP call m_dupcl() 2x as often on page-sized data generated by >sendfile (or more on non-x86 with larger pages). > >> level code such as NFS (and iSCSI, I think?) uses MCLBYTE clusters, >> anything 2K or higher normally works the same. Not sure about >> sosend(), but I think it also copies the data into MCLBYTE clusters? >> This would change if someday jumbo mbuf clusters become the norm. >> (I tried changing the NFS code to use jumbo clusters, but it would >> result in fragmentation of the memory used for mbuf cluster allocation= , >> so I never committed it.) > >At least for sendfile(), vm pages are wrapped up and attached to >mbufs, so you have 4K (and potentially much more on non-x86). >Doesn't NFS do something similar when sending data, or do you copy >into clusters? Most NFS RPC messages are small and fit into a regular mbuf. I have to look at the code to see when/if it uses an mbuf cluster for those. (It has chang= ed a few times over the years.) For Read replies, it uses a chain of mbuf clusters. I suspect that it could do what sendfile does for UFS. Part of the problem is that NFS clients can = do byte aligned reads of any size, so going through the buffer cache is useful sometimes. For write requests, odd sized writes that are byte aligned can o= ften happen when a loader does its thing. For ZFS, I have no idea. I'm not a ZFS guy. For write requests, the server gets whatever the TCP layer passes up, which is normally a chain of mbufs. (For the client substitute Read/Write, since the writes are copied out of t= he buffer cache and the Read replies come up from TCP.) >I have changes which I have not upstreamed yet which enhance mbufs to >carry TLS metadata & vector of physical addresses (which I call >unmapped mbufs) for sendfile and kernel TLS. As part of that, >sosend (for kTLS) can allocate many pages and attach them to one mbuf. >The idea (for kTLS) is that you can keep an entire TLS record (with >framing information) in a single unmapped mbuf, which saves a >huge amount of CPU which would be lost to cache misses doing >pointer-chasing of really long mbuf chains (TLS hdrs and trailers >are generally 13 and 16 bytes). The goal was to regain CPU >during Netflix's transition to https streaming. However, it >is unintentionally quite helpful on i386, since it reduces >overhead from having to map/unmap sf_bufs. FWIW, these mbufs >have been in production at Netflix for over a year, and carry >a large fraction of the worlds internet traffic :) These could probably be useful for the NFS server doing read replies, since it does a VOP_READ() with a "uio" that refers to buffers (which happen to b= e mbuf cluster data areas right now). For the other cases, I'd have to look at it more closely. They do sound interesting, rick [stuff snipped]=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTOPR0101MB095358F0FF099F552CB35A49DD460>