From owner-svn-src-head@freebsd.org Sat Jul 7 20:28:40 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6E38C10264EF; Sat, 7 Jul 2018 20:28:40 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660062.outbound.protection.outlook.com [40.107.66.62]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "Microsoft IT TLS CA 4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E23FA8A687; Sat, 7 Jul 2018 20:28:39 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM (52.132.44.24) by YTOPR0101MB1114.CANPRD01.PROD.OUTLOOK.COM (52.132.50.29) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.930.21; Sat, 7 Jul 2018 20:28:38 +0000 Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM ([fe80::7098:a543:5be8:f30e]) by YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM ([fe80::7098:a543:5be8:f30e%4]) with mapi id 15.20.0930.016; Sat, 7 Jul 2018 20:28:38 +0000 From: Rick Macklem To: Andrew Gallatin CC: "src-committers@freebsd.org" , "svn-src-all@freebsd.org" , "svn-src-head@freebsd.org" Subject: Re: svn commit: r335967 - head/sys/dev/mxge Thread-Topic: svn commit: r335967 - head/sys/dev/mxge Thread-Index: AQHUE/r/U9IY4eihoEW9Zacqgo0Qe6R/1L2AgAAUUoCAATWlZ4ACsWqAgABh6nU= Date: Sat, 7 Jul 2018 20:28:38 +0000 Message-ID: References: <201807050120.w651KP5K045633@pdx.rh.CN85.dnsmgr.net> <97ae3381-7c25-7b41-9670-84b825722f52@cs.duke.edu> , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTOPR0101MB1114; 7:QnqXdw3pxFxtwwWyIAveI3BH2nZLDIfC84jsPHufpcrXi2XKhd/sMhyOn+O9GkYTKbk9PNAEPbDydAckoY8He+MqPugmHR0fN4BJY1wQHTefGVzg4Ar9dnLTEEN2LauKcyk3Z9Jrs7SKF4Yy2y7TEc62dPRKzOJSRXsox3AV2v66OuMzrtwxBKN3NCgHj0LexebVaUudCpKEjzSiATTBnJXcd5toX0tcJVYCCaKJpkdo2oMZkm/wv6o3SXyIk9yB x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-correlation-id: 22a93d87-0607-4357-6597-08d5e44838c5 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652040)(8989117)(5600053)(711020)(4534165)(4627221)(201703031133081)(201702281549075)(8990107)(2017052603328)(7153060)(7193020); SRVR:YTOPR0101MB1114; x-ms-traffictypediagnostic: YTOPR0101MB1114: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(158342451672863); x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040522)(2401047)(8121501046)(5005006)(3231311)(944501410)(52105095)(93006095)(93001095)(10201501046)(3002001)(149027)(150027)(6041310)(20161123562045)(20161123558120)(20161123564045)(20161123560045)(201703131423095)(201702281529075)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(6072148)(201708071742011)(7699016); SRVR:YTOPR0101MB1114; BCL:0; PCL:0; RULEID:; SRVR:YTOPR0101MB1114; x-forefront-prvs: 0726B2D7A6 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(396003)(136003)(366004)(376002)(39850400004)(346002)(199004)(189003)(7696005)(2900100001)(229853002)(2171002)(102836004)(53936002)(76176011)(2906002)(316002)(786003)(6506007)(14454004)(54906003)(106356001)(33656002)(4326008)(105586002)(6246003)(99286004)(93886005)(186003)(26005)(8936002)(8676002)(25786009)(86362001)(81166006)(6436002)(9686003)(55016002)(6916009)(478600001)(486006)(11346002)(476003)(74316002)(446003)(305945005)(74482002)(97736004)(256004)(5250100002)(5660300001)(5024004)(81156014)(68736007); DIR:OUT; SFP:1101; SCL:1; SRVR:YTOPR0101MB1114; H:YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) x-microsoft-antispam-message-info: Fv1QlKjTHVknnwA8WoSmuZxY4yZBc8aFrmYUQOAdK8UJQ7wH9VtNQgG3tndX7UPBqSQIAMHSHsN7Q2SiRvaDWn/iDi+kH4CDn49ON33blOwdU/gyuxj2Wa6KoloMVlqdZw7uUzY7DFszZRabaKCBLobu6/nalMXGtvKRV1QEfVs8nXPt9Auf3XGqhq52vfLNlgROsTRq5WGWcMH0gzNi/L6H8XPIQ/A98TBjiqSOpmLce7vr44sjT10RIGhwhXqZ8y1QE891jkD4ZRFO3F0TBi9Mifs4ySmAQuxAAJ/1++Bif3T90/fKjQ78lEZMDe5tY1gMOnuOILRHbsZt/GIccWAAJPuYaXuDTBDbhdlb6qg= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: 22a93d87-0607-4357-6597-08d5e44838c5 X-MS-Exchange-CrossTenant-originalarrivaltime: 07 Jul 2018 20:28:38.3094 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTOPR0101MB1114 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 07 Jul 2018 20:28:40 -0000 Andrew Gallatin wrote: >Given that we do TSO like Linux, and not like MS (meaning >we express the size of the pre-segmented packet using the >a 16-bit value in the IPv4/IPv6 header), supporting more >than 64K is not possible in FreeBSD, so I'm basically >saying "nerf this constraint". Well, my understanding was that the total length of the TSO segment is in the first header mbuf of the chain handed to the net driver. I thought the 16bit IP header was normally filled in with the length because certain drivers/hardware expected that. >MS windows does it better / different; they express the >size of the pre-segmented packet in packet metadata, >leaving ip->ip_len =3D 0. This is better, since >then the pseudo hdr checksum in the template header can be >re-used (with the len added) for every segment by the NIC. >If you've ever seen a driver set ip->ip_len =3D 0, and re-calc >the pseudo-hdr checksum, that's why. This is also why >MS LSOv2 can support TSO of packets larger than 64K, since they're >not constrained by the 16-bit value in the IP{4,6} header. >The value of TSO larger than 64K is questionable at best though. >Without pacing, you'd just get more packets dropped when >talking across the internet.. I think some drivers already do TSO segments greater than 64K. (It has been a while, but I recall "grep"ng for a case where if_hw_tsomax w= as set to a large value and did find one. I think it was a "vm" fake hardware driver.) I suspect the challenge is more finding out what the hardware actually expects the IP header length to be set to. If MS uses a setting of 0, I'd g= uess most newer hardware can handle that? Beyond that, this is way out of my area of exeprtise;-) > if_hw_tsomaxsegsize is the maximum size of contiguous memory > that a "chunk" of the TSO segment can be stored in for handling by > the driver's transmit side. Since higher >And this is what I object to. TCP should not care about >this. Drivers should use busdma, or otherwise be capable of >chopping large contig regions down to chunks that they can >handle. If a driver can really only handle 2K, then it should >be having busdma give it an s/g list that is 2x as long, not having >TCP call m_dupcl() 2x as often on page-sized data generated by >sendfile (or more on non-x86 with larger pages). > >> level code such as NFS (and iSCSI, I think?) uses MCLBYTE clusters, >> anything 2K or higher normally works the same. Not sure about >> sosend(), but I think it also copies the data into MCLBYTE clusters? >> This would change if someday jumbo mbuf clusters become the norm. >> (I tried changing the NFS code to use jumbo clusters, but it would >> result in fragmentation of the memory used for mbuf cluster allocation= , >> so I never committed it.) > >At least for sendfile(), vm pages are wrapped up and attached to >mbufs, so you have 4K (and potentially much more on non-x86). >Doesn't NFS do something similar when sending data, or do you copy >into clusters? Most NFS RPC messages are small and fit into a regular mbuf. I have to look at the code to see when/if it uses an mbuf cluster for those. (It has chang= ed a few times over the years.) For Read replies, it uses a chain of mbuf clusters. I suspect that it could do what sendfile does for UFS. Part of the problem is that NFS clients can = do byte aligned reads of any size, so going through the buffer cache is useful sometimes. For write requests, odd sized writes that are byte aligned can o= ften happen when a loader does its thing. For ZFS, I have no idea. I'm not a ZFS guy. For write requests, the server gets whatever the TCP layer passes up, which is normally a chain of mbufs. (For the client substitute Read/Write, since the writes are copied out of t= he buffer cache and the Read replies come up from TCP.) >I have changes which I have not upstreamed yet which enhance mbufs to >carry TLS metadata & vector of physical addresses (which I call >unmapped mbufs) for sendfile and kernel TLS. As part of that, >sosend (for kTLS) can allocate many pages and attach them to one mbuf. >The idea (for kTLS) is that you can keep an entire TLS record (with >framing information) in a single unmapped mbuf, which saves a >huge amount of CPU which would be lost to cache misses doing >pointer-chasing of really long mbuf chains (TLS hdrs and trailers >are generally 13 and 16 bytes). The goal was to regain CPU >during Netflix's transition to https streaming. However, it >is unintentionally quite helpful on i386, since it reduces >overhead from having to map/unmap sf_bufs. FWIW, these mbufs >have been in production at Netflix for over a year, and carry >a large fraction of the worlds internet traffic :) These could probably be useful for the NFS server doing read replies, since it does a VOP_READ() with a "uio" that refers to buffers (which happen to b= e mbuf cluster data areas right now). For the other cases, I'd have to look at it more closely. They do sound interesting, rick [stuff snipped]=