From owner-freebsd-arch@FreeBSD.ORG Tue May 27 08:55:11 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3566937B404 for ; Tue, 27 May 2003 08:55:11 -0700 (PDT) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6022B43FCB for ; Tue, 27 May 2003 08:55:09 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfjqj.dialup.mindspring.com ([165.247.207.83] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19Kgn2-0006qX-00; Tue, 27 May 2003 08:55:05 -0700 Message-ID: <3ED38A13.524529B2@mindspring.com> Date: Tue, 27 May 2003 08:53:55 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Igor Sysoev References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4edbce846b2e507d84d06f7bfba2784a2666fa475841a1c7a350badd9bab72f9c350badd9bab72f9c cc: arch@freebsd.org Subject: Re: sendfile(2) SF_NOPUSH flag proposal X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 May 2003 15:55:11 -0000 Igor Sysoev wrote: > > I would be really surprised if you were able to demonstrate a > > measuarble performance difference which was above the noise. > > I hope I will demonstrate at least CPU usage in near future. See other post: that's the only place I expect there to be a potential win; however, unless you CPU power is relatively low, compared to memory and PCI bus bandwidth, I expect the limiting factor to be PCI bus bandwidth first, memory second, and CPU overhead a distant third. That changes if you are doing crypto, but then IPSEC changes all your assumptions. > > You were talking about the file and the header living in the > > same packet. > > I mean that if you have 230 bytes header then sendfile() will send it > in separate packet nevertheless the size of header and of the file. > Something like this - 230, 1460, 1460, ... Again, see other post: this is arguably a sendfile(2) bug, though a reall minor one; one which should be addressed in the sendfile(2) implementation, and doesn't need options added to the API in order to address it. > > > it will return me 230 bytes: > > > > The "HEAD" is atypical, compared to the "GET"; the full Google > > front page is larger than that, and consists of multiple files; > > assuming you support HTTP/1.1 and pipelining, it's going to be > > a back-to-back transfer involving multiple sendfile() calls. > > I use HEAD to show you the size of the HTTP header. > The HEAD is atypical but such small HTTP header is typical. Here is my problem: you are arguing both amortized cost and total cost, depending on which is more supportive of your main thesis. These arguments are seperate and orthogonal to each other: they don't support each other. You can argue tiny files, and a relatively high total cost, or you can argue large files and pipelining, and a relatively high amortized cost, but you can't argue both time and large files and many connections and one connection at the same time. Personally, I'd step back and get the arguments straight, and get an implementation that demonstrates statistically significant performance differences, and then come back, if I wanted to press the case for additional option flags. I have done this several times in the past, e.g. with my soft interrupt coelescing implementation that's now part of most of the ethernet drivers people care about. Actually, in this case, I'd just try to fix sendfile(2) to do the packet coelescing I'd expect, given the relative state of the TCP_NODELAY and TCP_NOPUSH options flags. > > 3 packets vs. 6. And using HTTP/1.0, there's also the three > > handshake packets, SYN/SYN-ACK/ACK, and the tear-down three > > teardown packets, FIN/FIN-ACK/ACK (or 4), plus the ACK's for > > the packets you sent (should be one ACK, since that's below > > the TCP window size). > > Actually 6 vs. 6 for this 8K file. But I said about another thing. > Let's see 48K file and 250 bytes header. sendfile() usually sends > it as 4K or 8K hunks so there are 48/8 * 6 + 1 (header) = 37 packets. > But (48K + 250) / 1460 = 33 * 1460 + 1270 i.e. 34 packets. > It's 8% decrease of data packets. Which may or may not be a possible win; it depends on how close to the bandwidth limit you are capable of driving your hardware. The bandwidth delay product between you and the other end of the connection is probably going to be much more significant a factor, when moving barely enough data to trigger one window framing event (forced ACK). > Add here the possible retransmitions. Retransmissions are probably irrelevent; when you talk about a retransmit, you are talking about data which is persisting in your send sockbuf because it is outstanding unacknowledged data. At that point, the mbuf chains are assemebled. The internal fragmentation you are complaining about here happens because of the initial lack of a TF_NOPUSH flag on tcpcb when the tcp_output() is called on it after the headers have been enqueued, but before any file data has been enqueued. So when a retransmit, if any, is necessary, the packet stream will not have the same decoelesced state: it will retransmit exactly as you wanted it to transmit in the first place. BTW: I'm still wary of the initial fault on the file data, if it's not already in cache: arguably, it's better to start sending the headers, and avoid the startup latency of delaying sending the headers until the fault is satisfied: part of the thing that's going to be eating your PCI bandwidth is the disk I/O, and your disks are going to be the slowest data sources/sinks in the whole equation. > > Really: it's in the noise. Unless you are paying by packet > > count, you probably shouldn't care. > > So do you consider that IP fragmentation is the good thing ? Depends; can I go end-to-end without any fragmentation that happens at all, or am I required to use frags to get packets through at all? If I have to use frags to get packets through, fragged data is *much* better than no data. 8-) 8-). In any case, I expect that this should be handled in the context of TCP_NODELAY and TCP_NOPUSH, rather than by adding options to work around an arguably broken sendfile(2). -- Terry