Date: Tue, 27 May 2003 22:54:53 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Igor Sysoev <is@rambler-co.ru> Cc: arch@freebsd.org Subject: Re: sendfile(2) SF_NOPUSH flag proposal Message-ID: <3ED44F2D.DAF1FA08@mindspring.com> References: <Pine.BSF.4.21.0305272137250.49494-100000@is>
next in thread | previous in thread | raw e-mail | index | archive | help
Igor Sysoev wrote: > How do suppose to coelesce the file pages ? Wire two or more pages > to mbuf's at once ? It's done by the network driver, using the network card's DMA's scatter/gather. > Terry, I do not understand you. > My argument is simple - I want to avoid the partial packets because it > decreases the number of packets. That's all. There's nothing about > amortized cost or total cost. I do not even know what they are. The total cost is the total overhead in packets to send a given amount of data. For a small amount of data, the total cost is small, compared to the overhead involved in sending the ethernet, IP, and TCP headers. The amortized cost is how much an extra packet costs you to send, relative to what you have to send anyway. If you have a lot of data to send, sending an extra packet or two is really not very costly, since it's just one more packet out of hundreds. If you argue there's a tiny amount of data, then the total cost is important. If you argue there's a lot of data, then the amortized cost is important. When you talk about extra packets being sent, you can't claim that the amortized cost is important for a small amount of data, or that the total cost is important for a huge amount of data. Your focus on number of packets, rather than your ability to move a total amount of data at or near the theoretical maximum, makes no sense. > > Actually, in this case, I'd just try to fix sendfile(2) to > > do the packet coelescing I'd expect, given the relative > > state of the TCP_NODELAY and TCP_NOPUSH options flags. > > Actually, sendfile() already works according to TCP_NOPUSH flag. > I do not know about TCP_NODELAY - I do not work with it. > But if you turn TCP_NOPUSH on then sendfile() will send the full packets. > If you turn TCP_NOPUSH off then sendfile() will send some packets partially > filled. It's correct. Sending some packets partially filled, instead of just the last packet in a series partially filled, is *wrong*, IMO. > > BTW: I'm still wary of the initial fault on the file data, if > > it's not already in cache: arguably, it's better to start > > sending the headers, and avoid the startup latency of delaying > > sending the headers until the fault is satisfied: part of the > > thing that's going to be eating your PCI bandwidth is the > > disk I/O, and your disks are going to be the slowest data > > sources/sinks in the whole equation. > > I agree but after all it's 20ms or so delay. Plus the delay for the NETISR. > > In any case, I expect that this should be handled in the > > context of TCP_NODELAY and TCP_NOPUSH, rather than by adding > > options to work around an arguably broken sendfile(2). > > sendfile() already works nice with TCP_NOPUSH. I propose only the flags > that allow to turn TCP_NOPUSH (actually TF_NOPUSH) on/off inside sendfile(). > Then in one syscall you can turn TCP_NOPUSH on, send the HTTP header, the file > pages and turn TCP_NOPUSH off if all file pages are wired to mbuf's. > And this TCP_NOPUSH state is not bound by sendfile() internals, you > can control it via setsockopt/getsockopt(TCP_NOPUSH). You're wrong about what TCP_NOPUSH is for; it's only for the last packet of one system call being concatenated with the first packet of another, to save empty packets between seperate system calls. When you call sendfile with a file, headers, and trailers, you are making *only one system call*. "man 4 tcp" tells us: TCP_NOPUSH By convention, the sender-TCP will set the ``push'' bit and begin transmission immediately (if permitted) at the end of every user call to write(2) or writev(2). The TCP_NOPUSH option is provided to allow servers to easily make use of Transaction TCP (see ttcp(4)). When the option is set to a non-zero value, TCP will delay sending any data at all until either the socket is closed, or the internal send buffer is filled. FWIW, here's what it tells us about TCP_NODELAY: TCP_NODELAY Under most circumstances, TCP sends data when it is pre- sented; when outstanding data has not yet been acknowl- edged, it gathers small amounts of output to be sent in a single packet once an acknowledgement is received. For a small number of clients, such as window systems that send a stream of mouse events which receive no replies, this pack- etization may cause significant delays. The boolean option TCP_NODELAY defeats this algorithm. IMO, sendfile(2) should be acting the way you want it to act *just by you *NOT* setting TCP_NODELAY*. If you *do* set TCP_NOPUSH, then it should delay sending the last partial packet until the timer goes, or until you write(2), writev(2), sendfile(2), or send/sendto/sendmsg(2) more data. NOTE: TCP_NOPUSH *specifically* mentions writev(2), which, like sendfile(2), takes data from multiple discrete buffers and sends it. Make sense now? You think sendfile(2) needs options; I think sendfile(2) is broken. -- Terry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3ED44F2D.DAF1FA08>