From owner-freebsd-net@FreeBSD.ORG Wed Dec 5 12:32:59 2012 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 291CA85A for ; Wed, 5 Dec 2012 12:32:59 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm47-vm1.bullet.mail.ne1.yahoo.com (nm47-vm1.bullet.mail.ne1.yahoo.com [98.138.121.97]) by mx1.freebsd.org (Postfix) with ESMTP id BF8FF8FC14 for ; Wed, 5 Dec 2012 12:32:58 +0000 (UTC) Received: from [98.138.90.49] by nm47.bullet.mail.ne1.yahoo.com with NNFMP; 05 Dec 2012 12:32:58 -0000 Received: from [98.138.89.168] by tm2.bullet.mail.ne1.yahoo.com with NNFMP; 05 Dec 2012 12:32:58 -0000 Received: from [127.0.0.1] by omp1024.mail.ne1.yahoo.com with NNFMP; 05 Dec 2012 12:32:58 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 65999.77670.bm@omp1024.mail.ne1.yahoo.com Received: (qmail 33185 invoked by uid 60001); 5 Dec 2012 12:32:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1354710777; bh=mXB4TdkKnov/lNyNyy9JU4dr7hc96bRxKTGK/B/wMD4=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=LuAjGzb6PF/nJ4jD1u1aAw++FWUG3JnT8r9tyM6wyCfjroaEjwF8mIF+RCYztZXmm/zUJTTYVKa0YBJ1pcMOvruqAODNZaRwNixkfDUEQXktLEW0Hn3SX9CQHFBTfb0leESkuZq61p68Z+x9SiZnA3Ojo/TfEEUQnufuCkYPG3Y= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=NLvs1yrSntjcOnc7fMr6oLYb+TJuKlMRvVckTEVPqKvuKLm+Xa4qMqUFLq7ois+Q/MXGTciOfWhrJT8lFIH7Ft0MS/5YY2t+Gk2TCySomm+nN9WKKd/k/juQSrnIMvTDVL2cMEmLIeEjFgyNC3/nA8TIHOqH1/kgqx9q9Ql1PBw=; X-YMail-OSG: Y58jqwoVM1n2ffkYFcd15Ta_8CZLzIirawoghO3vYwamjAG _FJDX.IY36WbxCSgBVnE9YhYEgYPiN35dXIkczK06bdiHPm7dXoq0XJKL.7M 3n91A_kR0UWB01D.Ty9vJEkLWskDS.fC84UU9wJyeq805xBSY7AA.PxqL772 nzXVeedQhTGmJvdJ.ZzDbkZxRRN7ivebLt50yga08AHIQB6uDVvGJk..Wz6E YyuFjY0OzBC2qrAJahwS9wINb5_fAkftKwb4GiOXZOW4la7T7kmOUSlxfU4g QylibeFFzQITUpqsQmmsA1jP0.WAKLoQjae2yOYOgAdErPj5hRJycFiA3FrY jYiTGKvI7JUIMQ6lQPNRBwJnY4x.v2Gar59iDP8CWDy0jj0pTN39tqndHBoo xh7w1zM9Qd5_T2CgR8P.R0Gdf2NvFPowomNoviu_J5BubQSEbHw17idnFYc4 tXMT9uba76d3FrU3polVY7bh6CZeQY5B0BDug084qB5GIfEkKiYjfFTJlg8N 8_bVpxE6v7snIniJ2cnPHduqlqq.AcQ-- Received: from [174.48.128.27] by web121602.mail.ne1.yahoo.com via HTTP; Wed, 05 Dec 2012 04:32:57 PST X-Rocket-MIMEInfo: 001.001, CgotLS0gT24gVHVlLCAxMi80LzEyLCBCcnVjZSBFdmFucyA8YnJkZUBvcHR1c25ldC5jb20uYXU.IHdyb3RlOgoKPiBGcm9tOiBCcnVjZSBFdmFucyA8YnJkZUBvcHR1c25ldC5jb20uYXU.Cj4gU3ViamVjdDogUmU6IExhdGVuY3kgaXNzdWVzIHdpdGggYnVmX3JpbmcKPiBUbzogIkFuZHJlIE9wcGVybWFubiIgPG9wcGVybWFubkBuZXR3b3J4LmNoPgo.IENjOiAiQWRyaWFuIENoYWRkIiA8YWRyaWFuQEZyZWVCU0Qub3JnPiwgIkJhcm5leSBDb3Jkb2JhIiA8YmFybmV5X2NvcmRvYmFAeWFob28uY29tPiwgIkoBMAEBAQE- X-Mailer: YahooMailClassic/15.1.1 YahooMailWebService/0.8.128.478 Message-ID: <1354710777.97879.YahooMailClassic@web121602.mail.ne1.yahoo.com> Date: Wed, 5 Dec 2012 04:32:57 -0800 (PST) From: Barney Cordoba Subject: Re: Latency issues with buf_ring To: Andre Oppermann , Bruce Evans In-Reply-To: <20121205112511.Q932@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@FreeBSD.org, Adrian Chadd , John Baldwin X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Dec 2012 12:32:59 -0000 =0A=0A--- On Tue, 12/4/12, Bruce Evans wrote:=0A=0A>= From: Bruce Evans =0A> Subject: Re: Latency issues w= ith buf_ring=0A> To: "Andre Oppermann" =0A> Cc: "Adri= an Chadd" , "Barney Cordoba" = , "John Baldwin" , freebsd-net@FreeBSD.org=0A> Date: Tuesd= ay, December 4, 2012, 10:31 PM=0A> On Tue, 4 Dec 2012, Andre Oppermann=0A> = wrote:=0A> =0A> > For most if not all ethernet drivers from 100Mbit/s the= =0A> TX DMA rings=0A> > are so large that buffering at the IFQ level doesn'= t=0A> make sense anymore=0A> > and only adds latency.=0A> =0A> I found sort= of the opposite for bge at 1Gbps.=A0 Most or=0A> all bge NICs=0A> have a t= x ring size of 512.=A0 The ifq length is the tx=0A> ring size minus=0A> 1 (= 511).=A0 I needed to expand this to imax(2 * tick / 4,=0A> 10000) to=0A> ma= ximize pps.=A0 This does bad things to latency and=0A> worse things to=0A> = caching (512 buffers might fit in the L2 cache, but 10000=0A> buffers=0A> b= ust any reasonably cache as they are cycled through), but I=0A> only=0A> tr= ied to optimize tx pps.=0A> =0A> > So it could simply directly put everythi= ng into=0A> > the TX DMA and not even try to soft-queue.=A0 If the=0A> TX D= MA ring is full=0A> > ENOBUFS is returned instead of filling yet another=0A= > queue.=0A> =0A> That could work, but upper layers currently don't underst= and=0A> ENOBUFS=0A> at all, so it would work poorly now.=A0 Also, 512 entri= es=0A> is not many,=0A> so even if upper layers understood ENOBUFS it is no= t easy=0A> for them to=0A> _always_ respond fast enough to keep the tx acti= ve, unless=0A> there are=0A> upstream buffers with many more than 512 entri= es.=A0=0A> There needs to be=0A> enough buffering somewhere so that the tx = ring can be=0A> replenished=0A> almost instantly from the buffer, to handle= the worst-case=0A> latency=0A> for the threads generatng new (unbuffered) = packets.=A0 At=0A> the line rate=0A> of ~1.5 Mpps for 1 Gbps, the maximum l= atency that can be=0A> covered by=0A> 512 entries is only 340 usec.=0A> =0A= > > However there=0A> > are ALTQ interactions and other mechanisms which ha= ve=0A> to be considered=0A> > too making it a bit more involved.=0A> =0A> I= didn't try to handle ALTQ or even optimize for TCP.=0A> =0A> More details:= to maximize pps, the main detail is to ensure=0A> that the tx=0A> ring nev= er becomes empty.=A0 The tx then transmits as=0A> fast as possible.=0A> Thi= s requires some watermark processing, but FreeBSD has=0A> almost none=0A> f= or tx rings.=A0 The following normally happens for=0A> packet generators=0A= > like ttcp and netsend:=0A> =0A> - loop calling send() or sendto() until t= he tx ring (and=0A> also any=0A> =A0 upstream buffers) fill up.=A0 Then ENO= BUFS is=0A> returned.=0A> =0A> - watermark processing is broken in the user= API at this=0A> point.=A0 There=0A> =A0 is no way for the application to w= ait for the ENOBUFS=0A> condition to=0A> =A0 go away (select() and poll() d= on't work).=A0=0A> Applications use poor=0A> =A0 workarounds:=0A> =0A> - ol= d (~1989) ttcp sleeps for 18 msec when send() returns=0A> ENOBUFS.=A0 This= =0A> =A0 was barely good enough for 1 Mbps ethernet (line rate=0A> ~1500 pp= s is 27=0A> =A0 per 18 msec, so IFQ_MAXLEN =3D 50 combined with just a=0A> = 1-entry tx ring=0A> =A0 provides a safety factor of about 2).=A0 Expansion= =0A> of the tx ring size to=0A> =A0 512 makes this work with 10 Mbps ethern= et too.=A0=0A> Expansion of the ifq=0A> =A0 to 511 gives another factor of = 2.=A0 After losing=0A> the safety factor of 2,=0A> =A0 we can now handle 40= Mbps ethernet, and are only a=0A> factor of 25 short=0A> =A0 for 1 Gbps.= =A0 My hardware can't do line rate for=0A> small packets -- it=0A> =A0 can = only do 640 kpps.=A0 Thus ttcp is only a=0A> factor of 11 short of=0A> =A0 = supporting the hardware at 1 Gbps.=0A> =0A> =A0 This assumes that sleeps of= 18 msec are actually=0A> possible, which=0A> =A0 they aren't with HZ =3D 1= 00 giving a granularity of 10=0A> msec so that=0A> =A0 sleep(18 msec) actua= lly sleeps for an average of 23=0A> msec.=A0 -current=0A> =A0 uses the bad = default of HZ =3D 1000.=A0 With that=0A> sleep(18 msec) would=0A> =A0 avera= ge 18.5 msec.=A0 Of course, ttcp should sleep=0A> for more like 1=0A> =A0 m= sec if that is possible.=A0 Then the average=0A> sleep is 1.5 msec.=A0 ttcp= =0A> =A0 can keep up with the hardware with that, and is only=0A> slightly = behind=0A> =A0 the hardware with the worst-case sleep of 2 msec=0A> (512+51= 1 packets=0A> =A0 generated every 2 msec is 511.5 kpps).=0A> =0A> =A0 I nor= mally use old ttcp, except I modify it to sleep=0A> for 1 msec instead=0A> = =A0 of 18 in one version, and in another version I remove=0A> the sleep so = that=0A> =A0 it busy-waits in a loop that calls send() which=0A> almost alw= ays returns=0A> =A0 ENOBUFS.=A0 The latter wastes a lot of CPU, but is=0A> = almost good enough=0A> =A0 for throughput testing.=0A> =0A> - newer ttcp tr= ies to program the sleep time in=0A> microseconds.=A0 This doesn't=0A> =A0 = really work, since the sleep granularity is normally=0A> at least a millise= cond,=0A> =A0 and even if it could be the 340 microseconds needed=0A> by bg= e with no ifq=0A> =A0 (see above, and better divide the 340 by 2), then=0A>= this is quite short=0A> =A0 and would take almost as much CPU as=0A> busy-= waiting.=A0 I consider HZ =3D 1000=0A> =A0 to be another form of polling/bu= sy-waiting and don't=0A> use it except for=0A> =A0 testing.=0A> =0A> - netr= ate/netsend also uses a programmed sleep time.=A0=0A> This doesn't really= =0A> =A0 work, as above.=A0 netsend also tries to limit its=0A> rate based = on sleeping.=0A> =A0 This is further from working, since even=0A> finer-gra= ined sleeps are needed=0A> =A0 to limit the rate accurately than to keep up= with the=0A> maxium rate.=0A> =0A> Watermark processing at the kernel leve= l is not quite as=0A> broken.=A0 It=0A> is mostly non-existend, but partly = works, sort of=0A> accidentally.=A0 The=0A> difference is now that there is= a tx "eof" or "completion"=0A> interrupt=0A> which indicates the condition= corresponding to the ENOBUFS=0A> condition=0A> going away, so that the ker= nel doesn't have to poll for=0A> this.=A0 This=0A> is not really an "eof" i= nterrupt (unless bge is programmed=0A> insanely,=0A> to interrupt only afte= r the tx ring is completely=0A> empty).=A0 It acts as=0A> primitive waterma= rking.=A0 bge can be programmed to=0A> interrupt after=0A> having sent ever= y N packets (strictly, after every N buffer=0A> descriptors,=0A> but for sm= all packets these are the same).=A0 When there=0A> are more than=0A> N pack= ets to start, say M, this acts as a watermark at M-N=0A> packets.=0A> bge i= s normally misprogrammed with N =3D 10.=A0 At the line=0A> rate of 1.5 Mpps= ,=0A> this asks for an interrupt rate of 150 kHz, which is far too=0A> high= and=0A> is usually unreachable, so reaching the line rate is=0A> impossibl= e due to=0A> the CPU load from the interrupts.=A0 I use N =3D 384 or 256=0A= > so that the=0A> interrupt rate is not the dominant limit.=A0 However, N = =3D=0A> 10 is better=0A> for latency and works under light loads.=A0 It als= o=0A> reduces the amount=0A> of buffering needed.=0A> =0A> The ifq works mo= re as part of accidentally watermarking than=0A> as a buffer.=0A> It is the= same size as the tx right (actually 1 smaller for=0A> bogus reasons),=0A> = so it is not really useful as a buffer.=A0 However, with=0A> no explicit=0A= > watermarking, any separate buffer like the ifq provides a=0A> sort of=0A>= watermark at the boundary between the buffers.=A0 The=0A> usefulness of th= is=0A> would most obvious if the tx "eof" interrupt were actually=0A> for e= of=0A> (perhaps that is what it was originally).=A0 Then on the=0A> eof int= errupt,=0A> there is no time at all to generate new packets, and the=0A> ti= me when the=0A> tx is idle can be minimized by keeping pre-generated packet= s=0A> handy where=0A> the can be copied to the tx ring at tx "eof" interrup= t=0A> time.=A0 A buffer=0A> of about the same size as the tx ring (or maybe= 1/4) the=0A> size, is enough=0A> for this.=0A> =0A> OTOH, with bge misprog= rammed to interrupt after every 10 tx=0A> packets, the=0A> ifq is useless f= or its watermark purposes.=A0 The=0A> watermark is effectively=0A> in the t= x ring, and very strangely placed there at 10 below=0A> the top=0A> (ring f= ull).=A0 Normally tx watermarks are placed near=0A> the bottom (ring=0A> em= pty).=A0 They must not be placed too near the bottom,=0A> else there would= =0A> not be enough time to replenish the ring between the time=0A> when the= "eof"=0A> (really, the "watermark") interrupt is received and when the=0A>= tx runs=0A> dry.=A0 They should not be placed too near the top like=0A> th= ey are in -current's=0A> bge, else the point of having a large tx ring is d= efeated=0A> and there are=0A> too many interrupts.=A0 However, when they ar= e placed=0A> near the top, latencency=0A> requirements are reduced.=0A> =0A= > I recently worked on buffering for sio and noticed similar=0A> related=0A= > problems for tx watermarks.=A0 Don't laugh -- serial i/o=0A> 1 character = at=0A> a time at 3.686400 Mbps has much the same timing=0A> requirements as= =0A> ethernet i/o 1 packet at a time at 1 Gbps.=A0 Each serial=0A> characte= r=0A> takes ~2.7 usec and each minimal ethernet packet takes ~0.67=0A> usec= .=0A> With tx "ring" sizes of 128 and 512 respectively, the ring=0A> times = for=0A> full to empty are 347 usec for serial i/o and 341 usec for=0A> ethe= rnet i/o.=0A> Strangely, tx is harder than rx because:=0A> - perfection is = possible and easier to measure for tx.=A0=0A> It consists of=0A> =A0 just k= eeping at least 1 entry in the tx ring at all=0A> times.=A0 Latency=0A> =A0= must be kept below ~340 usec to have any chance of=0A> this.=A0 This is no= t=0A> =A0 so easy to achieve under _all_ loads.=0A> - for rx, you have an e= xternal source generating the=0A> packets, so you=0A> =A0 don't have to wor= ry about latency affecting the=0A> generators.=0A> - the need for watermark= processing is better known for rx,=0A> since it=0A> =A0 obviously doesn't = work to generate the rx "eof"=0A> interrupt near the=0A> =A0 top.=0A> The s= erial timing was actually harder to satisfy, because I=0A> worked on=0A> it= on a 366 MHz CPU while I worked on bge on a 2 GHz CPU,=0A> and even the=0A= > 2GHz CPU couldn't keep up with line rate (so from full to=0A> empty takes= =0A> 800 usec).=0A> =0A> It turned out that the best position for the tx lo= w=0A> watermark is about=0A> 1/4 or 1/2 from the bottom for both sio and bg= e.=A0 It=0A> must be fairly=0A> high, else the latency requirements are not= met.=A0 In=0A> the middle is a=0A> good general position.=A0 Although it a= pparently "wastes"=0A> half of the ring=0A> to make the latency requirement= s easier to meet (without=0A> very=0A> system-dependent tuning), the effici= ency lost from this is=0A> reasonably=0A> small.=0A> =0A> Bruce=0A> =0A=0AI= 'm sure that Bill Paul is a nice man, but referencing drivers that were=0Aw= ritten from a template and never properly load tested doesn't really=0Aillu= strate anything. All of his drivers are functional but optimized for=0Anoth= ing.=0A=0ABC