Date: Sat, 20 Mar 2010 12:17:33 -0600 From: Scott Long <scottl@samsco.org> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: Alexander Motin <mav@freebsd.org>, FreeBSD-Current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org Subject: Re: Increasing MAXPHYS Message-ID: <891E2580-8DE3-4B82-81C4-F2C07735A854@samsco.org> In-Reply-To: <201003201753.o2KHrH5x003946@apollo.backplane.com> References: <4BA4E7A9.3070502@FreeBSD.org> <201003201753.o2KHrH5x003946@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: >=20 > :All above I have successfully tested last months with MAXPHYS of 1MB = on > :i386 and amd64 platforms. > : > :So my questions are: > :- does somebody know any issues denying increasing MAXPHYS in HEAD? > :- are there any specific opinions about value? 512K, 1MB, MD? > : > :--=20 > :Alexander Motin >=20 > (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you > might hit up against KVM exhaustion issues in unrelated subsystems. > nswbuf typically maxes out at around 256. For i386 1MB is probably > too large (256M of reserved KVM is a lot for i386). On amd64 there > shouldn't be a problem. >=20 Yes, this needs to be addressed. I've never gotten a clear answer from VM people like Peter Wemm and Alan Cox on what should be done. > Diminishing returns get hit pretty quickly with larger MAXPHYS = values. > As long as the I/O can be pipelined the reduced transaction rate > becomes less interesting when the transaction rate is less than a > certain level. Off the cuff I'd say 2000 tps is a good basis for > considering whether it is an issue or not. 256K is actually quite > a reasonable value. Even 128K is reasonable. >=20 I agree completely. I did quite a bit of testing on this in 2008 and = 2009. I even added some hooks into CAM to support this, and I thought that I = had discussed this extensively with Alexander at the time. Guess it was yet = another wasted conversation with him =3D-( I'll repeat it here for the record. What I call the silly-i/o-test, filling a disk up with the dd command, = yields performance improvements up to a MAXPHYS of 512K. Beyond that and it's negligible, and actually starts running into contention on the VM = page queues lock. There is some work to break down this lock, so it's worth revisiting in the future. For the non-silly-i/o-test, where I do real file i/o using various = sequential and random patterns, there was a modest improvement up to 256K, and a slight improvement up to 512K. This surprised me as I figured that most = filesystem i/o would be in UFS block sized chunks. Then I realized that the UFS = clustering code was actually taking advantage of the larger I/O's. The improvement = really depends on the workload, of course, and I wouldn't expect it to be = noticeable for most people unless they're running something like a media server. Besides the nswbuf sizing problem, there is a real problem that a lot of = drivers have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are particular values, and they've sized their data structures accordingly. = Before these values are changed, an audit needs to be done OF EVERY SINGLE STORAGE DRIVER. No exceptions. This isn't a case of changing MAXHYS in the ata driver, testing that your machine boots, and then committing = the change to source control. Some drivers will have non-obvious restrictions = based on the number of SG elements allowed in a particular command format. MPT comes to mind (its multi message SG code seems to be broken when I tried testing large MAXPHYS on it), but I bet that there are others. Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an odd number less than 512k. For the purpose of benchmarking against = these OS's, having comparable capabilities is essential; Linux easily beats = FreeBSD in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD = typically stomps linux in real I/O because of vastly better latency and caching = algorithms). I'm fine with raising MAXPHYS in production once the problems are = addressed. > Nearly all the issues I've come up against in the last few years = have > been related more to pipeline algorithms breaking down and less = with > I/O size. The cluster_read() code is especially vulnerable to > algorithmic breakdowns when fast media (such as a SSD) is involved. > e.g. I/Os queued from the previous cluster op can create stall > conditions in subsequent cluster ops before they can issue new I/Os > to keep the pipeline hot. >=20 Yes, this is another very good point. It's time to start really = figuring out what SSD means for FreeBSD I/O. Scott
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?891E2580-8DE3-4B82-81C4-F2C07735A854>