Date: Sat, 29 Jun 2013 05:35:32 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Alexander Motin <mav@FreeBSD.org> Cc: Adrian Chadd <adrian@freebsd.org>, hackers@freebsd.org Subject: Re: b_freelist TAILQ/SLIST Message-ID: <20130629023532.GW91021@kib.kiev.ua> In-Reply-To: <51CE0AF7.6090906@FreeBSD.org> References: <51CCAE14.6040504@FreeBSD.org> <20130628065732.GL91021@kib.kiev.ua> <51CE0AF7.6090906@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--dn4lWQ0qhoFTQ1P3 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jun 29, 2013 at 01:15:19AM +0300, Alexander Motin wrote: > On 28.06.2013 09:57, Konstantin Belousov wrote: > > On Fri, Jun 28, 2013 at 12:26:44AM +0300, Alexander Motin wrote: > >> While doing some profiles of GEOM/CAM IOPS scalability, on some test > >> patterns I've noticed serious congestion with spinning on global > >> pbuf_mtx mutex inside getpbuf() and relpbuf(). Since that code is > >> already very simple, I've tried to optimize probably the only thing > >> possible there: switch bswlist from TAILQ to SLIST. As I can see, > >> b_freelist field of struct buf is really used as TAILQ in some other > >> places, so I've just added another SLIST_ENTRY field. And result > >> appeared to be surprising -- I can no longer reproduce the issue at al= l. > >> May be it was just unlucky synchronization of specific test, but I've > >> seen in on two different systems and rechecked results with/without > >> patch three times. > > This is too unbelievable. Could it be, e.g. some cache line conflicts > > which cause the trashing, in fact ? >=20 > I think it indeed may be a cache trashing. I've made some profiling for= =20 > getpbuf()/relpbuf() and found interesting results. With patched kernel=20 > using SLIST profiling shows mostly one point of RESOURCE_STALLS.ANY in=20 > relpbuf() -- first lock acquisition causes 78% of them. Later memory=20 > accesses including the lock release are hitting the same cache line and= =20 > almost free. With "clean" kernel using TAILQ I see RESOURCE_STALLS.ANY=20 > spread almost equally between lock acquisition, bswlist access and lock= =20 > release. It looks like the cache line is constantly erased by something. >=20 > My guess was that patch somehow changed cache line sharing. But several= =20 > checks with nm shown that, while memory allocation indeed changed=20 > slightly, in both cases content of the cache line in question is=20 > absolutely the same, just shifted in memory by 128 bytes. >=20 > I guess the cache line could be trashed by threads doing adaptive=20 > spinning on lock after collision happened. That trashing increases lock= =20 > hold time and even more increases chance of additional collisions. May=20 > be switch from TAILQ to SLIST slightly reduces lock hold time, reducing= =20 > chance of cumulative effect. The difference is not big, but in this test= =20 > this global lock acquired 1.5M times per second by 256 threads on 24=20 > CPUs (12xL2 and 2xL3 caches). >=20 > Another guess was that we have some bad case of false cache line=20 > sharing, but I don't know how that can be either checked or avoided. >=20 > At the last moment mostly for luck I've tried to switch pbuf_mtx from=20 > mtx to mtx_padalign on "clean" kernel. For my surprise that also seems=20 > fixed the congestion problem, but I can't explain why.=20 > RESOURCE_STALLS.ANY still show there is cache trashing, but the lock=20 > spinning has gone. >=20 > Any ideas about what is going on there? FWIW, Jeff just changed pbuf_mtx allocation to use padalign, it is a somewhat unrelated change in r252330. Are pbuf_mtx and bswlist are located next to next in your kernel ? If yes, then I would expect that the explanation is how the MESI protocol and atomics work. When performing the locked op, CPU takes the whole cache line into the exclusive ownership. Since our locks try the cmpset as the first operation, and then 'adaptive' loop interleaving cmpset and check for the ownership, false cache line sharing between pbuf_mtx and bswlist should result exactly in such effects. Different cores should bounce the ownership of the cache line, slowing down the accesses. AFAIR Intel exports some MESI events as performance counters, but I was not able to find a reference in the architecturally defined events. It seems to be present in the model-specific part. --dn4lWQ0qhoFTQ1P3 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJRzkf0AAoJEJDCuSvBvK1BuiIP/ik9OFwcgQ9EwwhCedTG2wik w/dbyY1GvRC5QPKO4PrMN01kV/wY8njulycYqCv/BJ4RnPCyqNmkCpV9l1YPe7Ha HI6JB6G2pSWPMkQeK9T+9jZ4ED9HwqYVxapBWqMgtDV2G8gKSCNyxY16hib8VeXf ARLCr1Z6oDyCTzY1fbZ7WqpRX5cCcZXyAHWOMJMC1oLQMZ3JuHtEUAB3brEZwPFo Upze+7k+aT/pw7TvIF4Lz81a4eLU3IXtW/DgyfXuW2LxLpiEeAs0DuiK4sh0UBFs MyWIHzv37s976avg6KS3yDwkkYEpHBJAN/M9CT86xrTvIDuIchLmK0EOqS5IiFVq xStZiSzT1bM4L0qmZhMfKkUN5qbOSsWa/ptC1kqm6DPWSjASwuUmH2maxI3npgYO e+4RI1LdPfceg+3CIZtZcN4Cue/B+VHd3KL4SnGgbg3L5+kZuFqQRelkBKHphytN RKVElZe/VmeIC9zWIVG/BfKWTAJsfhp/Jgzu7CSfUYW0oiRRE+J+2nwQPs94GLLu z4Dt0cMadAq7v+t9EAQ1reHnZCNrrJ+Q6PzqOxmgTad4ofpHHWzpfYj9jM2/YOdO XXaZMgzvTbIe8bWhSkYgSjeO1kHVnD0UBOiGsuk79615yjSBSFDwRpALFK/ApacU bBn7Ta8HVokib274b/jF =4XcW -----END PGP SIGNATURE----- --dn4lWQ0qhoFTQ1P3--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130629023532.GW91021>