Date: Sat, 3 Mar 2018 14:48:06 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Mateusz Guzik <mjg@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r330294 - head/sys/kern Message-ID: <20180303135130.L1442@besplex.bde.org> In-Reply-To: <201803022126.w22LQRKH034839@repo.freebsd.org> References: <201803022126.w22LQRKH034839@repo.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2 Mar 2018, Mateusz Guzik wrote: > Log: > sx: fix adaptive spinning broken in r327397 > > The condition was flipped. > > In particular heavy multithreaded kernel builds on zfs started suffering > due to nested sx locks. > > For instance make -s -j 128 buildkernel: > > before: 3326.67s user 1269.62s system 6981% cpu 1:05.84 total > after: 3365.55s user 911.27s system 6871% cpu 1:02.24 total What are you doing to make builds so slow? Kernel builds here take about 5 seconds. A full world takes about 120 seconds. This is with old (~5.2) worlds. (I don't like newer worlds since they are so much slower. Yesterday I did more than 200 makeworlds looking for a lost wakeup for idle threads. When makeworld has a sys time of 50-100 seconds, it is hard to see the effect of optimizations in the noise. I recently rewrote the statistics reporting to show real, user, sys, interrupt, total = sum of these and idle times for the whole system. Based on kern.cp_time. This is innaccurate, but it is much better than rusage times which can't show times for other threads or even interrupt and idle times for threads in the build's tree. Results look like: 128.57r 566.24u 77.10s 0.83i 644.17t 384.36I 408X i386-12-cold-102-40 168.53r 559.01u 76.95s 1.07i 637.03t 711.19I 735X i386-12-cold-102-40 Here I is the idle time according to kern.cp_time and X is the idle time according to td_runtime. The latter is much more accurate and shows that there is an error of about 24 seconds = 24 * (stathz = 127) = 3048 ticks in the u+s+i vs I ticks (u+s+i is too high by 24 seconds or 3 seconds per CPU, and I is too low by 24 seconds). The second run shows an enormous variance in the idle time -- of 327 seconds or 41 seconds per CPU. This pessimization was mostly the result of enabling PREEMPTION and IPI_PREEMPTION, which was supposed to decrease the variance. Normally I turn these off to reduce context switches, and this somehow reduces the variance to more like 5 seconds per CPU. Context switches turned out to be mostly for waking up idle threads and this should not be affected by PREEMPTION, and my optimization of turning off PREEMPTION makes little difference. That it helps on average and for the best case was determined by running many more than a few hundred makeworlds. Increasing kern.sched.quantum makes similarly little difference. I don't normally use that since it was a pessimization on average. Now I think I understand why it is a pessimization: it saves a little u+s+i time by reducing context switches, but increases I time by reducing compensation for the missing wakeups. These benchmarks are with 4BSD. ULE is slightly slower and has lower variance but its variance is high enough to indicate missed wakeups. I can reduce the variance using spinloops, so that idle threads never sleep. This only works with HTT turned off (the above is wirh 4 cores times 2 threads). With HTT on, spinloops in idle threads slow down running threads by abiut 50%. ULE is aware of this and avoids internal spinloops for HTT cases. 4BSD and lower levels are not aware of this, but it takes the misconfigurations machdep.idle_mwait=0 and machdep.idle= spin to reach the spinloop case. Some times for ULE: 127.06r 565.81u 79.40s 0.65i 645.86t 370.58I 394X i386-12-cold-102-40 125.83r 566.72u 79.10s 0.61i 646.43t 360.24I 382X i386-12-warm-102-40 130.33r 565.56u 79.41s 0.65i 645.62t 397.02I 417X i386-12-cold-102-40 118.70r 566.55u 80.46s 0.54i 647.56t 302.05I 325X i386-12-warm-102-40 127.09r 569.36u 77.07s 0.98i 647.41t 369.28I 390X i386-12-cold-102-40 115.80r 570.67u 79.17s 0.63i 650.46t 275.90I 297X i386-12-warm-102-40 115.80r is the best time seen recently. The u+s+i = t times average a couple of seconds higher for ULE. This is not quite in the noise -- 2 seconds for 8 CPUs is 0.25 seconds in real time. The noise in the I time dominates the signal in the u+s+i = t times, so it isn't clear if ULE is doing anything better or worse. If the u+s+i times can be trusted, then ULE's affinity optimizations are apparently negative. I have some optimizations in 4BSD too. These might be negative too. One seemed to work, but only on 1 system, and recent benchmarks indicated that this is just because it accidentally reduced lost wakeups. Its improvements for u+s+i are in the noise. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180303135130.L1442>