Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 3 Mar 2018 14:48:06 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Mateusz Guzik <mjg@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r330294 - head/sys/kern
Message-ID:  <20180303135130.L1442@besplex.bde.org>
In-Reply-To: <201803022126.w22LQRKH034839@repo.freebsd.org>
References:  <201803022126.w22LQRKH034839@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2 Mar 2018, Mateusz Guzik wrote:

> Log:
>  sx: fix adaptive spinning broken in r327397
>
>  The condition was flipped.
>
>  In particular heavy multithreaded kernel builds on zfs started suffering
>  due to nested sx locks.
>
>  For instance make -s -j 128 buildkernel:
>
>  before: 3326.67s user 1269.62s system 6981% cpu 1:05.84 total
>  after: 3365.55s user 911.27s system 6871% cpu 1:02.24 total

What are you doing to make builds so slow?  Kernel builds here take about
5 seconds.  A full world takes about 120 seconds.  This is with old (~5.2)
worlds.  (I don't like newer worlds since they are so much slower.

Yesterday I did more than 200 makeworlds looking for a lost wakeup for
idle threads.  When makeworld has a sys time of 50-100 seconds, it is
hard to see the effect of optimizations in the noise.  I recently rewrote
the statistics reporting to show real, user, sys, interrupt, total = sum
of these and idle times for the whole system.  Based on kern.cp_time.  This
is innaccurate, but it is much better than rusage times which can't show
times for other threads or even interrupt and idle times for threads in the
build's tree.  Results look like:

  128.57r  566.24u  77.10s  0.83i  644.17t 384.36I 408X i386-12-cold-102-40
  168.53r  559.01u  76.95s  1.07i  637.03t 711.19I 735X i386-12-cold-102-40

Here I is the idle time according to kern.cp_time and X is the idle time
according to td_runtime.  The latter is much more accurate and shows that
there is an error of about 24 seconds = 24 * (stathz = 127) = 3048 ticks
in the u+s+i vs I ticks (u+s+i is too high by 24 seconds or 3 seconds per
CPU, and I is too low by 24 seconds).

The second run shows an enormous variance in the idle time -- of 327 seconds
or 41 seconds per CPU.  This pessimization was mostly the result of
enabling PREEMPTION and IPI_PREEMPTION, which was supposed to decrease the
variance.  Normally I turn these off to reduce context switches, and this
somehow reduces the variance to more like 5 seconds per CPU.  Context switches
turned out to be mostly for waking up idle threads and this should not be
affected by PREEMPTION, and my optimization of turning off PREEMPTION makes
little difference.  That it helps on average and for the best case was
determined by running many more than a few hundred makeworlds.  Increasing
kern.sched.quantum makes similarly little difference.  I don't normally use
that since it was a pessimization on average.  Now I think I understand why
it is a pessimization: it saves a little u+s+i time by reducing context
switches, but increases I time by reducing compensation for the missing
wakeups.

These benchmarks are with 4BSD.  ULE is slightly slower and has lower
variance but its variance is high enough to indicate missed wakeups.

I can reduce the variance using spinloops, so that idle threads never
sleep.  This only works with HTT turned off (the above is wirh 4 cores
times 2 threads).  With HTT on, spinloops in idle threads slow down
running threads by abiut 50%.  ULE is aware of this and avoids internal
spinloops for HTT cases.  4BSD and lower levels are not aware of this,
but it takes the misconfigurations machdep.idle_mwait=0 and machdep.idle=
spin to reach the spinloop case.

Some times for ULE:

  127.06r  565.81u  79.40s  0.65i  645.86t 370.58I 394X i386-12-cold-102-40
  125.83r  566.72u  79.10s  0.61i  646.43t 360.24I 382X i386-12-warm-102-40
  130.33r  565.56u  79.41s  0.65i  645.62t 397.02I 417X i386-12-cold-102-40
  118.70r  566.55u  80.46s  0.54i  647.56t 302.05I 325X i386-12-warm-102-40
  127.09r  569.36u  77.07s  0.98i  647.41t 369.28I 390X i386-12-cold-102-40
  115.80r  570.67u  79.17s  0.63i  650.46t 275.90I 297X i386-12-warm-102-40

115.80r is the best time seen recently.

The u+s+i = t times average a couple of seconds higher for ULE.  This is
not quite in the noise -- 2 seconds for 8 CPUs is 0.25 seconds in real time.
The noise in the I time dominates the signal in the u+s+i = t times, so
it isn't clear if ULE is doing anything better or worse.  If the u+s+i
times can be trusted, then ULE's affinity optimizations are apparently
negative.

I have some optimizations in 4BSD too.  These might be negative too.  One
seemed to work, but only on 1 system, and recent benchmarks indicated that
this is just because it accidentally reduced lost wakeups.  Its improvements
for u+s+i are in the noise.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180303135130.L1442>