Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Apr 2012 20:28:53 +0200
From:      Svatopluk Kraus <onwahe@gmail.com>
To:        Adrian Chadd <adrian@freebsd.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, hackers@freebsd.org
Subject:   Re: [vfs] buf_daemon() slows down write() severely on low-speed CPU
Message-ID:  <CAFHCsPU9N5GkWJ6D2s2BsgDAygnFVAzgJNVJzNy=YjMG1HMw2g@mail.gmail.com>
In-Reply-To: <CAJ-VmokHguLUUAhiwS9xkFRMBiiWbUrYRmDDp_PMTXFj3NR4hw@mail.gmail.com>
References:  <CAFHCsPVqNCYj-obQqS4iyKR-xK0AaRJU_6KX=fccEK4U8NaktQ@mail.gmail.com> <20120312181921.GF75778@deviant.kiev.zoral.com.ua> <CAFHCsPWZD065A0su_LJn8Q4RW1pft_DobbsgSph1NNZ=mNXYYw@mail.gmail.com> <20120315112959.GP75778@deviant.kiev.zoral.com.ua> <CAFHCsPXZ7MmaVpfN0H%2BaY5ubvY3L_guN2ugUOZ-wEnhKuJV=tw@mail.gmail.com> <CAJ-VmokHguLUUAhiwS9xkFRMBiiWbUrYRmDDp_PMTXFj3NR4hw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 21, 2012 at 5:55 AM, Adrian Chadd <adrian@freebsd.org> wrote:
> Hi,
>
> I'm interested in this, primarily because I'm tinkering with file
> storage stuff on my little (most wifi targetted) embedded MIPS
> platforms.
>
> So what's the story here? How can I reproduce your issue and do some
> of my own profiling/investigation?
>
>
> Adrian

Hi,

your interest has made me to do more solid/comparable investigation on
my embedded ELAN486 platform. With more test results, I made full
tracing of related VFS, filesystem, and disk function calls. It took
some time to understand what about the issue really is.

My test case:
Single file copy (no O_FSYNC). It means that no other filesystem
operation is served. The file size must be big enough according to
hidirtybuffers value. Other processes on machine, where the test was
run, almost were inactive. The real copy time was profiled. In all
tests, a machine was booted, a file was copied, file was removed, the
machine was rebooted. Thus, the the file was copied into same disk
layout.

The motivation is that my embedded machines don't do any writing to a
disk mostly. Only during software update, a single process is writing
to a disk (file by file). It doesn't need to be a problem at all, but
an update must be successful even under full cpu load. So, the writing
should be tuned up greatly to not affect other processes too much and
to finish in finite time.

On my embedded ELAN486 machines, a flash memory is used as a disk. It
means that a reading is very fast, but a writing is slow. Further, a
flash memory is divided into sectors and only complete sector can be
erased at once. A sector erasure is very time expensive action.

When I tried to tune up VFS by various parameters changing, I found
out that real copy time depends on two things. Both of them are a
subject of bufdaemon. Namely, its feature to try to work harder, if
its buffers flushing mission is failing. It's not suprise that the
best copy times were achived when bufdaemon was excluded from buffers
flushing at all (by VFS parameters setting).

This bufdaemon feature brings along (with respect to the real copy time):
1. bufdaemon runtime itself,
2. very frequent filesystem buffers flushing.

What really happens in the test case on my machine:

A copy program uses a buffer for coping. The default buffer size is
128 KiB in my case. The simplified sys_write() implementation for
DTYPE_VNODE and VREG type is following:

sys_write()
{
 dofilewrite()
 {
  bwillwrite()
  fo_write() => vn_write()
  {
   bwillwrite()
   vn_lock()
   VOP_WRITE()
   VOP_UNLOCK()
  }
 }
}

So, all 128 KiB is written under VNODE lock. When I take back the
machine defaults:

hidirtybuffers: 134
lodirtybuffers: 67
bufdirtythresh: 120
buffer size (filesystem block size): 512 bytes

and do some simple calculations:

134 * 512 = 68608  -> high water bytes count
120 * 512 = 61440
67 * 512 = 34304   -> low water byte count

then it's obvious that bufdaemon has something to do during each
sys_write(). However, almost all dirty buffers belong to new file
VNODE and the VNODE is locked. What remains are filesystem buffers
only. I.e., superblock buffer and free block bitmap buffers. So,
bufdaemon iterates over all dirty buffers queue, what takes a
SIGNIFICANT time on my machine, and does not find any buffer to be
able to flush almost all time. If bufdaemon flushes one or two
buffers, kern_yield() is called, and new iteration is started until no
buffer is flushed. So, very often TWO full iteration over dirty
buffers queue is done to flush only one or two filesystem buffers and
to failed to reach lodirtybuffers threshold. A bufdaemon runtime is
growing up. Moreover, the frequent filesystem buffers flushing brings
along higher cpu load (geom down thread, geom up thread, disk thread
scheduling) and a disk blocks writing re-ordering. The correct disk
blocks writing order is important for the flash disk. Further, while
the file data buffers are aged but not flushed, filesystem buffers are
written repeatedly but flushed.

Of course, I use a sector cache in the flash disk, but I can't cache
too many sectors because of total memory size. So, filesystem disk
blocks often are written and that evokes more disk sector flushes. A
sector flush really takes long time, so real copy time grows up beyond
control. Last but not least, the flash memory are going to be aged
uselessly.

Well, this is my old story. Just to be honest, I quite forgot that my
kernel was compiled with FULL_PREEMPTION option. The things are very
much worse in this case. However, the option just makes the issue
worse, the issue doesn't disapper without it.

In this old story, I played a game with and focused to bufdirtythresh
value. However, bufdirtythresh is changing the way, how and by who
buffers are flushed, too much. I recorded disk sector flush count and
total disk_strategy() calls count with BIO_WRITE command (and total
bytes count to write). I used a file with size 2235517 bytes. When I
was caching 4 disk sectors, without FULL_PREEMPTION option, the result
are following:

bufdirtythresh: 120, 85, 40
real copy time: 1:16.909, 38.031, 23.117
cp runtime: 8.876, 9.419, 9.922
bufdaemon runtime: 4.95, 0.26, 0.01
fldisk runtime: 39.64, 18.23, 8.50
idle runtime: 18.99, 7.79, 2.76
qsleep: 358, 32, 0
flush_count: 49, 39, 25
write_count: 452, 529, 931
write_size: 2428928, 2461184, 2649600

Idle runtime is, in fact, a disk sector erasure runtime. A qsleep is a
counter, how often bufdaemon slept with hz/10 timeout, so in some
sense, it shows how often bufdaemon failed to reach lodirtybuffers
threshold. A flush_count is a number of disk sector flushes and in
fact, many runtimes depend on it. A write_count is a number of
disk_strategy() calls with BIO_WRITE command. A write_size is a total
number of bytes to write in BIO_WRITE commands.

It can be seen in case of bufdirtythresh=40 that even if write_count
is biggest, the flush_count is lowest. Thus, a disk block writing
order was the best in respect to disk sector cache. However, it has
shown that there is another things in the issue, I didn't figure out
before: VFS clusters and disk fragmentation.

For a matter of interest, the same tests with FULL_PREEMPTION option:

bufdirtythresh: 120, 85, 40
real copy time: 2:00.389, 27.385, 21.975
cp runtime: 10.873, 9.538, 9.988
bufdaemon runtime: 48.36, 0.11, 0.00
fldisk runtime: 35.96, 11.67, 7.75
idle runtime: 16.59, 4.23, 2.20
qsleep: 4180, 10, 0
flush_count: 44, 30, 24
write_count: 1112, 532, 931
write_size: 2767360, 2462720, 264960

It's interesting that real copy times are better for bufdirtythresh=40
and 85. However, the times depend on flush_count again. I.e., a disk
block writing order is what is better.

As I said, the FULL_PREEMPTION option just makes the things worse. The
issue starts when cp process is rescheduled by bufdaemon (and
numdirtybuffers are greater than lodirtybuffers) during VOP_WRITE(),
which is called with VNODE locked. At the beginning (no
FULL_PREEMPTION option), it happens if:
1. bufdaemon is timeouted,
2. a thread with realtime priority is scheduled.

Threads with realtime priorities always are full preempted (with or
without FULL_PREEMPTION option). When a realtime thread finish its
job, a thread with highest priority is scheduled. I.e., bufdaemon
interrupts cp process too. After that, filesystem buffers can be
bawrited by bufdaemon, it causes, that cp process must sleep in
getblk() (td_wmesg = "getblk") sometimes and bufdaemon is scheduled
again. Moreover, when it starts, bufdaemon sleeps with hz/10 timeout,
so it's more likely that bufdaemon is timeouted. I'm not sure, but
when it starts, it looks that cp process sometimes can be contested
too (and bufdaemon can be scheduled).

One more word about FULL_PREEMPTION option. I know that the option is
not supported (unfortunately). The option also makes the things worse
because of bd_wakeup() and bufdaemon synchronization by bdlock mutex.
In case of FULL_PREEMPTION option, bufdaemon is waked up and scheduled
and contested immediately, so process which calls bd_wakeup() must be
scheduled again to release bdlock and then, bufdaemon is scheduled
again. The following would help:

critical_enter()
wakeup(&bd_request);
mtx_unlock(&bdlock);
critical_exit()

In UP case, it's OK always. In SMP case, it's better than nothing.

Now, a new story. I compiled kernel without FULL_PREEMPTION option and
made a ramdisk from flash disk, so no disk sector flush happened
during the copy. Only VFS and filesystem remained for investigation.

I already mentioned VFS clusters and disk fragmentation. My flash disk
almost is not fragmented, so even 64 KiB clusters are written at once.
A new message is that clusters are bawrited. I did not investigate it
fully, but (struct mount *)mnt_iosize_max limits maximal cluster size.
This clusters bawriting keeps numdirtybuffers low and makes the things
better. It still is about how to exclude bufdaemon from dirty buffers
flushing.

I also mentioned a buffer size, copy program is using. If the buffer
size is lower, it makes the probability of bufdaemon influence lesser.
However, once the described issue starts ...

I left bufdirtythresh tuning and started with mnt_iosize_max and copy
program buffer size tuning. On my machine, default mnt_iosize_max is
MAXPHYS = 128 KiB. Just to remember, 68608 bytes is my default high
water dirty buffer byte size.

Results for no FULL_PREEMPTION option, bufdirtythresh=120, file size
2235517 bytes, and ramdisk:

mnt_iosize_max = 128 KiB
------------------------
cp buffer size: 128, 64, 32, 16 KiB
real copy time: 19.234, 15.708, 15.700, 16.054
cp runtime: 10.978, 9.837, 9.895, 9.995
bufdaemon runtime: 2.93, 1.14, 0.97, 1.12
ramdisk runtime: 3.08, 2.98, 2.99, 3.00
qsleep: 172, 46, 37, 47
write_count: 473, 284, 283, 292
write_size: 2439680, 2336768, 2337280, 2341888
dirtybufferflushes: 134, 5, 25, 25
altbufferflushes: 1, 0, 0, 0

The dirtybufferflushes and altbufferflushes are kernel globals. As I
said, cp buffer size just makes probability lesser. A bufdaemon
runtime is significant in respect to ramdisk runtime. A mnt_iosize_max
is bigger than mentioned high water.

mnt_iosize_max = 64 KiB
------------------------
cp buffer size: 128, 64, 32, 16 KiB
real copy time: 17.566, 15.669, 15.708, 15.543
cp runtime: 9.934, 9.632, 9.730, 9.797
bufdaemon runtime: 2.40, 1.18, 1.10, 0.90
ramdisk runtime: 3.09, 2.99, 3.00, 2.98
qsleep: 137, 53, 43, 30
write_count: 451, 291, 280, 269
write_size: 2421760, 2339840, 2334720, 2329088
dirtybufferflushes: 134, 12, 5, 6
altbufferflushes: 1, 0, 0, 0

A bufdaemon is excluded a little, but no dirtybufthresh=120. It means
that 120 * 512 = 61440 bytes is less than mnt_iosize_max and still is
in a game (dirtybufferflushes).

mnt_iosize_max = 32 KiB
------------------------
cp buffer size: 128, 64, 32, 16 KiB
real copy time: 14.580, 14.327, 14.491, 14.609
cp runtime: 9.864, 9.680, 9.732, 9.841
bufdaemon runtime: 0.04, 0.01, 0.01, 0.03
ramdisk runtime: 2.98, 2.98, 2.97, 2.99
qsleep: 2, 0, 0, 2
write_count: 213, 206, 194, 191
write_size: 2266112, 2262528, 2256384, 2254848
dirtybufferflushes: 0, 0, 0, 0
altbufferflushes: 0, 0, 0, 0

A bufdaemon is excluded and dirtybufthresh too. Dirty buffers are
bawrited by VFS cluster layer.

All times and results are typical (no statistical) ones. The issue is
not so visible when a ramdisk is used. If cp process is sleeping
because of a buffer bawriting, which can be slow on my flash disk, the
bufdaemon runtime can grow very much. On high speed cpu, the issue can
be hidden in a system workload noise totally. However, on my machine,
the issue exists with no doubt.

I think, it's clear how to try to reproduce my issue. In FreeBSD
kernel, MAXPHYS = 128 KiB is a maximal mnt_iosize_max value. The
hidirtybuffers, dirtybufthresh, and lodirtybuffers together with VFS
buffer size must be set in respect to mnt_iosize_max. They (in bytes)
must be lower than mnt_iosize_max.

I know that my machine configuration and utilizations is a special
one. Thus, I don't push any change. And once again, I'm just sharing
my experience. However, for embedded platforms, it can be more common
than somebody thinks.

Svata



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFHCsPU9N5GkWJ6D2s2BsgDAygnFVAzgJNVJzNy=YjMG1HMw2g>