FreeBSD Mail Archives

Date:      Thu, 1 May 2014 11:59:56 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Matthew Fleming <mdf@freebsd.org>
Cc:        "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>, Eitan Adler <eadler@freebsd.org>, Ian Lepore <ian@freebsd.org>
Subject:   Re: svn commit: r265132 - in head: share/man/man4 sys/dev/null
Message-ID:  <20140501094737.J1261@besplex.bde.org>
In-Reply-To: <CAMBSHm9mocqTVBeC0WUwg8=t_5aRcWXQV0eb=jYAqavmS1Z-Cw@mail.gmail.com>
References:  <201404300620.s3U6Kmn6074492@svn.freebsd.org> <1398869319.22079.54.camel@revolution.hippie.lan> <CAMBSHm9mocqTVBeC0WUwg8=t_5aRcWXQV0eb=jYAqavmS1Z-Cw@mail.gmail.com>

index | next in thread | previous in thread | raw e-mail

On Wed, 30 Apr 2014, Matthew Fleming wrote:

> On Wed, Apr 30, 2014 at 7:48 AM, Ian Lepore <ian@freebsd.org> wrote:

>> For some reason this reminded me of something I've been wanting for a
>> while but never get around to writing... /dev/ones, it's just
>> like /dev/zero except it returns 0xff bytes.  Useful for dd'ing to wipe
>> out flash-based media.
>
> dd if=/dev/zero | tr "\000" "\377" | dd of=<xxx>

Why all these processes and i/o's?

tr </dev/dev/zero "\000" "\377"

The dd's may be needed for controlling the block sizes.

> But it's not quite the same.

It is better, since it is not limited to 0xff bytes :-).

Oops, perhaps not.  tr not only uses stdio to pessimize the i/o; it uses
wide characters 1 at a time.  It used to use only characters 1 at a time.

yes(1) is limited to newline bytes, or newlines mixed with strings.  It
also uses stdio to pessimize the i/o, but not wide characters yet.

stdio's pessimizations begin with naively believing that st_blksize gives
a good i/o size.  For most non-regular files, including all (?) devices
and all (?) pipes, st_blksize is PAGE_SIZE.  For disks, this has been
broken signficantly since FreeBSD-4 where it was the disk's si_bsize_best
(usually 64K).  For pipes, this has been broken significantly since
FreeBSD-4 where it was pipe_buffer.size (either PIPE_SIZE = 16K or
BIG_PIPE_SIZE = 64K).

So standard utilities tend to be too slow to use on disks.  You have to
use dd and relatively complicated pipelines to get adequate block sizes.
Sometimes dd or a special utility is needed to get adequate control and
error handling.  I have such a special utility for copying disks
with bad sectors, but prefer to use just cp fpr copying disks.  cp
doesn't use stdio, and doesn't use mmap() above certain small size; it
uses read/write() with a fixed block size of 64K or maybe larger in
-current, so it works OK for copying disks.

The most broken utilities that I use often for disk devices are:

- md5.  This (really libmd/mdXhl.c) has been broken on all devices (really
   on all non-regular files) since ~2001.  It is broken by misusing
   st_size instead of by trusting st_blksize.  st_size is only valid
   for regular files, but is used on other file types to break them.
   For example:

     pts/21:bde@freefall:~> md5 /dev/null
     MD5 (/dev/null) = d41d8cd98f00b204e9800998ecf8427e
     pts/21:bde@freefall:~> md5 /dev/zero
     MD5 (/dev/zero) = d41d8cd98f00b204e9800998ecf8427e

   Similarly for disk devices.  All devices are seen as empty by md5.

   The workaround is to use a pipeline, or just stdin.  "cat /dev/zero | md5"
   and even "md5 </dev/zero" confuse md5 into using a different input method
   that works.  OTOH, "md5 /dev/fd/0" sees an empty device file, and
   "cat /dev/zero | md5 /dev/fd/0" fails immediately with a seek error.
   Pipes have st_size == 0 too, so the input method that stats the file
   would see an empty file too, so it must not be reached in the working
   case.  "md5 /dev/fd/0" apparently just stats the device file, and this
   appears to be empty.  I'm not sure if it is the tty device file or
   /dev/fd/0 that is seen.  "cat /dev/zero | md5 /dev/fd/0" apparently
   reaches the buggy code, but somehow gets further and fails trying to
   seek.

   To get adequate block sizes for disks, use dd in the pipeline that must
   be used for other reasons.

   I only recently noticed that pipes have st_blksize = PAGE_SIZE, so that
   if you pipe to stdio utilities then the i/o will be pessimized and
   reblocking using another dd in a pipeline to get back to an adequate
   size.  PAGE_SIZE is large enough to not be very pessimal for some uses.

- cmp.  cmp uses mmap() excessively for regular files, but for device files
   it uses per-char stdio excessively.

   (
   More on md5.  The i/o routine for the working is are in the application
   (md5/md5.c).  This uses fread() with the bad block size BUFSIZ.  This
   is still 1024.  It is more broken than st_blksize.  However, fread()
   is not per-char, so it is reasonably efficient.  stdio uses st_blksize
   for read() from the file.  When the file is regular, the block size
   is again relatively unimportant provided the file system has a large
   enough block size or does clustering.  For device files, clustering
   might occur at levels below the file system, but usually doesn't for
   disks.  Instead, small i/o's get relatively slower with time except
   on high-end SSDs with high transactions per second, because clustering
   at low levels takes too many transactions.

   The i/o routine for the non0-working case is in the library
   (libmd/mdXhl.c).  It uses read(), but with the silly stdio block
   size of BUFSIZ.  libmd files have several includes of <stdio.h>, but
   don't seem to use stdio except for bugs like this.  The result is that
   the i/o is especially pessimized for the usual regular file case.
   Buffering in the kernel limits this pessimization.
   )

   The device file case for cmp just uses getc()/putc().  This first
   gets the st_blksize pessimization.  Then it gets the slow per-char
   i/o fro using getc()/putc().  For disks, the first pessimizations
   tends to dominate but the second one is noticeable.  For fast
   input devices it is very noticeable.  On freefall now:
   "dd if=/dev/zero bs=1m count=4k of=/dev/null": speed is 21GB/sec;
   "dd if=/dev/zero bs=1m count=4k | cmp - /dev/zero": speed is 187MB/sec.
   The overhead is a factor of 110.  With iron disks, the overhead would
   be a factor of about 1/2.

   The loop in cmp for regular files is slow too, but only in comparison
   with the memcpy() that is (essentially) used for reading /dev/zero
   and with the memcmp() that should be used by cmp.  It just compares
   bytewise and has mounds of bookkeeping to count characters and lines
   for the rare cases that fail.  The usual case should just use mmap()
   of the whole file (if not read()) and memcmp() on that.

   I recently noticed a very bad case for cmp on regular files too.  I
   was comparing large files on an cd9600 file system on a DVD, under
   an old version of FreeBSD.  cmp mmap()s the whole file.  The i/o
   for this is done by vm, and vm generated only minimal i/o's with
   the cd9660 block size of 2K.  read() would have done clustering
   to a block size of 64K.  Perhaps vm is better now, but it is hard
   to see how it could do as well as read() without doing the same
   clustering as read().

   One workaround for this is to prefetch files into the buffer (vmio)
   cache using read().  It is hard to avoid thrashing of the cache
   with this, so I used workarounds like diff'ing the files instead
   of cmp'ing them.  diff is much heavier weight, but it runs faster
   since it doesn't use mmap() (gnu diff seems to use fread() and
   suffers from stdio using st_blksize).

Bruce

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140501094737.J1261>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation