Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Feb 2004 21:44:12 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Doug White <dwhite@gumbysoft.com>
Cc:        Putinas Piliponis <putinas.piliponis@icnspot.net>
Subject:   Re: atacontrol rebuild and iostat 
Message-ID:  <20040204205305.B1469@gamplex.bde.org>
In-Reply-To: <20040203170027.E86301@carver.gumbysoft.com>
References:  <200402031308.i13D8F8F022178@cwsys.cwsent.com> <20040203170027.E86301@carver.gumbysoft.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 3 Feb 2004, Doug White wrote:

> On Tue, 3 Feb 2004, Cy Schubert wrote:
>
> > > Why nothing ?
> >
> > Iostat doesn't see the I/Os because RAID rebuilds occur within the
> > controller, the I/Os are not initiated in the O/S nor any of its utilities,
> > therefore the FreeBSDS kernel doesn't see them. The O/S doesn't see the
> > I/Os. Atacontrol see 3% because it specifically queries the controller for
> > that information.
>
> ATARAID is purely OS driven.  The OS issues the writes for the rebuild, as
> well as failure detection and mirroring.  You're thinking of SCSI
> controllers, or 3ware controllers.
>
> Since the rebuild I/O is driven by the kernel, it bypasses the normal I/O
> path and thus doesn't register in the stats.  If you try to do heavy I/O
> to the devices, you'll find the performance is reduced.

Drivers should register all interesting i/o transactions, but GEOM now hides
even more details from them than before so iostat often shows bogus stats.
E.g., if you try to write 256K-blocks to an ad (non raid) disk, then there
are many layers of deblocking and enblocking and iostat shows a wrong layer:

- first, physio() knows that you don't really want the 256K-blocks that you
  asked for (this is a bug for some devices but not disks).  It deblocks to
  block size dev->si_iosize_max.  si_iosize_max is supposed to be
  device-specific, but it is now just bogus.  GEOM always sets it to
  MAXPHYS (128K) for disks.

  si_iosize_max is bogus for other reasons.  Disk devices need to support
  reading blocks of sizes up to (VM_INITIAL_PAGEIN * PAGE_SIZE) bytes (64K
  on i386 and 128K on alphas...) for execve() to work.  The size for this
  on alphas is accidentally the same as MAXPHYS, so si_iosize_max must be
  MAXPHYS or larger for non-broken disk devices and there is no point in
  having it.  This is mostly fixed in -current, but in RELENG_4 most
  disk devices advertise a bogus limit of DFLTPHYS = 64K.  They had better
  support MAXPHYS = 256 and deblock it internally to support alphas.  The
  acd driver RELENG_4 advertises a bogus limit of 32K or 126K but actually
  does 128K or possibly more without deblocking.

- second, GEOM registers the i/o's with devstat with the sizes that it gets
  from physio() (128K in this example).

- third, GEOM deblocks the 128K blocks to the maximum sizes advertised by
  the driver in the new d_maxsize struct member.   The ad driver could
  handle 128K-blocks without deblocking in RELENG_4 (this is an old
  optimization by dyson, except the maximum was 127K or 127.5K in the
  first version of it because some drivers were claimed to not like 128K).
  Now the ad driver can only handle 64K blocks, so GEOM turns the
  128K-blocks into 64K ones.

- fourth, there might be another layer of deblocking in the driver (there
  isn't one for ad AFAIK).  iostat would not show it.

- fifth, there is more deblocking in the drive.  Sectors are normally 512
  bytes, at least virtually, so there is a lot of deblocking to get them
  from a 64K-block.  iostat just doesn't support this level.

  The best block size to use and the best deblocking strategies are not
  clear, but iostat should show the size sent to the hardware (or sizes
  at all levels) so that the best sizes and deblocking strategies can be
  chosen.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040204205305.B1469>