FreeBSD Mail Archives

Date:      Fri, 27 Feb 2015 17:08:46 -0700
From:      "Kenneth D. Merry" <ken@FreeBSD.ORG>
To:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Cc:        current@FreeBSD.ORG, scsi@FreeBSD.ORG
Subject:   Re: sa(4) driver changes available for test
Message-ID:  <20150228000846.GA33584@mithlond.kdm.org>
In-Reply-To: <54F0BFE1.4000000@omnilan.de>
References:  <20150214003232.GA63990@mithlond.kdm.org> <20150219001347.GA57416@mithlond.kdm.org> <54EEEE1E.7020007@omnilan.de> <20150226224202.GA14015@mithlond.kdm.org> <54F0BFE1.4000000@omnilan.de>

On Fri, Feb 27, 2015 at 20:05:05 +0100, Harald Schmalzbauer wrote:
>  Bez?glich Kenneth D. Merry's Nachricht vom 26.02.2015 23:42 (localtime):
> 
> ?
> >>> And (untested) patches against FreeBSD stable/10 as of SVN revision 278974:
> >>>
> >>> http://people.freebsd.org/~ken/sa_changes.stable_10.20150218.1.txt
> ?
> 
> > I'm glad it is working well for you!  You can do larger I/O sizes with the
> > Adaptec by changing your MAXPHYS and DFLTPHYS values in your kernel config
> > file.  e.g.:
> >
> > options         MAXPHYS=(1024*1024)
> > options         DFLTPHYS=(1024*1024)
> >
> > If you set those values larger, you won't be able to do more than 132K with
> > the sym(4) driver on an x86 box.  (It limits the maximum I/O size to 33
> > segments * PAGE_SIZE.)
> 
> Thanks for the hint! I wasn't aware that kern.cam.sa.N.maxio has driver
> limitations corresponding to systems MAX/DFLTPHYS. I thought only
> silicon limitations define it's value.

It depends on the driver.  I thought that the Adaptec drivers go off of
MAXPHYS (because that's what the driver author told me last week :), but
in looking at the code, they actually have a hard-coded value that can be
increased.  You can bump AHC_MAXPHYS or AHD_MAXPHYS in aic7xxx_osm.h or
aic79xx_osm.h, respectively.  In order to make any difference, though, you
would have to bump MAXPHYS/DFLTPHYS (so the sa(4) driver will use that
value) or change the ahc(4)/ahd(4) driver to set the maxio field in the
path inquiry CCB.

> But in order to have a best matching pre-production test-environment, I
> nevertheless replaced it, now using mpt(4) instead of ahc(4)/ahc_pci on
> PCI-X@S3210 (for parallel tape drives I consistently have mpt(4)@PCIe,
> which is the same LSI(53c1020) chip but with on-board PCI-X<->PCIe bridge).

Okay.  That should work.

> Still just works fine ! :-) (stable_10.20150218.1-patchset with LTO2,
> LTO3 and DDS5)
> With DDS5, densitiy is reported as "unknown". If I remember correctly,
> you have your DDS4 reporting "DDS4"?

That means that we need to add DDS5 to the density table in libmt.  Can
you send the output of 'mt status -v'?  It would actually be helpful for
all three drives.

Also, do any of your drives give a full report for 'mt getdensity'?  If so,
can you send that as well?  (By full report, I mean more than one line.)

We don't have density codes for DDS-5/DAT 72, DAT 160 or DAT 320 yet.  It
looks like DDS-5 should be 0x47.

> > > therefore I'd like to point to the new port misc/vdmfec
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197950
> > That looks cool. :)  I'm not a ports committer, but hopefully one of them
> > will pick it up.
> 
> Cool it is indeed, but whether it's really usefull or not is beyond my
> expertise. I couldn't collect much MT experience yet.
> I know that LTO and similar "modern" MT technology do their own ECC (in
> the meaning of erasure code, mostly Reed-Solomon).
> What I don't know (but wanting to be best prepared for) is how arbitrary
> LTO drives behave, if the one (1) in 10^17 bits was detected to be
> uncorrectable.
> If it wasn't detected, the post erasure code (vdmfec in that case) would
> help for sure.
> But If the drive just cuts the output, or stops streaming at all, vdmfec
> was useless?

There is a difference in the uncorrectable bit error rate and the
undetectable bit error rate.  The uncorrectable bit error rate for LTO-6 is
1 in 10^17.  It is 1 in 10^19 for Oracle T10000 C/D drives, and 1 in 10^20
for IBM TS1150.  Seagate Enterprise drives claim to have an uncorrectable
bit error rate of 1 sector per 10^15 bits read.

See:

http://www.oracle.com/us/products/servers-storage/storage/tape-storage/t10000c-reliability-wp-409919.pdf

http://www.spectralogic.com/index.cfm?fuseaction=home.displayFile&DocID=2513

http://www.seagate.com/www-content/product-content/enterprise-hdd-fam/enterprise-capacity-3-5-hdd/constellation-es-4/en-us/docs/enterprise-capacity-3-5-hdd-ds1791-8-1410us.pdf

The second white paper claims that tape has an undetectable bit error rate
of 1 in 1.6x10^33 bits.  I assume it is referring to TS1150, but I don't
know for sure.

It is far more likely that your tape or tape drive will break than it is
that you would get a bad bit back from the drive.

> According to excerpts of "Study of Perpendicular AME Media in a Linear
> Tape Drive", LTO-4 has a soft read error rate of 1 in 10^6 bits and DDS
> has 1 in 10^4 bits (!!!, according to HP C1537A DDS 3 - ACT/Apricot). So
> with DDS, _every_ single block pax(1) writes to tape needs to be
> internally corrected! Of course, nobody wants zfs' send output stream to
> DDS, it's much too slow/small, but just to mention.
> 
> For archives of zfs streams, I don't feel safe relying on the tape
> drives' FEC, which was designed for backup solutions which do their own
> blocking+cheksumming, so the very seldom to expect uncorrectable read
> error would at worst lead to some/single unrecoverable files ? even in
> case of database files most likely post-recoverable.
> But with one flipped bit in the zfs stream, you'd loose hundred of
> gigabytes, completely unrecoverable!
> As long as the tape keeps spitting complete blocks, even in the case
> when the tape knows that the output is not correct, vdmfec ought to be
> the holy grail :-)

A tape drive or hard drive isn't going to return bits that it knows aren't
correct.  They'll return an error instead.  The bit error rate of a tape
drive is lower than the bit error rate for a hard drive, so you're less
likely to get bad bits back from a tape drive than a disk drive.

Another thing you have to consider, if you're concerned about the bit error
rates, is the error rate of the link that you're using to connect to the
tape or disk drive.  The tape/disk might read the block correctly, but
you could also get corruption on the link.

This article talks about disk/tape bit error rates and the link bit error
rates.  I haven't read the whole thing, but it should make for some
interesting reading:

http://www.enterprisestorageforum.com/storage-technology/sas-vs.-sata-1.html

With ZFS, you get protection from link and disk errors via its checksums
and RAID.  If there is a checksum error on one drive, it can rebuild the
corrupted data using the parity/mirror information.

If you want to do the same thing with a tape drive, you would need to write
your data to two tapes, or have another copy somewhere that you could use
as your recovery copy in case of corruption.

FWIW, the sa(4) driver now supports protection information on a per-block
basis.  LTO (at least newer drives) and TS drives support adding a CRC on
the end of each tape block written to and read from the drive.  The drive
will verify it on writes, and supply the checksum on reads.  You could also
do your own FEC scheme.

> Going slightly more off topic:
> One hot candidate for beeing another holy grail, is mbuffer(1) for me.
> 
> I don't know if tar/pax/cpio do any kind of FIFO buffering at all, but
> for zfs-send-streaming, mbuffer(1) is obligatory. Even with really huge
> block sizes, you can't saturate LTO-3 native rate. With mbuffer(1) it's
> no problem to stream LTO-4 native rate with a tape-transport-blocksize
> of 32k.
> Btw, besides the FIFO-buffering, I'm missing star(1) also for it's
> multi-volume support. tar(1) in base isn't really useful for tape
> buddies ? IMHO it's hardly adequate for any purpose and I don't
> understand it's widespread usage? Most likely the absence of dump(8) for
> zfs misleads to tar(1) ;-)
> 
> Were there ever thoughts about implementing FIFO-buffering into sa(4)?
> We don't have mbuffer(1) in base, but I think, to complete FreeBSD's
> tape support, users should find all technology/tools, needed for using
> modern tape drives, in base. If sa(4) could provide sysctl-controlled
> FIFO-buffering, some base tools were a bit more apropriate for tape
> usage, I think.

It would probably be easier and better to just put mbuffer in the base.

The challenge with doing buffering in the tape driver is that the userland
application would need to use a different API to write to the driver.  The
standard write(2) API requires that it return status when the block is
written.  So if we're going to buffer up a bunch of I/O in the
tape driver, we would need to do it with an async I/O type interface so
that we could return an individual error status for any I/O that fails.

If you're not able to stream to a tape drive with ZFS send, but it does
work with mbuffer, then the issue is just an inconsistent output rate from
ZFS send.  mbuffer overcomes the consistency problem with buffering.
We could do the buffering in the kernel, but that would mean rewriting any
userland application that wants to talk to the tape drive.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150228000846.GA33584>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation