Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 13 Sep 2000 13:08:11 -0700 (PDT)
From:      dhesi@rahul.net (Rahul Dhesi)
To:        freebsd-stable@freebsd.org
Subject:   Re: SCSI retries without errors in /var/log/messages?
Message-ID:  <20000913200811.56E267C63@yellow.rahul.net>
References:  <freebsd-stable.20000911141718.A51045@panzer.kdm.org>

next in thread | previous in thread | raw e-mail | index | archive | help

"Kenneth D. Merry" <ken@kdm.org> writes:

>The timeout for read and write operations in the da(4) driver is 60
>seconds, and we retry things four times.

And I understand that an error is logged only if all retries fail.  So
potentially we could try tree times, with a total 180 second delay, then
succeed on the fourth try, and no error would be logged.

So I thought about this, and wondered if we could have ongoing SCSI
delays with no errors logged.

Suppose there is a SCSI hardware problem such that every I/O operation
has a 0.01 probability of timing out, which means it has a 0.99
probability of succeeding.

As a first approximation, out of every 100 I/O operations typically one
will time out, causing a 60-second delay.  If we are doing 30 I/O
operations per second, then we will encounter one 60-second delay every
3.3 seconds, on the average.  Which really means that our rate of I/O
operations will be reduced to 30 in 63.3 seconds, or an I/O operation
every 2 seconds, approximately.  A very, very slow computer system.

Will this show up in syslog?  Only when all 4 tries fail, the
probability of which is (0.01)**4,  which is 1 in 100000000.  At 30 
I/O operations per second, with no delays, we should see one syslog
entry every 1157 days, i.e., 3 years.  But if every 100th I/O operation
is delayed by 60 seconds, then we are really averaging quite a lower
rate of I/O operations per second, so we might not see a syslog entry
for several decades.

This is an aproximate calculation, but the orders of magnitudes should
be about right.  The most likely source of error in my logic above is
that the probability of encountering an error on the first try of a
specific I/O operation might not be independent of the probability of an
error on a retry.  Thus it might not be correct to use the term
(0.01)**4 above.  But this will very much depend on the exact reason for
the error.
-- 
Rahul



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000913200811.56E267C63>