Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 26 Jul 1998 18:15:55 -0500
From:      Karl Denninger  <karl@mcs.net>
To:        Terry Lambert <tlambert@primenet.com>
Cc:        Mike Smith <mike@smith.net.au>, wollman@khavrinen.lcs.mit.edu, dswartz@druber.com, current@FreeBSD.ORG, dillon@best.net
Subject:   Re: MMAP problems
Message-ID:  <19980726181555.49644@mcs.net>
In-Reply-To: <199807262228.PAA18917@usr01.primenet.com>; from Terry Lambert on Sun, Jul 26, 1998 at 10:28:08PM %2B0000
References:  <199807261647.JAA10667@antipodes.cdrom.com> <199807262228.PAA18917@usr01.primenet.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jul 26, 1998 at 10:28:08PM +0000, Terry Lambert wrote:
> > > The data which gets written is usually a block of zeros, but it may not be;
> > > it can also be random trash.  Its also not always one block (it could be
> > > more than one), but it IS always, at least from what I'm seeing here, a
> > > multiple of 512 bytes (disk blocksize).
> > 
> > The significant question in light of Garrett's description seem to be 
> > whether the trash that's written is actually being written by the 
> > process in error because that's what it got from a previous read, or 
> > whether the process is actually writing the right stuff and it's being 
> > corrupted on the way down.
> 
> See other postings.
> 
> Because the corrupt data can be non-zero, I do not believe Garrett's
> explanation is the correct one; instead, I believe the same page is
> being pointed to by two mappings at the same time because I don't
> believe that mmap() references are being revoked correctly.

Hmmm...

Yes, I can confirm (for certain) that the corrupt data is not always zero.
It FREQUENTLY is zero, but not always.  If it is non-zero it generally is
identifyable as a chunk of another message (unfortunately I haven't gotten 
a HEADER yet; if I do, I will be able to track down where the chunk of data 
actually came from).

> Note that I have seen the bug I am describing on both 2.2.6 and 3.0
> systems.  These are production systems that open and hild open the
> password file a long time, and which access the crontabe with an
> annoyingly (and probably undesirably) high frequency.  This results
> in corrupt crontabs.  On the other hand, corrupt crontabs are much
> better than silently corrupted user data... 8-(.

This is particularly bad, since one of the potential "fixes" Matt has given
me is to back down to 2.2.6.  Of course, if this problem is IN 2.2.6, then
backing down on that machine will do a big nothing.

I suspect the real culprit is that I'm running basically all my feeds in 
"realtime" mode - if I was delaying by 10 minutes, diablo would never be
writing to the same file that the feeder program was reading at a given time
(ie: the file that was open for MMAP would never be open for write at the
same time).  

I *can* confirm that the files where the corruption is being seen absolutely 
ARE open for write when the errors occur; I've managed to catch the system
"in the act" doing this, and have found the file open at the time.

I'm going to put a "q1" flag on all the feeds (which should effectively 
force the system to not "chase its tail" so effectively) and see if the 
problem goes away.  Doing that should prevent the software from attempting
to read via mmap in the last-written (by write(2)) page.

If you're right then this should make the problem disappear. 

Then the question becomes how to fix it.

-
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly / All Lines K56Flex/DOV
			     | NEW! Corporate ISDN Prices dropped by up to 50%!
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980726181555.49644>