Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 28 Nov 1999 19:05:12 -0600 (CST)
From:      Joe Greco <jgreco@ns.sol.net>
To:        mike@sentex.net (Mike Tancsa)
Cc:        stable@FreeBSD.ORG, mcgovern@spoon.beta.com
Subject:   Re: mmap bugs (was Re: ahc problems (with vinum?))
Message-ID:  <199911290105.TAA91433@aurora.sol.net>
In-Reply-To: <4.1.19991128190819.0518bd70@granite.sentex.ca> from Mike Tancsa at "Nov 28, 1999  7:13: 2 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
> At 02:45 PM 11/28/99 , Joe Greco wrote:
> >I have certainly beat the $#!+ out of these systems in a variety of ways,
> >and have run into some odd things.  Most were traceable to SCSI issues.
> >Some didn't get classified.  I'm running vinum in a ten-filesystem config
> >on top of the 18 18GB drives, and I copy in data from another machine.  I
> >then have an application which mmap()'s the files, doing search and replace
> >ops on the data.  Running this app in parallel causes the system to hang
> >(eventually causing the watchdog to expire and reset the system).  Running
> >it serially on one fs at a time doesn't.  This is probably the most
> >worrisome of the issues I've seen.  If you have a recommended revision of
> >the ahc driver you'd like me to try, let me know.
> 
> Can you post more details of the mmap bug you have come across ?  It would
> be nice if this were fixed for 3.4. mcgovern@spoon.beta.com is coordinating
> testing of RCs for 3.4.  Perhaps this is a problem that someone could be
> fix in time.

That's the problem, I don't really know what it is.  I'd sure love to see it
fixed, since anything that can hang a system in such a manner is unsettling,
but I don't really have much of an idea what's causing it.  It could be a
vinum thing, it could be some VM thing, it could be my crappy programming
(but userland programs should never puke the kernel).

I'll show you the program, the wrapper script, and a description of the
specific environment and use.  I'll also try to get around to doing some
additional debugging, but basically I've been seeing a soft system lockup
(userland processes appear to stop running, but console is responsive to
vty changes, pressing return results in an echo but the underlying program
doesn't seem to receive it and then further keystrokes are not echoed).
The kernel is still sane enough to be running my watchdog code, which will
eventually cause the system to reboot via software.  However, it does a
forced termination of the kernel since killing init doesn't work.

% cat filesed.c
/*
 * filesed.c
 *
 * (c) 1999 Joe Greco and sol.net Network Services.  All Rights Reserved.
 *
 * mmap a file, hunting for a string.  Replace with an identical-length
 * string.  Intended for scouring a spool and replacing Path: hosts after
 * a load-via-disk-copy.
 *
 * filesed 'from' 'to' file [file...]
 */

#include	<stdio.h>
#include	<fcntl.h>
#include	<sys/types.h>
#include	<sys/stat.h>
#include	<sys/mman.h>





int filesed(file, from, to)
char *file, *from, *to;
{
	int count = 0;
	int slen = strlen(from);
	struct stat statbuf;
	caddr_t map;
	char *here, *end, *ptr;
	int fd;

	if (stat(file, &statbuf) < 0) {
		perror(file);
		return(-1);
	}
	if ((fd = open(file, O_RDWR, 0)) < 0) {
		perror(file);
		return(-1);
	}
	if (((int)(map = mmap(NULL, statbuf.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0))) == -1) {
		close(fd);
		perror(file);
		return(-1);
	}

	/* Search and replace. */
	here = map;
	end = map + statbuf.st_size - slen;

	while (here < end) {
		ptr = memchr(here, *from, end - here);
		if (! ptr) {
			here = end;
		} else {
			if (! memcmp(ptr, from, slen)) {
				memcpy(ptr, to, slen);
				count++;
			}
			here = ptr + 1;
		}
	}

	if (munmap(map, statbuf.st_size) < 0) {
		perror(file);
	}
	if (count) {
		printf("%s: %d change%s\n", file, count, count == 1 ? "" : "s");
	} else {
		printf("%s: no changes\n", file);
	}
	return(0);
}





int main(argc, argv)
int argc;
char *argv[];
{
	int slen;
	char *from;
	char *to;

	if (argc < 4) {
		fprintf(stderr, "usage: filesed <fromstring> <tostring> <file> [file ...]\n");
		exit(1);
	}

	from = argv[1];
	to = argv[2];
	slen = strlen(from);
	if (slen != strlen(to)) {
		fprintf(stderr, "error: string lengths must be identical\n");
		exit(1);
	}
	if (! slen) {
		fprintf(stderr, "error: zero-length string unacceptable\n");
		exit(1);
	}
	argv += 3;
	argc -= 3;

	while (argc) {
		filesed(*argv, from, to);
		argv++;
		argc--;
	}
}
% cat fixpath.sh
#! /bin/sh -

case "${1}" in
	spool*|bins*)	continue;;
	*)		exit 1;;
esac

for i in /news/spool/news/N.*; do
	find ${i} -type f -name 'B.*' -print | xargs ./filesed $1 $2 &
done

What happens is I've got a system that looks like this:

% df -k
Filesystem      1K-blocks     Used    Avail Capacity  Mounted on
/dev/da0s2a        158783    21626   124455    15%    /
/dev/da0s2h        772075       97   710212     0%    /export/home/u0
/dev/da0s2e        198399   143748    38780    79%    /usr
/dev/da0s2f        119055     8264   101267     8%    /usr/local
/dev/da0s2g       1016303     3078   931921     0%    /var
procfs                  4        4        0   100%    /proc
/dev/vinum/news  14142987  2120003 12022984    15%    /news
/dev/vinum/n0    31821718 14782554 17039164    46%    /news/spool/news/N.00
/dev/vinum/n1    31821718 14921680 16900038    47%    /news/spool/news/N.01
/dev/vinum/n2    31821718 15535917 16285801    49%    /news/spool/news/N.02
/dev/vinum/n3    31821718 14769382 17052336    46%    /news/spool/news/N.03
/dev/vinum/n4    31821718 15435368 16386350    49%    /news/spool/news/N.04
/dev/vinum/n5    31821718 14619211 17202507    46%    /news/spool/news/N.05
/dev/vinum/n6    31821718 15547271 16274447    49%    /news/spool/news/N.06
/dev/vinum/n7    31821718 14721799 17099919    46%    /news/spool/news/N.07
/dev/vinum/n8    31821718        1 31821717     0%    /news/spool/news/N.08

which is an ASUS P2B-DS with the previously mentioned dmesg.  Each "n?"
partition is striped across two 18GB drives, striped across controllers too.
The data on the "n?" partitions is Usenet article data, stored in Matt
Dillon's Diablo format - many articles per file, maybe 10000 files per FS.

To install a new server, I build it and then load each filesystem across the
network.  I can't afford to lose months worth of data.  The only downside to
this is that the Path: lines are then wrong, since they'll say that the
article came in on "server1" but the data is actually now on "server2" due
to my cross-network-copy.

Since I'm working in a distributed server environment and occasionally need
to do debugging, I felt it necessary to change these files.  Since this is
a nice fast SMP dual PII/400, and there's lots of drives, the theoretical
limiting factors are the SCSI busses and the CPU.  So I decided to try
running my little filesed program in parallel on all filesystems, maximizing
the concurrency and hopefully maxxing out the CPU or the SCSI busses.

Instead, it hangs the $*!*$# system after doing a few thousand files.

If you have a suggested test/debug methodology, please let me know.  I can
also arrange for console access if someone wishes to poke at the machine.
I'm also willing to try patches/etc.  I'm just not quite sure what to do.

... Joe

-------------------------------------------------------------------------------
Joe Greco - Systems Administrator			      jgreco@ns.sol.net
Solaria Public Access UNIX - Milwaukee, WI			   414/342-4847


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911290105.TAA91433>