Date: Sun, 28 Nov 1999 19:05:12 -0600 (CST) From: Joe Greco <jgreco@ns.sol.net> To: mike@sentex.net (Mike Tancsa) Cc: stable@FreeBSD.ORG, mcgovern@spoon.beta.com Subject: Re: mmap bugs (was Re: ahc problems (with vinum?)) Message-ID: <199911290105.TAA91433@aurora.sol.net> In-Reply-To: <4.1.19991128190819.0518bd70@granite.sentex.ca> from Mike Tancsa at "Nov 28, 1999 7:13: 2 pm"
next in thread | previous in thread | raw e-mail | index | archive | help
> At 02:45 PM 11/28/99 , Joe Greco wrote: > >I have certainly beat the $#!+ out of these systems in a variety of ways, > >and have run into some odd things. Most were traceable to SCSI issues. > >Some didn't get classified. I'm running vinum in a ten-filesystem config > >on top of the 18 18GB drives, and I copy in data from another machine. I > >then have an application which mmap()'s the files, doing search and replace > >ops on the data. Running this app in parallel causes the system to hang > >(eventually causing the watchdog to expire and reset the system). Running > >it serially on one fs at a time doesn't. This is probably the most > >worrisome of the issues I've seen. If you have a recommended revision of > >the ahc driver you'd like me to try, let me know. > > Can you post more details of the mmap bug you have come across ? It would > be nice if this were fixed for 3.4. mcgovern@spoon.beta.com is coordinating > testing of RCs for 3.4. Perhaps this is a problem that someone could be > fix in time. That's the problem, I don't really know what it is. I'd sure love to see it fixed, since anything that can hang a system in such a manner is unsettling, but I don't really have much of an idea what's causing it. It could be a vinum thing, it could be some VM thing, it could be my crappy programming (but userland programs should never puke the kernel). I'll show you the program, the wrapper script, and a description of the specific environment and use. I'll also try to get around to doing some additional debugging, but basically I've been seeing a soft system lockup (userland processes appear to stop running, but console is responsive to vty changes, pressing return results in an echo but the underlying program doesn't seem to receive it and then further keystrokes are not echoed). The kernel is still sane enough to be running my watchdog code, which will eventually cause the system to reboot via software. However, it does a forced termination of the kernel since killing init doesn't work. % cat filesed.c /* * filesed.c * * (c) 1999 Joe Greco and sol.net Network Services. All Rights Reserved. * * mmap a file, hunting for a string. Replace with an identical-length * string. Intended for scouring a spool and replacing Path: hosts after * a load-via-disk-copy. * * filesed 'from' 'to' file [file...] */ #include <stdio.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> int filesed(file, from, to) char *file, *from, *to; { int count = 0; int slen = strlen(from); struct stat statbuf; caddr_t map; char *here, *end, *ptr; int fd; if (stat(file, &statbuf) < 0) { perror(file); return(-1); } if ((fd = open(file, O_RDWR, 0)) < 0) { perror(file); return(-1); } if (((int)(map = mmap(NULL, statbuf.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0))) == -1) { close(fd); perror(file); return(-1); } /* Search and replace. */ here = map; end = map + statbuf.st_size - slen; while (here < end) { ptr = memchr(here, *from, end - here); if (! ptr) { here = end; } else { if (! memcmp(ptr, from, slen)) { memcpy(ptr, to, slen); count++; } here = ptr + 1; } } if (munmap(map, statbuf.st_size) < 0) { perror(file); } if (count) { printf("%s: %d change%s\n", file, count, count == 1 ? "" : "s"); } else { printf("%s: no changes\n", file); } return(0); } int main(argc, argv) int argc; char *argv[]; { int slen; char *from; char *to; if (argc < 4) { fprintf(stderr, "usage: filesed <fromstring> <tostring> <file> [file ...]\n"); exit(1); } from = argv[1]; to = argv[2]; slen = strlen(from); if (slen != strlen(to)) { fprintf(stderr, "error: string lengths must be identical\n"); exit(1); } if (! slen) { fprintf(stderr, "error: zero-length string unacceptable\n"); exit(1); } argv += 3; argc -= 3; while (argc) { filesed(*argv, from, to); argv++; argc--; } } % cat fixpath.sh #! /bin/sh - case "${1}" in spool*|bins*) continue;; *) exit 1;; esac for i in /news/spool/news/N.*; do find ${i} -type f -name 'B.*' -print | xargs ./filesed $1 $2 & done What happens is I've got a system that looks like this: % df -k Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/da0s2a 158783 21626 124455 15% / /dev/da0s2h 772075 97 710212 0% /export/home/u0 /dev/da0s2e 198399 143748 38780 79% /usr /dev/da0s2f 119055 8264 101267 8% /usr/local /dev/da0s2g 1016303 3078 931921 0% /var procfs 4 4 0 100% /proc /dev/vinum/news 14142987 2120003 12022984 15% /news /dev/vinum/n0 31821718 14782554 17039164 46% /news/spool/news/N.00 /dev/vinum/n1 31821718 14921680 16900038 47% /news/spool/news/N.01 /dev/vinum/n2 31821718 15535917 16285801 49% /news/spool/news/N.02 /dev/vinum/n3 31821718 14769382 17052336 46% /news/spool/news/N.03 /dev/vinum/n4 31821718 15435368 16386350 49% /news/spool/news/N.04 /dev/vinum/n5 31821718 14619211 17202507 46% /news/spool/news/N.05 /dev/vinum/n6 31821718 15547271 16274447 49% /news/spool/news/N.06 /dev/vinum/n7 31821718 14721799 17099919 46% /news/spool/news/N.07 /dev/vinum/n8 31821718 1 31821717 0% /news/spool/news/N.08 which is an ASUS P2B-DS with the previously mentioned dmesg. Each "n?" partition is striped across two 18GB drives, striped across controllers too. The data on the "n?" partitions is Usenet article data, stored in Matt Dillon's Diablo format - many articles per file, maybe 10000 files per FS. To install a new server, I build it and then load each filesystem across the network. I can't afford to lose months worth of data. The only downside to this is that the Path: lines are then wrong, since they'll say that the article came in on "server1" but the data is actually now on "server2" due to my cross-network-copy. Since I'm working in a distributed server environment and occasionally need to do debugging, I felt it necessary to change these files. Since this is a nice fast SMP dual PII/400, and there's lots of drives, the theoretical limiting factors are the SCSI busses and the CPU. So I decided to try running my little filesed program in parallel on all filesystems, maximizing the concurrency and hopefully maxxing out the CPU or the SCSI busses. Instead, it hangs the $*!*$# system after doing a few thousand files. If you have a suggested test/debug methodology, please let me know. I can also arrange for console access if someone wishes to poke at the machine. I'm also willing to try patches/etc. I'm just not quite sure what to do. ... Joe ------------------------------------------------------------------------------- Joe Greco - Systems Administrator jgreco@ns.sol.net Solaria Public Access UNIX - Milwaukee, WI 414/342-4847 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199911290105.TAA91433>