From owner-freebsd-scsi@FreeBSD.ORG Wed Jan 19 13:49:44 2005 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A8D1416A4CE; Wed, 19 Jan 2005 13:49:44 +0000 (GMT) Received: from mail.dt.e-technik.uni-dortmund.de (krusty.dt.e-technik.Uni-Dortmund.DE [129.217.163.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id DCC2243D41; Wed, 19 Jan 2005 13:49:43 +0000 (GMT) (envelope-from matthias.andree@gmx.de) Received: from localhost (localhost [127.0.0.1])0301F4F842; Wed, 19 Jan 2005 14:49:43 +0100 (CET) Received: from mail.dt.e-technik.uni-dortmund.de ([127.0.0.1]) by localhost (krusty [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 18933-01-2; Wed, 19 Jan 2005 14:49:42 +0100 (CET) Received: from m2a2.dyndns.org (p54854A51.dip.t-dialin.net [84.133.74.81]) B65744E45A; Wed, 19 Jan 2005 14:49:41 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by merlin.emma.line.org (Postfix) with ESMTP id 2DA1877736; Wed, 19 Jan 2005 14:49:41 +0100 (CET) Received: from merlin.emma.line.org ([127.0.0.1]) by localhost (m2a2.dyndns.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 01853-06; Wed, 19 Jan 2005 14:49:40 +0100 (CET) Received: by merlin.emma.line.org (Postfix, from userid 500) id 25519777C6; Wed, 19 Jan 2005 14:49:40 +0100 (CET) From: Matthias Andree To: freebsd-stable@freebsd.org, freebsd-scsi@freebsd.org Date: Wed, 19 Jan 2005 14:49:40 +0100 Message-ID: User-Agent: Gnus/5.110003 (No Gnus v0.3) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Scanned: by amavisd-new at dt.e-technik.uni-dortmund.de cc: re@freebsd.org Subject: 4.11-RC3: SCSI+UFS+softupdates corruption (write cache DISABLED!) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Jan 2005 13:49:44 -0000 Hi, I had a FreeBSD 4.11-RC3 machine reboot without advance notice, the last logging the network syslogd captured was attempted aic0 (Adaptec 2940 UW Pro) recovery. Syslog excerpt as captured by the remote machine, with date and "hostname /kernel:" and card state dumps removed (can be provided if necessary). I wonder if the SCSI error recovery attempts caused the reboot, I have no hints either way, but this machine is otherwise stable. 13:28:35 ahc0: Recovery Initiated 13:28:53 (da0:ahc0:0:0:0): SCB 0x16 - timed out 13:28:53 sg[0] - Addr 0x6da3800 : Length 2048 13:28:53 (da0:ahc0:0:0:0): Other SCB Timeout 13:28:53 ahc0: Timedout SCBs already complete. Interrupts may not be functioning. 13:28:53 ahc0: Recovery Initiated 13:29:02 (da0:ahc0:0:0:0): SCB 0x1b - timed out 13:29:04 (da0:ahc0:0:0:0): BDR message in message buffer 13:29:04 ahc0: Timedout SCBs already complete. Interrupts may not be functioning. 13:29:04 ahc0: Recovery Initiated 13:29:16 Kernel Free SCB list: 9 4 15 20 13:29:17 sg[7] - Addr 0x3bea000 : Length 4096 13:29:18 ahc0: Issued Channel A Bus Reset. 25 SCBs aborted As the machine rebooted up, it remained in single user due to a softupdates inconsistency fsck reported: | # fsck -p /usr | /dev/da0s1g: DIRECTORY CORRUPTED I=175105 OWNER=root MODE=40755 | /dev/da0s1g: SIZE=512 MTIME=Jan 18 15:14 2005 | /dev/da0s1g: DIR=? | | /dev/da0s1g: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY. I have not yet run fsck for interactive repair, because I want to know what is going on here and allow debugging this. At the time of the crash, these tasks were running: 1. amanda was running a dump(8) 2. I was installing manpages from /usr/src/share/man/man4 3. a cvsup for the ports tree was running (this is likely related to the problem) | # fsdb -r /dev/da0s1g | fsdb (inum: 2)> inode 175105 | current inode: directory | I=175105 MODE=40755 SIZE=512 | MTIME=Jan 18 15:14:48 2005 [0 nsec] | CTIME=Jan 18 15:14:48 2005 [0 nsec] | ATIME=Jun 19 03:05:43 2003 [0 nsec] | OWNER=root GRP=wheel LINKCNT=2 FLAGS=0 BLKCNT=4 GEN=4e5151f9 | fsdb (inum: 175105)> cd .. | component `..': fsdb: name `..' not found in current inode directory I checked with camcontrol, the write cache is off (see below), but the queue algorithm modifier is on and cannot be switched off. Digging through the old structures, with find, reveals: | 175101 4 drwxr-xr-x 3 root wheel 512 Sep 1 2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd | 175102 4 drwxr-xr-x 2 root wheel 512 Sep 1 2002 /usr/X11R6/lib/perl5/site_perl/5.005/i386-freebsd/auto | 175103 4 drwxr-xr-x 5 root wheel 512 Aug 23 2002 /usr/sup | 175104 4 drwxr-xr-x 2 root wheel 512 Jan 19 13:29 /usr/sup/src-all > 175105 4 drwxr-xr-x 2 root wheel 512 Jan 18 15:14 /usr/sup/ports-all | 175106 4 drwxr-xr-x 2 root wheel 512 Jan 18 15:14 /usr/sup/doc-all | 175107 4 drwxr-xr-x 22 root wheel 1024 Sep 28 19:47 /usr/doc | 175108 4 drwxr-xr-x 6 root wheel 512 Dec 19 13:26 /usr/doc/de_DE.ISO8859-1 | 175109 4 drwxr-xr-x 5 root wheel 512 Dec 27 2003 /usr/doc/de_DE.ISO8859-1/books And, as expected: | # ls -la /usr/sup/ports-all/ | # Why can, under such circumstances, a softupdates filesystem become corrupt so that fsck -p cannot fix it, and it loses has directories without . and ..? kernel/softupdates bug? How can this directory become empty? locate has this information recorded: /usr/sup/ports-all /usr/sup/ports-all/#cvs.cvsup-2279.0 /usr/sup/ports-all/checkouts.cvs:. so apparently, three (checkouts.cvs:., . and ..) or four files (perhaps the # file) have disappeared. I'm not sure if fsck will revive them, I want to avoid destroying data useful for debugging. Is the Queue Algorithm Modifier a problem? (see below) I cannot set this to 0 on this drive, "camcontrol: error sending mode select command" with -P0 and -P3. (Micropolis 4345WS) How do I go about providing the file system metadata so someone can take a look at it? The file system is 3.5 G in size, so anything that goes beyond meta data is not feasible. Providing SSH access to the failed machine may work though if I'm sent your OpenSSH v2-format key. # camcontrol inquiry da0 pass0: Fixed Direct Access SCSI-2 device pass0: Serial Number 77HT45XXXX pass0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit), Tagged Queueing Enabled # camcontrol modepage da0 -m8 IC: 0 ABPF: 0 CAP: 0 DISC: 0 SIZE: 0 WCE: 0 MF: 0 RCD: 0 ... # camcontrol modepage da0 -m10 RLEC: 0 Queue Algorithm Modifier: 1 QErr: 0 DQue: 0 ... -- Matthias Andree