Date: Mon, 7 Jul 2008 21:59:13 GMT From: Andrew Hammond <andrew.george.hammond@gmail.com> To: freebsd-gnats-submit@FreeBSD.org Subject: kern/125382: ENOSPC may be misleading, consider EIO Message-ID: <200807072159.m67LxDnd002481@www.freebsd.org> Resent-Message-ID: <200807072200.m67M01Rt026647@freefall.freebsd.org>
index | next in thread | raw e-mail
>Number: 125382 >Category: kern >Synopsis: ENOSPC may be misleading, consider EIO >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Mon Jul 07 22:00:01 UTC 2008 >Closed-Date: >Last-Modified: >Originator: Andrew Hammond >Release: 6.2 amd64 >Organization: AdECN, a Microsoft Company >Environment: FreeBSD db1.sjc.adecn.com 6.2-RELEASE-p6 FreeBSD 6.2-RELEASE-p6 #1: Thu Jul 19 09:21:10 PDT 2007 root@qaipc1.qa1.adecn.com:/usr/obj/usr/src/sys/ADECNDB amd64 >Description: Found the following error message in PostgreSQL logs: vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not write block 209610 of relation 1663/16386/236356665: No space left on device Didn't make sense since device is only at 18% usage. Got on pgsql-hackers mailing list (subject "the un-vacuumable table", thread starts at http://archives.postgresql.org/pgsql-hackers/2008-06/msg00922.php). > Have you looked into the machine's kernel log to see if there is any > evidence of low-level distress (hardware or filesystem level)? I'm > wondering if ENOSPC is being reported because it is the closest > available errno code, but the real problem is something different than > the error message text suggests. Other than the errno the symptoms > all look quite a bit like a bad-sector problem ... Uhm, just for the record FileWrite returns error messages which get printed this way for two reasons other than write(2) returning ENOSPC: 1) if FileAccess has to reopen the file then open(2) could return an error. I don't see how open returns ENOSPC without O_CREAT (and that's cleared for reopening) 2) If write(2) returns < 0 but doesn't set errno. That also seems like a strange case that shouldn't happen, but perhaps there's some reason it can. On Thu, Jul 3, 2008 at 10:57 PM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >How-To-Repeat: >Fix: >Release-Note: >Audit-Trail: >Unformatted: >> Have you looked into the machine's kernel log to see if there is any >> evidence of low-level distress (hardware or filesystem level)? I'm >> wondering if ENOSPC is being reported because it is the closest >> available errno code, but the real problem is something different than >> the error message text suggests. Other than the errno the symptoms >> all look quite a bit like a bad-sector problem ... da1 is the storage device where the PGDATA lives. Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929ba560:6810 timed out for ccb 0xffffff0000e20000 (req->ccb 0xffffff0000e20000) Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b90c0:6811 timed out for ccb 0xffffff0001081000 (req->ccb 0xffffff0001081000) Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b9f88:6812 timed out for ccb 0xffffff0000d93800 (req->ccb 0xffffff0000d93800) Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929ba560:6810 function 0 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929bcc90:6813 timed out for ccb 0xffffff03e132dc00 (req->ccb 0xffffff03e132dc00) Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929ba560:6810 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929ba560:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929b90c0:6811 function 0 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929b90c0:6811 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b90c0:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929b9f88:6812 function 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 0 1 6c 99 9 c0 0 0 0 20 0 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus device reset occurred Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data) Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929b9f88:6812 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b9f88:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929bcc90:6813 function 0 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929bcc90:6813 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929bcc90:0 completed Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 0 1 65 1b 71 a0 0 0 0 20 0 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus device reset occurred Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data) Tom Lane writes: Also, I suggest filing a bug with your kernel distributor --- ENOSPC was a totally misleading error code here. Seems like EIO would be more appropriate. They'll probably want to see the kernel log. regards, tom lanehome | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200807072159.m67LxDnd002481>
