Date: Mon, 7 Jul 2008 21:59:13 GMT From: Andrew Hammond <andrew.george.hammond@gmail.com> To: freebsd-gnats-submit@FreeBSD.org Subject: kern/125382: ENOSPC may be misleading, consider EIO Message-ID: <200807072159.m67LxDnd002481@www.freebsd.org> Resent-Message-ID: <200807072200.m67M01Rt026647@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 125382 >Category: kern >Synopsis: ENOSPC may be misleading, consider EIO >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Mon Jul 07 22:00:01 UTC 2008 >Closed-Date: >Last-Modified: >Originator: Andrew Hammond >Release: 6.2 amd64 >Organization: AdECN, a Microsoft Company >Environment: FreeBSD db1.sjc.adecn.com 6.2-RELEASE-p6 FreeBSD 6.2-RELEASE-p6 #1: Thu Jul 19 09:21:10 PDT 2007 root@qaipc1.qa1.adecn.com:/usr/obj/usr/src/sys/ADECNDB amd64 >Description: Found the following error message in PostgreSQL logs: vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not write block 209610 of relation 1663/16386/236356665: No space left on device Didn't make sense since device is only at 18% usage. Got on pgsql-hackers mailing list (subject "the un-vacuumable table", thread starts at http://archives.postgresql.org/pgsql-hackers/2008-06/msg00922.php). > Have you looked into the machine's kernel log to see if there is any > evidence of low-level distress (hardware or filesystem level)? I'm > wondering if ENOSPC is being reported because it is the closest > available errno code, but the real problem is something different than > the error message text suggests. Other than the errno the symptoms > all look quite a bit like a bad-sector problem ... Uhm, just for the record FileWrite returns error messages which get printed this way for two reasons other than write(2) returning ENOSPC: 1) if FileAccess has to reopen the file then open(2) could return an error. I don't see how open returns ENOSPC without O_CREAT (and that's cleared for reopening) 2) If write(2) returns < 0 but doesn't set errno. That also seems like a strange case that shouldn't happen, but perhaps there's some reason it can. On Thu, Jul 3, 2008 at 10:57 PM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >How-To-Repeat: >Fix: >Release-Note: >Audit-Trail: >Unformatted: >> Have you looked into the machine's kernel log to see if there is any >> evidence of low-level distress (hardware or filesystem level)? I'm >> wondering if ENOSPC is being reported because it is the closest >> available errno code, but the real problem is something different than >> the error message text suggests. Other than the errno the symptoms >> all look quite a bit like a bad-sector problem ... da1 is the storage device where the PGDATA lives. Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929ba560:6810 timed out for ccb 0xffffff0000e20000 (req->ccb 0xffffff0000e20000) Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b90c0:6811 timed out for ccb 0xffffff0001081000 (req->ccb 0xffffff0001081000) Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b9f88:6812 timed out for ccb 0xffffff0000d93800 (req->ccb 0xffffff0000d93800) Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929ba560:6810 function 0 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929bcc90:6813 timed out for ccb 0xffffff03e132dc00 (req->ccb 0xffffff03e132dc00) Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929ba560:6810 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929ba560:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929b90c0:6811 function 0 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929b90c0:6811 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b90c0:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929b9f88:6812 function 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 0 1 6c 99 9 c0 0 0 0 20 0 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus device reset occurred Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data) Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929b9f88:6812 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b9f88:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929bcc90:6813 function 0 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929bcc90:6813 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929bcc90:0 completed Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 0 1 65 1b 71 a0 0 0 0 20 0 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus device reset occurred Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data) Tom Lane writes: Also, I suggest filing a bug with your kernel distributor --- ENOSPC was a totally misleading error code here. Seems like EIO would be more appropriate. They'll probably want to see the kernel log. regards, tom lane
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200807072159.m67LxDnd002481>
