Date: Sun, 5 Apr 2009 03:37:36 -0400 From: Michael Conlen <m@obmail.net> To: freebsd-fs@freebsd.org Subject: Bizarre IO errors Message-ID: <8FEAE0BA-5723-437C-8215-D2AEC7783713@obmail.net>
next in thread | raw e-mail | index | archive | help
First the background FreeBSD nfs4.tarhost.com 7.1-RELEASE-p3 FreeBSD 7.1-RELEASE-p3 #0: Sat Mar 7 00:15:02 EST 2009 root@nfs4.tarhost.com:/usr/obj/usr/src/ sys/GENERIC amd64 (Two of these processors) CPU: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (2500.10-MHz K8- class CPU) Memory usable memory = 17165377536 (16370 MB) avail memory = 16626044928 (15855 MB) The Disk system aac0: <Adaptec RAID 52445> mem 0xd8000000-0xd81fffff irq 48 at device 0.0 on pci1 aac0: Enabling 64-bit address support aac0: Enable Raw I/O aac0: Enable 64-bit array aac0: New comm. interface enabled aac0: [ITHREAD] aac0: Adaptec 52445, aac driver 2.0.0-1 Controller Status : Optimal Channel description : SAS/SATA Controller Model : Adaptec 52445 Controller Serial Number : 8A321083874 Physical Slot : 5 Temperature : 60 C/ 140 F (Normal) Installed memory : 512 MB Copyback : Disabled Background consistency check : Disabled Automatic Failover : Enabled Global task priority : High Performance Mode : Default/Dynamic Defunct disk drive count : 0 Logical devices/Failed/Degraded : 2/0/0 Logical device number 0 Logical device name : system RAID level : 1 Status of logical device : Optimal Size : 953334 MB Read-cache mode : Enabled Write-cache mode : Enabled (write-back) Write-cache setting : Enabled (write-back) Partitioned : Yes Protected by Hot-Spare : Yes Dedicated Hot-Spare : 0,4 Dedicated Hot-Spare : 0,12 Bootable : Yes Failed stripes : No -------------------------------------------------------- Logical device segment information -------------------------------------------------------- Segment 0 : Present (0,0) Segment 1 : Present (0,8) Logical device number 1 Logical device name : data RAID level : 10 Status of logical device : Optimal Size : 5720064 MB Stripe-unit size : 256 KB Read-cache mode : Enabled Write-cache mode : Disabled (write-through) Write-cache setting : Enabled (write-back) when protected by battery Partitioned : Yes Protected by Hot-Spare : No Bootable : No Failed stripes : No Filesystem Size Used Avail Capacity Mounted on /dev/aacd0s1a 15G 417M 14G 3% / devfs 1.0K 1.0K 0B 100% /dev /dev/aacd0s1e 15G 30K 14G 0% /tmp /dev/aacd0s1f 62G 3.2G 54G 6% /usr /dev/aacd0s1g 762G 354M 701G 0% /usr/local /dev/aacd0s1d 15G 3.5G 11G 24% /var /dev/aacd1p1 5.3T 874G 4.0T 18% /usr/local/export This system is setup as a NFS File server. It handled several stress tests for two weeks before going online. The files were transfered to the system and it was placed online and ran fine for a few days. There are 10 web servers which access the file server, but the file servers have a cluster of caches in front of them so the load isn't too bad. I see peaks around 60Mbit/sec of traffic from the NFS server when taking backups and 30 MBit/sec otherwise. IO is minimized due to the large amount of available ram. It takes about three hours before disk caching fills the available memory, so there's not a lot of really hot data it's going to the disks for, mostly just writes. iostat generally reports about 1 MB/sec, maybe 2 at most. During stress tests I'd seen in excess of 800 MB/sec though usually the 300-400 MB/sec range. Now for the strange. First notice the first two g_vfs_done lines. The offset is negative. After repeating the second 290 times we see this odd pattern of the error taking many lines to display. After that the logs continue occasionally showing a full error line then followed by a line broken up as below. The first 8 or so lines after the first have the same offset. After that the offset switches. to offset=1666490991559323648. That's in excess of 1 Exabyte. I've got a lot of disk but not that much. That seems to indicate that it's not really an IO error since the offset is way off the end of what could possibly be disk (Can someone confirm or deny that?). The offset occasionally changes to a negative number or some other value but this particular offset is repeated over and over. Between 1 and 2 AM over 2 GB of this log was generated. About two minutes after this started NFS stopped responding to the NFS clients in a prompt manner. Once the server was restarted it runs fine for sometime but this pattern soon (within minutes) repeats. I have noticed that several read errors seem to get logged several at a time then there will be occasional pauses but this is a sample of just the "last message repeated" which have 10 or more times repeated At the moment I can only assume that the negative and exceedingly large offsets are a symptom of something beyond simply "disk problems" since the messages logged indicate the OS was attempting to do something it shouldn't. Can this be confirmed or denied. The controller reports no problems. Is there anything else you can point me to? Start of problems in syslog Apr 4 00:00:00 nfs4 newsyslog[986]: logfile turned over due to size>100K Apr 4 01:11:59 nfs4 rpc.statd: unmon request from localhost, no matching monitor Apr 4 14:10:44 nfs4 rpc.statd: unmon request from localhost, no matching monitor Apr 4 14:10:44 nfs4 rpc.statd: unmon request from localhost, no matching monitor Apr 5 01:08:48 nfs4 kernel: g_vfs_done():aacd1p1[READ(offset=-6163487656308658176, length=32768)]error = 5 Apr 5 01:08:48 nfs4 kernel: g_vfs_done():aacd1p1[READ(offset=-2344660732015456256, length=32768)]error = 5 Apr 5 01:08:48 nfs4 last message repeated 290 times Apr 5 01:08:48 nfs4 kernel: g_vfs_done(): Apr 5 01:08:48 nfs4 kernel: aacd1p Apr 5 01:08:48 nfs4 kernel: 1[RE Apr 5 01:08:48 nfs4 kernel: AD( Apr 5 01:08:48 nfs4 kernel: off Apr 5 01:08:48 nfs4 kernel: set Apr 5 01:08:48 nfs4 kernel: =- Apr 5 01:08:48 nfs4 kernel: 234 Apr 5 01:08:48 nfs4 kernel: 466 Apr 5 01:08:48 nfs4 kernel: 073 Apr 5 01:08:48 nfs4 kernel: 201 Apr 5 01:08:48 nfs4 kernel: 545 Apr 5 01:08:48 nfs4 kernel: 6256, Apr 5 01:08:48 nfs4 kernel: len Apr 5 01:08:48 nfs4 kernel: gth Apr 5 01:08:48 nfs4 kernel: =32 Apr 5 01:08:48 nfs4 kernel: 76 Apr 5 01:08:48 nfs4 kernel: 8)] Apr 5 01:08:48 nfs4 kernel: err Apr 5 01:08:48 nfs4 kernel: or Apr 5 01:08:48 nfs4 kernel: = 5 Apr 5 01:08:48 nfs4 kernel: Log of "message repeated" with more than 9 times repeated. Apr 5 01:57:27 nfs4 last message repeated 75 times Apr 5 01:57:28 nfs4 last message repeated 434 times Apr 5 01:57:38 nfs4 last message repeated 18848 times Apr 5 01:57:43 nfs4 last message repeated 9894 times Apr 5 01:57:45 nfs4 last message repeated 435 times Apr 5 01:57:45 nfs4 last message repeated 105 times Apr 5 01:57:45 nfs4 last message repeated 433 times Apr 5 01:57:45 nfs4 last message repeated 303 times Apr 5 01:57:46 nfs4 last message repeated 421 times
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8FEAE0BA-5723-437C-8215-D2AEC7783713>