Date: Sun, 1 Jun 2008 23:40:23 -0700 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: Andrew Hill <lists@thefrog.net> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS lockup in "zfs" state Message-ID: <20080602064023.GA95247@eos.sc1.parodius.com> In-Reply-To: <16a6ef710806012304m48b63161oee1bc6d11e54436a@mail.gmail.com> References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> <20080518124217.GA16222@eos.sc1.parodius.com> <93F07874-8D5F-44AE-945F-803FFC3B9279@thefrog.net> <16a6ef710806012304m48b63161oee1bc6d11e54436a@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jun 02, 2008 at 04:04:12PM +1000, Andrew Hill wrote: > On Mon, May 19, 2008 at 1:11 AM, Andrew Hill <lists@thefrog.net> wrote: > > > i tend to find that the timeouts occur on one or two disks at once - e.g. > > ad0 and 2 will complain of timeouts, and the system locks up shortly > > thereafter... > > after spitting out the usual errors from ad0 and ad2 (in this case) with > TIMEOUTs and subsequent FAILUREs on READ_DMA[48] and WRITE_DMA[48]... > > i got the following panic > > vm_fault: pager read error, pid 1552 (tlsmgr) > ad0: FAILURE - READ_DMA48 timed out LBA=352903900 > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 437, size: 4096 > ad2: FAILURE - WRITE_DMA timed out LBA=239717693 > panic: ZFS: I/O failure (write on <unknown> off 0: zio 0xffffff001d47c810 > [L0 ZIL intent log] b000L/b000P DVA[0]=<0:c807795000:d000> zilog > uncompressed LE contiguous birth=750230 fill=0 > cksum=69f76525a84e1816:f6d86fe1d94cd68c:39:8af): error 5 > KDB: enter: panic > [thread pid 72 tid 100071 ] > Stopped at kdb_enter_why+0x3d: movq $0,0x39b248(%rip) > db> I would say the ZFS crash is a result of the ad0/ad2 timeouts. The ZIL log shows a hard checksum failure in the ZIL, which indicates a serious problem -- very likely hardware-related (or rather, at a lower level than ZFS). You've read this already, but maybe you missed the DMA error part: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues The DMA errors can actually be legitimate too -- it's very hard to troubleshoot if they're superfluous (e.g. a FreeBSD bug) or if they're real. If the problem is reproducable, then this is convenient with regards to providing you additional help. I really need to sit down and write a huge HOWTO doc for people on how to diagnose whether or not their disks or cables are bad, etc... It's a very hard thing to document, because everyone's situation is different. The first piece to start with is simplest, though: install ports/sysutils/smartmontools and provide the output of "smartctl -a /dev/ad0" and /dev/ad2. Actual disk errors will very likely show up there in one of the counters, or in the SMART log. I'd personally like to see the output from smartctl, because it's something you can do while the system is up/working. The next step would involve replacing your cables. If the problem continues, you've at least removed one piece of the puzzle. Next, replace the disks -- especially if they were bought at the same time, and are from the same vendor. Hard disk vendors are known to have bad batches of disks. For sake of example, I just had two Western Digital disks (which I bought at the same time) fail a short I/O test, returning errors at different LBAs (blocks). The 2nd one only started showing problems a few weeks after the first. I obviously got both of them RMA'd. Finally, replace the controller or motherboard. Some people have reported success with this. > generally the lockups don't result in a panic (at least not in the short > term of 5-10 minutes), so i can't be sure that this panic is necessarily > caused by the same problem, but thought it might be worth posting in case it > gives an indication of the location/cause of the deadlock The DMA timeout errors you've seen, others have seen as well -- including me -- even when the hardware, disks, cabling, and controllers are in a 100% working state. (Even switching OSes results in no errors, indicating there is a problem with FreeBSD in some way.) If the problem is reproducable, you should get in contact with Scott Long and let him poke at things. (I mentioned this last time. :-) ) I myself am not familiar with the FreeBSD kernel, the device drivers, or working with the kernel at such a low level to debug things of this nature. > unfortunately i couldn't get a backtrace or core dump for 'political' > reasons (the system was required for use by others) but i'll see if i can > get a panic happening after-hours to get some more info... I can't tell you what to do or how to do your job, but honestly you should be pulling this system out of production and replacing it with a different one, or a different implementation, or a different OS. Your users/employees are probably getting ticked off at the crashes, and it probably irritates you too. The added benefit is that you could get Scott access to the box. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080602064023.GA95247>