From owner-freebsd-fs@FreeBSD.ORG Mon Jun 2 19:31:54 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C06A31065673 for ; Mon, 2 Jun 2008 19:31:54 +0000 (UTC) (envelope-from yalur@mail.ru) Received: from mx39.mail.ru (mx39.mail.ru [194.67.23.35]) by mx1.freebsd.org (Postfix) with ESMTP id 3032A8FC1B for ; Mon, 2 Jun 2008 19:31:54 +0000 (UTC) (envelope-from yalur@mail.ru) Received: from [77.122.142.19] (port=43565 helo=dive.liberties.volia.net) by mx39.mail.ru with asmtp id 1K3Fl1-000KRg-00; Mon, 02 Jun 2008 23:31:51 +0400 From: Ruslan Kovtun Organization: Home To: freebsd-fs@freebsd.org Date: Mon, 2 Jun 2008 22:31:45 +0300 User-Agent: KMail/1.9.7 References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> <16a6ef710806012304m48b63161oee1bc6d11e54436a@mail.gmail.com> <20080602064023.GA95247@eos.sc1.parodius.com> In-Reply-To: <20080602064023.GA95247@eos.sc1.parodius.com> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200806022231.46079.yalur@mail.ru> Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: quoted-printable X-Spam: Not detected X-Mras: OK Cc: Andrew Hill Subject: Re: ZFS lockup in "zfs" state X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: yalur@mail.ru List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2008 19:31:54 -0000 Hi. I have the same problem very often with HDD (READ_DMA UDMA ICRC error) whi= ch=20 is in zfs pool. Before, this HDD was in mirror ar0 but not in ZFS pool and= =20 this hard disk sometimes have failed but with no any panic only detached fr= om=20 mirror. After I included this HDD to ZFS pool problem have apeared. I am= =20 sure that this is problem with hard disk.=20 Smartmontools notified me by mail that UDMA_CRC_Error_Count have increased= =20 after HDD failure and acording smartctl I can see that HDD have hardware=20 problem. I replased cable, tried to connect this HDD to another port - but= =20 no result: 100% hard disk problem. I can not create kernel coredump during panic: savecore: no dumps found :( Only logs are available: In log file: Jun 1 10:43:11 yalur kernel: ad16: WARNING - READ_DMA UDMA ICRC error=20 (retrying request) LBA=3D233909187 Jun 1 10:43:20 yalur kernel: ad16: WARNING - SETFEATURES SET TRANSFER MODE= =20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES SET TRANSFER MODE= =20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES ENABLE RCACHE=20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SETFEATURES ENABLE WCACHE=20 taskqueue timeout - completing request directly Jun 1 10:43:36 yalur kernel: ad16: WARNING - SET_MULTI taskqueue timeout -= =20 completing request directly Jun 1 10:43:36 yalur kernel: ad16: TIMEOUT - READ_DMA retrying (0 retries= =20 left) LBA=3D233909187 Jun 1 11:07:50 yalur syslogd: restart Jun 1 11:07:50 yalur syslogd: kernel boot file is /boot/kernel/kernel Jun 1 11:07:50 yalur kernel: ad16: FAILURE - device detached Jun 1 11:07:50 yalur kernel: subdisk16: detached Jun 1 11:07:50 yalur kernel: ad16: detached Jun 1 11:07:50 yalur kernel: Jun 1 11:07:50 yalur kernel: Jun 1 11:07:50 yalur kernel: Fatal trap 12: page fault while in kernel mode Jun 1 11:07:50 yalur kernel: cpuid =3D 0; apic id =3D 00 Jun 1 11:07:50 yalur kernel: fault virtual address =3D 0x2c Jun 1 11:07:50 yalur kernel: fault code =3D supervisor writ= e,=20 page not present Jun 1 11:07:50 yalur kernel: instruction pointer =3D 0x20:0x805aab85 Jun 1 11:07:50 yalur kernel: stack pointer =3D 0x28:0xed71ac5c Jun 1 11:07:50 yalur kernel: frame pointer =3D 0x28:0xed71ac70 Jun 1 11:07:50 yalur kernel: code segment =3D base 0x0, limit= =20 0xfffff, type 0x1b Jun 1 11:07:50 yalur kernel: =3D DPL 0, pres 1, def32 1, gran 1 Jun 1 11:07:50 yalur kernel: processor eflags =3D interrupt enabled, resu= me,=20 IOPL =3D 0 Jun 1 11:07:50 yalur kernel: current process =3D 3 (g_up) Jun 1 11:07:50 yalur kernel: trap number =3D 12 Jun 1 11:07:50 yalur kernel: panic: page fault Jun 1 11:07:50 yalur kernel: cpuid =3D 0 [root@yalur /home/ruslan]# zpool status pool: data state: ONLINE scrub: scrub completed with 0 errors on Mon Jun 2 12:05:52 2008 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad6 ONLINE 0 0 0 ad8 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad4 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 0 ad14 ONLINE 0 0 0 ad16 ONLINE 0 0 0 ad20 ONLINE 0 0 0 spares ad26 AVAIL errors: No known data errors =F7 =D3=CF=CF=C2=DD=C5=CE=C9=C9 =CF=D4 =F0=CF=CE=C5=C4=C5=CC=D8=CE=C9=CB 02= =C9=C0=CE=D1 2008 Jeremy Chadwick =CE=C1=D0=C9=D3=C1=CC(a): > On Mon, Jun 02, 2008 at 04:04:12PM +1000, Andrew Hill wrote: > > On Mon, May 19, 2008 at 1:11 AM, Andrew Hill wrote: > > > i tend to find that the timeouts occur on one or two disks at once - > > > e.g. ad0 and 2 will complain of timeouts, and the system locks up > > > shortly thereafter... > > > > after spitting out the usual errors from ad0 and ad2 (in this case) with > > TIMEOUTs and subsequent FAILUREs on READ_DMA[48] and WRITE_DMA[48]... > > > > i got the following panic > > > > vm_fault: pager read error, pid 1552 (tlsmgr) > > ad0: FAILURE - READ_DMA48 timed out LBA=3D352903900 > > swap_pager: indefinite wait buffer: bufobj: 0, blkno: 437, size: 4096 > > ad2: FAILURE - WRITE_DMA timed out LBA=3D239717693 > > panic: ZFS: I/O failure (write on off 0: zio 0xffffff001d47c8= 10 > > [L0 ZIL intent log] b000L/b000P DVA[0]=3D<0:c807795000:d000> zilog > > uncompressed LE contiguous birth=3D750230 fill=3D0 > > cksum=3D69f76525a84e1816:f6d86fe1d94cd68c:39:8af): error 5 > > KDB: enter: panic > > [thread pid 72 tid 100071 ] > > Stopped at kdb_enter_why+0x3d: movq $0,0x39b248(%rip) > > db> > > I would say the ZFS crash is a result of the ad0/ad2 timeouts. The ZIL > log shows a hard checksum failure in the ZIL, which indicates a serious > problem -- very likely hardware-related (or rather, at a lower level > than ZFS). > > You've read this already, but maybe you missed the DMA error part: > > http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues > > The DMA errors can actually be legitimate too -- it's very hard to > troubleshoot if they're superfluous (e.g. a FreeBSD bug) or if they're > real. If the problem is reproducable, then this is convenient with > regards to providing you additional help. > > I really need to sit down and write a huge HOWTO doc for people on how > to diagnose whether or not their disks or cables are bad, etc... It's a > very hard thing to document, because everyone's situation is different. > > The first piece to start with is simplest, though: install > ports/sysutils/smartmontools and provide the output of "smartctl -a > /dev/ad0" and /dev/ad2. Actual disk errors will very likely show up > there in one of the counters, or in the SMART log. I'd personally like > to see the output from smartctl, because it's something you can do while > the system is up/working. > > The next step would involve replacing your cables. If the problem > continues, you've at least removed one piece of the puzzle. > > Next, replace the disks -- especially if they were bought at the same > time, and are from the same vendor. Hard disk vendors are known to have > bad batches of disks. For sake of example, I just had two Western > Digital disks (which I bought at the same time) fail a short I/O test, > returning errors at different LBAs (blocks). The 2nd one only started > showing problems a few weeks after the first. I obviously got both of > them RMA'd. > > Finally, replace the controller or motherboard. Some people have > reported success with this. > > > generally the lockups don't result in a panic (at least not in the short > > term of 5-10 minutes), so i can't be sure that this panic is necessarily > > caused by the same problem, but thought it might be worth posting in ca= se > > it gives an indication of the location/cause of the deadlock > > The DMA timeout errors you've seen, others have seen as well -- > including me -- even when the hardware, disks, cabling, and controllers > are in a 100% working state. (Even switching OSes results in no errors, > indicating there is a problem with FreeBSD in some way.) > > If the problem is reproducable, you should get in contact with Scott > Long and let him poke at things. (I mentioned this last time. :-) ) > I myself am not familiar with the FreeBSD kernel, the device drivers, or > working with the kernel at such a low level to debug things of this > nature. > > > unfortunately i couldn't get a backtrace or core dump for 'political' > > reasons (the system was required for use by others) but i'll see if i c= an > > get a panic happening after-hours to get some more info... > > I can't tell you what to do or how to do your job, but honestly you > should be pulling this system out of production and replacing it with a > different one, or a different implementation, or a different OS. Your > users/employees are probably getting ticked off at the crashes, and it > probably irritates you too. The added benefit is that you could get > Scott access to the box. =2D-=20 ________________ =F3 =D5=D7=C1=D6=C5=CE=C9=C5=CD =EB=CF=D7=D4=D5=CE =F2=D5=D3=CC=C1=CE mailto