From owner-freebsd-stable@FreeBSD.ORG Fri Sep 20 21:32:11 2013 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id C2AE4E5; Fri, 20 Sep 2013 21:32:11 +0000 (UTC) (envelope-from olivier777a7@gmail.com) Received: from mail-la0-x22d.google.com (mail-la0-x22d.google.com [IPv6:2a00:1450:4010:c03::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CF8512367; Fri, 20 Sep 2013 21:32:10 +0000 (UTC) Received: by mail-la0-f45.google.com with SMTP id eh20so789660lab.4 for ; Fri, 20 Sep 2013 14:32:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=MaQm1gYA+dE/PL4YzHmaT8RlLpJrguEVQw4LCaopWOk=; b=LsKZbLxGq9R/b0YbSvug5i2HTd1TR8vsvRChHOhEBxeRgjbcBU141GeHzx4insjLmg shVkQqaSmed1LjmNo8IhrtGdE+LPdegcfwa4SSpfplkcWIONknuoDEptrqYLwK4Ob+7l umDfpHd4Peiuzyakrk2ngn2HDnKi01Qg1ffg5va8fVnVXwdiZQEp5JmYFsBoepZSWmW7 +GBSqtcwkFYLGw50TYyXyuexZvLgCULp7GT5/BGhT1ZCK5yr+UNVA6ZyKq6kIYhKVozR vE01RIVNvlgsToYYRdZy3b2YEJCcKtMnOwzvm/vyo5m3rHLeGvrwnV0Fje5zxvtIIghf ILsg== MIME-Version: 1.0 X-Received: by 10.152.36.98 with SMTP id p2mr7723721laj.14.1379712728547; Fri, 20 Sep 2013 14:32:08 -0700 (PDT) Received: by 10.114.176.69 with HTTP; Fri, 20 Sep 2013 14:32:08 -0700 (PDT) In-Reply-To: References: <51E944B0.5080409@gmail.com> Date: Fri, 20 Sep 2013 14:32:08 -0700 Message-ID: Subject: Re: 9.2PRERELEASE ZFS panic in lzjb_compress From: olivier To: Volodymyr Kostyrko Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-stable@freebsd.org" , zfs-devel@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Sep 2013 21:32:11 -0000 One last piece of information I just got: the problem is not specific to LZJB compression. I switched to LZ4 and get the same sort of panic: Fatal trap 12: page fault while in kernel mode cpuid = 8; apic id = 28 fault virtual address = 0xffffff8581c48000 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8195f6d1 stack pointer = 0x28:0xffffffcf950ee850 frame pointer = 0x28:0xffffffcf950ee8f0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (zio_write_issue_hig) trap number = 12 panic: page fault cpuid = 8 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xffffffcf950ee2e0 kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffcf950ee3a0 panic() at panic+0x1ce/frame 0xffffffcf950ee4a0 trap_fatal() at trap_fatal+0x290/frame 0xffffffcf950ee500 trap_pfault() at trap_pfault+0x211/frame 0xffffffcf950ee590 trap() at trap+0x344/frame 0xffffffcf950ee790 calltrap() at calltrap+0x8/frame 0xffffffcf950ee790 --- trap 0xc, rip = 0xffffffff8195f6d1, rsp = 0xffffffcf950ee850, rbp = 0xffffffcf950ee8f0 --- lz4_compress() at lz4_compress+0x81/frame 0xffffffcf950ee8f0 zio_compress_data() at zio_compress_data+0x92/frame 0xffffffcf950ee920 zio_write_bp_init() at zio_write_bp_init+0x24b/frame 0xffffffcf950ee970 zio_execute() at zio_execute+0xc3/frame 0xffffffcf950ee9b0 taskqueue_run_locked() at taskqueue_run_locked+0x74/frame 0xffffffcf950eea00 taskqueue_thread_loop() at taskqueue_thread_loop+0x46/frame 0xffffffcf950eea20 fork_exit() at fork_exit+0x11f/frame 0xffffffcf950eea70 fork_trampoline() at fork_trampoline+0xe/frame 0xffffffcf950eea70 --- trap 0, rip = 0, rsp = 0xffffffcf950eeb30, rbp = 0 --- (I am now trying without any compression.) On Fri, Sep 20, 2013 at 11:25 AM, olivier wrote: > Got another, very similar panic again on recent 9-STABLE (r255602); I > assume the latest 9.2 release candidate is affected too. Anybody have any > idea of what could be causing this, and of a workaround other than turning > compression off? > Unlike the last panic I reported, this one did not occur during a zfs > send/receive operation. There were just a number of processes potentially > writing to disk at the same time. > All hardware is healthy as far as I can tell (memory is ECC and no errors > in logs; zpool status and smartctl show no problems). > > Fatal trap 12: page fault while in kernel mode > > > cpuid = 4; apic id = 24 > cpuid = 51; apic id = 83 > fault virtual address = 0xffffff8700a9cc65 > fault virtual address = 0xffffff8700ab0ea9 > fault code = supervisor read data, page not present > > instruction pointer = 0x20:0xffffffff8195ff47 > fault code = supervisor read data, page not present > stack pointer = 0x28:0xffffffcf951390a0 > Fatal trap 12: page fault while in kernel mode > frame pointer = 0x28:0xffffffcf951398f0 > Fatal trap 12: page fault while in kernel mode > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > instruction pointer = 0x20:0xffffffff8195ffa4 > stack pointer = 0x28:0xffffffcf951250a0 > processor eflags = frame pointer = 0x28:0xffffffcf951258f0 > interrupt enabled, code segment = base 0x0, limit 0xfffff, type 0x1b > > resume, IOPL = 0 > cpuid = 28; apic id = 4c > Fatal trap 12: page fault while in kernel mode > = DPL 0, pres 1, long 1, def32 0, gran 1 > current process = 0 (zio_write_issue_hig) > processor eflags = fault virtual address = 0xffffff8700aa22ac > interrupt enabled, fault code = supervisor read data, page not present > resume, IOPL = 0 > trap number = 12 > instruction pointer = 0x20:0xffffffff8195ffa4 > current process = 0 (zio_write_issue_hig) > panic: page fault > cpuid = 4 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame > 0xffffffcf95138b30 > kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffcf95138bf0 > panic() at panic+0x1ce/frame 0xffffffcf95138cf0 > trap_fatal() at trap_fatal+0x290/frame 0xffffffcf95138d50 > trap_pfault() at trap_pfault+0x211/frame 0xffffffcf95138de0 > trap() at trap+0x344/frame 0xffffffcf95138fe0 > calltrap() at calltrap+0x8/frame 0xffffffcf95138fe0 > --- trap 0xc, rip = 0xffffffff8195ff47, rsp = 0xffffffcf951390a0, rbp = > 0xffffffcf951398f0 --- > lzjb_compress() at lzjb_compress+0xa7/frame 0xffffffcf951398f0 > zio_compress_data() at zio_compress_data+0x92/frame 0xffffffcf95139920 > zio_write_bp_init() at zio_write_bp_init+0x24b/frame 0xffffffcf95139970 > zio_execute() at zio_execute+0xc3/frame 0xffffffcf951399b0 > taskqueue_run_locked() at taskqueue_run_locked+0x74/frame > 0xffffffcf95139a00 > taskqueue_thread_loop() at taskqueue_thread_loop+0x46/frame > 0xffffffcf95139a20 > fork_exit() at fork_exit+0x11f/frame 0xffffffcf95139a70 > fork_trampoline() at fork_trampoline+0xe/frame 0xffffffcf95139a70 > --- trap 0, rip = 0, rsp = 0xffffffcf95139b30, rbp = 0 --- > > > 0x51f47 is in lzjb_compress > (/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c:74). > 69 } > 70 if (src > (uchar_t *)s_start + s_len - MATCH_MAX) { > 71 *dst++ = *src++; > 72 continue; > 73 } > 74 hash = (src[0] << 16) + (src[1] << 8) + src[2]; > 75 hash += hash >> 9; > 76 hash += hash >> 5; > 77 hp = &lempel[hash & (LEMPEL_SIZE - 1)]; > 78 offset = (intptr_t)(src - *hp) & OFFSET_MASK; > > dmesg output is at http://pastebin.com/U34fwJ5f > kernel config is at http://pastebin.com/c9HKfcsz > I can provide more information if useful. > Thanks > > > On Fri, Jul 19, 2013 at 6:52 AM, Volodymyr Kostyrko wrote: > >> 19.07.2013 07:04, olivier wrote: >> >>> Hi, >>> Running 9.2-PRERELEASE #19 r253313 I got the following panic >>> >>> Fatal trap 12: page fault while in kernel mode >>> cpuid = 22; apic id = 46 >>> fault virtual address = 0xffffff827ebca30c >>> fault code = supervisor read data, page not present >>> instruction pointer = 0x20:0xffffffff81983055 >>> stack pointer = 0x28:0xffffffcf75bd60a0 >>> frame pointer = 0x28:0xffffffcf75bd68f0 >>> code segment = base 0x0, limit 0xfffff, type 0x1b >>> = DPL 0, pres 1, long 1, def32 0, gran 1 >>> processor eflags = interrupt enabled, resume, IOPL = 0 >>> current process = 0 (zio_write_issue_hig) >>> trap number = 12 >>> panic: page fault >>> cpuid = 22 >>> KDB: stack backtrace: >>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/**frame >>> 0xffffffcf75bd5b30 >>> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffcf75bd5bf0 >>> panic() at panic+0x1ce/frame 0xffffffcf75bd5cf0 >>> trap_fatal() at trap_fatal+0x290/frame 0xffffffcf75bd5d50 >>> trap_pfault() at trap_pfault+0x211/frame 0xffffffcf75bd5de0 >>> trap() at trap+0x344/frame 0xffffffcf75bd5fe0 >>> calltrap() at calltrap+0x8/frame 0xffffffcf75bd5fe0 >>> --- trap 0xc, rip = 0xffffffff81983055, rsp = 0xffffffcf75bd60a0, rbp = >>> 0xffffffcf75bd68f0 --- >>> lzjb_compress() at lzjb_compress+0x185/frame 0xffffffcf75bd68f0 >>> zio_compress_data() at zio_compress_data+0x92/frame 0xffffffcf75bd6920 >>> zio_write_bp_init() at zio_write_bp_init+0x24b/frame 0xffffffcf75bd6970 >>> zio_execute() at zio_execute+0xc3/frame 0xffffffcf75bd69b0 >>> taskqueue_run_locked() at taskqueue_run_locked+0x74/**frame >>> 0xffffffcf75bd6a00 >>> taskqueue_thread_loop() at taskqueue_thread_loop+0x46/**frame >>> 0xffffffcf75bd6a20 >>> fork_exit() at fork_exit+0x11f/frame 0xffffffcf75bd6a70 >>> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffcf75bd6a70 >>> --- trap 0, rip = 0, rsp = 0xffffffcf75bd6b30, rbp = 0 --- >>> >>> lzjb_compress+0x185 corresponds to line 85 in >>> 80 cpy = src - offset; >>> 81 if (cpy >= (uchar_t *)s_start && cpy != src && >>> 82 src[0] == cpy[0] && src[1] == cpy[1] && src[2] == cpy[2]) { >>> 83 *copymap |= copymask; >>> 84 for (mlen = MATCH_MIN; mlen < MATCH_MAX; mlen++) >>> 85 if (src[mlen] != cpy[mlen]) >>> 86 break; >>> 87 *dst++ = ((mlen - MATCH_MIN) << (NBBY - MATCH_BITS)) | >>> 88 (offset >> NBBY); >>> 89 *dst++ = (uchar_t)offset; >>> >>> I think it's the first time I've seen this panic. It happened while >>> doing a >>> send/receive. I have two pools with lzjb compression; I don't know which >>> of >>> these pools caused the problem, but one of them was the source of the >>> send/receive. >>> >>> I only have a textdump but I'm happy to try to provide more information >>> that could help anyone look into this. >>> Thanks >>> Olivier >>> >> >> Oh, I can add to this one. I have a full core dump of the same problem >> caused by copying large set of files from lzjb compressed pool to lz4 >> compressed pool. vfs.zfs.recover was set. >> >> #1 0xffffffff8039d954 in kern_reboot (howto=260) >> at /usr/src/sys/kern/kern_**shutdown.c:449 >> #2 0xffffffff8039ddce in panic (fmt=) >> at /usr/src/sys/kern/kern_**shutdown.c:637 >> #3 0xffffffff80620a6a in trap_fatal (frame=, >> eva=) at /usr/src/sys/amd64/amd64/trap.**c:879 >> #4 0xffffffff80620d25 in trap_pfault (frame=0x0, usermode=0) >> at /usr/src/sys/amd64/amd64/trap.**c:700 >> #5 0xffffffff806204f6 in trap (frame=0xffffff821ca43600) >> at /usr/src/sys/amd64/amd64/trap.**c:463 >> #6 0xffffffff8060a032 in calltrap () >> at /usr/src/sys/amd64/amd64/**exception.S:232 >> #7 0xffffffff805a9367 in vm_page_alloc (object=0xffffffff80a34030, >> pindex=16633, req=97) at /usr/src/sys/vm/vm_page.c:1445 >> #8 0xffffffff8059c42e in kmem_back (map=0xfffffe00010000e8, >> addr=18446743524021862400, size=16384, flags=) >> at /usr/src/sys/vm/vm_kern.c:362 >> #9 0xffffffff8059c2ac in kmem_malloc (map=0xfffffe00010000e8, size=16384, >> flags=257) at /usr/src/sys/vm/vm_kern.c:313 >> #10 0xffffffff80595104 in uma_large_malloc (size=, >> wait=257) at /usr/src/sys/vm/uma_core.c:994 >> #11 0xffffffff80386b80 in malloc (size=16384, mtp=0xffffffff80ea7c40, >> flags=0) >> at /usr/src/sys/kern/kern_malloc.**c:492 >> #12 0xffffffff80c9e13c in lz4_compress (s_start=0xffffff80d0b19000, >> d_start=0xffffff8159445000, s_len=131072, d_len=114688, n=-2) >> at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/** >> common/fs/zfs/lz4.c:843 >> #13 0xffffffff80cdde25 in zio_compress_data (c=, >> src=, dst=0xffffff8159445000, s_len=131072) >> at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/** >> common/fs/zfs/zio_compress.c:**109 >> #14 0xffffffff80cda012 in zio_write_bp_init (zio=0xfffffe0143a12000) >> at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/** >> common/fs/zfs/zio.c:1107 >> #15 0xffffffff80cd8ec6 in zio_execute (zio=0xfffffe0143a12000) >> at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/** >> common/fs/zfs/zio.c:1305 >> #16 0xffffffff803e25e6 in taskqueue_run_locked (queue=0xfffffe00060ca300) >> at /usr/src/sys/kern/subr_**taskqueue.c:312 >> #17 0xffffffff803e2e38 in taskqueue_thread_loop (arg=> out>) >> at /usr/src/sys/kern/subr_**taskqueue.c:501 >> #18 0xffffffff8036f40a in fork_exit ( >> callout=0xffffffff803e2da0 , >> arg=0xfffffe00060cc3d0, frame=0xffffff821ca43a80) >> at /usr/src/sys/kern/kern_fork.c:**988 >> #19 0xffffffff8060a56e in fork_trampoline () >> at /usr/src/sys/amd64/amd64/**exception.S:606 >> >> I have a full crash dump in case someone wants to look at it. >> >> -- >> Sphinx of black quartz, judge my vow. >> > >