From owner-freebsd-current@FreeBSD.ORG Sun Sep 7 13:02:31 2014 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 60C1C3DD for ; Sun, 7 Sep 2014 13:02:31 +0000 (UTC) Received: from smtprelay06.ispgateway.de (smtprelay06.ispgateway.de [80.67.31.102]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B2C141D64 for ; Sun, 7 Sep 2014 13:02:29 +0000 (UTC) Received: from [87.79.252.204] (helo=fabiankeil.de) by smtprelay06.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1XQc6k-0003XZ-Vt for freebsd-current@freebsd.org; Sun, 07 Sep 2014 15:02:19 +0200 Date: Sun, 7 Sep 2014 15:02:19 +0200 From: Fabian Keil To: FreeBSD Current Subject: ZFS-related panic: "possible" spa->spa_errlog_lock deadlock Message-ID: <492dbacb.5942cc9b@fabiankeil.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/RF=29s9FFtRhTIQX5d+=r1I"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Sep 2014 13:02:31 -0000 --Sig_/RF=29s9FFtRhTIQX5d+=r1I Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Using a kernel built from FreeBSD 11.0-CURRENT r271182 I got the following panic yesterday: [...] Unread portion of the kernel message buffer: [6880] panic: deadlkres: possible deadlock detected for 0xfffff80015289490,= blocked for 1800503 ticks [6880]=20 [6880] cpuid =3D 1 [6880] KDB: stack backtrace: [6880] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe= 009432e9e0 [6880] kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe009432ea90 [6880] panic() at panic+0x1c1/frame 0xfffffe009432eb50 [6880] deadlkres() at deadlkres+0x588/frame 0xfffffe009432ebb0 [6880] fork_exit() at fork_exit+0x9a/frame 0xfffffe009432ebf0 [6880] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009432ebf0 [6880] --- trap 0, rip =3D 0, rsp =3D 0xfffffe009432ecb0, rbp =3D 0 --- [6880] KDB: enter: panic [6880] Uptime: 1h54m40s [6880] Dumping 354 out of 1973 MB:..5%..14%..23%..32%..41%..55%..64%..73%..= 82%..91% Reading symbols from /boot/kernel/zfs.ko.symbols...done. [...] Loaded symbols for /boot/kernel/profile.ko.symbols #0 doadump (textdump=3D1) at pcpu.h:219 219 pcpu.h: No such file or directory. in pcpu.h (kgdb) where #0 doadump (textdump=3D1) at pcpu.h:219 #1 0xffffffff80597bad in kern_reboot (howto=3D260) at /usr/src/sys/kern/ke= rn_shutdown.c:447 #2 0xffffffff80598100 in panic (fmt=3D) at /usr/src/s= ys/kern/kern_shutdown.c:746 #3 0xffffffff80539b98 in deadlkres () at /usr/src/sys/kern/kern_clock.c:240 #4 0xffffffff8055e8da in fork_exit (callout=3D0xffffffff80539610 , arg=3D0x0, frame=3D0xfffffe009432ec00) at /usr/src/sys/kern/kern_fork.c= :977 #5 0xffffffff8083fb5e in fork_trampoline () at /usr/src/sys/amd64/amd64/ex= ception.S:605 #6 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb) set $td=3D(struct thread *)0xfffff80015289490 (kgdb) tid $td->td_tid [Switching to thread 1184 (Thread 101428)]#0 sched_switch (td=3D0xfffff800= 15289490, newtd=3D, flags=3D) at = /usr/src/sys/kern/sched_ule.c:1932 1932 cpuid =3D PCPU_GET(cpuid); (kgdb) bt #0 sched_switch (td=3D0xfffff80015289490, newtd=3D, f= lags=3D) at /usr/src/sys/kern/sched_ule.c:1932 #1 0xffffffff805a23c1 in mi_switch (flags=3D260, newtd=3D0x0) at /usr/src/= sys/kern/kern_synch.c:493 #2 0xffffffff805e4bca in sleepq_wait (wchan=3D0x0, pri=3D0) at /usr/src/sy= s/kern/subr_sleepqueue.c:631 #3 0xffffffff805a12b2 in _sx_xlock_hard (sx=3D0xfffff800062ed820, tid=3D18= 446735277971510416, opts=3D, file=3D0x0, line=3D0) at = /usr/src/sys/kern/kern_sx.c:676 #4 0xffffffff805a0add in _sx_xlock (sx=3D0x0, opts=3D0, file=3D, line=3D0) at sx.h:154 #5 0xffffffff8114a691 in spa_get_errlog_size (spa=3D0xfffff800062ed000) at= /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_errlog.c:142 #6 0xffffffff8113f549 in spa_get_stats (name=3D0xfffffe0006126000 "spacelo= op", config=3D0xfffffe00950e58e8, altroot=3D0xfffffe0006126430 "", buflen= =3D2048) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:3287 #7 0xffffffff81189a45 in zfs_ioc_pool_stats (zc=3D0xfffffe0006126000) at /= usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:1656 #8 0xffffffff81187290 in zfsdev_ioctl (dev=3D, zcmd= =3D, arg=3D, flag=3D, td=3D) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:= 6136 #9 0xffffffff80464a55 in devfs_ioctl_f (fp=3D0xfffff80017107f50, com=3D322= 2821381, data=3D0xfffff8004fb4b740, cred=3D, td=3D0xff= fff80015289490) at /usr/src/sys/fs/devfs/devfs_vnops.c:757 #10 0xffffffff805f3c3d in kern_ioctl (td=3D0xfffff80015289490, fd=3D, com=3D0) at file.h:311 #11 0xffffffff805f381c in sys_ioctl (td=3D0xfffff80015289490, uap=3D0xfffff= e00950e5b80) at /usr/src/sys/kern/sys_generic.c:702 #12 0xffffffff8085c2db in amd64_syscall (td=3D0xfffff80015289490, traced=3D= 0) at subr_syscall.c:133 #13 0xffffffff8083f90b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exce= ption.S:390 #14 0x00000008019fc3da in ?? () (kgdb) f 3 #3 0xffffffff805a12b2 in _sx_xlock_hard (sx=3D0xfffff800062ed820, tid=3D18= 446735277971510416, opts=3D, file=3D0x0, line=3D0) at = /usr/src/sys/kern/kern_sx.c:676 676 sleepq_wait(&sx->lock_object, 0); (kgdb) p sx->lock_object $14 =3D {lo_name =3D 0xffffffff81202163 "spa->spa_errlog_lock", lo_flags = =3D 40960000, lo_data =3D 0, lo_witness =3D 0x0} This happened several minutes after a couple of zpool processes stopped responding while accessing information about the following pool: fk@r500 ~ $zpool status -v spaceloop pool: spaceloop state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 2.98M in 16h27m with 0 errors on Sun Sep 7 12:25:36= 2014 config: NAME STATE READ WRITE CKSUM spaceloop ONLINE 0 0 0 label/spaceloop.eli ONLINE 0 0 26 errors: No known data errors Dump time was 21:58:48, the vdev disappeared a while before that, therefore I believe the "possible" deadlock is a real one: Sep 6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_write_done() failed (er= ror=3D5) label/spaceloop.eli[WRITE(offset=3D2235700736, length=3D16384)] Sep 6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_read_done() failed (err= or=3D5) label/spaceloop.eli[READ(offset=3D4008976384, length=3D8192)] Sep 6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_read_done() failed (err= or=3D5) label/spaceloop.eli[READ(offset=3D4009238528, length=3D8192)] Sep 6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_read_done() failed (err= or=3D5) label/spaceloop.eli[READ(offset=3D270336, length=3D8192)] Sep 6 21:18:27 r500 kernel: [4459] (da1:umass-sim1:1:0:0): WRITE(10). CDB:= 2a 00 00 60 65 0c 00 00 40 00=20 Sep 6 21:18:27 r500 kernel: [4459] (da1:umass-sim1:1:0:0): CAM status: CCB= request completed with an error Sep 6 21:18:27 r500 kernel: [4459] (da1:umass-sim1:1:0:0): Retrying command Sep 6 21:18:45 r500 kernel: [4477] (da1:umass-sim1:1:0:0): WRITE(10). CDB:= 2a 00 00 60 65 0c 00 00 40 00=20 Sep 6 21:18:45 r500 kernel: [4477] (da1:umass-sim1:1:0:0): CAM status: CCB= request completed with an error Sep 6 21:18:45 r500 kernel: [4477] (da1:umass-sim1:1:0:0): Retrying command Sep 6 21:19:03 r500 kernel: [4495] (da1:umass-sim1:1:0:0): WRITE(10). CDB:= 2a 00 00 60 65 0c 00 00 40 00=20 Sep 6 21:19:03 r500 kernel: [4495] (da1:umass-sim1:1:0:0): CAM status: CCB= request completed with an error Sep 6 21:19:03 r500 kernel: [4495] (da1:umass-sim1:1:0:0): Retrying command Sep 6 21:19:21 r500 kernel: [4514] (da1:umass-sim1:1:0:0): WRITE(10). CDB:= 2a 00 00 60 65 0c 00 00 40 00=20 Sep 6 21:19:21 r500 kernel: [4514] (da1:umass-sim1:1:0:0): CAM status: CCB= request completed with an error Sep 6 21:19:21 r500 kernel: [4514] (da1:umass-sim1:1:0:0): Retrying command Sep 6 21:19:40 r500 kernel: [4532] (da1:umass-sim1:1:0:0): WRITE(10). CDB:= 2a 00 00 60 65 0c 00 00 40 00=20 Sep 6 21:19:40 r500 kernel: [4532] (da1:umass-sim1:1:0:0): CAM status: CCB= request completed with an error Sep 6 21:19:40 r500 kernel: [4532] (da1:umass-sim1:1:0:0): Error 5, Retrie= s exhausted Sep 6 21:19:40 r500 kernel: [4532] GEOM_ELI: g_eli_write_done() failed (er= ror=3D5) label/spaceloop.eli[WRITE(offset=3D3234469888, length=3D32768)] [...] Sep 6 21:29:46 r500 kernel: [5138] GEOM_ELI: Device label/takems.eli destr= oyed. Sep 6 21:29:52 r500 kernel: [5144] ugen0.3: at usbus0 (d= isconnected) Sep 6 21:29:52 r500 kernel: [5144] umass2: at uhub0, port 1, addr 3 (disco= nnected) Sep 6 21:29:52 r500 kernel: [5144] da2 at umass-sim2 bus 2 scbus4 target 0= lun 0 Sep 6 21:29:52 r500 kernel: [5144] da2: s/n AA04012= 900007508 detached Sep 6 21:29:52 r500 kernel: [5144] pass4 at umass-sim2 bus 2 scbus4 target= 0 lun 0 Sep 6 21:29:52 r500 kernel: [5144] pass4: s/n AA040= 12900007508 detached Sep 6 21:29:52 r500 kernel: [5144] (pass4:umass-sim2:2:0:0): Periph destro= yed Sep 6 21:29:52 r500 kernel: [5144] (da2:umass-sim2:2:0:0): Periph destroyed Sep 6 21:44:56 r500 kernel: [6049] umass1: at uhub1, port 1, addr 3 (disco= nnected) Sep 6 21:44:56 r500 kernel: [6049] da1 at umass-sim1 bus 1 scbus3 target 0= lun 0 Sep 6 21:44:56 r500 kernel: [6049] da1: s/n 0= 000000000002511 detached Sep 6 21:44:56 r500 kernel: [6049] pass3 at umass-sim1 bus 1 scbus3 target= 0 lun 0 Sep 6 21:44:56 r500 kernel: [6049] pass3: s/n= 0000000000002511 detached Sep 6 21:44:56 r500 kernel: [6049] (pass3:umass-sim1:1:0:0): Periph destro= yed Sep 6 21:44:56 r500 kernel: [6049] GEOM_ELI: Device label/spaceloop.eli de= stroyed. Sep 6 21:44:56 r500 kernel: [6049] GEOM_ELI: Detached label/spaceloop.eli = on last close. Sep 6 21:44:56 r500 kernel: [6049] (da1:umass-sim1:1:0:0): Periph destroyed Sep 6 21:44:57 r500 ZFS: vdev is removed, pool_guid=3D13593515477493551420= vdev_guid=3D2311923824790456307 Sep 6 21:44:57 r500 kernel: [6049] umass1: on usbus1 Sep 6 21:44:57 r500 kernel: [6049] umass1: SCSI over Bulk-Only; quirks = =3D 0x4101 Sep 6 21:44:57 r500 kernel: [6049] umass1:3:1: Attached to scbus3 Sep 6 21:46:10 r500 kernel: [6122] (probe0:umass-sim1:1:0:0): INQUIRY. CDB= : 12 00 00 00 24 00=20 Sep 6 21:46:10 r500 kernel: [6122] (probe0:umass-sim1:1:0:0): CAM status: = CCB request completed with an error Sep 6 21:46:10 r500 kernel: [6122] (probe0:umass-sim1:1:0:0): Retrying com= mand Sep 6 21:46:45 r500 kernel: [6157] ugen1.3: at usbus1 (disconnec= ted) Sep 6 21:46:45 r500 kernel: [6157] umass1: at uhub1, port 1, addr 3 (disco= nnected) Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB= : 12 00 00 00 24 00=20 Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: = CCB request completed with an error Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Retrying com= mand Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB= : 12 00 00 00 24 00=20 Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: = CCB request completed with an error Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Retrying com= mand Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB= : 12 00 00 00 24 00=20 Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: = CCB request completed with an error Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Retrying com= mand Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB= : 12 00 00 00 24 00=20 Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: = CCB request completed with an error Sep 6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Error 5, Ret= ries exhausted Two other pools, one using USB as well, were unaffected until the panic was triggered. A pleasant surprise was that the system entered ddb from Xorg and dumping on /dev/ada0s1b worked despite /dev/ada0s1b.eli being configured as swap device. I wasn't aware that this was working now. Fabian --Sig_/RF=29s9FFtRhTIQX5d+=r1I Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlQMV1sACgkQBYqIVf93VJ30SwCggvna+u5WCWCr2ysP3JaM0sP6 xdEAoIPZ7NS8C3Pua3cW1cFYAG1IQY/Z =2tkT -----END PGP SIGNATURE----- --Sig_/RF=29s9FFtRhTIQX5d+=r1I--