Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Dec 2012 21:41:45 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Andreas Longwitz <longwitz@incore.de>
Cc:        freebsd-stable@freebsd.org, fs@freebsd.org
Subject:   Re: FS hang with suspfs when creating snapshot on a UFS + GJOURNAL setup
Message-ID:  <20121227194145.GM82219@kib.kiev.ua>
In-Reply-To: <50DC8999.8000708@incore.de>
References:  <50DC30F6.1050904@incore.de> <20121227133355.GI82219@kib.kiev.ua> <50DC8999.8000708@incore.de>

next in thread | previous in thread | raw e-mail | index | archive | help

--zHDeOHGDnzKksZSU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 27, 2012 at 06:47:05PM +0100, Andreas Longwitz wrote:
> Konstantin Belousov wrote:
> > On Thu, Dec 27, 2012 at 12:28:54PM +0100, Andreas Longwitz wrote:
> >> On a FreeBSD 8-Stable machine with UFS + GJOURNAL (no SU) I observed t=
he
> >> same behaviour as described for UFS+SU+J in
> >>   lists.freebsd.org/pipermail/freebsd-current/2012-January/030937.html.
> >>
> >> The snapshot was initiated by amanda with
> >>   dump 3ubLshf 64 1048576 0 - /dev/mirror/gm0p10.journal (pid 50347)
> >> and the process creating the snapshot is
> >>   /sbin/mksnap_ffs /home /home/.snap/dump_snapshot (pid 50350).
> >>
> >> The process 50350 hangs and also all following processes that tried to
> >> access the home partition, some of them work, but don't come to an end,
> >> e.g. a shell after (Ctrl T):
> >>   load: 0.61  cmd: sh 52670 [suspfs] 43.46r 0.00u 0.00s 0% 2272k.
> >>
> >> All write I/O's to all gjournaled partitions are stopped. Under normal
> >> circumstances the snapshot is taken in five seconds, so I have definit=
iv
> >> not the problem I have described in
> >>   lists.freebsd.org/pipermail/freebsd-geom/2012-May/005246.html.
> >>
> >> .....
> >>
> >> It seems there is a deadlock on the suspfs lock, but I could not figure
> >> out who holds this lock.
> >> Any hints how to get better diagnostic information for next time the
> >> error occurs are welcome.
> >=20
> > The
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/ke=
rneldebug-deadlocks.html
> > provides the instructions.
> >=20
> > The suspfs is owned by the snapshot creator. The question is, where is =
it
> > blocked.
>=20
> Thanks for answer.
>=20
> In the meantime I can reproduce the problem and got some more
> information. It looks that there is a deadlock between the two processes
> with pid 18 (g_journal switcher) and pid 7126 (/sbin/mksnap_ffs /home
> /home/.snap/my_snapshot):
>=20
> 0    18     0   0  45  0     0    16 snaplk DL    ??    4:40,34
>          [g_journal switcher]
> 0  7126  1933   0  50  0  5836  1176 suspfs D     ??    0:00,44
>          /sbin/mksnap_ffs /home /home/.snap/my_snapshot
>=20
> procstat -t 18 -->
>   PID    TID COMM               TDNAME           CPU  PRI STATE   WCHAN
>    18 100076 g_journal switcher g_journal switch   0  129 sleep   snaplk
> procstat  -t 7126 -->
>   PID    TID COMM               TDNAME           CPU  PRI STATE   WCHAN
>  7126 100157 mksnap_ffs       -                    1  134 sleep   suspfs
> procstat -kk 18 -->
>   PID    TID COMM             TDNAME           KSTACK
>    18 100076 g_journal switcher g_journal switch mi_switch+0x186
>       sleepq_wait+0x42 __lockmgr_args+0x49b ffs_copyonwrite
>       +0x19a ffs_geom_strategy+0x1b5 bufwrite+0xe9 ffs_sbupdate+0x12a
>       g_journal_ufs_clean+0x3e g_journal_switcher+0xe5e fork
>       _exit+0x11f fork_trampoline+0xe
> procstat -kk 7126 -->
>   PID    TID COMM             TDNAME           KSTACK
>  7126 100157 mksnap_ffs       -                mi_switch+0x186
>       sleepq_wait+0x42 _sleep+0x373 vn_start_write+0xdf ffs_s
>       napshot+0xe2b ffs_mount+0x65a vfs_donmount+0xdc5 nmount+0x63
>       amd64_syscall+0x1f4 Xfast_syscall+0xfc
>=20
> >From DDB:
> db> show lockedvnods
> Locked vnodes
>=20
> 0xffffff012281a938: tag ufs, type VREG
>     usecount 1, writecount 0, refcount 3339 mountedhere 0
>     flags (VV_SYSTEM)
>     lock type snaplk: EXCL by thread 0xffffff000807a470 (pid 7126)
>  with exclusive waiters pending
>         ino 23552, on dev mirror/gm0p10.journal
=2E..
> db> alltrace (pid 18 and 7126)
>=20
> Tracing command g_journal switcher pid 18 tid 100076 td 0xffffff0002bd5000
> sched_switch() at sched_switch+0xde
> mi_switch() at mi_switch+0x186
> sleepq_wait() at sleepq_wait+0x42
> __lockmgr_args() at __lockmgr_args+0x49b
> ffs_copyonwrite() at ffs_copyonwrite+0x19a
> ffs_geom_strategy() at ffs_geom_strategy+0x1b5
> bufwrite() at bufwrite+0xe9
> ffs_sbupdate() at ffs_sbupdate+0x12a
> g_journal_ufs_clean() at g_journal_ufs_clean+0x3e
> g_journal_switcher() at g_journal_switcher+0xe5e
> fork_exit() at fork_exit+0x11f
> fork_trampoline() at fork_trampoline+0xe
> --- trap 0, rip =3D 0, rsp =3D 0xffffff8242ca8cf0, rbp =3D 0 ---
>=20
> Tracing command mksnap_ffs pid 7126 tid 100157 td 0xffffff000807a470
> sched_switch() at sched_switch+0xde
> mi_switch() at mi_switch+0x186
> sleepq_wait() at sleepq_wait+0x42
> _sleep() at _sleep+0x373
> vn_start_write() at vn_start_write+0xdf
> ffs_snapshot() at ffs_snapshot+0xe2b
Can you look up the line number for the ffs_snapshot+0xe2b ?

I think the bug is that vn_start_write() is called while the snaplock
is owned, after the out1 label in ffs_snapshot() (I am looking at the
HEAD code).
> ffs_mount() at ffs_mount+0x65a
> vfs_donmount() at vfs_donmount+0xdc5
> nmount() at nmount+0x63
> amd64_syscall() at amd64_syscall+0x1f4
> Xfast_syscall() at Xfast_syscall+0xfc
> --- syscall (378, FreeBSD ELF64, nmount), rip =3D 0x18069e35c, rsp =3D
> 0x7fffffffe358, rbp =3D 0x7fffffffedc7 ---
>=20
> There are a lot of other - but later started than pid 7126 - processes
> sitting on the suspfs lock, most of them hang on a close with a stack
> like this:
>=20
> Tracing command cat pid 17726 tid 100387 td 0xffffff012e4bd470
> sched_switch() at sched_switch+0xde
> mi_switch() at mi_switch+0x186
> sleepq_wait() at sleepq_wait+0x42
> _sleep() at _sleep+0x373
> vn_start_write() at vn_start_write+0xdf
> vn_close() at vn_close+0x5b
> vn_closefile() at vn_closefile+0x5a
> _fdrop() at _fdrop+0x23
> closef() at closef+0x45
> kern_close() at kern_close+0x163
> amd64_syscall() at amd64_syscall+0x1f4
> Xfast_syscall() at Xfast_syscall+0xfc
> --- syscall (6, FreeBSD ELF64, close), rip =3D 0x18073f07c, rsp =3D
> 0x7fffffffeb68, rbp =3D 0 ---
>=20
>=20
> If more information is necessary to catch this problem, I can try to
> reproduce the problem with a suitable debug kernel.
>=20
> --=20
> Andreas Longwitz

--zHDeOHGDnzKksZSU
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJQ3KR4AAoJEJDCuSvBvK1BOTEP/RIsmiLgE2r+ansViY7TGdv9
a/P0lQK/+a/+TZ2K24h+J5/sqBHFNnBq9mxD+8OwLBdGjYtoKb4uDio2DyS/xjBd
r1nP+XZxL5DS9cHTE7gzFJvX4i6T+g77L9PQkWaux1LhY06NW0gW/XytzWUcMl27
gQ66zXgpzvxC6IrQHOm8CN1F94N2wmu6o+3/mpP9XzbvShljkx+t2u7CqZpwQ7US
ItqSLMo6oUzfYYcWXskTud2BFbfcDaZXZ4ZwNt5MwmpnGBti+Cxq9D04NDNy5TYa
0zEsELjWv7DHxSOviD2YNdQR1o3dYPtHrWrnnFTMOL2ZFOao/XoDQmT8oR1qbUoB
I2BfpPBJ7rb52nS5oota19H3bjYqpTwQU/JFTQfoq9vkrt0WgktDPf+bgJdltw+y
Iaeq5osr2pdLjbdTiPve0UGgbFD4ayQVuRUOYJ/2dwC/eSQ+suMVhFgwx37o+TfQ
qKms7hpu3IuYKHBA+Von4RBB+g58PKbvEfAI9VY3D6xGj9hZ3R5wrh5GXvW3dI5U
4mBjPFaJXolubmwAgmJWNPcb5mNuJhEj0fhs/9XtN5V+bux6szt3cZdEN68zjA9F
k8hkP+FfvOa3VqGCdF6EfjcItcyUXw/l2vYJcmWzcSmP+EO8geyiE/DYtcKANvQd
hBoR2RLQOYj541So28Xd
=ywrw
-----END PGP SIGNATURE-----

--zHDeOHGDnzKksZSU--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121227194145.GM82219>