From owner-freebsd-bugs@freebsd.org Tue Dec 12 19:02:30 2017 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DB015E803B8 for ; Tue, 12 Dec 2017 19:02:30 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C35786C25D for ; Tue, 12 Dec 2017 19:02:30 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id vBCJ2UWW092580 for ; Tue, 12 Dec 2017 19:02:30 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 224292] processes are hanging in state ufs / possible deadlock in file system Date: Tue, 12 Dec 2017 19:02:29 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: wosch@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Dec 2017 19:02:31 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D224292 Bug ID: 224292 Summary: processes are hanging in state ufs / possible deadlock in file system Product: Base System Version: CURRENT Hardware: Any OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: wosch@FreeBSD.org A friend and I are testing perl modules on FreeBSD, on all the major versio= ns (12-current, 11-stable, 11-RELEASE). The tests are done in a cheap VM (1GB = RAM, 40GB SSD, 1 core), apparently running on virtio/kvm. While testing we noticed that the VM randomly crashes every 1-3 days, and we had to run a hard fsck on the console. I first thought it is a resource problem, but munin doesn=E2=80=99t show any shortage of RAM, disk or file d= escriptors. We find out that the crashes occurred shortly after processes hanging in st= ate =E2=80=9Cufs=E2=80=9D, as shown in top(1) output: $ top -b 300 | egrep ufs 73278 root 1 31 0 8364K 2340K ufs 0:02 10.35% find 34548 cpansand 1 52 10 43828K 15348K ufs 0:00 0.00% perl 34824 cpansand 1 20 0 21060K 1960K ufs 0:00 0.00% procstat 34557 cpansand 1 52 10 13180K 2112K ufs 0:00 0.00% sh $ ps xuawww | egrep 73278=20=20 root 73278 0.0 0.2 8364 2340 - D 03:01 0:02.25 find= -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g= +x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} + root 73667 0.0 0.0 412 312 8 R+ 03:08 0:00.00 egrep 73278 $ ps xuawww | egrep 34548=20=20 cpansand 34548 0.0 1.5 43828 15348 4 DN+ 15:29 0:00.10 /usr/perl5.26.1p/bin/perl t/test_command.t Another example: $ lsof | egrep 12710 perl5.26. 12710 cpansand cwd VDIR=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 18446735278340920664 0 / -- home/cpansand/.cpan/build/2017112212/Fsdb-2.= 64-1 perl5.26. 12710 cpansand rtd VDIR=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20 7507057954083258955 0 / (/dev/vtbd0p3) perl5.26. 12710 cpansand 0u VCHR 0,90=20=20=20=20=20= =20=20=20=20=20=20=20=20 0t20972 90 /dev/pts/0 perl5.26. 12710 cpansand 1u PIPE 0xfffff8000399f168=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 0 ->0xfffff8000399f000 perl5.26. 12710 cpansand 2u PIPE 0xfffff8001c30da50=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 0 ->0xfffff8001c30d8e8 perl5.26. 12710 cpansand 3u PIPE 0xfffff8001c30ebe0=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 16384 ->0xfffff8001c30ed48 perl5.26. 12710 cpansand 5u PIPE 0xfffff8000399f168=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 0 ->0xfffff8000399f000 perl5.26. 12710 cpansand 6u PIPE 0xfffff8001c30da50=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 0 ->0xfffff8001c30d8e8 perl5.26. 12710 cpansand 7u PIPE 0xfffff8000399f168=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 0 ->0xfffff8000399f000 perl5.26. 12710 cpansand 8u PIPE 0xfffff8001c30da50=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 0 ->0xfffff8001c30d8e8 $ top -b 300 | egrep ufs 12358 cpansand 1 20 10 268M 12980K ufs 0:04 0.00% perl 12710 cpansand 1 20 10 21412K 13808K ufs 0:00 0.00% perl5.26= .0 56111 root 1 20 0 7752K 3180K ufs 0:00 0.00% bash 56077 root 1 52 0 6332K 2312K ufs 0:00 0.00% find 12719 cpansand 1 20 10 7060K 1472K ufs 0:00 0.00% sh $ lsof | egrep '12710|12358' | egrep home perl 12358 cpansand cwd VDIR=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 18446735278340920664 0 / -- home/cpansand/.cpan/build/2017112212/Fsdb-2.= 64-1 perl5.26. 12710 cpansand cwd VDIR=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20 18446735278340920664 0 / -- home/cpansand/.cpan/build/2017112212/Fsdb-2.= 64-1 Perl often hangs in a directory during a build of a CPAN module. If I try to switch to this directory, my current shell will hang too: $ cd /home/cpansand/.cpan/build/2017112212/Fsdb-2.64-1 [hangs] $ find /home/cpansand/ /home/cpansand/ /home/cpansand/.login_conf /home/cpansand/.cshrc /home/cpansand/.rhosts /home/cpansand/.shrc /home/cpansand/.mail_aliases /home/cpansand/.profile /home/cpansand/.mailrc /home/cpansand/.login /home/cpansand/.ssh /home/cpansand/.ssh/authorized_keys /home/cpansand/.ssh/known_hosts /home/cpansand/.zshrc /home/cpansand/bin /home/cpansand/src /home/cpansand/.cpan /home/cpansand/.cpan/build /home/cpansand/.cpan/build/2017112215 /home/cpansand/.cpan/build/2017112212 [ hangs] $ top -b 300 | egrep ufs 12358 cpansand 1 20 10 268M 12980K ufs 0:04 0.00% perl 12710 cpansand 1 20 10 21412K 13808K ufs 0:00 0.00% perl5.26= .0 56111 root 1 20 0 7752K 3180K ufs 0:00 0.00% bash 56077 root 1 52 0 6332K 2312K ufs 0:00 0.00% find 57160 root 1 20 0 6332K 1792K ufs 0:00 0.00% find 57169 root 1 20 0 6332K 1792K ufs 0:00 0.00% find 12719 cpansand 1 20 10 7060K 1472K ufs 0:00 0.00% sh $ pstree |-+=3D 00675 root /usr/sbin/cron -s | \-+- 55939 root cron: running job (cron) | \-+=3D 55941 root /bin/sh - /usr/sbin/periodic daily | \-+- 55943 root lockf -t 0 /var/run/periodic.daily.lock /bin/sh /usr/sbin/periodic LOCKED daily | \-+- 55944 root /bin/sh /usr/sbin/periodic LOCKED daily | |-+- 55953 root /bin/sh /usr/sbin/periodic LOCKED daily | | \-+- 56059 root /bin/sh /etc/periodic/daily/450.status-security | | \-+- 56060 root /bin/sh - /usr/sbin/periodic security | | \-+- 56062 root lockf -t 0 /var/run/periodic.security.lock /bin/sh /usr/sbin/periodic LOCKED security | | \-+- 56063 root /bin/sh /usr/sbin/periodic LOCKED securi= ty | | |-+- 56071 root /bin/sh /usr/sbin/periodic LOCKED secu= rity | | | \-+- 56073 root /bin/sh - /etc/periodic/security/100.chksetuid | | | |--- 56077 root find -sx / /dev/null ( ! -fstype l= ocal ) -prune -o -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u= +s -or -perm -g+s ) -exec ls -liTd {} + | | | \-+- 56078 root /bin/sh - /etc/periodic/security/100.chksetuid | | | \--- 56080 root cat \--=3D 12358 cpansand /usr/perl5.26.0p/bin/perl5.26.0 (CPAN.pm test J/JO/JOHNH/Fsdb-2.64.tar.gz) (perl) |-+- 12710 cpansand /usr/perl5.26.0p/bin/perl5.26.0 t/test_command.t | \--- 12719 cpansand sh -c blib/script/cgi_to_db TEST/cgi_to_db_ex.in TEST/cgi_to_db_ex.in-2 < TEST/cgi_to_db_ex.in 2>&1 root 56077 0.0 0.2 6332 1936 - D 03:01 0:00.02 find = -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g= +x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} + cpansand 12710 0.0 1.4 21412 13780 0 DN+ 14:11 0:00.10 /usr/perl5.26.0p/bin/perl5.26.0 t/test_command.t $ ps -o wchan -auxgww | grep ufs=20=20=20=20 - root 19 0.0 0.0 0 16 - DL Wed02 0:03.= 12 [bufspacedaemon] ufs root 56077 0.0 0.2 6332 1936 - D 03:01 0:00.= 02 find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {}= + ufs cpansand 12358 0.0 0.7 274408 7280 0 DNs+ 14:11 0:03.= 89 /usr/perl5.26.0p/bin/perl5.26.0 (CPAN.pm test J/JO/JOHNH/Fsdb-2.64.tar.gz) (perl) ufs cpansand 12710 0.0 1.4 21412 13780 0 DN+ 14:11 0:00.= 10 /usr/perl5.26.0p/bin/perl5.26.0 t/test_command.t ufs cpansand 12719 0.0 0.1 7060 1444 0 DN+ 14:11 0:00.= 00 sh -c blib/script/cgi_to_db TEST/cgi_to_db_ex.in TEST/cgi_to_db_ex.in-2 < TEST/cgi_to_db_ex.in 2>&1 - root 78756 0.0 0.0 412 304 7 R+ 04:18 0:00.= 00 grep ufs ufs root 56111 0.0 0.3 7752 2688 8 Ds+ 03:03 0:00.= 02 -bash (bash) ufs root 57160 0.0 0.1 6332 1416 10 D+ 03:15 0:00.= 00 find . ufs root 57169 0.0 0.1 6332 1416 11 D+ 03:15 0:00.= 00 find /home/cpansand/ The perl interpreter usually crashed while testing the CPAN module J/JO/JOHNH/Fsdb-2.64.tar.gz in test t/test_command.t But I think this is a red herring. The module just runs fine if you test on= ly this module. It must be a combination of other, earlier test which finally bring Fsdb-2.64.tar.gz down. We saw this error on FreeBSD 9.x, 10.3, 11.1 Release, 11-stable, 12-current= .=20 My gut feeling is that we have a deadlock or forgotten lock of a directory/= file in the ufs code. Some kernel thread forgot to release a lock. Any other pro= cess which need to open this directory will hang, and soon or later (2-4 hours) = the machine crashes hard. Everytime the machine crashed we had to run fsck twice, repair inodes and t= he superblock on the console manually. This is not fun. We saw the errors at different cloud providers. I=E2=80=99m pretty sure it = is not a bad setup, it is a general problem of the FreeBSD kernel/ufs filesystem. --=20 You are receiving this mail because: You are the assignee for the bug.=