Date: Tue, 12 Dec 2017 19:02:29 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 224292] processes are hanging in state ufs / possible deadlock in file system Message-ID: <bug-224292-8@https.bugs.freebsd.org/bugzilla/>
index | next in thread | raw e-mail
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224292 Bug ID: 224292 Summary: processes are hanging in state ufs / possible deadlock in file system Product: Base System Version: CURRENT Hardware: Any OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: wosch@FreeBSD.org A friend and I are testing perl modules on FreeBSD, on all the major versions (12-current, 11-stable, 11-RELEASE). The tests are done in a cheap VM (1GB RAM, 40GB SSD, 1 core), apparently running on virtio/kvm. While testing we noticed that the VM randomly crashes every 1-3 days, and we had to run a hard fsck on the console. I first thought it is a resource problem, but munin doesn’t show any shortage of RAM, disk or file descriptors. We find out that the crashes occurred shortly after processes hanging in state “ufs”, as shown in top(1) output: $ top -b 300 | egrep ufs 73278 root 1 31 0 8364K 2340K ufs 0:02 10.35% find 34548 cpansand 1 52 10 43828K 15348K ufs 0:00 0.00% perl 34824 cpansand 1 20 0 21060K 1960K ufs 0:00 0.00% procstat 34557 cpansand 1 52 10 13180K 2112K ufs 0:00 0.00% sh $ ps xuawww | egrep 73278 root 73278 0.0 0.2 8364 2340 - D 03:01 0:02.25 find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} + root 73667 0.0 0.0 412 312 8 R+ 03:08 0:00.00 egrep 73278 $ ps xuawww | egrep 34548 cpansand 34548 0.0 1.5 43828 15348 4 DN+ 15:29 0:00.10 /usr/perl5.26.1p/bin/perl t/test_command.t Another example: $ lsof | egrep 12710 perl5.26. 12710 cpansand cwd VDIR 18446735278340920664 0 / -- home/cpansand/.cpan/build/2017112212/Fsdb-2.64-1 perl5.26. 12710 cpansand rtd VDIR 7507057954083258955 0 / (/dev/vtbd0p3) perl5.26. 12710 cpansand 0u VCHR 0,90 0t20972 90 /dev/pts/0 perl5.26. 12710 cpansand 1u PIPE 0xfffff8000399f168 0 ->0xfffff8000399f000 perl5.26. 12710 cpansand 2u PIPE 0xfffff8001c30da50 0 ->0xfffff8001c30d8e8 perl5.26. 12710 cpansand 3u PIPE 0xfffff8001c30ebe0 16384 ->0xfffff8001c30ed48 perl5.26. 12710 cpansand 5u PIPE 0xfffff8000399f168 0 ->0xfffff8000399f000 perl5.26. 12710 cpansand 6u PIPE 0xfffff8001c30da50 0 ->0xfffff8001c30d8e8 perl5.26. 12710 cpansand 7u PIPE 0xfffff8000399f168 0 ->0xfffff8000399f000 perl5.26. 12710 cpansand 8u PIPE 0xfffff8001c30da50 0 ->0xfffff8001c30d8e8 $ top -b 300 | egrep ufs 12358 cpansand 1 20 10 268M 12980K ufs 0:04 0.00% perl 12710 cpansand 1 20 10 21412K 13808K ufs 0:00 0.00% perl5.26.0 56111 root 1 20 0 7752K 3180K ufs 0:00 0.00% bash 56077 root 1 52 0 6332K 2312K ufs 0:00 0.00% find 12719 cpansand 1 20 10 7060K 1472K ufs 0:00 0.00% sh $ lsof | egrep '12710|12358' | egrep home perl 12358 cpansand cwd VDIR 18446735278340920664 0 / -- home/cpansand/.cpan/build/2017112212/Fsdb-2.64-1 perl5.26. 12710 cpansand cwd VDIR 18446735278340920664 0 / -- home/cpansand/.cpan/build/2017112212/Fsdb-2.64-1 Perl often hangs in a directory during a build of a CPAN module. If I try to switch to this directory, my current shell will hang too: $ cd /home/cpansand/.cpan/build/2017112212/Fsdb-2.64-1 [hangs] $ find /home/cpansand/ /home/cpansand/ /home/cpansand/.login_conf /home/cpansand/.cshrc /home/cpansand/.rhosts /home/cpansand/.shrc /home/cpansand/.mail_aliases /home/cpansand/.profile /home/cpansand/.mailrc /home/cpansand/.login /home/cpansand/.ssh /home/cpansand/.ssh/authorized_keys /home/cpansand/.ssh/known_hosts /home/cpansand/.zshrc /home/cpansand/bin /home/cpansand/src /home/cpansand/.cpan /home/cpansand/.cpan/build /home/cpansand/.cpan/build/2017112215 /home/cpansand/.cpan/build/2017112212 [ hangs] $ top -b 300 | egrep ufs 12358 cpansand 1 20 10 268M 12980K ufs 0:04 0.00% perl 12710 cpansand 1 20 10 21412K 13808K ufs 0:00 0.00% perl5.26.0 56111 root 1 20 0 7752K 3180K ufs 0:00 0.00% bash 56077 root 1 52 0 6332K 2312K ufs 0:00 0.00% find 57160 root 1 20 0 6332K 1792K ufs 0:00 0.00% find 57169 root 1 20 0 6332K 1792K ufs 0:00 0.00% find 12719 cpansand 1 20 10 7060K 1472K ufs 0:00 0.00% sh $ pstree |-+= 00675 root /usr/sbin/cron -s | \-+- 55939 root cron: running job (cron) | \-+= 55941 root /bin/sh - /usr/sbin/periodic daily | \-+- 55943 root lockf -t 0 /var/run/periodic.daily.lock /bin/sh /usr/sbin/periodic LOCKED daily | \-+- 55944 root /bin/sh /usr/sbin/periodic LOCKED daily | |-+- 55953 root /bin/sh /usr/sbin/periodic LOCKED daily | | \-+- 56059 root /bin/sh /etc/periodic/daily/450.status-security | | \-+- 56060 root /bin/sh - /usr/sbin/periodic security | | \-+- 56062 root lockf -t 0 /var/run/periodic.security.lock /bin/sh /usr/sbin/periodic LOCKED security | | \-+- 56063 root /bin/sh /usr/sbin/periodic LOCKED security | | |-+- 56071 root /bin/sh /usr/sbin/periodic LOCKED security | | | \-+- 56073 root /bin/sh - /etc/periodic/security/100.chksetuid | | | |--- 56077 root find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} + | | | \-+- 56078 root /bin/sh - /etc/periodic/security/100.chksetuid | | | \--- 56080 root cat \--= 12358 cpansand /usr/perl5.26.0p/bin/perl5.26.0 (CPAN.pm test J/JO/JOHNH/Fsdb-2.64.tar.gz) (perl) |-+- 12710 cpansand /usr/perl5.26.0p/bin/perl5.26.0 t/test_command.t | \--- 12719 cpansand sh -c blib/script/cgi_to_db TEST/cgi_to_db_ex.in TEST/cgi_to_db_ex.in-2 < TEST/cgi_to_db_ex.in 2>&1 root 56077 0.0 0.2 6332 1936 - D 03:01 0:00.02 find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} + cpansand 12710 0.0 1.4 21412 13780 0 DN+ 14:11 0:00.10 /usr/perl5.26.0p/bin/perl5.26.0 t/test_command.t $ ps -o wchan -auxgww | grep ufs - root 19 0.0 0.0 0 16 - DL Wed02 0:03.12 [bufspacedaemon] ufs root 56077 0.0 0.2 6332 1936 - D 03:01 0:00.02 find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( -perm -u+x -or -perm -g+x -or -perm -o+x ) ( -perm -u+s -or -perm -g+s ) -exec ls -liTd {} + ufs cpansand 12358 0.0 0.7 274408 7280 0 DNs+ 14:11 0:03.89 /usr/perl5.26.0p/bin/perl5.26.0 (CPAN.pm test J/JO/JOHNH/Fsdb-2.64.tar.gz) (perl) ufs cpansand 12710 0.0 1.4 21412 13780 0 DN+ 14:11 0:00.10 /usr/perl5.26.0p/bin/perl5.26.0 t/test_command.t ufs cpansand 12719 0.0 0.1 7060 1444 0 DN+ 14:11 0:00.00 sh -c blib/script/cgi_to_db TEST/cgi_to_db_ex.in TEST/cgi_to_db_ex.in-2 < TEST/cgi_to_db_ex.in 2>&1 - root 78756 0.0 0.0 412 304 7 R+ 04:18 0:00.00 grep ufs ufs root 56111 0.0 0.3 7752 2688 8 Ds+ 03:03 0:00.02 -bash (bash) ufs root 57160 0.0 0.1 6332 1416 10 D+ 03:15 0:00.00 find . ufs root 57169 0.0 0.1 6332 1416 11 D+ 03:15 0:00.00 find /home/cpansand/ The perl interpreter usually crashed while testing the CPAN module J/JO/JOHNH/Fsdb-2.64.tar.gz in test t/test_command.t But I think this is a red herring. The module just runs fine if you test only this module. It must be a combination of other, earlier test which finally bring Fsdb-2.64.tar.gz down. We saw this error on FreeBSD 9.x, 10.3, 11.1 Release, 11-stable, 12-current. My gut feeling is that we have a deadlock or forgotten lock of a directory/file in the ufs code. Some kernel thread forgot to release a lock. Any other process which need to open this directory will hang, and soon or later (2-4 hours) the machine crashes hard. Everytime the machine crashed we had to run fsck twice, repair inodes and the superblock on the console manually. This is not fun. We saw the errors at different cloud providers. I’m pretty sure it is not a bad setup, it is a general problem of the FreeBSD kernel/ufs filesystem. -- You are receiving this mail because: You are the assignee for the bug.home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-224292-8>
