Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Sep 2010 12:21:51 +0300
From:      Kostik Belousov <kostikbel@gmail.com>
To:        freebsd <free.bsd@webstyle.ch>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: strange problem with FreeBSD 7.3 64bit
Message-ID:  <20100910092151.GO2465@deviant.kiev.zoral.com.ua>
In-Reply-To: <4C89F014.1050601@webstyle.ch>
References:  <4C89F014.1050601@webstyle.ch>

next in thread | previous in thread | raw e-mail | index | archive | help

--WoqaC9TUMqqIOlla
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Sep 10, 2010 at 10:45:08AM +0200, freebsd wrote:
> hi list,
>=20
> we upgraded some 20 boxes from 7.1 and 7.2 to 7.3-RELEASE-p2 (all amd64)=
=20
> and now are experiencing some weird behaviour on 6 of them with rsnapshot:
>=20
> after a few days/several weeks (seems to be completely random),=20
> rsnapshot reports that it can't start due it's lockfile and process=20
> still being present. on such boxes either a zombie rm or find process=20
> (which presumably were launched by rsnapshot) can be found.
> if the backup was done to a separate partition (physical disks or RAIDs)=
=20
> any access (ls, stat, fsck, etc) to the partition would kill the current=
=20
> SSH session, creating a new zombie of the process one just started.=20
> unmounting the affected partition would render the server completely=20
> unresponsive and required a hardware reset.
>=20
> when trying to restart, the machines wouldn't even shut down completely=
=20
> but hanged somewhere after syncing buffers, only a hardware reset=20
> worked. after the reboot, those partitions were unmounted and fscked.=20
> after which the backups would work again until the next error happened=20
> again.
>=20
> the hardware of affected and unaffected system are:
>=20
> HP ProLiant DL380 G4
> HP ProLiant DL380 G5
> HP ProLiant DL360 G5
>=20
> there is no visible pattern between affected and unaffected boxes. also=
=20
> those machines were upgraded the exact same way, running identical=20
> kernels (more or less GENERIC, with QUOTA activated).
>=20
> we upgraded the most critical boxes which showed that behaviour on a=20
> daily interval to 8.0-RELEASE and ever since this behavior has=20
> disappeared since nearly 3 months now.
>=20
> we installed a debug-kernel on an affected box, but the machine wouldn't=
=20
> panic when the error occured. when trying to unmount the affected=20
> partition it just went completely unresponsive, as mentioned above.
>=20
> before trying to unmount procstat -ak showed some processes with=20
> VOP_LOCK1_APV:
>=20
> 55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire=20
> _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup=20
> vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall
> 70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire=
=20
> _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf=20
> ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat
>=20
> since this hardware has been working before 7.3 and -- as we assume --=20
> would work again with 8.*, we would be grateful for any hints what could=
=20
> be the cause of all this.
It sounds like a deadlock, but the cause cannot be identified without
further diagnostic. It might be driver (ciss I assume), but may be quota
code, or even something else.

Please follow the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kernel=
debug-deadlocks.html
to obtain the required information.

--WoqaC9TUMqqIOlla
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (FreeBSD)

iEYEARECAAYFAkyJ+K4ACgkQC3+MBN1Mb4hoJgCcCUB2l/kM45sCbzBk/czEKrUB
CrsAoM0ZaXnPW90d4+s5xOemTp/S4kMD
=DZNb
-----END PGP SIGNATURE-----

--WoqaC9TUMqqIOlla--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100910092151.GO2465>