Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Sep 2010 10:45:08 +0200
From:      freebsd <free.bsd@webstyle.ch>
To:        freebsd-stable@freebsd.org
Subject:   strange problem with FreeBSD 7.3 64bit
Message-ID:  <4C89F014.1050601@webstyle.ch>

next in thread | raw e-mail | index | archive | help
hi list,

we upgraded some 20 boxes from 7.1 and 7.2 to 7.3-RELEASE-p2 (all amd64) 
and now are experiencing some weird behaviour on 6 of them with rsnapshot:

after a few days/several weeks (seems to be completely random), 
rsnapshot reports that it can't start due it's lockfile and process 
still being present. on such boxes either a zombie rm or find process 
(which presumably were launched by rsnapshot) can be found.
if the backup was done to a separate partition (physical disks or RAIDs) 
any access (ls, stat, fsck, etc) to the partition would kill the current 
SSH session, creating a new zombie of the process one just started. 
unmounting the affected partition would render the server completely 
unresponsive and required a hardware reset.

when trying to restart, the machines wouldn't even shut down completely 
but hanged somewhere after syncing buffers, only a hardware reset 
worked. after the reboot, those partitions were unmounted and fscked. 
after which the backups would work again until the next error happened 
again.

the hardware of affected and unaffected system are:

HP ProLiant DL380 G4
HP ProLiant DL380 G5
HP ProLiant DL360 G5

there is no visible pattern between affected and unaffected boxes. also 
those machines were upgraded the exact same way, running identical 
kernels (more or less GENERIC, with QUOTA activated).

we upgraded the most critical boxes which showed that behaviour on a 
daily interval to 8.0-RELEASE and ever since this behavior has 
disappeared since nearly 3 months now.

we installed a debug-kernel on an affected box, but the machine wouldn't 
panic when the error occured. when trying to unmount the affected 
partition it just went completely unresponsive, as mentioned above.

before trying to unmount procstat -ak showed some processes with 
VOP_LOCK1_APV:

55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire 
_lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup 
vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall
70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire 
_lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf 
ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat

since this hardware has been working before 7.3 and -- as we assume -- 
would work again with 8.*, we would be grateful for any hints what could 
be the cause of all this.

kind regards
Flo



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C89F014.1050601>