Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Oct 2008 01:35:38 -0700
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        Peter Jeremy <peterjeremy@optushome.com.au>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: System hanging during dump
Message-ID:  <20081015083538.GA72190@icarus.home.lan>
In-Reply-To: <20081015082428.GE26536@server.vk2pj.dyndns.org>
References:  <20081015082428.GE26536@server.vk2pj.dyndns.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Oct 15, 2008 at 07:24:28PM +1100, Peter Jeremy wrote:
> Last night, I attempted a full, compressed backup of my 181GB /home
> (on a PATA disk) to a remote system.  The backup started at 2159 and
> everything appeared normal until about 0040 when the system became
> non-responsive and this lasted until the dump completed at 1033.  This
> is the first full backup of /home I've made for several years (due to
> lack of space).
> 
> I noticed the non-responsiveness at about 0500 when:
> - The dump, gzip and fifo pipeline were running normally.
> - A 'systat -v' I had started was running normally (though it
>   reported an excessive number of 'D' processes).  Other values
>   all appeared normal.
> - No response to return key at a zsh prompt
> - No response to up/down arrows in mutt
> [above all done in pre-existing ssh sessions from another host]
> - telnet to port 22 connected but didn't produce a banner.
> 
> The duration above is based on system logs - which show nothing
> happened during this period.  At the end, there were various anomolous
> entries:
> Oct 15 10:33:27 server ntpd[750]: too many recvbufs allocated (40)
> Oct 15 10:33:30 server sshd[947]: error: accept: Software caused connection abort
> Oct 15 10:33:34 server kernel: TCP: [192.168.123.123]:59516 to [192.168.123.200]:25 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
> 
> Possibly useful information:
> The dump pipeline was:
> dump -uaL0 -C 32 -f - /home | reblock | gzip [stdout connected to socket
> to remote server]
> 'reblock' is basically a 200MB FIFO I wrote to desynchronise the (often
> I/O bound) dump from the CPU-bound gzip.
> 
> server% uname -a
> FreeBSD server.vk2pj.dyndns.org 7.0-STABLE FreeBSD 7.0-STABLE #18: Sun May 18 15:02:39 EST 2008     root@server.vk2pj.dyndns.org:/var/obj/k7/usr/src/sys/server  i386
> server% df -ki
> Filesystem  1024-blocks      Used   Avail Capacity iused    ifree %iused  Mounted on
> /dev/ad0s3d   204648864 181911710 6365246    97% 1703016 11353942   13%   /home
> 
> About the only think that happened at around this time was nightly
> updates.  These start at 0005, fetching CTM cvs-cur updates, applying
> them to /home/ncvs, then cvs updating /home/ports.  Looking at
> timestamps, /home/ports/graphics/icod/CVS/Entries was updated at
> 0042 and /home/ports/graphics/imlib2_loaders/CVS/Entries (the next
> entry) was updated at 1034.
> 
> Whilst /home is fairly full, I can't see that the snapshot meta and
> rollback data would have occupied the 20GB free (and no 'out-of-space'
> messages were generated).  Is there some limit on the number of inodes
> that can be updated whilst a snapshot exists?
> 
> Has anyone else seen anything similar?

It's a known problem documented in my Wiki -- see "dump/restore".  Note
the part about UFS2 snapshot generation.  I'm almost certain this is
what you're describing.

http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues

This is one of the many reasons why I moved our backup infrastructure
over to use rsnapshot/rsync, despite the atime modification problem.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081015083538.GA72190>