Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Jan 2000 15:50:28 -0600 (CST)
From:      Sean Heber <sean@fifthace.com>
To:        freebsd-questions@freebsd.org
Subject:   Update regarding stuck file systems
Message-ID:  <Pine.BSF.4.10.10001241456160.2386-100000@marvin.fifthace.com>

next in thread | raw e-mail | index | archive | help
Ok, you may remember my previous e-mail about this a few days ago..  I
have since done a LOT of testing.  I don't have much of a conclusion
(which is why I'm writing again).

As you may recall, my system had an odd problem.  If I ran my backup
script (which tars files on one hard drive and puts them on another hard
drive), all file system access stopped.  So, the box would still be up,
top would still be running on the console, but nothing would work because
the OS couldn't seem to read from the drive.

The kicker, though, is no error messages.  Nothing in the logs.  Nothing
on the console.  It would just stop and the processes would happily wait
for data from the drives, but none would ever come.

So, after a whole lot of swearing and Dew drinking, I have narrowed it
down only slightly.  It seems that for some reason this only happens
around 1:00 - 2:30 AM or so.  Never any other times.

For example, as I write this a backup is being performed.  For testing
purposes I've been running one backup after another since 8:00 AM (3:30 PM
now).  No problems at all.

I can't think of any reason why this would fail in the early morning hours
and never any other time.  It's not uptime related since just yesterday I
had the box up and down (while testing this) and everything was going
great.  When I tried to run the backup again around 1:30AM, it died.  I
was forced to hit the rest button.  Once the system came back up, I
figured I would try to narrow things more.  So, I unloaded vinum on my two
IDE backup drives (see below), reformated one and gave it the same mount
point.  (So the backup would still work.  I don't need all that space just
yet.)  Once that was done, vinum was not loaded and I gave it another
shot.  The backup froze again.  The box had only been up about 30 minutes.

The first night I made the backup process, I put it at the end of my
daily.local cron script.  It runs at 1:59 or something like that.  Before
that time, the box was up for 2 days.  That first night brought it down
with a froze file system.

The night after I gave the backup script it's own entry in crontab for
3:00AM.  It worked just fine.  When I woke up in the morning things still
worked.

Just the other night I changed the cron's run time to 12:05 AM.  That also
made it through the night just fine.

Does any of this make any sense?  It doesn't to me.

I suppose I have two basic questions here:
1) Is there anyway to make this work aside from the obvious "Don't run it
between 1:00 and 2:30 AM"?  Because this really bothers me.  I have no
idea if heavy server load would cause this to happen or if this is just a
backup problem due to something stupid I'm doing.

2) I really need a better backup method.  The idea originally was to have
a duplicate structure on the backup drive as well as the main drive so
that in the event of a disk faliure the broken drive could just be
unplugged.  Is that reasonable?  Obviously using tar the way I am doesn't
really allow this.  The catch (at least it seems like one to me) is the
drives are all different sizes.. (see below)


Ok, the famed "below":


Running FreeBSD 3.3-RELEASE (I had 3.4-STABLE before.  Don't ask.  Long
story.  But the problem is still the same in either case.)
SMP Kernel
256 MB RAM
Dual PII-400Mhz
Currently sitting in my room with no other active users and no outside
activity via web or anything (it's still being configured, after all)

Drives:
SCSI id6: 4.5 GB (boot: /, /usr, swap)
SCSI id9: 9.0 GB (backup: /eddie)
IDE bus1master: 37 GB (data: /sites)
IDE bus1slave: none
IDE bus2master: 25 GB (backup1)
IDE bus2slave: 20 GB (backup2)

The last two backup drives are concated using vinum.  Mounted as
/wowbagger.

The idea is that everything on the boot SCSI drive could be on the backup
SCSI drive, and the same for the IDE.  This layout is like this because
our original plan was to have the ability to unplug the broken drive and
get things backup with minimum pain.  But using tar sort of defeats the
purpose--which is why I would like some more suggestions.  :-)

The backup script does this right now:

echo "Backup /:"
tar -cslpf /eddie/root.tar /
echo

# Backup by itself to be handy, maybe.
echo "Backup /usr/local:"
tar -clspf /eddie/usr.local.tar /usr/local
echo

echo "Backup all of /usr:"
tar -clpsf /eddie/usr.tar /usr
echo

echo "Backup /sites:"
tar -clpsf /wowbagger/sites.tar /sites
echo

Make sense?  One thing I just realized, though, is that I might hit that
famed 2GB file limit.  I imagine FreeBSD is prone to this?  Oh well.  I
need a better method anyway..

Just so you know, here's the current df:

Filesystem        1K-blocks     Used    Avail Capacity  Mounted on
/dev/da0s1a           99183    45741    45508    50%    /
/dev/da0s1e         3713364   507654  2908641    15%    /usr
/dev/da1s1e         8679993  1227161  6758433    15%    /eddie
/dev/wd0s1e        35503710   449097 32214317     1%    /sites
/dev/vinum/vinum0  43643010   996729 39154841     2%    /wowbagger
procfs                    4        4        0   100%    /proc


As you can see, the partitions that are being backed up are not over 2GB,
so that shouldn't be the problem right now.

Anyway..  I'm looking for some input here.  It's very very hard to make
this problem happen.  I can try all day and nothing will come of it, but
wait until 1:30AM or so, and it happens almost(key word) everytime.  Is
something deadlocking?  Perhaps something to do with SMP?  Or am I doing
something terrbily stupid?  (feel free to flame..  I need to learn
sometime, right? :-)

I hope someone has a clue of where to start digging, at least.  The last
e-mail generated one response.  The person suggested I try removing drives
one by one from the equation.  I'm going to attempt that tonight in more 
detail.  The problem is, setting the clock to 1:30 AM myself doesn't
seem to matter.  Maybe it's tied to the BIOS time...  Or perhaps it's not
time related at all and just really really coincedental that it happens
around that time all the time regardless of how long the box was up, how
hot it is, etc.

l8r
Sean

PS> ARG!!!!  (This has been driving me nuts for the past 4.5 days now)



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.10001241456160.2386-100000>