Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 3 Jul 95 21:43 CDT
From:      uhclem%nemesis@fw.ast.com (Frank Durda IV)
To:        bugs@freebsd.org
Subject:   State of Problem 389 (and 392)?
Message-ID:  <m0sSxxM-0004w1C@nemesis.lonestar.org>

next in thread | raw e-mail | index | archive | help
Has anybody looked into problem 389 since it was reported back in May?   This
had to do with the filesystem being corrupted by lots of file/directory
deletions and file/directory creations going on at the same time.   You
eventually end up with directories that can't be deleted by rmdir because the
link counts are wrong.   Then you must run fsck two or three times to
completely straighten-out things.   This still happens in 2.0.5R.

Two of my client sites are really bugging me about this, as they clean
the filesystems every day and encounter the residual of this bug.  Makes
them paranoid.

There was a similar problem with DOS file systems that was reported under
392 and has apparently been closed, but I see no evidence of it being fixed.
If anyone knows what happened to 392, I'd like to know.   Thanks.

						Frank Durda IV
						uhclem%nemesis@fw.ast.com

Here is the 389 report again.

>Number:         389
>Category:       bin
>Synopsis:       Simultaneous creation/deletion of dirs corrupts filesystem [FDIV024]
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs (FreeBSD bugs mailing list)
>State:          open
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon May  8 21:20:00 1995
>Originator:     Frank Durda IV
>Organization:
>Release:        FreeBSD 2.0.950412-SNAP i386 and FreeBSD 1.1.5.1
>Environment:

[FDIV024]

FreeBSD 2.0.950412-SNAP i386 	(also on 2.0.5R)
Stock kernel, "make world" kernel, or custom kernel.
Problem also noted in FreeBSD 1.1.5.1 on stock and custom kernel.

>Description:

On my 1.1.5.1, I discovered that I frequently ended-up with directories
that could not be deleted in my news partition.  The reason rmdir refused
to delete the directories was due to bad link counts.  Running fsck at least
two times would correct the link counts so that the directories
could be deleted.

I recently discovered that I could cause bogus link counts on demand, simply
by trying to remove files and directories while other processes were
trying to create files and directories in the same tree.

In my case, I was doing some rm -rf commands on selected portions of the
newsgroups to obtain space, but at the same time the cnews system was
injecting new articles and re-creating some of the directories I was
deleting.  Note that the partition DOES NOT have to be low on space to
create the problem.  
I reproduced it on a root filesystem that had 7.7Meg free worst case.

I tested the latest snapshot and determined the problem still exists.

>How-To-Repeat:

By using tar and rm I can reproduce the problem on the latest SNAP
or 1.1.5.1.

In my case, I created a tar file containing about 6 Meg of a heavily
expired alt.* tree using  
	cd /usr/spool/news/alt
	tar cvf /tmp/news.tar *
FYI, the alt tree consisted of 538 directories and 1684 files.  It seems
more important to have a large number of directories than it is to
have lots of files.  Using the news tree provided this but the failure
can probably be caused by using other distribution trees that have lots
of directories and small files.

Now login on the system to test on at least two screens as root.
On screen 1,
	cd /
	mkdir test
	cd test
Now, ftp news.tar file from remote system to this location.  
DO NOT USE /tmp in place of /test!  (If you crash - you lose things)
	mkdir scramble
	cd scramble
	tar xvf ../news.tar
	sync
You can fsck here to verify things are sane at this point if you want.

Now that the news tree is extracted, begin to exercise the system.
The numbers indicate which virtual screen to use for the commands:
	1	tar xvf ../news.tar &
	2	rm -rf [l-r]* &
	2	rm -rf [a-k]* &
	2	rm -rf [0-9]* &
	2	rm -rf [s-z]* &
Now monitor on screen 1 until the tar is about half-way through
(by directory), and then repeat all of the above commands.
	
Now wait until both tars complete and wait for all of the rm's
to finish.  Then issue:
	rm -rf * 
and note any "Directory not removed..." messages.

If the rm finishes and you didn't get any error messages, start over,
and maybe start three cycles of extract and rm running at once.

	[WARNING - Doing too many extract/rm pairs at once caused the
	processes to hang with no disk I/O.  Characters were echoed
	(for a while) and CAPS LOCK toggles.  Then the system output a
	message indicating that syslogd had terminated and that it was
	syncing disks.  However it just hung there and never halted.
	This only happened once and may be related to the VNODE lock
	problem.  I think this lock/shutdown is  unrelated to the
	problem I am reporting.  My systems have between 8 and 12 Meg of RAM]

Using the above procedure, I eventually ended up with the
following undeletable directories:

ls -aliR
total 5
 9032 drwxrwxr-x   4 root  bin    3072 May  8 21:55 .
  142 drwxrwxr-x   3 root  wheel   512 May  8 21:55 ..
13788 drwxrwxr-x   5 news  news    512 May  8 21:49 politics
13524 drwxrwxr-x  10 news  news    512 May  8 21:49 society

scramble/politics:
total 4
13788 drwxrwxr-x  5 news  news   512 May  8 21:49 .
 9032 drwxrwxr-x  4 root  bin   3072 May  8 21:55 ..

scramble/society:
total 4
13524 drwxrwxr-x  10 news  news   512 May  8 21:49 .
 9032 drwxrwxr-x   4 root  bin   3072 May  8 21:55 ..

I then sync'ed and halted the system.  On reboot, I ran fsck
with these results:

fsck -y /dev/wd0a
** /dev/rwd0a
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
UNREF DIR  I=13581  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=13581 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=13578  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=13578 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=13544  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=13544 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=13792  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:47 1995 
RECONNECT? [yn] 
DIR I=13792 CONNECTED. PARENT WAS I=13788

UNREF DIR  I=13539  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=13539 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=13555  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=13555 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=13536  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=13536 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=9037  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=9037 CONNECTED. PARENT WAS I=13524

UNREF DIR  I=399  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:47 1995 
RECONNECT? [yn] 
DIR I=399 CONNECTED. PARENT WAS I=13788

UNREF DIR  I=4892  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:47 1995 
RECONNECT? [yn] 
DIR I=4892 CONNECTED. PARENT WAS I=13788

UNREF DIR  I=166  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995 
RECONNECT? [yn] 
DIR I=166 CONNECTED. PARENT WAS I=13524

** Phase 4 - Check Reference Counts
LINK COUNT DIR I=166  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=399  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:47 1995  COUNT 2 SHOULD BE 3
ADJUST? [yn] 
LINK COUNT DIR I=4892  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:47 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=9037  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13536  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13539  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13544  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13555  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13578  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13581  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:43 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13792  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:47 1995  COUNT 1 SHOULD BE 2
ADJUST? [yn] 
** Phase 5 - Check Cyl groups
CLEAN FLAG NOT SET IN SUPERBLOCK
FIX? [yn] 
924 files, 43271 used, 32792 free (272 frags, 4065 blocks, 0.4% fragmentation)

***** FILE SYSTEM WAS MODIFIED *****

***** REBOOT NOW *****


Now I re-ran fsck because in the past it always took multiple passes to
really correct the problems:

fsck -y /dev/wd0a
** /dev/rwd0a
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
LINK COUNT DIR I=13524  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:49 1995  COUNT 10 SHOULD BE 2
ADJUST? [yn] 
LINK COUNT DIR I=13788  OWNER=news MODE=40775
SIZE=512 MTIME=May  8 21:49 1995  COUNT 5 SHOULD BE 2
ADJUST? [yn] 
** Phase 5 - Check Cyl groups
924 files, 43271 used, 32792 free (272 frags, 4065 blocks, 0.4% fragmentation)

***** FILE SYSTEM WAS MODIFIED *****

***** REBOOT NOW *****

Finally, I re-ran fsck a third time:

** /dev/rwd0a
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
924 files, 43271 used, 32792 free (272 frags, 4065 blocks, 0.4% fragmentation)

Ok, now here is what the directory looks like now:


total 6102
 142 drwxrwxr-x   3 root  wheel      512 May  8 22:04 .
   2 drwxr-xr-x  17 root  wheel      512 May  8 21:55 ..
 143 -rw-rw-r--   1 root  wheel      505 May  8 21:55 sample1	*
 145 -rw-rw-r--   1 root  wheel     3135 May  8 22:01 sample2	*
 146 -rw-rw-r--   1 root  wheel      588 May  8 22:02 sample3	*
 147 -rw-rw-r--   1 root  wheel      297 May  8 22:02 sample4	*
 148 -rw-rw-r--   1 root  wheel        0 May  8 22:04 sample5	*
9032 drwxrwxr-x   4 root  bin       3072 May  8 21:55 scramble
 144 -rw-rw-r--   1 root  wheel  6225920 May  8 21:55 news.tar

./scramble:
total 5
 9032 drwxrwxr-x  4 root  bin    3072 May  8 21:55 .
  142 drwxrwxr-x  3 root  wheel   512 May  8 22:04 ..
13788 drwxrwxr-x  2 news  news    512 May  8 21:49 politics
13524 drwxrwxr-x  2 news  news    512 May  8 21:49 society

./scramble/politics:
total 4
13788 drwxrwxr-x  2 news  news   512 May  8 21:49 .
 9032 drwxrwxr-x  4 root  bin   3072 May  8 21:55 ..

./scramble/society:
total 4
13524 drwxrwxr-x  2 news  news   512 May  8 21:49 .
 9032 drwxrwxr-x  4 root  bin   3072 May  8 21:55 ..

* are the "tee" logs of fsck and ls" for the bug report.  They were 
  written to a different partition and moved back to this location after
  the fscks completed and the system was rebooted.

At this point, "politics" and "society" could be deleted with rmdir.
(The directories and their files reconnected by fsck land in lost+found.)


>Fix:
	
Not known.

*END*

>Audit-Trail:
>Unformatted:

*END2*




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0sSxxM-0004w1C>