From owner-freebsd-fs@FreeBSD.ORG  Sun Sep  2 12:52:37 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id DB3D4106564A;
	Sun,  2 Sep 2012 12:52:37 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 3E17D8FC08;
	Sun,  2 Sep 2012 12:52:37 +0000 (UTC)
Received: by eeke52 with SMTP id e52so1754575eek.13
	for <multiple recipients>; Sun, 02 Sep 2012 05:52:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to:user-agent;
	bh=pYcfFabUEtz9VSpFIQkH5xqQ0r5n8c3vOqmJJGUIBT8=;
	b=hpCJ16QSlGRQ3X+pkLwJwMLQxecWkHckM7AMfGj4dluWjEzfnCFtglSQfX5JuUjcGd
	HpJM9+gdwqvVc9RS0ZcuK+jfNnErbVUNqwNjQ7HfeFRKPvlBa3cRQcwFxdAZIpp85Ken
	gWxdFp+ak184s2Zw5hjzSUDM6rj3UX1QSgm8JdBqKpIiOLVXkmi6fE/Od5dnyIoigpzv
	jvd8Oqn9DnhA+M94UV4jCLIsir8QfLSQ7oMLPVQPcAMDZz/XVt6B2g0Iw3y6Uf8hxkn9
	TiUU8BZ7RxudMxrNRLrV5s5PRJ2ktiLswRlb1l9yN9lTRQQPRfFbLEuSeUqclR4DBWiH
	u9Xg==
Received: by 10.14.213.137 with SMTP id a9mr17462033eep.38.1346590356197;
	Sun, 02 Sep 2012 05:52:36 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
	[2001:470:1f08:1f7::2])
	by mx.google.com with ESMTPS id h42sm28069245eem.5.2012.09.02.05.52.34
	(version=TLSv1/SSLv3 cipher=OTHER);
	Sun, 02 Sep 2012 05:52:35 -0700 (PDT)
Date: Sun, 2 Sep 2012 14:52:28 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Marcel Moolenaar <marcel@xcllnt.net>
Message-ID: <20120902125228.GA29075@dft-labs.eu>
References: <CAP=KkTz3+7UbfBcW9D_8VHv-Rw7BxNyG5xiVFxG4L-Zq1skwJw@mail.gmail.com>
	<96DC4416-6CA5-45B4-B790-068797FAA2C6@xcllnt.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <96DC4416-6CA5-45B4-B790-068797FAA2C6@xcllnt.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: freebsd-fs@freebsd.org, Grzegorz Bernacki <gber@freebsd.org>
Subject: Re: NANDFS: out of space panic
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 02 Sep 2012 12:52:38 -0000

On Fri, Aug 31, 2012 at 01:09:40PM +0100, Marcel Moolenaar wrote:
> 
> On Aug 22, 2012, at 4:34 PM, Boris Astardzhiev <boris.astardzhiev@gmail.com> wrote:
> > Now when I attempt to delete /mnt/file1234 I get a panic:
> > root@smartcpe:/mnt # rm file1234
> > panic: bmap_truncate_mapping: error 28 when truncate at level 1
> > 

I think this is a step in the right direction and should help for
originally reported testcase:

http://people.freebsd.org/~mjg/patches/nandfs-out-of-space1.diff

> 2.  There's a real bug. For me it gives the following panic
>     (by virtue of me changing the behaviour of point 1):
> 
> nandfs_new_segment: cannot create segment error 1
> create_segment: cannot create next segment
[snip]
> panic: brelse: not dirty
> cpuid = 0
> KDB: enter: panic
> 

While error handling in this case is clearly bogus, I believe that the
sole fact that nandfs ran out of free segments is a sign of another and
more important bug.

Mail ended up quite lengthy with some repetitions and possibly bad
English, so sorry for that. I didn't touch this filesystem for a couple
of months now and may remember things incorrectly. Also ideas presented
here are my own and may or may not be of any value.

Some definitions to help avoid confusion:
segment - fixed size continunous area filled with blocks containing user
and filesystem data; partitions consist of some number of segments
sufile - segment usage file (tells you which segments are free and so on)
datfile - virtual-physical block translation map
ifile - inode file, contains all user file inodes

Free space is reclaimed as follows:
1. cleaner reads n segments and dirties blocks that are still in use
2. syncer writes dirtied blocks in new segments along with updated
sufile and erases old segments
3. repeat with another n next segments. if reached end of partition,
start from the beginning

In other words, nandfs needs some space in order to reclaim anything.
Thus, if the user is allowed to use all available segments, nandfs is
unable to clean up.

fs should allow the user to write data to some point. After that point
is reached it should be possible to remove at least some parts of data
(which results in writes). And after that it should be possible to reclaim
free space (which results in additional writes).

So we need safe enough first threshold (i.e. you can reach it, delete
everything from fs and it still has place to clean up) or safe enough
second threshold (you are allowed to delete stuff to some point).

In both cases fs can return ENOSPC or try to adapt to situation by
suspending write operations and trying to free up more space that in
normal conditions.

nandfs currently maintains only one threshold and returns ENOSPC when
reached. Only removal operations are allowed (as noted earlier, this
causes additional writes, threshold is just ignored for such cases). And
unfortunately this can leave fs without free segments. So this is a
"first threshold" with incorrect value.

Some ways in which nandfs can adapt:
Less coding-way: temporarily increase number of scanned segments per
iteration or frequency of iterations until acceptable level of free space
is reached (or there is no more space to reclaim).

More coding-way: scan entire filesystem and free up top n segments with
stale blocks. Possibly track this information constantly so that full
scan is required once per mount.

Another thing that can help is reduction of data written during
deletion.

I believe that during large file removal intermediate bmap blocks can be
written, even though a segment later such blocks become stale.

datfile and ifile never shrink. So if you happen to "free" all virtual
block numbers from given datfile block, it's still carried around even
though it could be just removed (note that it does not mean that datfile
leaks anything - such blocks are used again as new vblocks are
allocated, similar situation with ifile).

Again, these ideas my be completely bogus or of little value.

> Also: design documentation is missing right now, which
> does mean that there's a pretty steep curve for anyone
> who didn't write the file system to go in and fix any
> bugs.
> 

While it's true that there is no documentation describing current state of
nandfs, there still are some ideas that originated from nilfs2, so one can
get some understanding after reading their materials.

-- 
Mateusz Guzik <mjguzik gmail.com>