From owner-freebsd-fs@FreeBSD.ORG  Mon Feb  4 15:47:46 2008
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 40ED016A417;
	Mon,  4 Feb 2008 15:47:46 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from falcon.cybervisiontech.com (falcon.cybervisiontech.com
	[217.20.163.9])
	by mx1.freebsd.org (Postfix) with ESMTP id 8544213C469;
	Mon,  4 Feb 2008 15:47:45 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from localhost (localhost [127.0.0.1])
	by falcon.cybervisiontech.com (Postfix) with ESMTP id 2EFD243DEC6;
	Mon,  4 Feb 2008 17:47:44 +0200 (EET)
X-Virus-Scanned: Debian amavisd-new at falcon.cybervisiontech.com
Received: from falcon.cybervisiontech.com ([127.0.0.1])
	by localhost (falcon.cybervisiontech.com [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id Sdig71qR3LVa; Mon,  4 Feb 2008 17:47:44 +0200 (EET)
Received: from [10.2.1.87] (gateway.cybervisiontech.com.ua [88.81.251.18])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by falcon.cybervisiontech.com (Postfix) with ESMTP id 3466343DDA5;
	Mon,  4 Feb 2008 17:47:43 +0200 (EET)
Message-ID: <47A7339E.4070708@icyb.net.ua>
Date: Mon, 04 Feb 2008 17:47:42 +0200
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.9 (X11/20080123)
MIME-Version: 1.0
To: Remko Lodder <remko@FreeBSD.org>, scottl@FreeBSD.org, 
	freebsd-fs@freebsd.org
References: <200612221824.kBMIOhfM049471@freefall.freebsd.org>
	<47A2EDB0.8000801@icyb.net.ua> <47A2F404.7010208@icyb.net.ua>
In-Reply-To: <47A2F404.7010208@icyb.net.ua>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: freebsd-hackers@freebsd.org, Pav Lucistnik <pav@FreeBSD.org>,
	Bruce Evans <bde@zeta.org.au>
Subject: fs/udf: vm pages "overlap" while reading  large dir
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Feb 2008 15:47:46 -0000


More on the problem with reading big directories on UDF.

First, some sleuthing. I came to believe that the problem is caused by
some larger change in vfs/vm/buf area. It seems that now VMIO is applied
to more vnode types than before. In particular it seems that now vnodes
for devices have non-NULL v_object (or maybe this is about directory
vnodes, I am not sure).

Also it seems that the problem should happen for any directory with size
larger than four 2048-bytes sectors (I think that any directory with >
300 files would definitely qualify).

After some code reading and run-time debugging, here are some facts
about udf directory reading:
1. bread-ing is done via device vnode (as opposed to directory vnodes),
as a consequence udf_strategy is not involved in directory reading
2. the device vnode has a non-NULL v_object, this means that VMIO is used
3. it seems that the code assumed that non-VM buffers are used
4. bread is done on as much data as possible, up to MAXBSIZE [and even
slightly more in some cases] (i.e. code determines directory data size
and attempts to read in everything in one go)
5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
passed to bread - this is because of #1 above and peculiarities of buf
system
6. directory reading and file reading are quite different, because the
latter does reading via file vnode

Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
bytes starting from a physical sector N with N%2 = 0 from media with
sector size of 2K (most common CD/DVD case):
1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
N%2 = 0
2. 4*N is passed as a sector number to bread
3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
N*2K, i.e. correct offset within the device
5. for a fresh read getblk will be called
6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
correct byte offset within the device
7. allocbuf will allocate pages using B_VMIO case
8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
this means the following:
sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
sectors N+2 and N+3 will be bound to the next page 2*N +1
sectors N+4 and N+5 is tied to the next page 2*N + 2

Now let's consider a "directory read" of 2 sectors (1 page) starting at
physical sector N+1.
Using the same calculations as above, we see that the sector will be
tied to a page 2*(N+1) = 2*N + 2.
And this page already contains valid cached data from the previous read,
*but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.

So, we see, that because of b_offset being 4 times the actual byte
offset, we get incorrect page indexing. Namely, sector X gets associated
with different pages depending on what sector is used as a starting
sector for bread. If bread starts at sector N, then data of sector N+1
is associated with (second half of) page with index 2*N; but if bread
starts at sector N+1 then it gets associated with (the first half of)
page with index 2*(N+1).

Previously, before VMIO change, data for all reads was put into separate
buffers that did not share anything between them. So the problem was
limited only to wasting memory with duplicate data, but no actual
over-runs did happen. Now we have the over-runs because the VM pages are
shared between the buffers of the same vnode.

One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
sector_size. In this case, as before, we would waste some memory on
duplicate data but we would avoid page overruns. If we limit bread size
even more, to 1 sector, then we would not have any duplicate data at
all. But there would still be some resource waste - each page would
correspond to one sector, so 4K page would have only 2K of valid data
and the other half in each page is unused.

Another solution, which to me seems to be better, is to do "usual"
reading - pass a directory vnode and 2048-byte sector offset to bread.
In this case udf_strategy will get called for bklno translation, all
offsets and indexes will be correct and everything will work perfectly
as it is the case for all other filesystems.
The only overhead in this case comes from the need to handle
UDF_INVALID_BMAP case (where data is embedded into file entry). So it
means that we have to call bmap_internal to see if we have that special
case and hanlde it, and if the case is not special we would call bread
on a directory vnode which means that bmap_internal would be called
again in udf_strategy.
One optimization that can be done in this case is to create a ligher
version of bmap_internal that would merely check for the special case
and wouldn't do anything else.

I am attaching a patch based on the ideas above. It fixes the problem
for me and doesn't seem to create any new ones :-)
About the patch:
hunk #1 fixes a copy/paste
hunk #2 should fixes strategy to not set junk b_blkno in case of
udf_bmap_internal failure and also replaces divisions/multiplications
with bit-shifts
hunk #3 really limits max size in bread to MAXBSIZE and also makes bread
to be done a directory vnode

BTW, maybe it's a good idea to further limit size of bread() anyway.
Like always reading 1 sector. Anyway, this should make any difference
only in minority of cases and I am not sure about performance effects (I
do not expect difference in anything else other than performance).

on 01/02/2008 12:27 Andriy Gapon said the following:
> on 01/02/2008 12:00 Andriy Gapon said the following:
>> ---- a different, under-debugged problem -----
>> BTW, on some smaller directories (but still large ones) I get some very
>> strange problems with reading a directory too. It seems like some bad
>> interaction between udf and buffer cache system. I added a lot of
>> debugging prints and the problems looks like the following:
>>
>> read starting at physical sector (2048-byte one) N, size is ~20K, N%4=0
>> 	bread(4 * N, some_big_size)
>> 		correct data is read
>> repeat the above couple dozen times
>> read starting at physical sector (2048-byte one) N+1, size is ~20K
>> 	bread(4 * (N+1), some_big_size)
>> 		data is read from physical sector N+4 (instead of N+1)
>>
>> I remember that Bruce Evance warned me that something like this could
>> happen but I couldn't understand him, because I don't understand
>> VM/buffer subsystem. I'll try to dig up the email.
>>
> 
> Sorry for the flood - additional info:
> if I limit max read size in udf_readatoffset() to 2048 (instead of
> MAXBSIZE), then large directories can be read OK. Seems like something
> with overlapping buffers, maybe?
> 
> BTW, here's how I created test environment for the described issues (in
> tcsh):
> 
> mkdir /tmp/bigdir
> 
> cd /tmp/bigdir
> 
> set i=1
> 
> while ($i < NNNNN)
> touch file.$i
> set i=`expr $i + 1`
> end
> 
> cd /tmp
> 
> mkisofs -udf -o test.iso bigdir
> 
> mdconfig -a -t vnode -f test.iso -S 2048 -u 0
> 
> mount_udf /dev/md0 /mnt
> 
> ls -l /mnt
> 


-- 
Andriy Gapon