From owner-freebsd-fs@FreeBSD.ORG  Mon Feb  4 18:51:21 2008
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 25EEE16A420
	for <freebsd-fs@freebsd.org>; Mon,  4 Feb 2008 18:51:21 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outJ.internet-mail-service.net (outJ.internet-mail-service.net
	[216.240.47.233])
	by mx1.freebsd.org (Postfix) with ESMTP id 0450913C47E
	for <freebsd-fs@freebsd.org>; Mon,  4 Feb 2008 18:51:20 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160)
	by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP;
	Mon, 04 Feb 2008 10:36:55 -0800
Received: from julian-mac.elischer.org (localhost [127.0.0.1])
	by idiom.com (Postfix) with ESMTP id 315561270B2;
	Mon,  4 Feb 2008 10:36:55 -0800 (PST)
Message-ID: <47A75B47.2040604@elischer.org>
Date: Mon, 04 Feb 2008 10:36:55 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.9 (Macintosh/20071031)
MIME-Version: 1.0
To: Andriy Gapon <avg@icyb.net.ua>
References: <200612221824.kBMIOhfM049471@freefall.freebsd.org>	<47A2EDB0.8000801@icyb.net.ua>
	<47A2F404.7010208@icyb.net.ua> <47A735A4.3060506@icyb.net.ua>
In-Reply-To: <47A735A4.3060506@icyb.net.ua>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Bruce Evans <bde@zeta.org.au>, freebsd-hackers@freebsd.org,
	scottl@FreeBSD.org, freebsd-fs@freebsd.org,
	Pav Lucistnik <pav@FreeBSD.org>, Remko Lodder <remko@FreeBSD.org>
Subject: Re: fs/udf: vm pages "overlap" while reading  large dir [patch]
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Feb 2008 18:51:21 -0000

Andriy Gapon wrote:
> More on the problem with reading big directories on UDF.

You do realise that you have now made yourself the official
maintainer of the UDF file system by submitting a competent
and insightful analysis of the problem?

> 
> First, some sleuthing. I came to believe that the problem is caused by
> some larger change in vfs/vm/buf area. It seems that now VMIO is applied
> to more vnode types than before. In particular it seems that now vnodes
> for devices have non-NULL v_object (or maybe this is about directory
> vnodes, I am not sure).
> 
> Also it seems that the problem should happen for any directory with size
> larger than four 2048-bytes sectors (I think that any directory with >
> 300 files would definitely qualify).
> 
> After some code reading and run-time debugging, here are some facts
> about udf directory reading:
> 1. bread-ing is done via device vnode (as opposed to directory vnodes),
> as a consequence udf_strategy is not involved in directory reading
> 2. the device vnode has a non-NULL v_object, this means that VMIO is used
> 3. it seems that the code assumed that non-VM buffers are used
> 4. bread is done on as much data as possible, up to MAXBSIZE [and even
> slightly more in some cases] (i.e. code determines directory data size
> and attempts to read in everything in one go)
> 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is
> passed to bread - this is because of #1 above and peculiarities of buf
> system
> 6. directory reading and file reading are quite different, because the
> latter does reading via file vnode
> 
> Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K)
> bytes starting from a physical sector N with N%2 = 0 from media with
> sector size of 2K (most common CD/DVD case):
> 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N,
> N%2 = 0
> 2. 4*N is passed as a sector number to bread
> 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K
> 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE =
> N*2K, i.e. correct offset within the device
> 5. for a fresh read getblk will be called
> 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the
> correct byte offset within the device
> 7. allocbuf will allocate pages using B_VMIO case
> 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE
> this means the following:
> sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N
> sectors N+2 and N+3 will be bound to the next page 2*N +1
> sectors N+4 and N+5 is tied to the next page 2*N + 2
> 
> Now let's consider a "directory read" of 2 sectors (1 page) starting at
> physical sector N+1.
> Using the same calculations as above, we see that the sector will be
> tied to a page 2*(N+1) = 2*N + 2.
> And this page already contains valid cached data from the previous read,
> *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2.
> 
> So, we see, that because of b_offset being 4 times the actual byte
> offset, we get incorrect page indexing. Namely, sector X gets associated
> with different pages depending on what sector is used as a starting
> sector for bread. If bread starts at sector N, then data of sector N+1
> is associated with (second half of) page with index 2*N; but if bread
> starts at sector N+1 then it gets associated with (the first half of)
> page with index 2*(N+1).
> 
> Previously, before VMIO change, data for all reads was put into separate
> buffers that did not share anything between them. So the problem was
> limited only to wasting memory with duplicate data, but no actual
> over-runs did happen. Now we have the over-runs because the VM pages are
> shared between the buffers of the same vnode.
> 
> One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 *
> sector_size. In this case, as before, we would waste some memory on
> duplicate data but we would avoid page overruns. If we limit bread size
> even more, to 1 sector, then we would not have any duplicate data at
> all. But there would still be some resource waste - each page would
> correspond to one sector, so 4K page would have only 2K of valid data
> and the other half in each page is unused.
> 
> Another solution, which to me seems to be better, is to do "usual"
> reading - pass a directory vnode and 2048-byte sector offset to bread.
> In this case udf_strategy will get called for bklno translation, all
> offsets and indexes will be correct and everything will work perfectly
> as it is the case for all other filesystems.
> The only overhead in this case comes from the need to handle
> UDF_INVALID_BMAP case (where data is embedded into file entry). So it
> means that we have to call bmap_internal to see if we have that special
> case and hanlde it, and if the case is not special we would call bread
> on a directory vnode which means that bmap_internal would be called
> again in udf_strategy.
> One optimization that can be done in this case is to create a ligher
> version of bmap_internal that would merely check for the special case
> and wouldn't do anything else.
> 
> I am attaching a patch based on the ideas above. It fixes the problem
> for me and doesn't seem to create any new ones :-)
> About the patch:
> hunk #1 fixes a copy/paste
> hunk #2 should fixes strategy to not set junk b_blkno in case of
> udf_bmap_internal failure and also replaces divisions/multiplications
> with bit-shifts
> hunk #3 really limits max size in bread to MAXBSIZE and also makes bread
> to be done a directory vnode
> 
> BTW, maybe it's a good idea to further limit size of bread() anyway.
> Like always reading 1 sector. Anyway, this should make any difference
> only in minority of cases and I am not sure about performance effects (I
> do not expect difference in anything else other than performance).
> 
> on 01/02/2008 12:27 Andriy Gapon said the following:
>> on 01/02/2008 12:00 Andriy Gapon said the following:
>>> ---- a different, under-debugged problem -----
>>> BTW, on some smaller directories (but still large ones) I get some very
>>> strange problems with reading a directory too. It seems like some bad
>>> interaction between udf and buffer cache system. I added a lot of
>>> debugging prints and the problems looks like the following:
>>>
>>> read starting at physical sector (2048-byte one) N, size is ~20K, N%4=0
>>> 	bread(4 * N, some_big_size)
>>> 		correct data is read
>>> repeat the above couple dozen times
>>> read starting at physical sector (2048-byte one) N+1, size is ~20K
>>> 	bread(4 * (N+1), some_big_size)
>>> 		data is read from physical sector N+4 (instead of N+1)
>>>
>>> I remember that Bruce Evance warned me that something like this could
>>> happen but I couldn't understand him, because I don't understand
>>> VM/buffer subsystem. I'll try to dig up the email.
>>>
>> Sorry for the flood - additional info:
>> if I limit max read size in udf_readatoffset() to 2048 (instead of
>> MAXBSIZE), then large directories can be read OK. Seems like something
>> with overlapping buffers, maybe?
>>
>> BTW, here's how I created test environment for the described issues (in
>> tcsh):
>>
>> mkdir /tmp/bigdir
>>
>> cd /tmp/bigdir
>>
>> set i=1
>>
>> while ($i < NNNNN)
>> touch file.$i
>> set i=`expr $i + 1`
>> end
>>
>> cd /tmp
>>
>> mkisofs -udf -o test.iso bigdir
>>
>> mdconfig -a -t vnode -f test.iso -S 2048 -u 0
>>
>> mount_udf /dev/md0 /mnt
>>
>> ls -l /mnt
>>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"