From owner-freebsd-fs@FreeBSD.ORG Mon Feb 4 15:47:46 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 40ED016A417; Mon, 4 Feb 2008 15:47:46 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from falcon.cybervisiontech.com (falcon.cybervisiontech.com [217.20.163.9]) by mx1.freebsd.org (Postfix) with ESMTP id 8544213C469; Mon, 4 Feb 2008 15:47:45 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from localhost (localhost [127.0.0.1]) by falcon.cybervisiontech.com (Postfix) with ESMTP id 2EFD243DEC6; Mon, 4 Feb 2008 17:47:44 +0200 (EET) X-Virus-Scanned: Debian amavisd-new at falcon.cybervisiontech.com Received: from falcon.cybervisiontech.com ([127.0.0.1]) by localhost (falcon.cybervisiontech.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Sdig71qR3LVa; Mon, 4 Feb 2008 17:47:44 +0200 (EET) Received: from [10.2.1.87] (gateway.cybervisiontech.com.ua [88.81.251.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by falcon.cybervisiontech.com (Postfix) with ESMTP id 3466343DDA5; Mon, 4 Feb 2008 17:47:43 +0200 (EET) Message-ID: <47A7339E.4070708@icyb.net.ua> Date: Mon, 04 Feb 2008 17:47:42 +0200 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.9 (X11/20080123) MIME-Version: 1.0 To: Remko Lodder , scottl@FreeBSD.org, freebsd-fs@freebsd.org References: <200612221824.kBMIOhfM049471@freefall.freebsd.org> <47A2EDB0.8000801@icyb.net.ua> <47A2F404.7010208@icyb.net.ua> In-Reply-To: <47A2F404.7010208@icyb.net.ua> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org, Pav Lucistnik , Bruce Evans Subject: fs/udf: vm pages "overlap" while reading large dir X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Feb 2008 15:47:46 -0000 More on the problem with reading big directories on UDF. First, some sleuthing. I came to believe that the problem is caused by some larger change in vfs/vm/buf area. It seems that now VMIO is applied to more vnode types than before. In particular it seems that now vnodes for devices have non-NULL v_object (or maybe this is about directory vnodes, I am not sure). Also it seems that the problem should happen for any directory with size larger than four 2048-bytes sectors (I think that any directory with > 300 files would definitely qualify). After some code reading and run-time debugging, here are some facts about udf directory reading: 1. bread-ing is done via device vnode (as opposed to directory vnodes), as a consequence udf_strategy is not involved in directory reading 2. the device vnode has a non-NULL v_object, this means that VMIO is used 3. it seems that the code assumed that non-VM buffers are used 4. bread is done on as much data as possible, up to MAXBSIZE [and even slightly more in some cases] (i.e. code determines directory data size and attempts to read in everything in one go) 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is passed to bread - this is because of #1 above and peculiarities of buf system 6. directory reading and file reading are quite different, because the latter does reading via file vnode Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) bytes starting from a physical sector N with N%2 = 0 from media with sector size of 2K (most common CD/DVD case): 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, N%2 = 0 2. 4*N is passed as a sector number to bread 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = N*2K, i.e. correct offset within the device 5. for a fresh read getblk will be called 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the correct byte offset within the device 7. allocbuf will allocate pages using B_VMIO case 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE this means the following: sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N sectors N+2 and N+3 will be bound to the next page 2*N +1 sectors N+4 and N+5 is tied to the next page 2*N + 2 Now let's consider a "directory read" of 2 sectors (1 page) starting at physical sector N+1. Using the same calculations as above, we see that the sector will be tied to a page 2*(N+1) = 2*N + 2. And this page already contains valid cached data from the previous read, *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. So, we see, that because of b_offset being 4 times the actual byte offset, we get incorrect page indexing. Namely, sector X gets associated with different pages depending on what sector is used as a starting sector for bread. If bread starts at sector N, then data of sector N+1 is associated with (second half of) page with index 2*N; but if bread starts at sector N+1 then it gets associated with (the first half of) page with index 2*(N+1). Previously, before VMIO change, data for all reads was put into separate buffers that did not share anything between them. So the problem was limited only to wasting memory with duplicate data, but no actual over-runs did happen. Now we have the over-runs because the VM pages are shared between the buffers of the same vnode. One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 * sector_size. In this case, as before, we would waste some memory on duplicate data but we would avoid page overruns. If we limit bread size even more, to 1 sector, then we would not have any duplicate data at all. But there would still be some resource waste - each page would correspond to one sector, so 4K page would have only 2K of valid data and the other half in each page is unused. Another solution, which to me seems to be better, is to do "usual" reading - pass a directory vnode and 2048-byte sector offset to bread. In this case udf_strategy will get called for bklno translation, all offsets and indexes will be correct and everything will work perfectly as it is the case for all other filesystems. The only overhead in this case comes from the need to handle UDF_INVALID_BMAP case (where data is embedded into file entry). So it means that we have to call bmap_internal to see if we have that special case and hanlde it, and if the case is not special we would call bread on a directory vnode which means that bmap_internal would be called again in udf_strategy. One optimization that can be done in this case is to create a ligher version of bmap_internal that would merely check for the special case and wouldn't do anything else. I am attaching a patch based on the ideas above. It fixes the problem for me and doesn't seem to create any new ones :-) About the patch: hunk #1 fixes a copy/paste hunk #2 should fixes strategy to not set junk b_blkno in case of udf_bmap_internal failure and also replaces divisions/multiplications with bit-shifts hunk #3 really limits max size in bread to MAXBSIZE and also makes bread to be done a directory vnode BTW, maybe it's a good idea to further limit size of bread() anyway. Like always reading 1 sector. Anyway, this should make any difference only in minority of cases and I am not sure about performance effects (I do not expect difference in anything else other than performance). on 01/02/2008 12:27 Andriy Gapon said the following: > on 01/02/2008 12:00 Andriy Gapon said the following: >> ---- a different, under-debugged problem ----- >> BTW, on some smaller directories (but still large ones) I get some very >> strange problems with reading a directory too. It seems like some bad >> interaction between udf and buffer cache system. I added a lot of >> debugging prints and the problems looks like the following: >> >> read starting at physical sector (2048-byte one) N, size is ~20K, N%4=0 >> bread(4 * N, some_big_size) >> correct data is read >> repeat the above couple dozen times >> read starting at physical sector (2048-byte one) N+1, size is ~20K >> bread(4 * (N+1), some_big_size) >> data is read from physical sector N+4 (instead of N+1) >> >> I remember that Bruce Evance warned me that something like this could >> happen but I couldn't understand him, because I don't understand >> VM/buffer subsystem. I'll try to dig up the email. >> > > Sorry for the flood - additional info: > if I limit max read size in udf_readatoffset() to 2048 (instead of > MAXBSIZE), then large directories can be read OK. Seems like something > with overlapping buffers, maybe? > > BTW, here's how I created test environment for the described issues (in > tcsh): > > mkdir /tmp/bigdir > > cd /tmp/bigdir > > set i=1 > > while ($i < NNNNN) > touch file.$i > set i=`expr $i + 1` > end > > cd /tmp > > mkisofs -udf -o test.iso bigdir > > mdconfig -a -t vnode -f test.iso -S 2048 -u 0 > > mount_udf /dev/md0 /mnt > > ls -l /mnt > -- Andriy Gapon