Date: Mon, 04 Feb 2008 10:36:55 -0800 From: Julian Elischer <julian@elischer.org> To: Andriy Gapon <avg@icyb.net.ua> Cc: Bruce Evans <bde@zeta.org.au>, freebsd-hackers@freebsd.org, scottl@FreeBSD.org, freebsd-fs@freebsd.org, Pav Lucistnik <pav@FreeBSD.org>, Remko Lodder <remko@FreeBSD.org> Subject: Re: fs/udf: vm pages "overlap" while reading large dir [patch] Message-ID: <47A75B47.2040604@elischer.org> In-Reply-To: <47A735A4.3060506@icyb.net.ua> References: <200612221824.kBMIOhfM049471@freefall.freebsd.org> <47A2EDB0.8000801@icyb.net.ua> <47A2F404.7010208@icyb.net.ua> <47A735A4.3060506@icyb.net.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Andriy Gapon wrote: > More on the problem with reading big directories on UDF. You do realise that you have now made yourself the official maintainer of the UDF file system by submitting a competent and insightful analysis of the problem? > > First, some sleuthing. I came to believe that the problem is caused by > some larger change in vfs/vm/buf area. It seems that now VMIO is applied > to more vnode types than before. In particular it seems that now vnodes > for devices have non-NULL v_object (or maybe this is about directory > vnodes, I am not sure). > > Also it seems that the problem should happen for any directory with size > larger than four 2048-bytes sectors (I think that any directory with > > 300 files would definitely qualify). > > After some code reading and run-time debugging, here are some facts > about udf directory reading: > 1. bread-ing is done via device vnode (as opposed to directory vnodes), > as a consequence udf_strategy is not involved in directory reading > 2. the device vnode has a non-NULL v_object, this means that VMIO is used > 3. it seems that the code assumed that non-VM buffers are used > 4. bread is done on as much data as possible, up to MAXBSIZE [and even > slightly more in some cases] (i.e. code determines directory data size > and attempts to read in everything in one go) > 5. physical sector number adjusted to DEV_BSIZE (512 bytes) sectors is > passed to bread - this is because of #1 above and peculiarities of buf > system > 6. directory reading and file reading are quite different, because the > latter does reading via file vnode > > Let's a consider a simple case of "directory reading" 5 * PAGE_SIZE (4K) > bytes starting from a physical sector N with N%2 = 0 from media with > sector size of 2K (most common CD/DVD case): > 1. we want to read 12 KB = 3 pages = 6 sectors starting from sector N, > N%2 = 0 > 2. 4*N is passed as a sector number to bread > 3. bo_bsize of the corresponding bufobj is a media sector size, i.e. 2K > 4. actual read in bread will happen from b_iooffset of 4*N*DEV_BSIZE = > N*2K, i.e. correct offset within the device > 5. for a fresh read getblk will be called > 6. getblk will set b_offset to blkno*bo_bsize=4*N*2K, i.e. 4 times the > correct byte offset within the device > 7. allocbuf will allocate pages using B_VMIO case > 8. allocbuf calculates base page index as follows: pi = b_offset/PAGE_SIZE > this means the following: > sectors N and N+1 will be bound to a page with index 4*N*2K/4K = 2*N > sectors N+2 and N+3 will be bound to the next page 2*N +1 > sectors N+4 and N+5 is tied to the next page 2*N + 2 > > Now let's consider a "directory read" of 2 sectors (1 page) starting at > physical sector N+1. > Using the same calculations as above, we see that the sector will be > tied to a page 2*(N+1) = 2*N + 2. > And this page already contains valid cached data from the previous read, > *but* it contains data from sectors N+4 and N+5 instead of N+1 and N+2. > > So, we see, that because of b_offset being 4 times the actual byte > offset, we get incorrect page indexing. Namely, sector X gets associated > with different pages depending on what sector is used as a starting > sector for bread. If bread starts at sector N, then data of sector N+1 > is associated with (second half of) page with index 2*N; but if bread > starts at sector N+1 then it gets associated with (the first half of) > page with index 2*(N+1). > > Previously, before VMIO change, data for all reads was put into separate > buffers that did not share anything between them. So the problem was > limited only to wasting memory with duplicate data, but no actual > over-runs did happen. Now we have the over-runs because the VM pages are > shared between the buffers of the same vnode. > > One obvious solution is to limit bread size to 2*PAGE_SIZE = 4 * > sector_size. In this case, as before, we would waste some memory on > duplicate data but we would avoid page overruns. If we limit bread size > even more, to 1 sector, then we would not have any duplicate data at > all. But there would still be some resource waste - each page would > correspond to one sector, so 4K page would have only 2K of valid data > and the other half in each page is unused. > > Another solution, which to me seems to be better, is to do "usual" > reading - pass a directory vnode and 2048-byte sector offset to bread. > In this case udf_strategy will get called for bklno translation, all > offsets and indexes will be correct and everything will work perfectly > as it is the case for all other filesystems. > The only overhead in this case comes from the need to handle > UDF_INVALID_BMAP case (where data is embedded into file entry). So it > means that we have to call bmap_internal to see if we have that special > case and hanlde it, and if the case is not special we would call bread > on a directory vnode which means that bmap_internal would be called > again in udf_strategy. > One optimization that can be done in this case is to create a ligher > version of bmap_internal that would merely check for the special case > and wouldn't do anything else. > > I am attaching a patch based on the ideas above. It fixes the problem > for me and doesn't seem to create any new ones :-) > About the patch: > hunk #1 fixes a copy/paste > hunk #2 should fixes strategy to not set junk b_blkno in case of > udf_bmap_internal failure and also replaces divisions/multiplications > with bit-shifts > hunk #3 really limits max size in bread to MAXBSIZE and also makes bread > to be done a directory vnode > > BTW, maybe it's a good idea to further limit size of bread() anyway. > Like always reading 1 sector. Anyway, this should make any difference > only in minority of cases and I am not sure about performance effects (I > do not expect difference in anything else other than performance). > > on 01/02/2008 12:27 Andriy Gapon said the following: >> on 01/02/2008 12:00 Andriy Gapon said the following: >>> ---- a different, under-debugged problem ----- >>> BTW, on some smaller directories (but still large ones) I get some very >>> strange problems with reading a directory too. It seems like some bad >>> interaction between udf and buffer cache system. I added a lot of >>> debugging prints and the problems looks like the following: >>> >>> read starting at physical sector (2048-byte one) N, size is ~20K, N%4=0 >>> bread(4 * N, some_big_size) >>> correct data is read >>> repeat the above couple dozen times >>> read starting at physical sector (2048-byte one) N+1, size is ~20K >>> bread(4 * (N+1), some_big_size) >>> data is read from physical sector N+4 (instead of N+1) >>> >>> I remember that Bruce Evance warned me that something like this could >>> happen but I couldn't understand him, because I don't understand >>> VM/buffer subsystem. I'll try to dig up the email. >>> >> Sorry for the flood - additional info: >> if I limit max read size in udf_readatoffset() to 2048 (instead of >> MAXBSIZE), then large directories can be read OK. Seems like something >> with overlapping buffers, maybe? >> >> BTW, here's how I created test environment for the described issues (in >> tcsh): >> >> mkdir /tmp/bigdir >> >> cd /tmp/bigdir >> >> set i=1 >> >> while ($i < NNNNN) >> touch file.$i >> set i=`expr $i + 1` >> end >> >> cd /tmp >> >> mkisofs -udf -o test.iso bigdir >> >> mdconfig -a -t vnode -f test.iso -S 2048 -u 0 >> >> mount_udf /dev/md0 /mnt >> >> ls -l /mnt >> > > > > ------------------------------------------------------------------------ > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47A75B47.2040604>