From owner-freebsd-hackers Tue Mar 5 15:35: 1 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id 3D66C37B404 for ; Tue, 5 Mar 2002 15:34:49 -0800 (PST) Received: from onyx (onyx.cs.binghamton.edu [128.226.140.171]) by bingnet2.cc.binghamton.edu (8.11.6/8.11.6) with ESMTP id g25NYcP14372; Tue, 5 Mar 2002 18:34:38 -0500 (EST) Date: Tue, 5 Mar 2002 18:32:54 -0500 (EST) From: Zhihui Zhang X-Sender: zzhang@onyx To: Lars Eggert Cc: "Rogier R. Mulhuijzen" , Julian Elischer , freebsd-hackers@FreeBSD.ORG Subject: Re: A weird disk behaviour In-Reply-To: <3C85542B.5060100@isi.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I apologize for all who have followed this. I made a typo in the original email. What I observed is that writing LESS performs WORSE. Since all blocks are laid out contiguously and I write them sequentially, there should not be any seek problem. I have modified the kernel in kern_physio.c and find out that physio() is called by expected number of times. I even add some code to record the time elapsed there: t1 = time_second; BUF_STRATEGY(bp, 0); spl = splbio(); while ((bp->b_flags & B_DONE) == 0) tsleep((caddr_t)bp, PRIBIO, "physstr", 0); splx(spl); t2 = time_second; physio_time += t2 - t1; the physio_time (a sysctl variable) is close to the time reported by the user program. -Zhihui On Tue, 5 Mar 2002, Lars Eggert wrote: > Zhihui Zhang wrote: > > Several times slower! The point is that writing less data performs > > worse. So I call it weird. > > Huh? You originally said: > > > (1) Write each block fully and sequentially, ie. 8192 bytes. > > > > (2) I still write these blocks sequentially, but for each block I only > > write part of it. > ... > > I find out the the performance of (2) is several times better than the > > performance of (1). Can anyone explain to me why this is the case? > > If (2) is better than (1), then writing *less* data is faster. Which is > it, now? > > Lars > > > > > -Zhihui > > > > On Tue, 5 Mar 2002, Lars Eggert wrote: > > > > > >>Zhihui Zhang wrote: > >> > >>>Well, the core of my program is as follows (RANDOM(x) return a value > >>>between 0 and x): > >>> > >>> blocksize = 8192; > >>> write_size_low = 512; > >>> > >>> time(&time1); > >>> for (i = 0; i < write_count; i++) { > >>> write_size = write_size_low + > >>> RANDOM(write_size_high-write_size_low); > >>> write_size = roundup(write_size, DEV_BSIZE); > >>> if (testcase == 1) > >>> write_size = blocksize; > >>> write_block(rawfd, sectorno, buf, write_size); > >>> sectorno += blocksize / DEV_BSIZE; > >>> } > >>> time(&time2); > >>> > >>>If testcase is one, then the time elapsed (time2 - time1) is much less. > >>> > >>How "much less" in milliseconds? > >> > >>Also, in your original mail, you said you had 15,000 of these 8K blocks, > >>which is only 120MB or so. Use 150,000 or 1,500,000 and check your > >>results then. > >> > >>Lars > >> > >> > >> > >> > >>>-Zhihui > >>> > >>>On Tue, 5 Mar 2002, Lars Eggert wrote: > >>> > >>> > >>> > >>>>I agree that it's probably caching at some level. You're only writing > >>>>about 120MB of data (and half that in your second case). Bump these to a > >>>>couple of GB and see what happens. > >>>> > >>>>Also, could you post your actual measurements? > >>>> > >>>>Lars > >>>> > >>>> > >>>>Zhihui Zhang wrote: > >>>> > >>>> > >>>>>The machine has 128M memory. I am doing physical I/O one block at a time, > >>>>>so there should be no memory copy. > >>>>> > >>>>>-Zhihui > >>>>> > >>>>>On Tue, 5 Mar 2002, Rogier R. Mulhuijzen wrote: > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>At 16:03 5-3-2002 -0500, Zhihui Zhang wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>On Tue, 5 Mar 2002, Julian Elischer wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>more writes fit in the disk's write cache? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>For (1), it writes 15000 * 8192 bytes in all. For (2), it writes 15000 * > >>>>>>>4096 bytes in all (assuming the random number distributes evenly between 0 > >>>>>>>and 8192). So your suggestion does not make sense to me. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>How large is your buffercache? it might be that the 15000 * ~4096 roughly > >>>>>>matches with your cache, and 15000 * 8912 doesn't. > >>>>>> > >>>>>>Case (1) would require a lot more physical IO in that case than case (2) > >>>>>>would require. > >>>>>> > >>>>>> Doc > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>-Zhihui > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>On Tue, 5 Mar 2002, Zhihui Zhang wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>I am doing some raw I/O test on a seagate SCSI disk running FreeBSD 4.5. > >>>>>>>>>This situation is like this: > >>>>>>>>> > >>>>>>>>>+-----+----+----+----+----+----+----+----+----+----+---+------ > >>>>>>>>>| | | | | | | | | | | | .... > >>>>>>>>>+-----+----+----+----+----+----+----+----+----+----+---+------ > >>>>>>>>> > >>>>>>>>>Each block is of fixed size, say 8192 bytes. Now I have a user program > >>>>>>>>>writing each contiguously laid out block sequentially using /dev/daxxx > >>>>>>>>>interface. There are a lot of them, say 15000. I write the blocks in two > >>>>>>>>>ways (the data used in writing are garbage): > >>>>>>>>> > >>>>>>>>>(1) Write each block fully and sequentially, ie. 8192 bytes. > >>>>>>>>> > >>>>>>>>>(2) I still write these blocks sequentially, but for each block I only > >>>>>>>>>write part of it. Exactly how many bytes are written inside each > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>block is > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>>determinted by a random number between 512 .. 8192 bytes (rounded up a > >>>>>>>>>to multiple of 512 bytes). > >>>>>>>>> > >>>>>>>>>I find out the the performance of (2) is several times better than the > >>>>>>>>>performance of (1). Can anyone explain to me why this is the case? > >>>>>>>>> > >>>>>>>>>Thanks for any suggestions or hints. > >>>>>>>>> > >>>>>>>>>-Zhihui > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org > >>>>>>>>>with "unsubscribe freebsd-hackers" in the body of the message > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org > >>>>>>>with "unsubscribe freebsd-hackers" in the body of the message > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org > >>>>>with "unsubscribe freebsd-hackers" in the body of the message > >>>>> > >>>>> > >>>>> > >>>> > >>>>-- > >>>>Lars Eggert Information Sciences Institute > >>>>http://www.isi.edu/larse/ University of Southern California > >>>> > >>>> > >>>> > >> > >> > >>-- > >>Lars Eggert Information Sciences Institute > >>http://www.isi.edu/larse/ University of Southern California > >> > >> > > > > -- > Lars Eggert Information Sciences Institute > http://www.isi.edu/larse/ University of Southern California > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message