Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Mar 2002 18:32:54 -0500 (EST)
From:      Zhihui Zhang <zzhang@cs.binghamton.edu>
To:        Lars Eggert <larse@ISI.EDU>
Cc:        "Rogier R. Mulhuijzen" <drwilco@drwilco.net>, Julian Elischer <julian@elischer.org>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: A weird disk behaviour
Message-ID:  <Pine.SOL.4.21.0203051826390.13181-100000@onyx>
In-Reply-To: <3C85542B.5060100@isi.edu>

next in thread | previous in thread | raw e-mail | index | archive | help

I apologize for all who have followed this. I made a typo in the original
email. What I observed is that writing LESS performs WORSE. Since all
blocks are laid out contiguously and I write them sequentially, there
should not be any seek problem.  I have modified the kernel in
kern_physio.c and find out that physio() is called by expected number of
times. I even add some code to record the time elapsed there:

                        t1 = time_second;
                         
                        BUF_STRATEGY(bp, 0);
                        spl = splbio();
                        while ((bp->b_flags & B_DONE) == 0)  
                                tsleep((caddr_t)bp, PRIBIO, "physstr", 0);
                        splx(spl);
                                
                        t2 = time_second;
                        physio_time += t2 - t1;

the physio_time (a sysctl variable) is close to the time reported by the
user program.

-Zhihui

On Tue, 5 Mar 2002, Lars Eggert wrote:

> Zhihui Zhang wrote:
> > Several times slower! The point is that writing less data performs
> > worse. So I call it weird.
> 
> Huh? You originally said:
> 
>  > (1) Write each block fully and sequentially, ie. 8192 bytes.
>  >
>  > (2) I still write these blocks sequentially, but for each block I only
>  > write part of it.
> ...
>  > I find out the the performance of (2) is several times better than the
>  > performance of (1). Can anyone explain to me why this is the case?
> 
> If (2) is better than (1), then writing *less* data is faster. Which is 
> it, now?
> 
> Lars
> 
> 
> 
> > -Zhihui
> > 
> > On Tue, 5 Mar 2002, Lars Eggert wrote:
> > 
> > 
> >>Zhihui Zhang wrote:
> >>
> >>>Well, the core of my program is as follows (RANDOM(x) return a value
> >>>between 0 and x):
> >>>
> >>>        blocksize = 8192;
> >>>        write_size_low = 512;
> >>>
> >>>	time(&time1);
> >>>	for (i = 0; i < write_count; i++) {
> >>>		write_size = write_size_low +
> >>>                         RANDOM(write_size_high-write_size_low);
> >>>		write_size = roundup(write_size, DEV_BSIZE);
> >>>		if (testcase == 1)
> >>>			write_size = blocksize;
> >>>		write_block(rawfd, sectorno, buf, write_size);
> >>>		sectorno += blocksize / DEV_BSIZE;
> >>>	}
> >>>        time(&time2);
> >>>
> >>>If testcase is one, then the time elapsed (time2 - time1) is much less.
> >>>
> >>How "much less" in milliseconds?
> >>
> >>Also, in your original mail, you said you had 15,000 of these 8K blocks, 
> >>which is only 120MB or so. Use 150,000 or 1,500,000 and check your 
> >>results then.
> >>
> >>Lars
> >>
> >>
> >>
> >>
> >>>-Zhihui
> >>>
> >>>On Tue, 5 Mar 2002, Lars Eggert wrote:
> >>>
> >>>
> >>>
> >>>>I agree that it's probably caching at some level. You're only writing 
> >>>>about 120MB of data (and half that in your second case). Bump these to a 
> >>>>couple of GB and see what happens.
> >>>>
> >>>>Also, could you post your actual measurements?
> >>>>
> >>>>Lars
> >>>>
> >>>>
> >>>>Zhihui Zhang wrote:
> >>>>
> >>>>
> >>>>>The machine has 128M memory. I am doing physical I/O one block at a time,
> >>>>>so there should be no memory copy.
> >>>>>
> >>>>>-Zhihui
> >>>>>
> >>>>>On Tue, 5 Mar 2002, Rogier R. Mulhuijzen wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>At 16:03 5-3-2002 -0500, Zhihui Zhang wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>On Tue, 5 Mar 2002, Julian Elischer wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>more writes fit in the disk's write cache?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>For (1), it writes 15000 * 8192 bytes in all.  For (2), it writes 15000 *
> >>>>>>>4096 bytes in all (assuming the random number distributes evenly between 0
> >>>>>>>and 8192).  So your suggestion does not make sense to me.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>How large is your buffercache?  it might be that the 15000 * ~4096 roughly 
> >>>>>>matches with your cache, and 15000 * 8912 doesn't.
> >>>>>>
> >>>>>>Case (1) would require a lot more physical IO in that case than case (2) 
> >>>>>>would require.
> >>>>>>
> >>>>>>       Doc
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>-Zhihui
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>On Tue, 5 Mar 2002, Zhihui Zhang wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>I am doing some raw I/O test on a seagate SCSI disk running FreeBSD 4.5.
> >>>>>>>>>This situation is like this:
> >>>>>>>>>
> >>>>>>>>>+-----+----+----+----+----+----+----+----+----+----+---+------
> >>>>>>>>>|     |    |    |    |    |    |    |    |    |    |   | ....
> >>>>>>>>>+-----+----+----+----+----+----+----+----+----+----+---+------
> >>>>>>>>>
> >>>>>>>>>Each block is of fixed size, say 8192 bytes. Now I have a user program
> >>>>>>>>>writing each contiguously laid out block sequentially using /dev/daxxx
> >>>>>>>>>interface. There are a lot of them, say 15000.  I write the blocks in two
> >>>>>>>>>ways (the data used in writing are garbage):
> >>>>>>>>>
> >>>>>>>>>(1) Write each block fully and sequentially, ie. 8192 bytes.
> >>>>>>>>>
> >>>>>>>>>(2) I still write these blocks sequentially, but for each block I only
> >>>>>>>>>write part of it.  Exactly how many bytes are written inside each 
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>block is
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>determinted by a random number between 512 .. 8192 bytes (rounded up a
> >>>>>>>>>to multiple of 512 bytes).
> >>>>>>>>>
> >>>>>>>>>I find out the the performance of (2) is several times better than the
> >>>>>>>>>performance of (1). Can anyone explain to me why this is the case?
> >>>>>>>>>
> >>>>>>>>>Thanks for any suggestions or hints.
> >>>>>>>>>
> >>>>>>>>>-Zhihui
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org
> >>>>>>>>>with "unsubscribe freebsd-hackers" in the body of the message
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org
> >>>>>>>with "unsubscribe freebsd-hackers" in the body of the message
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org
> >>>>>with "unsubscribe freebsd-hackers" in the body of the message
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>-- 
> >>>>Lars Eggert <larse@isi.edu>               Information Sciences Institute
> >>>>http://www.isi.edu/larse/              University of Southern California
> >>>>
> >>>>
> >>>>
> >>
> >>
> >>-- 
> >>Lars Eggert <larse@isi.edu>               Information Sciences Institute
> >>http://www.isi.edu/larse/              University of Southern California
> >>
> >>
> 
> 
> 
> -- 
> Lars Eggert <larse@isi.edu>               Information Sciences Institute
> http://www.isi.edu/larse/              University of Southern California
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.SOL.4.21.0203051826390.13181-100000>