FreeBSD Mail Archives

Date:      Wed, 6 Mar 2002 11:43:16 -0500 (EST)
From:      Zhihui Zhang <zzhang@cs.binghamton.edu>
To:        "Brian T.Schellenberger" <bts@babbleon.org>
Cc:        Lars Eggert <larse@ISI.EDU>, "Rogier R. Mulhuijzen" <drwilco@drwilco.net>, Julian Elischer <julian@elischer.org>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: A weird disk behaviour
Message-ID:  <Pine.SOL.4.21.0203061132090.5743-100000@onyx>
In-Reply-To: <20020306025055.2C1B5BA03@i8k.babbleon.org>




On Tue, 5 Mar 2002, Brian T.Schellenberger wrote:

> On Tuesday 05 March 2002 06:32 pm, Zhihui Zhang wrote:
> > I apologize for all who have followed this. I made a typo in the original
> > email. What I observed is that writing LESS performs WORSE. Since all
> > blocks are laid out contiguously and I write them sequentially, there
> > should not be any seek problem.
> 
> Hmmm . . . perhaps I misunderstood you, but I thought that you said that in 
> the original mail that you were writing to the same number of disk blocks 
> eiteher way but in some cases you were writing partial blocks and in some 
> cases full blocks.  How do you do that if you don't seek?
> 
> If you aren't seeking, then you must be, in the slower case, writing partial 
> blocks.  Well, there is some size where the disk has physical blocks.  On 
> some disks, writes are always done in full physical blocks.  To write a 
> partial block, the block is read from disk, the data to be written is 
> substituted and then the entire block is written.  This would certainly be 
> likely to be slower than writing a whole block.

In the case of partial block writes, I move to the next block which is
contiguous to the current block.  So the start address of each write in
both cases are exactly the same. The only difference is that one write
full blocks, the other write partial blocks. I also do not read anything
during the partial block write, and I think the disk controller should not
do that either.

-Zhihui
 
> Does this possibly explain what you are seeing?
> 
> Note that I have no clue whether this happens with many real disks, or even 
> with any made in the last 20 years, but I have heard tell of such things.
> 
> >  I have modified the kernel in
> > kern_physio.c and find out that physio() is called by expected number of
> > times. I even add some code to record the time elapsed there:
> >
> >                         t1 = time_second;
> >
> >                         BUF_STRATEGY(bp, 0);
> >                         spl = splbio();
> >                         while ((bp->b_flags & B_DONE) == 0)
> >                                 tsleep((caddr_t)bp, PRIBIO, "physstr", 0);
> >                         splx(spl);
> >
> >                         t2 = time_second;
> >                         physio_time += t2 - t1;
> >
> > the physio_time (a sysctl variable) is close to the time reported by the
> > user program.
> >
> > -Zhihui
> >
> > On Tue, 5 Mar 2002, Lars Eggert wrote:
> > > Zhihui Zhang wrote:
> > > > Several times slower! The point is that writing less data performs
> > > > worse. So I call it weird.
> > >
> > > Huh? You originally said:
> > >  > (1) Write each block fully and sequentially, ie. 8192 bytes.
> > >  >
> > >  > (2) I still write these blocks sequentially, but for each block I only
> > >  > write part of it.
> > >
> > > ...
> > >
> > >  > I find out the the performance of (2) is several times better than the
> > >  > performance of (1). Can anyone explain to me why this is the case?
> > >
> > > If (2) is better than (1), then writing *less* data is faster. Which is
> > > it, now?
> > >
> > > Lars
> > >
> > > > -Zhihui
> > > >
> > > > On Tue, 5 Mar 2002, Lars Eggert wrote:
> > > >>Zhihui Zhang wrote:
> > > >>>Well, the core of my program is as follows (RANDOM(x) return a value
> > > >>>between 0 and x):
> > > >>>
> > > >>>        blocksize = 8192;
> > > >>>        write_size_low = 512;
> > > >>>
> > > >>>	time(&time1);
> > > >>>	for (i = 0; i < write_count; i++) {
> > > >>>		write_size = write_size_low +
> > > >>>                         RANDOM(write_size_high-write_size_low);
> > > >>>		write_size = roundup(write_size, DEV_BSIZE);
> > > >>>		if (testcase == 1)
> > > >>>			write_size = blocksize;
> > > >>>		write_block(rawfd, sectorno, buf, write_size);
> > > >>>		sectorno += blocksize / DEV_BSIZE;
> > > >>>	}
> > > >>>        time(&time2);
> > > >>>
> > > >>>If testcase is one, then the time elapsed (time2 - time1) is much
> > > >>> less.
> > > >>
> > > >>How "much less" in milliseconds?
> > > >>
> > > >>Also, in your original mail, you said you had 15,000 of these 8K
> > > >> blocks, which is only 120MB or so. Use 150,000 or 1,500,000 and check
> > > >> your results then.
> > > >>
> > > >>Lars
> > > >>
> > > >>>-Zhihui
> > > >>>
> > > >>>On Tue, 5 Mar 2002, Lars Eggert wrote:
> > > >>>>I agree that it's probably caching at some level. You're only writing
> > > >>>>about 120MB of data (and half that in your second case). Bump these
> > > >>>> to a couple of GB and see what happens.
> > > >>>>
> > > >>>>Also, could you post your actual measurements?
> > > >>>>
> > > >>>>Lars
> > > >>>>
> > > >>>>Zhihui Zhang wrote:
> > > >>>>>The machine has 128M memory. I am doing physical I/O one block at a
> > > >>>>> time, so there should be no memory copy.
> > > >>>>>
> > > >>>>>-Zhihui
> > > >>>>>
> > > >>>>>On Tue, 5 Mar 2002, Rogier R. Mulhuijzen wrote:
> > > >>>>>>At 16:03 5-3-2002 -0500, Zhihui Zhang wrote:
> > > >>>>>>>On Tue, 5 Mar 2002, Julian Elischer wrote:
> > > >>>>>>>>more writes fit in the disk's write cache?
> > > >>>>>>>
> > > >>>>>>>For (1), it writes 15000 * 8192 bytes in all.  For (2), it writes
> > > >>>>>>> 15000 * 4096 bytes in all (assuming the random number distributes
> > > >>>>>>> evenly between 0 and 8192).  So your suggestion does not make
> > > >>>>>>> sense to me.
> > > >>>>>>
> > > >>>>>>How large is your buffercache?  it might be that the 15000 * ~4096
> > > >>>>>> roughly matches with your cache, and 15000 * 8912 doesn't.
> > > >>>>>>
> > > >>>>>>Case (1) would require a lot more physical IO in that case than
> > > >>>>>> case (2) would require.
> > > >>>>>>
> > > >>>>>>       Doc
> > > >>>>>>
> > > >>>>>>>-Zhihui
> > > >>>>>>>
> > > >>>>>>>>On Tue, 5 Mar 2002, Zhihui Zhang wrote:
> > > >>>>>>>>>I am doing some raw I/O test on a seagate SCSI disk running
> > > >>>>>>>>> FreeBSD 4.5. This situation is like this:
> > > >>>>>>>>>
> > > >>>>>>>>>+-----+----+----+----+----+----+----+----+----+----+---+------
> > > >>>>>>>>>
> > > >>>>>>>>>|     |    |    |    |    |    |    |    |    |    |   | ....
> > > >>>>>>>>>
> > > >>>>>>>>>+-----+----+----+----+----+----+----+----+----+----+---+------
> > > >>>>>>>>>
> > > >>>>>>>>>Each block is of fixed size, say 8192 bytes. Now I have a user
> > > >>>>>>>>> program writing each contiguously laid out block sequentially
> > > >>>>>>>>> using /dev/daxxx interface. There are a lot of them, say 15000.
> > > >>>>>>>>>  I write the blocks in two ways (the data used in writing are
> > > >>>>>>>>> garbage):
> > > >>>>>>>>>
> > > >>>>>>>>>(1) Write each block fully and sequentially, ie. 8192 bytes.
> > > >>>>>>>>>
> > > >>>>>>>>>(2) I still write these blocks sequentially, but for each block
> > > >>>>>>>>> I only write part of it.  Exactly how many bytes are written
> > > >>>>>>>>> inside each
> > > >>>>>>>
> > > >>>>>>>block is
> > > >>>>>>>
> > > >>>>>>>>>determinted by a random number between 512 .. 8192 bytes
> > > >>>>>>>>> (rounded up a to multiple of 512 bytes).
> > > >>>>>>>>>
> > > >>>>>>>>>I find out the the performance of (2) is several times better
> > > >>>>>>>>> than the performance of (1). Can anyone explain to me why this
> > > >>>>>>>>> is the case?
> > > >>>>>>>>>
> > > >>>>>>>>>Thanks for any suggestions or hints.
> > > >>>>>>>>>
> > > >>>>>>>>>-Zhihui
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org
> > > >>>>>>>>>with "unsubscribe freebsd-hackers" in the body of the message
> > > >>>>>>>
> > > >>>>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org
> > > >>>>>>>with "unsubscribe freebsd-hackers" in the body of the message
> > > >>>>>
> > > >>>>>To Unsubscribe: send mail to majordomo@FreeBSD.org
> > > >>>>>with "unsubscribe freebsd-hackers" in the body of the message
> > > >>>>
> > > >>>>--
> > > >>>>Lars Eggert <larse@isi.edu>               Information Sciences
> > > >>>> Institute http://www.isi.edu/larse/              University of
> > > >>>> Southern California
> > > >>
> > > >>--
> > > >>Lars Eggert <larse@isi.edu>               Information Sciences
> > > >> Institute http://www.isi.edu/larse/              University of
> > > >> Southern California
> > >
> > > --
> > > Lars Eggert <larse@isi.edu>               Information Sciences Institute
> > > http://www.isi.edu/larse/              University of Southern California
> >
> > To Unsubscribe: send mail to majordomo@FreeBSD.org
> > with "unsubscribe freebsd-hackers" in the body of the message
> 
> -- 
> Brian T. Schellenberger . . . . . . .   bts@wnt.sas.com (work)
> Brian, the man from Babble-On . . . .   bts@babbleon.org (personal)
>                                 ME -->  http://www.babbleon.org
> http://www.eff.org   <-- GOOD GUYS -->  http://www.programming-freedom.org 
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message
> 


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.SOL.4.21.0203061132090.5743-100000>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation