Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 24 Aug 1996 11:31:26 -0500 (CDT)
From:      Joe Greco <jgreco@brasil.moneng.mei.com>
To:        michaelv@MindBender.serv.net (Michael L. VanLoon)
Cc:        jgreco@brasil.moneng.mei.com, michael@memra.com, craigs@os.com, freebsd-isp@freebsd.org, mvanloon@microsoft.com
Subject:   Re: Anyone using ccd (FreeBSD disk striper) for news
Message-ID:  <199608241631.LAA28292@brasil.moneng.mei.com>
In-Reply-To: <199608240708.AAA02076@MindBender.serv.net> from "Michael L. VanLoon" at Aug 24, 96 00:08:15 am

next in thread | previous in thread | raw e-mail | index | archive | help
> >Build it for speed and as close to zero latency as possible.  Use more disks
> >instead of less.  Stripe lots of FAST 1GB disks - like the new Hawk 31055's
> >- instead of going with larger drives.  2 9ms 1G disks are ALWAYS faster
> >than 1 8ms 2G disk, and the price is similar!  Go with more SCSI busses.
> >NCR controllers are $60 apiece.  Get three, and a 10/100 PCI Ethernet
> >controller, and you're still only putting out about $350 for your I/O
> >controllers.
> 
> Have you compared Adaptec 2940UW's with tagged-command-queuing enabled
> to these?  I found tagged-queuing to be a huge win in some benchmarks
> I ran recently when comparing a BusLogic and Adaptec controller.  Does
> the NCR driver do tagged-command-queuing?

# ncrcontrol
T:L  Vendor   Device           Rev  Speed   Max Wide Tags
0:0  SEAGATE  ST31055N         0318  10.0  10.0   8    4
1:0  SEAGATE  ST31055N         0318  10.0  10.0   8    4
2:0  SEAGATE  ST31055N         0318  10.0  10.0   8    4
3:0  SEAGATE  ST31055N         0318  10.0  10.0   8    4
4:0  SEAGATE  ST31055N         0318  10.0  10.0   8    4

I believe that you can set the number; I haven't seen any reason yet to
bother with it.

> >Use a large stripe size.  I use 1 cylinder group.  You are not striping for
> >bandwidth.  You are striping for CONCURRENCY.  You _want_ one mechanism to
> >be able to handle an _entire_ file access on its own.
> 
> Is this something you just deduced, or have you proven this under real
> newsfeed conditions? 

It's something that some simple filesystem concurrency tests did favor, 
it's the traditional news wisdom, and if you think about it, it makes a
lot of sense.

Your traditional striping paradigm is designed to double the BANDWIDTH off
the disks... i.e. combine two drives that peak out at 2.5MB/s to get an
aggregate 5MB/s.  This is done by a combination of small stripe sizes,
the fact that the drive will tend to read ahead, and concurrent read
requests.  You end up with multiple mechanisms whose heads are moving 
closely in sync.

This is stupid for news, where your average transaction is very small, and
in reality what you want is not greater bandwidth, but greater transactions
per second.  You engineer for this by engineering your disk I/O subsystem
for concurrency:  if you open a particular file, you want (best case) ONE
mechanism to do the directory lookup and data fetch for that file.

This is hopeless in reality..

/news/comp/protocols/tcp-ip/domains/12345

because each directory will be in a different area of the disk, in general.
So the best optimization you can make is hope that you can arrange for
"domains/12345" to be accessed by a single mechanism, which you can do by
setting a LARGE stripe size.  Incidentally, you often end up getting a free
ride for the "/news/comp/protocols/tcp-ip" portion, because a good amount of
that is likely to be already cached by the system.  Your terminal node
directories (domains in this case) are the least likely to be cached, most
likely to be read.

You see how it works?  :-)  It ain't perfect but there's no obviously
better solution unless you move to a news-specialized FS.

> I'm still slightly skeptical -- I think I'd
> start by trying smaller interleaves to increase the liklihood of
> randomizing the drive accessed per file, going with maybe cluster size
> (16K) up to a physical drive cylinder (~600K, probably) per
> interleave.  But, if you've done extensive testing (and only if you've
> done extensive testing) of these alternatives, I'll take your advice
> as the direction to go in.

You have the right idea (randomize the drive accessed per file!!!!) but you
have to remember that you often are forced to do that lookup in the
"domains" directory, and then fetch the data.  A smaller interleave means an
increased likelihood that one mechanism will do the directory lookup and the
other gets the data.  This is inefficient because the first mechanism was
already in the neighborhood and the second mechanism's time is being wasted.

It is not up to me to convince you, however.  Do your own tests and draw
your own conclusions.  Then go look with DejaNews through news.software.nntp
for discussions of this in the past.

> How many drives per controller, and controllers per machine would you
> say is "optimum"?

At $60 a controller I say stuff the machine with controllers and spread 
your disks out over them!  (On a PCI system that means 3 SCSI controllers,
this is better than your average Sun with its $800 SCSI controllers, so 
most Sun news servers have one or maybe two SCSI busses).

It only costs you $120 more (two extras) to get one third the SCSI bus
contention of using a single controller.  At that price why bother figuring
out if two or three is optimal... "just do it".

Your drives then obviously get spread out among the busses.  Note:  I stripe
_across_ busses because I intuitively believe that this may give me better
response.

> >Don't compromise on RAM.  Stuff it.  My feeds box has 128MB RAM.  The
> >readers have 256MB (we had some fun with that though).
> 
> What special tricks did you need to do to FreeBSD to make it run in
> 128MB of RAM?

128MB works fine with Triton-I and Triton-II.  RTFMM (motherboard manual) 
for recommendations on RAM though.  You then set options MAXMEM because 
your standard PC BIOS apparently reports memory > 64MB in some odd fashion
that FreeBSD doesn't comprehend yet.

> 256MB?  Anything?

We had a summerlong adventure with 256MB.  You need a Triton-II board.  If
you really plan to do this, contact me in e-mail and I'll talk to some
people and give you some more details.

> Did I understand that you're running a 2.2 snapshot?  Is there a
> particular reason you're using this and not 2.1.5?

I'm using 2.1.5R.  I do not use snapshots on production systems.

> Also, what ethernet card has given you the best results (specific
> model, please)?

I've used the Kingston PCI 10bT cheapies with great success, maybe the 
KNE40T but I don't recall the model # for sure (my supplier knows what I
mean when I order one), the SMC EtherPower 10/100 (9332?) works great as
well, I've seen these hooked up to a SynOptics switch and you can really
shovel data around.

> I'm going to be setting up a killer newsfeed-sucking machine at work
> to do performance testing against, and I want to wring as much
> performance as I can out of this box (It'll be a Dell OptiPlex P5
> 133MHz -- the rest is up to me).
> 
> Any other tips you (or anyone else) would like to share?

"The more, the merrier".  That applies to every resource: RAM, disks, SCSI
busses, etc.

... JG



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199608241631.LAA28292>