From owner-freebsd-chat  Fri Feb 14 19:38:44 2003
Delivered-To: freebsd-chat@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0810637B401
	for <freebsd-chat@freebsd.org>; Fri, 14 Feb 2003 19:38:35 -0800 (PST)
Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218])
	by mx1.FreeBSD.org (Postfix) with ESMTP id EA3E643F85
	for <freebsd-chat@freebsd.org>; Fri, 14 Feb 2003 19:38:33 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0018.cvx21-bradley.dialup.earthlink.net ([209.179.192.18] helo=mindspring.com)
	by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18jt9k-00019d-00; Fri, 14 Feb 2003 19:38:24 -0800
Message-ID: <3E4DB5DE.D3EA1FE4@mindspring.com>
Date: Fri, 14 Feb 2003 19:37:02 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Brad Knowles <brad.knowles@skynet.be>
Cc: Rahul Siddharthan <rsidd@online.fr>, freebsd-chat@freebsd.org
Subject: Re: Email push and pull (was Re: matthew dillon)
References: <20030211032932.GA1253@papagena.rockefeller.edu>						
	 	 <a05200f2bba6e8fc03a0f@[10.0.1.2]>								
	 <3E498175.295FC389@mindspring.com>							
	 <a05200f37ba6f50bfc705@[10.0.1.2]>							
	 <3E49C2BC.F164F19A@mindspring.com>						
	 <a05200f43ba6fe1a9f4d8@[10.0.1.2]>						
	 <3E4A81A3.A8626F3D@mindspring.com>					
	 <a05200f4cba70710ad3f1@[10.0.1.2]>					
	 <3E4B11BA.A060AEFD@mindspring.com>				
	 <a05200f5bba7128081b43@[10.0.1.2]>				
	 <3E4BC32A.713AB0C4@mindspring.com>			
	 <a05200f07ba71ee8ee0b6@[10.0.1.2]>			
	 <3E4CB9A5.645EC9C@mindspring.com>
	 <a05200f14ba72aae77b18@[10.0.1.2]>
	 <3E4D7702.8EE51F54@mindspring.com> <a05200f01ba734065f08e@[10.0.1.4]>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a416adc66cb4be5e02154b1b847cf5e6d4350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-chat.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-chat>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-chat>
X-Loop: FreeBSD.org

Brad Knowles wrote:
> At 3:08 PM -0800 2003/02/14, Terry Lambert wrote:
> >  I've got to say that on any mail server I've ever worked on, the
> >  limitation on what it could handle was *never* disk I/O, unless
> >  it was before trivial tuning had been done.  It was always network
> >  I/O, bus bandwidth, or CPU.
> 
>         I have yet to see a mail system that had limitations in any of
> these areas, and other than the ones you've seen, I have yet to hear
> of one such a mail system that any other mail systems expert has ever
> seen that I have talked to -- including the AOL mail system.

I like to build systems that are inherently scalable, so that you
can just throw more resources at them.  That means I'm usually
more concerned with breaking the ties between where the data resides
and where it gets read/written from.  For the 1 machine per every
50,000 domain units case we dealt with, the machines in question were
2x166MHz PPC 604's.  I guess if the machines were 3 GHz P4's or
something, then my perspective might be different... though probably
not much, since such systems would not have had a crossbar bus, and
there are stalls that happen from CPU clock multipliers which I did
not have to deal with.

The pipes in and out have always been on the order of T1 to T3, as
well, with an assumption of 50% bandwidth in, and 50% bandwidth
out.  In no case was I interested in the leaf node server, unless
it was on the other end of a (comparatively) low bandwidth link.

The only thing that concerned me about the disk was pool retention
time for time-in-queue for the main queue depth.  I was only
concerned about that for iteration, not lookup: for lookup, the
system in question had a btree structured directory implementation,
so lookups were always O(log2(N)+1).  Even then, since the data was
being moved to per domain queues, the main mail queue never ended
up getting very deep, even when the server was saturated at 100Mbit,
which was "worst case" (if the queue had started growing and continued
growing at that point, I would have had to throttle input rate to
avoid a queue livelock state, and ensure the drain rate was >= the
fill rate).

The FS's were AIX JFS, striped, on 6 10,000 RPM SCSI spindles.

I can imagine there being problems with UFS in a similar circumstance,
unless you carefully designed your usage semantics to ensure that you
would not introduce stalls through your choice of FS.  Even so,
however, I think that it's possible to implement in such a way as
to avoid the stalls.

But my problems were *always* network saturation before hitting
any other limit (even CPU).


> >  As far as I'm concerned, for most applications, "disks are fast
> >  enough"; even an IDE disk doing thermal recalibration can keep
> >  up with full frame rate digitized video for simultaneous record
> >  and playback.  5 of those, and you're at the limit of 100Mbit
> >  ethernet in and out, and you need 5 for RAID.
> 
>         If all you ever care about is bandwidth and not latency, and you
> really do get the bandwidth numbers claimed by the manufacturer as
> opposed to the ones that we tend to see in the real world, I might
> possibly believe this.  Of course, I have yet to hear of a
> theoretical application where this would be the case, but I am
> willing to concede the point that such a thing might perhaps exist.

8-).

> >  FWIW, since you really don't give a damn about timestamps on the
> >  queue files in question, etc., you can relax POSIX guarantees on
> >  certain metadata updates that were put there to make the DEC VMS
> >  engineers happy by slowing down UNIX, relative to VMS, so they
> >  would not file lawsuit from POSIX being required for gvernment
> >  contracts.
> 
>         In what way don't you care about timestamps?  And which
> timestamps don't we care about?  Are you talking about noatime, or
> something else?  Note that noatime doesn't exist for NFS mounts, at
> least it's not one I have been able to specify on our systems.

You have to specify it on the NFS server system, or you specify it
on the NFS client system, so the transactions are not attempted, if
you care about the request going over the wire and being ignored,
rather than just being ignored.  Since the issue you are personally
fighting is write latency for requests you should probably not be
making (8-)), you ought to work hard on this.

If your NFS client systems are FreeBSD, then it's a fairly minor
hack to add the option the the NFS client code.  If your NFS server
is FreeBSD, then turning it off on the exported mount is probably
sufficient.

If neither is FreeBSD... well, can you switch to FreeBSD?


> >  Because those are not magically a consequence of increased complexity.
> >  Complexity can be managed.
> 
>         At what cost?

Engineering.


> >  And they are back to transmitting 1 copy each,
> 
>         If they're putting the recipient name into the body of the e-mail
> message, then they're doing that anyway.  Since they don't care about
> whether any of their spam is lost, they can run from memory-based
> filesystems.  They can generate orders of magnitude more traffic than
> you could handle on the same hardware, simply because they don't have
> to worry what happens if the system crashes.  Moreover, they can use
> open relays and high-volume spam-sending networks to further increase
> their amplitude.

The point is not what they can do, it's what you can do.  You;ve
already admitted a 1.3x multiple recipient rate.


> >  Only if the messages came in on seperate sessions, and they are back
> >  to transmitting 1 copy each, and they lose their amplification effect
> >  in any attack.
> 
>         See above.  Using SIS hurts you far more than it could possibly hurt them.

It's not intended to "hurt them".  It's only intended to deal with
mutiple recipients for a single message.  SPAM is almost the only
type of mail that's externally generated that gets multiple recipient
targets.  The point is not to "hurt" them (if you wanted that, you
would run RBL or ORBS or SPEWS or ... and not accept connections from
their servers in the first place), but to mitigate their effect on
your storage costs.  Note that this is the same philosophy you've been
espousing all along, with quotas: you don't care if it causes a problem
for your users, only if it causes a problem for you.

Internally, you have a higher connectedness between users, so you
get much larger than your 1.3 multiplier, and for email lists, it's
higher still.  In fact, I would go so far as to say that DJB's idea
of sending a reference is applicable to email list messages, only
the messages would be stored on the list server, instead of on the
sender machine.  In fact, there are MIME types for this, and it
would be really useful for any list which intends to archive its
content anyway.  8-).


> >>          So, you're right back where you started, and yet you've paid such
> >>  a very high price.
> >
> >  It's a price you have to pay anyway.
> 
>         No, you don't.

At the point that you no longer care which machine you send a user
connection to to retrieve their mail, then you no longer care where
you send the mail, or if the mail is single instance multiple
time, a real replica, or a virtual replica (SIS).  It takes a small
amount of additional work.


> >  You mean "simply because we, the provider, failed to protect them
> >  from a DOS".
> 
>         On user's DOS is another user's normal level of e-mail.  It's
> impossible to protect them from DOS at that level, because you cannot
> possibly know, a priori, what is a DOS for which person.  At higher
> levels, you can detect a DOS and at least delay it by having circuit
> breakers, such as quotas.

The repeated mailing ("mail bombing") that started this thread is,
or should be, simple to detect.

Yes, it's a trivial case, but it's the most common case.  You don't
have to go to a compute-intensive technique to deal with it.


> >  OK, that's a 25% reduction in the metadata overhead required,
> >  which is what you claim is the bottleneck.  That doesn't look
> >  insignificant to me.
> 
>         Read the slides again.  It doesn't reduce the meta-data overhead
> at all, only the data bandwidth required.  Using ln to create a hard
> link to another file requires just as much synchronous meta-data
> overhead as it does to create the file in the first place -- the only
> difference is that you didn't have to also store another copy of the
> file contents.

You are storing the reference wrong.  Use an encapsulated reference,
not a hard link.  That will permit the metadata operations to occur
simultaneously, instead of constraining them to occur serially, like
a link does.  In many of the systems I've seen, where the domain
name is used as an index into a hashed directory structure, you would
not be able to hard link in any case, since the link targets would be
on different physical FS's.


>         However, as we said before, storing a copy of the file contents
> is cheap -- what kills us is the synchronous meta-data overhead.

You keep saying this, and then you keep arranging the situation
(order of operations, FS backing store, networ transport protocol,
etc.) so that it's true, instead of trying to arrange them so it
isn't.


> >  My argument would be that this should be handled out-of-band
> >  via a feedback mechanism, rather than in-band via an EQUOTA,
> >  using the quota as a ffeedback mechanism.
> 
>         What kind of out-of-band mechanism did you have in mind?  Are we
> re-inventing the entirety of Internet e-mail yet once again?

No, we are not.  The transport protocols are the transport protocols,
and you are constrained to implement to the transport protocols, no
matter what else you do.  But you are not constrained to depend on
rename-based two phase commits (for example), if your FS or data
store exports a transaction interface for use by applications: you
can use that transaction interface instead.


> >  You're going to do that to the user anyway. Worse, you are going
> >  to give them a mailbox full of DOS crap, and drop good messages
> >  in the toilet (you've taken responsibility for the delivery, so
> >  the sender may not even have them any more, so when you drop them
> >  after the 4 days, they are screwed;
> 
>         As soon as the user notices the overflowing mailbox, they can
> call the helpdesk and the people on the helpdesk have tools available
> to them to do mass cleanup, and avoid the problem for the user to
> deal with this problem.  That gives them seven days to notice the
> problem and fix it, before things might start bouncing.  We will
> likewise have daily monitoring processes that will set off alarms if
> a mailbox overflows, so that we can go take a look at it immediately.

So your queue return time is 7 days.

I have to say, I've personally delt with "help desk" escalations
for problems like this, and it's incredibly labor intensive.  You
should always design as if you were going to have to deal with
100,000 customers or more, so that you put yourself in a position
that manual processes will not scale, and then think about the
problem.


> >>          Bait not taken.  The customer is paying me to implement quotas.
> >>  This is a basic requirement.
> >
> >  This is likely the source of the disconnect.  I view the person
> >  whose mail I'm taking responsibility for, as the customer.
> 
>         The users don't pay my salary.  The customer does.  I do
> everything I can to help the users in every way possible, but when it
> comes down to a choice of whether to do A or B, the customer decides
> -- not the users.

Which explains the general level of user satisfaction with this
industry, according to a refcent survey, I think.  8-) 8-).

> >  You wouldn't implement an out-of-band mechanism instead?
> 
>         Not at the price of re-inventing the entirety of Internet e-mail, no.

Something simple like recognizing repetitive size/sender/subject
pairing on the SMTP transit server.


> >                                                            You'd
> >  insist on the in-band mechanism of a MDA error, after you've
> >  already accepted responsibility for the message you aren't going
> >  to be able to deliver?
> 
>         The message was accepted and delivered to stable storage, and we
> would have done the best we could possibly do to actually deliver it
> to the user's mailbox.  However, that's a gateway function -- the
> users mailbox doesn't speak SMTP, and therefore we would have
> fulfilled all of our required duties, to the best of our ability.  No
> one has any right to expect any better.

Ugh.  Would you, as a user, bet your comapny on that level of service?


> >  You *will* drop stuff in /dev/null.  Any queue entries you remove
> >  are dropped in /dev/null.
> 
>         They're not removed or dropped in /dev/null.  I don't know where
> you pulled that out of your hat, but on our real-world mail systems
> we would generate a bounce message.

And send it to "<>", if it were a bounce for a DSN?


> >  My recommendation would be to use an inode FS as a variable
> >  granularity block store, and use that for storing messages.
> 
>         It must be nice to be in a place where you can afford the luxury
> of contemplating completely re-writing the filesystem code, or even
> the entire OS.

You mean the FreeBSD-chat mailing list?  8-) 8-).  That capability
is one of the reasons people participate in the FreeBSD project.


>         Not my decision.  I wasn't given a choice.
[ ... ]
>         I wish I could be.  Not my decision.  I wasn't given a choice.

So the cowboy tells his friend he'll be right back, and rides to town
to talk to the doctor.  The doctor is in the middle of a delicate
surgery, but pauses long enough to tell the cowboy that he'll have to
cut X's over the bite marks, and then suck the poison out. The cowboy
rushes back to his friend and say "Bad news, Clem; Doc say's you're
going to die!".

8-) 8-).


> >  This is an artifact of using the new Sleepycat code.  You can
> >  actually compile it to use the older code, which can be made to
> >  not use mmap.
> 
>         As of what version is this still possible?  How far back do you
> have to go?  And are you sure that Cyrus would still work with that?

2.8.  It's not like OpenLDAP, which needs the transactioning interfaces,
it's pretty straight-forward code.


>         Certainly, when it comes to SAMS, all this stuff is pre-compiled
> and you don't get the option of building Berkeley DB in a different
> manner, etc....

Yes, you end up having to compile things yourself.


> >  That's all to the good: by pushing it from 40 seconds to ~8 minutes,
> >  you favor my argument that the operation is network bound.
> 
>         Indirectly, perhaps.  The real limitations is in the NFS
> implementation on the server, including how it handles synchronous
> meta-data updates.  A major secondary factor is the client NFS
> implementation.

If you have control over the clients, you can avoid making update
requests.  If you have no control over either, well, "Bad news, Clem".


> >  Yes, it is.  If you read previous postings, I suggested that the
> >  bastion SMTP server would forward the messages to the IMAP server
> >  that will in the future serve them, in order to permit local
> >  delivery.
> 
>         There will be a designated primary server for a give mailbox, but
> any of the other back-end servers could potentially also receive a
> request for delivery or access to the same mailbox.  Our hope is that
> 99% of all requests will go through the designated primary (for
> obvious performance reasons), but we cannot currently design the
> system so that *only* the designated back-end server is allowed to
> serve that particular mailbox.

Not unless you are willing to accept "hot fail over" as a strategy to
use in place of replication.

Though... you *could* allow any of the replicas to accept and queue
on behalf of the primary, but then deliver only to the primary;
presumably you'd be able to replace a primary in 7 days.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message