FreeBSD Mail Archives

Date:      Fri, 14 Feb 2003 15:08:50 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Brad Knowles <brad.knowles@skynet.be>
Cc:        Rahul Siddharthan <rsidd@online.fr>, freebsd-chat@freebsd.org
Subject:   Re: Email push and pull (was Re: matthew dillon)
Message-ID:  <3E4D7702.8EE51F54@mindspring.com>
References:  <20030211032932.GA1253@papagena.rockefeller.edu>					 <a05200f2bba6e8fc03a0f@[10.0.1.2]>					 <3E498175.295FC389@mindspring.com>				 <a05200f37ba6f50bfc705@[10.0.1.2]>				 <3E49C2BC.F164F19A@mindspring.com>			 <a05200f43ba6fe1a9f4d8@[10.0.1.2]>			 <3E4A81A3.A8626F3D@mindspring.com>		 <a05200f4cba70710ad3f1@[10.0.1.2]>		 <3E4B11BA.A060AEFD@mindspring.com>	 <a05200f5bba7128081b43@[10.0.1.2]>	 <3E4BC32A.713AB0C4@mindspring.com> <a05200f07ba71ee8ee0b6@[10.0.1.2]> <3E4CB9A5.645EC9C@mindspring.com> <a05200f14ba72aae77b18@[10.0.1.2]>

Brad Knowles wrote:
>         You're still limited by disk devices that may be used temporarily
> on the local server, as well as the disk devices on the other end of
> that network connection.  Putting them on the network does not
> magically solve the problem that disk I/O is still many orders of
> magnitude slower than any other thing we ever do on computer systems.

I've got to say that on any mail server I've ever worked on, the
limitation on what it could handle was *never* disk I/O, unless
it was before trivial tuning had been done.  It was always network
I/O, bus bandwidth, or CPU.

As far as I'm concerned, for most applications, "disks are fast
enough"; even an IDE disk doing thermal recalibration can keep
up with full frame rate digitized video for simultaneous record
and playback.  5 of those, and you're at the limit of 100Mbit
ethernet in and out, and you need 5 for RAID.

FWIW, since you really don't give a damn about timestamps on the
queue files in question, etc., you can relax POSIX guarantees on
certain metadata updates that were put there to make the DEC VMS
engineers happy by slowing down UNIX, relative to VMS, so they
would not file lawsuit from POSIX being required for gvernment
contracts.

> >  Disagree.  These locking issues are an artifact of the system
> >  design (FS, application, or both).
> 
>         And you have magically solved all these problems in what way?

By writing an appropriate filesystem for the task at hand, and
by using the stacking proxy described in Heidemann's Master's
Thesis from the FICUS project, the source of FreeBSD stacking
vnode code.

> >  Simple answer: Don't use a metadata intensive storage mechanism.
> 
>         So, use what -- a pure memory-based file system for hundreds of
> gigabytes or even multiple terabytes of storage?  Even that will
> still have synchronous meta-data update issues with regards to the
> in-memory directory structure, even if those operations do take place
> much faster.

No, not "pure memory".  Survey all the metadata you update.  Then
survey all the metadata that you *ned* to update, and subtract the
one from the other, and turn the rest off.  Trivially, look at the
"noasync" mount option, or the "inode FS".

> >  In other words, the message takes up your disk space, no matter
> >  what.
> 
>         I other words, I can protect the entire system from being taken
> down by a concerted DOS attack on a single user.  They're going to
> have to work harder than that if they want to take down my entire
> system.

Like that's frigging hard.

> >>          SIS increases SPOFs, reduces reliability, increases complexity,
> >>  increases the probability of hot-spots and other forms of contention,
> >>  and all for very little possible benefit.
> >
> >  The only one of these I agree with is that it increases complexity.
> 
>         In what way does SIS *not* increase SPOFs, reduce reliability,
> increase the probability of hot-spots and other forms of contention,

Because those are not magically a consequence of increased complexity.
Complexity can be managed.

> and in what way does it magically solve all the storage problems of
> the system?

It doesn't solve *all* of them.  As I stated, you have to do in
depth modification of the software involved.  Turning off mailboxes
and turning on maildirs in the software hardly qualifies as "in depth".

> >  This discussion *started* because there was a set of list floods,
> >  and someone made a stupid remark about an important researcher
> >  indicating he was cancelling his subscription to the -hackers
> >  mailing list over it, and I pointed out to the person belittling
> >  the important researcher that such flooding has consequences that
> >  depend on the mail transport technology over and above "just having
> >  to delete a bunch of identical email".
> 
>         Okay, so let's say that you've got this magical SIS which solves
> all storage problems, and you let your users have unlimited disk
> space.  All it takes is someone applying trivial changes to the
> messages so that they are not all actually identical, and you're back
> to storing at least one copy of each.

And they are back to transmitting 1 copy each, and they lose their
amplification effect in any attack.

>         Such transformations are typically found in message headers
> (message-ids are supposed to be unique, and combinations of date/time
> stamps and process ids will probably be unique, especially when taken
> over the entire message and the multiple hops it might have
> traversed).

Only if the messages came in on seperate sessions, and they are back
to transmitting 1 copy each, and they lose their amplification effect
in any attack.

>         Such transformations are becoming much more typical with spam,
> where the recipient's name is part of the message body.

And they are back to transmitting 1 copy each, and they lose their
amplification effect in any attack.

>         So, you're right back where you started, and yet you've paid such
> a very high price.

It's a price you have to pay anyway.

What the difference between a hard RT system, and a soft RT system?
The major difference is that a hard RT system achieves bounded time
processing for kernel operations, and does so by supporting kernel
preemption, which requires function reentrancy.

What's the difference between a UP system and a single system image
shared memory SMP system?  The major difference is that a shared
memory SMP system supports kernel reentrancy, which requires function
reentrancy.

Solve 100% of one problem, you solve 90% of the other.

You have to solve 90% of the SIS problem anyway, you might as well
solve the remaining 10%.

> >  As far as "dealing with DOS", in for a penny, in for a pound: if
> >  you are willing to burn CPU cycles, then implement Sieve or some
> >  other technology to permit server-side filtering.
> 
>         We're doing that, too.  However, server-side filtering can only
> do so much.  Yes, it can eliminate duplicates that have the same
> message-id (although there is some risk that you'll eliminate unique
> messages that have colliding ids), and there is the possibility to
> program it so that it can actually inspect the content and eliminate
> additional messages that have the same message body fingerprint as
> previously seen.
> 
>         But even that can only go so far.  See above.

And it can do all the SPAM filtering that people keep saying
the user's mail client should do, because they think everyone
has broadband.  If a customer has broadband, and sets their
polling interval at the default for OutLook, then all of the
problems with server side storage for both the customer and
the provider move to the customer's machine, instead.

> >  We also know that, for most DOS cases on maildrops, the user
> >  simply loses, and that's that.
> 
>         True enough.  But I don't have to throw out all of my users
> simply because just one of them was the target of a DOS.

You mean "simply because we, the provider, failed to protect them
from a DOS".

> >  The replication model is actually a pretty profound issue.  Prior
> >  to replication, if you connect to one of the replicas, the message
> >  can be seen as "in transit".  Post deletion on an original prior to
> >  the replication, and the deletion can bee seen as "in transit".  The
> >  worst case failure modes are that a message has increased apparent
> >  delivery latency, or the message "comes back" after it's deleted.
> 
>         Yes, at another level, the particular replication model chosen
> will be important.  However, at this level what we really care about
> is the fact that the message/mailbox is replicated, and we don't
> really care how.

I think you still care how.  I think you care because create
event propagation has to be more reliable than delete event
propagation, because of the failure cases.

[ ... ]
>         So, when defining "recipient system", it makes perfect sense that
> this would be the point at which the mail is accumulated into some
> sort of a mailbox or queue and held on their behalf, regardless of
> whether that mailbox/queue is downloaded/retrieved with UUCP, POP3,
> IMAP4, or some other protocol.

By this definition, DJB is right, and the SMTP server that the
original sender contacts to send the mail is a "recipient system".
I don't buy this definition.

I think the problem here is that you think your customer is the
person who will own the mail server, while I'm thinking the
customer is the person for whom the mail is being transported.

> >  The majority of that latency is an artifact of the FS technology,
> >  not an artifact of the disk technology, except as it impacts the
> >  ability of the FS technology to be implemented without stall
> >  barriers (e.g. IDE write data transfers not permitting disconnect
> >  ruin your whole day).
> 
>         Again, I'd like to know where you get this magic filesystem
> technology that solves all disk I/O performance issues and makes them
> as fast as a RAM disk, while also being 100% perfectly safe.

At one point Matt Dillon was working on a system that did replication
into RAM on multiple knodes, and defined that as "stable storage",
since a system failure or two would not damage the ability to take
responsibility for final delivery; that's one potential implementation.

But the easiest implementation is to use an inode FS.

[ ... ]
>         Correct.  But with only ~1.3 recipients per message (on average),
> there isn't much duplication to be had anyway.  The whole replication
> issue is a different matter.

OK, that's a 25% reduction in the metadata overhead required,
which is what you claim is the bottleneck.  That doesn't look
insignificant to me.

>         No, I don't panic "...at the idea of throwing it into the user
> mailbox...".  I have defined queueing & buffering mechanisms that
> function system-wide, which help me resist problems with even
> large-scale DOS attacks, and help ensure that all the rest of my
> customers continue to receive service even if a single user has an
> overflowing mailbox.

My argument would be that this should be handled out-of-band
via a feedback mechanism, rather than in-band via an EQUOTA,
using the quota as a ffeedback mechanism.

IMO, quotas are useful in IMAP4 servers, where the tendency is
to leave data on the server.  But the value in the quota applies
only to *old* mail, not to unread mail, or newly arrived mail.

>         But it's easier to solve this problem at the system-wide level
> where I can allocate relatively large buffers, as opposed to
> inflicting it on the end user and letting them try to deal with it
> across their slow dial-up line (or whatever).

You're going to do that to the user anyway. Worse, you are going
to give them a mailbox full of DOS crap, and drop good messages
in the toilet (you've taken responsibility for the delivery, so
the sender may not even have them any more, so when you drop them
after the 4 days, they are screwed; you are especially screwed if
the things you are dropping are DSN's from someone *just like you*).

>         Bait not taken.  The customer is paying me to implement quotas.
> This is a basic requirement.

This is likely the source of the disconnect.  I view the person
whose mail I'm taking responsibility for, as the customer.

>         Moreover, even if it wasn't a basic requirement, I'd go back to
> the customer and make sure that they understood that they're placing
> the entire mail system for all thousands of users at risk if there is
> a single mail loop or a large DOS attack on a single user, where I
> have better tools to constrain these issues at a system-wide level.

But you don't.  You are relying on the feedback from an EQUOTA.
Worse, the tools you are using don't turn a quota overage into
an protocol level refusal, e.g. "451 recipient over quota", on
attempts to send the user messages.

Actually, that error would be incredibly telling: by returning it
to the remote system, you are blaming the user for being over quota,
when it's probably not the user who's at fault.

Instead, what happens is the messages pile up in your queue.

>         If they still said that they didn't want quotas, then I'd let
> someone else build the system for them -- I wouldn't want my name on
> it.

You wouldn't implement an out-of-band mechanism instead?  You'd
insist on the in-band mechanism of a MDA error, after you've
already accepted responsibility for the message you aren't going
to be able to deliver?

>         I don't drop the stuff in /dev/null.  I just put some limits on
> things so that I've got brakes that will automatically kick in and
> start slowing the train down if there is an excessive overspeed
> problem for an excessive period of time.

You *will* drop stuff in /dev/null.  Any queue entries you remove
are dropped in /dev/null.  You've accepted responsibility for the
delivery.  In some cases, you'll be able to generate a bounce
message, but not for DSN's.  Basically, if you are talking to
someone who implements as you do, then information gets lost.

[ ... ]
>         Well, we're not talking about FreeBSD.  I wish we were.  However,

Probably ought to take the discussion off this mailing list, then.
;^).

> I can assure you that UFS+Logging definitely has synchronous
> meta-data update issues -- making them ordered or putting them into a
> commit log and doing them in larger chunks does not eliminate them.

Matt Dillon was working on this problem at one point in time;
he defined "committed to stable storage" as "replicated in RAM
on some number of hosts with fault tolerant features".  Even if
you lost one, you didn't lose the data.  That's one approach.

My recommendation would be to use an inode FS as a variable
granularity block store, and use that for storing messages.

>         However, there's nothing I can do about synchronous meta-data
> issues with the network & filesystem implementation of the NFS
> server, and any related problems with the NFS client.

Not if you constrain yourself to NFS, there isn't, I agree.

[ ... ]
> >  Maildir is a kludge aound NFS locking.  Nothing more, and nothing
> >  less.
> 
>         Yup.  And I'm convinced that it introduces more problems than it
> solves.  But I still don't have much choice.

If you're convinced, then you should be doing something else.  8-(.

> >  MS Exchange does, and so does Lotus Notes.  I know they suck, but
> >  they are examples.
> 
>         They're not IMAP servers.  They are proprietary LAN e-mail
> systems that may happen to have an interface to this alien IMAP
> protocol.

They both have "IMAP connectors", actually.

> >  Who's using mmap?!?
> 
>         Cyrus.  All those databases it keeps to help inform it what the
> status is of the various messages, etc... are using mmap to access
> the information inside the database files.  Or are you not familiar
> with the method of operation of tools like Berkeley DB?

This is an artifact of using the new Sleepycat code.  You can
actually compile it to use the older code, which can be made to
not use mmap.

[ ... I, too, have fond/nightmarish memories of MMDF ... ]

>         SIMS and Netscape/iPlanet mail server are dead-end products.
> Scott McNealy was very unpleasantly surprised when the Sun Europe
> guys sprung SIMS on him, and it is definitely going the way of the
> dodo.  Note that Sun is a major investor in Sendmail, Inc. and they
> have on their payroll one of the key members of the Sendmail
> Consortium.

I like sendmail, and I like their people.  In general, though, I
would say that they are still looking for their commercial market,
so this is less impressive to me than it would be otherwise.

> >  40 seconds to transfer on a Gigabit ethernet... assuming you can get
> >  it of the disks.  8-).  Do you really expect them all simultaneously?
> 
>         Not a one of these machines has GigaBit Ethernet.  They all have
> 100Base-TX FastEthernet, and the front-end machines may also have a
> second 100Base-TX FastEthernet interface (if I can scrounge a couple
> of NICs).

That's all to the good: by pushing it from 40 seconds to ~8 minutes,
you favor my argument that the operation is network bound.

>         The big problem is that most of the users will also have
> 100Base-TX FastEthernet.  It won't take too many of them trying to
> access the server at once to completely swamp it.

That's a server stack implementation issue, if it's an issue for
you.  There are boxes you can buy or build to perform QoS that
will deal with that issue.

> >  You don't need to assert a lock over NFS, if the only machine doing
> >  the reading is the one doing the writing, and it asserts the lock
> >  locally (this was more talking about the Cyrus cache files, not
> >  maildir).
> 
>         This assumes that there is only one machine ever writing to a
> particular mailbox.  This is not a valid assumption.

Yes, it is.  If you read previous postings, I suggested that the
bastion SMTP server would forward the messages to the IMAP server
that will in the future serve them, in order to permit local
delivery.  It doesn't solve the replication issue, but it solves
your locking issue.  8-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E4D7702.8EE51F54>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation