Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Nov 2008 02:16:25 -0800
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        Ian Smith <smithi@nimnet.asn.au>
Cc:        questions@FreeBSD.org
Subject:   Re: Apache environment variables - logical AND
Message-ID:  <20081105101625.GA6494@icarus.home.lan>
In-Reply-To: <20081105194002.N70117@sola.nimnet.asn.au>
References:  <20081105170631.O70117@sola.nimnet.asn.au> <20081105072752.GA4079@icarus.home.lan> <20081105194002.N70117@sola.nimnet.asn.au>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Nov 05, 2008 at 08:24:16PM +1100, Ian Smith wrote:
> On Tue, 4 Nov 2008, Jeremy Chadwick wrote:
>  > On Wed, Nov 05, 2008 at 05:33:45PM +1100, Ian Smith wrote:
>  > > I know this isn't FreeBSD specific - but I am, so crave your indulgence.
>  > > 
>  > > Running Apache 1.3.27, using a fairly extensive access.conf to beat off 
>  > > the most rapacious robots and such, using mostly BrowserMatch[NoCase] 
>  > > and SetEnvIf to moderate access to several virtual hosts.  No problem.
>  > > 
>  > > OR conditions are of course straighforward:
>  > > 
>  > >   SetEnvIf <condition1> somevar
>  > >   SetEnvIf <condition2> somevar
>  > >   SetEnvIf <exception1> !somevar
>  > > 
>  > > What I can't figure out is how to set a variable3 if and only if both 
>  > > variable1 AND variable2 are set.  Eg:
>  > > 
>  > >   SetEnvIf Referer "^$" no_referer
>  > >   SetEnvIf User-Agent "^$" no_browser
>  > > 
>  > > I want the equivalent for this (invalid and totally fanciful) match: 
>  > > 
>  > >   SetEnvIf (no_browser AND no_referer) go_away
>  > 
>  > Sounds like a job for mod_rewrite.  The SetEnvIf stuff is such a hack.
> 
> It may be a hack, but I've found it an extremely useful one so far.
>
>  > This is what we use on our production servers (snipped to keep it
>  > short):
>  > 
>  > RewriteEngine on
>  > RewriteCond %{HTTP_REFERER} ^XXXX:                      [OR]
>  > RewriteCond %{HTTP_REFERER} ^http://forums.somethingawful.com/  [OR]
>  > RewriteCond %{HTTP_REFERER} ^http://forums.fark.com/    [OR]
>  > RewriteCond %{HTTP_USER_AGENT} ^Alexibot                [OR]
>  > RewriteCond %{HTTP_USER_AGENT} ^asterias                [OR]
>  > RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot             [OR]
>  > RewriteCond %{HTTP_USER_AGENT} ^Black.Hole              [NC,OR]
>  > RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE                [OR]
>  > RewriteCond %{HTTP_USER_AGENT} ^Xaldon.WebSpider
>  > RewriteRule ^.* - [F,L]
>  > 
>  > You need to keep something in mind however: blocking by user agent is
>  > basically worthless these days.  Most "leeching" tools now let you
>  > spoof the user agent to show up as Internet Explorer, essentially
>  > defeating the checks.
> 
> While that's true, I've found most of the more troublesome robots are 
> too proud of their 'brand' to spoof user agent, and those that do are a) 
> often consistent enough in their Remote_Addr to exclude by subnet and/or 
> b) often make obvious errors in spoofed User_Agent strings .. especially 
> those pretending to be some variant of MSIE :)

I haven't found this to be true at all, and I've been doing web hosting
since 1993.  In the past 2-3 years, the amount of leeching tools which
spoof their User-Agent has increased dramatically.

But step back for a moment and look at it from a usability perspective,
because this is what really happens.

A user tries to leech a site you host, using FruitBatLeecher, which your
Apache server blocks based on User-Agent.  The user has no idea why the
leech program doesn't work.  Does the user simply give up his quest?
Absolutely not -- the user then goes and finds BobsBandwidthZilla which
pretends to be Internet Explorer, Firefox, or lynx, and downloads the
site.

Now, if you're trying to block robots/scrapers which aren't honouring
robots.txt, oh yes, that almost always works, because those rarely spoof
their User-Agent (I think to date I've only seen one site which did
that, and it was some Russian search engine).

If you feel I'm just doing burn-outs arguing, a la "BSD style", let me
give you some insight to how often I deal with this problem: daily.

We host a very specific/niche site that contains over 20 years of
technical information on the Famicom / Nintendo Entertainment System.
The site has hundreds of megabytes of information, and a very active
forum.  Some jackass comes along and decides "Wow, this has all the info
I want!" and fires off a leeching program against the entire
domain/vhost.  Let's say the program he's using is blocked by our
User-Agent blocks; there is a 6-7 minute delay as the user goes off to
find another program to leech with, installs it, and attempts it again.
Pow, it works, and we find nice huge spikes in our logs for the vhost
indicating someone got around it.  I later dig through our access_log and
find that he tried to use FruitBatLeecher, which got blocked, but then
6-7 minutes later came back with a leeching client that spoofs itself
as IE.

And it gets worse.

Many of these leeching programs get stuck in infinite loops when it
comes to forum software, so they sit there pounding on the webserver
indefinitely.  It requires administrator intervention to stop it; in my
case, I don't even bother with Apache ACLs, because ~70% of the time
the client ignores 403s and keeps bashing away (yes really!) -- I go
straight for a pf-based block in a table called <web-leechers>.  These
guys will hit that block for *days* -- that should give you some idea
how long they'll let that program run.

But it gets worse -- again.

Recently, I found two examples of very dedicated leechers.  One was an
individual out of China (or using Chinese IPs -- take your pick), and
another was at an Italian university.  These individuals got past the
User-Agent blocks, and I caught their leeching software stuck in a loop
on the site forum.  I blocked their IPs with pf, thinking it would be
enough, then went to sleep.  I woke up the following evening to find
they were back at it again.  How?

The Chinese individual literally got another IP somehow, in a completely
different netblock; possibly a DHCP release/renew, possibly some friend
of his, whatever.

The Italian university individual was successful in his leech attempts
exactly 50% of the time -- because their university used a transparent
HTTP proxy that was balanced between two IPs.  I had only blocked one
of them.

Starting to get the picture now?  :-)

The only effective way to deal with all of this is rate-limiting.  I do
not advocate "queues" or "buckets", or "dynamic buckets" where each IP
is allocated X number of simultaneous sockets, and if they exceed that,
they get rate-limited.  I also do not advocate "shared queues", where
if there are X number of sockets, allow Z amount of bandwidth, but if
X is more than, say, 200 sockets, allow Z/2 amount of bandwidth.

The tuning is simply not worth it -- people will go to great lengths
to screw you.  And if your stuff is in a 95th-percentile billing
environment, believe me, you DO NOT want to wake up one morning to
find that someone has cost you thousands of dollars.

Also, I recommend using ipfw dummynet or pf ALTQ for rate-limiting.  The
few Apache bandwidth-limiting modules I've tried have bizarre side
effects.  Here's a forum post of mine (on the above site) explaining
why we moved away from mod_cband and went with pf ALTQ.

http://nesdev.parodius.com/bbs/viewtopic.php?t=4184

>  > If you're that concerned about bandwidth (which is why a lot of people
>  > do the above), consider rate-limiting.  It's really, quite honestly, the
>  > only method that is fail-safe.
> 
> Thanks Jeremy.  Certainly time to take the time to have another look at 
> mod_rewrite, especially regarding redirection, alternative pages etc, 
> but I still tend to glaze over about halfway through all that section.

Yeah, I agree, the mod_rewrite documentation is overwhelming, and that
turns a lot of people off.  The examples I gave you should allow you to
look up each piece of the directive at a time, and once you do that,
it'll all make sense.

> And unless I've completely missed it, your examples don't address my 
> question, being how to AND two or more conditions in a particular test?
>
> If I really can't do this with mod_setenvif I'll have to take that time.

You can't do it with mod_setenvif.  You can do it with mod_rewrite,
because all mod_rewrite rules default to an operator type of "AND".  The
[OR] you see in my rules is an explicit override for obvious reasons.

Open the Apache 1.3 mod_rewrite docs and search for "implicit AND".
It'll all make sense then.  :-)

I hope some of what I've said above gives you something to think about.
Hosting environments are a real pain in the ass; when it's "just you and
your own personal box" it's easy, but when it's larger scale and
involves users (customers or friends, doesn't matter), it's a totally
different game.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20081105101625.GA6494>