From owner-freebsd-questions@FreeBSD.ORG Sat Nov 8 17:02:20 2008 Return-Path: Delivered-To: questions@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A51F61065670; Sat, 8 Nov 2008 17:02:20 +0000 (UTC) (envelope-from smithi@nimnet.asn.au) Received: from sola.nimnet.asn.au (paqi.nimnet.asn.au [220.233.188.227]) by mx1.freebsd.org (Postfix) with ESMTP id BD5868FC13; Sat, 8 Nov 2008 17:02:19 +0000 (UTC) (envelope-from smithi@nimnet.asn.au) Received: from localhost (localhost [127.0.0.1]) by sola.nimnet.asn.au (8.14.2/8.14.2) with ESMTP id mA8H2HLP090770; Sun, 9 Nov 2008 04:02:17 +1100 (EST) (envelope-from smithi@nimnet.asn.au) Date: Sun, 9 Nov 2008 04:02:17 +1100 (EST) From: Ian Smith To: Jeremy Chadwick In-Reply-To: <20081105101625.GA6494@icarus.home.lan> Message-ID: <20081109012957.R70117@sola.nimnet.asn.au> References: <20081105170631.O70117@sola.nimnet.asn.au> <20081105072752.GA4079@icarus.home.lan> <20081105194002.N70117@sola.nimnet.asn.au> <20081105101625.GA6494@icarus.home.lan> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: questions@FreeBSD.org Subject: [SOLVED] Apache environment variables - logical AND X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 08 Nov 2008 17:02:20 -0000 On Wed, 5 Nov 2008, Jeremy Chadwick wrote: > On Wed, Nov 05, 2008 at 08:24:16PM +1100, Ian Smith wrote: > > On Tue, 4 Nov 2008, Jeremy Chadwick wrote: > > > On Wed, Nov 05, 2008 at 05:33:45PM +1100, Ian Smith wrote: > > > > I know this isn't FreeBSD specific - but I am, so crave your indulgence. > > > > > > > > Running Apache 1.3.27, using a fairly extensive access.conf to beat off > > > > the most rapacious robots and such, using mostly BrowserMatch[NoCase] > > > > and SetEnvIf to moderate access to several virtual hosts. No problem. > > > > > > > > OR conditions are of course straighforward: > > > > > > > > SetEnvIf somevar > > > > SetEnvIf somevar > > > > SetEnvIf !somevar > > > > > > > > What I can't figure out is how to set a variable3 if and only if both > > > > variable1 AND variable2 are set. Eg: > > > > > > > > SetEnvIf Referer "^$" no_referer > > > > SetEnvIf User-Agent "^$" no_browser > > > > > > > > I want the equivalent for this (invalid and totally fanciful) match: > > > > > > > > SetEnvIf (no_browser AND no_referer) go_away > > > > > > Sounds like a job for mod_rewrite. The SetEnvIf stuff is such a hack. That's true. Thanks for your considered and helpful tutorial. I do use ipfw+dummynet for bandwidth limiting, and ipfw table 80 to house bogons. But I finally figured out how to make such a hack work .. it just kept on bugging me until I woke up remembering some very basic logic; quite embarrassing really .. # 9/11/8: preset env vars to be tested by value SetEnvIf Referer ".*" no_ref=0 no_bro=0 both=1 SetEnvIf Referer "^$" no_ref=1 SetEnvIf User-Agent "^$" no_bro=1 # duh, logic 101: a AND b = NOT ( (NOT a) OR (NOT b) ) SetEnvIf no_ref 0 both=0 SetEnvIf no_bro 0 both=0 SetEnvIf both 1 go_away It's a bit round about and awkward but seems to work fine, and this was just one example of several combination conditions I'd like to test. cheers, Ian > > It may be a hack, but I've found it an extremely useful one so far. > > > > > This is what we use on our production servers (snipped to keep it > > > short): > > > > > > RewriteEngine on > > > RewriteCond %{HTTP_REFERER} ^XXXX: [OR] > > > RewriteCond %{HTTP_REFERER} ^http://forums.somethingawful.com/ [OR] > > > RewriteCond %{HTTP_REFERER} ^http://forums.fark.com/ [OR] > > > RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR] > > > RewriteCond %{HTTP_USER_AGENT} ^asterias [OR] > > > RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR] > > > RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [NC,OR] > > > RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] > > > RewriteCond %{HTTP_USER_AGENT} ^Xaldon.WebSpider > > > RewriteRule ^.* - [F,L] > > > > > > You need to keep something in mind however: blocking by user agent is > > > basically worthless these days. Most "leeching" tools now let you > > > spoof the user agent to show up as Internet Explorer, essentially > > > defeating the checks. > > > > While that's true, I've found most of the more troublesome robots are > > too proud of their 'brand' to spoof user agent, and those that do are a) > > often consistent enough in their Remote_Addr to exclude by subnet and/or > > b) often make obvious errors in spoofed User_Agent strings .. especially > > those pretending to be some variant of MSIE :) > > I haven't found this to be true at all, and I've been doing web hosting > since 1993. In the past 2-3 years, the amount of leeching tools which > spoof their User-Agent has increased dramatically. > > But step back for a moment and look at it from a usability perspective, > because this is what really happens. > > A user tries to leech a site you host, using FruitBatLeecher, which your > Apache server blocks based on User-Agent. The user has no idea why the > leech program doesn't work. Does the user simply give up his quest? > Absolutely not -- the user then goes and finds BobsBandwidthZilla which > pretends to be Internet Explorer, Firefox, or lynx, and downloads the > site. > > Now, if you're trying to block robots/scrapers which aren't honouring > robots.txt, oh yes, that almost always works, because those rarely spoof > their User-Agent (I think to date I've only seen one site which did > that, and it was some Russian search engine). > > If you feel I'm just doing burn-outs arguing, a la "BSD style", let me > give you some insight to how often I deal with this problem: daily. > > We host a very specific/niche site that contains over 20 years of > technical information on the Famicom / Nintendo Entertainment System. > The site has hundreds of megabytes of information, and a very active > forum. Some jackass comes along and decides "Wow, this has all the info > I want!" and fires off a leeching program against the entire > domain/vhost. Let's say the program he's using is blocked by our > User-Agent blocks; there is a 6-7 minute delay as the user goes off to > find another program to leech with, installs it, and attempts it again. > Pow, it works, and we find nice huge spikes in our logs for the vhost > indicating someone got around it. I later dig through our access_log and > find that he tried to use FruitBatLeecher, which got blocked, but then > 6-7 minutes later came back with a leeching client that spoofs itself > as IE. > > And it gets worse. > > Many of these leeching programs get stuck in infinite loops when it > comes to forum software, so they sit there pounding on the webserver > indefinitely. It requires administrator intervention to stop it; in my > case, I don't even bother with Apache ACLs, because ~70% of the time > the client ignores 403s and keeps bashing away (yes really!) -- I go > straight for a pf-based block in a table called . These > guys will hit that block for *days* -- that should give you some idea > how long they'll let that program run. > > But it gets worse -- again. > > Recently, I found two examples of very dedicated leechers. One was an > individual out of China (or using Chinese IPs -- take your pick), and > another was at an Italian university. These individuals got past the > User-Agent blocks, and I caught their leeching software stuck in a loop > on the site forum. I blocked their IPs with pf, thinking it would be > enough, then went to sleep. I woke up the following evening to find > they were back at it again. How? > > The Chinese individual literally got another IP somehow, in a completely > different netblock; possibly a DHCP release/renew, possibly some friend > of his, whatever. > > The Italian university individual was successful in his leech attempts > exactly 50% of the time -- because their university used a transparent > HTTP proxy that was balanced between two IPs. I had only blocked one > of them. > > Starting to get the picture now? :-) > > The only effective way to deal with all of this is rate-limiting. I do > not advocate "queues" or "buckets", or "dynamic buckets" where each IP > is allocated X number of simultaneous sockets, and if they exceed that, > they get rate-limited. I also do not advocate "shared queues", where > if there are X number of sockets, allow Z amount of bandwidth, but if > X is more than, say, 200 sockets, allow Z/2 amount of bandwidth. > > The tuning is simply not worth it -- people will go to great lengths > to screw you. And if your stuff is in a 95th-percentile billing > environment, believe me, you DO NOT want to wake up one morning to > find that someone has cost you thousands of dollars. > > Also, I recommend using ipfw dummynet or pf ALTQ for rate-limiting. The > few Apache bandwidth-limiting modules I've tried have bizarre side > effects. Here's a forum post of mine (on the above site) explaining > why we moved away from mod_cband and went with pf ALTQ. > > http://nesdev.parodius.com/bbs/viewtopic.php?t=4184 > > > > > If you're that concerned about bandwidth (which is why a lot of people > > > do the above), consider rate-limiting. It's really, quite honestly, the > > > only method that is fail-safe. > > > > Thanks Jeremy. Certainly time to take the time to have another look at > > mod_rewrite, especially regarding redirection, alternative pages etc, > > but I still tend to glaze over about halfway through all that section. > > Yeah, I agree, the mod_rewrite documentation is overwhelming, and that > turns a lot of people off. The examples I gave you should allow you to > look up each piece of the directive at a time, and once you do that, > it'll all make sense. > > > And unless I've completely missed it, your examples don't address my > > question, being how to AND two or more conditions in a particular test? > > > > If I really can't do this with mod_setenvif I'll have to take that time. > > You can't do it with mod_setenvif. You can do it with mod_rewrite, > because all mod_rewrite rules default to an operator type of "AND". The > [OR] you see in my rules is an explicit override for obvious reasons. > > Open the Apache 1.3 mod_rewrite docs and search for "implicit AND". > It'll all make sense then. :-) > > I hope some of what I've said above gives you something to think about. > Hosting environments are a real pain in the ass; when it's "just you and > your own personal box" it's easy, but when it's larger scale and > involves users (customers or friends, doesn't matter), it's a totally > different game. > > -- > | Jeremy Chadwick jdc at parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, USA | > | Making life hard for others since 1977. PGP: 4BD6C0CB |