From owner-freebsd-stable@FreeBSD.ORG Tue Apr 6 21:26:39 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D116F1065674 for ; Tue, 6 Apr 2010 21:26:39 +0000 (UTC) (envelope-from cswiger@mac.com) Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105]) by mx1.freebsd.org (Postfix) with ESMTP id B76BD8FC15 for ; Tue, 6 Apr 2010 21:26:39 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from cswiger1.apple.com ([17.209.4.71]) by asmtp030.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L0H001SQ4W5EB00@asmtp030.mac.com> for freebsd-stable@freebsd.org; Tue, 06 Apr 2010 14:26:39 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=5.0.0-0908210000 definitions=main-1004060228 From: Chuck Swiger In-reply-to: <4BBA51B9.6010803@gausus.net> Date: Tue, 06 Apr 2010 14:26:29 -0700 Message-id: <0A88DAC9-DCDC-49A0-8BDC-DA34EEBC1C7A@mac.com> References: <1209800810.33861270466947931.JavaMail.root@dagobah.intersec.pl> <4BBA05A2.40706@intertainservices.com> <4BBA1823.1090305@gausus.net> <4BBA4334.1020506@interazioni.it> <4BBA4478.7030302@gausus.net> <20100405164306.hmy4n8pvs4goc8ks@www.goldsword.com> <4BBA51B9.6010803@gausus.net> To: Maciej Jan Broniarz X-Mailer: Apple Mail (2.1078) Cc: freebsd-stable@freebsd.org Subject: Re: fault tolerant web servers on freebsd X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Apr 2010 21:26:40 -0000 On Apr 5, 2010, at 2:10 PM, Maciej Jan Broniarz wrote: > W dniu 10-04-05 22:43, jfarmer@goldsword.com pisze: >> Quoting Maciej Jan Broniarz : >> So first you have to define your workload, then define what errors you >> must avoid or allow, and then define how to deal with failures, errors, >> etc. >> Then you can start talking about High Availability vs. level of Fault >> tolerance, vs. .... > > Let's say i need to run a few php/sql based web sites and I would like to maintain uptime of about 99,99% per month. No matter how good the hardware - it will always fail at some time. My goal is to build a system, that can maintain that uptime. You're attempting to move from ~1 hour of downtime per month to ~1 hour of downtime per year, or less than 5 minutes per month. To begin with, you must implement adequate monitoring to detect, notify, and track service outages (ie, Nagios, BigBrother, commercial test services like SiteScope, etc), and you need a 24/7/365 team available to immediately respond to pages/email/etc to minimize outage duration. With 168 hours a week divided by nominal 40-hour workweek, that needs a team of 4.2 people, and also implies the cost of downtime per hour for the system should be higher than about 35K per hour to justify keeping such a team available. (Four people can do it with one working an extra 8-hour shift per week; however, at least local to me, California state law mandates that people on call for pager duty must be paid hourly overtime if they are expected to respond to issues.) > From what You say I need some level of HA system, to maintain the required uptime. > > So, as I've said earlier (correct me, if I'm wrong) - the setup could look something like that: > > - 2 web servers with carp > - 2 storage servers with on-line sync mechanism running > - 2 mysql servers with on-line database replication > > (i'm skiping power and network issues at the moment). Do you already know what your causes of downtime have been? To my mind, you must consider all parts of the system, gather data, and resolve the problems which have the greatest downtime cost in a cost-effective fashion. The most common sources of machine failure are hard drives and PSUs; setting up RAID-1 mirrors for all machines and getting redundant power supplies on separate breakers should be a minimal starting point to avoid a likely single point of failure within a single machine. Beyond that, the suggestion to have at least two of every component of the system is a right notion, but you need to include the glue which implements failover. That can be RFC-2391 style NAT to round-robin requests onto multiple webservers, but a hardware-based load-balancer (ServerIrons, Netscalers, etc) with aliveness or health checks will do a better job. The better ones also support full redundancy, so you want a pair of those, or perhaps a pair of router/firewall/NAT boxes using VRRP or similar for the networking connectivity if you want to use software-based load-balancing. Of course, all of this is assuming that the software is more reliable than the hardware it runs on. Good software can be, but it's more common for software failures, mistakes by admins, or the like to also contribute a lot to the system downtime. Regards, -- -Chuck