Date: Wed, 04 Mar 1998 15:58:30 -0800 (PST) From: Simon Shapiro <shimon@simon-shapiro.org> To: sbabkin@dcn.att.com Cc: wilko@yedi.iaf.nl, tlambert@primenet.com, jdn@acp.qiv.com, blkirk@float.eli.net, hackers@FreeBSD.ORG, grog@lemis.com, karl@mcs.net Subject: RE: SCSI Bus redundancy... Message-ID: <XFMail.980304155830.shimon@simon-shapiro.org> In-Reply-To: <C50B6FBA632FD111AF0F0000C0AD71EE4132D6@dcn71.dcn.att.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 04-Mar-98 sbabkin@dcn.att.com wrote: ... >> What you describe here is application-level mirroring. It works after > Yes, with the difference that the second copy may be located > on machine in another building connected by something like FDDI, > so you are protected against things like fire in computer room. > And it is not quite mirror, they are out of sync by something > like 10 minutes all the time. Of course, the primary system should have > all the hardware mirroring and like things (or may not, in my exact > case it was not done by political reasons. Personally I would > prefer having mirroring, even instead of this scheme, but there > were political reasons), so you can lose these > 10 minutes of operational data only if you have the primary system > significantly destroyed. I have a better solution, which was implemented here at my work (and I'll repeat if my employer ahd the guy that wote it do not contribute it); This is NOT my original idea, but so old I do not remember who did it first; You start with two identical databases. You modify the (postgres) libpq, or (Oracle) SQL*Net interface to intercept all data-modifying SQL statements (you do not care about SELECT and such). You cache those until you see a COMMIT. If you see a ROLLBACK, you discard all that cache. When you capture the SQL statements, you stamp each with a high precision time stamp (does not have to be accurate, but has to be precise). When you see a COMMIT, you ship the whole thing to a remote machine. The remote machine can simply log these, or apply them against a reference database. If you just logthem, you sort them by the timestamp before you apply them. The quality of the resultant database is surprisingly good. Especially for an OLTP system. the advantages are obvious. ... > They can not go far out of sync if everything is working. One of > them is master, it generates the database archive logs during the > operation and these logs get applied to the secondary database. > They are all the time out of sync by the time necessary to > generate, transfer and apply these logs but it can't become worse. This will only work if you can switch the database clients to the alternative system. Otherwise you will have long interruption of service. Something your employer does not routinely like. I normally classify these schemes as part of disaster recovery plan, not routine operation. In my terminology, backup is part of routine operation. Truely hot databases cannot be routinely backed up, nor restored without unacceptable disruption of service. Your scheme, which is good for disaster recovery, is not acceptable for a hot, non-stop operation, unless modified as indicated above. ... > It does not try to sync. It is just an auxiliary backup system. If > your primary system goes completely down, you can start > the secondary system in 10 minutes as primary. Yes, you will lose > something like last 0...10 minutes of operation. But you will > still be able to provide service. Last time AT&T lost service for 10 minutes it ended up on TV. Besides, if you promise 1 minute, demonstrate 10 minutes, in a real disaster it will be 4 hours. What is the revenue loss, per 5ESS switch for 10 minutes loss of service? What is the contractual obligation for downtime? One of the readers of this list (works for Sprint, I think) reminded me the 5 minutes/year or something on that scale. ... > Agreed. For these cases Oracle has opportunity named online backup. > You tell the RDBMS that you are going to do backup, after that > copy the database files (the RDBMS is still running, only performance > is degraded due to competition for disks). Later you can apply > the archived logs to this image and get working database. I know of that option. I also had to listen to a Telco customer who detailed, in public, how this feature takes 18 hours to bring the database back up after a software induced crash, using exactly this mechanism. It sounds good in a brochure. Not worth a damn for non-stop operations in real-life. Besides, Oracle does not support a FreeBSD port, costs a yearly salary per copy and does not provide source yet. ... > Yes, I had it too. But don't forget that the booting changes > files, at least logs, utmp/wtmp, pipes, etc. If you just > mount some filesystem and don't touch it after this, it > can not get corrupted. If you say so. Are you willing to bet your salary, carreer or life on that statement? In FreeBSD, I have lost /usr/src twice, and /usr/local three times in the last three or six months (Each of these is on a separate F/S, of course. None of them is modifyable by the boot process (other than the clean umount bit), but I lost them all the same. Once it was atttributed to a bug/glitch in the fdisk/disklabel/partitions/slices logic, the other times, I have no clue. I did not bitch about it as it is under current, which has no warranty, etc. I had similar losses under other O/Ss and versions. Finally, even if you were totally right (which I do not think you are), no technical executive will allow a critical database on a Unix filesystem. Databases get corrupted all the time, on and off Unix filesystems. But to allow mission critical databases on Unix filesystems is prophane to these people. Reality notwithstanding. ... > Don't know about ccf, never saw it. But if I create the database, > all the blocks are allocated during the creation and later the > file sizes never change (and no, they don't have gaps inside). > And as far as I know if I write to some block in file that is > already allocated, the data will go to this block and it will never > be reallocated by the filesystem. So you do not have any blocks > allocated or deallocated during normal operation and the filesystem > can not get corrupted. The reason you see these nice solid files, is that ccf (which used to be a stand-alone utility up to Oracle version 4.1.3, is now part of the program which creates the database. It still does the same exact thing; Goes and writes every byte in the Unix file. If you have a Unix filesystem with three-four files that were totally pre-written, and nothing else, and then go through the girations the Oracle OSD does to circumvent the caching and such of the filesystem, you are in effect on a raw device. the only difference is that you are executing 19,438 lines of ufs code, plus who knows how many lines of VFS, FFS, whatever, in addition to the code required to run to the device itself. how can that be faster, or more reliable than not running that code, I do not grasp. Remember, not executing logic is more reliable and faster than executing it. The contents of that logic is immaterial. Simon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980304155830.shimon>