From owner-freebsd-fs@FreeBSD.ORG Fri Apr 18 18:30:50 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1989937B404; Fri, 18 Apr 2003 18:30:50 -0700 (PDT) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1B39943FE9; Fri, 18 Apr 2003 18:30:48 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0114.cvx21-bradley.dialup.earthlink.net ([209.179.192.114] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 196hBj-0007jF-00; Fri, 18 Apr 2003 18:30:44 -0700 Message-ID: <3EA0A647.BEC5931A@mindspring.com> Date: Fri, 18 Apr 2003 18:28:39 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Marko Zec References: <3E976EBD.C3E66EF8@tel.fer.hr> <200304182243.05739.zec@tel.fer.hr> <3EA06C07.A34F1C31@mindspring.com> <200304182348.58356.zec@tel.fer.hr> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a419d7a1c3f61e46c70ea0413bc5b08d503ca473d225a0f487350badd9bab72f9c350badd9bab72f9c cc: freebsd-fs@FreeBSD.org cc: David Schultz cc: freebsd-stable@FreeBSD.org Subject: Re: PATCH: Forcible delaying of UFS (soft)updates X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Apr 2003 01:30:50 -0000 Marko Zec wrote: > > The main technical (not philosophical) problem with the patch > > as it sits is that you can cause the soft updates wheel to wrap > > around. > > No, that just cannot happen. You are probably confusing rushjob with > syncer_delayno, which gets reset to 0 each time it reaches the value of > syncer_maxdelay. The rushjob variable simply tells the syncer how many times > it should iterate _sequentially_ through the softupdates queues before > getting to sleep on lbolt. Obviously I am not explaining myself correctly. I guess the next step would be to offer my own patch set for doing what you are trying to do. Before I do that, let me try one more time. I think that it is important that the value of syncer_delayno needs to continue to be incremented once a second, and that the modified sched_sync(), which with your patch no longer does this, needs to used it's own counter. In other words, I think that you need to implement a two handed clock algorithm, to keep the buckets from getting too deep with work items, and in case there is some dependency which is not being accounted for that has been working because there is an implicit delay of 1 second or more in vn_syncer_add_to_worklist() calls (your patch would break this, so without taking this into account, we would have to retest all of soft updates). > > Then when you write things out, they write out of order. > > Uhh.. NO! Uh, yes; potentially they do. See the implicit dependency situation described above. There are other cases, too, but they are much more complicated. I wish Kirk would speak up in more technical detail about the problems you are potentially introducing; they require a deep understanding of the soft updates code. > > The purpose of the wheel is to allow placing of operations at > > some relative offset in the future to an outstanding operation, > > to ensure ordering. > > True. And this has not changed with my patch. No, it has changed. It's changed both in the depth of the queue entries, and it's changed in the relative spacing of implicit dependencies, and it's changed in the relative depth, for two or more dependent operations with future offsets. In the depth case, when the code runs, is going to stall the system for a really long time, relatively, because there are a number of worklists which are *substantially* deep, because vn_syncer_add_to_worklist() was using a syncer_delano that has been assumed to be updated once a second, and never changed during your stall. This means that the worklist represented by syncer_workitem_pending[syncer_delayno] is going to contain *almost all work* that was enqueued in the interim. The problem with this is that in the for(;;) loop in sched_sync() in the "if (LIST_FIRST(slp) == vp)" code block, you are likely to run yourself into a panic. See the comment about "sync_fsync() moves it to a different slot so we are safe"? That comment is no longer true. > > No matter what else you do, you can not allow the wheel to > > "wrap". Because the offsets are "future relative", that means > > that you have to flush at some number of wheel entries equal > > to: > > > > wrap_boundary - the_largest_potential_future_offset - 1. > > > > Making the wheel bigger is probably acceptable, but then you > > will exacerbate the memory problem that rushjob was invented > > to resolve (please do a "cvs log" and look at the checkin > > comments; I still believe it was "dillon" who made the change). > > Where from did you get the idea I'm making the wheel bigger? The size of the > softupdates "wheel" is determined by the value of syncer_maxdelay, which not > only I haven't touched at all, but is also completely unrelated to the > rushjob variable. I didn't get the idea you were making the wheel bigger. That's the problem: you probably need to make the wheel bigger, so that the [new!] second hand on the two handed clock has more time until it runs into first hand on the clock. You will have to do this so you can bound the vn_syncer_add_to_worklist() add delay to something less than "syncer_maxdelay - 2"; I suggest "syncer_maxdelay / 2", as a first approximation (remember this needs to be a power of 2, due to syncer_mask). You also want to count workitem insertions and removals, so you have a total count. This is easy: it's already protected by sync_mtx, so all you need is a static global counter. When the counter gets to a certain size (configurable), you have too much memory tied up in the work queue -- so you flush it. > If it is of any relevance for this discussion, I want to add that I've been > running my system with extended delaying all the time for the last two weeks > (even when on AC power). I have had absolutely no problems nor have lost a > single bit of data, even during the most stresfull tests such as untarring of > huge archives, or making the kernel etc. Not to mention this is also my > primary and "production" machine, with all my e-mail on it etc. Write some code that specifically stresses a specific FS dependency on a set of files, iteratively, over and over again. Then close all the files, and call "sync", and wait. Or run your test, and then unmount the FS on which the test was running, before your delayed fsync gets a change to run, and then do a shutdown. When the system comes back up, check the data to see if it's what it's supposed to be. Basically, you are going to have to provide something *other than* "rushjob" to be able to cause unmounts and other "special" code to be able to force the fsync (consider removable media, like flash, if nothing else). -- Terry