Date: Sun, 23 Aug 1998 03:19:16 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: dyson@iquest.net Cc: tlambert@primenet.com, sos@FreeBSD.ORG, croot@btp1da.phy.uni-bayreuth.de, regnauld@deepo.prosa.dk, current@FreeBSD.ORG, smp@FreeBSD.ORG Subject: Re: softupdates and smp crash Message-ID: <199808230319.UAA21616@usr04.primenet.com> In-Reply-To: <199808220523.AAA19739@dyson.iquest.net> from "John S. Dyson" at Aug 22, 98 00:23:34 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > I'm also not convinced that this is the only possible cause of > > the problem; the VM code is hardly "assert" protected everywhere, > > so diagnosing this thing is not trivial. Look at the VM fixes > > I recently did, which killed the bugs Karl Denniger was seeing > > in 75% of the cases, leaving 25% of the cases "clustered" (in his > > words), indicating a seperate problem, in addition to the ones I > > fixed, in a periodically executing code path. I had suspected > > that this would be the case when I made the fix, since it doesn't > > account for the buggy behaviour I'm personally seeing. 8-(. > > > I have to chime in here -- some of the "fixes" are work-arounds, and > there are still underlying VM problems. It might be "good enough" > for 3.0, but I would suggest preparing for some rework to find the > root cause for the problem. Can you identify which of the "better fixes" are workarounds? The two fixes I have done, and now have enough confidence in to want them committed, are: o The "valid = 0 at wrong time" that you told me about. o The "setting the recorded size of a backing object to a page boundary instead of to the actual size". You could argue that this second, which promiscuously sets the vnode object size after instancing the object, is a workaround which should be repaired by adding a "real_size" parameter to the allocator, but the fact is that the setsize code path is not a problem at the only time when it is called (ie: it can't be called at interrupt level as a result of a disk I/O completion interrupt); so the window I noted has been analyzed, and is not there. The code is ugly, but it does the intended job, without side effects. The other "fix", the "back up one" is, indeed, a kludge that happens to work for some cases, but I would not want that one committed (I explicitly posted that it should be tried as a dianostic). The only other changes packaged with the two real changes, above, are panics in the diagnostic case, which is basically an "assert" that map contents aren't being stomped on page insertion, and a lock acquisition logging that was arguably erroneously missing. I haven't been able to get anyone to run with the "DIAGNOSTIC" flag to test the first nor the "MAP_LOCK_DIAGNOSTIC" for the second (but they run without error here, where I can't trigger the failures at will). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199808230319.UAA21616>