Date: Sat, 07 May 2011 12:43:16 +0300 From: Alexander Motin <mav@FreeBSD.org> To: Doug Barton <dougb@dougbarton.us> Cc: freebsd-current@FreeBSD.org Subject: Re: My problems with stability on -current Message-ID: <4DC51434.3000501@FreeBSD.org> In-Reply-To: <4DC50804.6000809@dougbarton.us> References: <4DC25396.1070909@dougbarton.us> <4DC30EC5.3090703@FreeBSD.org> <4DC50804.6000809@dougbarton.us>
next in thread | previous in thread | raw e-mail | index | archive | help
Doug Barton wrote: > On 05/05/2011 13:55, Alexander Motin wrote: >> I see several possibly unrelated problems there: >> - crashes are always crashes. They should be debugged. >> - calcru going backwards could have the same roots as lost wall clock >> time. > > I think you're right about that. What usually happens when the load > maxes out is that the system visibly freezes for a minute or 2, and when > it comes back to life the log is flooded with calcru messages. If it > stays up long enough after that the wall clock drift becomes noticeable. > This is in spite of running ntpd. These system freezes are very suspicious. Most time counters need only few seconds to overflow, some even less. So freeze for few minutes will easily overflow most of them. So the freezes are probably the cause of time problems, but the question now is what the cause of freezes. You should try to investigate what is going on during freezes. Does the system do anything, are there any interrupts working (`vmstat -i` just before and after), are there any interrupt storms, etc? >> If there are some problems with timer interrupts, timecounters >> could wrap unnoticed that will cause random time jumps. >> - interactivity problems. I can't prove it is unrelated, but have no >> real ideas now. >> >> I would start from most obvious problems. I need to know more about >> crashes. As usual: how to trigger, stack backtraces, etc. > > Triggering is easy, I can start a buildworld with -j2, and a build of > ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system > will reboot. I posted a panic message relative to r220282, (-current > archives, 4/4) but kib said it didn't make any sense. Usually I don't > get a panic at all. Could you hint me the thread? >> What's about time problems, I would try to collect more data: >> - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose >> dmesg outputs; > > http://people.freebsd.org/~dougb/dougb-current-r221566.txt > >> - what eventtimer is used now and does it helps to switch to another >> one with kern.eventtimer.timer sysctl? > > When I was trying to track down the problems last summer I vaguely > remember trying RTC, but eventually we realized that the real problem > was throttling, so I stopped specifying RTC and let it go back to the > default. What do you suggest I try? As I see, now you are using HPET (chosen automatically). I would try switch to the LAPIC. Just make sure to disable C-states if you are enabled them to be sure that LAPIC timer won't stop. >> - does the timer runs in periodic or one-shot mode and does it helps to >> switch to another one? > > How could I tell, and how would I switch? `sysctl kern.eventtimer.periodic`. And read eventtimers(4) please. >> - if full CPU load makes time to stop, try to track what is going on >> with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full >> CPU load in one-shot mode you should have stable timer interrupt rate >> about hz+stathz. > > Ok, I'll do that tomorrow, tired now. > >> - if timer interrupts are not working well, you can build kernel with >> options KTR >> options ALQ >> options KTR_ALQ >> options KTR_COMPILE=(KTR_SPARE2) >> options KTR_ENTRIES=131072 >> options KTR_MASK=(KTR_SPARE2) >> to track event timers operation and use ktrdump to save the trace when >> problem exist (preferably when it begins). >> >> And let's experiment with fresh CURRENT. > > Done and done. I'm up to r221566, and I added those options to my kernel > config. I ran ktrdump -cH -o ktrdumpfile and posted the results here: > http://people.freebsd.org/~dougb/ktrdumpfile.txt This was shortly after > boot, with no load. Not sure if it helps, but there you go. Dump looks fine, but I need dump specifically for the time of the problem. As soon as time probably can't be trusted here, it would be nice to make dump as localized as possible: clear buffer with `sysctl debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with `sysctl debug.ktr.mask=0` and do the dump. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DC51434.3000501>