Date: Thu, 22 Sep 2022 13:40:08 -0700 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Mateusz Guzik <mjguzik@gmail.com> Cc: Mark Johnston <markj@freebsd.org>, freebsd-current@freebsd.org Subject: Re: A panic a day Message-ID: <YyzIKJMEtUPs3a15@troutmask.apl.washington.edu> In-Reply-To: <CAGudoHGhsj2__OFwSrm4=8_f0FKirRaj%2BjtfYEx5LMPLaJkMwQ@mail.gmail.com> References: <YyyqDEPL3X3esFYl@troutmask.apl.washington.edu> <Yyyw5bnWO1y6veYl@nuc> <YyyyCh32i8LfGhqS@troutmask.apl.washington.edu> <CAGudoHGhsj2__OFwSrm4=8_f0FKirRaj%2BjtfYEx5LMPLaJkMwQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Sep 22, 2022 at 09:07:08PM +0200, Mateusz Guzik wrote: > On 9/22/22, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote: > > On Thu, Sep 22, 2022 at 03:00:53PM -0400, Mark Johnston wrote: > >> On Thu, Sep 22, 2022 at 11:31:40AM -0700, Steve Kargl wrote: > >> > All, > >> > > >> > I updated my kernel/world/all ports on Sept 19 2022. > >> > Since then, I have had daily panics and hard lock-up > >> > (no panic, keyboard, mouse, network, ...). The one > >> > panic I did witness sent text scolling off the screen. > >> > There is no dump, or at least, I haven't figured out > >> > a way to get a dump. > >> > > >> > Using ports/graphics/tesseract and then hand editing > >> > the OCR result, the last visible portions is > >> > > >> > > > > > (panic messages removed). > > > >> It looks like you use the 4BSD scheduler? I think there's a bug in > >> kick_other_cpu() in that it doesn't make sure that the remote CPU's > >> curthread lock is held when modifying thread state. Because 4BSD has a > >> global scheduler lock, this is often true in practice, but doesn't have > >> to be. > > > > Yes, I use 4BSD. ULE has very poor performance for HPC type work with > > OpenMPI. > > > > Is there an easy way to set it up for testing purposes? > I reported this years ago. One instance is here https://lists.freebsd.org/pipermail/freebsd-hackers/2008-October/026375.html and, I've tested ULE a few times since. A HPC program, compiled with openmpi, can spawn multiple images. The gist of the problem is that under ULE, if one gets in an over-subscribed situation (e.g., N+1 images and only N cpus), then ULE's cpu affinity will place two images on 1 cpu. Those images ping-pong. The other N-1 images run happily. An image that completes its task will then wait on the ping-pong match before getting its next quantum of work. Under 4BSD, the N+1 images simply run on the N cpus where each gets a cpu slice. Note, you don't need an openmpi program to get this situation. Simply use a numerical intensive code that takes 5 or so minutes to complete. Start N+1 jobs. You'll get 2 jobs completing for 1 CPU. -- Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YyzIKJMEtUPs3a15>