Date: Wed, 14 Jun 2017 16:08:02 -0700 From: Mark Millard <markmi@dsl-only.net> To: andreast@FreeBSD.org, svn-src-head@freebsd.org Subject: Re: svn commit: r319722 - in head: sys/cam/ctl sys/dev/iscsi sys/kern sys/netgraph sys/netgraph/bluetooth/socket sys/netinet sys/ofed/drivers/infiniband/core sys/ofed/drivers/infiniband/ulp/sdp sys/rpc... Message-ID: <C5A3E8D4-8E04-48F0-A3F5-EA2AF383C647@dsl-only.net>
next in thread | raw e-mail | index | archive | help
Andreas Tobler andreast at FreeBSD.org wrote on Wed Jun 14 08:00:03 UTC 2017: > Hi Gleb, > > with this revision I get either a kernel panic or a hang. This happens > on powerpc (32-bit). The powerpc64 looks stable. > > Here you can see the backtrace in case of the panic: > > https://people.freebsd.org/~andreast/r319722_ppc32_1.jpg > > > In the source code I see a comment with XXXGL... > Is this powerpc specific or do you think that there are some issues in > the uipc_socket.c code? I'm not so sure that the specific change in question will turn out to be the cause. Below is about why I say that: similar problems back in the likes of -r317820 and before. (I'd frozen at -r317820 for weeks. That is why I've no claims about later.) TARGET=powerpc TARGET_ARCH=powerpc context. . . (Not observed anywhere else. Also only being used on a old PowerMac G5 so-called "Quad Core".) I've spent weeks trying to get evidence of crashes that include jumps to non-code (and so illegal instructions and such). This would happen if busy or if sitting idle. Usually taking hours to happen but could happen in minutes after booting. This goes back to -r317820 where I finally froze the status for a while to focus on attempted problem isolation or at least evidence. It goes back farther as well but most of my effort was on -r317820. I found that the results were very memory layout dependent. Inserting: void HACKISH_EXTRA_CODE {} into any one of a variety of source files would change the resultant behavior. (No calles to the routine but externally accessible so not eliminate by the tool chain.) Adding any code to detect a observed failure earlier also changed the type of failure seen, making the change not directly effective. In some cases the result was that I was not able to identify a problem as happening even with waiting well over 24 hours. (Longest time to observed failure: 11 hours. A couple around 8. The rest under 7). But something still might have been trashed, just with less obvious consequences. In other cases other addressing errors occurred or other out of bounds accesses occurred or locks would spin too long or . . . You probably get the idea. All my effort basically only seemed to show one thing: occasionally something stomps on register values. It almost has to be some interrupt activity that does not restore context correctly. But I never found anything that I could identify as evidence of the prior interrupt that might have happened. I was completely unable to come up with any useful identification of what specific code was doing that trashing. I recently gave up and am starting to work on taking the machines that I have access to past -r317820. That will eventually include TARGET_ARCH=powerpc . Note: I eventually modified the kernel to prevent execution of most kernel pages that are from loading the file that also have no code in the page. So this was PowerMac G5 specific but at least prevented executing most potential garbage and should catch jumping out of code areas more reliably and sooner. (Not that it got me the answer I was looking for.) === Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C5A3E8D4-8E04-48F0-A3F5-EA2AF383C647>