Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Jun 2017 16:08:02 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        andreast@FreeBSD.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r319722 - in head: sys/cam/ctl sys/dev/iscsi sys/kern sys/netgraph sys/netgraph/bluetooth/socket sys/netinet  sys/ofed/drivers/infiniband/core sys/ofed/drivers/infiniband/ulp/sdp  sys/rpc...
Message-ID:  <C5A3E8D4-8E04-48F0-A3F5-EA2AF383C647@dsl-only.net>

next in thread | raw e-mail | index | archive | help
Andreas Tobler andreast at FreeBSD.org wrote on
Wed Jun 14 08:00:03 UTC 2017:

> Hi Gleb,
> 
> with this revision I get either a kernel panic or a hang. This happens 
> on powerpc (32-bit). The powerpc64 looks stable.
> 
> Here you can see the backtrace in case of the panic:
> 
> https://people.freebsd.org/~andreast/r319722_ppc32_1.jpg
> 
> 
> In the source code I see a comment with XXXGL...
> Is this powerpc specific or do you think that there are some issues in 
> the uipc_socket.c code?

I'm not so sure that the specific change in
question will turn out to be the cause. Below
is about why I say that: similar problems
back in the likes of -r317820 and before.
(I'd frozen at -r317820 for weeks. That
is why I've no claims about later.)

TARGET=powerpc TARGET_ARCH=powerpc context. . .
(Not observed anywhere else. Also only
being used on a old PowerMac G5 so-called
"Quad Core".)

I've spent weeks trying to get evidence of crashes
that include jumps to non-code (and so illegal
instructions and such). This would happen if
busy or if sitting idle. Usually taking hours
to happen but could happen in minutes after
booting.

This goes back to -r317820 where I finally froze
the status for a while to focus on attempted
problem isolation or at least evidence. It goes
back farther as well but most of my effort was
on -r317820.

I found that the results were very memory layout
dependent. Inserting:

void HACKISH_EXTRA_CODE {}

into any one of a variety of source files
would change the resultant behavior. (No
calles to the routine but externally
accessible so not eliminate by the tool
chain.)

Adding any code to detect a observed failure
earlier also changed the type of failure seen,
making the change not directly effective.

In some cases the result was that I was
not able to identify a problem as happening
even with waiting well over 24 hours. (Longest
time to observed failure: 11 hours. A couple
around 8. The rest under 7). But something still
might have been trashed, just with less obvious
consequences.

In other cases other addressing errors occurred
or other out of bounds accesses occurred or
locks would spin too long or . . . You
probably get the idea.

All my effort basically only seemed to
show one thing: occasionally something
stomps on register values. It almost
has to be some interrupt activity that
does not restore context correctly. But
I never found anything that I could
identify as evidence of the prior
interrupt that might have happened.

I was completely unable to come up with
any useful identification of what specific
code was doing that trashing.

I recently gave up and am starting to work
on taking the machines that I have access
to past -r317820. That will eventually
include TARGET_ARCH=powerpc . 




Note: I eventually modified the kernel to
prevent execution of most kernel pages that
are from loading the file that also have no
code in the page. So this was PowerMac G5
specific but at least prevented executing
most potential garbage and should catch
jumping out of code areas more reliably
and sooner. (Not that it got me the answer
I was looking for.)


===
Mark Millard
markmi at dsl-only.net




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C5A3E8D4-8E04-48F0-A3F5-EA2AF383C647>