FreeBSD Mail Archives

Date:      Tue, 27 Apr 2021 18:41:24 +0000
From:      bugzilla-noreply@freebsd.org
To:        python@FreeBSD.org
Subject:   maintainer-feedback requested: [Bug 255445] lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS
Message-ID:  <bug-255445-21822-YuYPUhPP2v@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-255445-21822@https.bugs.freebsd.org/bugzilla/>

index | next in thread | previous in thread | raw e-mail


Bugzilla Automation <bugzilla@FreeBSD.org> has asked freebsd-python (Nobody)
<python@FreeBSD.org> for maintainer-feedback:
Bug 255445: lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=255445



--- Description ---
Seeing many TrueNAS (previously FreeNAS) users dump core on the main
middlewared process (python) starting with our version 12.0 release.

Relevant OS information:
12.2-RELEASE-p6 FreeBSD 12.2-RELEASE-p6 f2858df162b(HEAD) TRUENAS  amd64

Python versions that experience the core dump:
Python 3.8.7
Python 3.9.4

When initially researching this, I did find a regression with threading and
python 3.8 on freeBSD and was able to resolve that particular problem by
backporting the commits:
https://github.com/python/cpython/commit/4d96b4635aeff1b8ad41d41422ce808ce0b971
c8
and
https://github.com/python/cpython/commit/9ad58acbe8b90b4d0f2d2e139e38bb5aa32b7f
b6.

The reason why I backported those commits is because all of the core dumps that
I've analyzed are panic'ing in the same spot (or very close to it). For
example, here are 2 backtraces showing null-ptr dereference.

Core was generated by `python3.8: middlewared'.
 Program terminated with signal SIGSEGV, Segmentation fault.
 #0 cond_signal_common (cond=<optimized out>) at
/truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:457
warning: Source file is more recent than executable.
 457 mp = td->mutex_obj;
 [Current thread is 1 (LWP 100733)]
 (gdb) list
 452		    _sleepq_unlock(cvp);
 453			return (0);
 454		    }
 455
 456		    td = _sleepq_first(sq);
 457		    mp = td->mutex_obj;
 458		    cvp->__has_user_waiters = _sleepq_remove(sq, td);
 459		    if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
 460			if (curthread->nwaiter_defer >= MAX_DEFER_WAITERS) {
 461			    _thr_wake_all(curthread->defer_waiters, 

(gdb) p *td
Cannot access memory at address 0x0


and another one
Core was generated by `python3.8: middlewared'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  cond_signal_common (cond=<optimized out>) at
/truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:459warning: Source
file is more recent than executable.
459		if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
[Current thread is 1 (LWP 101105)]
(gdb) list
454		}
455
456		td = _sleepq_first(sq);
457		mp = td->mutex_obj;
458		cvp->__has_user_waiters = _sleepq_remove(sq, td);
459		if (PMUTEX_OWNER_ID(mp) == TID(curthread)) {
460			if (curthread->nwaiter_defer >= MAX_DEFER_WAITERS) {
461				_thr_wake_all(curthread->defer_waiters,
462				    curthread->nwaiter_defer);
463				curthread->nwaiter_defer = 0;
(gdb) p *mp
Cannot access memory at address 0x0

I'm trying to instrument a program to "stress" test threading (tearing down and
recreating etc etc) but I've been unsuccessful at tickling this particular
problem. The end-users that have seen this core dump sometimes go 1month +
without a problem. Hoping someone more knowledgeable can at least give me a
pointer or help me figure this one out. I have access to my VM that has all the
relevant core dumps available so if someone needs remote access to it to "poke"
around, please let me know. You can reach me at caleb [at] ixsystems.com

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-255445-21822-YuYPUhPP2v>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation