Date: Tue, 27 Apr 2021 18:41:24 +0000 From: bugzilla-noreply@freebsd.org To: python@FreeBSD.org Subject: maintainer-feedback requested: [Bug 255445] lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS Message-ID: <bug-255445-21822-YuYPUhPP2v@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-255445-21822@https.bugs.freebsd.org/bugzilla/> References: <bug-255445-21822@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
Bugzilla Automation <bugzilla@FreeBSD.org> has asked freebsd-python (Nobody) <python@FreeBSD.org> for maintainer-feedback: Bug 255445: lang/python 3.8/3.9 SIGSEV core dumps in libthr TrueNAS https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D255445 --- Description --- Seeing many TrueNAS (previously FreeNAS) users dump core on the main middlewared process (python) starting with our version 12.0 release. Relevant OS information: 12.2-RELEASE-p6 FreeBSD 12.2-RELEASE-p6 f2858df162b(HEAD) TRUENAS amd64 Python versions that experience the core dump: Python 3.8.7 Python 3.9.4 When initially researching this, I did find a regression with threading and python 3.8 on freeBSD and was able to resolve that particular problem by backporting the commits: https://github.com/python/cpython/commit/4d96b4635aeff1b8ad41d41422ce808ce0= b971 c8 and https://github.com/python/cpython/commit/9ad58acbe8b90b4d0f2d2e139e38bb5aa3= 2b7f b6. The reason why I backported those commits is because all of the core dumps = that I've analyzed are panic'ing in the same spot (or very close to it). For example, here are 2 backtraces showing null-ptr dereference. Core was generated by `python3.8: middlewared'. Program terminated with signal SIGSEGV, Segmentation fault. #0 cond_signal_common (cond=3D<optimized out>) at /truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:457 warning: Source file is more recent than executable. 457 mp =3D td->mutex_obj; [Current thread is 1 (LWP 100733)] (gdb) list 452 _sleepq_unlock(cvp); 453 return (0); 454 } 455 456 td =3D _sleepq_first(sq); 457 mp =3D td->mutex_obj; 458 cvp->__has_user_waiters =3D _sleepq_remove(sq, td); 459 if (PMUTEX_OWNER_ID(mp) =3D=3D TID(curthread)) { 460 if (curthread->nwaiter_defer >=3D MAX_DEFER_WAITERS) { 461 _thr_wake_all(curthread->defer_waiters,=20 (gdb) p *td Cannot access memory at address 0x0 and another one Core was generated by `python3.8: middlewared'. Program terminated with signal SIGSEGV, Segmentation fault. #0 cond_signal_common (cond=3D<optimized out>) at /truenas-releng/freenas/_BE/os/lib/libthr/thread/thr_cond.c:459warning: Sou= rce file is more recent than executable. 459 if (PMUTEX_OWNER_ID(mp) =3D=3D TID(curthread)) { [Current thread is 1 (LWP 101105)] (gdb) list 454 } 455 456 td =3D _sleepq_first(sq); 457 mp =3D td->mutex_obj; 458 cvp->__has_user_waiters =3D _sleepq_remove(sq, td); 459 if (PMUTEX_OWNER_ID(mp) =3D=3D TID(curthread)) { 460 if (curthread->nwaiter_defer >=3D MAX_DEFER_WAITERS) { 461 _thr_wake_all(curthread->defer_waiters, 462 curthread->nwaiter_defer); 463 curthread->nwaiter_defer =3D 0; (gdb) p *mp Cannot access memory at address 0x0 I'm trying to instrument a program to "stress" test threading (tearing down= and recreating etc etc) but I've been unsuccessful at tickling this particular problem. The end-users that have seen this core dump sometimes go 1month + without a problem. Hoping someone more knowledgeable can at least give me a pointer or help me figure this one out. I have access to my VM that has all= the relevant core dumps available so if someone needs remote access to it to "p= oke" around, please let me know. You can reach me at caleb [at] ixsystems.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-255445-21822-YuYPUhPP2v>