From owner-freebsd-current@freebsd.org Wed Jun 27 01:05:11 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C405C1025699 for ; Wed, 27 Jun 2018 01:05:11 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-eopbgr670060.outbound.protection.outlook.com [40.107.67.60]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "Microsoft IT TLS CA 4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 367828C623; Wed, 27 Jun 2018 01:05:11 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM (52.132.44.24) by YTOPR0101MB1034.CANPRD01.PROD.OUTLOOK.COM (52.132.48.18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.884.23; Wed, 27 Jun 2018 01:05:09 +0000 Received: from YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM ([fe80::d0eb:3783:7c99:2802]) by YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM ([fe80::d0eb:3783:7c99:2802%4]) with mapi id 15.20.0884.024; Wed, 27 Jun 2018 01:05:09 +0000 From: Rick Macklem To: Konstantin Belousov CC: "freebsd-current@freebsd.org" , Alexander Motin , Doug Rabson Subject: Re: nfsd kernel threads won't die via SIGKILL Thread-Topic: nfsd kernel threads won't die via SIGKILL Thread-Index: AQHUCzMbOoi8yQKY7UygDXm1hs0B0KRvJmwAgAESAI+AAT8WAIAB0vQT Date: Wed, 27 Jun 2018 01:05:09 +0000 Message-ID: References: <20180624093330.GX2430@kib.kiev.ua> , <20180625205614.GI2430@kib.kiev.ua> In-Reply-To: <20180625205614.GI2430@kib.kiev.ua> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=rmacklem@uoguelph.ca; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; YTOPR0101MB1034; 7:guW5+4fmPAY511uKfx/3B9o0ApBTwAEu6CKI5Up7S9yd67xZhANkY8kSPyU93N8oVMcwvHxJaazyBjNYBKQ6Soxt1bdGixNVQMoxWpqUCL7ROSVt1JYKLA+TTGIeSV3cuJHd5YT8rfnaYtFHUwFkgRHR0Lj7kFManln8qiIb1fzThWMyz3ZTlR5bRucPmtiPnlCseqQKon7J9yQ4ItpBlr4slmdeAPaM2IBSvr5ZuLGq18fknXmBcHOJVZryIo/m x-ms-exchange-antispam-srfa-diagnostics: SOS; x-ms-office365-filtering-correlation-id: c31c7be4-661f-4ca4-ef40-08d5dbca0773 x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(8989117)(4534165)(4627221)(201703031133081)(201702281549075)(8990107)(5600026)(711020)(2017052603328)(7153060)(7193020); SRVR:YTOPR0101MB1034; x-ms-traffictypediagnostic: YTOPR0101MB1034: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(158342451672863); x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040522)(2401047)(5005006)(8121501046)(93006095)(93001095)(10201501046)(3002001)(3231254)(944501410)(52105095)(149027)(150027)(6041310)(201703131423095)(201702281529075)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123562045)(20161123560045)(20161123564045)(20161123558120)(6072148)(201708071742011)(7699016); SRVR:YTOPR0101MB1034; BCL:0; PCL:0; RULEID:; SRVR:YTOPR0101MB1034; x-forefront-prvs: 0716E70AB6 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(346002)(136003)(39860400002)(396003)(366004)(376002)(199004)(189003)(86362001)(99286004)(25786009)(53936002)(6436002)(4326008)(6246003)(97736004)(6916009)(26005)(1411001)(81156014)(81166006)(76176011)(74482002)(478600001)(68736007)(186003)(486006)(33656002)(102836004)(256004)(14444005)(786003)(446003)(11346002)(54906003)(6506007)(316002)(229853002)(55016002)(476003)(105586002)(2900100001)(93886005)(5250100002)(106356001)(7696005)(39060400002)(5660300001)(14454004)(9686003)(8936002)(305945005)(74316002)(2906002)(8676002); DIR:OUT; SFP:1101; SCL:1; SRVR:YTOPR0101MB1034; H:YTOPR0101MB0953.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) x-microsoft-antispam-message-info: pdE3NmDz8873WPIpCV/DY+QXz0DFi6JUov0RuhMrNoAeUKfrdRtHiaHPMgpDGO2YR8pt5nvlAFRIN7EUtO9Eb34HYj/WZX0vaK1ngZo5kgEwxghh2jXM/7TGx2CRLf4mrxgT2HM0CracoKpollu7Go91kewYnxfCtlSDIUcax/TlKpbrSjv7VM+hEbJuuRTRkAiB8fiI0D+QRGaKfhx+BjTqVcDlfd8geS9sVFLRIfR+rVavh5rrTsP1uSIC20GzCESz3bj00fhRtDymlMqLdhrmBy+wGcHZtuLJDc5LVnJy+oEVYGbWR00GxNTyc64Fn2F+tnPcdTSW4bV9VC82n7GA4TA2vNyIvKCo2kGwq/U= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: c31c7be4-661f-4ca4-ef40-08d5dbca0773 X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Jun 2018 01:05:09.6277 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTOPR0101MB1034 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jun 2018 01:05:12 -0000 Konstantin Belousov wrote: On Mon, Jun 25, 2018 at 02:04:32AM +0000, Rick Macklem wrote: > Konstantin Belousov wrote: > >On Sat, Jun 23, 2018 at 09:03:02PM +0000, Rick Macklem wrote: > >> During testing of the pNFS server I have been frequently killing/resta= rting the nfsd. > >> Once in a while, the "slave" nfsd process doesn't terminate and a "ps = axHl" shows: > >> 0 48889 1 0 20 0 5884 812 svcexit D - 0:00.01 nfsd= : server > >> 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd= : server > >> ... more of the same > >> 0 48889 1 0 40 0 5884 812 rpcsvc I - 0:00.00 nfsd= : server > >> 0 48889 1 0 -8 0 5884 812 rpcsvc I - 1:51.78 nfsd= : server > >> 0 48889 1 0 -8 0 5884 812 rpcsvc I - 2:27.75 nfsd= : server > >> > >> You can see that the top thread (the one that was created with the pro= cess) is > >> stuck in "D" on "svcexit". > >> The rest of the threads are still servicing NFS RPCs. [lots of stuff snipped] >Signals are put onto a signal queue between a time where the signal is >generated until the thread actually consumes it. I.e. the signal queue >is a container for the signals which are not yet acted upon. There is >one signal queue per process, and one signal queue for each thread >belonging to the process. When you signal the process, the signal is >put into some thread' signal queue, where the only criteria for the >selection of the thread is that the signal is not blocked. Since >SIGKILL is never blocked, it is put anywhere. > >Until signal is delivered by cursig()/postsig() loop, typically at the >AST handler, the only consequence of its presence are the EINTR/ERESTART >errors returned from the PCATCH-enabled sleeps. Ok, now I think I understand how this works. Thanks a lot for the explanati= on. > >Your description at the start of the message of the behaviour after > >SIGKILL, where other threads continued to serve RPCs, exactly matches > >above explanation. You need to add some global 'stop' flag, if it is not I looked at the code and there is already basically a "global stop flag". It's done by setting the sg_state variable to CLOSING for all thread groups in a function called svc_exit(). (I missed this when I looked before, so I didn't understand how all the threads normally terminate.) So, when I looked at svc_run_internal(), there is a loop while (state !=3D = closing) that calls cv_wait_sig()/cv_timedwait_sig() and when these return EINTR/ERE= START the call to svc_exit() is done to make the threads all return from the func= tion. --> The only way in can get into the broken situation I see sometimes is if= the top thread (called "ismaster" by the code) somehow returns from svc_run_internal() without calling svc_exit(), so that the state isn'= t set to "closing". Turns out there is only one place this can happen. It's this line: if (grp->sg_threadcount > grp->sg_maxthreads) break; I wouldn't have thought that sg_threadcount would have become ">" tha= n sg_maxthreads, but when I looked at the output of "ps" that I pasted = into the first message, there are 33 threads. (When I started the nfsd, I = specified 32 threads, so I think it did the "break;" at this place to get out o= f the loop and return from svc_run_internal() without calling svc_exit().) I think changing the above line to: if (!ismaster && grp->sg_threadcount > grp->sg_maxthre= ads) will fix it. I'll test this and see if I can get it to fail. Thanks again for your help, rick