From owner-freebsd-arm@freebsd.org Fri Jan 8 00:10:34 2016 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4C900A67DCF for ; Fri, 8 Jan 2016 00:10:34 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-210-3.reflexion.net [208.70.210.3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 11F121499 for ; Fri, 8 Jan 2016 00:10:33 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 8692 invoked from network); 8 Jan 2016 00:10:31 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 8 Jan 2016 00:10:31 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.80.0) with SMTP; Thu, 07 Jan 2016 19:10:39 -0500 (EST) Received: (qmail 21524 invoked from network); 8 Jan 2016 00:10:39 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with SMTP; 8 Jan 2016 00:10:39 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-76-115-7-162.hsd1.or.comcast.net [76.115.7.162]) by iron2.pdx.net (Postfix) with ESMTPSA id 44E881C43C1; Thu, 7 Jan 2016 16:10:22 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm (rpi2): a description of sorts From: Mark Millard In-Reply-To: <1452205717.1215.25.camel@freebsd.org> Date: Thu, 7 Jan 2016 16:10:26 -0800 Cc: freebsd-arm Content-Transfer-Encoding: 7bit Message-Id: References: <1452183170.1215.4.camel@freebsd.org> <1452196099.1215.12.camel@freebsd.org> <568EC4D8.7010106@selasky.org> <8B728C93-9C90-4821-A607-5D157F028812@dsl-only.net> <568ED810.8010309@selasky.org> <568ED92C.9070602@selasky.org> <1452205717.1215.25.camel@freebsd.org> To: Ian Lepore , Hans Petter Selasky X-Mailer: Apple Mail (2.2104) X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jan 2016 00:10:34 -0000 On 2016-Jan-7, at 2:28 PM, Ian Lepore wrote: > > On Thu, 2016-01-07 at 14:04 -0800, Mark Millard wrote: >> On 2016-Jan-7, at 1:31 PM, Hans Petter Selasky >> wrote: >>> >>> On 01/07/16 22:26, Hans Petter Selasky wrote: >>>> On 01/07/16 21:20, Mark Millard wrote: >>>>> >>>>> On 2016-Jan-7, at 12:04 PM, Hans Petter Selasky >>>> selasky.org> >>>>> wrote: >>>>>> >>>>>> On 01/07/16 20:48, Ian Lepore wrote: >>>>>>> If the filesystems and swap space are on a usb drive, then >>>>>>> maybe it's >>>>>>> the usb subsystem that's hanging. The wait states you >>>>>>> showed for those >>>>>>> processes are consistant with what I've seen when all >>>>>>> buffers get >>>>>>> backed up in a queue on one non-responsive or slow device. >>>>>>> It may be >>>>>>> that there's a way to get the system deadlocked when it's >>>>>>> low on >>>>>>> buffers and there is memory pressure causing the swap to be >>>>>>> used (I >>>>>>> generally run arms systems without any swap configured). >>>>>>> >>>>>>> Running gstat in another window while this is going on may >>>>>>> give you >>>>>>> some insight into the situation. Beyond that I don't know >>>>>>> what to look >>>>>>> at, especially since you generally can't launch any new >>>>>>> tools once the >>>>>>> system gets into this kind of state. >>>>>>> >>>>>>> -- Ian >>>>>> >>>>>> Hi, >>>>>> >>>>>> All USB transfers towards disk devices have timeouts, so if >>>>>> something >>>>>> is hanging at USB level, you'll get a printout eventually. >>>>> >>>>> What sort of timescale after deadlock/live-lock is observed to >>>>> apparently have started does one have to wait in order to >>>>> conclude >>>>> that the timeouts would have happened and so they do not apply >>>>> to the >>>>> deadlock/live-lock? >>>>> >>>>>> The USB kernel processes needed for doing I/O transfers are >>>>>> not >>>>>> pinned to RAM. Can it happen if a USB process is swapped to >>>>>> disk, >>>>>> that the system cannot wakeup a swapped out process to get >>>>>> more swap? >>>>>> >>>>>> --HPS >>>>> >>>> >>>> Hi, >>>> >>>>> Wow. Could I use ddb to somehow check on the "USB kernel >>>>> processes" >>>>> swap status when the overall context is deadlocked/live-locked? >>>> >>>> Are you able to run something like: >>>> >>>> ps auxwwH | grep usb >>>> >>>>> If yes, how? Otherwise something in top or some such display >>>>> that I'd >>>> left running over the serial console would have to present useful >>>> information on the subject. Is there anything that would? >>>> >>> >>> Are you able to SSH into the box or ping it? >>> >>> --HPS >> >> Once the live-lock condition is reached no new processes can be >> created as far as I can tell: the attempt will hang any process that >> attempts the creation. >> >> I'd need "ps auxwwH" to be internally repeating to even get that >> much: I'd have to start it before the live-lock happened and it would >> have to be still running when the hang occurs, no on-going process >> creations involved. >> >> I'm not so sure that two communicating processes (ps and grep over a >> pipe) would work but I can not get to even one new process so far. >> >> ssh sessions also hang, input and output stop for them fairly >> generally. (Sometimes the context is such that ^t still works but >> shows no progress in what it reports.) No new ssh connections are >> possible: "Operation timed out". >> >> ping does respond normally: it is more of a live-lock status then a >> true deadlock one overall. >> >> The serial console still outputs what it was already running if that >> process does nothing that locks up. Changing what it is doing >> generally locks it up too. >> >> Doing something like unplugging a usb keyboard or mouse or plugging >> one in does show the expected messages via the console: it is more of >> a live-lock status then a true deadlock one overall. >> >> I can get to ddb after the hang. But I do not know what I'd do with >> it to find any useful information. >> >> >> As noted in another message: I used gstat instead of top on the >> serial console: >> >>> gstat shows everything zero during a hang, even L(q) column. >>> (Length of queue?) >>> >>> I used: >>> >>> gstat -cod >>> >>> and had it running over the serial console port during the >>> attempted portmaster activity. > > All of those symptoms sound consistant with the deadlock being IO > -related. You can't ssh in because creating an ssh session for you > requires reading a variety of files and it locks at that point. USB > insert/remove events lead to devd events which can lead to doing IO (to > load driver modules for example) so that might lead to lockups or not. > > Since ddb is still usable when the hangs occur, you can break into that > and use its 'ps' command (no args) to find out what various threads are > waiting for (wmesg column). The fact that your original output > included processes in a 'wswbuf' state is what makes me think it's swap > -related IO that's causing everything else to back up behind it. > (Unfortunately, there are 'wswbuf0' and 'wswbuf1' waits in the kernel > that really should be named "wsw0buf' and 'wsw1buf' to allow for the 6 > -char truncation of the display). > > There are probably ddb commands to look at a variety of other > interesting things (the 'show' command has a lot of options), but I > don't know what to look at really, other than some guesses (show pageq > might be interesting, show freepages maybe?). > > -- Ian FYI. . . ddb's "ps" showed (my presentation order and formating): [pagedaemon] had wmesg wswbuf0 and state D [swapper] had wmesg vmwait and state D [md0] had wmesg vmwait and state DL [usb]'s threads: [usb0] had wmesg - and state D (all 5 such lines did) [smsc0] had wmesg - and state D "show pageq" listed: pq_free 2 pq_cache 0 dom 0 page_cnt 234761 free 2 pq_act 164873 pq_inact 18563 pass 2 "show freepages" listed only one non-zero "NUMBER POOL 0": ORDER (SIZE) NUMBER POOL 0 01 (000008k) 000001 === Mark Millard markmi at dsl-only.net