From owner-freebsd-arm@freebsd.org  Fri Jan  8 05:49:56 2016
Return-Path: <owner-freebsd-arm@freebsd.org>
Delivered-To: freebsd-arm@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 82DC9A6786A
 for <freebsd-arm@mailman.ysv.freebsd.org>;
 Fri,  8 Jan 2016 05:49:56 +0000 (UTC)
 (envelope-from markmi@dsl-only.net)
Received: from asp.reflexion.net (outbound-mail-210-3.reflexion.net
 [208.70.210.3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 45AE1158B
 for <freebsd-arm@freebsd.org>; Fri,  8 Jan 2016 05:49:55 +0000 (UTC)
 (envelope-from markmi@dsl-only.net)
Received: (qmail 26565 invoked from network); 8 Jan 2016 05:49:53 -0000
Received: from unknown (HELO rtc-sm-01.app.dca.reflexion.local) (10.81.150.1)
 by 0 (rfx-qmail) with SMTP; 8 Jan 2016 05:49:53 -0000
Received: by rtc-sm-01.app.dca.reflexion.local
 (Reflexion email security v7.80.0) with SMTP;
 Fri, 08 Jan 2016 00:49:56 -0500 (EST)
Received: (qmail 23405 invoked from network); 8 Jan 2016 05:49:56 -0000
Received: from unknown (HELO iron2.pdx.net) (69.64.224.71)
 by 0 (rfx-qmail) with SMTP; 8 Jan 2016 05:49:56 -0000
X-No-Relay: not in my network
X-No-Relay: not in my network
X-No-Relay: not in my network
X-No-Relay: not in my network
Received: from [192.168.1.8] (c-76-115-7-162.hsd1.or.comcast.net
 [76.115.7.162])
 by iron2.pdx.net (Postfix) with ESMTPSA id 551521C43BC;
 Thu,  7 Jan 2016 21:49:47 -0800 (PST)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))
Subject: Re: FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm
 (rpi2): a description of sorts
From: Mark Millard <markmi@dsl-only.net>
In-Reply-To: <97E0840E-987C-4893-9E63-EA51741CFC75@dsl-only.net>
Date: Thu, 7 Jan 2016 21:49:52 -0800
Cc: freebsd-arm <freebsd-arm@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <5D7C239F-6D43-4C7A-B7F2-88E7488418B2@dsl-only.net>
References: <E0379BE9-308A-4219-A8AE-A5FFE828BA93@dsl-only.net>
 <1452183170.1215.4.camel@freebsd.org>
 <FB0D5486-AD27-44A7-86CA-68989AE08EC7@dsl-only.net>
 <1452196099.1215.12.camel@freebsd.org> <568EC4D8.7010106@selasky.org>
 <8B728C93-9C90-4821-A607-5D157F028812@dsl-only.net>
 <568ED810.8010309@selasky.org> <568ED92C.9070602@selasky.org>
 <B7E8D0FD-B3A9-40DF-B0ED-9D3041F8B2A2@dsl-only.net>
 <D44C4EF3-0976-45E7-944A-A8F23D3D89BF@dsl-only.net>
 <CANCZdfqGUJ19Gbu=ermSGh1LJ5N9OPEyRYH9kPEAoaUmTuObdw@mail.gmail.com>
 <97E0840E-987C-4893-9E63-EA51741CFC75@dsl-only.net>
To: Hans Petter Selasky <hps@selasky.org>, Ian Lepore <ian@freebsd.org>,
 Warner Losh <imp@bsdimp.com>
X-Mailer: Apple Mail (2.2104)
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: "Porting FreeBSD to ARM processors." <freebsd-arm.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm/>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jan 2016 05:49:56 -0000

Top post of a major conclusion:

I have isolated a working vs. failing context for the hangs issue:
(Note that everything for world and ports and such is on the SSD root =
partition in my examples.)

A) Using a swap file on the root partition as the swap space leads to =
hangs when the space gets sufficient activity

vs.

B) Using a swap partition as the swap space works without hangs

It is the same SSD as before both ways. (I had to dump, repartition, =
restore since I'd not provided space for a swap partition earlier.)

A swap partition on the sdcard as the swap space also works.


So it appears there is a problem with using swapfiles --at least when =
they are on otherwise sometimes-also-busy file systems but possibly more =
generally. (As the SSD has a USB SSD interface to the RPI2, swapfiles do =
not provide trim support any more than swap partitions would.)

The SSD is likely noticeably faster in various respects and so may be =
more of a challenge for swapfile handling in some way (via extra =
file-system/IO/resource load with less time between various activities), =
at least on rpi2's.


I now have for the SSD context:

$ df -m
Filesystem          1M-blocks  Used  Avail Capacity  Mounted on
/dev/ufs/RPI2rootfs    440365 11179 393957     3%    /
devfs                       0     0      0   100%    /dev
/dev/mmcsd0s1              49     7     42    15%    /boot/msdos

$ swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/RPI2swap   3282756   905876  2376880    28%


=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On 2016-Jan-7, at 3:24 PM, Mark Millard <markmi at dsl-only.net> wrote:
>=20
>=20
> On 2016-Jan-7, at 2:28 PM, Warner Losh <imp at bsdimp.com> wrote:
>>=20
>> 4 page requests shouldn't hang the whole system. That should be more =
like hundreds or thousands depending on the tuning you've done.
>>=20
>> Warner
>>=20
>=20
> FYI: I do not remember doing any explicit tuning. Other than having a =
SSD for the root file system (via fstab content) and using cortex-a7 =
related compile options things are default with ssh and little else =
enabled as I remember. I'm even currently running KERNCONF=3DRPI2 =
instead of my RPI2-NODBG variant.
>=20
> For my note about L(q)=3D=3D4 for md0: "SWAP/swap/md0" showed 0. The =
only "name" showing a non-zero value was "md0" --and only for L(q).
>=20
>=20
>=20
> It does look like the latest hang finally produced some messages: 3 =
copies of
>=20
> smsc0: warning: failed to create new mbuf
>=20
> but these messages do not normally appear.
>=20
>=20
>=20
>> On Thu, Jan 7, 2016 at 3:16 PM, Mark Millard <markmi@dsl-only.net> =
wrote:
>> I'm top posting this change of information about the hang status seen =
via gstat:
>>=20
>> After a long time the gstat -cod is showing a non-zero value in one =
place:
>>=20
>> L(q) for md0 is showing 4 now.
>>=20
>> (I've no clue when it changed. I do not expect that I missed the 4 =
before.)
>>=20
>> md0 is for the file-system based page file. That file is on the SSD, =
not the sdcard.
>>=20
>>=20
>> =3D=3D=3D
>> Mark Millard
>> markmi at dsl-only.net
>>=20
>> On 2016-Jan-7, at 2:04 PM, Mark Millard <markmi@dsl-only.net> wrote:
>>=20
>>>=20
>>> On 2016-Jan-7, at 1:31 PM, Hans Petter Selasky <hps at selasky.org> =
wrote:
>>>>=20
>>>> On 01/07/16 22:26, Hans Petter Selasky wrote:
>>>>> On 01/07/16 21:20, Mark Millard wrote:
>>>>>>=20
>>>>>> On 2016-Jan-7, at 12:04 PM, Hans Petter Selasky <hps at =
selasky.org>
>>>>>> wrote:
>>>>>>>=20
>>>>>>> On 01/07/16 20:48, Ian Lepore wrote:
>>>>>>>> If the filesystems and swap space are on a usb drive, then =
maybe it's
>>>>>>>> the usb subsystem that's hanging.  The wait states you showed =
for those
>>>>>>>> processes are consistant with what I've seen when all buffers =
get
>>>>>>>> backed up in a queue on one non-responsive or slow device.  It =
may be
>>>>>>>> that there's a way to get the system deadlocked when it's low =
on
>>>>>>>> buffers and there is memory pressure causing the swap to be =
used (I
>>>>>>>> generally run arms systems without any swap configured).
>>>>>>>>=20
>>>>>>>> Running gstat in another window while this is going on may give =
you
>>>>>>>> some insight into the situation.  Beyond that I don't know what =
to look
>>>>>>>> at, especially since you generally can't launch any new tools =
once the
>>>>>>>> system gets into this kind of state.
>>>>>>>>=20
>>>>>>>> -- Ian
>>>>>>>=20
>>>>>>> Hi,
>>>>>>>=20
>>>>>>> All USB transfers towards disk devices have timeouts, so if =
something
>>>>>>> is hanging at USB level, you'll get a printout eventually.
>>>>>>=20
>>>>>> What sort of timescale after deadlock/live-lock is observed to
>>>>>> apparently have started does one have to wait in order to =
conclude
>>>>>> that the timeouts would have happened and so they do not apply to =
the
>>>>>> deadlock/live-lock?
>>>>>>=20
>>>>>>> The USB kernel processes needed for doing I/O transfers are not
>>>>>>> pinned to RAM. Can it happen if a USB process is swapped to =
disk,
>>>>>>> that the system cannot wakeup a swapped out process to get more =
swap?
>>>>>>>=20
>>>>>>> --HPS
>>>>>>=20
>>>>>=20
>>>>> Hi,
>>>>>=20
>>>>>> Wow. Could I use ddb to somehow check on the "USB kernel =
processes"
>>>>>> swap status when the overall context is deadlocked/live-locked?
>>>>>=20
>>>>> Are you able to run something like:
>>>>>=20
>>>>> ps auxwwH | grep usb
>>>>>=20
>>>>>> If yes, how? Otherwise something in top or some such display that =
I'd
>>>>> left running over the serial console would have to present useful
>>>>> information on the subject. Is there anything that would?
>>>>>=20
>>>>=20
>>>> Are you able to SSH into the box or ping it?
>>>>=20
>>>> --HPS
>>>=20
>>> Once the live-lock condition is reached no new processes can be =
created as far as I can tell: the attempt will hang any process that =
attempts the creation.
>>>=20
>>> I'd need "ps auxwwH" to be internally repeating to even get that =
much: I'd have to start it before the live-lock happened and it would =
have to be still running when the hang occurs, no on-going process =
creations involved.
>>>=20
>>> I'm not so sure that two communicating processes (ps and grep over a =
pipe) would work but I can not get to even one new process so far.
>>>=20
>>> ssh sessions also hang, input and output stop for them fairly =
generally. (Sometimes the context is such that ^t still works but shows =
no progress in what it reports.) No new ssh connections are possible: =
"Operation timed out".
>>>=20
>>> ping does respond normally: it is more of a live-lock status then a =
true deadlock one overall.
>>>=20
>>> The serial console still outputs what it was already running if that =
process does nothing that locks up. Changing what it is doing generally =
locks it up too.
>>>=20
>>> Doing something like unplugging a usb keyboard or mouse or plugging =
one in does show the expected messages via the console: it is more of a =
live-lock status then a true deadlock one overall.
>>>=20
>>> I can get to ddb after the hang. But I do not know what I'd do with =
it to find any useful information.
>>>=20
>>>=20
>>> As noted in another message: I used gstat instead of top on the =
serial console:
>>>=20
>>>> gstat shows everything zero during a hang, even L(q) column. =
(Length of queue?)
>>>>=20
>>>> I used:
>>>>=20
>>>> gstat -cod
>>>>=20
>>>> and had it running over the serial console port during the =
attempted portmaster activity.
>>>=20
>>>=20
>> =3D=3D=3D
>> Mark Millard
>> markmi at dsl-only.net
>>=20
>>=20
>>=20
>>=20
>>=20
>> _______________________________________________
>> freebsd-arm@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-arm
>> To unsubscribe, send any mail to =
"freebsd-arm-unsubscribe@freebsd.org"
>>=20
>=20
=3D=3D=3D
Mark Millard
markmi at dsl-only.net