Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Mar 2020 01:18:36 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        bob prohaska <fbsd@www.zefox.net>
Cc:        freebsd-arm@freebsd.org
Subject:   Re: Belated out of swap kill on rpi3 at r359216
Message-ID:  <0FF6BC4C-296F-49F3-8FB8-AA87A49349E2@yahoo.com>
In-Reply-To: <20200325015633.GA93057@www.zefox.net>
References:  <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net> <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> <20200324224658.GA92726@www.zefox.net> <764D5A86-6A42-44E0-A706-F1C49BB198DA@yahoo.com> <20200325015633.GA93057@www.zefox.net>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2020-Mar-24, at 18:56, bob prohaska <fbsd at www.zefox.net> wrote:

> [this thread needs a haircut, new comments start near line 288]
> On Tue, Mar 24, 2020 at 05:21:47PM -0700, Mark Millard wrote:
>> On 2020-Mar-24, at 15:46, bob prohaska <fbsd at www.zefox.net> wrote:
>>=20
>>> . . .
>>=20
>> . . .
>>=20
>>=20
>> The text before "Mar 24 04:20:50" messages, with notes
>> and blank lines mixed in to group things:
>>=20
>> . . .
>> Tue Mar 24 04:20:38 PDT 2020

Note the time above.

>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 3/773/776/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   54180 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.031s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>>   18      1      0      0    0.0      1     31  26686      0      0   =
 0.0  287.1  mmcsd0
>>   18      1      0      0    0.0      1     31  26686      0      0   =
 0.0  287.1  mmcsd0s2
>>   10      1      0      0    0.0      1     31  26686      0      0   =
 0.0  287.1  mmcsd0s2a
>>   10      1      0      0    0.0      1     31  26686      0      0   =
 0.0  287.1  ufs/rootfs
>> Tue Mar 24 04:20:40 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 6/770/776/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> Note that the above has 26sec+ ms/w figures. No reads.
>> No mmcsd0s2b.
>>=20
>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   54252 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.064s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>> Tue Mar 24 04:20:42 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>=20
>=20
> Same thing happens when gstat -a is run on an idle machine.
> I thought it normal, the CPU had all the data it needed.=20
>=20
>=20
>=20
>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   53668 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.046s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>>   17      1      0      0    0.0      1     31  30042      0      0   =
 0.0  320.7  mmcsd0
>>   17      1      0      0    0.0      1     31  30042      0      0   =
 0.0  320.7  mmcsd0s2
>>    9      1      0      0    0.0      1     31  30042      0      0   =
 0.0  320.7  mmcsd0s2a
>>    9      1      0      0    0.0      1     31  30042      0      0   =
 0.0  320.7  ufs/rootfs
>> Tue Mar 24 04:20:43 PDT 2020

Note the time above compared to the earlier one that
I referenced: around 5sec later (=3D=3D 43sec-38sec),
despite the ms/w figures involved (over 26sec/w and
over 30sec/w with a nothing-listed in the middle).


>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> Note that the above has 30sec+ ms/w figures. No reads.
>>=20
>> Also note that the L(q) figures dropped by only 1 over
>> the about 04:20:40 to 04:20:43 interval (18->17 and
>> 10->9). (I only take these deltas as suggestive. They
>> might be accidental near matches.) No mmcsd0s2b.
>>=20
>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   53624 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.003s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>> Tue Mar 24 04:20:45 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>=20
>=20
> It's tempting, but maybe not correct, the think the I/O caught up.

On this side of the 30sec+ example, things are less clear
than the above note, so maybe. (Same below.)

>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   53600 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.063s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>> Tue Mar 24 04:20:47 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>=20
>=20
> Can't that be interpreted as gstat -a having nothing to report?=20
> Just _why_ it might have nothing to report never crossed my mind....
>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   53592 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.006s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>> Tue Mar 24 04:20:48 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>=20
>>=20
>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 0  0 30 2034696   53672 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.002s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>> Tue Mar 24 04:20:50 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   403784  4000468     9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 1033854, size: 4096
>> 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> For the above, note the list of any lines with ms/r or ms/w figures,
>> just the title rows.
>>=20
>>=20
>>=20
>> procs     memory       page                      disks     faults     =
  cpu
>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   =
cs us sy id
>> 2  0 30 2045028   54160 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
>> dT: 1.001s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
>>    2    329    236   3839   13.0     93   1262   18.6      0      0   =
 0.0   99.4  mmcsd0
>>    2    329    236   3839   13.2     93   1262   18.7      0      0   =
 0.0  101.1  mmcsd0s2
>>    1    153    152   2816   12.5      1     32  366.5      0      0   =
 0.0   95.2  mmcsd0s2a
>>    1    176     84   1023   14.5     92   1230   14.9      0      0   =
 0.0   66.8  mmcsd0s2b
>>    1    153    152   2816   12.5      1     32  366.6      0      0   =
 0.0   95.3  ufs/rootfs
>> Tue Mar 24 04:20:51 PDT 2020
>> Device          1K-blocks     Used    Avail Capacity
>> /dev/mmcsd0s2b    4404252   296976  4107276     7%
>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 21842, size: 12288
>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 25477, size: 4096
>> 0/1032/1032/19177 mbuf clusters in use (current/cache/total/max)
>>=20
>> For the above, the lines with ms/r and ms/w figures are back
>> but the indefinite wait buffer for "Mar 24 04:20:50" happened
>> too, before the "Tue Mar 24 04:20:51 PDT 2020" above. I'd guess
>> they are from before the data that the gstat output is based on.
>>=20
>> Also there are lots of reads and some writes in the above
>> gstat output. mmcsd0s2b is again showing as well.
>>=20
>> Note: It looks like  ufs/rootfs might be a label
>> identifying mmcsd0s2a, so they track up to slight
>> time-of-information differences. I'm guessing that
>> mmcsd0s2b is just a swap/paging partition.
>>=20
>=20
> That's correct. =20
>=20
>> So it looks like activity for ufs/rootfs may have
>> blocked/delayed activity for paging/swaping
>> (mmcsd0s2b), at least for a time, even if it is
>> not directly the cause of the specific OOM
>> activity.
>>=20
>=20
> How would one distinguish slow swap i/o from slow
> filesystem i/o ?=20

The 26sec+ and 30sec+ examples do not list mmcsd0s2b
but do list mmcsd0s2a (and ufs/rootfs). Presumably
that means no reas/writes to mmcsd0s2b in the time
frame spanned. That is what I was referring to. I
was not comparing against mmcsd0s2b I/O activity.

> Would moving swap to a USB device avoid blocking=20
> by writes to microSD? That's not hard to try, but
> it puts a whole new kettle of fish on the fire.

Presuming the USB device is well performing,
splitting the load across buses might well
help. A good USB SSD might handle both
activities better than involving the microsd
card at all(?).

It is possible to have only /boot/ materials
on ufs on the microsd card and so to have the
(ufs) root file system on the USB.

For example, in the Pine64+2GB's /boot/loader.conf
on the microsd card I have:

vfs.root.mountfrom=3D"ufs:/dev/gpt/PINE642Groot"

That label identifies the USB SSD's ufs partition
holding the root file system. On that file system
is an empty directory /microsd_ufs used as a mount
point to get to the microsd card's ufs-based /boot/ .

# gpart show -p
=3D>       63  249737153    mmcsd0  MBR  (119G)
         63      16380            - free -  (8.0M)
      16443     131040  mmcsd0s1  fat32lba  [active]  (64M)
     147483        997            - free -  (499K)
     148480  241172480  mmcsd0s2  freebsd  (115G)
  241320960    8416256            - free -  (4.0G)

=3D>        0  241172480   mmcsd0s2  BSD  (115G)
          0  230686720  mmcsd0s2a  freebsd-ufs  (110G)
  230686720   10485760             - free -  (5.0G)

=3D>       40  468862048    da0  GPT  (224G)
         40       2008         - free -  (1.0M)
       2048  413138944  da0p1  freebsd-ufs  (197G)
  413140992    6291456  da0p2  freebsd-swap  (3.0G)
  419432448    6291456  da0p4  freebsd-swap  (3.0G)
  425723904   43138184         - free -  (21G)

# gpart show -pl
=3D>       63  249737153    mmcsd0  MBR  (119G)
         63      16380            - free -  (8.0M)
      16443     131040  mmcsd0s1  (null)  [active]  (64M)
     147483        997            - free -  (499K)
     148480  241172480  mmcsd0s2  (null)  (115G)
  241320960    8416256            - free -  (4.0G)

=3D>        0  241172480   mmcsd0s2  BSD  (115G)
          0  230686720  mmcsd0s2a  (null)  (110G)
  230686720   10485760             - free -  (5.0G)

=3D>       40  468862048    da0  GPT  (224G)
         40       2008         - free -  (1.0M)
       2048  413138944  da0p1  PINE642Groot  (197G)
  413140992    6291456  da0p2  PINE642Gswap  (3.0G)
  419432448    6291456  da0p4  PINE642Gswp2  (3.0G)
  425723904   43138184         - free -  (21G)

After booting it looks like:

# df -m
Filesystem               1M-blocks  Used  Avail Capacity  Mounted on
/dev/gpt/PINE642Groot       195378 34613 145135    19%    /
devfs                            0     0      0   100%    /dev
/dev/label/PINE64P2Groot    109101    75 100297     0%    /microsd_ufs
/dev/label/PINE642GAboot        63    43     20    69%    /boot/efi

(The Pine64+2GB uses a /boot/efi/ instead of a /boot/msdos/ .)

# swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/PINE642Gswap   3145728        0  3145728     0%
/dev/gpt/PINE642Gswp2   3145728        0  3145728     0%
Total             6291456        0  6291456     0%

(I'm not suggesting that having 2 swap partitions
is important, it just is what I happen to have.)

# dumpon -l
gpt/PINE642Gswp2,gpt/PINE642Gswp2

(The duplication is a result of something I
report later.)

I also have in the USB SSD's ufs file system's
/etc/fstab :

/dev/label/PINE64P2Groot        /microsd_ufs    ufs rw,noatime          =
1 1

(Again, I used labeling, in this case glabel
based on the microsd card. Having a /boot/ is
a subset of having a full root file system and
I still used "root" terminology in the label.)

I keep a /boot/ on the USB SSD and update the
microsd card's copy from the USB copy:

# more /boot/copy_boot_to_microsd.sh=20
rsync -axHh --info=3Dprogress2 --exclude=3D/boot/entropy --delete -r =
/boot /microsd_ufs/

(I do not have /boot/entropy on the USB SSD,
just on the microsd card at its /boot/entropy .
The exclude prevents the --delete from touching
the /boot/entropy file on the microsd card.)

I also have in /etc/rc.conf (so: on the USB SSD):

entropy_boot_file=3D"/microsd_ufs/boot/entropy"

For reference:

# ls -ldT /microsd_ufs/*
-r--r--r--   1 root  wheel  6170 Feb  1 04:48:34 2020 =
/microsd_ufs/COPYRIGHT
drwxr-xr-x  11 root  wheel  1024 Mar 24 19:06:27 2020 /microsd_ufs/boot

(The COPYRIGHT file is, by content, the normal one
that is on a root filesystem.)

# ls -ldT /microsd_ufs/*/*
-r--r--r--  1 root  wheel    3547 Mar 14 15:05:54 2020 =
/microsd_ufs/boot/beastie.4th
. . .
-rw-------  1 root  wheel    4096 Dec 31 16:00:25 2009 =
/microsd_ufs/boot/entropy
. . .
drwxr-xr-x  2 root  wheel     512 Dec 23 22:17:51 2019 =
/microsd_ufs/boot/zfs

(The date/time for /microsd_ufs/boot/entropy and /entropy
are from before time is adjusted via ntp, so the dates
can be odd.)

I also use:

dumpdev=3D"/dev/gpt/PINE642Gswp2"

in /boot/loader.conf (both media) and in /etc/rc.conf .
(This leads to the duplication noted earlier.)

The same media-pair (microsd card plus USB SSD) can
boot the RPi3. For the Pine64+2GB and RPi3, these days
I use a small powered USB hub instead of directly
plugging in the USB SSD to the arm-based board. I
had power problems otherwise, at least on the
Pine64+2GB in recent times.

Technically, I use a USB3 hub and a USB3 SSD, both of
which happen to allow USB2 use as well.

The OPi+2e (armv7) and Rock64 (aarch64) are set up
similarly. In all cases the USB SSDs used happen to
perform better than the microsd cards used. (On the
Rock64, the USB SSD also out performs the e.MMC that
is in use --and the e.MMC outperforms the microsd
card.)

(The Rock64 USB3 is not operational yet. As stands,
it too is a USB2 based context in my use.)

I do not have other external USB devices involved.
Mulitple external USB devices may share an internal
hub on some arm boards, leading to a lack of
multi-channel performance gains for multiple drives.


> It should be added that the original buildworld has been restarted =
with
> vm.pfault_oom_attempts: -1 and so far, though it still generates the
> "indefinite wait" warnings, the ms/w times look better. It's not done =
yet,
> though. How much overhead comes with OOMA?

I doubt the ms/w figures span execution of the code
that deals with OOM kill decisions or retries logic.
But vm.pfault_oom_attempts=3D=3D-1 would avoid any
retries, if I understand right. That in turn might
put less of a load on the microsd card? (Guess
work on my part.)


=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0FF6BC4C-296F-49F3-8FB8-AA87A49349E2>