Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Apr 2024 12:01:26 +0200
From:      Alexander Leidinger <Alexander@Leidinger.net>
To:        Gleb Smirnoff <glebius@freebsd.org>
Cc:        Current <current@freebsd.org>
Subject:   Re: Strange network/socket anomalies since about a month
Message-ID:  <d915e7b730f59a531d03b027215f81e0@Leidinger.net>
In-Reply-To: <ZiaMgRH-8vecPfSt@cell.glebi.us>
References:  <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net> <ZiaMgRH-8vecPfSt@cell.glebi.us>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)

--=_6b985b0e8816937697b536ffdd21ad4a
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII;
 format=flowed

Am 2024-04-22 18:12, schrieb Gleb Smirnoff:

> There were several preparatory commits that were not reverted and one 
> of them
> had a bug.  The bug manifested itself as failure to send(2) zero bytes 
> over
> unix/stream.  It was fixed with 
> e6a4b57239dafc6c944473326891d46d966c0264. Can
> you please check you have this revision? Other than that there are no 
> known
> bugs left.

Yes, I have this fix in my running kernel.

> A> Any ideas how to track this down more easily than running the entire
> A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?
> 
> I don't have any better idea than ktrace over failing application.  
> Yep, I
> understand that poudriere will produce a lot.  But first we need to 
> determine

Yes, it does. 4.4 GB just for the start of poudriere until the first 
package build fails due to a failed sccache start (luckily in the first 
builder, but I had at least 2 builders automatically spin up by 
poudriere at the time when I validated the failure in the logs and 
disabled the tracing).

> what syscall fails and on what type of socket.  After that we can scope 
> down to
> using dtrace on very particular functions.

I'm not sure I manage to find the cause of the failure... the only thing 
which remotely looks like an issue is "Resource temporarily 
unavailable", but this is from the process which waits for the server to 
have started:
---snip---
  58406 sccache  1713947887.504834367 RET   __sysctl 0
  58406 sccache  1713947887.505521884 CALL  
rfork(0x80000000<><invalid>2147483648)
  58406 sccache  1713947887.505757777 CAP   system call not allowed: 
rfork
  58406 sccache  1713947887.505774176 RET   rfork 58426/0xe43a
  58406 sccache  1713947887.507304865 CALL  
compat11.kevent(0x3,0x371d360f89e8,0x2,0x371d360f89e8,0x2,0)
  58406 sccache  1713947887.507657906 STRU  struct freebsd11_kevent[] = { 
{ ident=11, filter=EVFILT_READ, flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>, 
fflags=0, data=0, udata=0x0 }
              { ident=11, filter=EVFILT_WRITE, 
flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>, fflags=0, data=0, udata=0x0 } }
  58406 sccache  1713947887.507689980 STRU  struct freebsd11_kevent[] = { 
{ ident=11, filter=EVFILT_READ, flags=0x4000<EV_ERROR>, fflags=0, 
data=0, udata=0x0 }
              { ident=11, filter=EVFILT_WRITE, flags=0x4000<EV_ERROR>, 
fflags=0, data=0, udata=0x0 } }
  58406 sccache  1713947887.507977155 RET   compat11.kevent 2
  58406 sccache  1713947887.508015751 CALL  write(0x5,0x371515685bcc,0x1)
  58406 sccache  1713947887.508086434 GIO   fd 5 wrote 1 byte
        0x0000 01                                       |.|

  58406 sccache  1713947887.508145930 RET   write 1
  58406 sccache  1713947887.508183140 CALL  
compat11.kevent(0x7,0,0,0x5a5689ab0c40,0x400,0)
  58406 sccache  1713947887.508396614 STRU  struct freebsd11_kevent[] = { 
  }
  58406 sccache  1713947887.508156537 STRU  struct freebsd11_kevent[] = { 
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>, 
fflags=0, data=0x1, udata=0xffffffffffffffff } }
  58406 sccache  1713947887.508530888 RET   compat11.kevent 1
  58406 sccache  1713947887.508563736 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947887.508729102 GIO   fd 4 read 1 byte
        0x0000 01                                       |.|

  58406 sccache  1713947887.508967661 RET   read 1
  58406 sccache  1713947887.508996753 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947887.509028311 RET   read -1 errno 35 Resource 
temporarily unavailable
  58406 sccache  1713947887.509068838 CALL  
compat11.kevent(0x3,0,0,0x5a5689a97540,0x400,0x371d3a2887c8)
...
  58406 sccache  1713947897.514352552 CALL  
_umtx_op(0x5a5689a3d290,0x10<UMTX_OP_WAKE_PRIVATE>,0x7fffffff,0,0)
  58406 sccache  1713947897.514383653 RET   _umtx_op 0
  58406 sccache  1713947897.514421273 CALL  write(0x5,0x371515685bcc,0x1)
  58406 sccache  1713947897.515050967 STRU  struct freebsd11_kevent[] = { 
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>, 
fflags=0, data=0x1, udata=0xffffffffffffffff } }
  58406 sccache  1713947897.515146151 RET   compat11.kevent 1
  58406 sccache  1713947897.515178978 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947897.515368070 GIO   fd 4 read 1 byte
        0x0000 01                                       |.|

  58406 sccache  1713947897.515396600 RET   read 1
  58406 sccache  1713947897.515426523 CALL  read(0x4,0x371d3a2887c0,0x80)
  58406 sccache  1713947897.515457073 RET   read -1 errno 35 Resource 
temporarily unavailable
  58406 sccache  1713947897.515004494 GIO   fd 5 wrote 1 byte
        0x0000 01                                       |.|
---snip---

https://www.leidinger.net/test/sccache.tar.bz2 contains the parts of the 
trace of the sccache processes (in case someone wants to have a look).

The poudriere run had several builders in parallel, at least 2 were 
running at that point in time. What the overlay does is to startup 
(sccache --start-server) the sccache server process (forks to return 
back on the command line) which creates a file system socket, and then 
it queries the stats (sccache --show-stats). So some of the traces in 
the tarball are the server start (those with "Timed out waiting for 
server startup" are maybe the processes which fork to start the server 
and wait for it to be started), some are the stat-query, and some seem 
to be a successful start in another poudriere-builder (those with a 
successful /root/.ccache/sccache/5/4/<hashvalue> access look like from a 
successful start in another jail). Maybe there is also a --stop-server 
from poudriere somewhere.

What I noticed (except that printing the new CAP stuff for non-CAP 
enabled processes by default is disturbing) is, that compat11 stuff is 
called (seems the rust ecosystem is not keeping up with our speed of 
development...). Not sure if it matters here that some compat stuff is 
called.

Bye,
Alexander.

-- 
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF

--=_6b985b0e8816937697b536ffdd21ad4a
Content-Type: application/pgp-signature;
 name=signature.asc
Content-Disposition: attachment;
 filename=signature.asc;
 size=833
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEER9UlYXp1PSd08nWXEg2wmwP42IYFAmYo2IQACgkQEg2wmwP4
2IbLtRAAnQduZ2y4IiSCZmbfWHVePBv4ZX8tAjezWkMig6tpM3qPmpqBBEDrpsNJ
l2mNMNPEeUpx7ThU4pNqBvVkYUkHGrXpI/w/i4xHO5cj/tOHANkRlQnf9WZWEqKM
muvYsCz5ACvduiqYBkP2Z/Y+BD2ZoYljt7ALVUenlqGvmdWyCtDPXVP7/4ZqCCAS
gveuWUY6ClX9AKxDTh46HjY1Nx3fiM8iNtP4ZTciHrXt++wmTJoPGODMxJeP0/Vd
DvRTXM/zqeDKzQAhOcaa9CFoVXGq81WGgTSLU8NBHCS5Z/jClp3eiCfTtunL2Cn3
TQpvQEg2Vqz/eLWEn16d1OuMtd4PcY7aQSkMnyoVRPcg2Vu3I6rO/G74AGYdpKER
btfIgpovbtBMz/TOF4gzBni+TsPpmWGjzXppbyRjKB7/Yw4vT3fUvtjMBMe7Iflt
CERjAM9DVwaAx22xLPYhxaSoKe/dW7efYRVzshXTHRfr6rpt6w97OxG82C8DMJcd
RAnqt2Gz62tVpjKrvyYtzeOpiOPV4u3MYmttebSjWPiNR7MAGPQISeq1pF9a7zK3
IVFVGru0ptCOKtdAEk0MMa647WbRCX1V2+fpsFiuk7qxyIZ6HvqGhNxU7LmmMeI/
q0dqRQm3952Rm5SwllDouPMC77Gmnd5rJYrR/Mh3vQakIS2SbY0=
=Ql0c
-----END PGP SIGNATURE-----

--=_6b985b0e8816937697b536ffdd21ad4a--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?d915e7b730f59a531d03b027215f81e0>