Date: Mon, 22 Apr 2024 09:12:49 -0700 From: Gleb Smirnoff <glebius@freebsd.org> To: Alexander Leidinger <Alexander@leidinger.net> Cc: Current <current@freebsd.org> Subject: Re: Strange network/socket anomalies since about a month Message-ID: <ZiaMgRH-8vecPfSt@cell.glebi.us> In-Reply-To: <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net> References: <1fe609f252e7fae6d746530d5035ec0e@Leidinger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Alexander, On Mon, Apr 22, 2024 at 09:26:59AM +0200, Alexander Leidinger wrote: A> I see a higher failure rate of socket/network related stuff since a while. A> Those failures are transient. Directly executing the same thing again may A> or may not result in success/failure. I'm not able to reproduce this at A> will. Sometimes they show up. A> A> Examples: A> - poudriere runs with the sccache overlay (like ccache but also works for A> rust) sometimes fail to create the communication socket and as such the A> build fails. I have 3 different poudriere bulk runs after each other in my A> build script, and when the first one fails, the second and third still run. A> If the first fails due to the sccache issue, the second and 3rd may or may A> not fail. Sometimes the first fails and the rest is ok. Sometimes all fail, A> and if I then run one by hand it works (the script does the same as the A> manual run, the script is simply a "for type in A B C; do; poudriere bulk A> -O sccache -j $type -f ${type}.pkglist; done" which I execute from the A> same shell, and the script doesn't do env-sanityzing). A> - A webmail interface (inet / local net -> nginx (rev-proxy) -> nginx A> (webmail service) -> php -> imap) sees intermittent issues sometimes. A> Opening the same email directly again afterwards normally works. I've also A> seen transient issues with pgp signing (webmail interface -> gnupg / A> gpg-agent on the server), simply hitting send again after a failure works A> fine. A> A> Gleb, could this be related to the socket stuff you did 2 weeks ago? My A> world is from 2024-04-17-112537. I do notice this since at least then, but A> I'm not sure if they where there before that and I simply didn't notice A> them. They are surely "new recently", that amount of issues I haven's seen A> in January. The last two updates of current I did before the last one where A> on 2024-03-31-120210 and 2024-04-08-112551. The stuff I pushed 2 weeks ago was a large rewrite of unix/stream, but that was reverted as it appears needs more work wrt to aio(4), nfs/rpc and also appeared that sendfile(2) over unix(4) has some non-zero use. There were several preparatory commits that were not reverted and one of them had a bug. The bug manifested itself as failure to send(2) zero bytes over unix/stream. It was fixed with e6a4b57239dafc6c944473326891d46d966c0264. Can you please check you have this revision? Other than that there are no known bugs left. A> I could also imagine that some memory related transient failure could cause A> this, but with >3 GB free I do not expect this. Important here may be that A> I have https://reviews.freebsd.org/D40575 in my tree, which is memory A> related, but it's only a metric to quantify memory fragmentation. A> A> Any ideas how to track this down more easily than running the entire A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)? I don't have any better idea than ktrace over failing application. Yep, I understand that poudriere will produce a lot. But first we need to determine what syscall fails and on what type of socket. After that we can scope down to using dtrace on very particular functions. -- Gleb Smirnoff
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZiaMgRH-8vecPfSt>