Date: Mon, 20 Jun 2005 10:53:19 +0200 From: =?ISO-8859-1?Q?Eirik_=D8verby?= <ltning@anduin.net> To: Robert Watson <rwatson@FreeBSD.org> Subject: Re: NFS-related hang in 5.4? Message-ID: <A9D88C9D-B3F4-4FD3-A210-06A59EA15787@anduin.net> Resent-Message-ID: <D6049B6F-31D3-42E9-ADDC-C5C092C9AC78@anduin.net>
next in thread | raw e-mail | index | archive | help
On 20. jun. 2005, at 10.38, Robert Watson wrote: > > On Mon, 20 Jun 2005, Eirik Øverby wrote: > > > >>> Hmm. Looks like a bug in dummynet. ipfw should not be directly >>> re- injecting UDP traffic back into the input path from an >>> outbound path, or it risks re-entering, generating lock order >>> problems, etc. It should be getting dropped into the netisr queue >>> to be processed from the netisr context. >>> >>> >> >> This problem would exist across all 5.4 installations, both i386 >> and amd64? Would it depend on heavy load, or could it >> theoretically happen at any time when there's traffic? All three >> of my fbsd5 servers (dual opteron, dual p3-1ghz, dual p3-700mhz) >> are experiencing random hangs with ~a few weeks between, >> impression is that if running single-cpu mode they are all stable. >> All using dummynet in a comparable manner. Ideas? >> >> > > Yes. Basically, the network stack avoids recursion in processing > for "complicated" packets by deferring processing an offending > packet to a thread called the 'netisr'. Whenever the stack reaches > a possible recursion point on a packet, it's supposed to queue the > packet for processing 'later' in a per-protocol queue, unwind, and > then when the netisr runs, pick up and continue processing. In the > stack trace you provide, dummynet appears to immediately > immediately invoke the in-bound network path from the out-bound > network path, walking back into the network stack from the outbound > path. This is generally forbidden, for a variety of reasons: > > - We do allow the in-bound path to call the out-bound path, so that > protocols like TCP, and services like NFS can turn around packets > without a context switch. If further recursion is permitted, the > stack > may overflow. > > - Both paths may hold network stack locks over calls in either > direction > -- specifically, we allow protocol locks to be held over calls > into the > socket layer, as the protocol layer drives operation; if a recursive > call is made, deadlocks can occur due to violating the lock > order. This > is what is happening in your case. > > Pretty much all network code is entirely architecture-independent, > so bugs typically span architectures, although race conditions can > sometimes be hard to reproduce if they require precise timing and > multiple processors. > So I'm lucky to have seen this one... Great ;) >>> Is it possible to configure dummynet out of your configuration, >>> and see if the problem goes away? >>> >>> >> >> I'm running a test right now, will let you know in the morning. >> >> > > Thanks. > I know enough not to call this a "confirmation", but disabling dummynet did indeed allow me to finish the backup. I never made it past 15GBs before, now the full 19GB tar.gz file is done, and the boxes are both still running. The funny thing is - I only disabled dummynet on one of the boxes now - the source of the backup, the box that pushes data. The other box has pretty much 100% the same setup, and is also i386. But as traffic shaping can only happen on outgoing packets, I suppose that makes sense. I can try re-running the test again if you wish, in order to gain more statistics. It's just too bad it takes a while ;) /Eirik > > Robert N M Watson >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A9D88C9D-B3F4-4FD3-A210-06A59EA15787>
