Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 19 Sep 2008 12:48:48 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        "Oleg V. Nauman" <oleg@opentransfer.com>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: RELENG_7: something is very wrong with UDP?
Message-ID:  <alpine.BSF.1.10.0809191241050.3922@fledge.watson.org>
In-Reply-To: <20080919143636.p661cjfopw44osco@webmail.opentransfer.com>
References:  <20080918180543.pt7s2zmaio48ww8g@webmail.opentransfer.com> <alpine.BSF.1.10.0809182005570.16464@fledge.watson.org> <20080919143636.p661cjfopw44osco@webmail.opentransfer.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Fri, 19 Sep 2008, Oleg V. Nauman wrote:

>> (1) Start by deleting all but one nameserver entry in /etc/resolv.conf.
>>    Confirm that you can still reproduce the problem.
>
> Due to various reasons my laptop running local caching DNS server ( named ) 
> without any forwarders assigned. My /etc/resolv.conf contains nameserver 
> 127.0.0.1

This is simplifying in some senses, but complicating in others.  In 
particular, the question it raises is whether the problem is in the DNS 
resolver or the nameserver.  Seeing a tcpdump of lo0 for DNS traffic would be 
quite interesting, since we could look at timestamps and try to place the 
blame a bit more precisely.

>> Could you
>>    also use procstat -k on the dig process to generate a kernel stack trace
>>    for it?

Let's add to this list: when the problem happens, could you also procstat -k 
the name server process(es)?

> And procstat -kk output for logger process waiting:
>
> PID    TID COMM             TDNAME           KSTACK
> 1421 100095 logger           -                mi_switch+0x2c8 
> sleepq_switch+0xd9 sleepq_catch_signals+0x239 sleepq_wait_sig+0x14 
> _sleep+0x35f pipe_read+0x389 dofileread+0x96 kern_readv+0x58 read+0x4f 
> syscall+0x2b3 Xint0x80_syscall+0x20

Interesting -- logger is blocked on reading from a pipe, likely standard 
input.  So it sounds like something else is failing to complete in a timely 
manner -- perhaps due to DNS.

>> This is approximately the date of my last UDP MFC.  Could you try backing 
>> out just src/sys/netinet6/udp6_usrreq.c revision 1.81.2.7 and see if that 
>> helps? (specifically, restore the use of sosend_generic instead of 
>> sosend_dgram)

If you can show that it's definitely a problem with the change to sosend_dgram 
for UDPv6 socket send, then it might suggest it's the same problem that it is 
related to the UDPv46 code there.  In which case I will propose we back out 
that portion of the change in the 7-stable branch until it's known to be 
resolved -- I don't want other people tripping over this.

>> Could you try compiling your kernel with WITNESS to see if we get any 
>> extended debugging information?
>
> Have added WITNESS ( and STACK required by procstat ) options but it is not 
> producing any output ( so no LORs or something like this )

OK.  Could you try adding INVARIANT_SUPPORT and INVARIANTS if they aren't 
there?  Be aware: this may convert the wedging you are experiencing into a 
kernel panic.

>>> Is anybody experiencing the same issues with fresh RELENG_7? Unsure it is 
>>> my local issues though
>> 
>> I'm not experiencing them, but these sorts of things can be quite subtle 
>> and workload-dependent.
>
> Well experiencing this issue during the system boot even..

OK.  So there must be something a bit different about your setup -- perhaps 
there's something specific about the way things are interacting over the 
loopback address for the name server.  Is this the stock system BIND9 or 
something else?  Are you able to temporarily switch to an external name server 
and see if that changes things?

Robert N M Watson
Computer Laboratory
University of Cambridge



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.1.10.0809191241050.3922>