From owner-freebsd-stable@FreeBSD.ORG  Thu Sep 30 13:59:47 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4225B1065679
	for <freebsd-stable@freebsd.org>; Thu, 30 Sep 2010 13:59:47 +0000 (UTC)
	(envelope-from 000.fbsd@quip.cz)
Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4])
	by mx1.freebsd.org (Postfix) with ESMTP id ABEC78FC1B
	for <freebsd-stable@freebsd.org>; Thu, 30 Sep 2010 13:59:46 +0000 (UTC)
Received: from elsa.codelab.cz (localhost.codelab.cz [127.0.0.1])
	by elsa.codelab.cz (Postfix) with ESMTP id 5E20819E030;
	Thu, 30 Sep 2010 15:59:45 +0200 (CEST)
Received: from [192.168.1.2] (ip-86-49-61-235.net.upcbroadband.cz
	[86.49.61.235])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by elsa.codelab.cz (Postfix) with ESMTPSA id AEC8919E02D;
	Thu, 30 Sep 2010 15:59:41 +0200 (CEST)
Message-ID: <4CA497CD.1020609@quip.cz>
Date: Thu, 30 Sep 2010 15:59:41 +0200
From: Miroslav Lachman <000.fbsd@quip.cz>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
	rv:1.9.1.13) Gecko/20100914 SeaMonkey/2.0.8
MIME-Version: 1.0
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
References: <4CA22FF0.8060303@quip.cz>
	<20100928184343.GA70384@icarus.home.lan>	<4CA25718.2000101@quip.cz>
	<20100929072316.GA82514@icarus.home.lan>
In-Reply-To: <20100929072316.GA82514@icarus.home.lan>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-stable@freebsd.org
Subject: Re: fetch: Non-recoverable resolver failure
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 13:59:47 -0000

Jeremy Chadwick wrote:
> On Tue, Sep 28, 2010 at 10:59:04PM +0200, Miroslav Lachman wrote:
>> Jeremy Chadwick wrote:
>>> On Tue, Sep 28, 2010 at 08:12:00PM +0200, Miroslav Lachman wrote:
>>>> Hi,
>>>>
>>>> we are using fetch command from cron to run PHP scripts periodically
>>>> and sometimes cron sends error e-mails like this:
>>>>
>>>> fetch: https://hiden.example.com/cron/fiveminutes: Non-recoverable
>>>> resolver failure
>>
>> [...]
>>
>>>> Note: target domains are hosted on the server it-self and named too.
>>>>
>>>> The system is FreeBSD 7.3-RELEASE-p2 i386 GENERIC
>>>>
>>>> Can somebody help me to diagnose this random fetch+resolver issue?

[...]

>> There is PF with some basic rules, mostly blocking incomming
>> packets, allowing all outgoing and scrubbing:
>>
>> scrub in on bge1 all fragment reassemble
>> scrub out on bge1 all no-df random-id min-ttl 24 max-mss 1492
>> fragment reassemble
>>
>> pass out on bge1 inet proto udp all keep state
>> pass out on bge1 inet proto tcp from 1.2.3.40 to any flags S/SA
>> modulate state
>> pass out on bge1 inet proto tcp from 1.2.3.41 to any flags S/SA
>> modulate state
>> pass out on bge1 inet proto tcp from 1.2.3.42 to any flags S/SA
>> modulate state
>>
>> modified PF options:
>>
>> set timeout { frag 15, interval 5 }
>> set limit { frags 2500, states 5000 }
>> set optimization aggressive
>> set block-policy drop
>> set loginterface bge1
>> # Let loopback and internal interface traffic flow without restrictions
>> set skip on lo0
>
> Please also provide "pfctl -s info" output, in addition to uname -a
> output (you can hide the hostname), since the pf stack differs depending
> on what FreeBSD version you're using.

# pfctl -s info
No ALTQ support in kernel
ALTQ related functions disabled
Status: Enabled for 32 days 11:31:02          Debug: Urgent

Interface Stats for bge1              IPv4             IPv6
   Bytes In                     37064314787                0
   Bytes Out                   279633869976                0
   Packets In
     Passed                       214057477                0
     Blocked                        1180125                0
   Packets Out
     Passed                       272266744                0
     Blocked                         128777                0

State Table                          Total             Rate
   current entries                      181
   searches                       518860439          184.9/s
   inserts                         16608172            5.9/s
   removals                        16607991            5.9/s
Counters
   match                           17951131            6.4/s
   bad-offset                             0            0.0/s
   fragment                              23            0.0/s
   short                                  0            0.0/s
   normalize                              4            0.0/s
   memory                                 0            0.0/s
   bad-timestamp                          0            0.0/s
   congestion                             0            0.0/s
   ip-option                              0            0.0/s
   proto-cksum                         3095            0.0/s
   state-mismatch                     16707            0.0/s
   state-insert                           0            0.0/s
   state-limit                            0            0.0/s
   src-limit                              0            0.0/s
   synproxy                               0            0.0/s


uname:
7.3-RELEASE-p2 FreeBSD 7.3-RELEASE-p2 #0: Mon Jul 12 19:04:04 UTC 2010 
    root@i386-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  i386

> Things that catch my eye as potential problems -- I don't have a way to
> confirm these are responsible for your issue (DNS resolver lookups are
> UDP-based, not TCP), but I want to point them out anyway.
>
> 1) "modulate state" is broken on FreeBSD.  Taken from our pf.conf notes:
>
> # Filtering (public interface only; see "set skip")
> #
> # NOTE: Do not use "modulate state", as it's known to be broken on FreeBSD.
> # http://lists.freebsd.org/pipermail/freebsd-pf/2008-March/004227.html
>
> 2) "optimization aggressive" sounds dangerous given what pf.conf(5) says
> about it.  I'd like to know what it considers "idle".
>
> 3) I would also remove many of the options you have set in your "scrub
> out" rule.  Starting with a clean slate to see if things improve is
> probably a good idea.  As you'll see below, sometimes pf does things
> which may be correct per IP specification but don't work quite right
> with other vendors' IP stacks.
>
> 4) Your "set timeout" values look to be extreme.  I would recommend
> leaving these at their defaults given your situation.
>
> 5) This feature is not in use in your pf.conf, but I want to point out
> regardless.  "reassemble tcp" is also broken in some way.  Again taken
> from our pf.conf notes:
>
> # Normalization -- resolve/reduce traffic ambiguities.
> #
> # NOTE: Do NOT use 'reassemble tcp' as it definitely causes breakage.
> # Issue may be related to other vendors' IP stacks, so let's leave it
> # disabled.

Thank you for all your hints about PF! Maybe it's time to consider 
refactoring our standard pf.conf which was made years ago...


The original problem seems to be problem of how resolver on FreeBSD 7.3 
works. This machine was upgraded from 7.2 few weeks ago and we had not 
this problem before.

I added '|| dig hiden.example.com' to the crontab so I get dig output in 
the case of fetch failure:

*/5  * * * * fetch -qo /dev/null 
"https://hiden.example.com/cron/fiveminutes"  || dig hiden.example.com

The domain has TTL set to 360 seconds and each fetch "Non-recoverable 
resolver failure" is exactly in the time when TTL was expired and new 
query to authoritative nameservers must be done:

; <<>> DiG 9.4.-ESV <<>> hiden.example.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30191
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;hiden.example.com.	IN	A

;; ANSWER SECTION:
hiden.example.com. 360 IN	CNAME	server.example.com.
server.example.com.	360	IN	A	1.2.3.49

;; AUTHORITY SECTION:
example.com.		224	IN	NS	ns1.ignum.com.
example.com.		224	IN	NS	ns2.ignum.cz.

;; Query time: 395 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 30 11:30:16 2010
;; MSG SIZE  rcvd: 135

Note: real domains and IPs were replaced with example.com / 1.2.3.49


I made some easy script to run dig queries to affected domains each 3 
minutes from cron with logging to file. The script is in use for one day 
and did not log any error response (resolving by dig command works fine) 
and we got only one occurence of fetch "Non-recoverable resolver 
failure" in the time when cached DNS entry expired (the above one), this 
is coincidence where diq query from script was made in the same time as 
fetch job. The same DNS answere was e-mailed from cron and loggend in to 
file by the script.

So my thought is that DNS cache server (locally running BIND) is working 
fine, authoritative nameservers too, but resolving the domain for the 
first time and passing the reply to the fetch fails for unknown reason. 
I will try to use curl or wget instead of fetch to see if the symptoms 
persist or not.

Miroslav Lachman