Date: Thu, 30 Sep 2010 15:59:41 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-stable@freebsd.org Subject: Re: fetch: Non-recoverable resolver failure Message-ID: <4CA497CD.1020609@quip.cz> In-Reply-To: <20100929072316.GA82514@icarus.home.lan> References: <4CA22FF0.8060303@quip.cz> <20100928184343.GA70384@icarus.home.lan> <4CA25718.2000101@quip.cz> <20100929072316.GA82514@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
Jeremy Chadwick wrote: > On Tue, Sep 28, 2010 at 10:59:04PM +0200, Miroslav Lachman wrote: >> Jeremy Chadwick wrote: >>> On Tue, Sep 28, 2010 at 08:12:00PM +0200, Miroslav Lachman wrote: >>>> Hi, >>>> >>>> we are using fetch command from cron to run PHP scripts periodically >>>> and sometimes cron sends error e-mails like this: >>>> >>>> fetch: https://hiden.example.com/cron/fiveminutes: Non-recoverable >>>> resolver failure >> >> [...] >> >>>> Note: target domains are hosted on the server it-self and named too. >>>> >>>> The system is FreeBSD 7.3-RELEASE-p2 i386 GENERIC >>>> >>>> Can somebody help me to diagnose this random fetch+resolver issue? [...] >> There is PF with some basic rules, mostly blocking incomming >> packets, allowing all outgoing and scrubbing: >> >> scrub in on bge1 all fragment reassemble >> scrub out on bge1 all no-df random-id min-ttl 24 max-mss 1492 >> fragment reassemble >> >> pass out on bge1 inet proto udp all keep state >> pass out on bge1 inet proto tcp from 1.2.3.40 to any flags S/SA >> modulate state >> pass out on bge1 inet proto tcp from 1.2.3.41 to any flags S/SA >> modulate state >> pass out on bge1 inet proto tcp from 1.2.3.42 to any flags S/SA >> modulate state >> >> modified PF options: >> >> set timeout { frag 15, interval 5 } >> set limit { frags 2500, states 5000 } >> set optimization aggressive >> set block-policy drop >> set loginterface bge1 >> # Let loopback and internal interface traffic flow without restrictions >> set skip on lo0 > > Please also provide "pfctl -s info" output, in addition to uname -a > output (you can hide the hostname), since the pf stack differs depending > on what FreeBSD version you're using. # pfctl -s info No ALTQ support in kernel ALTQ related functions disabled Status: Enabled for 32 days 11:31:02 Debug: Urgent Interface Stats for bge1 IPv4 IPv6 Bytes In 37064314787 0 Bytes Out 279633869976 0 Packets In Passed 214057477 0 Blocked 1180125 0 Packets Out Passed 272266744 0 Blocked 128777 0 State Table Total Rate current entries 181 searches 518860439 184.9/s inserts 16608172 5.9/s removals 16607991 5.9/s Counters match 17951131 6.4/s bad-offset 0 0.0/s fragment 23 0.0/s short 0 0.0/s normalize 4 0.0/s memory 0 0.0/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 0 0.0/s proto-cksum 3095 0.0/s state-mismatch 16707 0.0/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s uname: 7.3-RELEASE-p2 FreeBSD 7.3-RELEASE-p2 #0: Mon Jul 12 19:04:04 UTC 2010 root@i386-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC i386 > Things that catch my eye as potential problems -- I don't have a way to > confirm these are responsible for your issue (DNS resolver lookups are > UDP-based, not TCP), but I want to point them out anyway. > > 1) "modulate state" is broken on FreeBSD. Taken from our pf.conf notes: > > # Filtering (public interface only; see "set skip") > # > # NOTE: Do not use "modulate state", as it's known to be broken on FreeBSD. > # http://lists.freebsd.org/pipermail/freebsd-pf/2008-March/004227.html > > 2) "optimization aggressive" sounds dangerous given what pf.conf(5) says > about it. I'd like to know what it considers "idle". > > 3) I would also remove many of the options you have set in your "scrub > out" rule. Starting with a clean slate to see if things improve is > probably a good idea. As you'll see below, sometimes pf does things > which may be correct per IP specification but don't work quite right > with other vendors' IP stacks. > > 4) Your "set timeout" values look to be extreme. I would recommend > leaving these at their defaults given your situation. > > 5) This feature is not in use in your pf.conf, but I want to point out > regardless. "reassemble tcp" is also broken in some way. Again taken > from our pf.conf notes: > > # Normalization -- resolve/reduce traffic ambiguities. > # > # NOTE: Do NOT use 'reassemble tcp' as it definitely causes breakage. > # Issue may be related to other vendors' IP stacks, so let's leave it > # disabled. Thank you for all your hints about PF! Maybe it's time to consider refactoring our standard pf.conf which was made years ago... The original problem seems to be problem of how resolver on FreeBSD 7.3 works. This machine was upgraded from 7.2 few weeks ago and we had not this problem before. I added '|| dig hiden.example.com' to the crontab so I get dig output in the case of fetch failure: */5 * * * * fetch -qo /dev/null "https://hiden.example.com/cron/fiveminutes" || dig hiden.example.com The domain has TTL set to 360 seconds and each fetch "Non-recoverable resolver failure" is exactly in the time when TTL was expired and new query to authoritative nameservers must be done: ; <<>> DiG 9.4.-ESV <<>> hiden.example.com ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30191 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 0 ;; QUESTION SECTION: ;hiden.example.com. IN A ;; ANSWER SECTION: hiden.example.com. 360 IN CNAME server.example.com. server.example.com. 360 IN A 1.2.3.49 ;; AUTHORITY SECTION: example.com. 224 IN NS ns1.ignum.com. example.com. 224 IN NS ns2.ignum.cz. ;; Query time: 395 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Thu Sep 30 11:30:16 2010 ;; MSG SIZE rcvd: 135 Note: real domains and IPs were replaced with example.com / 1.2.3.49 I made some easy script to run dig queries to affected domains each 3 minutes from cron with logging to file. The script is in use for one day and did not log any error response (resolving by dig command works fine) and we got only one occurence of fetch "Non-recoverable resolver failure" in the time when cached DNS entry expired (the above one), this is coincidence where diq query from script was made in the same time as fetch job. The same DNS answere was e-mailed from cron and loggend in to file by the script. So my thought is that DNS cache server (locally running BIND) is working fine, authoritative nameservers too, but resolving the domain for the first time and passing the reply to the fetch fails for unknown reason. I will try to use curl or wget instead of fetch to see if the symptoms persist or not. Miroslav Lachman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4CA497CD.1020609>