Date: Fri, 17 Jul 2009 19:12:06 GMT From: pmc@citylink.dinoex.sub.org (Peter Much) To: freebsd-stable@FreeBSD.ORG Subject: Can an app crash from a single TCP packet lost in transmission? Message-ID: <KMxxC7.32y@citylink.dinoex.sub.org>
next in thread | raw e-mail | index | archive | help
The first thing I noticed was that my nameserver had gone. I searched for the reason and found: >Jul 15 04:04:52 <kern.crit> edge kernel: swap_pager_getswapspace(3): failed < ... hundreds more of these ... > >Jul 15 04:05:07 <kern.err> edge kernel: pid 47113 (named), uid 53, was killed: out of swap space That didn't make sense - the machine has enough swapspace. But since this did repeat every other night, I started logging ps output minutely. And so I found a postgres database backup going weird: 03:23 70 78433 78432 0 96 0 8220 4196 - R ?? 0:22.84 pg_dump -b < ... > 03:49 70 78433 78432 0 96 0 8220 4024 - R ?? 17:06.61 pg_dump -b 03:50 70 78433 78432 0 96 0 8220 4024 - R ?? 17:46.15 pg_dump -b 03:51 70 78433 78432 0 96 0 8220 4024 - R ?? 18:26.69 pg_dump -b 03:52 70 78433 78432 0 47 0 139292 57888 select S ?? 18:37.65 pg_dump -b 03:53 70 78433 78432 0 48 0 139292 57828 select S ?? 18:40.36 pg_dump -b 03:54 70 78433 78432 0 -20 0 401436 69092 swread DL ?? 18:42.49 pg_dump -b 03:55 70 78433 78432 0 -20 0 401436 63232 swread DL ?? 18:43.99 pg_dump -b That process starts with 8MB memory, and runs so for half an hour, then suddenly between 03:51 and 03:52 memory usage explodes. And in that night it did not run out of swap space - instead it gave an error message: >pg_dump: Error message from server: lost synchronization with server: > got message type "0", length 154143043 >pg_dump: The command was: COPY public.file (fileid, fileindex, jobid, > pathid, filenameid, markid, lstat, md5) TO stdout; But that database backup is at that time quite in the middle of dumping a db table containing lots of small records - there is no reason why a 154 MB "message" should be transferred between server and client while copying records of ~60 Bytes each. One other thing did happen between 03:51 and 03:52 - the DSL internet connection did disconnect/reconnect and obtained a new IP adress. Afterwards, a script does flush and reload an ipfw table() with the new local adresses - and during this process one(!) packet of the database session was dropped. I could verify that relation: every night when there were memory problems, few packets from the database backup were lost during the firewall reconfigure - in nights when no packets were lost, there were no memory problems. I will now change the firewall handling to get rid of that packet loss, but also, I need some refresh on how TCP works: I thought TCP would not be disturbed by a lost packet, but would automatically resend that packet until ACK received; and I thought this would happen below the application, so practically the application CANNOT go weird from a lost packet... Is there any reason why this would not be true on a localhost connection? rgds, PMc
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?KMxxC7.32y>