From owner-freebsd-stable@FreeBSD.ORG Tue Nov 4 12:24:32 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E799E106568C for ; Tue, 4 Nov 2008 12:24:32 +0000 (UTC) (envelope-from kjedruczyk@ramfasto.com) Received: from out5.smtp.messagingengine.com (out5.smtp.messagingengine.com [66.111.4.29]) by mx1.freebsd.org (Postfix) with ESMTP id AE76D8FC40 for ; Tue, 4 Nov 2008 12:24:32 +0000 (UTC) (envelope-from kjedruczyk@ramfasto.com) Received: from compute1.internal (compute1.internal [10.202.2.41]) by out1.messagingengine.com (Postfix) with ESMTP id E29341A0274 for ; Tue, 4 Nov 2008 07:10:50 -0500 (EST) Received: from heartbeat1.messagingengine.com ([10.202.2.160]) by compute1.internal (MEProxy); Tue, 04 Nov 2008 07:10:50 -0500 X-Sasl-enc: oeIA8TaM6Os5nfJ4qs+vOfMpyeU/iBZdN7qOuMMGIc8L 1225800650 Received: from buka.ramfasto.com (dyb186.internetdsl.tpnet.pl [83.14.53.186]) by mail.messagingengine.com (Postfix) with ESMTPA id 26BCD13D2E; Tue, 4 Nov 2008 07:10:49 -0500 (EST) Message-ID: <49103BC0.3070605@ramfasto.com> Date: Tue, 04 Nov 2008 13:10:40 +0100 From: =?UTF-8?B?S3J6eXN6dG9mIErEmWRydWN6eWs=?= User-Agent: Thunderbird 2.0.0.14 (X11/20080707) MIME-Version: 1.0 To: freebsd-stable@freebsd.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: PostgreSQL stats collector eats all CPU time X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Nov 2008 12:24:33 -0000 Recently postgresql on our database server started showing some sort of problems: after running for some time stats collector process eats 100% cpu time - exactly as someone reported here: http://groups.google.com/group/pgsql.general/browse_thread/thread/6dfea591d243e987 No solution is provided there though... kernel/libc bug is suggested I'm not sure how relevant it is - problem appeared first time about a day or two after server has been upgraded with additional processor: now it is 2x dual core opteron with 8GB of RAM. For some reason we didn't see this problem back when it was just one dual core opteron with 4GB of RAM. It is amd64 version of freebsd of course... As the person who reported the problem previously on postgresql mailing list showed - the stats collector busy-loops in interrupted poll call - kdump contains output like this: 878 postgres 0.009643 CALL poll(0x7fffffffd4e0,0x1,0x7d0) 878 postgres 0.009671 RET poll -1 errno 4 Interrupted system call 878 postgres 0.009675 CALL poll(0x7fffffffd4e0,0x1,0x7d0) 878 postgres 0.009687 RET poll -1 errno 4 Interrupted system call 878 postgres 0.009691 CALL poll(0x7fffffffd4e0,0x1,0x7d0) 878 postgres 0.009700 RET poll -1 errno 4 Interrupted system call I also grabbed core dump of the postmaster process and the backtrace seems a little weird to me: #0 0x00000008012186cc in poll () from /lib/libc.so.7 [New Thread 0x801601120 (LWP 100209)] [New LWP 54785] (gdb) bt #0 0x00000008012186cc in poll () from /lib/libc.so.7 #1 0x000000080107c85e in poll () from /lib/libthr.so.3 #2 0x0000000000578bd0 in pgstat_start () #3 0x000000000057d2b5 in PostmasterMain () #4 #5 0x0000000801268cdc in select () from /lib/libc.so.7 #6 0x000000080107c574 in select () from /lib/libthr.so.3 #7 0x000000000057aaa3 in ClosePostmasterPorts () #8 0x000000000057be9e in PostmasterMain () #9 0x00000000005358fe in main () If I'm reading it right the constantly interrupted poll function is being called from the signal handler? Any suggestions what else to do to identify the problem? It seems that the situation will be reproducible - after server restart it happened again within one day. -- Best regards, Krzysztof Jędruczyk