From owner-freebsd-ports@freebsd.org Wed Oct 28 22:21:48 2015 Return-Path: Delivered-To: freebsd-ports@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EE541A20FD7 for ; Wed, 28 Oct 2015 22:21:47 +0000 (UTC) (envelope-from rcarter@pinyon.org) Received: from quine.pinyon.org (quine.pinyon.org [65.101.5.249]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B36A41D64; Wed, 28 Oct 2015 22:21:47 +0000 (UTC) (envelope-from rcarter@pinyon.org) Received: by quine.pinyon.org (Postfix, from userid 122) id 0E2141602EF; Wed, 28 Oct 2015 15:21:41 -0700 (MST) X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on quine.pinyon.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 Received: from feyerabend.n1.pinyon.org (acipenser.esturion.net [65.101.5.252]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by quine.pinyon.org (Postfix) with ESMTPSA id 9874F160276; Wed, 28 Oct 2015 15:21:38 -0700 (MST) Subject: Re: hung poudriere bulk recovery To: Bryan Drewery , FreeBSD Ports ML References: <562A6185.5000305@pinyon.org> <563147BE.2070604@FreeBSD.org> From: "Russell L. Carter" Message-ID: <56314A72.9020005@pinyon.org> Date: Wed, 28 Oct 2015 15:21:38 -0700 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <563147BE.2070604@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-ports@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Porting software to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Oct 2015 22:21:48 -0000 Hi Bryan, On 10/28/15 15:10, Bryan Drewery wrote: > On 10/23/2015 9:34 AM, Russell L. Carter wrote: >> >> Greetings, >> >> Recently my nightly cron poudriere builds have been occasionally >> hanging. For instance, here's last night's, with apparently no >> progress for over 10 hours: >> >> root@terpsichore> poudriere status >> SET PORTS JAIL BUILD STATUS QUEUE >> BUILT FAIL SKIP IGNORE REMAIN TIME LOGS >> - default 10-stable-amd64 2015-10-22_22h30m08s parallel_build 488 >> 34 0 0 0 454 10:45:56 >> /ssd1/poudriere/data/logs/bulk/10-stable-amd64-default/2015-10-22_22h30m08s >> root@terpsichore> >> > > Also check 'poudriere status -b' to see per-builder status. Something > may be actually doing something. Poudriere will timeout builds after a > long time. I forget the default but it may be up to 24 hours. Good to know. I will try that out, probably tomorrow morning. The last two night's poudriere bulk builds have hung, but as I mentioned before, when run from the console the exact same script succeeds and poudriere shuts down cleanly. poudriere jail -k seems to mostly work ok for recovering. This just started last week after near a year of flawless cron'd jobs. (poudriere was flawless, ports are another matter). >> htop now shows no significant activity for the specified 3 builders: >> >> root@terpsichore> ps xa | grep poud >> 72482 - Is 0:00.01 /bin/sh /root/poudriere/run-poudriere-bulk >> 73202 - S 0:04.24 sh -e /usr/local/share/poudriere/bulk.sh -f >> /root/poudriere/ports -j 10-stable-amd64 >> 73347 - S 1:55.38 sh -e /usr/local/share/poudriere/bulk.sh -f >> /root/poudriere/ports -j 10-stable-amd64 >> 73352 - I 0:00.08 sh -e /usr/local/share/poudriere/bulk.sh -f >> /root/poudriere/ports -j 10-stable-amd64 >> 6119 1 S+ 0:00.00 grep poud >> root@terpsichore> >> >> If I reboot, so that the tmp zfs filesystems are unmounted, and >> manually rerun the exact same script as the previous cron'd, hung >> instance, poudriere has (so far) run to completion. > > Please record 'procstat -kka' before rebooting in case this is some kind > of deadlock. Will do. Many thanks for the suggestions. It sure smells like luser fail but I don't see it yet... Best, Russell >> >> I'm not sure how to debug this, but in the interim, I'm very curious >> how I can stop the hung bulk run, and either restart it, or clean up >> the various mounted zfs filesystems and manually restart from the >> beginning w/o rebooting. Studying the man page, it's not clear at all >> the Right Way to do this, so any pointers here would be appreciated. > > Kill -TERM the main poudriere process. It will clean up children. > > Beyond that you can 'poudriere jail -j NAME -p TREE -z SET -k' to clean > up any mounts leftover from a previous build. > > Adding a 'poudriere kill' command is on the todo list. > >> >> I'm leaving the system untouched for now so that I can try out any >> suggestions for cleanup and restart. > > >