From owner-freebsd-hackers@freebsd.org Tue Oct 24 15:06:09 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BE007E50165; Tue, 24 Oct 2017 15:06:09 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com [195.16.151.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 80F7A632AD; Tue, 24 Oct 2017 15:06:08 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.41] (unknown [192.148.167.11]) by proxypop01.sare.net (Postfix) with ESMTPA id DCD459DCDBF; Tue, 24 Oct 2017 17:06:04 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.0 \(3445.1.7\)) Subject: Re: Periodic jobs lockf timeout From: Borja Marcos In-Reply-To: Date: Tue, 24 Oct 2017 17:06:04 +0200 Cc: "freebsd-hackers@freebsd.org" , freebsd-security@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: To: Alan Somers X-Mailer: Apple Mail (2.3445.1.7) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Oct 2017 15:06:09 -0000 > On 24 Oct 2017, at 16:41, Alan Somers wrote: >=20 > On Tue, Oct 24, 2017 at 3:07 AM, Borja Marcos = wrote: > Are you talking about the lockf in /usr/sbin/periodic? It already has > a timeout of 0, which should prevent overlapping periodic jobs. Or is > there some other lockf involved? Without knowing which lockf you're > talking about, I can't understand your problem. Sorry, my explanation was awful now that I read it again. Yes, I mean = the lockf in /usr/sbin/periodic. And no, I didn=E2=80=99t mean that jobs overlap (certainly they don=E2=80=99t = thanks to the lockf) but they can pile up. Today I had a machine with three daily jobs waiting to start because the first one = had been running for four days (a combination of lots of files and datasets, heavy system load, ZFS pool almost = full=E2=80=A6)=20 The problem with a timeout of 0 is that it=E2=80=99s unlimited. In case = something is wrong you can end up with a growing queue of daily periodic jobs waiting to run. Imagine you have a very high system = load for several days and for some reason the daily job won=E2=80=99t complete. Next day a new daily job will try to start but = it will have to wait for the first one to finish. And so on. The proposal is to replace the =E2=80=9C0=E2=80=9D timeout for lockf = with a sane timeout so that it will attempt to run it, but give up in case it can=E2=80=99t be done in a reasonable time. The timeout = shouldn=E2=80=99t be long actually. If periodic must wait in order to start a job it means that you have a serious performance problem and = it=E2=80=99s pointless to keep your machine doing =E2=80=9Cfind=E2=80=9D 24/7. Given the nature of the periodic jobs I don=E2=80=99t think it should be = a problem to attempt to run them in a best effort basis rather than guaranteing that they will eventually even if awfully late. I would add a configurable timeout for /usr/sbin/periodic. I think = it=E2=80=99s better done with a different variable for each=20 class and their default values can be 0 so that nothing changes. daily_start_timeout weekly_start_timeout monthly_start_timeout > The anticongestion_sleeptime variable is unrelated to lockf. Understood, I stand corrected. I assumed it was.=20 Hope it=E2=80=99s better now. It=E2=80=99s pretty easy to do but I=E2=80=99= m interested on the opinions on this matter :) Thank you! Borja.=