From owner-freebsd-hackers@freebsd.org  Tue Oct 24 15:06:09 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BE007E50165;
 Tue, 24 Oct 2017 15:06:09 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com
 [195.16.151.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 80F7A632AD;
 Tue, 24 Oct 2017 15:06:08 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.41] (unknown [192.148.167.11])
 by proxypop01.sare.net (Postfix) with ESMTPA id DCD459DCDBF;
 Tue, 24 Oct 2017 17:06:04 +0200 (CEST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.0 \(3445.1.7\))
Subject: Re: Periodic jobs lockf timeout
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <CAOtMX2hb_Ur8XtTdoPju3ZQGMfJ_pApUKsZiaocxaG9n+DVycA@mail.gmail.com>
Date: Tue, 24 Oct 2017 17:06:04 +0200
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>,
 freebsd-security@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <EAE33C61-BC70-4A09-86A0-0C5F62D993ED@sarenet.es>
References: <AEF2CF7D-BFAC-4ACE-95F2-EF5026E89959@sarenet.es>
 <CAOtMX2hb_Ur8XtTdoPju3ZQGMfJ_pApUKsZiaocxaG9n+DVycA@mail.gmail.com>
To: Alan Somers <asomers@freebsd.org>
X-Mailer: Apple Mail (2.3445.1.7)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Oct 2017 15:06:09 -0000


> On 24 Oct 2017, at 16:41, Alan Somers <asomers@freebsd.org> wrote:
>=20
> On Tue, Oct 24, 2017 at 3:07 AM, Borja Marcos <borjam@sarenet.es> =
wrote:
> Are you talking about the lockf in /usr/sbin/periodic?  It already has
> a timeout of 0, which should prevent overlapping periodic jobs.  Or is
> there some other lockf involved?  Without knowing which lockf you're
> talking about, I can't understand your problem.

Sorry, my explanation was awful now that I read it again. Yes, I mean =
the lockf in /usr/sbin/periodic. And
no, I didn=E2=80=99t mean that jobs overlap (certainly they don=E2=80=99t =
thanks to the lockf) but they can pile up. Today I had
a machine with three daily jobs waiting to start because the first one =
had been running for four days (a combination
of lots of files and datasets, heavy system load, ZFS pool almost =
full=E2=80=A6)=20

The problem with a timeout of 0 is that it=E2=80=99s unlimited. In case =
something is wrong you can end up with a growing queue of
daily periodic jobs waiting to run. Imagine you have a very high system =
load for several days and for some reason the daily job
won=E2=80=99t complete. Next day a new daily job will try to start but =
it will have to wait for the first one to finish. And so on.

The proposal is to replace the =E2=80=9C0=E2=80=9D timeout for lockf =
with a sane timeout so that it will attempt to run it, but give up in
case it can=E2=80=99t be done in a reasonable time. The timeout =
shouldn=E2=80=99t be long actually. If periodic must wait in order to
start a job it means that you have a serious performance problem and =
it=E2=80=99s pointless to keep your machine doing =E2=80=9Cfind=E2=80=9D
24/7.

Given the nature of the periodic jobs I don=E2=80=99t think it should be =
a problem to attempt to run them in a best effort basis
rather than guaranteing that they will eventually even if awfully late.

I would add a configurable timeout for /usr/sbin/periodic. I think =
it=E2=80=99s better done with a different variable for each=20
class and their default values can be 0 so that nothing changes.

daily_start_timeout
weekly_start_timeout
monthly_start_timeout


> The anticongestion_sleeptime variable is unrelated to lockf.

Understood, I stand corrected. I assumed it was.=20

Hope it=E2=80=99s better now. It=E2=80=99s pretty easy to do but I=E2=80=99=
m interested on the opinions on this matter :)


Thank you!


Borja.=