Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Jan 2024 15:33:45 +0100
From:      Olivier Certner <olce@freebsd.org>
To:        Xin LI <delphij@gmail.com>
Cc:        Xin LI <delphij@freebsd.org>, Mike Karels <mike@karels.net>, src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org
Subject:   Re: git: 2f036705f337 - main - Document the two recent newsyslog(8) change (-c option and <compress> configuration option).
Message-ID:  <3130778.jP0jbBhz4e@ravel>
In-Reply-To: <CAGMYy3tzXv%2Bp7CCAvNU5YQxoia6Thn3pazkc_xSZYfHN=tctEw@mail.gmail.com>
References:  <202312290846.3BT8kOiO029918@gitrepo.freebsd.org> <2683023.poxlI1A5LX@ravel> <CAGMYy3tzXv%2Bp7CCAvNU5YQxoia6Thn3pazkc_xSZYfHN=tctEw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart2003580.XuNk7dGF6U
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="UTF-8"; protected-headers="v1"
From: Olivier Certner <olce@freebsd.org>
To: Xin LI <delphij@gmail.com>
Date: Wed, 10 Jan 2024 15:33:45 +0100
Message-ID: <3130778.jP0jbBhz4e@ravel>
MIME-Version: 1.0

Hi Xin,

Thanks for responding.

There were several ideas in my mail, some of them contradictory, and at tim=
es not grouped properly.  I hope it still was intelligible enough.

I was mostly concerned about the (future) change of default value, and stil=
l am.  But I'm also surprised by the premises some of your choices (includi=
ng the default value) are based on.  To me, they look generally weak, and e=
ven for some do not seem to make sense.  This is (also) what I would like t=
o discuss.  I'm probably far from having the most stringent or intensive us=
e of log files in this community, and I'm not an expert of SSD wear-levelin=
g either.  So maybe it's just me, but then I'd ask for the minimum educatio=
n to understand your reasoning and learn from it.

> I am open to removing '-c'.

An alternative I developed later in my initial mail (it was not apparent at=
 the point you responsed to) is to have '-c' (on the command-line) override=
 <compress> (in some configuration file), and I think this is what you've d=
one (and responded to Mike).  I'm fine with it since I have the feeling it'=
s the general rule for most utilities where it's possible to request the sa=
me behavior on the command-line and from configuration files (in other word=
s, it respects POLA).  My main concern here is that, if you keep '-c', you =
document it, as well as its relation to <compress>.  I'm saying this becaus=
e you evoked the possibility of not documenting it on purpose in some other=
 message, which I think can't be justified here.
=20
> Could you please clarify what you mean by "make it enable compression" --
> did you mean that we mark all log files to be compressible?  (It's probab=
ly
> not a good idea as some "log" files may be binary and not really
> compressible).

Yes, I meant exactly that.  In this alternative, you simply ignore compress=
ion letters but also their absence, and compress everything the same.  I un=
derstand your point about binary files, but I would be surprised if that lo=
gs even formatted as binary files aren't significantly compressible (albeit=
 less than text) in most cases, and even if they aren't, it would only be a=
 very minor annoyance (files are not going to get longer; for other (non-)a=
nnoyances, see below).  Moreover, all log files in base are text files, and=
 that is also the case for all ports/applications I use, so I find it stran=
ge not to cater to what is probably the vast majority of use cases (or do y=
ou disagree with that?).

Doing so would have also the benefit that application writers just don't ha=
ve to bother wondering whether their logs should be compressed or not.  Wha=
t would that decision based on?  Basing it on format (text or binary) is mo=
st probably flawed, as I've just said above.  I don't think it can be based=
 on content either, which I suspect will always be compressible for log fil=
es (there will be redundancy, like timestamps, identifiers, etc.).  And I s=
ee this more as an administrative decision (e.g., do I have plenty of disk =
space or not?), which is independent.  So shifting that decision to the adm=
inistrator once and for all makes sense.  If you don't like this way to mak=
e it happen, I'm suggesting another one next.

> Changing the meaning of all four legacy compression type letters to "file
> is compressible" is part of the intention.  The goal is to discourage usi=
ng
> them as a way to specify a compression type, in favor of using the
> administrator configured value.

As I've just explained, I see a lot of value in having an administrator dec=
iding on a global behavior.  I will use this functionality most likely.

I had been hesitating between preserving the current meaning of the compres=
sion letters, for POLA in general, and having the configuration directive o=
verride them.  That's why I mentioned an alternative where the override wou=
ld have to be explicit, through an additional, different directive.  This i=
dea could be reused like this: Have '<compress>' affect only files without =
compression letters, and have '<compress_override>' affect only those with =
them, and perhaps also have the specified value of one of them used as the =
default for the other (e.g., if '<compress_override>' is set, it also affec=
ts by default files without compression letters).  I'm mentioning this for =
completeness in case it fulfills the needs of others.  I probably won't use=
 this refinement personally.  And, concerning POLA, there are different lev=
els of it.  Forgetting a moment about the change in default value, being ab=
le to override compression letters with a directive in the configuration fi=
le is a bit surprising, but after more pondering I now do not consider it t=
o be terribly annoying if sufficiently publicized.

> That's said, 'none' is a reasonable default in many ways as explained
> before (it makes grep'ing easier, compression is not really that helpful =
in
> the modern world because hard drives are larger than the 90's and it
> reduces the times data gets rewritten to SSDs and avoids hourly CPU load
> bursts for busy systems).

This is where my main disagreement is currently.  Most arguments have been =
addressed in my previous mails, so for each I'll do a small wrap-up and add=
 a few new thoughts.

"it makes grep'ing easier": Our zgrep(1) works on any compressed file, and =
even on uncompressed ones, so is a drop-in replacement for grep(1).  I fail=
 to see anything hard about using it.  Scripts already using grep(1) don't =
even need to be modified, via a combination of PATH or symlink tweaking.  W=
e could even go so far as having grep(1) itself behave like zgrep(1), which=
 could be a great usability win for newcomers as well.

"compression is not really that helpful in the modern world because hard dr=
ives are larger than the 90's": I certainly don't think so.  I manipulate G=
Bs of (text) log files.  On build logs, I typically see ratios of 1/10, whi=
ch is huge.  The space I'm saving is not only used to save more logs, but a=
lso for unrelated purposes, and prevents me from having to buy or dedicate =
more hard disks to this use.  And I'm not even talking about embedded syste=
ms, which are much more constrained, or virtual machines.

"it reduces the times data gets rewritten to SSDs": Surely, but does it mat=
ter? I don't think so.  A single rewrite of log data in most use cases shou=
ldn't have any visible effect on wear-leveling, except for SSDs where this =
is the only and continuous job, but then you can have your equivalent to 's=
yslog' compressing on the fly, or can use ZFS with compression.  If really,=
 you're reaching the disk I/O limits on your machine and can't afford the e=
xtra bandwidth for reading and compressing, shouldn't you be sending the lo=
gs via network to another machine doing exactly that processing?  And is th=
is a use case common enough to warrant making non-compression the new newsy=
slog(8)'s default?  I don't think so.

"avoids hourly CPU load bursts for busy systems": That can, and should, be =
solved by configuration.  You're free to choose a higher frequency, to avoi=
d busy hours if there are less loaded ones, and to rotate logs on a smaller=
 size limit, all of which will mitigate the problem to the point of almost =
non-existence.  And if the "almost" is still significant to your workload, =
then see the previous point.  Again, is this common or important enough?  F=
or now, I doubt it.  And there is an advantage of having application-contro=
lled compression: At least you can control exactly when the bursts occur, w=
hich you can't with ZFS (which has to compress blocks also).

> 'bzip2' could be a good second best default (because for most
> configurations it's how the log files are compressed with today's
> defaults), but if the administrator has already configured their systems =
to
> use a different method, this would break their configuration anyways.

Yes for 'bzip2' as a good default, for POLA.  If the administrator configur=
ed its system, then the best default would be 'legacy'.  That's why I was h=
esitating with always keeping the original meaning to the compression lette=
rs.

> There are other benefits of not compressing rotated logs.  For busy
> systems, the hourly newsyslog run would process larger logs and cause CPU
> workload bursts.
>=20
> And when logs are compressed, the data is read back and compressed data is
> rewritten to disk / SSDs, causing additional wear of the flash storage, a=
nd
> all that comes with no significant benefit for modern hardware.
>=20
> (I don't think it's common to have log files indexed after rotation; a mo=
re
> common use case would be to use [u]grep to look up for a certain pattern).

I think I've already addressed most of these points in the previous mail an=
d above.

I've read and, I think, understood your points.  So please save us time and=
 refrain from repeating them.  This is not going to make me change my curre=
nt mind that they all are weak at best.

On the other hand, please, after a careful reading of my objections, respon=
d with comments, critiques or rebuttals as you see fit.  I may learn things=
 in the process, and you might as well too.
=20
> Yes, and that's not a big concern.  Achieving the maximum compression rat=
io
> is probably never the goal for most scenarios (not limited to logs, but
> also other places) where compression is used, and one always has to balan=
ce
> between the cost and benefit.

We are talking about logs, or at least use cases for newsyslog(8).  A frequ=
ent use case for it (it's certainly the primary for me) is long-term storag=
e of old logs that are unfrequently read/processed.  Achieving a high compr=
ession ratio is important here, to save the space used in absolute terms *a=
nd* with respect to the expected (in a statistical sense) utility of these =
(i.e., low).

> If the person is distributing a release image to many thousands of users
> over the Internet, it would make a lot of sense to try the best compressi=
on
> for an 5% reduction of size because that adds up to the bandwidth cost and
> optimizes the experience for users, but it doesn't make as much sense to
> save, let's say a few MBs of disk space at the expense of spending a few
> more minutes every hour, the added "bursts" of slower response time for a
> server, and that's usually undesirable for production.

Really, I don't see where these figures can come from.  Here is a very quic=
k example on a typical (for me) build log file of about ~70MB:

* Method            * Compression ratio * Elapsed time (s) *
************************************************************
  gzip (default)    | 95.3%, or / 21.2  | 0.426
  xz (default)      | 96.9%, or / 32.6  | 5.619
  zstd (default)    | 95.6%, or / 22.5  | 0.088

I could multiply them to convince you in a more serious manner statisticall=
y.  But already, I think you can agree that "a few MBs of disk space at the=
 expense of spending a few
more minutes" is way, way off, even if you're still using xz(1).

Thanks and regards.

=2D-=20
Olivier Certner
--nextPart2003580.XuNk7dGF6U
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part.
Content-Transfer-Encoding: 7Bit

-----BEGIN PGP SIGNATURE-----

iQIzBAABCQAdFiEEmNCxHjkosai0LYIujKEwQJceJicFAmWeqskACgkQjKEwQJce
JifYiBAAtUnuhiJBAeBwmCXFY3bHnIfipM8KGlfUdKbkzzhuU1zyDitZoc1r1e7P
Nq25SclCV41JqIaY62bxFrLohGLDFfO1pIx98qIBL9/HTTAqiwnzDpUqL+u2zh6b
ghYJpXW0q1E+yl0jFHYh6xhQjZV42jBeNROAgS2uM+V5u64AFyBM4xfz6txCb18O
i6gNjP7KzZbh3qwG708YNx+gWHyt6JwzAYNebXsP+1SgdDpr7f6jKO6TXwBXaKo7
pJSJ1hgipzeC26H/zPgcFlUpQndWtnRGR6T4zaCRUr6Qy5jTxNBFLPG+1LzN9PiG
PuXSVd69RAKLZ7nS4U2iTnN5tWg8JWuSH1XRUbxGukmM7fTfh98pAl246bgpgv1K
bfng7W9UGV8YMoCJURhwDdXc9/9Xh9rJ9OhqGcBcu0yI5Y7p43TrG7713uGQBx3g
xT34+PEgkAu9+Kf3VztPVyBe810RnCqNvZpQ1GW5TylzWipV8k4pReI2OI0e/KdM
OXhOtbqkCXoAhhDcB5vRjLjQrLkRD5tE9dBRuHgPIxue6mRxWR97ty4A81Adr8fw
n/tNt8ez8uB4W0fYFxfsFWZ+tqnqJqg3oDd8kBIapdpHS4jMXAb0c2sS5HrrP+9j
luiEEDU4A2THRL7rOQdtFirM1zuKQBo0GcMte7l7gkmddfKJcOI=
=3pjz
-----END PGP SIGNATURE-----

--nextPart2003580.XuNk7dGF6U--






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3130778.jP0jbBhz4e>