Date: Wed, 10 Jan 2024 15:33:45 +0100 From: Olivier Certner <olce@freebsd.org> To: Xin LI <delphij@gmail.com> Cc: Xin LI <delphij@freebsd.org>, Mike Karels <mike@karels.net>, src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org Subject: Re: git: 2f036705f337 - main - Document the two recent newsyslog(8) change (-c option and <compress> configuration option). Message-ID: <3130778.jP0jbBhz4e@ravel> In-Reply-To: <CAGMYy3tzXv%2Bp7CCAvNU5YQxoia6Thn3pazkc_xSZYfHN=tctEw@mail.gmail.com> References: <202312290846.3BT8kOiO029918@gitrepo.freebsd.org> <2683023.poxlI1A5LX@ravel> <CAGMYy3tzXv%2Bp7CCAvNU5YQxoia6Thn3pazkc_xSZYfHN=tctEw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart2003580.XuNk7dGF6U Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8"; protected-headers="v1" From: Olivier Certner <olce@freebsd.org> To: Xin LI <delphij@gmail.com> Date: Wed, 10 Jan 2024 15:33:45 +0100 Message-ID: <3130778.jP0jbBhz4e@ravel> MIME-Version: 1.0 Hi Xin, Thanks for responding. There were several ideas in my mail, some of them contradictory, and at tim= es not grouped properly. I hope it still was intelligible enough. I was mostly concerned about the (future) change of default value, and stil= l am. But I'm also surprised by the premises some of your choices (includi= ng the default value) are based on. To me, they look generally weak, and e= ven for some do not seem to make sense. This is (also) what I would like t= o discuss. I'm probably far from having the most stringent or intensive us= e of log files in this community, and I'm not an expert of SSD wear-levelin= g either. So maybe it's just me, but then I'd ask for the minimum educatio= n to understand your reasoning and learn from it. > I am open to removing '-c'. An alternative I developed later in my initial mail (it was not apparent at= the point you responsed to) is to have '-c' (on the command-line) override= <compress> (in some configuration file), and I think this is what you've d= one (and responded to Mike). I'm fine with it since I have the feeling it'= s the general rule for most utilities where it's possible to request the sa= me behavior on the command-line and from configuration files (in other word= s, it respects POLA). My main concern here is that, if you keep '-c', you = document it, as well as its relation to <compress>. I'm saying this becaus= e you evoked the possibility of not documenting it on purpose in some other= message, which I think can't be justified here. =20 > Could you please clarify what you mean by "make it enable compression" -- > did you mean that we mark all log files to be compressible? (It's probab= ly > not a good idea as some "log" files may be binary and not really > compressible). Yes, I meant exactly that. In this alternative, you simply ignore compress= ion letters but also their absence, and compress everything the same. I un= derstand your point about binary files, but I would be surprised if that lo= gs even formatted as binary files aren't significantly compressible (albeit= less than text) in most cases, and even if they aren't, it would only be a= very minor annoyance (files are not going to get longer; for other (non-)a= nnoyances, see below). Moreover, all log files in base are text files, and= that is also the case for all ports/applications I use, so I find it stran= ge not to cater to what is probably the vast majority of use cases (or do y= ou disagree with that?). Doing so would have also the benefit that application writers just don't ha= ve to bother wondering whether their logs should be compressed or not. Wha= t would that decision based on? Basing it on format (text or binary) is mo= st probably flawed, as I've just said above. I don't think it can be based= on content either, which I suspect will always be compressible for log fil= es (there will be redundancy, like timestamps, identifiers, etc.). And I s= ee this more as an administrative decision (e.g., do I have plenty of disk = space or not?), which is independent. So shifting that decision to the adm= inistrator once and for all makes sense. If you don't like this way to mak= e it happen, I'm suggesting another one next. > Changing the meaning of all four legacy compression type letters to "file > is compressible" is part of the intention. The goal is to discourage usi= ng > them as a way to specify a compression type, in favor of using the > administrator configured value. As I've just explained, I see a lot of value in having an administrator dec= iding on a global behavior. I will use this functionality most likely. I had been hesitating between preserving the current meaning of the compres= sion letters, for POLA in general, and having the configuration directive o= verride them. That's why I mentioned an alternative where the override wou= ld have to be explicit, through an additional, different directive. This i= dea could be reused like this: Have '<compress>' affect only files without = compression letters, and have '<compress_override>' affect only those with = them, and perhaps also have the specified value of one of them used as the = default for the other (e.g., if '<compress_override>' is set, it also affec= ts by default files without compression letters). I'm mentioning this for = completeness in case it fulfills the needs of others. I probably won't use= this refinement personally. And, concerning POLA, there are different lev= els of it. Forgetting a moment about the change in default value, being ab= le to override compression letters with a directive in the configuration fi= le is a bit surprising, but after more pondering I now do not consider it t= o be terribly annoying if sufficiently publicized. > That's said, 'none' is a reasonable default in many ways as explained > before (it makes grep'ing easier, compression is not really that helpful = in > the modern world because hard drives are larger than the 90's and it > reduces the times data gets rewritten to SSDs and avoids hourly CPU load > bursts for busy systems). This is where my main disagreement is currently. Most arguments have been = addressed in my previous mails, so for each I'll do a small wrap-up and add= a few new thoughts. "it makes grep'ing easier": Our zgrep(1) works on any compressed file, and = even on uncompressed ones, so is a drop-in replacement for grep(1). I fail= to see anything hard about using it. Scripts already using grep(1) don't = even need to be modified, via a combination of PATH or symlink tweaking. W= e could even go so far as having grep(1) itself behave like zgrep(1), which= could be a great usability win for newcomers as well. "compression is not really that helpful in the modern world because hard dr= ives are larger than the 90's": I certainly don't think so. I manipulate G= Bs of (text) log files. On build logs, I typically see ratios of 1/10, whi= ch is huge. The space I'm saving is not only used to save more logs, but a= lso for unrelated purposes, and prevents me from having to buy or dedicate = more hard disks to this use. And I'm not even talking about embedded syste= ms, which are much more constrained, or virtual machines. "it reduces the times data gets rewritten to SSDs": Surely, but does it mat= ter? I don't think so. A single rewrite of log data in most use cases shou= ldn't have any visible effect on wear-leveling, except for SSDs where this = is the only and continuous job, but then you can have your equivalent to 's= yslog' compressing on the fly, or can use ZFS with compression. If really,= you're reaching the disk I/O limits on your machine and can't afford the e= xtra bandwidth for reading and compressing, shouldn't you be sending the lo= gs via network to another machine doing exactly that processing? And is th= is a use case common enough to warrant making non-compression the new newsy= slog(8)'s default? I don't think so. "avoids hourly CPU load bursts for busy systems": That can, and should, be = solved by configuration. You're free to choose a higher frequency, to avoi= d busy hours if there are less loaded ones, and to rotate logs on a smaller= size limit, all of which will mitigate the problem to the point of almost = non-existence. And if the "almost" is still significant to your workload, = then see the previous point. Again, is this common or important enough? F= or now, I doubt it. And there is an advantage of having application-contro= lled compression: At least you can control exactly when the bursts occur, w= hich you can't with ZFS (which has to compress blocks also). > 'bzip2' could be a good second best default (because for most > configurations it's how the log files are compressed with today's > defaults), but if the administrator has already configured their systems = to > use a different method, this would break their configuration anyways. Yes for 'bzip2' as a good default, for POLA. If the administrator configur= ed its system, then the best default would be 'legacy'. That's why I was h= esitating with always keeping the original meaning to the compression lette= rs. > There are other benefits of not compressing rotated logs. For busy > systems, the hourly newsyslog run would process larger logs and cause CPU > workload bursts. >=20 > And when logs are compressed, the data is read back and compressed data is > rewritten to disk / SSDs, causing additional wear of the flash storage, a= nd > all that comes with no significant benefit for modern hardware. >=20 > (I don't think it's common to have log files indexed after rotation; a mo= re > common use case would be to use [u]grep to look up for a certain pattern). I think I've already addressed most of these points in the previous mail an= d above. I've read and, I think, understood your points. So please save us time and= refrain from repeating them. This is not going to make me change my curre= nt mind that they all are weak at best. On the other hand, please, after a careful reading of my objections, respon= d with comments, critiques or rebuttals as you see fit. I may learn things= in the process, and you might as well too. =20 > Yes, and that's not a big concern. Achieving the maximum compression rat= io > is probably never the goal for most scenarios (not limited to logs, but > also other places) where compression is used, and one always has to balan= ce > between the cost and benefit. We are talking about logs, or at least use cases for newsyslog(8). A frequ= ent use case for it (it's certainly the primary for me) is long-term storag= e of old logs that are unfrequently read/processed. Achieving a high compr= ession ratio is important here, to save the space used in absolute terms *a= nd* with respect to the expected (in a statistical sense) utility of these = (i.e., low). > If the person is distributing a release image to many thousands of users > over the Internet, it would make a lot of sense to try the best compressi= on > for an 5% reduction of size because that adds up to the bandwidth cost and > optimizes the experience for users, but it doesn't make as much sense to > save, let's say a few MBs of disk space at the expense of spending a few > more minutes every hour, the added "bursts" of slower response time for a > server, and that's usually undesirable for production. Really, I don't see where these figures can come from. Here is a very quic= k example on a typical (for me) build log file of about ~70MB: * Method * Compression ratio * Elapsed time (s) * ************************************************************ gzip (default) | 95.3%, or / 21.2 | 0.426 xz (default) | 96.9%, or / 32.6 | 5.619 zstd (default) | 95.6%, or / 22.5 | 0.088 I could multiply them to convince you in a more serious manner statisticall= y. But already, I think you can agree that "a few MBs of disk space at the= expense of spending a few more minutes" is way, way off, even if you're still using xz(1). Thanks and regards. =2D-=20 Olivier Certner --nextPart2003580.XuNk7dGF6U Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQIzBAABCQAdFiEEmNCxHjkosai0LYIujKEwQJceJicFAmWeqskACgkQjKEwQJce JifYiBAAtUnuhiJBAeBwmCXFY3bHnIfipM8KGlfUdKbkzzhuU1zyDitZoc1r1e7P Nq25SclCV41JqIaY62bxFrLohGLDFfO1pIx98qIBL9/HTTAqiwnzDpUqL+u2zh6b ghYJpXW0q1E+yl0jFHYh6xhQjZV42jBeNROAgS2uM+V5u64AFyBM4xfz6txCb18O i6gNjP7KzZbh3qwG708YNx+gWHyt6JwzAYNebXsP+1SgdDpr7f6jKO6TXwBXaKo7 pJSJ1hgipzeC26H/zPgcFlUpQndWtnRGR6T4zaCRUr6Qy5jTxNBFLPG+1LzN9PiG PuXSVd69RAKLZ7nS4U2iTnN5tWg8JWuSH1XRUbxGukmM7fTfh98pAl246bgpgv1K bfng7W9UGV8YMoCJURhwDdXc9/9Xh9rJ9OhqGcBcu0yI5Y7p43TrG7713uGQBx3g xT34+PEgkAu9+Kf3VztPVyBe810RnCqNvZpQ1GW5TylzWipV8k4pReI2OI0e/KdM OXhOtbqkCXoAhhDcB5vRjLjQrLkRD5tE9dBRuHgPIxue6mRxWR97ty4A81Adr8fw n/tNt8ez8uB4W0fYFxfsFWZ+tqnqJqg3oDd8kBIapdpHS4jMXAb0c2sS5HrrP+9j luiEEDU4A2THRL7rOQdtFirM1zuKQBo0GcMte7l7gkmddfKJcOI= =3pjz -----END PGP SIGNATURE----- --nextPart2003580.XuNk7dGF6U--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3130778.jP0jbBhz4e>