From owner-freebsd-ports@FreeBSD.ORG Sun Jun 20 15:23:07 2010 Return-Path: Delivered-To: ports@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6CE381065673; Sun, 20 Jun 2010 15:23:07 +0000 (UTC) (envelope-from lasse.collin@tukaani.org) Received: from mailfw02.zoner.fi (mailfw02.zoner.fi [84.34.147.249]) by mx1.freebsd.org (Postfix) with ESMTP id 0DD938FC0A; Sun, 20 Jun 2010 15:23:05 +0000 (UTC) Received: from www25.zoner.fi ([84.34.147.45]) by wwwsmtp02.zoner.fi with ESMTP; 20 Jun 2010 18:23:03 +0300 Received: from 86-60-146-209-dyn-dsl.ssp.fi ([86.60.146.209] helo=kaneli.localnet) by www25.zoner.fi with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1OQMMQ-0007kt-Uf; Sun, 20 Jun 2010 18:23:03 +0300 From: Lasse Collin To: Matthias Andree Date: Sun, 20 Jun 2010 18:23:03 +0300 User-Agent: KMail/1.13.3 (Linux/2.6.33-ARCH; KDE/4.4.4; x86_64; ; ) References: <4C1BA4D4.9000205@FreeBSD.org> <201006191641.26301.lasse.collin@tukaani.org> <4C1E18C4.5020303@FreeBSD.org> In-Reply-To: <4C1E18C4.5020303@FreeBSD.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201006201823.03817.lasse.collin@tukaani.org> X-Antivirus-Scanner: Clean mail though you should still use an Antivirus Cc: ports@freebsd.org, Christian Weisgerber , portmgr@freebsd.org Subject: Re: FreeBSD ports USE_XZ critical issue on low-RAM computers X-BeenThere: freebsd-ports@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting software to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jun 2010 15:23:07 -0000 On 2010-06-20 Matthias Andree wrote: > Am 19.06.2010 15:41, schrieb Lasse Collin: > > Perhaps FreeBSD provides a good working way to limit the amount of > > memory that a process actually can use. I don't see such a way e.g. > > in Linux, so having some method in the application to limit memory > > usage is definitely nice. It's even more useful in the compression > > library, because a virtual-memory-hog application on a busy server > > doesn't necessarily want to use tons of RAM for decompressing data > > from untrusted sources. > > Even there the default should be "max", and the library SHOULD NOT > second-guess what trust level of data the application might to > process with libxz's help. There is no default value for the memory limit in liblzma (not libxz, for historical reasons). You can specify UINT64_MAX if you want. Please don't complain how the library sucks without looking at its API first. Don't confuse the limiter _feature_ with its _default value_; there is a default value only in the command line tools. > Expose the limiter interface in the API if you want, but particularly > for the library in particular, any other default than "unlimited > memory" is a nuisance. And there's still an application, and unlike > the xz library, the application should know what kind of data from > what sources it is processing, and if - for instance - a virus > inspector wants to impose memory limits and quarantine an attachment > with what looks like an zip bomb. Yes, this is exactly what I have done in liblzma, except that there is no default value (typing UINT64_MAX isn't too much to ask). > >> For compression, it's less critical because service is degraded, > >> not denied, but I'd still think -M max would be the better > >> default. I can always put "export XZ_OPT=-3" in > >> /etc/profile.d/local.sh or wherever it belongs on the OS of the > >> day. > > > > If a script has "xz -9", it overrides XZ_OPT=-3. > > I know. This isn't a surprise for me. The memory limiting however is. > And the memory limiting overrides xz -9 to something lesser, which > may not be what I want either. I have only one computer with over 512 MiB RAM (this has 8 GiB). Thus "xz -9" is usable only on one of my computers. I cannot go and fix all scripts so that they first check how much RAM I have and then pick a reasonable compression level. It doesn't look so good to make "xz -9" so low either that it would be usable on all systems with e.g. 256 MiB RAM or more (you can have higher settings than the current "xz -9", they just aren't so useful usually, even -9 is not always so useful compared to a bit lower settings). What do you think is the best solution to the above problem without putting a default memory usage limit in xz? Setting something in XZ_OPT might work in many cases, but sometimes scripts set it themselves e.g. to pass compression settings to some other script calling xz. Maybe xz should support a config file? Or maybe another environment variable, which one could assume that scripts won't touch? These are honest questions and answering them would help much more than long descriptions of how the current method is bad. > >> I still think utilities and applications should /not/ impose > >> arbitrarily lower limits by default though. > > > > There's no multithreading in xz yet, but when there is, do you want > > xz to use as many threads as there are CPU cores _by default_? If > > so, do you mind if compressing with "xz -9" used around 3.5 GiB of > > memory on a four-core system no matter how much RAM it has? > > Multithreading in xz is worth discussion if the tasks can be > parallelized, which is apparently not the case. You would be > duplicating effort, because we have tools to run several xz on > distinct files at the same time, for instance BSD portable make or > GNU make with a "-j" option. That's a nice way to avoid answering the question. xargs works too when you have multiple small files (there's even an example on recent man page of xz). Please explain how any of these help with a multigigabyte file. That's where people want xz to use threads. There is more than one way to parallelize the compression, and some of them increase encoder memory usage quite a lot. > > I think it is quite obvious that you want the number of threads to > > be limited so that xz won't accidentally exceed the total amount > > of physical RAM, because then it is much slower than using fewer > > threads. > > This tells me xz cannot fully parallelize its effort on the CPUs, and > should be single-threaded so as not to waste the parallelization > overhead. Sure, it cannot "fully" parallelize, whatever that means. But the amount of parallelization that is possible is welcomed by many others (you are the very first person to think it's useless). For example, 7-Zip can use any number of threads with .xz files and there are some liblzma-based experimental tools too. Next question could be how to determine how many threads could be OK for multithreaded decompression. It doesn't "fully" parallelize either, and would be possible only in certain situations. There too the memory usage grows quickly when threads are added. To me, a memory usage limit together with a limit on number of threads looks good; with no limits, the decompressor could end up reading the whole file into RAM (and swap). Threaded decompression isn't so important though, so I'm not even sure if I will ever implement it. > If I specify -9 or --best, but no memory option, that means "compress > as hard as you can". With xz it isn't and will never be. The "compress has hard as you can" option would currently use 1.5 GiB dictionary, which is waste of memory when compressing files that are a lot smaller than that. The format of the LZMA2 algorithm used in .xz supports dictionaries up to 4 GiB. The decoder supports all dictionary sizes, but the current encoder is limited to 1.5 GiB for implementation reasons. You would need a little bit over 16 GiB memory to compress with 1.5 GiB dictionary using the BT4 match finder. I don't think you honestly want -9 to be that. Instead, -9 is set to an arbitrary point of 64 MiB dictionary, which still can make sense in many common situations. That currently uses 674 MiB of memory to compress and a little more than the dictionary size to decompress, so I round it up to 65 MiB. The dictionary size is only one thing to get high compression. It depends on the file. Some files benefit a lot when dictionary size increases while others benefit mostly from spending more CPU cycles. That's why there is the --extreme option. It allows improving the compression ratio by spending more time without requiring so much RAM. The existence of --extreme (-e) naturally makes things slightly more complicated for a user than using only a linear single-digit scale for compression levels, but makes it easier to specify what is wanted without requiring the user to read about the advanced options. Note that I plan to revise what settings exactly are bound to different compression levels before the 5.0.0 release. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode