Date: 23 Jan 2000 20:21:12 +0100 From: Dag-Erling Smorgrav <des@flood.ping.uio.no> To: Ollivier Robert <roberto@keltia.freenix.fr> Cc: freebsd-current@FreeBSD.ORG Subject: Re: bzip2 in src tree (Was Re: ports/16252: bsd.port.mk: Add bzip2 support for distribution patches) Message-ID: <xzp7lh0lhlj.fsf@flood.ping.uio.no> In-Reply-To: Ollivier Robert's message of "Sun, 23 Jan 2000 14:14:13 %2B0100" References: <Pine.BSF.4.21.0001221703020.4454-100000@localhost> <200001230222.SAA18355@salsa.gv.tsc.tdk.com> <20000123141413.A49567@keltia.freenix.fr>
next in thread | previous in thread | raw e-mail | index | archive | help
Ollivier Robert <roberto@keltia.freenix.fr> writes: > According to Don Lewis: > > Doesn't bzip2 require a lot more memory for decompression? As I > > recall, someone mentioned that this would cause problems for installing > > releases on machines with only a small amount of RAM. > Yes and is much slower than gzip. Having bzip2 as a port is enough IMO. Not only is it slower (and more memory-hungry), but it was designed and optimized for compressing text. The less text-like the input is, the worse bzip2 fares; in my tests, when compressing the output of /dev/urandom, it runs seven times slower than gzip, and produces a slightly larger output file: des@des ~/test% dd if=/dev/urandom of=random bs=1024 count=$((16*1024)) 16384+0 records in 16384+0 records out 16777216 bytes transferred in 21.406185 secs (783756 bytes/sec) des@des ~/test% time gzip -c random >random.gz gzip -c random > random.gz 12.62s user 0.63s system 88% cpu 14.997 total des@des ~/test% time bzip2 -c random >random.bz bzip2 -c random > random.bz 83.16s user 1.02s system 80% cpu 1:44.42 total des@des ~/test% ll total 49256 -rw-r--r-- 1 des des 16777216 Jan 23 18:48 random -rw-r--r-- 1 des des 16850433 Jan 23 18:54 random.bz -rw-r--r-- 1 des des 16779801 Jan 23 18:52 random.gz Of course, this is a worst-case scenario for both gzip and bzip2. Let's see how they fare with a tarball of the -CURRENT kernel tree: des@des ~/test% lcvs co sys >/dev/null 2>&1 des@des ~/test% tar cf sys.tar sys des@des ~/test% rm -rf sys des@des ~/test% time gzip -c sys.tar >sys.tar.gz gzip -c sys.tar > sys.tar.gz 37.46s user 1.04s system 89% cpu 43.062 total des@des ~/test% time bzip2 -c sys.tar >sys.tar.bz bzip2 -c sys.tar > sys.tar.bz 140.02s user 1.21s system 96% cpu 2:26.18 total des@des ~/test% ll total 59360 -rw-r--r-- 1 des des 43673600 Jan 23 19:10 sys.tar -rw-r--r-- 1 des des 7492474 Jan 23 19:22 sys.tar.bz -rw-r--r-- 1 des des 9554566 Jan 23 19:15 sys.tar.gz des@des ~/test% echo $((7492474*100/9554566)) 78 Much better! a 22% improvement over gzip - but it took nearly four times as long... How about my mail archive? des@des ~/test% tar cf mail.tar ~/Mail tar: Removing leading / from absolute path names in the archive. des@des ~/test% time gzip -c mail.tar >mail.tar.gz gzip -c mail.tar > mail.tar.gz 10.99s user 0.80s system 93% cpu 12.665 total des@des ~/test% time bzip2 -c mail.tar >mail.tar.bz bzip2 -c mail.tar > mail.tar.bz 64.16s user 0.77s system 92% cpu 1:10.40 total des@des ~/test% ll total 31048 -rw-r--r-- 1 des des 26204160 Jan 23 19:33 mail.tar -rw-r--r-- 1 des des 2329178 Jan 23 19:36 mail.tar.bz -rw-r--r-- 1 des des 3210845 Jan 23 19:34 mail.tar.gz des@des ~/test% echo $((2329178*100/3210845)) 72 Even better than source code, yet even slower. Mail is less repetitive than source code, but the repeated strings are longer - source code has much more punctuation (including operators) than text. This is the kind of data bzip2 excels at. How about a kernel? des@des ~/test% cp /kernel . des@des ~/test% time gzip -c kernel >kernel.gz gzip -c kernel > kernel.gz 2.02s user 0.02s system 98% cpu 2.068 total des@des ~/test% time bzip2 -c kernel >kernel.bz bzip2 -c kernel > kernel.bz 4.62s user 0.09s system 98% cpu 4.766 total des@des ~/test% ll total 2936 -r-xr-xr-x 1 des des 1597538 Jan 23 19:42 kernel* -rw-r--r-- 1 des des 669729 Jan 23 19:42 kernel.bz -rw-r--r-- 1 des des 696809 Jan 23 19:42 kernel.gz des@des ~/test% echo $((669729*100/1597538)) 41 des@des ~/test% echo $((696809*100/1597538)) 43 des@des ~/test% echo $((669729*100/696809)) 96 Very little improvement, and a fairly large time penalty, though the sample is too small for the comparison to make much sense (we don't know enough about overhead). Let's strip the kernel: des@des ~/test% strip kernel des@des ~/test% time gzip -c kernel >kernel.gz gzip -c kernel > kernel.gz 1.77s user 0.04s system 96% cpu 1.884 total des@des ~/test% time bzip2 -c kernel >kernel.bz bzip2 -c kernel > kernel.bz 5.80s user 0.04s system 97% cpu 5.989 total des@des ~/test% ll total 2488 -r-xr-xr-x 1 des des 1328176 Jan 23 19:44 kernel* -rw-r--r-- 1 des des 575566 Jan 23 19:45 kernel.bz -rw-r--r-- 1 des des 605411 Jan 23 19:44 kernel.gz des@des ~/test% echo $((575566*100/1328176)) 43 des@des ~/test% echo $((605411*100/1328176)) 45 des@des ~/test% echo $((575566*100/605411)) 95 Both gzip and bzip2 do worse with a stripped kernel than with an unstripped one, because what we stripped away (function and variable names) is repetitive and covers a small portion of the alphabet, and therefore compresses well. Now, here's the killer: a 2.5 MB text file which gzip compresses significantly better than bzip2: des@des ~/test% cp /usr/share/dict/web2 . des@des ~/test% time gzip -c web2 >web2.gz gzip -c web2 > web2.gz 4.25s user 0.06s system 99% cpu 4.319 total des@des ~/test% time bzip2 -c web2 >web2.bz bzip2 -c web2 > web2.bz 5.86s user 0.07s system 98% cpu 5.998 total des@des ~/test% ll total 4048 -r--r--r-- 1 des des 2493066 Jan 23 19:24 web2 -rw-r--r-- 1 des des 857838 Jan 23 19:24 web2.bz -rw-r--r-- 1 des des 754540 Jan 23 19:24 web2.gz des@des ~/test% echo $((754540*100/857838)) 87 Not much of a difference in compression time, though we still see bzip2 running slower than gzip. But gzip compressed 13% better than bzip2, because the file doesn't contain many repetitions of the kind that bzip2 exploits. bzip2 needs lots of fairly long repeated strings to work well, because it's designed to compress text, which contains lots of fairly long repeated strings. To summarize, bzip2 compresses certain types of data very well, but has more degenerate cases than gzip. Using bzip2 for source code or text (even marked-up text) is good. Using it for binaries is not-so-good. Using it for one-shot compression (e.g. compressing backup archives, or in a gather-compress-transfer-decompress-process pipeline to save bandwidth) is downright stupid unless the cost of storing or transferring the compressed data is unusually high. Note that all published benchmarks (at least, those I've seen) comparing the performance of block-sorting compression with other compression algorithms use plain text as sample input, and are therefore mostly meaningless unless all your data is plain text. DES -- Dag-Erling Smorgrav - des@flood.ping.uio.no To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?xzp7lh0lhlj.fsf>