Skip site navigation (1)Skip section navigation (2)
Date:      23 Jan 2000 20:21:12 +0100
From:      Dag-Erling Smorgrav <des@flood.ping.uio.no>
To:        Ollivier Robert <roberto@keltia.freenix.fr>
Cc:        freebsd-current@FreeBSD.ORG
Subject:   Re: bzip2 in src tree (Was Re: ports/16252: bsd.port.mk: Add bzip2 support for distribution patches)
Message-ID:  <xzp7lh0lhlj.fsf@flood.ping.uio.no>
In-Reply-To: Ollivier Robert's message of "Sun, 23 Jan 2000 14:14:13 %2B0100"
References:  <Pine.BSF.4.21.0001221703020.4454-100000@localhost> <200001230222.SAA18355@salsa.gv.tsc.tdk.com> <20000123141413.A49567@keltia.freenix.fr>

next in thread | previous in thread | raw e-mail | index | archive | help
Ollivier Robert <roberto@keltia.freenix.fr> writes:
> According to Don Lewis:
> > Doesn't bzip2 require a lot more memory for decompression?  As I
> > recall, someone mentioned that this would cause problems for installing
> > releases on machines with only a small amount of RAM.
> Yes and is much slower than gzip. Having bzip2 as a port is enough IMO.

Not only is it slower (and more memory-hungry), but it was designed
and optimized for compressing text. The less text-like the input is,
the worse bzip2 fares; in my tests, when compressing the output of
/dev/urandom, it runs seven times slower than gzip, and produces a
slightly larger output file:

des@des ~/test% dd if=/dev/urandom of=random bs=1024 count=$((16*1024))
16384+0 records in
16384+0 records out
16777216 bytes transferred in 21.406185 secs (783756 bytes/sec)
des@des ~/test% time gzip -c random >random.gz 
gzip -c random > random.gz  12.62s user 0.63s system 88% cpu 14.997 total
des@des ~/test% time bzip2 -c random >random.bz 
bzip2 -c random > random.bz  83.16s user 1.02s system 80% cpu 1:44.42 total
des@des ~/test% ll
total 49256
-rw-r--r--  1 des  des  16777216 Jan 23 18:48 random
-rw-r--r--  1 des  des  16850433 Jan 23 18:54 random.bz
-rw-r--r--  1 des  des  16779801 Jan 23 18:52 random.gz

Of course, this is a worst-case scenario for both gzip and bzip2.
Let's see how they fare with a tarball of the -CURRENT kernel tree:

des@des ~/test% lcvs co sys >/dev/null 2>&1
des@des ~/test% tar cf sys.tar sys
des@des ~/test% rm -rf sys
des@des ~/test% time gzip -c sys.tar >sys.tar.gz 
gzip -c sys.tar > sys.tar.gz  37.46s user 1.04s system 89% cpu 43.062 total
des@des ~/test% time bzip2 -c sys.tar >sys.tar.bz 
bzip2 -c sys.tar > sys.tar.bz  140.02s user 1.21s system 96% cpu 2:26.18 total
des@des ~/test% ll
total 59360
-rw-r--r--  1 des  des  43673600 Jan 23 19:10 sys.tar
-rw-r--r--  1 des  des   7492474 Jan 23 19:22 sys.tar.bz
-rw-r--r--  1 des  des   9554566 Jan 23 19:15 sys.tar.gz
des@des ~/test% echo $((7492474*100/9554566))
78

Much better! a 22% improvement over gzip - but it took nearly four
times as long...

How about my mail archive?

des@des ~/test% tar cf mail.tar ~/Mail
tar: Removing leading / from absolute path names in the archive.
des@des ~/test% time gzip -c mail.tar >mail.tar.gz
gzip -c mail.tar > mail.tar.gz  10.99s user 0.80s system 93% cpu 12.665 total
des@des ~/test% time bzip2 -c mail.tar >mail.tar.bz 
bzip2 -c mail.tar > mail.tar.bz  64.16s user 0.77s system 92% cpu 1:10.40 total
des@des ~/test% ll
total 31048
-rw-r--r--  1 des  des  26204160 Jan 23 19:33 mail.tar
-rw-r--r--  1 des  des   2329178 Jan 23 19:36 mail.tar.bz
-rw-r--r--  1 des  des   3210845 Jan 23 19:34 mail.tar.gz
des@des ~/test% echo $((2329178*100/3210845))
72

Even better than source code, yet even slower. Mail is less repetitive
than source code, but the repeated strings are longer - source code
has much more punctuation (including operators) than text. This is the
kind of data bzip2 excels at.

How about a kernel?

des@des ~/test% cp /kernel .
des@des ~/test% time gzip -c kernel >kernel.gz   
gzip -c kernel > kernel.gz  2.02s user 0.02s system 98% cpu 2.068 total
des@des ~/test% time bzip2 -c kernel >kernel.bz 
bzip2 -c kernel > kernel.bz  4.62s user 0.09s system 98% cpu 4.766 total
des@des ~/test% ll
total 2936
-r-xr-xr-x  1 des  des  1597538 Jan 23 19:42 kernel*
-rw-r--r--  1 des  des   669729 Jan 23 19:42 kernel.bz
-rw-r--r--  1 des  des   696809 Jan 23 19:42 kernel.gz
des@des ~/test% echo $((669729*100/1597538)) 
41
des@des ~/test% echo $((696809*100/1597538))
43
des@des ~/test% echo $((669729*100/696809))
96

Very little improvement, and a fairly large time penalty, though the
sample is too small for the comparison to make much sense (we don't
know enough about overhead).

Let's strip the kernel:

des@des ~/test% strip kernel
des@des ~/test% time gzip -c kernel >kernel.gz 
gzip -c kernel > kernel.gz  1.77s user 0.04s system 96% cpu 1.884 total
des@des ~/test% time bzip2 -c kernel >kernel.bz
bzip2 -c kernel > kernel.bz  5.80s user 0.04s system 97% cpu 5.989 total
des@des ~/test% ll
total 2488
-r-xr-xr-x  1 des  des  1328176 Jan 23 19:44 kernel*
-rw-r--r--  1 des  des   575566 Jan 23 19:45 kernel.bz
-rw-r--r--  1 des  des   605411 Jan 23 19:44 kernel.gz
des@des ~/test% echo $((575566*100/1328176))
43
des@des ~/test% echo $((605411*100/1328176))
45
des@des ~/test% echo $((575566*100/605411))
95

Both gzip and bzip2 do worse with a stripped kernel than with an
unstripped one, because what we stripped away (function and variable
names) is repetitive and covers a small portion of the alphabet, and
therefore compresses well.

Now, here's the killer: a 2.5 MB text file which gzip compresses
significantly better than bzip2:

des@des ~/test% cp /usr/share/dict/web2 .
des@des ~/test% time gzip -c web2 >web2.gz 
gzip -c web2 > web2.gz  4.25s user 0.06s system 99% cpu 4.319 total
des@des ~/test% time bzip2 -c web2 >web2.bz 
bzip2 -c web2 > web2.bz  5.86s user 0.07s system 98% cpu 5.998 total
des@des ~/test% ll
total 4048
-r--r--r--  1 des  des  2493066 Jan 23 19:24 web2
-rw-r--r--  1 des  des   857838 Jan 23 19:24 web2.bz
-rw-r--r--  1 des  des   754540 Jan 23 19:24 web2.gz
des@des ~/test% echo $((754540*100/857838))
87

Not much of a difference in compression time, though we still see
bzip2 running slower than gzip. But gzip compressed 13% better than
bzip2, because the file doesn't contain many repetitions of the kind
that bzip2 exploits. bzip2 needs lots of fairly long repeated strings
to work well, because it's designed to compress text, which contains
lots of fairly long repeated strings.

To summarize, bzip2 compresses certain types of data very well, but
has more degenerate cases than gzip. Using bzip2 for source code or
text (even marked-up text) is good. Using it for binaries is
not-so-good. Using it for one-shot compression (e.g. compressing
backup archives, or in a gather-compress-transfer-decompress-process
pipeline to save bandwidth) is downright stupid unless the cost of
storing or transferring the compressed data is unusually high.

Note that all published benchmarks (at least, those I've seen)
comparing the performance of block-sorting compression with other
compression algorithms use plain text as sample input, and are
therefore mostly meaningless unless all your data is plain text.

DES
-- 
Dag-Erling Smorgrav - des@flood.ping.uio.no


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?xzp7lh0lhlj.fsf>