Date: Mon, 1 Oct 2012 22:16:53 -0700 From: Tim Kientzle <tim@kientzle.com> To: Brandon Falk <bfalk_bsd@brandonfa.lk> Cc: freebsd-hackers@freebsd.org Subject: Re: SMP Version of tar Message-ID: <87549776-9051-4B4B-8D53-DAE6D51C2A94@kientzle.com> In-Reply-To: <5069C9FC.6020400@brandonfa.lk> References: <5069C9FC.6020400@brandonfa.lk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 1, 2012, at 9:51 AM, Brandon Falk wrote: > I would be willing to work on a SMP version of tar (initially just = gzip or something). >=20 > I don't have the best experience in compression, and how to = multi-thread it, but I think I would be able to learn and help out. >=20 > Note: I would like to make this for *BSD under the BSD license. I am = aware that there are already tools to do this (under GPL), but I would = really like to see this existent in the FreeBSD base. >=20 > Anyone interested? Great! First rule: be skeptical. In particular, tar is so entirely disk-bound = that many performance optimizations have no impact whatsoever. If you = don't do a lot of testing, you can end up wasting a lot of time. There are a few different parallel command-line compressors and = decompressors in ports; experiment a lot (with large files being read = from and/or written to disk) and see what the real effect is. In = particular, some decompression algorithms are actually faster than = memcpy() when run on a single processor. Parallelizing such algorithms = is not likely to help much in the real world. The two popular algorithms I would expect to benefit most are bzip2 = compression and lzma compression (targeting xz or lzip format). For = decompression, bzip2 is block-oriented so fits SMP pretty naturally. = Other popular algorithms are stream-oriented and less amenable to = parallelization. Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 = implementation that's already under a BSD license. You should be able = to get a lot of ideas about how to implement a parallel compression = algorithm. Better yet, you might be able to reuse a lot of the existing = pbzip2 code. Mark Adler's pigz is also worth studying. It's also license-friendly, = and is built on top of regular zlib, which is a nice technique when it's = feasible. There are three fundamentally different implementation approaches with = different complexity/performance issues: * Implement as a stand-alone executable similar to pbzip2. This makes = your code a lot simpler and makes it reasonably easy for people to reuse = your work. This could work with tar, though it could be slightly slower = than the in-process version due to the additional data-copying and = process-switch overhead. * Implement within libarchive directly. This would benefit tar and a = handful of other programs that use libarchive, but may not be worth the = complexity. * Implement as a standalone library with an interface similar to zlib = or libbz2 or liblzma. The last would be my personal preference, though it's probably the most = complex of all. That would easily support libarchive and you could = create a simple stand-alone wrapper around it as well, giving you the = best of all worlds. If you could extend the pigz technique, you might be able to build a = multi-threaded compression library where the actual compression was = handled by an existing single-threaded library. Since zlib, bzlib, and = liblzma already have similar interfaces, your layer might require only a = thin adapter to handle any of those three. *That* would be very = interesting, indeed. Sounds like a fun project. I wish I had time to work on something like = this. Cheers, Tim
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87549776-9051-4B4B-8D53-DAE6D51C2A94>