From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 2 05:17:02 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C499A106566B for ; Tue, 2 Oct 2012 05:17:02 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net [99.115.135.74]) by mx1.freebsd.org (Postfix) with ESMTP id 972508FC12 for ; Tue, 2 Oct 2012 05:17:02 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.4/8.14.4) id q925Gr9K084914; Tue, 2 Oct 2012 05:16:53 GMT (envelope-from tim@kientzle.com) Received: from [192.168.2.143] (CiscoE3000 [192.168.1.65]) by kientzle.com with SMTP id cq7qbx5d328pv8g7sahzfn5raa; Tue, 02 Oct 2012 05:16:53 +0000 (UTC) (envelope-from tim@kientzle.com) Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: text/plain; charset=iso-8859-1 From: Tim Kientzle In-Reply-To: <5069C9FC.6020400@brandonfa.lk> Date: Mon, 1 Oct 2012 22:16:53 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <87549776-9051-4B4B-8D53-DAE6D51C2A94@kientzle.com> References: <5069C9FC.6020400@brandonfa.lk> To: Brandon Falk X-Mailer: Apple Mail (2.1278) Cc: freebsd-hackers@freebsd.org Subject: Re: SMP Version of tar X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Oct 2012 05:17:02 -0000 On Oct 1, 2012, at 9:51 AM, Brandon Falk wrote: > I would be willing to work on a SMP version of tar (initially just = gzip or something). >=20 > I don't have the best experience in compression, and how to = multi-thread it, but I think I would be able to learn and help out. >=20 > Note: I would like to make this for *BSD under the BSD license. I am = aware that there are already tools to do this (under GPL), but I would = really like to see this existent in the FreeBSD base. >=20 > Anyone interested? Great! First rule: be skeptical. In particular, tar is so entirely disk-bound = that many performance optimizations have no impact whatsoever. If you = don't do a lot of testing, you can end up wasting a lot of time. There are a few different parallel command-line compressors and = decompressors in ports; experiment a lot (with large files being read = from and/or written to disk) and see what the real effect is. In = particular, some decompression algorithms are actually faster than = memcpy() when run on a single processor. Parallelizing such algorithms = is not likely to help much in the real world. The two popular algorithms I would expect to benefit most are bzip2 = compression and lzma compression (targeting xz or lzip format). For = decompression, bzip2 is block-oriented so fits SMP pretty naturally. = Other popular algorithms are stream-oriented and less amenable to = parallelization. Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 = implementation that's already under a BSD license. You should be able = to get a lot of ideas about how to implement a parallel compression = algorithm. Better yet, you might be able to reuse a lot of the existing = pbzip2 code. Mark Adler's pigz is also worth studying. It's also license-friendly, = and is built on top of regular zlib, which is a nice technique when it's = feasible. There are three fundamentally different implementation approaches with = different complexity/performance issues: * Implement as a stand-alone executable similar to pbzip2. This makes = your code a lot simpler and makes it reasonably easy for people to reuse = your work. This could work with tar, though it could be slightly slower = than the in-process version due to the additional data-copying and = process-switch overhead. * Implement within libarchive directly. This would benefit tar and a = handful of other programs that use libarchive, but may not be worth the = complexity. * Implement as a standalone library with an interface similar to zlib = or libbz2 or liblzma. The last would be my personal preference, though it's probably the most = complex of all. That would easily support libarchive and you could = create a simple stand-alone wrapper around it as well, giving you the = best of all worlds. If you could extend the pigz technique, you might be able to build a = multi-threaded compression library where the actual compression was = handled by an existing single-threaded library. Since zlib, bzlib, and = liblzma already have similar interfaces, your layer might require only a = thin adapter to handle any of those three. *That* would be very = interesting, indeed. Sounds like a fun project. I wish I had time to work on something like = this. Cheers, Tim