From owner-freebsd-questions@FreeBSD.ORG Wed Jan 3 12:55:52 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CDA9316A416 for ; Wed, 3 Jan 2007 12:55:52 +0000 (UTC) (envelope-from keramida@ceid.upatras.gr) Received: from igloo.linux.gr (igloo.linux.gr [62.1.205.36]) by mx1.freebsd.org (Postfix) with ESMTP id 48B4A13C468 for ; Wed, 3 Jan 2007 12:55:52 +0000 (UTC) (envelope-from keramida@ceid.upatras.gr) Received: from kobe.laptop (host5.bedc.ondsl.gr [62.103.39.229]) (authenticated bits=128) by igloo.linux.gr (8.13.8/8.13.8/Debian-3) with ESMTP id l03CtMt1025074 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 3 Jan 2007 14:55:29 +0200 Received: from kobe.laptop (kobe.laptop [127.0.0.1]) by kobe.laptop (8.13.8/8.13.8) with ESMTP id l03Ct8T9001588; Wed, 3 Jan 2007 14:55:15 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Received: (from keramida@localhost) by kobe.laptop (8.13.8/8.13.8/Submit) id l031YHsk002233; Wed, 3 Jan 2007 03:34:17 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Date: Wed, 3 Jan 2007 03:34:17 +0200 From: Giorgos Keramidas To: Kurt Buff Message-ID: <20070103013416.GA1161@kobe.laptop> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Hellug-MailScanner: Found to be clean X-Hellug-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-3.21, required 5, autolearn=not spam, ALL_TRUSTED -1.80, AWL 0.49, BAYES_00 -2.60, DATE_IN_PAST_06_12 0.50, DNS_FROM_RFC_ABUSE 0.20) X-Hellug-MailScanner-From: keramida@ceid.upatras.gr X-Spam-Status: No Cc: freebsd-questions@freebsd.org Subject: Re: Batch file question - average size of file in directory X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Jan 2007 12:55:52 -0000 On 2007-01-02 10:20, Kurt Buff wrote: > All, > > I don't even have a clue how to start this one, so am looking for a > little help. > > I've got a directory with a large number of gzipped files in it (over > 110k) along with a few thousand uncompressed files. > > I'd like to find the average uncompressed size of the gzipped files, > and ignore the uncompressed files. > > How on earth would I go about doing that with the default shell (no > bash or other shells installed), or in perl, or something like that. > I'm no scripter of any great expertise, and am just stumbling over > this trying to find an approach. You can probably use awk(1) or perl(1) to post-process the output of gzip(1). The gzip(1) utility, when run with the -cd options will uncompress the compressed files and send the uncompressed data to standard output, without actually affecting the on-disk copy of the compressed data. It is easy then to pipe the uncompressed data to wc(1) to count the 'bytes' of the uncompressed data: for fname in *.Z *.z *.gz; do if test -f "${fname}"; then gzip -cd "${fname}" | wc -c fi done This will print the byte-size of the uncompressed output of gzip, for all the files which are currently compressed. Something like the following could be its output: 220381 3280920 This can be piped into awk(1) for further processing, with something like this: for fname in *.Z *.gz; do if test -f "$fname"; then gzip -cd "$fname" | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; } { total += $1; if ($1 > max) { max = $1; } if (min == -1 || $1 < min) { min = $1; } } END { if (NR > 0) { printf "min/avg/max file size = %d/%d/%d\n", min, total / NR, max; } }' With the same files as above, the output of this would be: min/avg/max file size = 220381/1750650/3280920 With a slightly modified awk(1) script, you can even print a running min/average/max count, following each line. Mmodified lines marked with a pipe character (`|') in their leftmost column below. The '|' characters are *not* part of the script itself. for fname in *.Z *.gz; do if test -f "$fname"; then gzip -cd "$fname" | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; | printf "%10s %10s %10s %10s\n", | "SIZE", "MIN", "AVERAGE", "MAX"; } { total += $1; if ($1 > max) { max = $1; } if (min == -1 || $1 < min) { min = $1; } | printf "%10d %10d %10d %10d\n", | $1, min, total/NR, max; } END { if (NR > 0) { | printf "%10s %10d %10d %10d\n", | "TOTAL", min, total / NR, max; } }' When run with the same set of two compressed files this will print: SIZE MIN AVERAGE MAX 220381 220381 220381 220381 3280920 220381 1750650 3280920 TOTAL 220381 1750650 3280920 Please note though that with a sufficiently large set of files, awk(1) may fail to count the total number of bytes correctly. If this is the case, it should be easy to write an equivalent Perl or Python script, to take advantage of their big-number support.