From owner-freebsd-questions@FreeBSD.ORG  Wed Jan  3 18:55:22 2007
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id BDEE316A403
	for <freebsd-questions@freebsd.org>;
	Wed,  3 Jan 2007 18:55:22 +0000 (UTC)
	(envelope-from kurt.buff@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.228])
	by mx1.freebsd.org (Postfix) with ESMTP id 73C8813C457
	for <freebsd-questions@freebsd.org>;
	Wed,  3 Jan 2007 18:55:22 +0000 (UTC)
	(envelope-from kurt.buff@gmail.com)
Received: by wx-out-0506.google.com with SMTP id s18so6041685wxc
	for <freebsd-questions@freebsd.org>;
	Wed, 03 Jan 2007 10:55:22 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=mHk85Sfc2cT+RLZnduKK43zdpALfZ6/eb+X/cTdvprYbVsHNnuQQI+QtGMssEKsojkFTnVp6pxyEl95URCL3Kxv6Um9CCVfbDYRSPRRnWp0HDS1bV/SOUa+uWME/yiAjrZK075yT52k5rgweZ/XN0Glh/F+BPvoc+Jg3/IhrPcA=
Received: by 10.70.74.6 with SMTP id w6mr39331378wxa.1167848930173;
	Wed, 03 Jan 2007 10:28:50 -0800 (PST)
Received: by 10.70.131.11 with HTTP; Wed, 3 Jan 2007 10:28:50 -0800 (PST)
Message-ID: <a9f4a3860701031028w2af80416k20f7abf46eaa9a81@mail.gmail.com>
Date: Wed, 3 Jan 2007 10:28:50 -0800
From: "Kurt Buff" <kurt.buff@gmail.com>
To: "Giorgos Keramidas" <keramida@ceid.upatras.gr>
In-Reply-To: <20070103013416.GA1161@kobe.laptop>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <a9f4a3860701021020g1468af4ah26c8a5fe90610719@mail.gmail.com>
	<20070103013416.GA1161@kobe.laptop>
Cc: freebsd-questions@freebsd.org
Subject: Re: Batch file question - average size of file in directory
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Jan 2007 18:55:22 -0000

On 1/2/07, Giorgos Keramidas <keramida@ceid.upatras.gr> wrote:
> On 2007-01-02 10:20, Kurt Buff <kurt.buff@gmail.com> wrote:
> You can probably use awk(1) or perl(1) to post-process the output of
> gzip(1).
>
> The gzip(1) utility, when run with the -cd options will uncompress the
> compressed files and send the uncompressed data to standard output,
> without actually affecting the on-disk copy of the compressed data.
>
> It is easy then to pipe the uncompressed data to wc(1) to count the
> 'bytes' of the uncompressed data:
>
>         for fname in *.Z *.z *.gz; do
>                 if test -f "${fname}"; then
>                         gzip -cd "${fname}" | wc -c
>                 fi
>         done
>
> This will print the byte-size of the uncompressed output of gzip, for
> all the files which are currently compressed.  Something like the
> following could be its output:

I put together this one-liner after perusing 'man zcat':

find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l >> out.txt

It puts out multiple instances of stuff like this:

compressed  uncompr. ratio uncompressed_name
     1508      3470  57.0% stuff-7f+BIOFX1-qX
     1660      3576  54.0% stuff-bsFK-yGcWyCm
     9113     17065  46.7% stuff-os1MKlKGu8ky
...
...
...
 10214796  17845081  42.7% (totals)
compressed  uncompr. ratio uncompressed_name
     7790     14732  47.2% stuff-Z3UO7-uvMANd
     1806      3705  51.7% stuff-9ADk-DSBFQGQ
     9020     16638  45.8% stuff-Caqfgao-Tc5F
     7508     14361  47.8% stuff-kVUWa8ua4zxc

I'm thinking that piping the output like so:

find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l |
grep -v compress | grep-v totals

will do to suppress extraneous header/footer info


> This can be piped into awk(1) for further processing, with something
> like this:
>
>         for fname in *.Z *.gz; do
>                 if test -f "$fname"; then
>                         gzip -cd "$fname" | wc -c
>                 fi
>         done | \
>         awk 'BEGIN {
>             min = -1; max = 0; total = 0;
>         }
>         {
>             total += $1;
>             if ($1 > max) {
>                 max = $1;
>             }
>             if (min == -1 || $1 < min) {
>                 min = $1;
>             }
>         }
>         END {
>             if (NR > 0) {
>                 printf "min/avg/max file size = %d/%d/%d\n",
>                     min, total / NR, max;
>             }
>         }'
>
> With the same files as above, the output of this would be:
>
>         min/avg/max file size = 220381/1750650/3280920
>
> With a slightly modified awk(1) script, you can even print a running
> min/average/max count, following each line.  Mmodified lines marked with
> a pipe character (`|') in their leftmost column below.  The '|'
> characters are *not* part of the script itself.
>
>         for fname in *.Z *.gz; do
>                 if test -f "$fname"; then
>                         gzip -cd "$fname" | wc -c
>                 fi
>         done | \
>         awk 'BEGIN {
>             min = -1; max = 0; total = 0;
> |           printf "%10s %10s %10s %10s\n",
> |               "SIZE", "MIN", "AVERAGE", "MAX";
>         }
>         {
>             total += $1;
>             if ($1 > max) {
>                 max = $1;
>             }
>             if (min == -1 || $1 < min) {
>                 min = $1;
>             }
> |           printf "%10d %10d %10d %10d\n",
> |               $1, min, total/NR, max;
>         }
>         END {
>             if (NR > 0) {
> |               printf "%10s %10d %10d %10d\n",
> |                   "TOTAL", min, total / NR, max;
>             }
>         }'
>
> When run with the same set of two compressed files this will print:
>
>       SIZE        MIN    AVERAGE        MAX
>     220381     220381     220381     220381
>    3280920     220381    1750650    3280920
>      TOTAL     220381    1750650    3280920
>
> Please note though that with a sufficiently large set of files, awk(1)
> may fail to count the total number of bytes correctly.  If this is the
> case, it should be easy to write an equivalent Perl or Python script,
> to take advantage of their big-number support.

I'll try to parse and understand this, and see if I can modify it to
suit the output I'm currently generating.

Many thanks for the help!

Kurt