From owner-freebsd-questions@FreeBSD.ORG Wed Jan 3 18:55:22 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BDEE316A403 for ; Wed, 3 Jan 2007 18:55:22 +0000 (UTC) (envelope-from kurt.buff@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.228]) by mx1.freebsd.org (Postfix) with ESMTP id 73C8813C457 for ; Wed, 3 Jan 2007 18:55:22 +0000 (UTC) (envelope-from kurt.buff@gmail.com) Received: by wx-out-0506.google.com with SMTP id s18so6041685wxc for ; Wed, 03 Jan 2007 10:55:22 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=mHk85Sfc2cT+RLZnduKK43zdpALfZ6/eb+X/cTdvprYbVsHNnuQQI+QtGMssEKsojkFTnVp6pxyEl95URCL3Kxv6Um9CCVfbDYRSPRRnWp0HDS1bV/SOUa+uWME/yiAjrZK075yT52k5rgweZ/XN0Glh/F+BPvoc+Jg3/IhrPcA= Received: by 10.70.74.6 with SMTP id w6mr39331378wxa.1167848930173; Wed, 03 Jan 2007 10:28:50 -0800 (PST) Received: by 10.70.131.11 with HTTP; Wed, 3 Jan 2007 10:28:50 -0800 (PST) Message-ID: Date: Wed, 3 Jan 2007 10:28:50 -0800 From: "Kurt Buff" To: "Giorgos Keramidas" In-Reply-To: <20070103013416.GA1161@kobe.laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070103013416.GA1161@kobe.laptop> Cc: freebsd-questions@freebsd.org Subject: Re: Batch file question - average size of file in directory X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Jan 2007 18:55:22 -0000 On 1/2/07, Giorgos Keramidas wrote: > On 2007-01-02 10:20, Kurt Buff wrote: > You can probably use awk(1) or perl(1) to post-process the output of > gzip(1). > > The gzip(1) utility, when run with the -cd options will uncompress the > compressed files and send the uncompressed data to standard output, > without actually affecting the on-disk copy of the compressed data. > > It is easy then to pipe the uncompressed data to wc(1) to count the > 'bytes' of the uncompressed data: > > for fname in *.Z *.z *.gz; do > if test -f "${fname}"; then > gzip -cd "${fname}" | wc -c > fi > done > > This will print the byte-size of the uncompressed output of gzip, for > all the files which are currently compressed. Something like the > following could be its output: I put together this one-liner after perusing 'man zcat': find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l >> out.txt It puts out multiple instances of stuff like this: compressed uncompr. ratio uncompressed_name 1508 3470 57.0% stuff-7f+BIOFX1-qX 1660 3576 54.0% stuff-bsFK-yGcWyCm 9113 17065 46.7% stuff-os1MKlKGu8ky ... ... ... 10214796 17845081 42.7% (totals) compressed uncompr. ratio uncompressed_name 7790 14732 47.2% stuff-Z3UO7-uvMANd 1806 3705 51.7% stuff-9ADk-DSBFQGQ 9020 16638 45.8% stuff-Caqfgao-Tc5F 7508 14361 47.8% stuff-kVUWa8ua4zxc I'm thinking that piping the output like so: find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l | grep -v compress | grep-v totals will do to suppress extraneous header/footer info > This can be piped into awk(1) for further processing, with something > like this: > > for fname in *.Z *.gz; do > if test -f "$fname"; then > gzip -cd "$fname" | wc -c > fi > done | \ > awk 'BEGIN { > min = -1; max = 0; total = 0; > } > { > total += $1; > if ($1 > max) { > max = $1; > } > if (min == -1 || $1 < min) { > min = $1; > } > } > END { > if (NR > 0) { > printf "min/avg/max file size = %d/%d/%d\n", > min, total / NR, max; > } > }' > > With the same files as above, the output of this would be: > > min/avg/max file size = 220381/1750650/3280920 > > With a slightly modified awk(1) script, you can even print a running > min/average/max count, following each line. Mmodified lines marked with > a pipe character (`|') in their leftmost column below. The '|' > characters are *not* part of the script itself. > > for fname in *.Z *.gz; do > if test -f "$fname"; then > gzip -cd "$fname" | wc -c > fi > done | \ > awk 'BEGIN { > min = -1; max = 0; total = 0; > | printf "%10s %10s %10s %10s\n", > | "SIZE", "MIN", "AVERAGE", "MAX"; > } > { > total += $1; > if ($1 > max) { > max = $1; > } > if (min == -1 || $1 < min) { > min = $1; > } > | printf "%10d %10d %10d %10d\n", > | $1, min, total/NR, max; > } > END { > if (NR > 0) { > | printf "%10s %10d %10d %10d\n", > | "TOTAL", min, total / NR, max; > } > }' > > When run with the same set of two compressed files this will print: > > SIZE MIN AVERAGE MAX > 220381 220381 220381 220381 > 3280920 220381 1750650 3280920 > TOTAL 220381 1750650 3280920 > > Please note though that with a sufficiently large set of files, awk(1) > may fail to count the total number of bytes correctly. If this is the > case, it should be easy to write an equivalent Perl or Python script, > to take advantage of their big-number support. I'll try to parse and understand this, and see if I can modify it to suit the output I'm currently generating. Many thanks for the help! Kurt