Date: Wed, 3 Jan 2007 10:28:50 -0800 From: "Kurt Buff" <kurt.buff@gmail.com> To: "Giorgos Keramidas" <keramida@ceid.upatras.gr> Cc: freebsd-questions@freebsd.org Subject: Re: Batch file question - average size of file in directory Message-ID: <a9f4a3860701031028w2af80416k20f7abf46eaa9a81@mail.gmail.com> In-Reply-To: <20070103013416.GA1161@kobe.laptop> References: <a9f4a3860701021020g1468af4ah26c8a5fe90610719@mail.gmail.com> <20070103013416.GA1161@kobe.laptop>
next in thread | previous in thread | raw e-mail | index | archive | help
On 1/2/07, Giorgos Keramidas <keramida@ceid.upatras.gr> wrote: > On 2007-01-02 10:20, Kurt Buff <kurt.buff@gmail.com> wrote: > You can probably use awk(1) or perl(1) to post-process the output of > gzip(1). > > The gzip(1) utility, when run with the -cd options will uncompress the > compressed files and send the uncompressed data to standard output, > without actually affecting the on-disk copy of the compressed data. > > It is easy then to pipe the uncompressed data to wc(1) to count the > 'bytes' of the uncompressed data: > > for fname in *.Z *.z *.gz; do > if test -f "${fname}"; then > gzip -cd "${fname}" | wc -c > fi > done > > This will print the byte-size of the uncompressed output of gzip, for > all the files which are currently compressed. Something like the > following could be its output: I put together this one-liner after perusing 'man zcat': find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l >> out.txt It puts out multiple instances of stuff like this: compressed uncompr. ratio uncompressed_name 1508 3470 57.0% stuff-7f+BIOFX1-qX 1660 3576 54.0% stuff-bsFK-yGcWyCm 9113 17065 46.7% stuff-os1MKlKGu8ky ... ... ... 10214796 17845081 42.7% (totals) compressed uncompr. ratio uncompressed_name 7790 14732 47.2% stuff-Z3UO7-uvMANd 1806 3705 51.7% stuff-9ADk-DSBFQGQ 9020 16638 45.8% stuff-Caqfgao-Tc5F 7508 14361 47.8% stuff-kVUWa8ua4zxc I'm thinking that piping the output like so: find /local/amavis/virusmails -name "*.gz" -print | xargs zcat -l | grep -v compress | grep-v totals will do to suppress extraneous header/footer info > This can be piped into awk(1) for further processing, with something > like this: > > for fname in *.Z *.gz; do > if test -f "$fname"; then > gzip -cd "$fname" | wc -c > fi > done | \ > awk 'BEGIN { > min = -1; max = 0; total = 0; > } > { > total += $1; > if ($1 > max) { > max = $1; > } > if (min == -1 || $1 < min) { > min = $1; > } > } > END { > if (NR > 0) { > printf "min/avg/max file size = %d/%d/%d\n", > min, total / NR, max; > } > }' > > With the same files as above, the output of this would be: > > min/avg/max file size = 220381/1750650/3280920 > > With a slightly modified awk(1) script, you can even print a running > min/average/max count, following each line. Mmodified lines marked with > a pipe character (`|') in their leftmost column below. The '|' > characters are *not* part of the script itself. > > for fname in *.Z *.gz; do > if test -f "$fname"; then > gzip -cd "$fname" | wc -c > fi > done | \ > awk 'BEGIN { > min = -1; max = 0; total = 0; > | printf "%10s %10s %10s %10s\n", > | "SIZE", "MIN", "AVERAGE", "MAX"; > } > { > total += $1; > if ($1 > max) { > max = $1; > } > if (min == -1 || $1 < min) { > min = $1; > } > | printf "%10d %10d %10d %10d\n", > | $1, min, total/NR, max; > } > END { > if (NR > 0) { > | printf "%10s %10d %10d %10d\n", > | "TOTAL", min, total / NR, max; > } > }' > > When run with the same set of two compressed files this will print: > > SIZE MIN AVERAGE MAX > 220381 220381 220381 220381 > 3280920 220381 1750650 3280920 > TOTAL 220381 1750650 3280920 > > Please note though that with a sufficiently large set of files, awk(1) > may fail to count the total number of bytes correctly. If this is the > case, it should be easy to write an equivalent Perl or Python script, > to take advantage of their big-number support. I'll try to parse and understand this, and see if I can modify it to suit the output I'm currently generating. Many thanks for the help! Kurt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a9f4a3860701031028w2af80416k20f7abf46eaa9a81>