Date: Wed, 3 Jan 2007 03:34:17 +0200 From: Giorgos Keramidas <keramida@ceid.upatras.gr> To: Kurt Buff <kurt.buff@gmail.com> Cc: freebsd-questions@freebsd.org Subject: Re: Batch file question - average size of file in directory Message-ID: <20070103013416.GA1161@kobe.laptop> In-Reply-To: <a9f4a3860701021020g1468af4ah26c8a5fe90610719@mail.gmail.com> References: <a9f4a3860701021020g1468af4ah26c8a5fe90610719@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2007-01-02 10:20, Kurt Buff <kurt.buff@gmail.com> wrote: > All, > > I don't even have a clue how to start this one, so am looking for a > little help. > > I've got a directory with a large number of gzipped files in it (over > 110k) along with a few thousand uncompressed files. > > I'd like to find the average uncompressed size of the gzipped files, > and ignore the uncompressed files. > > How on earth would I go about doing that with the default shell (no > bash or other shells installed), or in perl, or something like that. > I'm no scripter of any great expertise, and am just stumbling over > this trying to find an approach. You can probably use awk(1) or perl(1) to post-process the output of gzip(1). The gzip(1) utility, when run with the -cd options will uncompress the compressed files and send the uncompressed data to standard output, without actually affecting the on-disk copy of the compressed data. It is easy then to pipe the uncompressed data to wc(1) to count the 'bytes' of the uncompressed data: for fname in *.Z *.z *.gz; do if test -f "${fname}"; then gzip -cd "${fname}" | wc -c fi done This will print the byte-size of the uncompressed output of gzip, for all the files which are currently compressed. Something like the following could be its output: 220381 3280920 This can be piped into awk(1) for further processing, with something like this: for fname in *.Z *.gz; do if test -f "$fname"; then gzip -cd "$fname" | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; } { total += $1; if ($1 > max) { max = $1; } if (min == -1 || $1 < min) { min = $1; } } END { if (NR > 0) { printf "min/avg/max file size = %d/%d/%d\n", min, total / NR, max; } }' With the same files as above, the output of this would be: min/avg/max file size = 220381/1750650/3280920 With a slightly modified awk(1) script, you can even print a running min/average/max count, following each line. Mmodified lines marked with a pipe character (`|') in their leftmost column below. The '|' characters are *not* part of the script itself. for fname in *.Z *.gz; do if test -f "$fname"; then gzip -cd "$fname" | wc -c fi done | \ awk 'BEGIN { min = -1; max = 0; total = 0; | printf "%10s %10s %10s %10s\n", | "SIZE", "MIN", "AVERAGE", "MAX"; } { total += $1; if ($1 > max) { max = $1; } if (min == -1 || $1 < min) { min = $1; } | printf "%10d %10d %10d %10d\n", | $1, min, total/NR, max; } END { if (NR > 0) { | printf "%10s %10d %10d %10d\n", | "TOTAL", min, total / NR, max; } }' When run with the same set of two compressed files this will print: SIZE MIN AVERAGE MAX 220381 220381 220381 220381 3280920 220381 1750650 3280920 TOTAL 220381 1750650 3280920 Please note though that with a sufficiently large set of files, awk(1) may fail to count the total number of bytes correctly. If this is the case, it should be easy to write an equivalent Perl or Python script, to take advantage of their big-number support.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070103013416.GA1161>