From owner-freebsd-questions@FreeBSD.ORG  Fri Sep 14 07:27:45 2007
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9144916A419
	for <freebsd-questions@freebsd.org>;
	Fri, 14 Sep 2007 07:27:45 +0000 (UTC)
	(envelope-from jonathan+freebsd-questions@hst.org.za)
Received: from hermes.hst.org.za (onix.hst.org.za [209.203.2.133])
	by mx1.freebsd.org (Postfix) with ESMTP id AC00E13C461
	for <freebsd-questions@freebsd.org>;
	Fri, 14 Sep 2007 07:27:43 +0000 (UTC)
	(envelope-from jonathan+freebsd-questions@hst.org.za)
Received: from sysadmin.hst.org.za (sysadmin.int.dbn.hst.org.za [10.1.1.20])
	(authenticated bits=0)
	by hermes.hst.org.za (8.13.8/8.13.8) with ESMTP id l8E7NEJK090912
	(version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=NO);
	Fri, 14 Sep 2007 09:23:14 +0200 (SAST)
	(envelope-from jonathan+freebsd-questions@hst.org.za)
From: Jonathan McKeown <jonathan+freebsd-questions@hst.org.za>
Organization: Health Systems Trust
To: freebsd-questions@freebsd.org
Date: Fri, 14 Sep 2007 09:30:20 +0200
User-Agent: KMail/1.7.2
References: <a9f4a3860709131016w54c12b6fy94fc2b0f286aea3d@mail.gmail.com>
	<20070913183504.GC11683@slackbox.xs4all.nl>
In-Reply-To: <20070913183504.GC11683@slackbox.xs4all.nl>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200709140930.21142.jonathan+freebsd-questions@hst.org.za>
X-Spam-Score: -4.218 () ALL_TRUSTED,AWL,BAYES_00
X-Scanned-By: MIMEDefang 2.61 on 209.203.2.133
Cc: Kurt Buff <kurt.buff@gmail.com>
Subject: Re: Scripting question
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Sep 2007 07:27:45 -0000

On Thursday 13 September 2007 20:35, Roland Smith wrote:
> On Thu, Sep 13, 2007 at 10:16:40AM -0700, Kurt Buff wrote:
> > I'm trying to do some text file manipulation, and it's driving me nuts.
[snip]
> > I've looked at sort and uniq, and I've googled a fair bit but can't
> > seem to find anything that would do this.
> >
> > I don't have the perl skills, though that would be ideal.
> >
> > Any help out there?
>
> #!/usr/bin/perl
> while (<>) {
>     # Assuming no whitespace in addresses; kill everything after the first
>     # space 
>     s/ .*$//; 
>     # Store the name & count in a hash
>     $names{$_}++;
> }
> # Go over the hash
> while (($name,$count) = each(%names)) {
>   if ($count == 1) {
>       # print unique names.
>       print $name, "\n";
>   }
> }

Another approach in Perl would be:

#!/usr/bin/perl
my (%names, %dups);
while (<>) {
    my ($key) = split;
    $dups{$key} = 1 if $names{$key};
    $names{$key} = 1;
}
delete @names{keys %dups};
#
# keys %names is now an unordered list of only non-repeated elements
# keys %dups is an unordered list of only repeated elements

split splits on whitespace, returning a list of fields which can be assigned 
to a list of variables. Here we only want to capture the first field: split 
is more efficient for this than using a regex. The first occurrence of $key 
is in parens because it's actually a list of one variable name.

We build two hashes, one, %name, keyed by the original names (this is the 
classic way to reduce duplicates to single occurrences, since the duplicated 
keys overwrite the originals), and one, %dup, whose keys are names already 
appearing in %names - the duplicated entries. Having done that we use a hash 
slice to delete from %names all the keys of %dups, which leaves the keys of 
%names holding all the entries which only appear once (and the keys of %dups 
all the duplicated entries if that's useful).

Jonathan