From owner-freebsd-chat  Wed Jun 27 11:58:45 2001
Delivered-To: freebsd-chat@freebsd.org
Received: from heechee.tobez.org (254.adsl0.ryv.worldonline.dk [213.237.10.254])
	by hub.freebsd.org (Postfix) with ESMTP id 60A9437B419
	for <freebsd-chat@FreeBSD.ORG>; Wed, 27 Jun 2001 11:58:39 -0700 (PDT)
	(envelope-from tobez@tobez.org)
Received: by heechee.tobez.org (Postfix, from userid 1001)
	id 80FD4543D; Wed, 27 Jun 2001 20:58:35 +0200 (CEST)
Date: Wed, 27 Jun 2001 20:58:35 +0200
From: Anton Berezin <tobez@tobez.org>
To: Dag-Erling Smorgrav <des@ofug.org>
Cc: Ryan Thompson <ryan@sasknow.com>, freebsd-chat@FreeBSD.ORG
Subject: Re: most complex code in BSD?
Message-ID: <20010627205835.A72506@heechee.tobez.org>
Mail-Followup-To: Anton Berezin <tobez@tobez.org>,
	Dag-Erling Smorgrav <des@ofug.org>, Ryan Thompson <ryan@sasknow.com>,
	freebsd-chat@FreeBSD.ORG
References: <Pine.BSF.4.21.0106260042370.82644-100000@ren.sasknow.com> <xzpzoavadi6.fsf@flood.ping.uio.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <xzpzoavadi6.fsf@flood.ping.uio.no>; from des@ofug.org on Tue, Jun 26, 2001 at 09:47:45AM +0200
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-chat.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo?subject=subscribe%20freebsd-chat>
List-Unsubscribe: <mailto:majordomo?subject=unsubscribe%20freebsd-chat>
X-Loop: FreeBSD.org

On Tue, Jun 26, 2001 at 09:47:45AM +0200, Dag-Erling Smorgrav wrote:
> Ryan Thompson <ryan@sasknow.com> writes:
> > Classic problem with this is, even if you can understand it (or at
> > least figure it out in less than a minute or so) who the hell could
> > hazard a guess at the efficiency of that "algorithm"? (Before going
> > to the trouble of testing it on a few million lines). I bet it isn't
> > O(n) ;-)
> 
> I strongly suspect it's O(n*log(k)), where k is the number of distinct
> lines in the input.  In most cases k will probably be a significant
> fraction of n, so call it O(n*log(n)).

I do think that in fact it is closer to O(n), if one replaces s,.,,sg
with $_="", which is the obvious optimization, and which you already
mentioned.  The hash accesses are essentially O(1) for random enough
keys and in case there are enough buckets - and there is always enough
buckets in Perl since it rebalances its hashes when necessary.  So it
should be close to O(n).

> The other (not possible in Perl) is to keep only the hash value of
> each previously seen line, and discard the actual line to conserve
> memory.  With a good hash function, the likelihood of finding two
> distinct input lines that hash to the same value should be pretty low
> (though non-zero).

This does not improve the time characteristics of the algorithm.  It
saves memory, though.

\Anton.
-- 
May the tuna salad be with you.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message