From owner-freebsd-chat Wed Jun 27 11:58:45 2001 Delivered-To: freebsd-chat@freebsd.org Received: from heechee.tobez.org (254.adsl0.ryv.worldonline.dk [213.237.10.254]) by hub.freebsd.org (Postfix) with ESMTP id 60A9437B419 for ; Wed, 27 Jun 2001 11:58:39 -0700 (PDT) (envelope-from tobez@tobez.org) Received: by heechee.tobez.org (Postfix, from userid 1001) id 80FD4543D; Wed, 27 Jun 2001 20:58:35 +0200 (CEST) Date: Wed, 27 Jun 2001 20:58:35 +0200 From: Anton Berezin To: Dag-Erling Smorgrav Cc: Ryan Thompson , freebsd-chat@FreeBSD.ORG Subject: Re: most complex code in BSD? Message-ID: <20010627205835.A72506@heechee.tobez.org> Mail-Followup-To: Anton Berezin , Dag-Erling Smorgrav , Ryan Thompson , freebsd-chat@FreeBSD.ORG References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from des@ofug.org on Tue, Jun 26, 2001 at 09:47:45AM +0200 Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Tue, Jun 26, 2001 at 09:47:45AM +0200, Dag-Erling Smorgrav wrote: > Ryan Thompson writes: > > Classic problem with this is, even if you can understand it (or at > > least figure it out in less than a minute or so) who the hell could > > hazard a guess at the efficiency of that "algorithm"? (Before going > > to the trouble of testing it on a few million lines). I bet it isn't > > O(n) ;-) > > I strongly suspect it's O(n*log(k)), where k is the number of distinct > lines in the input. In most cases k will probably be a significant > fraction of n, so call it O(n*log(n)). I do think that in fact it is closer to O(n), if one replaces s,.,,sg with $_="", which is the obvious optimization, and which you already mentioned. The hash accesses are essentially O(1) for random enough keys and in case there are enough buckets - and there is always enough buckets in Perl since it rebalances its hashes when necessary. So it should be close to O(n). > The other (not possible in Perl) is to keep only the hash value of > each previously seen line, and discard the actual line to conserve > memory. With a good hash function, the likelihood of finding two > distinct input lines that hash to the same value should be pretty low > (though non-zero). This does not improve the time characteristics of the algorithm. It saves memory, though. \Anton. -- May the tuna salad be with you. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message