From owner-freebsd-questions@FreeBSD.ORG Sat May 7 11:15:03 2011 Return-Path: Delivered-To: freebsd-questions@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3997F1065672 for ; Sat, 7 May 2011 11:15:03 +0000 (UTC) (envelope-from listreader@lazlarlyricon.com) Received: from mailgw5.surf-town.net (mail12.surf-town.net [212.97.132.52]) by mx1.freebsd.org (Postfix) with ESMTP id BC9148FC08 for ; Sat, 7 May 2011 11:15:02 +0000 (UTC) Received: by mailgw5.surf-town.net (Postfix, from userid 65534) id DA6FC1FF07; Sat, 7 May 2011 13:15:01 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailgw5.surf-town.net (Postfix) with ESMTP id B635C1FF25; Sat, 7 May 2011 13:15:01 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mailgw5.surf-town.net X-Spam-Flag: NO X-Spam-Score: -1.44 X-Spam-Level: X-Spam-Status: No, score=-1.44 tagged_above=-999 required=7 tests=[ALL_TRUSTED=-1.44] Received: from mailgw5.surf-town.net ([127.0.0.1]) by localhost (mailgw5.surf-town.net [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1wtQmc-Bh9bS; Sat, 7 May 2011 13:14:56 +0200 (CEST) Received: from lazlar.kicks-ass.net (c-0987e355.09-42-6e6b7010.cust.bredbandsbolaget.se [85.227.135.9]) by mailgw5.surf-town.net (Postfix) with ESMTPA id E734B1FF07; Sat, 7 May 2011 13:14:54 +0200 (CEST) Message-ID: <4DC529AD.5080906@lazlarlyricon.com> Date: Sat, 07 May 2011 13:14:53 +0200 From: Rolf Nielsen User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; sv-SE; rv:1.9.2.17) Gecko/20110502 Lightning/1.0b2 Thunderbird/3.1.10 MIME-Version: 1.0 To: Robert Bonomi References: <201105070528.p475SvZ8093849@mail.r-bonomi.com> In-Reply-To: <201105070528.p475SvZ8093849@mail.r-bonomi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-questions@FreeBSD.org Subject: Re: Comparing two lists X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 07 May 2011 11:15:03 -0000 2011-05-07 07:28, Robert Bonomi skrev: >> From listreader@lazlarlyricon.com Fri May 6 20:14:09 2011 >> Date: Sat, 07 May 2011 03:13:39 +0200 >> From: Rolf Nielsen >> To: Robert Bonomi >> CC: freebsd-questions@freebsd.org >> Subject: Re: Comparing two lists >> >> 2011-05-07 02:54, Robert Bonomi skrev: >>>> From owner-freebsd-questions@freebsd.org Fri May 6 19:27:54 2011 >>>> Date: Sat, 07 May 2011 02:09:26 +0200 >>>> From: Rolf Nielsen >>>> To: FreeBSD >>>> Subject: Comparing two lists >>>> >>>> Hello all, >>>> >>>> I have two text files, quite extensive ones. They have some lines in >>>> common and some lines are unique to one of the files. The lines that do >>>> exist in both files are not necessarily in the same location. Now I need >>>> to compare the files and output a list of lines that exist in both >>>> files. Is there a simple way to do this? diff? awk? sed? cmp? Or a >>>> combination of two or more of them? >>> >>> >>> If the files have only 'minor' differences -- i.e. no long runs of lines >>> that are in only one fie -- *and* the common lines are in the same order >>> in each file, you can use diff(1), without any other shennigans. >>> >>> If the above is -not- true, and If you need _only_ the common lines, AND >>> order is not important, then sort(1) both files, and use diff(1) on the >>> two sorted versions. >>> >>> >>> Beyond that it depends on what you mean by 'extensive' ones. megabytes? >>> Gigabytes? or what?? >>> >>> >>> >> >> Some 10,000 to 20,000 lines each. I do need only the common lines. Order >> is not essential, but would make life easier. I've tried a little with >> uniq, as suggested by Polyptron, but I guess 3am is not quite the right >> time to do these things. Anyway, thanks. > > Ok, 20k lines is only a medium-size file. There's no problem in fitting > the entire file 'in memory'. ('big' files are ones that are larger than > available memory. :) By "quite extensive" I was refering to the number of lines rather than the byte size, and 20k lines is, by my standards, quite a lot for a plain text file. :P But that's beside the point. :) > > Using uniq: > sort {{file1}} {{file2}} |uniq -d Yes, I found that solution on http://www.catonmat.net/blog/set-operations-in-unix-shell which is mainly about comm, but also lists other ways of doing things. I also found grep -xF -f file1 file2 there, and I've tested that one too. Both seem to be doing what I want. > > to maintain order, put the following in a file, call it 'common.awk' > > NR==FNR { array[$0]=1; next; } > { if (array[$0] == 1) print $0; } > > then use the command: > > awk -f common.awk {{file1}} {{file2}} > > This will output common lines, in the order they occur in _file2_. > > I took the liberty of sending a copy of this to the list although you replied privately.