From owner-freebsd-questions@freebsd.org Sat Apr 8 19:03:51 2017 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 916F3D34BDA for ; Sat, 8 Apr 2017 19:03:51 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mailrelay15.qsc.de (mailrelay15.qsc.de [212.99.187.254]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.antispameurope.com", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 18939171 for ; Sat, 8 Apr 2017 19:03:50 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx01.qsc.de ([213.148.129.14]) by mailrelay15.qsc.de; Sat, 08 Apr 2017 21:03:41 +0200 Received: from r56.edvax.de (port-92-195-127-117.dynamic.qsc.de [92.195.127.117]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx01.qsc.de (Postfix) with ESMTPS id 554F33C77D; Sat, 8 Apr 2017 21:03:40 +0200 (CEST) Received: from r56.edvax.de (localhost [127.0.0.1]) by r56.edvax.de (8.14.5/8.14.5) with SMTP id v38J3doo002444; Sat, 8 Apr 2017 21:03:39 +0200 (CEST) (envelope-from freebsd@edvax.de) Date: Sat, 8 Apr 2017 21:03:39 +0200 From: Polytropon To: Ernie Luzar Cc: RW , freebsd-questions@freebsd.org Subject: Re: Is there a database built into the base system Message-Id: <20170408210339.b3517d6a.freebsd@edvax.de> In-Reply-To: <58E91F4D.90005@gmail.com> References: <58E696BD.6050503@gmail.com> <69607026-F68C-4D9D-A826-3EFE9ECE12AB@mac.com> <58E69E59.6020108@gmail.com> <20170406210516.c63644064eb99f7b60dbd8f4@sohara.org> <58E6AFC0.2080404@gmail.com> <20170407001101.GA5885@tau1.ceti.pl> <20170407210629.GR2787@mailboy.kipshouse.net> <58E83E19.8010709@gmail.com> <20170408145503.69ddf649@gumby.homeunix.com> <58E9171F.3060405@gmail.com> <20170408191633.70d1f303.freebsd@edvax.de> <58E91F4D.90005@gmail.com> Reply-To: Polytropon Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-cloud-security-sender: freebsd@edvax.de X-cloud-security-recipient: freebsd-questions@freebsd.org X-cloud-security-Virusscan: CLEAN X-cloud-security-disclaimer: This E-Mail was scanned by E-Mailservice on mailrelay15.qsc.de with 93D5569FCF3 X-cloud-security-connect: mx01.qsc.de[213.148.129.14], TLS=1, IP=213.148.129.14 X-cloud-security: scantime:.2521 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 08 Apr 2017 19:03:51 -0000 On Sat, 08 Apr 2017 13:35:09 -0400, Ernie Luzar wrote: > Polytropon wrote: > > On Sat, 08 Apr 2017 13:00:15 -0400, Ernie Luzar wrote: > >> Here is my first try at using awk to Read every record in the input > >> file and drop duplicates records from output file. > >> > >> > >> This what the data looks like. > >> /etc >cat /ip.org.sorted > >> 1.121.136.228; > >> 1.186.172.200; > >> 1.186.172.210; > >> 1.186.172.218; > >> 1.186.172.218; > >> 1.186.172.218; > >> 1.34.169.204; > >> 101.109.155.81; > >> 101.109.155.81; > >> 101.109.155.81; > >> 101.109.155.81; > >> 104.121.89.129; > > > > Why not simply use "sort | uniq" to eliminate duplicates? > > > > > > > >> /etc >cat /root/bin/ipf.table.awk.dup > >> #! /bin/sh > >> > >> file_in="/ip.org.sorted" > >> file_out="/ip.no-dups" > >> > >> awk '{ in_ip = $1 }' > >> END { (if in_ip = prev_ip) > >> next > >> else > >> prev_ip > $file_out > >> prev_ip = in_ip > >> } $file_in > >> > >> When I run this script it just hangs there. I have to ctrl/c to break > >> out of it. What is wrong with my awk command? > > > > For each line, you store the 1st field (in this case, the entire > > line) in in_ip, and you overwrite (!) that variable with each new > > line. At the end of the file (!!!) you make a comparison and even > > request the next data line. Additionally, keep an eye on the quotes > > you use: '...' will keep the $ in $file_out, that's now a variable > > inside awk which is empty. The '...' close before END, so outside > > of awk. Remember that awk reads from standard input, so your > > redirection for the input file would need to be "< $file_in", > > or useless use of cat, "cat $file_in | awk > $file_out". > > > > In your specific case, I'd say not that awk is the wrong tool. > > If you simply want to eliminate duplicates, use the classic > > UNIX approach "sort | uniq". Both tools are part of the OS. > > > > The awk script I posted is a learning tool. I know about "sort | uniq" > > I though "end" was end of line not end of file. So how should that awk > command look to drop dups from the out put file? In that case, I'd suggest to drop the sh wrapper and put everything into the awk file, for learning purposes. :-) Here is a suggestion with comments. #!/usr/bin/awk -f BEGIN { # output file name ARGV[1] = "/ip.org.sorted" ARGC = 2; # output file name file_out = "/ip.no-dups" # reset output file printf("") > file_out # temporary ip temp = "" } # process all lines which are not empty (length > 0) { # remove ; at the end of the line gsub(";", "", $0); # new ip? if (temp != $1) { printf("%s\n", $1) >> file_out temp = $1 } # ip already known, do not output anything, "empty else branch" } As you can see, you don't need an END block when you're not going to do anything at the end of the processing, i. e., after EOF on input. You can match lines against several patterns. Example: { ... } -> process all lines (cond) { ... } -> process line of condition "cond" is true /regex/ { ... } -> process line if regular expression "regex" matches You can of course combine patterns, for example: /^[^#]/ && (length > 100) { ... } That would process all lines not starting with a # which are longer than 100 characters. In my example, I wanted to show how it is possible to have awk "instead of" sh. But keep in mind using awk as a filter is much better. For illustration: #!/bin/sh file_in="/ip.org.sorted" file_out="/ip.no-dups" cat ${file_in} | awk ' BEGIN { # temporary ip temp = "" } # process all lines which are not empty (length > 0) { # remove ; at the end of the line gsub(";", "", $0); # new ip? if (temp != $1) { printf("%s\n", $1) temp = $1 } # ip already known, do not output anything, "empty else branch" } ' > ${file_out} As you can see now, the awk code is inside the sh wrapper. If it was in a separate file, you'd probably do something like this: #!/bin/sh file_in="/ip.org.sorted" file_out="/ip.no-dups" cat ${file_in} | awk -f remove_dup_ip.awk > ${file_out} This again is a nice illustration of "useless use of cat". ;-) The form awk -f < > is probably more efficient, but it "breaks" the idea of the pipeline where you can add or remove steps, supply testing data instead of the real data, or create temporary results for later comparison. -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...