From nobody Tue Feb 21 01:13:27 2023 X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PLLrz1Kdcz3skyh for ; Tue, 21 Feb 2023 01:13:39 +0000 (UTC) (envelope-from jguojun@gmail.com) Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PLLry6gNgz49ws for ; Tue, 21 Feb 2023 01:13:38 +0000 (UTC) (envelope-from jguojun@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-qt1-x830.google.com with SMTP id w23so3352815qtn.6 for ; Mon, 20 Feb 2023 17:13:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=F3TgEMmbMMq0NJgMOVDScxSkD+t2zaAs41xhnC1bBs0=; b=NVM1FMr1WpKD5vR7sNUosS/hg58Zb/SDUXsz4p2rIUolcbwC2GQM7uImjUN8yij7RD DySPed3oDgc4xblMDsYv/6eZ7zqL4EXIEqqMu4Kh4oiRj4frRcA/JxHhPikuCAKll9iJ FB1dxT1DZUyeZIDoPeSSpE7lIU/JhHzJiSvBgjfv4DUMYESXRvbg011t5M6fRY7BXyoC Ik0zPec33Q/9QBGWcy1N4hA78DQzMp7yGl4lCrvEgIzyuPHTU4yX35/lng4eoUo+KRU/ b3CpXMvgsZKlNfP2ml0xqO5wSGK36cJoRg9RQbmnrnQKOmQMHORqAILdwbO+8SkeYbxI 8B/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=F3TgEMmbMMq0NJgMOVDScxSkD+t2zaAs41xhnC1bBs0=; b=D4UPTy1XmaX8/AGdxAeO1itbjPDX9oSh/W7elBU2qoy64wc5qxaCQkuYYNKgawCB66 jqH1tlleSE0d3s32ZGKX9rF7cBau/UqWxPQ22LJcRCPu0LSyPGI+Ay6lnsmL6DK4yY46 gt+U9jtCplItZX+06hdoFncwo2gfQsY9fRCRwAH00cpfazT0tiNPvA1O8t8UkfR7xUXT g7MweFm8YEDt31hc6VC0+odBjZ4QP+WRTUrv0AutZmMychyKlzW9EkdT9/Qj0JZBqUr+ EKd0UYwEn68qZSO1NTTr9aYD4Fovo+mI5PFng/IWK48230KxJPFhgwElIm0+/kopApmg pRmA== X-Gm-Message-State: AO0yUKWzerWmdUotyq9IKUp3RTELXatZRLIdOux0V8ddJHE3w4b2hNXu 93PAI6weXi+NuES/FTtivS91d6VpsSAxfS0DgA== X-Google-Smtp-Source: AK7set/48X1JulGSlxhI0r4JjgMzV4OJ3IQlwl83DXLKmDqRk3/HG8z7k/SazhzfTk3a+9L2WkmjuEZ6wN/Cf1up1gc= X-Received: by 2002:ac8:4997:0:b0:3bb:92c4:9441 with SMTP id f23-20020ac84997000000b003bb92c49441mr357537qtq.13.1676942018224; Mon, 20 Feb 2023 17:13:38 -0800 (PST) List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 References: <1600449078.170379.1676939080787@fidget.co-bxl> In-Reply-To: <1600449078.170379.1676939080787@fidget.co-bxl> From: jin guojun Date: Mon, 20 Feb 2023 17:13:27 -0800 Message-ID: Subject: Re: BSD-awk print() Behavior To: Sysadmin Lists Cc: Freebsd Questions Content-Type: multipart/alternative; boundary="0000000000004e26ea05f52b7d97" X-Rspamd-Queue-Id: 4PLLry6gNgz49ws X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --0000000000004e26ea05f52b7d97 Content-Type: text/plain; charset="UTF-8" Without knowing what hidden character(s) in those files, how one can guess what happened. hexdump -C file_{1,2} can show what is the real difference, which may help to understand what is going on with awk print. -Jin On Mon, Feb 20, 2023 at 4:25 PM Sysadmin Lists wrote: > Trying to wrap my head around what BSD awk is doing here. Although the > behavior > is unwanted for this exercise, it seems like a possibly useful feature or > hack > for future projects. Either way I'd like to understand what's going on. > > I extracted a list of URLs from my browser's history sql file, and when > iterating over the list with awk got some strange results. > > file_1 has the sql-extracted URLs, and file_2 is a copy-paste of that > file's > contents using vim's yank-and-paste. > > $ cat file_{1,2} > https://github.com/ > https://github.com/ > https://github.com/ > https://github.com/ > > $ diff file_{1,2} > 1,2c1,2 > < https://github.com/ > < https://github.com/ > --- > > https://github.com/ > > https://github.com/ > > $ awk '{ print $0 " abc " }' file_{1,2} > abc ://github.com/ > abc ://github.com/ > https://github.com/ abc > https://github.com/ abc > > The sql-extracted URLs cause awk's print() to replace the front of the > string > with text following $0. file_2 does not. I used vim's `:set list' option to > view hidden chars, but there's no apparent difference between the two -- > although `diff' clearly thinks so. Both files show this when `list' is set: > > https://github.com/$ > https://github.com/$ > > > Here's more background if needed: > > I extracted the URLs using sqlite3 like so: > for f in History-16768665* > do > sqlite3 --bail $f <<-HEREDOC > .mode csv > .output ${f}.csv > select * from urls where url like '%github%'; > HEREDOC > done > > Then tried to create a list of unique URLs using `sort -u' but it broke > because > of special chars in the extracted lines (so it claimed). I used awk to get > a > unique list instead: > > for f in *.csv; do [[ -s $f ]] && list="${list} $f"; done; echo $list > awk '{ u[$0] } END { for (e in u) print e > "file_1" }' $list > > -- > Sent with https://mailfence.com > Secure and private email > > --0000000000004e26ea05f52b7d97 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Without knowing what hidden character(s) in those fil= es, how one can guess what happened.

hexdump -C fi= le_{1,2} can show what is the real difference, which may help to understand= what is going on with awk print.

-Jin

On= Mon, Feb 20, 2023 at 4:25 PM Sysadmin Lists <sysadmin.lists@mailfence.com> wrote:
=
Trying to wrap my head ar= ound what BSD awk is doing here. Although the behavior
is unwanted for this exercise, it seems like a possibly useful feature or h= ack
for future projects. Either way I'd like to understand what's going= on.

I extracted a list of URLs from my browser's history sql file, and when=
iterating over the list with awk got some strange results.

file_1 has the sql-extracted URLs, and file_2 is a copy-paste of that file&= #39;s
contents using vim's yank-and-paste.

$ cat file_{1,2}
https:= //github.com/
https:= //github.com/
https:= //github.com/
https:= //github.com/

$ diff file_{1,2}=C2=A0
1,2c1,2
< h= ttps://github.com/
< h= ttps://github.com/
---
> h= ttps://github.com/
> h= ttps://github.com/

$ awk '{ print $0 " abc " }' file_{1,2}=C2=A0
=C2=A0abc ://github.com/
=C2=A0abc ://github.com/
https:= //github.com/ abc
https:= //github.com/ abc

The sql-extracted URLs cause awk's print() to replace the front of the = string
with text following $0. file_2 does not. I used vim's `:set list' o= ption to
view hidden chars, but there's no apparent difference between the two -= -
although `diff' clearly thinks so. Both files show this when `list'= is set:

https= ://github.com/$
https= ://github.com/$


Here's more background if needed:

I extracted the URLs using sqlite3 like so:
for f in History-16768665*
do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sqlite3 --bail $f <<-HEREDOC
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .mode csv
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .output ${f}.csv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 select * from urls = where url like '%github%';
HEREDOC
done

Then tried to create a list of unique URLs using `sort -u' but it broke= because
of special chars in the extracted lines (so it claimed). I used awk to get = a
unique list instead:

for f in *.csv; do [[ -s $f ]] && list=3D"${list} $f"; do= ne; echo $list
awk '{ u[$0] } END { for (e in u) print e > "file_1" }'= ; $list

--
Sent with https://mailfence.com=C2=A0
Secure and private email

--0000000000004e26ea05f52b7d97--