Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Feb 2023 17:13:27 -0800
From:      jin guojun <jguojun@gmail.com>
To:        Sysadmin Lists <sysadmin.lists@mailfence.com>
Cc:        Freebsd Questions <freebsd-questions@freebsd.org>
Subject:   Re: BSD-awk print() Behavior
Message-ID:  <CAE6yT5unTF5S=gt3oFy2-MhAdv-rDO660Dw4Y0O_AFQwSLnp%2Bw@mail.gmail.com>
In-Reply-To: <1600449078.170379.1676939080787@fidget.co-bxl>
References:  <1600449078.170379.1676939080787@fidget.co-bxl>

next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000004e26ea05f52b7d97
Content-Type: text/plain; charset="UTF-8"

Without knowing what hidden character(s) in those files, how one can guess
what happened.

hexdump -C file_{1,2} can show what is the real difference, which may help
to understand what is going on with awk print.

-Jin

On Mon, Feb 20, 2023 at 4:25 PM Sysadmin Lists <sysadmin.lists@mailfence.com>
wrote:

> Trying to wrap my head around what BSD awk is doing here. Although the
> behavior
> is unwanted for this exercise, it seems like a possibly useful feature or
> hack
> for future projects. Either way I'd like to understand what's going on.
>
> I extracted a list of URLs from my browser's history sql file, and when
> iterating over the list with awk got some strange results.
>
> file_1 has the sql-extracted URLs, and file_2 is a copy-paste of that
> file's
> contents using vim's yank-and-paste.
>
> $ cat file_{1,2}
> https://github.com/
> https://github.com/
> https://github.com/
> https://github.com/
>
> $ diff file_{1,2}
> 1,2c1,2
> < https://github.com/
> < https://github.com/
> ---
> > https://github.com/
> > https://github.com/
>
> $ awk '{ print $0 " abc " }' file_{1,2}
>  abc ://github.com/
>  abc ://github.com/
> https://github.com/ abc
> https://github.com/ abc
>
> The sql-extracted URLs cause awk's print() to replace the front of the
> string
> with text following $0. file_2 does not. I used vim's `:set list' option to
> view hidden chars, but there's no apparent difference between the two --
> although `diff' clearly thinks so. Both files show this when `list' is set:
>
> https://github.com/$
> https://github.com/$
>
>
> Here's more background if needed:
>
> I extracted the URLs using sqlite3 like so:
> for f in History-16768665*
> do
>         sqlite3 --bail $f <<-HEREDOC
>                 .mode csv
>                 .output ${f}.csv
>                 select * from urls where url like '%github%';
> HEREDOC
> done
>
> Then tried to create a list of unique URLs using `sort -u' but it broke
> because
> of special chars in the extracted lines (so it claimed). I used awk to get
> a
> unique list instead:
>
> for f in *.csv; do [[ -s $f ]] && list="${list} $f"; done; echo $list
> awk '{ u[$0] } END { for (e in u) print e > "file_1" }' $list
>
> --
> Sent with https://mailfence.com
> Secure and private email
>
>

--0000000000004e26ea05f52b7d97
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Without knowing what hidden character(s) in those fil=
es, how one can guess what happened.</div><div><br></div><div>hexdump -C fi=
le_{1,2} can show what is the real difference, which may help to understand=
 what is going on with awk print.</div><div><br></div><div>-Jin<br></div></=
div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On=
 Mon, Feb 20, 2023 at 4:25 PM Sysadmin Lists &lt;<a href=3D"mailto:sysadmin=
.lists@mailfence.com">sysadmin.lists@mailfence.com</a>&gt; wrote:<br></div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">Trying to wrap my head ar=
ound what BSD awk is doing here. Although the behavior<br>
is unwanted for this exercise, it seems like a possibly useful feature or h=
ack<br>
for future projects. Either way I&#39;d like to understand what&#39;s going=
 on.<br>
<br>
I extracted a list of URLs from my browser&#39;s history sql file, and when=
<br>
iterating over the list with awk got some strange results.<br>
<br>
file_1 has the sql-extracted URLs, and file_2 is a copy-paste of that file&=
#39;s<br>
contents using vim&#39;s yank-and-paste.<br>
<br>
$ cat file_{1,2}<br>
<a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:=
//github.com/</a><br>
<a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:=
//github.com/</a><br>
<a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:=
//github.com/</a><br>
<a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:=
//github.com/</a><br>
<br>
$ diff file_{1,2}=C2=A0 <br>
1,2c1,2<br>
&lt; <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h=
ttps://github.com/</a><br>
&lt; <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h=
ttps://github.com/</a><br>
---<br>
&gt; <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h=
ttps://github.com/</a><br>
&gt; <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h=
ttps://github.com/</a><br>
<br>
$ awk &#39;{ print $0 &quot; abc &quot; }&#39; file_{1,2}=C2=A0 <br>
=C2=A0abc ://<a href=3D"http://github.com/" rel=3D"noreferrer" target=3D"_b=
lank">github.com/</a><br>
=C2=A0abc ://<a href=3D"http://github.com/" rel=3D"noreferrer" target=3D"_b=
lank">github.com/</a><br>
<a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:=
//github.com/</a> abc <br>
<a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:=
//github.com/</a> abc <br>
<br>
The sql-extracted URLs cause awk&#39;s print() to replace the front of the =
string<br>
with text following $0. file_2 does not. I used vim&#39;s `:set list&#39; o=
ption to<br>
view hidden chars, but there&#39;s no apparent difference between the two -=
-<br>
although `diff&#39; clearly thinks so. Both files show this when `list&#39;=
 is set:<br>
<br>
<a href=3D"https://github.com/$" rel=3D"noreferrer" target=3D"_blank">https=
://github.com/$</a><br>
<a href=3D"https://github.com/$" rel=3D"noreferrer" target=3D"_blank">https=
://github.com/$</a><br>
<br>
<br>
Here&#39;s more background if needed:<br>
<br>
I extracted the URLs using sqlite3 like so:<br>
for f in History-16768665*<br>
do<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sqlite3 --bail $f &lt;&lt;-HEREDOC<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .mode csv<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .output ${f}.csv<br=
>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 select * from urls =
where url like &#39;%github%&#39;;<br>
HEREDOC<br>
done<br>
<br>
Then tried to create a list of unique URLs using `sort -u&#39; but it broke=
 because<br>
of special chars in the extracted lines (so it claimed). I used awk to get =
a<br>
unique list instead:<br>
<br>
for f in *.csv; do [[ -s $f ]] &amp;&amp; list=3D&quot;${list} $f&quot;; do=
ne; echo $list<br>
awk &#39;{ u[$0] } END { for (e in u) print e &gt; &quot;file_1&quot; }&#39=
; $list<br>
<br>
-- <br>
Sent with <a href=3D"https://mailfence.com" rel=3D"noreferrer" target=3D"_b=
lank">https://mailfence.com</a>=C2=A0 <br>
Secure and private email<br>
<br>
</blockquote></div>

--0000000000004e26ea05f52b7d97--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAE6yT5unTF5S=gt3oFy2-MhAdv-rDO660Dw4Y0O_AFQwSLnp%2Bw>