Date: Mon, 20 Feb 2023 17:13:27 -0800 From: jin guojun <jguojun@gmail.com> To: Sysadmin Lists <sysadmin.lists@mailfence.com> Cc: Freebsd Questions <freebsd-questions@freebsd.org> Subject: Re: BSD-awk print() Behavior Message-ID: <CAE6yT5unTF5S=gt3oFy2-MhAdv-rDO660Dw4Y0O_AFQwSLnp%2Bw@mail.gmail.com> In-Reply-To: <1600449078.170379.1676939080787@fidget.co-bxl> References: <1600449078.170379.1676939080787@fidget.co-bxl>
next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000004e26ea05f52b7d97 Content-Type: text/plain; charset="UTF-8" Without knowing what hidden character(s) in those files, how one can guess what happened. hexdump -C file_{1,2} can show what is the real difference, which may help to understand what is going on with awk print. -Jin On Mon, Feb 20, 2023 at 4:25 PM Sysadmin Lists <sysadmin.lists@mailfence.com> wrote: > Trying to wrap my head around what BSD awk is doing here. Although the > behavior > is unwanted for this exercise, it seems like a possibly useful feature or > hack > for future projects. Either way I'd like to understand what's going on. > > I extracted a list of URLs from my browser's history sql file, and when > iterating over the list with awk got some strange results. > > file_1 has the sql-extracted URLs, and file_2 is a copy-paste of that > file's > contents using vim's yank-and-paste. > > $ cat file_{1,2} > https://github.com/ > https://github.com/ > https://github.com/ > https://github.com/ > > $ diff file_{1,2} > 1,2c1,2 > < https://github.com/ > < https://github.com/ > --- > > https://github.com/ > > https://github.com/ > > $ awk '{ print $0 " abc " }' file_{1,2} > abc ://github.com/ > abc ://github.com/ > https://github.com/ abc > https://github.com/ abc > > The sql-extracted URLs cause awk's print() to replace the front of the > string > with text following $0. file_2 does not. I used vim's `:set list' option to > view hidden chars, but there's no apparent difference between the two -- > although `diff' clearly thinks so. Both files show this when `list' is set: > > https://github.com/$ > https://github.com/$ > > > Here's more background if needed: > > I extracted the URLs using sqlite3 like so: > for f in History-16768665* > do > sqlite3 --bail $f <<-HEREDOC > .mode csv > .output ${f}.csv > select * from urls where url like '%github%'; > HEREDOC > done > > Then tried to create a list of unique URLs using `sort -u' but it broke > because > of special chars in the extracted lines (so it claimed). I used awk to get > a > unique list instead: > > for f in *.csv; do [[ -s $f ]] && list="${list} $f"; done; echo $list > awk '{ u[$0] } END { for (e in u) print e > "file_1" }' $list > > -- > Sent with https://mailfence.com > Secure and private email > > --0000000000004e26ea05f52b7d97 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div>Without knowing what hidden character(s) in those fil= es, how one can guess what happened.</div><div><br></div><div>hexdump -C fi= le_{1,2} can show what is the real difference, which may help to understand= what is going on with awk print.</div><div><br></div><div>-Jin<br></div></= div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On= Mon, Feb 20, 2023 at 4:25 PM Sysadmin Lists <<a href=3D"mailto:sysadmin= .lists@mailfence.com">sysadmin.lists@mailfence.com</a>> wrote:<br></div>= <blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-= left:1px solid rgb(204,204,204);padding-left:1ex">Trying to wrap my head ar= ound what BSD awk is doing here. Although the behavior<br> is unwanted for this exercise, it seems like a possibly useful feature or h= ack<br> for future projects. Either way I'd like to understand what's going= on.<br> <br> I extracted a list of URLs from my browser's history sql file, and when= <br> iterating over the list with awk got some strange results.<br> <br> file_1 has the sql-extracted URLs, and file_2 is a copy-paste of that file&= #39;s<br> contents using vim's yank-and-paste.<br> <br> $ cat file_{1,2}<br> <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:= //github.com/</a><br> <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:= //github.com/</a><br> <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:= //github.com/</a><br> <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:= //github.com/</a><br> <br> $ diff file_{1,2}=C2=A0 <br> 1,2c1,2<br> < <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h= ttps://github.com/</a><br> < <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h= ttps://github.com/</a><br> ---<br> > <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h= ttps://github.com/</a><br> > <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">h= ttps://github.com/</a><br> <br> $ awk '{ print $0 " abc " }' file_{1,2}=C2=A0 <br> =C2=A0abc ://<a href=3D"http://github.com/" rel=3D"noreferrer" target=3D"_b= lank">github.com/</a><br> =C2=A0abc ://<a href=3D"http://github.com/" rel=3D"noreferrer" target=3D"_b= lank">github.com/</a><br> <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:= //github.com/</a> abc <br> <a href=3D"https://github.com/" rel=3D"noreferrer" target=3D"_blank">https:= //github.com/</a> abc <br> <br> The sql-extracted URLs cause awk's print() to replace the front of the = string<br> with text following $0. file_2 does not. I used vim's `:set list' o= ption to<br> view hidden chars, but there's no apparent difference between the two -= -<br> although `diff' clearly thinks so. Both files show this when `list'= is set:<br> <br> <a href=3D"https://github.com/$" rel=3D"noreferrer" target=3D"_blank">https= ://github.com/$</a><br> <a href=3D"https://github.com/$" rel=3D"noreferrer" target=3D"_blank">https= ://github.com/$</a><br> <br> <br> Here's more background if needed:<br> <br> I extracted the URLs using sqlite3 like so:<br> for f in History-16768665*<br> do<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 sqlite3 --bail $f <<-HEREDOC<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .mode csv<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 .output ${f}.csv<br= > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 select * from urls = where url like '%github%';<br> HEREDOC<br> done<br> <br> Then tried to create a list of unique URLs using `sort -u' but it broke= because<br> of special chars in the extracted lines (so it claimed). I used awk to get = a<br> unique list instead:<br> <br> for f in *.csv; do [[ -s $f ]] && list=3D"${list} $f"; do= ne; echo $list<br> awk '{ u[$0] } END { for (e in u) print e > "file_1" }'= ; $list<br> <br> -- <br> Sent with <a href=3D"https://mailfence.com" rel=3D"noreferrer" target=3D"_b= lank">https://mailfence.com</a>=C2=A0 <br> Secure and private email<br> <br> </blockquote></div> --0000000000004e26ea05f52b7d97--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAE6yT5unTF5S=gt3oFy2-MhAdv-rDO660Dw4Y0O_AFQwSLnp%2Bw>