From owner-freebsd-questions  Mon May 15  8:49:36 2000
Delivered-To: freebsd-questions@freebsd.org
Received: from hecate.webcom.com (hecate.webcom.com [209.1.28.39])
	by hub.freebsd.org (Postfix) with ESMTP id 73B6537B892
	for <questions@freebsd.org>; Mon, 15 May 2000 08:49:28 -0700 (PDT)
	(envelope-from graeme@echidna.com)
Received: from eresh (eresh.webcom.com [209.1.28.49])
	by hecate.webcom.com (8.9.1/8.9.1) with SMTP id IAA23558;
	Mon, 15 May 2000 08:49:27 -0700
Received: from [63.83.153.218] by inanna.webcom.com (WebCom SMTP 1.2.1)
	with SMTP id 74147824; Mon May 15 08:48 PDT 2000
Message-Id: <39201D4D.B7482D55@echidna.com>
Date: Mon, 15 May 2000 11:52:45 -0400
From: Graeme Tait <graeme@echidna.com>
X-Mailer: Mozilla 4.51 [en] (WinNT; U)
X-Accept-Language: en,pdf
Mime-Version: 1.0
To: Dan Larsson <dl@tyfon.net>
Cc: questions@freebsd.org
Subject: Re: regexp driving me nuts, help needed! (followup)
References: <NEBBJANJCNNAKCPFKHHFGEHMCCAA.dl@tyfon.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

You can certainly do what you want with regexps - the downside is that getting a
regexp right can involve more pain and suffering than other, perhaps less
compact approaches. I'm accustomed to using regexps in Perl, not with SED (there
are differences in regexp syntax). So forgive me if this not applicable to your
usage.

Your regexp

s/\([\.a-zA-Z0-9]+[a-zA-Z]{2,3}\)/\1 /g

is fundamentally flawed, in that it doesn't strip unmatched characters - all it
would do (at best) is add a space. That is, you are matching a substring, then
replacing it by itself plus a space, leaving the unmatched parts of the string
unchanged.

Also why do you have the parentheses escaped? Won't that mean they are taken as
literal parens, so that you are trying to match a string like
"http://www.(domain.com)/" ? In this case you will get no match, and the input
string will be unchanged by the regexp. Or is that escaping required by the
shell?

Also, you are not allowing for "-" characters in the domain name, and since you
have a match component "[\.a-zA-Z0-9]+", you will match more than the
second-level domain (that match component will match say "aaa.bbb.ccc.ddd" for
an arbitrary number of sub-domains). And why the "g" at the end of the regexp -
are you expecting multiple URL's in the input line? And in Perl (at least), you
should properly use "$1", not "\1" in the replacement specification.


In Perl, to extract the second-level domain name from a string consisting of a
correctly-formed URL like yours (and nothing else), I would do something like
(following other users suggestions)

echo [URL] | perl -ne 'print $2,"\n" if
(m/^[a-zA-Z]+:\/(\/|\/[^\/]+\.)([a-zA-Z0-9-]+\.[a-zA-Z]{2,3})[\/:].*/)'

This assumes that the hostname part of the URL can end in either "/" or ":"
(allowing for a port number to be present, since that is not uncommon), and that
the TLD must be 2 or 3 alphabetic characters, and that the domain is at least
second level (i.e., the above code will skip a URL referencing a host directly
by name, like "http://localhost/"; also the regexp will tolerate certain
malformed URL's). Remember that the syntax of a URL can get more complicated
than your samples, and apart from a port number can (e.g.) contain a
username[:password] in the host portion; to accomodate the full syntax would
require more work.

BTW, I don't believe "telnet://domain.tld" is a valid URL - I think the closing
"/" is strictly required. But if you want to include that case, do

echo [URL] | perl -ne 'print $2,"\n" if
(m/^[a-zA-Z]+:\/(\/|\/[^\/]+\.)([a-zA-Z0-9-]+\.[a-zA-Z]{2,3})([\/:].*|$)/)'


Dan Larsson wrote:
> 
> Thanks for all the help guys (you know who you are)!
> Now I can extract 'sub.domain.tld/file.html' to 'domain.tld'. Thanks!
> 
> However I need to extract and return second level
> domainname with top level domainname from any combination:
> 
> https://foo.bar.sub.domain.tld/anythingornothing/file.html
> gopher://anything.domain.tld/nothingoranything/foo.file
> anything://domain.tld/file.php3
> http://domain.tld/
> telnet://domain.tld
> 
> should all return: 'domain.tld'.
> 
> Maybe I'm reaching in the wrong direction when trying to use regexps
> for this. Any other method is also welcome.
> 
> Regards
> ------------
> Dan Larsson
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-questions" in the body of the message


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message