From owner-freebsd-questions Mon May 15 8:49:36 2000 Delivered-To: freebsd-questions@freebsd.org Received: from hecate.webcom.com (hecate.webcom.com [209.1.28.39]) by hub.freebsd.org (Postfix) with ESMTP id 73B6537B892 for ; Mon, 15 May 2000 08:49:28 -0700 (PDT) (envelope-from graeme@echidna.com) Received: from eresh (eresh.webcom.com [209.1.28.49]) by hecate.webcom.com (8.9.1/8.9.1) with SMTP id IAA23558; Mon, 15 May 2000 08:49:27 -0700 Received: from [63.83.153.218] by inanna.webcom.com (WebCom SMTP 1.2.1) with SMTP id 74147824; Mon May 15 08:48 PDT 2000 Message-Id: <39201D4D.B7482D55@echidna.com> Date: Mon, 15 May 2000 11:52:45 -0400 From: Graeme Tait X-Mailer: Mozilla 4.51 [en] (WinNT; U) X-Accept-Language: en,pdf Mime-Version: 1.0 To: Dan Larsson Cc: questions@freebsd.org Subject: Re: regexp driving me nuts, help needed! (followup) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG You can certainly do what you want with regexps - the downside is that getting a regexp right can involve more pain and suffering than other, perhaps less compact approaches. I'm accustomed to using regexps in Perl, not with SED (there are differences in regexp syntax). So forgive me if this not applicable to your usage. Your regexp s/\([\.a-zA-Z0-9]+[a-zA-Z]{2,3}\)/\1 /g is fundamentally flawed, in that it doesn't strip unmatched characters - all it would do (at best) is add a space. That is, you are matching a substring, then replacing it by itself plus a space, leaving the unmatched parts of the string unchanged. Also why do you have the parentheses escaped? Won't that mean they are taken as literal parens, so that you are trying to match a string like "http://www.(domain.com)/" ? In this case you will get no match, and the input string will be unchanged by the regexp. Or is that escaping required by the shell? Also, you are not allowing for "-" characters in the domain name, and since you have a match component "[\.a-zA-Z0-9]+", you will match more than the second-level domain (that match component will match say "aaa.bbb.ccc.ddd" for an arbitrary number of sub-domains). And why the "g" at the end of the regexp - are you expecting multiple URL's in the input line? And in Perl (at least), you should properly use "$1", not "\1" in the replacement specification. In Perl, to extract the second-level domain name from a string consisting of a correctly-formed URL like yours (and nothing else), I would do something like (following other users suggestions) echo [URL] | perl -ne 'print $2,"\n" if (m/^[a-zA-Z]+:\/(\/|\/[^\/]+\.)([a-zA-Z0-9-]+\.[a-zA-Z]{2,3})[\/:].*/)' This assumes that the hostname part of the URL can end in either "/" or ":" (allowing for a port number to be present, since that is not uncommon), and that the TLD must be 2 or 3 alphabetic characters, and that the domain is at least second level (i.e., the above code will skip a URL referencing a host directly by name, like "http://localhost/"; also the regexp will tolerate certain malformed URL's). Remember that the syntax of a URL can get more complicated than your samples, and apart from a port number can (e.g.) contain a username[:password] in the host portion; to accomodate the full syntax would require more work. BTW, I don't believe "telnet://domain.tld" is a valid URL - I think the closing "/" is strictly required. But if you want to include that case, do echo [URL] | perl -ne 'print $2,"\n" if (m/^[a-zA-Z]+:\/(\/|\/[^\/]+\.)([a-zA-Z0-9-]+\.[a-zA-Z]{2,3})([\/:].*|$)/)' Dan Larsson wrote: > > Thanks for all the help guys (you know who you are)! > Now I can extract 'sub.domain.tld/file.html' to 'domain.tld'. Thanks! > > However I need to extract and return second level > domainname with top level domainname from any combination: > > https://foo.bar.sub.domain.tld/anythingornothing/file.html > gopher://anything.domain.tld/nothingoranything/foo.file > anything://domain.tld/file.php3 > http://domain.tld/ > telnet://domain.tld > > should all return: 'domain.tld'. > > Maybe I'm reaching in the wrong direction when trying to use regexps > for this. Any other method is also welcome. > > Regards > ------------ > Dan Larsson > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-questions" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message