Date: Tue, 6 Jul 1999 11:55:26 +0100 From: Nik Clayton <nclayton@lehman.com> To: chris@calldei.com, Bill Fumerola <billf@chc-chimes.com>, doc@freebsd.org Cc: hackers@freebsd.org Subject: Searching the Handbook (was Re: 'rtfm script') Message-ID: <19990706115526.Z15628@lehman.com> In-Reply-To: <19990705141635.D97224@holly.dyndns.org>; from Chris Costello on Mon, Jul 05, 1999 at 02:16:36PM -0500 References: <Pine.HPP.3.96.990705100523.26110A-100000@hp9000.chc-chimes.com> <19990705141635.D97224@holly.dyndns.org>
next in thread | previous in thread | raw e-mail | index | archive | help
I've added doc@freebsd.org to the distribution list, for obvious reasons. On Mon, Jul 05, 1999 at 02:16:36PM -0500, Chris Costello wrote: > On Mon, Jul 5, 1999, Bill Fumerola wrote: > > I'm in favor of the rtfm script. It's amazing the positive > > things that come out of an offhand IRC comment. > > > > [ from http://www.emsphone.com/FreeBSD/log.cgi/19990704.txt ] > > > > [15:33] <cmc> First it'll search the man pages. Then the handbook. Then > > the FAQ. Then, maybe see if I can find out if they start bitching, and if > > so, email Jesus Monroy. > > Note that I can't figure out a decent way to search the > Handbook at this point, but I'm open to ideas. There are a couple of ways you could do it. Some of them more optimal than others. Executive summary: sgrep is probably your best choice now, which can can be found at <URL:http://www.cs.helsinki.fi/~jjaakkol/sgrep.html>. But read on for more. The simplest way is to assume that the user has the plain text handbook installed, and do a simple grep through that for what you're looking for. This is nice and easy to do, but doesn't take advantage of the additional smarts built in to the Handbook's native format. To do that requires some additional work. A brief recap for those not au fait with how the Handbook is organised in source form. The Handbook is 'marked up' in a language called DocBook. DocBook was designed specifically for formatting technical documentation, and looks a lot like HTML. However, instead of tags like <em>, <b>, <ul>, and so on, DocBook has tags like <example>, <screen>, <userinput>, <devicename>, <filename>, and so forth. A document that is marked up in DocBook therefore contains a lot of additional semantic information about the content (and very little formatting information). When the Handbook is converted to HTML, some of this semantic information is retained. For example, the DocBook source for an example that the user might want to copy verbatim would look like, <screen><prompt>#</prompt> <userinput>rm -rf /</userinput></screen> and might be converted to HTML that looks like <blockquote class="screen"> <tt><span class="prompt">#</span> <span class="userinput">rm -rf /</span></tt> </blockquote> Lots more information can be found at http://www.freebsd.org/tutorials/docproj-primer/. A smart searching mechanism will be able to use this additional semantic information to reject (or lower the rankings of) results that don't match what the user wanted. For example, suppose you're searching the Handbook for examples of the make(1) command in action. The simple string "make" occurs lots of times in the Handbook. However, you're only interested in those sections where it occurs *inside* a <userinput> element; all the other occurences can be ignored. For a simple rtfm(1) style search most of this can probably be ignored, and you can just search the plain text handbook. But even then you might want to provide switches that allow the user to specify: - Only match this word if found in an example - Only match this word if found in a title - Only match this word if found in a command name and so on. How do you do that? Good question. This has been on my list of things to investigate (at the back of my mind) for a while, but more important things have taken my time. If anyone's interested in doing this, here's what I've discovered. You could go the full SGML route. This would involve building an application that can parse the DocBook source of the Handbook (and other articles, and soon to be the FAQ) and allow the user to do their queries using this application. This is probably the most 'correct' route from a purist point of view, but is an awful lot of work. You could go the XML route. XML is the buzzword of the moment, can be thought of as being SGML-Lite. Writing an XML parser is much easier than writing an SGML parser, and you could write an XML aware application could parse the Handbook and other docs, returning results that only appeared inside certain elements. This is still a chunk of work, and the end user will need to keep an XML copy of the documentation somewhere on their disk. Converting from SGML to XML is not a hard problem for our documents though, so at least that hurdle is skipped. For an example of this, check out SCOOBS, at <URL:http://www.scoobs.com/>. This is still probably too heavyweight a solution though. *Much* simpler is to build a grep-alike that understands structured documents, but that doesn't care how those documents are structured. This is such a great idea that someone's already done it -- sgrep, which can be found at <URL:http://www.cs.helsinki.fi/~jjaakkol/sgrep.html> can search structured text (such as DocBook, HTML, or even mail files). Some examples of sgrep queries; sgrep 'start or "\n" .. (end or "\n") containing "Hello World"' You can define macros in sgrep, so the above could be simplified to sgrep 'LINE containing "Hello World"' If you wanted to find all the From: fields in a Unix mbox file; sgrep '"\nFrom: " .. "\n" extracting ("\n" in "\nFrom: ")' or with macros sgrep 'MAIL_FROM' Print out the title from a collection of HTML documents in which the word "SGML" is mentioned more than 12 times, or which have the word "SGML" inside H1 or H2 elements; sgrep 'HTML_TITLE in (start .. end containing (\ join(12,"SGML") or (HTML_H1 or HTML_H2 containing "SGML") ) )' *.html rtfm(1) could provide a simpler front-end to a series of canned sgrep searches, depending on switches passed to rtfm(1). As you can probably tell, I'm in favour of the sgrep(1) approach, simply because you'll get something working much faster. One caveat though -- the sgrep query language is not standard, and is only implemented by sgrep. There is a proposal going through for something called XQL, the XML Query Language. In the long run, something that supported searching using XQL is likely to be most useful. But in the short-term, sgrep will probably get you up and running quickly. More information about XQL can be found at <URL:http://www.w3.org/TandS/QL/QL98/pp/xql.html>. If you do a search for "xql" at Google (<URL:http://www.google.com/>) then you'll turn up all sorts of goodies, including various Perl and Python interfaces to XQL, which might make writing an XQL search system easier. HTH, N -- --+==[ Systems Administrator, Year 2000 Test Lab, Lehman Brothers, Inc. ]==+-- --+==[ 1 Broadgate, London, EC2M 7HA 0171-601-0011 x5514 ]==+-- --+==[ Year 2000 Testing: It's about time. . . ]==+-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19990706115526.Z15628>