From owner-freebsd-doc@FreeBSD.ORG Fri Nov 7 13:43:27 2008 Return-Path: Delivered-To: freebsd-doc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AD92E1065676; Fri, 7 Nov 2008 13:43:27 +0000 (UTC) (envelope-from roberthuff@rcn.com) Received: from smtp02.lnh.mail.rcn.net (smtp02.lnh.mail.rcn.net [207.172.157.102]) by mx1.freebsd.org (Postfix) with ESMTP id 55C8A8FC0A; Fri, 7 Nov 2008 13:43:26 +0000 (UTC) (envelope-from roberthuff@rcn.com) Received: from mr02.lnh.mail.rcn.net ([207.172.157.22]) by smtp02.lnh.mail.rcn.net with ESMTP; 07 Nov 2008 08:14:54 -0500 Received: from smtp01.lnh.mail.rcn.net (smtp01.lnh.mail.rcn.net [207.172.4.11]) by mr02.lnh.mail.rcn.net (MOS 3.8.6-GA) with ESMTP id PFJ26457; Fri, 7 Nov 2008 08:14:53 -0500 (EST) Received: from 209-6-22-188.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com (HELO jerusalem.litteratus.org.litteratus.org) ([209.6.22.188]) by smtp01.lnh.mail.rcn.net with ESMTP; 07 Nov 2008 08:14:54 -0500 From: Robert Huff MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <18708.16205.131542.449645@jerusalem.litteratus.org> Date: Fri, 7 Nov 2008 08:14:53 -0500 To: Giorgos Keramidas In-Reply-To: <87d4h884c3.fsf@kobe.laptop> References: <4913C74C.80606@gmail.com> <87d4h884c3.fsf@kobe.laptop> X-Mailer: VM 7.17 under 21.5 (beta28) "fuki" XEmacs Lucid X-Junkmail-Whitelist: YES (by domain whitelist at mr02.lnh.mail.rcn.net) Cc: freebsd-doc@freebsd.org, freebsd-questions@freebsd.org Subject: Re: spell check - how to? X-BeenThere: freebsd-doc@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Documentation project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 07 Nov 2008 13:43:27 -0000 Giorgos Keramidas writes: > The main drawback of being unable to use the `freebsd' wordlist > is that you will get many false positives for words that are > perfectly valid for FreeBSD documentation but are not standard > English words. I have a script which does something similar, using ispell. It's based on the Perl script - found on-line - appended below. I pseudo-fixed that running the output through sort and starting with least frequent hits. Attempts to build a project-specific dictionary proved too confusing and it was ultimatly not worth the effort. Robert Huff #!/usr/local/bin/perl -W # WordFreq.pl -- Count word frequency in a text file $ver = "v1.0"; # 05-Dec-2001 JP Vossen {jp@jpsdomain.org> # Basics from 8.3, page 280 of _Perl_Cookbook_ # Added stop words (($myname = $0) =~ s/^.*(\/|\\)|\..*$//ig); # remove up to last "\" or "/" and after any "." $Greeting = ("$myname $ver Copyright 12001 JP Vossen (http://www.jpsdomain.org/)\n"); $Greeting .= (" Licensed under the GNU GENERAL PUBLIC LICENSE:\n"); $Greeting .= (" See http://www.gnu.org/copyleft/gpl.html for full text and details.\n"); # Version and copyright info %seen = (); # Create the hash # Define the stopwords @stopwords = ("a", "an", "and", "are", "as", "at", "be", "but", "by", "does", "for", "from", "had", "have", "her", "his", "if", "in", "is", "it", "not", "of", "on", "or", "that", "the", "this", "to", "was", "which", "with", "you"); if (("@ARGV" =~ /\?/) || (@ARGV > 5) || (@ARGV < 0)) { #if wrong # of args, or a ? in args - die print STDERR ("\n$Greeting\n\tUsage: $myname -i {infile} [-s]\n"); print STDERR ("\nIf -s is used, the list of stop words will NOT be used.\n"); print STDERR ("The stopwords currently defined are:\n\n "); foreach $stopword (@stopwords) { print STDERR ("$stopword "); } # end of foreach stopword die ("\n"); } use Getopt::Std; # User Perl5 built-in program argument handler getopts('i:o:s'); # Define possible args. if (! $opt_i) { $opt_i = "-"; } # If no input file specified, use STDIN if (! $opt_o) { $opt_o = "-"; } # If no output file specified, use STDOUT open (INFILE, "$opt_i") || die "$myname: error opening $opt_i $!\n"; open (OUTFILE, ">$opt_o") || die "$myname: error opening $opt_o $!\n"; print STDERR ("\n$Greeting\n"); while () { # Read the input file while ( /(\w['\w-]*)/g ) { # If we have a "word" $seen{lc $1}++; # Count it in the hash } # end of while words } # end of while input if (! $opt_s) { # If we're using stopwords foreach $stopword (@stopwords) { # for each stopword delete($seen{$stopword}); # Remove it from the hash } # end of foreach stopword # This way we only test once for each } # end of if using stopwords stopword, rather than in a loop! # Print the results, sorted most frequent words at the top foreach $word ( sort { $seen{$b} <=> $seen{$a} } keys %seen) { printf OUTFILE ("%6d %s\n", $seen{$word}, $word); } # end of foreach word