From owner-freebsd-doc@FreeBSD.ORG  Fri Nov  7 13:43:27 2008
Return-Path: <owner-freebsd-doc@FreeBSD.ORG>
Delivered-To: freebsd-doc@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AD92E1065676;
	Fri,  7 Nov 2008 13:43:27 +0000 (UTC)
	(envelope-from roberthuff@rcn.com)
Received: from smtp02.lnh.mail.rcn.net (smtp02.lnh.mail.rcn.net
	[207.172.157.102])
	by mx1.freebsd.org (Postfix) with ESMTP id 55C8A8FC0A;
	Fri,  7 Nov 2008 13:43:26 +0000 (UTC)
	(envelope-from roberthuff@rcn.com)
Received: from mr02.lnh.mail.rcn.net ([207.172.157.22])
	by smtp02.lnh.mail.rcn.net with ESMTP; 07 Nov 2008 08:14:54 -0500
Received: from smtp01.lnh.mail.rcn.net (smtp01.lnh.mail.rcn.net [207.172.4.11])
	by mr02.lnh.mail.rcn.net (MOS 3.8.6-GA) with ESMTP id PFJ26457;
	Fri, 7 Nov 2008 08:14:53 -0500 (EST)
Received: from 209-6-22-188.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com (HELO
	jerusalem.litteratus.org.litteratus.org) ([209.6.22.188])
	by smtp01.lnh.mail.rcn.net with ESMTP; 07 Nov 2008 08:14:54 -0500
From: Robert Huff <roberthuff@rcn.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <18708.16205.131542.449645@jerusalem.litteratus.org>
Date: Fri, 7 Nov 2008 08:14:53 -0500
To: Giorgos Keramidas <keramida@freebsd.org>
In-Reply-To: <87d4h884c3.fsf@kobe.laptop>
References: <4913C74C.80606@gmail.com>
	<87d4h884c3.fsf@kobe.laptop>
X-Mailer: VM 7.17 under 21.5  (beta28) "fuki" XEmacs Lucid
X-Junkmail-Whitelist: YES (by domain whitelist at mr02.lnh.mail.rcn.net)
Cc: freebsd-doc@freebsd.org, freebsd-questions@freebsd.org
Subject: Re: spell check - how to?
X-BeenThere: freebsd-doc@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Documentation project <freebsd-doc.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-doc>,
	<mailto:freebsd-doc-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-doc>
List-Post: <mailto:freebsd-doc@freebsd.org>
List-Help: <mailto:freebsd-doc-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-doc>,
	<mailto:freebsd-doc-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Nov 2008 13:43:27 -0000


Giorgos Keramidas writes:

>  The main drawback of being unable to use the `freebsd' wordlist
>  is that you will get many false positives for words that are
>  perfectly valid for FreeBSD documentation but are not standard
>  English words.

	I have a script which does something similar, using ispell.
It's based on the Perl script - found on-line - appended below.
	I pseudo-fixed that running the output through sort and
starting with least frequent hits.
	Attempts to build a project-specific dictionary proved too
confusing and it was ultimatly not worth the effort.


				Robert Huff


#!/usr/local/bin/perl -W

# WordFreq.pl -- Count word frequency in a text file
$ver = "v1.0"; # 05-Dec-2001 JP Vossen {jp@jpsdomain.org>

# Basics from 8.3, page 280 of _Perl_Cookbook_
# Added stop words

(($myname = $0) =~ s/^.*(\/|\\)|\..*$//ig); # remove up to last "\" or "/" and after any "."
$Greeting =  ("$myname $ver Copyright 12001 JP Vossen (http://www.jpsdomain.org/)\n");
$Greeting .= ("    Licensed under the GNU GENERAL PUBLIC LICENSE:\n");
$Greeting .= ("    See http://www.gnu.org/copyleft/gpl.html for full text and details.\n"); # Version and copyright info

%seen = ();   # Create the hash

# Define the stopwords
@stopwords = ("a", "an", "and", "are", "as", "at", "be", "but", "by", 
"does", "for", "from", "had", "have", "her", "his", "if", "in", "is",
"it", "not", "of", "on", "or", "that", "the", "this", "to", "was",
"which", "with", "you");


if (("@ARGV" =~ /\?/) || (@ARGV > 5) || (@ARGV < 0)) { #if wrong # of args, or a ? in args - die
    print STDERR ("\n$Greeting\n\tUsage: $myname -i {infile} [-s]\n");
    print STDERR ("\nIf -s is used, the list of stop words will NOT be used.\n");
    print STDERR ("The stopwords currently defined are:\n\n ");
    foreach $stopword (@stopwords) {
        print STDERR ("$stopword ");
    } # end of foreach stopword
    die ("\n");
}

use Getopt::Std;                 # User Perl5 built-in program argument handler
getopts('i:o:s');                # Define possible args.

if (! $opt_i) { $opt_i = "-"; }  # If no input file specified, use STDIN
if (! $opt_o) { $opt_o = "-"; }  # If no output file specified, use STDOUT

open (INFILE, "$opt_i") || die "$myname: error opening $opt_i $!\n";
open (OUTFILE, ">$opt_o") || die "$myname: error opening $opt_o $!\n";

print STDERR ("\n$Greeting\n");

while (<INFILE>) {                # Read the input file
    while ( /(\w['\w-]*)/g ) {    # If we have a "word"
        $seen{lc $1}++;           # Count it in the hash
    } # end of while words
} # end of while input

if (! $opt_s) {                       # If we're using stopwords
    foreach $stopword (@stopwords) {  # for each stopword
        delete($seen{$stopword});     # Remove it from the hash
    } # end of foreach stopword       # This way we only test once for each
} # end of if using stopwords           stopword, rather than in a loop!


# Print the results, sorted most frequent words at the top
foreach $word ( sort { $seen{$b} <=> $seen{$a} } keys %seen) {
    printf OUTFILE ("%6d %s\n", $seen{$word}, $word);
} # end of foreach word