From owner-freebsd-questions@FreeBSD.ORG Fri Jul 29 15:22:34 2005 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1009916A41F for ; Fri, 29 Jul 2005 15:22:34 +0000 (GMT) (envelope-from cswiger@mac.com) Received: from pi.codefab.com (pi.codefab.com [199.103.21.227]) by mx1.FreeBSD.org (Postfix) with ESMTP id A9DE543D45 for ; Fri, 29 Jul 2005 15:22:33 +0000 (GMT) (envelope-from cswiger@mac.com) Received: from localhost (localhost [127.0.0.1]) by pi.codefab.com (Postfix) with ESMTP id EC9D85C45; Fri, 29 Jul 2005 11:22:32 -0400 (EDT) Received: from pi.codefab.com ([127.0.0.1]) by localhost (pi.codefab.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 68985-08; Fri, 29 Jul 2005 11:22:31 -0400 (EDT) Received: from [192.168.1.3] (pool-68-161-54-113.ny325.east.verizon.net [68.161.54.113]) by pi.codefab.com (Postfix) with ESMTP id 7A81F5E8D; Fri, 29 Jul 2005 11:22:29 -0400 (EDT) Message-ID: <42EA49B8.4070804@mac.com> Date: Fri, 29 Jul 2005 11:22:32 -0400 From: Chuck Swiger Organization: The Courts of Chaos User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8) Gecko/20050511 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Michael Sharp References: <1784.192.168.1.1.1122647757.squirrel@probsd.org> In-Reply-To: <1784.192.168.1.1.1122647757.squirrel@probsd.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: amavisd-new at codefab.com Cc: freebsd-questions@freebsd.org Subject: Re: Need a good Unix script that.. X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jul 2005 15:22:34 -0000 Michael Sharp wrote: > I need a simple sh script that will daily (via cron) crawl a website > looking for multiple keywords, then reporting those keyword results and > URL to an email address. > > Anyone know of a pre-written script that does this, or point me in the > right direction in using the FreeBSD core commands that can accomplish > this? If you feed the webserver's access log into various programs like analog, these will report on the keywords people used to search for when linking into the site. (This is not quite what you asked for, but I mention it because the suggestion might be closer to what you want to see... :-) Anyway, if you do not own the site & have access to the logfiles, you ought to honor things like /robots.txt and the site's policies with regard to copyright and datamining, but you could easily use lynx, curl, or anything similiar which supports a recursive/web-spider download capability, and then grep for keywords, do histograms, whatever on the content you DL. -- -Chuck