Date: Wed, 20 Jun 2001 20:19:27 -0400 (EDT) From: Pete Fritchman <petef@databits.net> To: FreeBSD-gnats-submit@freebsd.org Subject: ports/28304: New port: www/crawl Message-ID: <200106210019.f5L0JRn54781@electron.databits.net>
next in thread | raw e-mail | index | archive | help
>Number: 28304 >Category: ports >Synopsis: New port: www/crawl >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-ports >State: open >Quarter: >Keywords: >Date-Required: >Class: change-request >Submitter-Id: current-users >Arrival-Date: Wed Jun 20 17:30:01 PDT 2001 >Closed-Date: >Last-Modified: >Originator: Pete Fritchman >Release: FreeBSD 4.3-STABLE i386 >Organization: Databits Network Services, Inc. >Environment: System: FreeBSD electron.databits.net 4.3-STABLE FreeBSD 4.3-STABLE #7: Mon Jun 11 10:15:45 EDT 2001 root@electron.databits.net:/usr/obj/usr/src/sys/ELECTRON i386 >Description: The crawl utility starts a depth-first traversal of the web at the specified URLs. It stores all JPEG images that match the configured constraints. Crawl is fairly fast and allows for graceful termination. After terminating crawl, it is possible to restart it at exactly the same spot where it was terminated. Crawl keeps a persistent database that allows multiple crawls without revisiting sites. The main reason for writing crawl was the lack of simple open source web crawlers. Crawl is only a few thousand lines of code and fairly easy to debug and customize. Some of the main features: - Saves encountered JPEG images - Image selection based on regular expressions and size contrainsts - Resume previous crawl after graceful termination - Persistent database of visited URLs - Very small and efficient code - Supports robots.txt WWW: http://www.monkey.org/~provos/crawl/ Note: This port depends on devel/libevent, in PR ports/28302 >How-To-Repeat: >Fix: # This is a shell archive. Save it in a file, remove anything before # this line, and then unpack it by entering "sh file". Note, it may # create directories; files and directories will be owned by you and # have default permissions. # # This archive contains: # # crawl # crawl/distinfo # crawl/pkg-descr # crawl/pkg-plist # crawl/pkg-comment # crawl/Makefile # crawl/files # crawl/files/patch-configure.in # echo c - crawl mkdir -p crawl > /dev/null 2>&1 echo x - crawl/distinfo sed 's/^X//' >crawl/distinfo << 'END-of-crawl/distinfo' XMD5 (crawl-0.1.tar.gz) = 93df9d0e6534bc4fc462950c023ec2e7 END-of-crawl/distinfo echo x - crawl/pkg-descr sed 's/^X//' >crawl/pkg-descr << 'END-of-crawl/pkg-descr' XThe crawl utility starts a depth-first traversal of the web at the Xspecified URLs. It stores all JPEG images that match the configured Xconstraints. Crawl is fairly fast and allows for graceful termination. XAfter terminating crawl, it is possible to restart it at exactly Xthe same spot where it was terminated. Crawl keeps a persistent Xdatabase that allows multiple crawls without revisiting sites. X XThe main reason for writing crawl was the lack of simple open source Xweb crawlers. Crawl is only a few thousand lines of code and fairly Xeasy to debug and customize. X XSome of the main features: X - Saves encountered JPEG images X - Image selection based on regular expressions and size contrainsts X - Resume previous crawl after graceful termination X - Persistent database of visited URLs X - Very small and efficient code X - Supports robots.txt X XWWW: http://www.monkey.org/~provos/crawl/ X X- Pete Xpetef@databits.net END-of-crawl/pkg-descr echo x - crawl/pkg-plist sed 's/^X//' >crawl/pkg-plist << 'END-of-crawl/pkg-plist' Xbin/crawl END-of-crawl/pkg-plist echo x - crawl/pkg-comment sed 's/^X//' >crawl/pkg-comment << 'END-of-crawl/pkg-comment' XA small, efficient web crawler with advanced features END-of-crawl/pkg-comment echo x - crawl/Makefile sed 's/^X//' >crawl/Makefile << 'END-of-crawl/Makefile' X# New ports collection makefile for: crawl X# Date created: 20 June 2001 X# Whom: Pete Fritchman <petef@databits.net> X# X# $FreeBSD$ X# X XPORTNAME= crawl XPORTVERSION= 0.1 XCATEGORIES= www XMASTER_SITES= http://www.monkey.org/~provos/ X XMAINTAINER= petef@databits.net X XBUILD_DEPENDS= ${LOCALBASE}/lib/libevent.a:${PORTSDIR}/devel/libevent X XWRKSRC= ${WRKDIR}/${PORTNAME} X XUSE_AUTOCONF= yes XGNU_CONFIGURE= yes XCONFIGURE_ARGS= --with-libevent=${LOCALBASE} X XMAN1= crawl.1 X X.include <bsd.port.mk> END-of-crawl/Makefile echo c - crawl/files mkdir -p crawl/files > /dev/null 2>&1 echo x - crawl/files/patch-configure.in sed 's/^X//' >crawl/files/patch-configure.in << 'END-of-crawl/files/patch-configure.in' X--- configure.in.orig Wed Jun 20 14:41:44 2001 X+++ configure.in Wed Jun 20 17:30:07 2001 X@@ -38,11 +38,11 @@ X ;; X *) X AC_MSG_RESULT($withval) X- if test -f $withval/event.h -a -f $withval/libevent.a; then X+ if test -f $withval/include/event.h -a -f $withval/lib/libevent.a; then X owd=`pwd` X if cd $withval; then withval=`pwd`; cd $owd; fi X- EVENTINC="-I$withval" X- EVENTLIB="-L$withval -levent" X+ EVENTINC="-I$withval/include" X+ EVENTLIB="-L$withval/lib -levent" X else X AC_ERROR(event.h or libevent.a not found in $withval) X fi END-of-crawl/files/patch-configure.in exit >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-ports" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200106210019.f5L0JRn54781>