Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Jun 2001 20:19:27 -0400 (EDT)
From:      Pete Fritchman <petef@databits.net>
To:        FreeBSD-gnats-submit@freebsd.org
Subject:   ports/28304: New port: www/crawl
Message-ID:  <200106210019.f5L0JRn54781@electron.databits.net>

next in thread | raw e-mail | index | archive | help

>Number:         28304
>Category:       ports
>Synopsis:       New port: www/crawl
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-ports
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Wed Jun 20 17:30:01 PDT 2001
>Closed-Date:
>Last-Modified:
>Originator:     Pete Fritchman
>Release:        FreeBSD 4.3-STABLE i386
>Organization:
Databits Network Services, Inc.
>Environment:
System: FreeBSD electron.databits.net 4.3-STABLE FreeBSD 4.3-STABLE #7: Mon Jun 11 10:15:45 EDT 2001 root@electron.databits.net:/usr/obj/usr/src/sys/ELECTRON i386

>Description:

The crawl utility starts a depth-first traversal of the web at the
specified URLs. It stores all JPEG images that match the configured
constraints.  Crawl is fairly fast and allows for graceful termination.
After terminating crawl, it is possible to restart it at exactly
the same spot where it was terminated. Crawl keeps a persistent
database that allows multiple crawls without revisiting sites.

The main reason for writing crawl was the lack of simple open source
web crawlers. Crawl is only a few thousand lines of code and fairly
easy to debug and customize.

Some of the main features:
 - Saves encountered JPEG images
 - Image selection based on regular expressions and size contrainsts
 - Resume previous crawl after graceful termination
 - Persistent database of visited URLs
 - Very small and efficient code
 - Supports robots.txt

WWW: http://www.monkey.org/~provos/crawl/

Note:  This port depends on devel/libevent, in PR ports/28302

>How-To-Repeat:

>Fix:

# This is a shell archive.  Save it in a file, remove anything before
# this line, and then unpack it by entering "sh file".  Note, it may
# create directories; files and directories will be owned by you and
# have default permissions.
#
# This archive contains:
#
#	crawl
#	crawl/distinfo
#	crawl/pkg-descr
#	crawl/pkg-plist
#	crawl/pkg-comment
#	crawl/Makefile
#	crawl/files
#	crawl/files/patch-configure.in
#
echo c - crawl
mkdir -p crawl > /dev/null 2>&1
echo x - crawl/distinfo
sed 's/^X//' >crawl/distinfo << 'END-of-crawl/distinfo'
XMD5 (crawl-0.1.tar.gz) = 93df9d0e6534bc4fc462950c023ec2e7
END-of-crawl/distinfo
echo x - crawl/pkg-descr
sed 's/^X//' >crawl/pkg-descr << 'END-of-crawl/pkg-descr'
XThe crawl utility starts a depth-first traversal of the web at the
Xspecified URLs. It stores all JPEG images that match the configured
Xconstraints.  Crawl is fairly fast and allows for graceful termination.
XAfter terminating crawl, it is possible to restart it at exactly
Xthe same spot where it was terminated. Crawl keeps a persistent
Xdatabase that allows multiple crawls without revisiting sites.
X
XThe main reason for writing crawl was the lack of simple open source
Xweb crawlers. Crawl is only a few thousand lines of code and fairly
Xeasy to debug and customize.
X
XSome of the main features:
X - Saves encountered JPEG images
X - Image selection based on regular expressions and size contrainsts
X - Resume previous crawl after graceful termination
X - Persistent database of visited URLs
X - Very small and efficient code
X - Supports robots.txt
X
XWWW: http://www.monkey.org/~provos/crawl/
X
X- Pete
Xpetef@databits.net
END-of-crawl/pkg-descr
echo x - crawl/pkg-plist
sed 's/^X//' >crawl/pkg-plist << 'END-of-crawl/pkg-plist'
Xbin/crawl
END-of-crawl/pkg-plist
echo x - crawl/pkg-comment
sed 's/^X//' >crawl/pkg-comment << 'END-of-crawl/pkg-comment'
XA small, efficient web crawler with advanced features
END-of-crawl/pkg-comment
echo x - crawl/Makefile
sed 's/^X//' >crawl/Makefile << 'END-of-crawl/Makefile'
X# New ports collection makefile for:	crawl
X# Date created:				20 June 2001
X# Whom:					Pete Fritchman <petef@databits.net>
X#
X# $FreeBSD$
X#
X
XPORTNAME=	crawl
XPORTVERSION=	0.1
XCATEGORIES=	www
XMASTER_SITES=	http://www.monkey.org/~provos/
X
XMAINTAINER=	petef@databits.net
X
XBUILD_DEPENDS=	${LOCALBASE}/lib/libevent.a:${PORTSDIR}/devel/libevent
X
XWRKSRC=		${WRKDIR}/${PORTNAME}
X
XUSE_AUTOCONF=	yes
XGNU_CONFIGURE=	yes
XCONFIGURE_ARGS=	--with-libevent=${LOCALBASE}
X
XMAN1=	crawl.1
X
X.include <bsd.port.mk>
END-of-crawl/Makefile
echo c - crawl/files
mkdir -p crawl/files > /dev/null 2>&1
echo x - crawl/files/patch-configure.in
sed 's/^X//' >crawl/files/patch-configure.in << 'END-of-crawl/files/patch-configure.in'
X--- configure.in.orig	Wed Jun 20 14:41:44 2001
X+++ configure.in	Wed Jun 20 17:30:07 2001
X@@ -38,11 +38,11 @@
X      ;;
X   *)
X      AC_MSG_RESULT($withval)
X-     if test -f $withval/event.h -a -f $withval/libevent.a; then
X+     if test -f $withval/include/event.h -a -f $withval/lib/libevent.a; then
X         owd=`pwd`
X         if cd $withval; then withval=`pwd`; cd $owd; fi
X-        EVENTINC="-I$withval"
X-        EVENTLIB="-L$withval -levent"
X+        EVENTINC="-I$withval/include"
X+        EVENTLIB="-L$withval/lib -levent"
X      else
X         AC_ERROR(event.h or libevent.a not found in $withval)
X      fi
END-of-crawl/files/patch-configure.in
exit

>Release-Note:
>Audit-Trail:
>Unformatted:

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-ports" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200106210019.f5L0JRn54781>