From owner-freebsd-questions@FreeBSD.ORG Sun Mar 7 13:20:42 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41BA8106564A for ; Sun, 7 Mar 2010 13:20:42 +0000 (UTC) (envelope-from keramida@ceid.upatras.gr) Received: from poseidon.ceid.upatras.gr (poseidon.ceid.upatras.gr [150.140.141.169]) by mx1.freebsd.org (Postfix) with ESMTP id 50B778FC1A for ; Sun, 7 Mar 2010 13:20:41 +0000 (UTC) Received: from mail.ceid.upatras.gr (unknown [10.1.0.143]) by poseidon.ceid.upatras.gr (Postfix) with ESMTP id 0DA45EB4840; Sun, 7 Mar 2010 15:20:39 +0200 (EET) Received: from localhost (europa.ceid.upatras.gr [127.0.0.1]) by mail.ceid.upatras.gr (Postfix) with ESMTP id 1F516160D22; Sun, 7 Mar 2010 15:20:40 +0200 (EET) X-Virus-Scanned: amavisd-new at ceid.upatras.gr Received: from mail.ceid.upatras.gr ([127.0.0.1]) by localhost (europa.ceid.upatras.gr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LymN+rqj0Toj; Sun, 7 Mar 2010 15:20:39 +0200 (EET) Received: from kobe.laptop (ppp-94-64-237-111.home.otenet.gr [94.64.237.111]) by mail.ceid.upatras.gr (Postfix) with ESMTP id 93CF3160D1F; Sun, 7 Mar 2010 15:20:39 +0200 (EET) Received: from kobe.laptop (kobe.laptop [127.0.0.1]) by kobe.laptop (8.14.4/8.14.4) with ESMTP id o27DKchb005810 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 7 Mar 2010 15:20:38 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) Received: (from keramida@localhost) by kobe.laptop (8.14.4/8.14.4/Submit) id o27DKbfh005807; Sun, 7 Mar 2010 15:20:37 +0200 (EET) (envelope-from keramida@ceid.upatras.gr) From: Giorgos Keramidas To: Alexander Best References: Date: Sun, 07 Mar 2010 15:20:37 +0200 In-Reply-To: (Alexander Best's message of "Sun, 07 Mar 2010 12:08:32 +0100 (CET)") Message-ID: <87bpf01d5m.fsf@kobe.laptop> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.92 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Dan Nelson , freebsd-questions@freebsd.org Subject: Re: mailing list archive as mbox X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 07 Mar 2010 13:20:42 -0000 On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best wrote: > Dan Nelson schrieb am 2010-03-07: >> In the last episode (Mar 07), Alexander Best said: >> > hi there, > >> > what are the steps i need to perform to get a copy of the entire >> > mailingslist >> > archive of lets say freebsd-current@ in mbox format? > >> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/ >> where you >> can download weekly gzipped archives of all the mailing lists since >> their >> creation. > > thanks for the hint, but it would take hours to download all those gzipped > files, extract them and merge them. > > i really need ALL the messages of a mailinglist. of course i could use the > gzipped files you mentioned if i had some script for downloading extracting > and merging all those files for me. It's relatively easy to hack one. You can get a list of year names from the /archive/ directory itself with curl(1) and a small amount of Python plumbing around curl: >>> from subprocess import Popen as popen, PIPE >>> import re >>> yre = re.compile('^d.*\s(\d+)$') >>> devnull = file("/dev/null") >>> def years(): ... curl = "curl -o /dev/stdout ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/" ... ylist = [] ... for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): ... m = yre.match(line) ... if m: ... ylist.append(int(m.group(1))) ... return ylist ... >>> years() [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010] Then you can grab a list of the freebsd-current archives by looping through the list of years and looking for the list of files that match the pattern: ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz) Using a pipe to parse the output of curl you can collect a list of all the files that match this pattern, e.g.: >>> def yearfiles(year): ... base = "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" % year ... curl = "curl -o /dev/stdout %s/" % base ... flist = [] ... fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$') ... for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): ... m = fre.match(line) ... if m: ... flist.append("%s/%s" % (base, m.group(1))) ... return flist ... >>> yearfiles(1994) [] >>> yearfiles(1995) ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz', ...] Concatenating the file lists of all years and fetching each one of them with curl is then trivial: >>> ylist = years() >>> ylist [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010] >>> flist = [] >>> for y in ylist: ... f = yearfiles(y) ... flist = flist + f ... >>> len(flist) 785 Once you have the list of all the remote gzipped files, you can loop through the list of files once more and fetch them locally. I'm only going to fetch the first two files here, but feel free to fetch all of them in your version of the script: >>> flist = flist[:2] >>> flist ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', 'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz'] >>> >>> from subprocess import call >>> def getfile(url): ... out = os.path.basename(url) ... retcode = call(["curl", "-o", out, url], stderr=devnull) ... if retcode == 0: ... print "fetched %s" % url ... return tuple([url, out, retcode]) ... >>> map(getfile, flist) fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz ... [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', '19950101.freebsd-current.gz', 0), ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz', '19950226.freebsd-current.gz', 0)] >>> A slightly hackish script that collects all this to a more usable whole but lacks LOTS of error checking is the following: #!/usr/bin/env python from subprocess import call, Popen as popen, PIPE import os import re import sys devnull = file("/dev/null") yre = re.compile('^d.*\s(\d+)$') fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$') def years(): curl = "curl -o /dev/stdout ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/" ylist = [] for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): m = yre.match(line) if m: ylist.append(int(m.group(1))) return ylist def yearfiles(year): base = "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" % year curl = "curl -o /dev/stdout %s/" % base flist = [] for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): m = fre.match(line) if m: flist.append("%s/%s" % (base, m.group(1))) return flist def getfile(url): out = os.path.basename(url) retcode = call(["curl", "-o", out, url], stderr=devnull) if retcode == 0: print "fetched %s" % url return tuple([url, out, retcode]) if __name__ == "__main__": print "Fetching year list." ylist = years() if len(ylist) == 0: print "No yearly archives found." sys.exit(1) print "Fetching file lists for %d years." % len(ylist) flist = [] for y in ylist: f = yearfiles(y) flist = flist + f if len(flist) == 0: print "No archives found." sys.exit(1) print "Fetching %d archives." % len(flist) fresult = map(getfile, flist) fok = [fentry[1] for fentry in fresult if fentry[2] == 0] ferr = [fentry[1] for fentry in fresult if fentry[2] != 0] if len(fok) > 0: print "" print "Successfully downloaded %d archives" % len(fok) for f in fok: print " %s" % f if len(ferr) > 0: print "" print "Failed to download %d archives" % len(ferr) for f in ferr: print " %s" % f Running this with a couple of lines to limit the FTP connections a bit and fetch only parts of the freebsd-current mail archives produces the following output on my laptop: keramida@kobe:/tmp$ python foo.py Fetching year list. Fetching file lists for 3 years. Fetching 5 archives. fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950305.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950312.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950319.freebsd-current.gz Successfully downloaded 5 archives 19950101.freebsd-current.gz 19950226.freebsd-current.gz 19950305.freebsd-current.gz 19950312.freebsd-current.gz 19950319.freebsd-current.gz Without the limiting code that I removed from the example, it will try to fetch all the archive files for all 17 years. Then you can simply type: gzip -cd *.freebsd-current.gz > freebsd-current.mbox to produce a single UNIX mbox file with all the messages.