Date: Sun, 07 Mar 2010 15:20:37 +0200 From: Giorgos Keramidas <keramida@ceid.upatras.gr> To: Alexander Best <alexbestms@wwu.de> Cc: Dan Nelson <dnelson@allantgroup.com>, freebsd-questions@freebsd.org Subject: Re: mailing list archive as mbox Message-ID: <87bpf01d5m.fsf@kobe.laptop> In-Reply-To: <permail-2010030711083280e26a0b000037c1-a_best01@message-id.uni-muenster.de> (Alexander Best's message of "Sun, 07 Mar 2010 12:08:32 %2B0100 (CET)") References: <permail-2010030711083280e26a0b000037c1-a_best01@message-id.uni-muenster.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best <alexbestms@wwu.de> wrote: > Dan Nelson schrieb am 2010-03-07: >> In the last episode (Mar 07), Alexander Best said: >> > hi there, > >> > what are the steps i need to perform to get a copy of the entire >> > mailingslist >> > archive of lets say freebsd-current@ in mbox format? > >> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/ >> where you >> can download weekly gzipped archives of all the mailing lists since >> their >> creation. > > thanks for the hint, but it would take hours to download all those gzipped > files, extract them and merge them. > > i really need ALL the messages of a mailinglist. of course i could use the > gzipped files you mentioned if i had some script for downloading extracting > and merging all those files for me. It's relatively easy to hack one. You can get a list of year names from the /archive/ directory itself with curl(1) and a small amount of Python plumbing around curl: >>> from subprocess import Popen as popen, PIPE >>> import re >>> yre = re.compile('^d.*\s(\d+)$') >>> devnull = file("/dev/null") >>> def years(): ... curl = "curl -o /dev/stdout ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/" ... ylist = [] ... for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): ... m = yre.match(line) ... if m: ... ylist.append(int(m.group(1))) ... return ylist ... >>> years() [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010] Then you can grab a list of the freebsd-current archives by looping through the list of years and looking for the list of files that match the pattern: ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz) Using a pipe to parse the output of curl you can collect a list of all the files that match this pattern, e.g.: >>> def yearfiles(year): ... base = "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" % year ... curl = "curl -o /dev/stdout %s/" % base ... flist = [] ... fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$') ... for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): ... m = fre.match(line) ... if m: ... flist.append("%s/%s" % (base, m.group(1))) ... return flist ... >>> yearfiles(1994) [] >>> yearfiles(1995) ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz', ...] Concatenating the file lists of all years and fetching each one of them with curl is then trivial: >>> ylist = years() >>> ylist [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010] >>> flist = [] >>> for y in ylist: ... f = yearfiles(y) ... flist = flist + f ... >>> len(flist) 785 Once you have the list of all the remote gzipped files, you can loop through the list of files once more and fetch them locally. I'm only going to fetch the first two files here, but feel free to fetch all of them in your version of the script: >>> flist = flist[:2] >>> flist ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', 'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz'] >>> >>> from subprocess import call >>> def getfile(url): ... out = os.path.basename(url) ... retcode = call(["curl", "-o", out, url], stderr=devnull) ... if retcode == 0: ... print "fetched %s" % url ... return tuple([url, out, retcode]) ... >>> map(getfile, flist) fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz ... [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', '19950101.freebsd-current.gz', 0), ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz', '19950226.freebsd-current.gz', 0)] >>> A slightly hackish script that collects all this to a more usable whole but lacks LOTS of error checking is the following: #!/usr/bin/env python from subprocess import call, Popen as popen, PIPE import os import re import sys devnull = file("/dev/null") yre = re.compile('^d.*\s(\d+)$') fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$') def years(): curl = "curl -o /dev/stdout ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/" ylist = [] for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): m = yre.match(line) if m: ylist.append(int(m.group(1))) return ylist def yearfiles(year): base = "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" % year curl = "curl -o /dev/stdout %s/" % base flist = [] for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines(): m = fre.match(line) if m: flist.append("%s/%s" % (base, m.group(1))) return flist def getfile(url): out = os.path.basename(url) retcode = call(["curl", "-o", out, url], stderr=devnull) if retcode == 0: print "fetched %s" % url return tuple([url, out, retcode]) if __name__ == "__main__": print "Fetching year list." ylist = years() if len(ylist) == 0: print "No yearly archives found." sys.exit(1) print "Fetching file lists for %d years." % len(ylist) flist = [] for y in ylist: f = yearfiles(y) flist = flist + f if len(flist) == 0: print "No archives found." sys.exit(1) print "Fetching %d archives." % len(flist) fresult = map(getfile, flist) fok = [fentry[1] for fentry in fresult if fentry[2] == 0] ferr = [fentry[1] for fentry in fresult if fentry[2] != 0] if len(fok) > 0: print "" print "Successfully downloaded %d archives" % len(fok) for f in fok: print " %s" % f if len(ferr) > 0: print "" print "Failed to download %d archives" % len(ferr) for f in ferr: print " %s" % f Running this with a couple of lines to limit the FTP connections a bit and fetch only parts of the freebsd-current mail archives produces the following output on my laptop: keramida@kobe:/tmp$ python foo.py Fetching year list. Fetching file lists for 3 years. Fetching 5 archives. fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950305.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950312.freebsd-current.gz fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950319.freebsd-current.gz Successfully downloaded 5 archives 19950101.freebsd-current.gz 19950226.freebsd-current.gz 19950305.freebsd-current.gz 19950312.freebsd-current.gz 19950319.freebsd-current.gz Without the limiting code that I removed from the example, it will try to fetch all the archive files for all 17 years. Then you can simply type: gzip -cd *.freebsd-current.gz > freebsd-current.mbox to produce a single UNIX mbox file with all the messages.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87bpf01d5m.fsf>