Date: Mon, 08 Mar 2010 02:24:17 +0100 (CET) From: Alexander Best <alexbestms@wwu.de> To: Giorgos Keramidas <keramida@ceid.upatras.gr> Cc: Dan Nelson <dnelson@allantgroup.com>, freebsd-questions@freebsd.org Subject: Re: mailing list archive as mbox Message-ID: <permail-2010030801241780e26a0b0000466b-a_best01@message-id.uni-muenster.de> In-Reply-To: <87bpf01d5m.fsf@kobe.laptop>
next in thread | previous in thread | raw e-mail | index | archive | help
Giorgos Keramidas schrieb am 2010-03-07: > On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best > <alexbestms@wwu.de> wrote: > > Dan Nelson schrieb am 2010-03-07: > >> In the last episode (Mar 07), Alexander Best said: > >> > hi there, > >> > what are the steps i need to perform to get a copy of the entire > >> > mailingslist > >> > archive of lets say freebsd-current@ in mbox format? > >> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/ > >> where you > >> can download weekly gzipped archives of all the mailing lists > >> since > >> their > >> creation. > > thanks for the hint, but it would take hours to download all those > > gzipped > > files, extract them and merge them. > > i really need ALL the messages of a mailinglist. of course i could > > use the > > gzipped files you mentioned if i had some script for downloading > > extracting > > and merging all those files for me. > It's relatively easy to hack one. wow!!! thanks a billion. that's a great script. i pointed the vars containing ftp sites at mirrors near me which give me better download speed and will run the script for freebsd-current@ this night (~850 archives to pull). thanks again. great job. :-) alex > You can get a list of year names from the /archive/ directory itself > with curl(1) and a small amount of Python plumbing around curl: > >>> from subprocess import Popen as popen, PIPE > >>> import re > >>> yre = re.compile('^d.*\s(\d+)$') > >>> devnull = file("/dev/null") > >>> def years(): > ... curl = "curl -o /dev/stdout > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/" > ... ylist = [] > ... for line in popen(curl, shell=True, stdout=PIPE, > stderr=devnull).stdout.readlines(): > ... m = yre.match(line) > ... if m: > ... ylist.append(int(m.group(1))) > ... return ylist > ... > >>> years() > [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, > 2004, 2005, > 2006, 2007, 2008, 2009, 2010] > Then you can grab a list of the freebsd-current archives by looping > through the list of years and looking for the list of files that > match > the pattern: > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz) > Using a pipe to parse the output of curl you can collect a list of > all > the files that match this pattern, e.g.: > >>> def yearfiles(year): > ... base = > "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" > % year > ... curl = "curl -o /dev/stdout %s/" % base > ... flist = [] > ... fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$') > ... for line in popen(curl, shell=True, stdout=PIPE, > stderr=devnull).stdout.readlines(): > ... m = fre.match(line) > ... if m: > ... flist.append("%s/%s" % (base, m.group(1))) > ... return flist > ... > >>> yearfiles(1994) > [] > >>> yearfiles(1995) > ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz', > ...] > Concatenating the file lists of all years and fetching each one of > them > with curl is then trivial: > >>> ylist = years() > >>> ylist > [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, > 2004, 2005, 2006, 2007, 2008, 2009, 2010] > >>> flist = [] > >>> for y in ylist: > ... f = yearfiles(y) > ... flist = flist + f > ... > >>> len(flist) > 785 > Once you have the list of all the remote gzipped files, you can loop > through the list of files once more and fetch them locally. I'm only > going to fetch the first two files here, but feel free to fetch all > of > them in your version of the script: > >>> flist = flist[:2] > >>> flist > ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', > 'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz'] > >>> from subprocess import call > >>> def getfile(url): > ... out = os.path.basename(url) > ... retcode = call(["curl", "-o", out, url], stderr=devnull) > ... if retcode == 0: > ... print "fetched %s" % url > ... return tuple([url, out, retcode]) > ... > >>> map(getfile, flist) > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz > ... > [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', > '19950101.freebsd-current.gz', 0), > ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz', > '19950226.freebsd-current.gz', 0)] > A slightly hackish script that collects all this to a more usable > whole > but lacks LOTS of error checking is the following: > #!/usr/bin/env python > from subprocess import call, Popen as popen, PIPE > import os > import re > import sys > devnull = file("/dev/null") > yre = re.compile('^d.*\s(\d+)$') > fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$') > def years(): > curl = "curl -o /dev/stdout > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/" > ylist = [] > for line in popen(curl, shell=True, stdout=PIPE, > stderr=devnull).stdout.readlines(): > m = yre.match(line) > if m: > ylist.append(int(m.group(1))) > return ylist > def yearfiles(year): > base = > "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" > % year > curl = "curl -o /dev/stdout %s/" % base > flist = [] > for line in popen(curl, shell=True, stdout=PIPE, > stderr=devnull).stdout.readlines(): > m = fre.match(line) > if m: > flist.append("%s/%s" % (base, m.group(1))) > return flist > def getfile(url): > out = os.path.basename(url) > retcode = call(["curl", "-o", out, url], stderr=devnull) > if retcode == 0: > print "fetched %s" % url > return tuple([url, out, retcode]) > if __name__ == "__main__": > print "Fetching year list." > ylist = years() > if len(ylist) == 0: > print "No yearly archives found." > sys.exit(1) > print "Fetching file lists for %d years." % len(ylist) > flist = [] > for y in ylist: > f = yearfiles(y) > flist = flist + f > if len(flist) == 0: > print "No archives found." > sys.exit(1) > print "Fetching %d archives." % len(flist) > fresult = map(getfile, flist) > fok = [fentry[1] for fentry in fresult if fentry[2] == 0] > ferr = [fentry[1] for fentry in fresult if fentry[2] != 0] > if len(fok) > 0: > print "" > print "Successfully downloaded %d archives" % len(fok) > for f in fok: > print " %s" % f > if len(ferr) > 0: > print "" > print "Failed to download %d archives" % len(ferr) > for f in ferr: > print " %s" % f > Running this with a couple of lines to limit the FTP connections a > bit > and fetch only parts of the freebsd-current mail archives produces > the > following output on my laptop: > keramida@kobe:/tmp$ python foo.py > Fetching year list. > Fetching file lists for 3 years. > Fetching 5 archives. > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950305.freebsd-current.gz > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950312.freebsd-current.gz > fetched > ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950319.freebsd-current.gz > Successfully downloaded 5 archives > 19950101.freebsd-current.gz > 19950226.freebsd-current.gz > 19950305.freebsd-current.gz > 19950312.freebsd-current.gz > 19950319.freebsd-current.gz > Without the limiting code that I removed from the example, it will > try > to fetch all the archive files for all 17 years. > Then you can simply type: > gzip -cd *.freebsd-current.gz > freebsd-current.mbox > to produce a single UNIX mbox file with all the messages.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?permail-2010030801241780e26a0b0000466b-a_best01>