Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 07 Mar 2010 15:20:37 +0200
From:      Giorgos Keramidas <keramida@ceid.upatras.gr>
To:        Alexander Best <alexbestms@wwu.de>
Cc:        Dan Nelson <dnelson@allantgroup.com>, freebsd-questions@freebsd.org
Subject:   Re: mailing list archive as mbox
Message-ID:  <87bpf01d5m.fsf@kobe.laptop>
In-Reply-To: <permail-2010030711083280e26a0b000037c1-a_best01@message-id.uni-muenster.de> (Alexander Best's message of "Sun, 07 Mar 2010 12:08:32 %2B0100 (CET)")
References:  <permail-2010030711083280e26a0b000037c1-a_best01@message-id.uni-muenster.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 07 Mar 2010 12:08:32 +0100 (CET), Alexander Best <alexbestms@wwu.de> wrote:
> Dan Nelson schrieb am 2010-03-07:
>> In the last episode (Mar 07), Alexander Best said:
>> > hi there,
>
>> > what are the steps i need to perform to get a copy of the entire
>> > mailingslist
>> > archive of lets say freebsd-current@ in mbox format?
>
>> Go to ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/
>> where you
>> can download weekly gzipped archives of all the mailing lists since
>> their
>> creation.
>
> thanks for the hint, but it would take hours to download all those gzipped
> files, extract them and merge them.
>
> i really need ALL the messages of a mailinglist. of course i could use the
> gzipped files you mentioned if i had some script for downloading extracting
> and merging all those files for me.

It's relatively easy to hack one.

You can get a list of year names from the /archive/ directory itself
with curl(1) and a small amount of Python plumbing around curl:

    >>> from subprocess import Popen as popen, PIPE
    >>> import re
    >>> yre = re.compile('^d.*\s(\d+)$')
    >>> devnull = file("/dev/null")
    >>> def years():
    ...     curl = "curl -o /dev/stdout ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/"
    ...     ylist = []
    ...     for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines():
    ...         m = yre.match(line)
    ...         if m:
    ...             ylist.append(int(m.group(1)))
    ...     return ylist
    ...
    >>> years()
    [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
     2006, 2007, 2008, 2009, 2010]

Then you can grab a list of the freebsd-current archives by looping
through the list of years and looking for the list of files that match
the pattern:

    ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/{year}/freebsd-current/(\d+.freebsd-current.gz)

Using a pipe to parse the output of curl you can collect a list of all
the files that match this pattern, e.g.:

    >>> def yearfiles(year):
    ...     base = "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" % year
    ...     curl = "curl -o /dev/stdout %s/" % base
    ...     flist = []
    ...     fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')
    ...     for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines():
    ...         m = fre.match(line)
    ...         if m:
    ...             flist.append("%s/%s" % (base, m.group(1)))
    ...     return flist
    ...
    >>> yearfiles(1994)
    []
    >>> yearfiles(1995)
    ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/1.freebsd-current.gz',
     ...]

Concatenating the file lists of all years and fetching each one of them
with curl is then trivial:

    >>> ylist = years()
    >>> ylist
    [1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010]
    >>> flist = []
    >>> for y in ylist:
    ...     f = yearfiles(y)
    ...     flist = flist + f
    ...
    >>> len(flist)
    785

Once you have the list of all the remote gzipped files, you can loop
through the list of files once more and fetch them locally.  I'm only
going to fetch the first two files here, but feel free to fetch all of
them in your version of the script:

    >>> flist = flist[:2]
    >>> flist
    ['ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz',
     'ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz']
    >>>

    >>> from subprocess import call
    >>> def getfile(url):
    ...     out = os.path.basename(url)
    ...     retcode = call(["curl", "-o", out, url], stderr=devnull)
    ...     if retcode == 0:
    ...         print "fetched %s" % url
    ...     return tuple([url, out, retcode])
    ...
    >>> map(getfile, flist)
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
    ...
    [('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz', '19950101.freebsd-current.gz', 0),
     ('ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz', '19950226.freebsd-current.gz', 0)]
    >>>

A slightly hackish script that collects all this to a more usable whole
but lacks LOTS of error checking is the following:

    #!/usr/bin/env python

    from subprocess import call, Popen as popen, PIPE
    import os
    import re
    import sys

    devnull = file("/dev/null")
    yre = re.compile('^d.*\s(\d+)$')
    fre = re.compile(r'^.*\D(\d+.freebsd-current.gz).*$')

    def years():
        curl = "curl -o /dev/stdout ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/"
        ylist = []
        for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines():
            m = yre.match(line)
            if m:
                ylist.append(int(m.group(1)))
        return ylist

    def yearfiles(year):
        base = "ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/%4d/freebsd-current" % year
        curl = "curl -o /dev/stdout %s/" % base
        flist = []
        for line in popen(curl, shell=True, stdout=PIPE, stderr=devnull).stdout.readlines():
            m = fre.match(line)
            if m:
                flist.append("%s/%s" % (base, m.group(1)))
        return flist

    def getfile(url):
        out = os.path.basename(url)
        retcode = call(["curl", "-o", out, url], stderr=devnull)
        if retcode == 0:
            print "fetched %s" % url
        return tuple([url, out, retcode])

    if __name__ == "__main__":
        print "Fetching year list."
        ylist = years()
        if len(ylist) == 0:
            print "No yearly archives found."
            sys.exit(1)
        print "Fetching file lists for %d years." % len(ylist)

        flist = []
        for y in ylist:
            f = yearfiles(y)
            flist = flist + f
        if len(flist) == 0:
            print "No archives found."
            sys.exit(1)
        print "Fetching %d archives." % len(flist)
        fresult = map(getfile, flist)

        fok = [fentry[1] for fentry in fresult if fentry[2] == 0]
        ferr = [fentry[1] for fentry in fresult if fentry[2] != 0]
        if len(fok) > 0:
            print ""
            print "Successfully downloaded %d archives" % len(fok)
            for f in fok:
                print "    %s" % f
        if len(ferr) > 0:
            print ""
            print "Failed to download %d archives" % len(ferr)
            for f in ferr:
                print "    %s" % f

Running this with a couple of lines to limit the FTP connections a bit
and fetch only parts of the freebsd-current mail archives produces the
following output on my laptop:

    keramida@kobe:/tmp$ python foo.py
    Fetching year list.
    Fetching file lists for 3 years.
    Fetching 5 archives.
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950101.freebsd-current.gz
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950226.freebsd-current.gz
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950305.freebsd-current.gz
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950312.freebsd-current.gz
    fetched ftp://ftp.freebsd.org/pub/FreeBSD/doc/mailing-lists/archive/1995/freebsd-current/19950319.freebsd-current.gz

    Successfully downloaded 5 archives
        19950101.freebsd-current.gz
        19950226.freebsd-current.gz
        19950305.freebsd-current.gz
        19950312.freebsd-current.gz
        19950319.freebsd-current.gz

Without the limiting code that I removed from the example, it will try
to fetch all the archive files for all 17 years.

Then you can simply type:

    gzip -cd *.freebsd-current.gz > freebsd-current.mbox

to produce a single UNIX mbox file with all the messages.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87bpf01d5m.fsf>