Date: Mon, 02 Jul 2007 08:10:14 -0700 From: Garrett Cooper <youshi10@u.washington.edu> To: Alexander Leidinger <Alexander@Leidinger.net> Cc: ports@FreeBSD.org, "\[LoN\]Kamikaze" <LoN_Kamikaze@gmx.de> Subject: Re: +CONTENTS files Message-ID: <46891556.4090209@u.washington.edu> In-Reply-To: <20070702115733.3fotau92scgs4g4s@webmail.leidinger.net> References: <46887FD3.3080307@u.washington.edu> <46889F5D.70801@gmx.de> <4688AF6D.90904@u.washington.edu> <20070702115733.3fotau92scgs4g4s@webmail.leidinger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Alexander Leidinger wrote: > Quoting Garrett Cooper <youshi10@u.washington.edu> (from Mon, 02 Jul > 2007 00:55:25 -0700): > >> [LoN]Kamikaze wrote: >>> Garrett Cooper wrote: >>> >>>> Pardon me for being naive, but wouldn't it be wiser for all of the >>>> data >>>> in the +CONTENTS file to be aggregated into sections instead of having >>>> line by line info? >>>> >>>> Example (net/samba_3.0.25a): >>>> >>>> @comment MD5:9e94560ac5e757d3bc5f922dcf3ab4fb >>>> man/man1/log2pcap.1.gz >>>> [~100 lines of repetitive data...] >>>> @comment MD5:9f5fc8df2a1383a175e165ef2e0b10cc >>>> man/man8/vfs_notify_fam.8.gz >>>> >>>> Could be aggregated into: >>>> >>>> @MD5 >>>> 9e94560ac5e757d3bc5f922dcf3ab4fb man/man1/log2pcap.1.gz >>>> c58f068d603a12d4af867c15cf77e636 man/man1/nmblookup.1.gz >>>> [etc..] >>>> @end MD5 >>>> >>>> or something similar to XML. >>>> >>>> This would reduce the filesize from n bytes to n - (9 + 4 -1) * >>>> i_entries + 8. In larger package files this would reduce the amount of >>>> data parsing by a long shot. Also, more powerful scripting languages >>>> like Perl, Python, or smart parsers in C could make short work of this >>>> data and just extract the MD5 elements for comparison. >>>> >>>> Also, by doing a little extra work when creating packages by >>>> organizing all the sections together, I think that the file size could >>>> be reduced by a large degree. >>>> >>>> Similar fields to @comment MD5 could be reduced I believe, but with >>>> less benefit maybe, other than just the @unexec rmdir, etc lines. >>>> >>>> Either that, or the data should be organized into separate files I >>>> think (increases number of files, but reduces overall processing >>>> time IMO). > >>> In some cases the order of data stored is important and thus it >>> cannot be >>> seperated into section. Also, this layout allows for very simple >>> parsing with >>> usual UNIX tools (sed, cut, awk, perl, simply everything). Unlike >>> XML, which is >>> rather complex and thus does not belong into base, in my opinion. > > We have libbsdxml in the base already (an old version of one in the > ports). Ok. >> I didn't say XML exactly. I say XML-like, with implied end and begin >> tags, but keeping with the Makefile like syntax of @MD5 ... @end MD5, >> or something similar. > > The problem is, that a change would break existing installations, as > they can not cope with such a new format. Feel free to propose > improvements, but you need to keep in your mind, that any supported > FreeBSD release has to be able to install packages with only the > package tools available in the basesystem. The point is though that there's a lot of unnecessary bloat, which adds to longer text file sizes, and thus slows down smarter parsers written in C, Perl, or Python. >> My point being is that the +CONTENTS file is bloated a lot by >> useless lines, and it would help speed up package processing if it was >> clipped or reduced somehow I would think. > > You need to provide numbers. Without them this is pure speculation. > > And you have to explain, why the current parsing routines can not be > speed up for the current format, maybe the implementation is just a > little bit outdated compared to todays parsing knowledge... > > Bye, > Alexander. > Ok. I take your challenge and will have preliminary results in 2-3 days. Are Excel formatted spreadsheets ok (thinking graphs)? Thanks, -Garrett
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?46891556.4090209>