Date: Tue, 8 Dec 2009 14:17:11 -0600 From: Mark Linimon <linimon@lonesome.com> To: oren.almog@gmail.com Cc: freebsd-ports@freebsd.org Subject: Re: Pointyhat packages Message-ID: <20091208201711.GE3057@lonesome.com> In-Reply-To: <4B1E4351.2030004@gmail.com> References: <4B1E4351.2030004@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Dec 08, 2009 at 09:15:13AM -0300, oren.almog@gmail.com wrote: > For the last couple of days I have been following the pointyhat build > statistics provided at > http://pointyhat.freebsd.org/errorlogs/packagestats.html Brave man :-) That's one I set up. > As seen on that page, the building process started on Dec 3rd but had > not been completed yet. Apparently the build for www/p5-Gtk2-WebKit is now hanging on all buildenvs. Pav already marked it so on amd64. It will continue to run until a reaper process kills it off (or one of us portmgrs does it manually). I'd like to see the error log so I'm going to let it run for now. The reaper process is IIRC 24 hours. > Why is there such a large difference between the build times on amd64 > and i386? Are the i386 machines really that underpowered? Two data points: one, it looks like Pav having marked www/p5-Gtk2-WebKit as broken had already been taken into account for the amd64 build, so it didn't have that problem. And two, some of our i386 machines are indeed underpowered. We've added several new, more modern, ones this year that were donated to us: these are dual 2.4 or 2.8GHz machines, mostly with 2G of RAM. (One of my background tasks is to try to characterize performance on the nodes with various setups; my intuition is that 4G would allow us to raise throughput, but I need to make a 'use case' for that before I go ask for funding.) fwiw, I continually look for new ways to scrounge more package building nodes (I seem to have inherited the task of looking after them). > Next I found this page which keeps track on the upload status of > packages to the various ftp sites > http://portsmon.freebsd.org/portsuploadstatus.py That's mine too :-) > If the statistics on that page are correct then it seems to me that > there is a lot of inefficiency in the build and upload process. With 11 active buildenvs, we have saturated the amount of data that the sites can upload. We've discussed the matter before but no one has come up with a solution. We try not to upload different package sets at the same time, as a workaround. > Some sites are rarely updated Not all of the sites carry all of the buildenvs, and some of those that do can run days behind. Also, I don't have up-to-date contact information for the various sites. If anyone has that, please let me know. > and some poinyhat build runs are never uploaded. Hmm, they should be. I'll forward this on to pav. (The way we have the work divided up is that pav does amd64; erwin does i386; I do sparc64 and the nascent ia64; and various portmgrs, including miwi, do the *-exp runs which are intentionally not uploaded, but do constitute a load on both pointyhat and the nodes.) > I simply want to make sense of all this and understand how it all fits > together I've been trying to understand it for several years, so don't worry :-) And I'm one of the people "in charge". Longer explanation: pointyhat throughput depends on a lot of factors, some of which I am in the early stages of understanding. - if a node hangs (but only in certain ways), the dispatch scheduler can get into a state where it still tries to schedule builds on that node, over and over. This causes an overall slowdown in build dispatch. I'm not exactly sure of the root cause of the hangs, but one of them is likely to be swap exhaustion which leads to sshd being killed. (The most recent -current fixes this). Since the failures are statistical, they are hard to catch. I have added some error logging code to try to figure this out. As for the scheduler, there is some missing functionality there. The code is complex so it's not trivial to fix. - pointyhat itself is a very heavily loaded machine. The most recent problems we have been chasing are a) disk space exhaustion, and b) disk controller saturation. For the former, we keep finding things to evict. OTOH, with 16 buildenvs (counting the *-exp ones) there is only so much we can do. When space is low, the rate of builds slows down significantly, for reasons I do not understand yet. For the latter, there are two processes that busy the controller: 1) compression of saved logfiles, and 2) the ZFS backup process. I think I may have an idea of how to fix 1); I will have to learn more about the way ZFS is set up on pointyhat to fix that. - pointyhat can get into situations where nfs timeouts from nfs mounted filesystems (such as /home) crash the system. I don't know much about this. Once that happens, we have to restart all the builds. Sometimes it can take a little while for one of us to notice the crash. - the scheduler has a bug where it occasionally crashes. I am actively investigating this and have added a bunch of debug code to catch it in the act. Again, when this happens, all the builds have to be restarted. This was happening a lot in the first few days of December, but seems to have settled down now. The code that runs pointyhat is hundreds of lines of sh, awk, perl, and python, and quite complex. Although these days I understand most of it from a static sense, I'm still learning about its dynamic characteristics. But now you know the contents of (part of) my todo list. mcl
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20091208201711.GE3057>