From owner-freebsd-chat@FreeBSD.ORG Sun Dec 9 18:27:31 2012 Return-Path: Delivered-To: chat@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AAA0799F for ; Sun, 9 Dec 2012 18:27:31 +0000 (UTC) (envelope-from rsk@gsp.org) Received: from taos.firemountain.net (taos.firemountain.net [207.114.3.54]) by mx1.freebsd.org (Postfix) with ESMTP id 5BC638FC08 for ; Sun, 9 Dec 2012 18:27:30 +0000 (UTC) Received: from gsp.org (bltmd-207.114.17.210.dsl.charm.net [207.114.17.210]) by taos.firemountain.net (8.14.5/8.14.5) with ESMTP id qB9HwuH6016223 for ; Sun, 9 Dec 2012 12:58:57 -0500 (EST) Date: Sun, 9 Dec 2012 12:58:50 -0500 From: Rich Kulawiec To: chat@freebsd.org Subject: Re: Google spyware on FreeBSD Web site? Message-ID: <20121209175850.GA31072@gsp.org> References: <201212041900.MAA14107@lariat.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201212041900.MAA14107@lariat.net> User-Agent: Mutt/1.5.20 (2009-06-14) X-BeenThere: freebsd-chat@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Non technical items related to the community List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 09 Dec 2012 18:27:31 -0000 I often disgree with Brett, sometimes sharply, but on this issue I strongly concur. Some points to augment/add, briefly: - Requiring "opt-out" is always an explicit admission that what's being done is being inflicted on users without their prior, informed, express consent. It's disrepectful and abusive. The FreeBSD project should be better than that and never require opt-out of anything, ever. - Of course, as has been pointed out, open-source analytic software that runs on FreeBSD is of course not only available, but vastly preferable. - But it won't work either, because most of the input data is crud. (This isn't FreeBSD's fault: most of the data accumulated in most of the web server logs globally is crud.) [1] The conclusions drawn from processing mostly-crud data won't have much, if any, validity. - This also presumes that the right questions are being asked. It's not clear, at this point, whether they are or aren't. If, for example, the question is "will page Z on the FreeBSD web site work with browser X on operating system Y?" then resources such as BrowserShots (http://www.browsershots.org) will provide credible answers. If the question is "how long is a user spending on page Z?" then no tool will provide a credible answer. Moreover, the question itself is pointless. So I suggest, if the goal is to improve the web site (and that is a good goal) an open public debate over which questions should be asked before moving on to the question of which software tools might be able to provide answers to those questions. (And yes, since I'm arguing that it should happen, I'll contribute to the effort.) - If you want active feedback from users, then maintaining a proper role address (webmaster@) is the best way to do that. I see that's already in place, and that's excellent. - Of course standards compliance and cross-browser/cross-platform testing are great ways to ensure that the site is as usable as possible by as many people as possible. Based on what I see on the site via things like the W3C validator as well as trying it using multiple browsers on multiple operating systems, it appears to me that considerable work has already been done on this: the site is viewable, navigable, etc. without issue on any of them. Once again, that's excellent. ---rsk [1] The majority of data found in the typical public webserver's logs is crud because it doesn't originate from human action: it originates from software. Of course, in case of many common webcrawlers, this activity is relatively easy to isolate. But that leaves all the crud originating from malicious/surreptitious software agents such as those running on a few hundred million compromised/botted/zombied systems. This data is (mostly) functionally indistinguishable from that originating from humans, and for many sites, it dwarfs the latter. So while certainly it can be fed to analytic software (along with all of the actual human-originated data) what emerges is quite often useless. Techniques *do* exist to isolate and filter this spurious data out, but they're unreliable, tedious, manual, and they don't scale well. "GIGO" is an old acronym, and not used much any more, but it certainly applies here.