From owner-freebsd-hackers@FreeBSD.ORG Wed Dec 26 02:56:28 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B5A35414 for ; Wed, 26 Dec 2012 02:56:28 +0000 (UTC) (envelope-from dieterbsd@engineer.com) Received: from mout.gmx.net (mout.gmx.net [74.208.4.200]) by mx1.freebsd.org (Postfix) with ESMTP id 64BEB8FC0C for ; Wed, 26 Dec 2012 02:56:28 +0000 (UTC) Received: from mailout-us.gmx.com ([172.19.198.41]) by mrigmx.server.lan (mrigmxus002) with ESMTP (Nemesis) id 0LjaqC-1TCRWo0Cca-00bav0 for ; Wed, 26 Dec 2012 03:51:18 +0100 Received: (qmail 15416 invoked by uid 0); 26 Dec 2012 02:51:17 -0000 Received: from 67.206.185.118 by rms-us007 with HTTP Content-Type: text/plain; charset="utf-8" Date: Tue, 25 Dec 2012 21:51:14 -0500 From: "Dieter BSD" Message-ID: <20121226025115.91860@gmx.com> MIME-Version: 1.0 Subject: Re: FreeBSD for serious performance? To: freebsd-hackers@freebsd.org X-Authenticated: #74169980 X-Flags: 0001 X-Mailer: GMX.com Web Mailer x-registered: 0 Content-Transfer-Encoding: 8bit X-GMX-UID: dAOocIZC3zOlNR3dAHAh7Ml+IGRvb8A3 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2012 02:56:28 -0000 > Which device drivers?  We can't fix problems we don't know about. ata(4) completely hung the system for 19 minutes (at which point I manually intervened, see the PR), probably an infinite loop. http://www.freebsd.org/cgi/query-pr.cgi?pr=170675 Siis(4) and ahci(4) have also caused data loss, presumably by blocking interrupts for too long. Improving these drivers would be wonderful. But better yet, can we please find a way to fix the underlying problem? When a device driver handles an interrupt, it needs to block further interrupts while it modifies its data structures. Otherwise another interrupt coming in might cause it to mangle the data. Right? But! Why does it need to block interrupts for everything? Why does a disk driver need to block interrupts from Ethernet? Why does Ethernet need to block Firewire? Why does Firewire need to block USB? And so on. Can't the disk driver block just its own interrupts and leave the other devices alone? That way, when some device driver writer puts in DELAY(TOO_LONG), at least the other devices will still work. Alternately, why couldn't the data structures be protected with a mutex? Then the drivers shouldn't have to block even themselves. Alternately, why can't drivers have a polling option? Yes, the extra overhead of polling sucks, but losing incoming data sucks a lot more. I am not suggesting that polling should be the default, just an option for those who need it. Alternately, Current machines can have multiple disks, multiple Ethernets, multiple pretty-much-any-device, multiple CPUs, etc. etc. We have SMP kernel to juggle those multiple CPUs. But we still have this absurd bottleneck where the device drivers bring everything to a screaching halt every time an interrupt happens. And if the driver has a bug, or thinks there is a problem and decides to keep DELAY()ing over and over, the entire machine just locks up and stays locked up, often forever. It isn't just me. I have seen quite a few threads where other people are having the same problem. This needs to be fixed. (Fixing this is at *least* a Usenix paper.)