From owner-freebsd-mips@FreeBSD.ORG Thu Aug 23 23:26:26 2012 Return-Path: Delivered-To: freebsd-mips@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4A6831065670 for ; Thu, 23 Aug 2012 23:26:26 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com [209.85.160.54]) by mx1.freebsd.org (Postfix) with ESMTP id 1292B8FC0A for ; Thu, 23 Aug 2012 23:26:25 +0000 (UTC) Received: by pbbrp2 with SMTP id rp2so2404540pbb.13 for ; Thu, 23 Aug 2012 16:26:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=huWGlQ/K+J9q23N2Dy+i8d3okVdSWb7iIk3+N02V218=; b=YfKNPahQCrou2U4Q1U4stX2TJDFxz4BYJ0KvY5YiRw4/auKAcOx9i1gYq1M+2qFl9K 6HGNMnrm2CnP2EJzUeqHpR7BJZzBXcQoGp6HZRR0G/vCt9L76tM366zas//VX+fE6QXO xGw0oYwgRvzXrZL0UhUyubitzzddSUO2kvJixCeFF3npI4xBmteYQSlgy9BkmdzobaU/ nis2VPvdWUN7keQyw2dOiuDllA7Bvtb+EFYRvha6cu/hnfRjUe79oEo2ZZs03eK5GooI J6exfluNdq3sQToZ9Ak0iRdVJk52OCRdwMLVb95CDlCCbv26MwBu72c19I/tyuKosf/2 bs6g== Received: by 10.68.196.193 with SMTP id io1mr8509341pbc.17.1345764385480; Thu, 23 Aug 2012 16:26:25 -0700 (PDT) Received: from [10.30.101.53] ([209.117.142.2]) by mx.google.com with ESMTPS id pj10sm6923195pbb.46.2012.08.23.16.26.24 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 23 Aug 2012 16:26:25 -0700 (PDT) Sender: Warner Losh Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: <1345763393.27688.578.camel@revolution.hippie.lan> Date: Thu, 23 Aug 2012 17:26:19 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1345757300.27688.535.camel@revolution.hippie.lan> <3A08EB08-2BBF-4B0F-97F2-A3264754C4B7@bsdimp.com> <1345763393.27688.578.camel@revolution.hippie.lan> To: Ian Lepore X-Mailer: Apple Mail (2.1084) X-Gm-Message-State: ALoCoQlS+8VNNlDUQuomr0fYnz3Q0EHVIKHoR3CRrawN4rXayM8iLbnKpNsVNF2p+tIkERBOaY8X Cc: freebsd-arm@freebsd.org, freebsd-mips@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Partial cacheline flush problems on ARM and MIPS X-BeenThere: freebsd-mips@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to MIPS List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Aug 2012 23:26:26 -0000 On Aug 23, 2012, at 5:09 PM, Ian Lepore wrote: > On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote:=20 >> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote: >>> A recent innocuous change to the USB driver code caused intermittant >>> errors in the umass(4) driver on ARM and MIPS platforms, and this >>=20 >> I think the proper solution is to segregate DMA and non-DMA parts of = structures so that you don't have both sharing a cache line. >>=20 >> I also wonder why we don't allocate the DMA memory for these = structures separately from the non-DMA parts. This would eliminate the = USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and = move the knowledge of this junk into busdma layer where it belongs. = =46rom my understanding of the issue, this would completely eliminate = the problem forever! >>=20 >> Sharing a cacheline between something that is DMA aware and something = that is just begging for trouble. We're doing more work than we need = to to support this dubious feature and we'd be miles ahead if we could = not share at all. >>=20 >> Warner >>=20 >=20 > It seems to me that what we have here is a new type of constraint on = DMA > operations, and we need a way to communicate that constraint from the > part of the platform support code that knows about it to the drivers = and > driver support code that needs to know. The way we communicate DMA > constraints is with a busdma tag, but right now that tag only > communicates constraints that were needed for ISA and PCI busses, = namely > buffer alignment, boundary-crossing restrictions, and exclusion > regions. =20 >=20 > Now we have a new type of constraint, I think of it as "granularity". > In effect, we have a DMA system that can only do DMA in cacheline = sized > chunks. Even when the IO size -- and thus the number of "bits on the > wire" -- is less than the cacheline size, at the end of the DMA > operation (which includes the software-assisted coherency operations) > the number of bytes in memory that may be modified is the size of a > cacheline. This is because "the DMA system" is not just the engine = that > moves bytes around, it's the combination of hardware and software that > work together to maintain cache coherency. But this isn't new. It is an alignment requirement, which carries with = it an implicit size requirement. If you enforce the alignment, and = force all 'sub buffers' to have this alignment, you don't need the new = thing. > Ideally we'd find a way to communicate this new constraint using the > existing mechanism, the busdma tag, and ideally we'd not have to = change > every existing call to bus_dma_tag_create() to add a new parm. As I > understand it, parent tags are now passed down through the newbus > hierarchy consistantly, such that a tag at the nexus level could > describe a platform requirement such as granularity, and all devices = and > the helper code they use will have access to that constraint via > inheritance from ancestors' tags. Maybe we could have a new flavor of > bus_dma_tag_create() that takes a struct of parms, and existing calls > wouldn't have to be changed. Wouldn't a simpler solution be to just make this alignment requirement = be part of the global parent tag on these platforms and to make sure all = drivers on those platforms use it and don't cop-out and pass NULL? > Communicating the constraint is only part of the problem; it also has = to > be easy for drivers to work with that constraint, especially drivers > that are not targeted specifically at platforms with granular DMA. I > think we can achieve a huge chunk of that purely within the arm/mips > implementation of bus_dmamem_alloc(), but even so there would be a new > conceptual limitation on using that routine: it is specifically for > allocating DMA buffers, and that means that there will be a new a rule > that the CPU cannot access any memory within that buffer while an IO > operation is in progress. I don't think we should pander to drivers that don't know how to do DMA = properly. We get it almost right in bus_dma now. However, going from = almost right to completely right is hard and we keep uncovering edge = cases that bite us. Wouldn't it be better to eliminate all these weird = edge cases? > I'd also like to say there's a new rule that you cannot subdivide a > buffer obtained from bus_dmamem_alloc() into multiple buffers, or into = a > combination of DMA and CPU-accessed data. That would be bad news for > the USB subsystem, and perhaps other drivers. If this idea is either > impossible or particularly contentious, then I guess we'd need some = new > helper routines so that a driver can subdivide the memory in a way = that > doesn't violate any constraints implied by the tag used to allocate = the > buffer. When the USB subsystem went into the tree, this was one of the = criticisms that was ignored. It has come back to bite us time and time = again. Perhaps it is time to fix it once and for all. > Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and = I > doubt we can reasonably ever require that it be so. =20 True, but the I/O that's not in memory from bus_dmamem_alloc is page = aligned. > I think the only > hope we have of handling that problem is to bounce the requests that > don't meet the granularity constraint, just as we'd have to do if the > caller-supplied buffer fell into an exclusion zone or violated an > alignment or boundary constraint. When I've tossed this idea out in = the > past there was instant resistance. Yeah, bounce buffers are massively > inefficient, but my experience has been that most of the IO that isn't > aligned and sized to a multiple of a cacheline is small IO (a few to a > few dozen bytes). I've never seen a case of page-sized or larger IO > requests that required partial-cacheline handling. I'm sure some > examples exist, but they're probably more the exception than the rule. > (And the bad performance you'd get from bouncing and copying massive > bulk data flow would be a powerful incentive to track down the culprit > and improve the driver.) That's also the underlying idea in the bus_dma stuff. You give the = constraints, you get the buffers and if you have a buffer that's outside = the constraints it gets bounced. That's why the sync operations on on = DMA items, not on cache line items. While cache lines are one issue, = memory placement can be another. Floppy drives, for example, couldn't = DMA past the first 16MB and if you have a buffer passed in that's = outside of that, it bounces. If this bouncing produces slower code, = then the drivers should be updated to have better alignment. The USB subsystem is making assumptions about the underlying cache = mechanisms that aren't really true. Ideally, we could get it to stop = doing that. Warner=