From owner-freebsd-current@FreeBSD.ORG Wed Jan 10 14:39:26 2007 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 725C816A584 for ; Wed, 10 Jan 2007 14:39:26 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.freebsd.org (Postfix) with ESMTP id 0DC4513C448 for ; Wed, 10 Jan 2007 14:39:25 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (zion.baldwin.cx [192.168.0.7]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id l0AEdGQ3085036; Wed, 10 Jan 2007 09:39:23 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-current@freebsd.org Date: Wed, 10 Jan 2007 09:10:12 -0500 User-Agent: KMail/1.9.4 References: <20070110120731.GA1515@shark.localdomain> In-Reply-To: <20070110120731.GA1515@shark.localdomain> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200701100910.13167.jhb@freebsd.org> Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [192.168.0.1]); Wed, 10 Jan 2007 09:39:23 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2432/Wed Jan 10 08:12:31 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Sergey Zaharchenko Subject: Re: nve related LOR triggered by lots of small packets, and a hard hang X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Jan 2007 14:39:26 -0000 On Wednesday 10 January 2007 07:07, Sergey Zaharchenko wrote: > Hello -current, > > While chasing that smbfs recursive locking thing, I decided to try > copying a large amount of small files (/usr/src actually) to an SMB > share to which I am connected by an NVIDIA nForce MCP2 card. I have come > across a lock order reversal which seems related to the card. First, > some files are copied, then I see the following kernel messages, some > more files are copied, and then the system hangs without responding to > the keyboard or anything. > > : lock order reversal: > : 1st 0xc3629f00 inp (tcpinp) @ /src/usr.src/sys/netinet/tcp_usrreq.c:801 > : 2nd 0xc0a9feec tcp (tcp) @ /src/usr.src/sys/netinet/tcp_input.c:626 > : KDB: stack backtrace: > : db_trace_self_wrapper(c0950c60) at db_trace_self_wrapper+0x25 > : kdb_backtrace(0,ffffffff,c0a612a8,c0a612d0,c09f8e84,...) at kdb_backtrace+0x29 > : witness_checkorder(c0a9feec,9,c095ec63,272) at witness_checkorder+0x586 > : _mtx_lock_flags(c0a9feec,0,c095ec63,272,0,...) at _mtx_lock_flags+0x84 > : tcp_input(c32df800,14,c3300800,100a8c0,0,...) at tcp_input+0x432 > : ip_input(c32df800) at ip_input+0x5a6 > : netisr_dispatch(2,c32df800,0,c32c5000,c3300800,...) at netisr_dispatch+0x58 > : ether_demux(c32c5000,c32df800,c32caed8,c32df800,dd1757d4,...) at ether_demux+0x28a > : ether_input(c32c5000,c32df800,c32caed8,0,c0970133,...) at ether_input+0x202 > : nve_ospacketrx(c32cae00,dd175810,1,0,0,...) at nve_ospacketrx+0xd9 > : UpdateReceiveDescRingData(c08981a4,c08981c4,c0898260,c089828c,c08982a4,...) at UpdateReceiveDescRingData+0x2f8 > : nve_osalloc(c32cb200,dd391010,c32cae00,c0898108,c08981a4,...) at nve_osalloc > : _end(c33a5c00,c0a9e784,3065766e,0,0,...) at 0xc32aa600 > : _end(c32cb200,dd391010,c32cae00,c0898108,c08981a4,...) at 0xc3327680 > : _end(c33a5c00,c0a9e784,3065766e,0,0,...) at 0xc32aa600 > : _end(c32cb200,dd391010,c32cae00,c0898108,c08981a4,...) at 0xc3327680 > > The last 2 strings repeat themselves a lot of times (kdb seems to have a > limit of 1024 stack trace strings, which came in very helpful). No info > about the actual hang... The LOR looks like #009 > (http://sources.zabbadoz.net/freebsd/lor/009.html), but is different > actually. Any ideas? BTW, what is _end? _end may hint to being out in a kernel module, though ddb usually can handle those fine. I think your stack is busted somehow though as nve_osalloc() doesn't call UpdateReceiveDescRingData(), and the first lock is acquired in tcp_usr_send() (userland is sending data on a tcp socket). Somehow the nve driver has decided to handle receiving a packet and re-entering the stack leading to the LOR. Have you tried using nfe(4)? :) -- John Baldwin