From owner-freebsd-current@FreeBSD.ORG Wed Dec 24 20:38:24 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 11FFA16A4CE; Wed, 24 Dec 2003 20:38:24 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id A202F43D39; Wed, 24 Dec 2003 20:38:20 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3p2/8.8.7) with ESMTP id PAA15030; Thu, 25 Dec 2003 15:38:17 +1100 Date: Thu, 25 Dec 2003 15:38:16 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: current@freebsd.org Message-ID: <20031225145427.J16564@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Subject: deadlock in ata_queue_request() X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Dec 2003 04:38:24 -0000 ata_queue_request() sleeps in an interrupt handler here: % void % ata_queue_request(struct ata_request *request) % { % /* mark request as virgin (it might be a reused one) */ % request->result = request->status = request->error = 0; % request->flags &= ~ATA_R_DONE; % % /* put request on the locked queue at the specified location */ % mtx_lock(&request->device->channel->queue_mtx); % if (request->flags & ATA_R_AT_HEAD) % TAILQ_INSERT_HEAD(&request->device->channel->ata_queue, request, chain); % else % TAILQ_INSERT_TAIL(&request->device->channel->ata_queue, request, chain); % mtx_unlock(&request->device->channel->queue_mtx); % % /* should we skip start ? */ % if (!(request->flags & ATA_R_SKIPSTART)) % ata_start(request->device->channel); % % /* if this was a requeue op callback/sleep already setup */ % if (request->flags & ATA_R_REQUEUE) % return; % % /* if this is not a callback and we havn't seen DONE yet -> sleep */ % if (!request->callback) { % while (!(request->flags & ATA_R_DONE)) % tsleep(request, PRIBIO, "atareq", hz/10); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ % } % } when it is called from an interrupt handler. It is called from an interrupt handler as part of timeout processing: ... msleep(...) ata_queue_request(...) ata_via_family_setmode(...) ata_identify_devices(...) ata_reinit(...) ata_timeout(...) softclock(...) ithread_loop(...) ... The timeout was called here shortly after ad2 hung: Dec 25 12:28:27 besplex kernel: ad2: TIMEOUT - READ_DMA retrying (2 retries left) Dec 25 12:28:27 besplex kernel: ata1: resetting devices .. Dec 25 12:28:27 besplex kernel: ad2: FAILURE - already active DMA on this device Dec 25 12:28:27 besplex kernel: ad2: setting up DMA failed ATA_R_DONE was never set and wakeup_request() was never called either, so softclock() was deadlocked and tsleep() never returned. The system ran surprisingly well with softclock() deadlocked. ad0 worked and everything that didn't use timeouts worked. Examples of things that didn't work because they use timeouts: - syscons screen updates. - statistics in top and systat. - sleep 1 in shells. - mbmon (shows the status, then never repeats). I tried the following to recover: - call wakeup(request) using ddb. This worked, but ATA_R_DONE was never set so ata_queue_request() just looped. - also ignore the ATA_R_DONE check using ddb. This un-deadlocked softclock(), but ata1 remained wedged. - then call "atacontrol reinit 1". This partly worked: Dec 25 14:31:12 besplex kernel: ata1: resetting devices .. Dec 25 14:31:44 besplex kernel: ad2: WARNING - removed from configuration Dec 25 14:31:44 besplex kernel: done but the ata driver caused a null pointer panic an instant later. - ad2 didn't come back after a hard reset. - ad2 came back after a power cycle. This was on an undermydesktop. Problems resuming on laptops may be similar. The hardware may really be wedged. Then the software shouldn't make things worse by sleeping or spinning in the timeout handler. Bruce