From owner-freebsd-bugs Fri Mar 22 9: 0:35 2002 Delivered-To: freebsd-bugs@hub.freebsd.org Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21]) by hub.freebsd.org (Postfix) with ESMTP id 1876737B423 for ; Fri, 22 Mar 2002 09:00:08 -0800 (PST) Received: (from gnats@localhost) by freefall.freebsd.org (8.11.6/8.11.6) id g2MH07J43217; Fri, 22 Mar 2002 09:00:07 -0800 (PST) (envelope-from gnats) Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21]) by hub.freebsd.org (Postfix) with ESMTP id 0227437B41C for ; Fri, 22 Mar 2002 08:59:42 -0800 (PST) Received: (from nobody@localhost) by freefall.freebsd.org (8.11.6/8.11.6) id g2MGxfu43089; Fri, 22 Mar 2002 08:59:41 -0800 (PST) (envelope-from nobody) Message-Id: <200203221659.g2MGxfu43089@freefall.freebsd.org> Date: Fri, 22 Mar 2002 08:59:41 -0800 (PST) From: Todd Hayton To: freebsd-gnats-submit@FreeBSD.org X-Send-Pr-Version: www-1.0 Subject: kern/36209: read() system call never returns in some cases Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org >Number: 36209 >Category: kern >Synopsis: read() system call never returns in some cases >Confidential: no >Severity: serious >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Mar 22 09:00:06 PST 2002 >Closed-Date: >Last-Modified: >Originator: Todd Hayton >Release: FreeBSD 2.2.2 >Organization: Ericsson IPI >Environment: >Description: Essentially the problem is this - about 1 in every 100 copies using scp (scp somefile user@remosthost:somefile), the entire machine will hang. Pings will work for a short time, but you can't telnet into the machine. Wait long enough and the pings will eventually fail as well. I started doing ktrace's of each scp, and the last system call to get recorded is always read() and it's always a read on a file descriptor created by pipe(): ktrace 1 -------- cognac> kdump -f scp_ktrace.out | tail -20 3725 scp CALL old.sigaction(0xd,0x7fbfdd00,0x7fbfdcf4) 3725 scp RET old.sigaction 0 3725 scp CALL pipe 3725 scp RET pipe 9 3725 scp CALL pipe 3725 scp RET pipe 11/0xb 3725 scp CALL pipe 3725 scp RET pipe 13/0xd 3725 scp CALL close(0x9) 3725 scp RET close 0 3725 scp CALL close(0xa) 3725 scp RET close 0 3725 scp CALL fork 3725 scp RET fork 3726/0xe8e 3725 scp CALL close(0xb) 3725 scp RET close 0 3725 scp CALL close(0xe) 3725 scp RET close 0 3725 scp CALL read(0xd,0x7fbfdcbf,0x1) 3725 scp PSIG SIGINT SIG_DFL ktrace 2 -------- cognac> kdump -f scp_ktrace2.out | tail -20 2456 scp RET chdir 0 2456 scp CALL old.sigaction(0xd,0x7fbfdd30,0x7fbfdd24) 2456 scp RET old.sigaction 0 2456 scp CALL pipe 2456 scp RET pipe 9 2456 scp CALL pipe 2456 scp RET pipe 11/0xb 2456 scp CALL pipe 2456 scp RET pipe 13/0xd 2456 scp CALL close(0x9) 2456 scp RET close 0 2456 scp CALL close(0xa) 2456 scp RET close 0 2456 scp CALL fork 2456 scp RET fork 2457/0x999 2456 scp CALL close(0xb) 2456 scp RET close 0 2456 scp CALL close(0xe) 2456 scp RET close 0 2456 scp CALL read(0xd,0x7fbfdcef,0x1) ktrace 3 -------- cognac> kdump -f scp_ktrace3.out | tail -20 727 scp RET chdir 0 727 scp CALL old.sigaction(0xd,0x7fbfdd30,0x7fbfdd24) 727 scp RET old.sigaction 0 727 scp CALL pipe 727 scp RET pipe 9 727 scp CALL pipe 727 scp RET pipe 11/0xb 727 scp CALL pipe 727 scp RET pipe 13/0xd 727 scp CALL close(0x9) 727 scp RET close 0 727 scp CALL close(0xa) 727 scp RET close 0 727 scp CALL fork 727 scp RET fork 728/0x2d8 727 scp CALL close(0xb) 727 scp RET close 0 727 scp CALL close(0xe) 727 scp RET close 0 727 scp CALL read(0xd,0x7fbfdcef,0x1) We're using F-secure's version of SSH. I looked through the code and it looks as if SSH actually gets some random noise from the output of commands: it forks() off some children to run some commands, and then uses the pipes to read the results of the commands from the child back into the parent. It's these reads that seem to never return (and eventually hang the system). I glanced around on the web and have seen references to pipe_read() and pipe_write() containing race conditions: http://www.geocrawler.com/mail/msg.php3?msg_id=2172760&list=159 So, I'm wondering if this is a manifestation of that...the fact that it takes so many repeated attempts for the lockup to occur would seem to be characteristic of a race condition... >How-To-Repeat: >Fix: >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message