From owner-freebsd-stable@FreeBSD.ORG Mon Sep 15 15:57:03 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C929F106564A for ; Mon, 15 Sep 2008 15:57:03 +0000 (UTC) (envelope-from gphoto6@gmail.com) Received: from mail-gx0-f17.google.com (mail-gx0-f17.google.com [209.85.217.17]) by mx1.freebsd.org (Postfix) with ESMTP id 75EA78FC12 for ; Mon, 15 Sep 2008 15:57:03 +0000 (UTC) (envelope-from gphoto6@gmail.com) Received: by gxk10 with SMTP id 10so23661975gxk.19 for ; Mon, 15 Sep 2008 08:57:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=k7o6ZYQIphxuBhOpPzaWne1XfF3iCaIc+5RwpC6d0EU=; b=fGU2idi7lF+467RJtcv0SG058VhXGsvATKFxyNxcDFQtk0UvF5BZ5qrLTGJu1MaE7D +9MQzk10NphEXGQ7aW+qSduFauOUNgKxO/aAtDI5g7WSN2dxI1jzlimLpPGmT3qkJHyG RVfDqEW69/JcpbwwBRN1dBH6nUBqSF9y0WLPE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=acjnQN2HLyMScSLPg60B1I7cpz5wjMbIm+EBu3R+VBrE2XmE8W7K+pnYvQ67WYIrds lkqYF482UXWfme15Xo1UacX0gFrWgfKDXPFMBqwR7Sz8ycisrla8NH8nN3S3cGdYnOBO JXm8EqQIewJCV1E3oHUX6zOKspp0BgABdbaNo= Received: by 10.151.109.11 with SMTP id l11mr10907856ybm.204.1221494222670; Mon, 15 Sep 2008 08:57:02 -0700 (PDT) Received: by 10.151.41.19 with HTTP; Mon, 15 Sep 2008 08:57:02 -0700 (PDT) Message-ID: <1f51039c0809150857l50b6be8eu848e21189a4175d6@mail.gmail.com> Date: Mon, 15 Sep 2008 23:57:02 +0800 From: "Tim Chen" To: freebsd-stable@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Suddenly frozen fcntl/stat call on NFS over TCP with MTU 9000 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Sep 2008 15:57:03 -0000 Currently I was running a mail server using a netapp filer as backend storage. >From time to time, the whole system get stuck and lasted for 3-5 minutes. But after that, everything recovers normally. During the "stuck" moment, using ps auxw shows 200-300 of mail delivery agent(MDA) processes staying in "D" status. The command df certainly does not reponse either. System configuration: 1. NFS server: NetApp FAS3020 2. NFS client: acting as a smtp/pop3/imap server. freebsd 7.0-stable (almost 7.1-prelease) hardware: IBM x3550 server network interface: bce1: mem 0xc8000000-0xc9ffffff irq 18 at device 0.0 on pci4 miibus0: on bce0 brgphy0: PHY 1 on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto bce1: Ethernet address: 00:1a:64:--:--:-- bce1: [ITHREAD] bce1: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W (0x04000305); Flags( MFW MSI ) ifconfig: bce1: flags=8843 metric 0 mtu 9000 options=1bb ether 00:1a:64:--:--:-- inet 192.168.1.166 netmask 0xffffff00 broadcast 192.168.1.255 media: Ethernet autoselect (1000baseTX ) status: active software: postfix 2.5.4 courier-imap 4.4.1 maildrop 2.0.4 After further investigation, I found that the situation is most severe when nfs over tcp and using mtu 9000. If nfs mount is changed to either (over udp and mtu 9000) or (over tcp and mtu 1500), things get significantly improvement. The frequency of "suddenly hang" decreases from every 10-15 min to several hours. Another observation is the "freeze" happens more frequently when server load is high, especially working hours. So I believed it is tightly related to server load (or nfs load). I tried to modify the source code of MDA (maildrop) and adding some debug code to identify the problem. What I found is: 1) MDA processing time always approximate 0 sec or < 1 sec when things work normally. 2) MDA processing time may up to 30 second when system got stuck. If the incoming email continues to come, later emails may cost up to 200 second to complete. At this time, using ps auxw shows MDAs were in "D" status. 3) Detail trace shows the processing time spent were waiting around the fcntl (lock) and stat(fstat) code. One more thing to note: I've tried to turn on and off rpc.statd,rpc.lockd, -L mount, even compile NFSLOCKD in kernel. All were in vain, things still got stuck when using NFS over TCP with mtu 9000. We have already lots of mail servers whose hardware were different and OS is freebsd 6-stable. Softwares were all the same but with prior version. Those servers didn't show any of the above strange behavior. Based on all of the above experiment and observation, I guess there might be something wrong with: 1) NFS or network stack of freebsd 7 2) fcntl/stat over NFS 3) bce driver Need your help/suggestion to solve the problem! Thanks very much. Sincerely, Tim Chen