From owner-freebsd-current@FreeBSD.ORG  Mon Jun  7 04:48:38 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A677E16A4CE
	for <freebsd-current@FreeBSD.org>;
	Mon,  7 Jun 2004 04:48:38 +0000 (GMT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E98E143D4C
	for <freebsd-current@FreeBSD.org>;
	Mon,  7 Jun 2004 04:48:37 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.11/8.12.11) with ESMTP id i574mTpP009303;
	Sun, 6 Jun 2004 21:48:33 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200406070448.i574mTpP009303@gw.catspoiler.org>
Date: Sun, 6 Jun 2004 21:48:29 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: avatar@mmlab.cse.yzu.edu.tw
In-Reply-To: <040607103216B.85106@www.mmlab.cse.yzu.edu.tw>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: freebsd-current@FreeBSD.org
cc: dl@leo.org
Subject: Re: LOR No 9 and strange other kernel messages
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Jun 2004 04:48:38 -0000

On  7 Jun, Tai-hwa Liang wrote:
> On Sun, 6 Jun 2004, Don Lewis wrote:
> [...]
>> >> Maybe your sound client software (or something else on the system) is
>> >> querying for the number of vchans.  Even "sysctl -a" will invoke the
>> >> handler.
>> >
>> > I'm using the stock mpg123 0.59r to play MP3 files. Since I've never read
>> > the source before, I'm not sure whether it utilised vchan related sysctls
>> > or not.
>>
>> The code that calls pcm_inprog() and prints the "x: 2" debug message
>> appears to be an attempt at implementing a reader/writer lock.  I'm
>> pretty sure the failure that triggers the debug message is harmless,
>> other than causing the sysctl() call to return EWOULDBLOCK.
> 
> I'm glad to know that the message is harmless. However, the "x: %d" is
> a little too obscured to endusers IMHO. Shouldn't that be protected by
> #ifdef DIAGNOSTIC or something like PCM_DEBUG?

This diagnotic message should probably just go away.  Also the locking
should be fixed to avoid the EWOULDBLOCK error.

>> The most likely trigger for this event is if the client software is
>> writing data to sound device and is blocked in the write() syscall when
>> another thread calls sysctl().  You could find out which process is
>> calling sysctl() by printing curproc->p_pid in place of or in addition
>> to x.
> 
> Ah, that explained everything. I've added the diagnostic code in
> sys/dev/sound/pcm/sound.c, and can 100% trigger this while mpg123 is
> playing and having a "sysctl -a" running in another xterm at the same time.
> 
> x: 2, current pid: 39234 (sysctl)
> x: 2, current pid: 39234 (sysctl)
> 
> It also looks like gcc will use sysctl() to retrieve physical memory size
> before compilation. I guess the "x: 2" I ran into was just a result of
> running gcc too frequently(make buildworld buildkernel).

Gcc only looks up hw.physmem and hw.usermem, so it shouldn't trigger
this message.

>>
> [...]
>> >> Do your sound hardware and em0 share the same irq?  If they do, that
>> >
>> > Nope.
>> >
>> > interrupt                          total       rate
>> > irq0: clk                         356127         99
>> > irq1: atkbd0                        6006          1
>> > irq4: cbb0 em0++                 1378364        386
>> > irq6: cbb1 pcm0                   110211         30
>> > irq7: ppc0                             3          0
>> > irq8: rtc                         455852        127
>> > irq12: psm0                        16785          4
>> > irq13: npx0                            1          0
>> > irq14: ata0                       322853         90
>> > irq15: ata1                           55          0
>> > Total                            2646257        742
>>
>> In that case, the em0 problem is likely to be something else.  Does em0
>> lock up even if you don't use the sound hardware?
> 
> Yes. It always locked up if the network load is high(w/ ACPI enabled).
> 
>>
>> >> might be a clue that the problems are related.  Can you boot with ACPI
>> >> disabled, and if so, does it change the symptoms?
>> >
>> > Whoops! My first trial on booting without ACPI -- kernel panic in
>> > usb_get_next_event(). Apparently I have to disconnect the USB mouse
>> > prior to kernel booting. See final section for the backtrace.
>>
>> Sounds like bug #3.
> 
> Any idea about how bug #3 can happen? I've tried to comment out the
> #ifdef DIAGNOSTIC code around sys/dev/usb/usb.c:752 to see if the extra
> checking for NULL ue helps. Unfortunately that didn't work for me: the
> kernel always panic at usb_get_next_event+0x5e after starting usbd.

I'd recommend posting a separate message about this panic.  That way you
are more likely to catch the attention of a USB expert.

>> > And, yes, the em0 didn't hang after booting to the non-ACPI environment.
>> > However, there're still something weird happening at that moment:
>> >
>> > 	1. There're still a couple of "x: 2" dumped on console. Test case:
>> > 	"make buildworld buildkernel -DNOCLEAN" in one xterm, mpg123 in
>> > 	another xterm, and providing file downloading to another Windoze
>> > 	box.
>> >
>> > 	2. While the remote box was downloading(Intel eepro 100/VE) files,
>> > 	the number of interrupt on em0 was dramatically reduced to 300 ~ 500+
>> > 	per second. It was 3000+ when ACPI was enabled(no device polling in
>> > 	kernel).
>> >
>> > 	3. The system average load is surprising low(I'm using SCHED_4BSD),
>> > 	about 0.06 ~ 0.30; however, the download was awfully slooowwwww.
>> > 	It took my brother's Windoze XP about 25 minutes to complete the
>> > 	whole downloading(3 ISO files, about 2GB in total). I'm pretty sure
>> > 	that the mediaopt on em0 was 100baseTX + full-duplex.
>>
>> I wonder if the em0 interrupt is not getting enabled or is getting
>> misrouted and the em0 interrupt service routing is getting triggered by
>> another interrupt source that is happening at a lower rate.  Print out
>> the irq mapping with ACPI disabled to see what em0 is sharing its irq
>> with.  This is probably something that jhb and njl will need to look at.
> 
> Oops, it looks like I accidentally enabled the DEVICE_POLLING in kernel
> configuration file and forgot to turn kern.polling.enable on.
> 
> After disabling DEVICE_POLLING and booting w/o ACPI, I can get rid of
> the aforementioned em0 sluggish problem. The interrupt rate on em0 is
> now 5700 ~ 7200 per second while downloading a couple of files from this
> laptop, and the average speed is 10+ MBytes.
> 
> Hmm... it seems that I have to disable USB mouse and ACPI to get em0
> back to work at this moment.
> 
> I'm not quite sure about where the "IRQ mapping" you're talking about to
> retrieve from; therefore, I assume that came from dmesg:
> 
> Found $PIR table, 15 entries at 0xc00fdea0
> PCI-Only Interrupts: none
> Location  Bus Device Pin  Link  IRQs
> embedded    0    0    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    0    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    0    C   0x62  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    0    D   0x63  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    2    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    2    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    1    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    0    1    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    1    0    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    1    0    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   30    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   30    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   30    C   0x62  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   30    D   0x63  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    0    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    0    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> slot 1      2    2    A   0x62  3 4 5 6 7 9 10 11 12 14 15
> slot 1      2    2    B   0x63  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    3    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    3    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    3    D   0x63  3 4 5 6 7 9 10 11 12 14 15
> slot 2      9    0    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> slot 2      9    0    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> slot 2      9    0    C   0x62  3 4 5 6 7 9 10 11 12 14 15
> slot 2      9    0    D   0x63  3 4 5 6 7 9 10 11 12 14 15
> embedded    9    1    A   0x61  3 4 5 6 7 9 10 11 12 14 15
> embedded    9    2    A   0x62  3 4 5 6 7 9 10 11 12 14 15
> embedded    9    2    B   0x62  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    1    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    2    8    A   0x68  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   29    A   0x60  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   29    B   0x63  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   29    C   0x62  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   29    D   0x6b  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   31    A   0x62  3 4 5 6 7 9 10 11 12 14 15
> embedded    0   31    B   0x61  3 4 5 6 7 9 10 11 12 14 15
> pcib0: <Host to PCI bridge> at pcibus 0 on motherboard
> pir0: <PCI Interrupt Routing Table: 15 Entries> on motherboard
> $PIR: Links after initial probe:
> Link  IRQ  Rtd  Ref  IRQs
> 0x60  255   N    10  3 4 5 6 7 9 10 11 12 14 15
> 0x61  255   N    10  3 4 5 6 7 9 10 11 12 14 15
> 0x62  255   N     8  3 4 5 6 7 9 10 11 12 14 15
> 0x63  255   N     6  3 4 5 6 7 9 10 11 12 14 15
> 0x68  255   N     1  3 4 5 6 7 9 10 11 12 14 15
> 0x6b  255   N     1  3 4 5 6 7 9 10 11 12 14 15
> $PIR: Found matching pin for 1.0.INTA at func 0: 4
> $PIR: Found matching pin for 2.0.INTA at func 0: 4
> $PIR: Found matching pin for 2.0.INTB at func 1: 6
> $PIR: Found matching pin for 2.2.INTA at func 0: 11
> $PIR: Found matching pin for 2.1.INTA at func 0: 4
> $PIR: Found matching pin for 0.29.INTA at func 0: 4
> $PIR: Found matching pin for 0.29.INTB at func 1: 10
> $PIR: Found matching pin for 0.29.INTC at func 2: 11
> $PIR: Found matching pin for 0.29.INTD at func 7: 11
> $PIR: Found matching pin for 0.31.INTA at func 1: 255
> $PIR: Found matching pin for 0.31.INTB at func 3: 6
> $PIR: Links after initial IRQ discovery:
> Link  IRQ  Rtd  Ref  IRQs
> 0x60    4   Y    10  3 4 5 6 7 9 10 11 12 14 15
> 0x61    6   Y    10  3 4 5 6 7 9 10 11 12 14 15
> 0x62   11   Y     8  3 4 5 6 7 9 10 11 12 14 15
> 0x63   10   Y     6  3 4 5 6 7 9 10 11 12 14 15
> 0x68  255   N     1  3 4 5 6 7 9 10 11 12 14 15
> 0x6b   11   Y     1  3 4 5 6 7 9 10 11 12 14 15
> $PIR: IRQs used by BIOS: 4 6 10 11
> $PIR: Interrupt Weights:
> [    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 ]
> [    0   0   0   0  10   0  10   0   0   0   6   9   0   0   0   0 ]
> 
> For the complete dmesg, please consult:
> 
> 	http://www.mmlab.cse.yzu.edu.tw/~avatar/dmesg-noacpi.txt
> 	http://www.mmlab.cse.yzu.edu.tw/~avatar/dmesg-acpi.txt

Wierd ... in the ACPI case, em0 gets attached, gets detached, and then
gets attached again.

em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.25> port 0x8000-0x803f
 mem 0xc0200000-0xc020ffff,0xc0220000-0xc023ffff irq 4 at device 1.0 on pci2
em0: Reserved 0x20000 bytes for rid 0x10 type 3 at 0xc0220000
em0: Reserved 0x40 bytes for rid 0x18 type 4 at 0x8000
em0: [GIANT-LOCKED]
em0: bpf attached  
em0:  Speed:N/A  Duplex:N/A
[ snip ]
em0: Link is up 100 Mbps Full Duplex
[ snip ]
em0: detached
[ snip ]
em0: <Intel(R) PRO/1000 Network Connection, Version - 1.7.25> port 0x8000-0x803f
 mem 0xc0200000-0xc020ffff,0xc0220000-0xc023ffff irq 4 at device 1.0 on pci2
pcib2: device em0 requested decoded memory range 0xc0220000-0xc023ffff
pcib2: device em0 requested decoded I/O range 0x8000-0x803f
em0: [GIANT-LOCKED]
em0: bpf attached
em0:  Speed:N/A  Duplex:N/A
em0: Link is up 100 Mbps Full Duplex

It looks like em0 shares irq 4 with the exact same set of devices in
both cases.  It does seem strange to me that it is sharing irq 4 with
sio0.  Typically the sio ports get exclusive use of irqs.

The whole part of the dmesg starting with "em0: detached" and "pci0:
driver added" is something that I haven't seen before, but I'm not
running bleeding edge -CURRENT at the moment.

I'd recommend starting a separate thread to discuss your em0 problem.