Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Dec 2012 14:14:30 +1100
From:      Peter Jeremy <peter@rulingia.com>
To:        Dieter BSD <dieterbsd@engineer.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: FreeBSD for serious performance?
Message-ID:  <20121227031430.GD82100@server.rulingia.com>
In-Reply-To: <20121226084805.91840@gmx.com>
References:  <20121226084805.91840@gmx.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--xgyAXRrhYN0wYx8y
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2012-Dec-25 21:51:14 -0500, Dieter BSD <dieterbsd@engineer.com> wrote:
>ata(4) completely hung the system for 19 minutes (at which point
>I manually intervened, see the PR), probably an infinite loop.
>
>http://www.freebsd.org/cgi/query-pr.cgi?pr=3D170675

Which contains no useful information.  You've even edited out
system details that are automatically inserted by send-pr.

Please provide a dmesg from a verbose boot of that machine.  What
brand/model motherboard?  What add-in cards do yau have?  What do you
mean by "completely hung"?  What did you try to do to provoke a
response?  Are you running a GENERIC kernel?  If not, please provide
your kernel configuration.  Please provide the SMART data for ad6
(smartctl -a /dev/ad6).  Where does ad6 connect to the controller?  Do
you use any port-multipliers?  What was the system doing when ad6
detached?  Since the system ran for 24 hours, apparently without you
noticing that ad6 had detached, is ad6 part of a RAID?  If so, what is
the RAID configuration and technology?

>Siis(4) and ahci(4) have also caused data loss, presumably by
>blocking interrupts for too long.

You're still refusing to provide any useful information that might
allow us to locate the supposed problem.

>Improving these drivers would be wonderful. But better yet,
>can we please find a way to fix the underlying problem?

What underlying problem?

>When a device driver handles an interrupt, it needs to block
>further interrupts while it modifies its data structures. Otherwise
>another interrupt coming in might cause it to mangle the data.
>Right? But! Why does it need to block interrupts for everything?

That depends on how the interrupts are laid out in the hardware.  One
popular approach on cheap motherboards is to have lots of different
devices sharing the same interrupt.  In this case, an interrupt
generated by one device can block interrupts by all other devices
sharing that interrupt.

>Alternately, why couldn't the data structures be protected with
>a mutex? Then the drivers shouldn't have to block even themselves.
>
>Alternately, why can't drivers have a polling option?

Your patches implementing this functionality appear to have gotten
detached from your mail.  Could you please resend them.  Note that
several ethernet drivers already have a polling option (intended to
avoid livelock issues at high traffic levels on primitive NICs).

>Current machines can have multiple disks, multiple Ethernets,
>multiple pretty-much-any-device, multiple CPUs, etc. etc.

Which is why it's important to have complete details of the system
when reporting issues since the problem may be caused by an
unexpected interaction between the components.

>have this absurd bottleneck where the device drivers bring
>everything to a screaching halt every time an interrupt happens.

So you keep claiming without producing any evidence.  Can you please
point to the code that does this.

On 2012-Dec-26 03:48:04 -0500, Dieter BSD <dieterbsd@engineer.com> wrote:
>They are doing *something* that completely locks out everything else.
>It is always a device driver.

So far, you have failed to provide any details to back this claim up.

>Hard to imagine locking everything out for 19 minutes without being
>in a loop.

I can think of several possibilities:
- broken controller locking up the bus
- deadlock
- clocks stopping (I've seen this in a different scenario)

>Would several different drivers have this same bug?

You haven't provided any evidence of a software bug.  If you're seeing
the saem problem across lots of different devices, it suggests a
hardware problem.

>I've only caught it hanging forever once. It only takes a few
>milliseconds to cause incoming data to be lost,

I'm not sure what you mean by this.  FreeBSD is not a real-time
operating system and so offers no guarantees on how long it will
take before incoming data will be processed.  If you have an
application that relies on incoming data being processed within
milliseconds, you may need to do some redesign.

>BTW, how do I break into the debugger and gather data when all of
>the devices are locked out, including the console?

Firewire?  Have you verified that the console is locked up and
you can't enter the debugger?

>The ata controller is soldered to the mainboard, a gazillion pins
>I'm sure, and no doubt requires very specialized equipment to replace,
>and I don't know of any pin-compatable replacements. Besides the
>hardware itself has never caused any problems. The problem is caused
>by the software, it is the software that needs to be fixed.

The limited information you have provided points to a hardware fault,
not a software bug.  If you have evidence that it's a software bug,
please provide it.

>Ata isn't maintained? Why the bleep not? Disk drivers are essential.

ata(4) _is_ maintained.  Your particular obsolete ATA controller may
not be.

>I was under the impression that siis(4) and ahci(4) were actively
>maintained? I'm running four sata controllers using three different
>drivers and all three drivers lock out other drivers for too long
>when something unusual happens.

I'm using both siis & ahci and have never seen anything that points to
a bug in those device drivers causing the system to lockup.  And I
don't recall (offhand) seeing other reports of it.  This again points
to a problem with your particular configuration, rather than FreeBSD.

>And other, non-disk drivers have the same problem of locking out
>other drivers, even during normal operation. And this happens on
>yet other drivers on other people's hardware, not just mine.

Can you provide mailing list or PR references to these.

--=20
Peter Jeremy

--xgyAXRrhYN0wYx8y
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iEYEARECAAYFAlDbvRYACgkQ/opHv/APuIcvqwCgu3JmXM4zsowFIHaXnn5xYDm+
dacAn0fS8zYTrqySCzEkz80BeT5kRppN
=HKKM
-----END PGP SIGNATURE-----

--xgyAXRrhYN0wYx8y--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121227031430.GD82100>