From owner-freebsd-hackers  Tue Jun  3 19:15:24 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id TAA10801
          for hackers-outgoing; Tue, 3 Jun 1997 19:15:24 -0700 (PDT)
Received: from spoon.beta.com (root@[199.165.180.33])
          by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id TAA10790
          for <hackers@freebsd.org>; Tue, 3 Jun 1997 19:15:19 -0700 (PDT)
Received: from spoon.beta.com (mcgovern@localhost [127.0.0.1])
	by spoon.beta.com (8.8.5/8.8.5) with ESMTP id WAA21318
	for <hackers@freebsd.org>; Tue, 3 Jun 1997 22:29:07 -0400 (EDT)
Message-Id: <199706040229.WAA21318@spoon.beta.com>
To: hackers@freebsd.org
Subject: Need help with fastest way to move data...
Date: Tue, 03 Jun 1997 22:29:06 -0400
From: "Brian J. McGovern" <mcgovern@spoon.beta.com>
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

I've completed a prototype device driver for the Cyclades Cyclom-Z card, and
I'm hoping to make it available to everyone by the end of the week.
Unfortunately, its under-performing a 16550 UART, and I suspect that its due
to how I'm doing I/O with the card. 

Currently, I'm moving a byte at a time to the card (ie - read - I loop,
moving a byte at a time until the Xmit/Recv buffer is full/empty for
each TX or RX interrupt I get). Obviously, on a PCI bus (not to mention
internally), this is terribly inefficient. With the 32 bit bus, I'm
hoping to be able to move 4 characters at a time, and thereby increase
performace of this chunk of code by 3-4 times.

The question I have is what is the best way to do this? I'm having some
problems with q_to_b() locking up the system (I'm not quite sure why, I
just know it is), but I'm not even sure if this is the best way to move the
data.

A port on a card (there are 8 ports per card) has a 4KB receive buffer, and
a 2KB transmit buffer. Both are "ring buffers", with transmit and receive
pointer pairs (head and tail) in a seperate structure just ahead of the
ring buffers.

Due to the nature of the clists that I'm moving the data in and out of,
I'm making no assumption as to its positioning in memory. I'm also under
the assumption that things will be "most efficient" if the destination on
the PCI bus (and locally) is aligned on a 32-bit boundary. Therefore, what
I was considering doing was moving a character at a time, until
the buffer offset and'ed with 0x3 was 0 (ie - offset & 0x03), which would mean
that my buffer pointer was now on a long boundary. If there are less than 4
I would then use q_to_b() to move remaining bytes (up to 4) in to an 
unsigned long (ie - something like:  
bytesmvd = q_to_b(&tp->t_outq, (unsigned char *)&longint, MIN(bytes_left, 4));

Then, I'd transfer the 4 bytes with something like:

memcpy((void *)buffer_base + offset, (void *)longint, bytesmvd);

Receive would be something similar. Move a byte at a time to l_rint until
offset & 0x03 was 0, then memcpy up to 4 bytes at a time to the long int,
then loop for one to N bytes, passing each to l_rint in turn.

Now, the big question... Is this the most efficient way to do this? Does memcpy
and the like work best on long-aligned values. Would it be even MORE efficient
to use larger structures, given sufficient data to move?

I'm curious to hear comments, and see if anyone has any truely cool ideas.

	-Brian