Date: Sun, 3 Sep 2000 00:51:39 -0400 (EDT) From: Garance A Drosehn <gad@freefour.acs.rpi.edu> To: FreeBSD-gnats-submit@freebsd.org Cc: gad@eclipse.acs.rpi.edu Subject: bin/21008: Fix for lpr's handling of lots of jobs in a queue Message-ID: <200009030451.AAA73984@freefour.acs.rpi.edu>
next in thread | raw e-mail | index | archive | help
>Number: 21008
>Category: bin
>Synopsis: Fix for lpr's handling of lots of jobs in a queue
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Sat Sep 02 22:00:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator: Garance A Drosehn
>Release: FreeBSD 4.-stable and 5.-current i386
>Organization:
RPI ; Troy, NY
>Environment:
Using freebsd's lpr/lpd on print servers in a busy setting.
>Description:
Lpr uses a counter from 0 to 999 when spooling jobs in a queue
which cycles back to 0. The assumption is that by the time the
counter cycles around, the earlier jobs are long gone. If that
assumption is wrong, then the error is handled in a way which
is pretty painful. (note that I have only checked on how lpd's
recvjob routine handles it when accepting jobs from some other
host, I haven't looked at what lpr does if a client is "full up").
What happens is that recvjob (readfile) notices that the datafile
already exists, so it calls frecverr. That goes to cleanup, and
the cleanup routine assumes that the INCOMING file was a problem,
and thus removes it. This means you have removed the datafile for
an EARLIER job, and told the sending host an error occurred. The
sending host may respond to this error by waiting a bit, and then
resending the same file. Now the datafile will not already exist,
because it was destroyed, so the datafile will transfer successfully.
However, the control file for the earlier job still exists, and when
the control file for the incoming job arrives, IT errors. The
receiving host again sends an error to the sending host, and it seems
that the sending host decides this is a good reason to just remove
the job on it's side.
So, you've gone from two (or more) datafiles and two control files
to one control file and no datafiles.
We should (optionally) allow a larger range for the counter, but
I'm not ready to write that right now. In any case, there needs
to be a fix for how recvjob behaves when an overflow of the counter's
range does occur.
>How-To-Repeat:
You could send over a thousand jobs to a queue, but that's a bit
unwieldly. Instead, I suggest:
On "server":
lpc stop <printer>
On "client":
lpc stop <printer>
lpr -P<printer> somefile
go to spool directory, and save a copy of the cf and df files.
lpc start <printer>
(the files for the job go from client to server, and sit there)
recreate cf and df files, with the exact same name, from copies.
lpc start <printer>
then watch what happens...
>Fix:
The real fix, in my opinion, would be a pretty significant rewrite
of recvjob.c.
The interim fix is to change recvjob such that the receiving host
will tell the sending host that it is "out of space" if a file is
being sent which already exists on the server. Assuming datafiles
are being correctly removed as each job finishes printing, this
works well. NOTE: there does seem to be some situations where a
datafile is left behind (not-removed) even though the job has in
fact printed. I do not know if those are due to other changes I
have made in my lpr, or if everyone sees them. In any case, those
leftover data files could now cause queues to "stall" with this
update. Still, no data is lost, and both the server and the client
will have some information as to why the stall has happened. So,
I still think this is a reasonable fix, even if it isn't foolproof.
Here is the update:
--- recvjob.c.orig Sat Sep 2 23:39:25 2000
+++ recvjob.c Sat Sep 2 23:35:45 2000
@@ -58,6 +58,7 @@
#include <signal.h>
#include <fcntl.h>
#include <dirent.h>
+#include <errno.h>
#include <syslog.h>
#include <stdio.h>
#include <stdlib.h>
@@ -239,8 +240,18 @@
int fd, err;
fd = open(file, O_CREAT|O_EXCL|O_WRONLY, FILMOD);
- if (fd < 0)
- frecverr("readfile: %s: illegal path name: %m", file);
+ if (fd < 0) {
+ if (errno != EEXIST)
+ frecverr("readfile: %s: illegal path name: %m", file);
+ /* the open() failed because the file already exists. This
+ * may just mean that there already are 1000 jobs in the queue
+ * from the sending host. Treat this as if we are temporarily
+ * out-of-space for new jobs */
+ syslog(LOG_INFO, "returning 'no-space' to %s because %s already exists", fromb, file);
+ sleep(2);
+ (void) write(1, "\2", 1);
+ return(0);
+ }
ack();
err = 0;
for (i = 0; i < size; i += BUFSIZ) {
>Release-Note:
>Audit-Trail:
>Unformatted:
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200009030451.AAA73984>
