Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 3 Sep 2000 00:51:39 -0400 (EDT)
From:      Garance A Drosehn <gad@freefour.acs.rpi.edu>
To:        FreeBSD-gnats-submit@freebsd.org
Cc:        gad@eclipse.acs.rpi.edu
Subject:   bin/21008: Fix for lpr's handling of lots of jobs in a queue
Message-ID:  <200009030451.AAA73984@freefour.acs.rpi.edu>

next in thread | raw e-mail | index | archive | help

>Number:         21008
>Category:       bin
>Synopsis:       Fix for lpr's handling of lots of jobs in a queue
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 02 22:00:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Garance A Drosehn
>Release:        FreeBSD 4.-stable and 5.-current i386
>Organization:
RPI ; Troy, NY
>Environment:

	Using freebsd's lpr/lpd on print servers in a busy setting.

>Description:

	Lpr uses a counter from 0 to 999 when spooling jobs in a queue
	which cycles back to 0.  The assumption is that by the time the
	counter cycles around, the earlier jobs are long gone.  If that
	assumption is wrong, then the error is handled in a way which
	is pretty painful.  (note that I have only checked on how lpd's
	recvjob routine handles it when accepting jobs from some other
	host, I haven't looked at what lpr does if a client is "full up").

	What happens is that recvjob (readfile) notices that the datafile
	already exists, so it calls frecverr.  That goes to cleanup, and
	the cleanup routine assumes that the INCOMING file was a problem,
	and thus removes it.  This means you have removed the datafile for
	an EARLIER job, and told the sending host an error occurred.  The
	sending host may respond to this error by waiting a bit, and then
	resending the same file.  Now the datafile will not already exist,
	because it was destroyed, so the datafile will transfer successfully.
	However, the control file for the earlier job still exists, and when
	the control file for the incoming job arrives, IT errors.  The
	receiving host again sends an error to the sending host, and it seems
	that the sending host decides this is a good reason to just remove
	the job on it's side.
	So, you've gone from two (or more) datafiles and two control files
	to one control file and no datafiles.

	We should (optionally) allow a larger range for the counter, but
	I'm not ready to write that right now.  In any case, there needs
	to be a fix for how recvjob behaves when an overflow of the counter's
	range does occur.

>How-To-Repeat:

	You could send over a thousand jobs to a queue, but that's a bit
	unwieldly.  Instead, I suggest:
	    On "server":
		lpc stop <printer>
	    On "client":
		lpc stop <printer>
	        lpr -P<printer> somefile
	    go to spool directory, and save a copy of the cf and df files.
	        lpc start <printer>
	    (the files for the job go from client to server, and sit there)
	    recreate cf and df files, with the exact same name, from copies.
	        lpc start <printer>
	    then watch what happens...

>Fix:
	
	The real fix, in my opinion, would be a pretty significant rewrite
	of recvjob.c.

	The interim fix is to change recvjob such that the receiving host
	will tell the sending host that it is "out of space" if a file is
	being sent which already exists on the server.  Assuming datafiles
	are being correctly removed as each job finishes printing, this
	works well.  NOTE: there does seem to be some situations where a
	datafile is left behind (not-removed) even though the job has in
	fact printed.  I do not know if those are due to other changes I
	have made in my lpr, or if everyone sees them.  In any case, those
	leftover data files could now cause queues to "stall" with this
	update.  Still, no data is lost, and both the server and the client
	will have some information as to why the stall has happened.  So,
	I still think this is a reasonable fix, even if it isn't foolproof.

	Here is the update:

--- recvjob.c.orig	Sat Sep  2 23:39:25 2000
+++ recvjob.c	Sat Sep  2 23:35:45 2000
@@ -58,6 +58,7 @@
 #include <signal.h>
 #include <fcntl.h>
 #include <dirent.h>
+#include <errno.h>
 #include <syslog.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -239,8 +240,18 @@
 	int fd, err;
 
 	fd = open(file, O_CREAT|O_EXCL|O_WRONLY, FILMOD);
-	if (fd < 0)
-		frecverr("readfile: %s: illegal path name: %m", file);
+	if (fd < 0) {
+		if (errno != EEXIST)
+			frecverr("readfile: %s: illegal path name: %m", file);
+		/* the open() failed because the file already exists.  This
+		 * may just mean that there already are 1000 jobs in the queue
+		 * from the sending host.  Treat this as if we are temporarily
+		 * out-of-space for new jobs */
+		syslog(LOG_INFO, "returning 'no-space' to %s because %s already exists", fromb, file);
+		sleep(2);
+		(void) write(1, "\2", 1);
+		return(0);
+	}
 	ack();
 	err = 0;
 	for (i = 0; i < size; i += BUFSIZ) {


>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200009030451.AAA73984>