Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Jan 2000 22:59:34 -0800 (PST)
From:      Paul Eggert <eggert@twinsun.com>
To:        ru@freebsd.org
Cc:        obrien@NUXI.com, alainm@rcsm.ece.mcgill.ca, freebsd-current@freebsd.org, bug-gnu-utils@gnu.org
Subject:   Re: Bad 'grep' behaviour in -CURRENT, faulty binary detection?
Message-ID:  <200001050659.WAA10612@set.twinsun.com>
In-Reply-To: <20000104143320.D10970@relay.ucb.crimea.ua> (ru@freebsd.org)
References:   <20000104143320.D10970@relay.ucb.crimea.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for implementing the result of our earlier discussion.  I
documented what you wrote, and as a result I have the following
suggestions for further improvements:

  * The assignment to not_text should depend on out_quiet
    if no --binary-files option is given.
  * The operands of --binary-files are confusing.  E.g. --binary-files=read
    implies that --binary-options=summarize and --binary-options=skip
    don't read binary files, which isn't true.  And --binary-options=skip
    implies that binary files aren't mentioned on output, which also isn't true
    (a counterexample is --files-without-match --binary-options=skip).
    I propose that the operands be renamed from (read, skip, summarize)
    to (text, without-match, binary).
  * There's no need for a short equivalent to the --binary-files option.
  * Is the -I option really needed?  Can't users just use
    --binary-files=without-match instead?  I have the impression that
    this option won't be typed interactively much.

Could you please try the patch enclosed below instead?
It implements all the above suggestions.

2000-01-04  Paul Eggert  <eggert@twinsun.com>

	Add --binary-files option.
	* NEWS, doc/grep.1, doc/grep.texi: Document it.
	* src/grep.c (BINARY_FILES_OPTION): New constant.
	(long_options, grep, usage, main): New --binary-files option.
	(binary_files): New var.
	* src/system.h (TYPE_SIGNED, TYPE_MINIMUM, TYPE_MAXIMUM, CHAR_MAX):
	New macros.
	(INT_MAX, UCHAR_MAX): Define in terms of TYPE_MAXIMUM.

===================================================================
RCS file: doc/grep.1,v
retrieving revision 2.4
retrieving revision 2.4.0.1
diff -pu -r2.4 -r2.4.0.1
--- doc/grep.1	1999/10/12 20:41:22	2.4
+++ doc/grep.1	2000/01/05 06:35:00	2.4.0.1
@@ -28,6 +28,7 @@ grep, egrep, fgrep \- print lines matchi
 .IR FILE ]
 .RB [ \-d
 .IR ACTION ]
+.RB [ \-\^\-binary-files=\fITYPE\fP ]
 .RB [ \-\^\-directories=\fIACTION\fP ]
 .RB [ \-\^\-extended-regexp ]
 .RB [ \-\^\-fixed-strings ]
@@ -143,6 +144,41 @@ Print the version number of
 to standard error.  This version number should
 be included in all bug reports (see below).
 .TP
+.BI \-\^\-binary-files= TYPE
+If the first few bytes of a file indicate that the file contains binary
+data, assume that the file is of type
+.IR TYPE .
+By default,
+.I TYPE
+is
+.BR binary ,
+and
+.B grep
+normally outputs either
+a one-line message saying that a binary file matches, or no message if
+there is no match.
+If
+.I TYPE
+is
+.BR without-match ,
+.B grep
+assumes that a binary file does not match.
+If
+.I TYPE
+is
+.BR text ,
+.B grep
+processes a binary file as if it were text; this is equivalent to the
+.B \-a
+or
+.B \-\^\-text
+option.
+.I Warning:
+.B "grep \-\^\-binary-files=text"
+might output binary garbage,
+which can have nasty side effects if the output is a terminal and if the
+terminal driver interprets some of it as commands.
+.TP
 .BR \-b ", " \-\^\-byte-offset
 Print the byte offset within the input file before
 each line of output.
@@ -257,15 +293,9 @@ and
 and should redirect output to /dev/null instead.
 .TP
 .BR \-a ", " \-\^\-text
-Do not suppress output lines that contain binary data.
-Normally, if the first few bytes of a file indicate that
-the file contains binary data,
-.B grep
-outputs only a message saying that the file matches the pattern.
-This option causes
-.B grep
-to act as if the file is a text file,
-even if it would otherwise be treated as binary.
+Process a binary file as if it were text; this is equivalent to the
+.B \-\^\-binary-files=text
+option.
 .TP
 .BR \-v ", " \-\^\-invert-match
 Invert the sense of matching, to select non-matching lines.
@@ -323,9 +353,9 @@ system call to read input, instead of
 the default
 .BR read (2)
 system call.  In some situations,
-.B -\^-mmap
+.B \-\^\-mmap
 yields better performance.  However,
-.B -\^-mmap
+.B \-\^\-mmap
 can cause undefined behavior (including core dumps)
 if an input file shrinks while
 .B grep
===================================================================
RCS file: doc/grep.texi,v
retrieving revision 2.4
retrieving revision 2.4.0.1
diff -pu -r2.4 -r2.4.0.1
--- doc/grep.texi	1999/11/13 21:20:24	2.4
+++ doc/grep.texi	2000/01/05 06:35:00	2.4.0.1
@@ -275,6 +275,21 @@ This version number should be included i
 Print a usage message briefly summarizing these command-line options
 and the bug-reporting address, then exit.
 
+@itemx --binary-files=@var{type}
+@opindex --binary-files
+@cindex binary files
+If the first few bytes of a file indicate that the file contains binary
+data, assume that the file is of type @var{type}.  By default,
+@var{type} is @samp{binary}, and @command{grep} normally outputs either
+a one-line message saying that a binary file matches, or no message if
+there is no match.  If @var{type} is @samp{without-match},
+@command{grep} assumes that a binary file does not match.  If @var{type}
+is @samp{text}, @command{grep} processes a binary file as if it were
+text; this is equivalent to the @samp{-a} or @samp{--text} option.
+@emph{Warning:} @samp{--binary-files=text} might output binary garbage,
+which can have nasty side effects if the output is a terminal and if the
+terminal driver interprets some of it as commands.
+
 @item -b
 @itemx --byte-offset
 @opindex -b
@@ -329,16 +344,8 @@ The scanning of every file will stop on 
 @opindex --text
 @cindex suppress binary data
 @cindex binary files
-Do not suppress output lines that contain binary data.
-Normally, if the first few bytes of a file indicate
-that the file contains binary data, grep outputs only a
-message saying that the file matches the pattern.  This
-option causes grep to act as if the file is a text
-file, even if it would otherwise be treated as binary.
-@emph{Warning:} the result might be binary garbage
-printed to the terminal, which can have nasty
-side-effects if the terminal driver interprets some of
-it as commands.
+Process a binary file as if it were text; this is equivalent to the
+@samp{--binary-files=text} option.
 
 @item -w
 @itemx --word-regexp
===================================================================
RCS file: RCS/NEWS,v
retrieving revision 2.4
retrieving revision 2.4.0.1
diff -pu -r2.4 -r2.4.0.1
--- NEWS	1999/11/08 00:42:29	2.4
+++ NEWS	2000/01/05 06:35:00	2.4.0.1
@@ -1,3 +1,10 @@
+  - The new option --binary-files=TYPE makes grep assume that a binary input
+    file is of type TYPE.
+    --binary-files='binary' (the default) outputs a 1-line summary of matches.
+    --binary-files='without-match' assumes binary files do not match.
+    --binary-files='text' treats binary files as text
+	(equivalent to -a or --text).
+
 Version 2.4:
 
   - egrep is now equivalent to `grep -E' as required by POSIX,
===================================================================
RCS file: src/grep.c,v
retrieving revision 2.4
retrieving revision 2.4.0.1
diff -pu -r2.4 -r2.4.0.1
--- src/grep.c	1999/11/13 16:31:55	2.4
+++ src/grep.c	2000/01/05 06:35:00	2.4.0.1
@@ -62,12 +62,19 @@ static int mmap_option;
 static char const short_options[] =
 "0123456789A:B:C::EFGHUVX:abcd:e:f:hiLlnqrsuvwxyZz";
 
+/* Non-boolean long options that have no corresponding short equivalents.  */
+enum
+{
+  BINARY_FILES_OPTION = CHAR_MAX + 1
+};
+
 /* Long options equivalences. */
 static struct option long_options[] =
 {
   {"after-context", required_argument, NULL, 'A'},
   {"basic-regexp", no_argument, NULL, 'G'},
   {"before-context", required_argument, NULL, 'B'},
+  {"binary-files", required_argument, NULL, BINARY_FILES_OPTION},
   {"byte-offset", no_argument, NULL, 'b'},
   {"context", optional_argument, NULL, 'C'},
   {"count", no_argument, NULL, 'c'},
@@ -476,7 +483,12 @@ fillbuf (save, stats)
 }
 
 /* Flags controlling the style of output. */
-static int always_text;		/* Assume the input is always text. */
+static enum
+  {
+    BINARY_BINARY_FILES,
+    TEXT_BINARY_FILES,
+    WITHOUT_MATCH_BINARY_FILES
+  } binary_files;		/* How to handle binary files.  */
 static int filename_mask;	/* If zero, output nulls after filenames.  */
 static int out_quiet;		/* Suppress all normal output. */
 static int out_invert;		/* Print nonmatching stuff. */
@@ -729,11 +741,14 @@ grep (fd, file, stats)
     {
       if (! (is_EISDIR (errno, file) && suppress_errors))
 	error (filename, errno);
-      return nlines;
+      return 0;
     }
 
-  not_text = (! (always_text | out_quiet)
+  not_text = (((binary_files == BINARY_BINARY_FILES && !out_quiet)
+	       || binary_files == WITHOUT_MATCH_BINARY_FILES)
 	      && memchr (bufbeg, eol ? '\0' : '\200', buflim - bufbeg));
+  if (not_text && binary_files == WITHOUT_MATCH_BINARY_FILES)
+    return 0;
   done_on_match += not_text;
   out_quiet += not_text;
 
@@ -993,7 +1008,9 @@ Output control:\n\
   -H, --with-filename       print the filename for each match\n\
   -h, --no-filename         suppress the prefixing filename on output\n\
   -q, --quiet, --silent     suppress all normal output\n\
-  -a, --text                do not suppress binary output\n\
+  -a, --text                equivalent to --binary-files=text\n\
+      --binary-files=TYPE   assume that binary files are TYPE\n\
+                            TYPE is 'binary', 'text', or 'without-match'.\n\
   -d, --directories=ACTION  how to handle directories\n\
                             ACTION is 'read', 'recurse', or 'skip'.\n\
   -r, --recursive           equivalent to --directories=recurse.\n\
@@ -1276,7 +1293,7 @@ main (argc, argv)
 	setmatcher (optarg);
 	break;
       case 'a':
-	always_text = 1;
+	binary_files = TEXT_BINARY_FILES;
 	break;
       case 'b':
 	out_byte = 1;
@@ -1370,6 +1387,16 @@ main (argc, argv)
       case 'z':
 	eolbyte = '\0';
 	break;
+      case BINARY_FILES_OPTION:
+	if (strcmp (optarg, "binary") == 0)
+	  binary_files = BINARY_BINARY_FILES;
+	else if (strcmp (optarg, "text") == 0)
+	  binary_files = TEXT_BINARY_FILES;
+	else if (strcmp (optarg, "without-match") == 0)
+	  binary_files = WITHOUT_MATCH_BINARY_FILES;
+	else
+	  fatal (_("unknown binary-files type"), 0);
+	break;
       case 0:
 	/* long options */
 	break;
===================================================================
RCS file: src/system.h,v
retrieving revision 2.4
retrieving revision 2.4.0.1
diff -pu -r2.4 -r2.4.0.1
--- src/system.h	1999/10/12 20:41:25	2.4
+++ src/system.h	2000/01/05 06:35:00	2.4.0.1
@@ -127,11 +127,20 @@ void free();
 #ifndef CHAR_BIT
 # define CHAR_BIT 8
 #endif
+/* The extra casts work around common compiler bugs.  */
+#define TYPE_SIGNED(t) (! ((t) 0 < (t) -1))
+#define TYPE_MINIMUM(t) ((t) (TYPE_SIGNED (t) \
+			      ? ~ (t) 0 << (sizeof (t) * CHAR_BIT - 1) \
+			      : (t) 0))
+#define TYPE_MAXIMUM(t) ((t) (~ (t) 0 - TYPE_MINIMUM (t)))
+#ifndef CHAR_MAX
+# define CHAR_MAX TYPE_MAXIMUM (char)
+#endif
 #ifndef INT_MAX
-# define INT_MAX 2147483647
+# define INT_MAX TYPE_MAXIMUM (int)
 #endif
 #ifndef UCHAR_MAX
-# define UCHAR_MAX 255
+# define UCHAR_MAX TYPE_MAXIMUM (unsigned char)
 #endif
 
 #if !defined(STDC_HEADERS) && defined(HAVE_STRING_H) && defined(HAVE_MEMORY_H)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200001050659.WAA10612>