Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Jun 2009 18:32:43 -0500
From:      Jeffrey Goldberg <jeffrey@goldmark.org>
To:        Gary Kline <kline@thought.org>
Cc:        FreeBSD Questions <freebsd-questions@freebsd.org>
Subject:   Re: feedback, comments on this php-delimiter scrubbing program?
Message-ID:  <57E07CDA-AA9E-4A8F-91BC-3BF90177CA3A@goldmark.org>
In-Reply-To: <20090616170244.GA40934@thought.org>
References:  <20090616012114.GA38011@thought.org> <200906151857.45945.mel.flynn%2Bfbsd.questions@mailing.thruhere.net> <20090616153040.GA40540@thought.org> <20090616170244.GA40934@thought.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--Apple-Mail-1486-599991389
Content-Type: text/plain;
	charset=US-ASCII;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit

On Jun 16, 2009, at 12:02 PM, Gary Kline wrote:

> 	this works, but still gives a warning.  it's sloppy coding, but
> 	as a second version...

You've got some superfluous tests for EOF in some places, and you may  
also be missing some.

Your approach has been to "look ahead" with an extra getc() when you  
come across an interesting character.  I recommended that instead of  
doing that you keep a variable "state" to keep track of where you are  
(and have very recently been) instead of looking ahead.

I haven't tried your code, but I suspect that it behaves incorrectly  
with input

   (1) that has a '<' as a final character
   (2) that includes things like "<<<<?"
   (3) that includes things like "??>"

There is a systematic (if a bit tedious) way to make sure that you  
check every condition.  When you've worked enough on this, you can  
peek at an answer which I've attached.

(For the rest of you, I know that it would be more efficient to make  
the big switch on state instead of on input character, but for  
pedagogical reasons I did it the other way around.  I deliberately  
avoided other available tunings).

The extensive comments in the code should make it clear what is going  
on.  Once you understand the concepts here it should be very easy to  
write code to do similar things in the future.

-j



-- 
Jeffrey Goldberg                        http://www.goldmark.org/jeff/


--Apple-Mail-1486-599991389
Content-Disposition: attachment;
	filename=gkline.c
Content-Type: application/octet-stream;
	x-unix-mode=0644;
	name="gkline.c"
Content-Transfer-Encoding: 7bit

/* simple code to parse out stuff between <? and ?> inclusively
 * does not need to nest.  exits with 0 if end of input occurs outside of
 * the PHP stuff, and with 1 otherwise
 *
 * Copyright 2009 Jeffrey Goldberg <jeffrey@goldmark.org>
 * free to use and distribute under the BSD license
 */

/* Speicial cases need to be considered:
 *   (1) input ends with "<" as the last character
 *   (2) Input contains "<<<<?" 
 *   (3) Input contains "???>"
 *   and many more 
 *
 *   Overall strategy is to recognize that we are interested in five
 *   types of characters and four different "states"
 *   The character types are
 *    1. "<"
 *    2. "?"
 *    3. ">"
 *    4. "EOF" (yes, it's not a character, but let me call it that)
 *    5. any other character
 *
 *    The states are for keeping track of what we have found already,
 *    there are 4 of them
 *
 *    (1) Outside PHP with nothing special going on
 *    (2) We were outside PHP but found a '<'
 *    (3) We are inside the PHP
 *    (4) We are inside the PHP and have found a '?' 
 *
 *    It's easy to forget, but important to remember, that in each state
 *    we might encounter any character.  This gives us 5 X 4 different
 *    combinations, which each require their own behavior.  The way to
 *    make sure that you take care of all 20 possibilities is to
 *    explicitly spell them out.
 */

#include <stdlib.h>
#include <stdio.h>

/* define the various values for state
 * We could use "enum state {...}", but I'm old fashioned. */
#define OUTSIDE		(1)	// in normal text
#define AFTER_LT	(2)	// found "<" looking for "?
#define INSIDE		(3)	// between <? and ?> 
#define AFTER_Q		(4)	// was INSIDE and after "?"

int main(void) {
  int c;	// because of EOF we need to make this an integer
  unsigned char state;

  state = OUTSIDE; /* set starting state */
  while( (c=getchar()) )
    switch (c) {
      case '<':
	  if (state == OUTSIDE) state = AFTER_LT;
	  else if (state == AFTER_LT) putchar('<'); // print previous '<'
	  else if (state == INSIDE) ; /* stay in state, don't print */
	  else if (state == AFTER_Q) state = INSIDE;
	  break;

      case '?':
	  if (state == AFTER_LT) state = INSIDE;
	  else if (state == INSIDE) state = AFTER_Q;
	  else if (state == AFTER_Q) ; /* stay in same state, don't print */ 
	  else if (state == OUTSIDE) putchar(c); /* and stay in same state */
	  break;

      case '>':
	  if (state == AFTER_Q) state = OUTSIDE;
	  else if (state == AFTER_LT) {
	    putchar('<'); putchar(c);
	    state = OUTSIDE; }
	  else if (state == INSIDE) ; /* same state, don't print */
	  else if (state == OUTSIDE) putchar(c);
	  break;

      case EOF:
	  if (state == OUTSIDE) exit(0);
	  else if (state == AFTER_LT) {
	    putchar('<');
	    exit(0); }
	  else exit(1);

      default: /* for normal characters */
	  if (state == OUTSIDE) putchar(c);
	  else if (state == INSIDE) ; /* same state, don't print */
	  else if (state == AFTER_LT) {
	    putchar('<');
	    putchar(c);
	    state = OUTSIDE; }
	  else if (state == AFTER_Q) state = INSIDE;
    }

  exit(3); /* this should never be reached, but it eliminates a gcc warning */
}

--Apple-Mail-1486-599991389
Content-Type: text/plain;
	charset=US-ASCII;
	format=flowed
Content-Transfer-Encoding: 7bit




--Apple-Mail-1486-599991389--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?57E07CDA-AA9E-4A8F-91BC-3BF90177CA3A>