Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Oct 1999 20:03:11 -0700
From:      Darryl Okahata <darrylo@sr.hp.com>
To:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Search a symbol in the source tree 
Message-ID:  <199910150303.UAA04572@mina.sr.hp.com>
In-Reply-To: Your message of "Thu, 14 Oct 1999 07:38:21 %2B1000."

next in thread | raw e-mail | index | archive | help
Peter Jeremy <jeremyp@gsmx07.alcatel.com.au> wrote:

> I use id-utils (/usr/ports/devel/id-utils).  It builds a single database
> file and has a variety of tools (including e-lisp) to search the database.
> 
> Since global(1) was mentioned in this threaad, I decided to have a look
> at it.  It seems much slower and my sample (samba-2.0.5a) database was
> nearly 20 times larger.

     Well, as a longtime-user of mkid, mkid2, and mkid3 (the
predecessors to id-utils), here are some comments on the various
packages:

[ Note: in the following, I'm not quite comparing apples and apples.
  However, I'm too lazy to do a strict comparison, but this should still
  give people a vague idea of each package's performance.  Take the
  following as you will, with a grain of salt. ]

* As a baseline, let's look at plain grep.  First, generate a list of
  files to search (this assumes that we don't want to look through all
  files, including Makefiles, man pages, etc.):

	cd /usr/src
	find * -type f | time grep '\.[chsSly][cxp]*$' > /tmp/foo

  Now, on my system (-current from Aug. 21, a PII 300MHz w/128MB & a F/W
  SCSI disk), this takes around 50 seconds (real time):

	xargs grep ptrace < /tmp/foo

  Not too bad, but not great, either.  Let's try looking for utmp.h:

	xargs grep 'utmp\.h' < /tmp/foo

  This takes around a minute.

  Now, let's look at "grep -R":

	cd /usr/src
	grep -R ptrace .			# 2 minutes 42 seconds
	grep -R 'utmp\.h' .			# 2 minutes 40 seconds

  In other words, with grep, you need to limit your searches.  Also,
  "grep -R" doesn't work very well if you also happen to have glimpse,
  global, or mkid/id-utils indices under /usr/src.

* Global is OK (does not appear to support C++, though), but generates
  HUGE databases (by default).  For /usr/src, the databases are around
  as large as the total size of the indexed source files (the gtags "-c"
  option was not used).  Indexing is slow, but searching seems to be
  quite fast.  In particular, "global -x name" is nice, because it just
  return where "name" is defined, as opposed to a plain grep which can
  also return matches on "fooname" and "namebar", as well as where
  "name" is used.  However, global appears to be optimized for locating
  where a function is defined.  It appears to be difficult to locate,
  for example, where a preprocessor macro is defined; except for "global
  -g" (which is often too slow to be usable), I haven't found a way of
  getting global to search through .h header files.
  
  On my system, indexing /usr/src took around an hour, and the indices
  took up around 240MB+ (this was with "gtags" and not "gtags -c").
  This is 20+ times larger than a glimpse or mkid/id-utils database.

  It's interesting to note that "global -x -g ptrace" takes around twice
  as long to execute (over two minutes), compared to plain grep.
  However, "global -x -s ptrace" is very fast (under 1 second).

  Searching for ptrace generates two (2) lines of output, in well under
  one second:

	global -x ptrace

  as do these:

	global -x -s ptrace
	global -x -s uap

  Looking for where "utmp.h" is used:

	global -x -s utmp.h

  This takes more than 2212 seconds (over 36 minutes!), and outputs
  nothing.  Well, let's try this instead:

	global -x -g utmp.h

  This works, taking a bit over a minute and a half.  However, plain
  grep is faster (note that, as global searches through source files
  only, you have to compare it to the source-file-only grep, and not
  "grep -R").

  However, looking for the definition of a preprocessor macro is a
  pain.  Try looking for KBD_DATA_PORT:

	global -x KBD_DATA_PORT

  This runs quickly, but displays nothing.  Next, try:

	global -x -s KBD_DATA_PORT

  This runs quickly, and shows where this is used in .c source files.
  However, where's the definition?  It's not shown.

  This works:

	global -x -g KBD_DATA_PORT

  However, this takes around two minutes to run, which is much slower
  than a plain grep.

* Glimpse is a general-purpose text indexer which can be used to index
  source files.  It's basically an intelligent grep, but it works quite
  well.  Unlike mkid, you can search through comments and non-source
  files (like Makefiles, man pages, README's, etc.).

  On my system, indexing /usr/src took around 6 minutes (using the "-M
  20" option), and the indices took up around 10MB.

  On my system, searching for ptrace took 35 seconds, with 505 lines of
  output (ChangeLogs, man pages, etc. account for the extra lines):

	glimpse -w ptrace

  Searching for uap takes around 21 seconds:

	glimpse -w uap

  Looking for utmp.h:

	glimpse -y -w utmp.h

  This takes a bit over 45 seconds.  However, glimpse searched through
  (and displayed hits in) non-source files, like configure,
  configure.in, Makefiles, etc..

  It is possible to have glimpse exclude certain files and index only
  those files you want indexed.  However, I don't have the time to
  configure and test this.  Perhaps someone else will do this.

* Mkid/mkid2/mkid3/id-utils appear to generate the smallest index
  databases, and they run quickly.  They're great for looking up where a
  particular identifier is used (e.g., "gid ptrace", which is an
  intelligent grep), but it can't just tell you where something is
  defined, and only that place.  The place where something is defined is
  output along with every place that it's used.  You're basically doing
  a very intelligent grep.  However, grep'ing via gid is *MUCH* faster
  than "global -g" (it's like 100X faster); on the other hand, "global
  -s" is often comparable to gid.

  Mkid and friends can also (supposedly, as I've never tried it) tell
  you where a number occurs, in any base.  If you know the number 100 is 
  somewhere in your source code, mkid can show you where it occurs, as
  "100" (decimal), "64" (hex)", or "144" (octal).

  Only source files are indexed, as mkid & friends only know about
  certain languages (C, C++, & assembly being a few).  Also, comments
  aren't indexed, although gid will display hits in comments (because
  the file being grep'd contains a hit in a non-comment line).

  However, the "id-utils-3.2" package for -current dumps core when used
  to index /usr/src.  I don't have the time to track this down.

  On my system, indexing /usr/src using mkid3 took a bit over 2 minutes,
  and the indices took up around 9.1MB.  The index was built using:

	find . -type f | grep '\.[chsSly][cxp]*$' | time mkid -

  (Note: id-utils is further broken, since it cannot take the list of
  files to index from stdin or a file -- this example is for mkid3.)
  Both glimpse and global index more files by default (in the case of
  glimpse, Makefiles, CVS/Root, CVS/Repository, COPYRIGHT files,
  etc. were indexed).

  It's VERY fast.  On my system, searching for ptrace takes under 0.5 sec.:

	gid ptrace

  Yup, that's under one-half second, with 195 lines of output.

  Let's try looking for where "utmp.h" is used:

	gid utmp.h

  This takes around 2.5 seconds.


***** Bottom line:

     For general-purpose use, mkid and friends is best, as long as you
don't need to search through comments or non-source files (Makefiles,
README's, etc.).  The database index is reasonably small, the
indexing time is relatively quick, and the search times are often
comparable to or better than those of global.  However, mkid and friends 
can't just tell you where something is defined; they can only show where 
it is defined and used.

     If you need to search through comments, or need to search
non-source files, glimpse is good.  The index is larger than that of
mkid/id-utils, and the search speed is decent, but not great.  For many
searches, it's faster than plain grep, although it can be comparable to
grep in some cases.

     I've got mixed feelings about global.  On the one hand, you can't
beat it for locating where a function is defined, and it's very good at
showing where a variable is used.  However, for best results, you have
to remember to use different options when searching for function
definitions, identifier usage, preprocessor definitions, etc., and you
may still have to resort to doing a full grep because, for some
searches, global is too slow.  The indices for global are HUGE, and
indexing takes much longer than other approaches.  I'm surprised that
global is part of the base distribution, instead of being a port.

--
	Darryl Okahata
	darrylo@sr.hp.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Hewlett-Packard, or of the
little green men that have been following him all day.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199910150303.UAA04572>