Date: Sun, 03 Jul 2011 17:16:21 +0100 From: Gabor Kovesdan <gabor@FreeBSD.org> To: soc-status@freebsd.org Subject: regex status report #6 Message-ID: <4E1095D5.6020900@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
Hi, although the optimization code does not work yet, I think I've made good progress this week. As I said before, I took the quick search algorithm code from BSD grep and started to refactor it for TRE. It became much more complex because of the more general domain. It has to support more POSIX regex features and support single-byte, multi-byte and wide characters. When wide character support is enabled, the alphabet is much bigger than for single bytes so the bad character shift table cannot be an array any more because it may take 4 bytes x 1 bytes at least. Fortunately, there are only as many distinct values as the distinct characters in the pattern and this is not so many, so I wrote a hashtable to store and quickly look up these when running the quick search algorithm. I'll have to see how it performs with this overhead. Then the idea is to check Boyer-Moore algorithm and maybe Apostolico-Giancarlo that theoretically uses the less comparisons among all of the fixed string matching algorithms. This is quite an experiment and I don't know if this shortcut will practically beat the full regex engine for fixed string regexes because of the wchar-related overhead but still it will be used later as a heuristic to match more complex patterns and only apply the heavy algorithm on smaller contexts so this code will surely be useful. To summarize, at the moment I'm close to finish the fixed string matching but some debugging is missing. Gabor
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E1095D5.6020900>