From owner-soc-status@FreeBSD.ORG  Sun Jul  3 16:32:59 2011
Return-Path: <owner-soc-status@FreeBSD.ORG>
Delivered-To: soc-status@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DC83F1065672
	for <soc-status@freebsd.org>; Sun,  3 Jul 2011 16:32:59 +0000 (UTC)
	(envelope-from gabor@FreeBSD.org)
Received: from server.mypc.hu (server.mypc.hu [87.229.73.95])
	by mx1.freebsd.org (Postfix) with ESMTP id 6507A8FC15
	for <soc-status@freebsd.org>; Sun,  3 Jul 2011 16:32:59 +0000 (UTC)
Received: from server.mypc.hu (localhost [127.0.0.1])
	by server.mypc.hu (Postfix) with ESMTP id CD48114E59DD
	for <soc-status@freebsd.org>; Sun,  3 Jul 2011 18:16:16 +0200 (CEST)
X-Virus-Scanned: amavisd-new at server.mypc.hu
Received: from server.mypc.hu ([127.0.0.1])
	by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024)
	with LMTP id 23PyCgUHVmbm for <soc-status@freebsd.org>;
	Sun,  3 Jul 2011 18:16:14 +0200 (CEST)
Received: from [193.137.158.160] (unknown [193.137.158.160])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by server.mypc.hu (Postfix) with ESMTPSA id 8E2C914E59DA
	for <soc-status@freebsd.org>; Sun,  3 Jul 2011 18:16:14 +0200 (CEST)
Message-ID: <4E1095D5.6020900@FreeBSD.org>
Date: Sun, 03 Jul 2011 17:16:21 +0100
From: Gabor Kovesdan <gabor@FreeBSD.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:5.0) Gecko/20110624 Thunderbird/5.0
MIME-Version: 1.0
To: soc-status@freebsd.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: regex status report #6
X-BeenThere: soc-status@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Summer of Code Status Reports and Discussion <soc-status.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/soc-status>,
	<mailto:soc-status-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/soc-status>
List-Post: <mailto:soc-status@freebsd.org>
List-Help: <mailto:soc-status-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/soc-status>,
	<mailto:soc-status-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 03 Jul 2011 16:33:00 -0000

Hi,

although the optimization code does not work yet, I think I've made good 
progress this week. As I said before, I took the quick search algorithm 
code from BSD grep and started to refactor it for TRE. It became much 
more complex because of the more general domain. It has to support more 
POSIX regex features and support single-byte, multi-byte and wide 
characters. When wide character support is enabled, the alphabet is much 
bigger than for single bytes so the bad character shift table cannot be 
an array any more because it may take 4 bytes x 1 bytes at least. 
Fortunately, there are only as many distinct values as the distinct 
characters in the pattern and this is not so many, so I wrote a 
hashtable to store and quickly look up these when running the quick 
search algorithm. I'll have to see how it performs with this overhead. 
Then the idea is to check Boyer-Moore algorithm and maybe 
Apostolico-Giancarlo that theoretically uses the less comparisons among 
all of the fixed string matching algorithms. This is quite an experiment 
and I don't know if this shortcut will practically beat the full regex 
engine for fixed string regexes because of the wchar-related overhead 
but still it will be used later as a heuristic to match more complex 
patterns and only apply the heavy algorithm on smaller contexts so this 
code will surely be useful.

To summarize, at the moment I'm close to finish the fixed string 
matching but some debugging is missing.

Gabor