From owner-soc-status@FreeBSD.ORG Sun Jul 3 16:32:59 2011 Return-Path: Delivered-To: soc-status@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DC83F1065672 for ; Sun, 3 Jul 2011 16:32:59 +0000 (UTC) (envelope-from gabor@FreeBSD.org) Received: from server.mypc.hu (server.mypc.hu [87.229.73.95]) by mx1.freebsd.org (Postfix) with ESMTP id 6507A8FC15 for ; Sun, 3 Jul 2011 16:32:59 +0000 (UTC) Received: from server.mypc.hu (localhost [127.0.0.1]) by server.mypc.hu (Postfix) with ESMTP id CD48114E59DD for ; Sun, 3 Jul 2011 18:16:16 +0200 (CEST) X-Virus-Scanned: amavisd-new at server.mypc.hu Received: from server.mypc.hu ([127.0.0.1]) by server.mypc.hu (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 23PyCgUHVmbm for ; Sun, 3 Jul 2011 18:16:14 +0200 (CEST) Received: from [193.137.158.160] (unknown [193.137.158.160]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by server.mypc.hu (Postfix) with ESMTPSA id 8E2C914E59DA for ; Sun, 3 Jul 2011 18:16:14 +0200 (CEST) Message-ID: <4E1095D5.6020900@FreeBSD.org> Date: Sun, 03 Jul 2011 17:16:21 +0100 From: Gabor Kovesdan User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20110624 Thunderbird/5.0 MIME-Version: 1.0 To: soc-status@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: regex status report #6 X-BeenThere: soc-status@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Summer of Code Status Reports and Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jul 2011 16:33:00 -0000 Hi, although the optimization code does not work yet, I think I've made good progress this week. As I said before, I took the quick search algorithm code from BSD grep and started to refactor it for TRE. It became much more complex because of the more general domain. It has to support more POSIX regex features and support single-byte, multi-byte and wide characters. When wide character support is enabled, the alphabet is much bigger than for single bytes so the bad character shift table cannot be an array any more because it may take 4 bytes x 1 bytes at least. Fortunately, there are only as many distinct values as the distinct characters in the pattern and this is not so many, so I wrote a hashtable to store and quickly look up these when running the quick search algorithm. I'll have to see how it performs with this overhead. Then the idea is to check Boyer-Moore algorithm and maybe Apostolico-Giancarlo that theoretically uses the less comparisons among all of the fixed string matching algorithms. This is quite an experiment and I don't know if this shortcut will practically beat the full regex engine for fixed string regexes because of the wchar-related overhead but still it will be used later as a heuristic to match more complex patterns and only apply the heavy algorithm on smaller contexts so this code will surely be useful. To summarize, at the moment I'm close to finish the fixed string matching but some debugging is missing. Gabor