From owner-cvs-all@FreeBSD.ORG Mon Jul 12 07:35:59 2004 Return-Path: Delivered-To: cvs-all@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AB52B16A4CE; Mon, 12 Jul 2004 07:35:59 +0000 (GMT) Received: from repoman.freebsd.org (repoman.freebsd.org [216.136.204.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id A6DBD43D1F; Mon, 12 Jul 2004 07:35:59 +0000 (GMT) (envelope-from tjr@FreeBSD.org) Received: from repoman.freebsd.org (localhost [127.0.0.1]) by repoman.freebsd.org (8.12.11/8.12.11) with ESMTP id i6C7Zxbv005904; Mon, 12 Jul 2004 07:35:59 GMT (envelope-from tjr@repoman.freebsd.org) Received: (from tjr@localhost) by repoman.freebsd.org (8.12.11/8.12.11/Submit) id i6C7Zx2f005903; Mon, 12 Jul 2004 07:35:59 GMT (envelope-from tjr) Message-Id: <200407120735.i6C7Zx2f005903@repoman.freebsd.org> From: "Tim J. Robbins" Date: Mon, 12 Jul 2004 07:35:59 +0000 (UTC) To: src-committers@FreeBSD.org, cvs-src@FreeBSD.org, cvs-all@FreeBSD.org X-FreeBSD-CVS-Branch: HEAD Subject: cvs commit: src/lib/libc/regex engine.c regcomp.c regex2.h regexec.c regfree.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2004 07:35:59 -0000 tjr 2004-07-12 07:35:59 UTC FreeBSD src repository Modified files: lib/libc/regex engine.c regcomp.c regex2.h regexec.c regfree.c Log: Make regular expression matching aware of multibyte characters. The general idea is that we perform multibyte->wide character conversion while parsing and compiling, then convert byte sequences to wide characters when they're needed for comparison and stepping through the string during execution. As with tr(1), the main complication is to efficiently represent sets of characters in bracket expressions. The old bitmap representation is replaced by a bitmap for the first 256 characters combined with a vector of individual wide characters, a vector of character ranges (for [A-Z] etc.), and a vector of character classes (for [[:alpha:]] etc.). One other point of interest is that although the Boyer-Moore algorithm had to be disabled in the general multibyte case, it is still enabled for UTF-8 because of its self-synchronizing nature. This greatly speeds up matching by reducing the number of multibyte conversions that need to be done. Revision Changes Path 1.14 +92 -40 src/lib/libc/regex/engine.c 1.32 +253 -259 src/lib/libc/regex/regcomp.c 1.8 +57 -17 src/lib/libc/regex/regex2.h 1.6 +64 -3 src/lib/libc/regex/regexec.c 1.6 +10 -3 src/lib/libc/regex/regfree.c