From nobody Sun Jun 14 18:14:35 2026 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4gdhH50x7Kz6h9VL for ; Sun, 14 Jun 2026 18:14:41 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R13" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4gdhH508Tdz48ht for ; Sun, 14 Jun 2026 18:14:41 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1781460881; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=R6feZXsCwWBHnP+KNP5yKMwDZLtHsyL8Nc8+vh/FF1I=; b=LtjUJ+rC9XnO/LTv/Svnp9YrGeUZIWheCuPFfVmNkdFTHQv0SX6bP4sZ/u2611nClWvzMX K2uANJiuKgZyGJjBVWH8xLxfoc+GG7KYFkouInnNrDZR+RAhj+2j3/I65AGy+lkzh/VjJS 4y3kWQ+itbfQhpqCnapzZpu+9jYLkW6Qpr3JpIWWq9eyZkZ5atPJhvDIcr9Hisc8EsolKa RlLDnYwoYfHdJeDdL+VwJzdsIOxQ1dyxJokKjRKR0P4poAXcJi0HROXlWWFR4vj/LCxXcf ZREpegkEm1efp+Aozr1/jUZLL0CU6qOAdbubOkAY+FHkiP54VnulpBtWfVcfXg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1781460881; a=rsa-sha256; cv=none; b=mB31Ud8MDp5mXYYGjEpw5FKIlFxMV3OXloBdAwAsqErSOYl+woXMOY4qWK0rAST7cstniz 1eHqE6gd8K4dPPPBAb16eOIbEF2+GRlgTd5qO2UAhQZ0SJ1WIuknEJ3SEzOuBhH25+EYbC LMqIS8UqlaB9LROGmwXV4Kp4PzqZWHWNyoBIroIk3xjeZ1sTDfx5Uq5TgMbOdtFpaBSSR0 8nuhM594+Yh3wf0A/vcSAOsU5jTmT97LqT4b+iKE6nxgWY5OzYlvOE2M4JRX5n+1MOliI6 JLxo9hnLCT4lHnJwn2gSQVwinJ1mcwHAUH1RknXKxssv7jUkpqZdYPhzvhEU5A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1781460881; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=R6feZXsCwWBHnP+KNP5yKMwDZLtHsyL8Nc8+vh/FF1I=; b=BGYw/4/3TYC7ZJH7VjvAZatoqbpwbPrLH+0uJuo8x1t/2hA6Qwf1umoRU2cu9Shr/mrc5V R0HNaAsMo7HZzl2ZllXQ/h/127b7GqPxjRLK0zx+Xxiw44Q9DHe0MROHIiAeykQETL5Qvn aXnunN9h+MBKGAwEIJCFgg7lRzjpXoBanqkjnuMSJ+uCLm6XxrzNdXnwPR4tNJlt01OKQZ XJjOxeAYFIBUq99mT1Hjt+jfBoZuoB5o+N6E3NuraDYgYtixGCCnmajGnzmnX+7QmwsUps b0PBkBsmDTjoLodskzPd8qy3B5cunR5g+YxV3YAS3GatsJVXe6Wlm9udAzQaSw== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) by mxrelay.nyi.freebsd.org (Postfix) with ESMTP id 4gdhH46dlFz3sF for ; Sun, 14 Jun 2026 18:14:40 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from git (uid 1279) (envelope-from git@FreeBSD.org) id 1f3da by gitrepo.freebsd.org (DragonFly Mail Agent v0.13+ on gitrepo.freebsd.org); Sun, 14 Jun 2026 18:14:35 +0000 To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Baptiste Daroussin Subject: git: a74c77cc7bed - main - grep(1): optimize -w/--word-regexp word boundary check List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: dev-commits-src-all@freebsd.org Sender: owner-dev-commits-src-all@FreeBSD.org List-Id: List-Post: List-Help: List-Subscribe: List-Unsubscribe: List-Owner: Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: bapt X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: a74c77cc7bed8dba50e976a7be2aa0094ee27b61 Auto-Submitted: auto-generated Date: Sun, 14 Jun 2026 18:14:35 +0000 Message-Id: <6a2eef8b.1f3da.117cc007@gitrepo.freebsd.org> The branch main has been updated by bapt: URL: https://cgit.FreeBSD.org/src/commit/?id=a74c77cc7bed8dba50e976a7be2aa0094ee27b61 commit a74c77cc7bed8dba50e976a7be2aa0094ee27b61 Author: Baptiste Daroussin AuthorDate: 2026-06-10 14:41:39 +0000 Commit: Baptiste Daroussin CommitDate: 2026-06-14 18:14:31 +0000 grep(1): optimize -w/--word-regexp word boundary check The -w option checks word boundaries before and after each potential match by decoding the adjacent character. This was done via the heavyweight sscanf(3) with "%lc", which goes through the full scanf parser and locale-aware mbrtowc(3) machinery even for simple ASCII. Replace with a three-tier fast path: 1. ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison 2. UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed 3. Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead Benchmark with ministat(1) (10 runs each): Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'): Difference at 95.0% confidence: -15.3% +/- 3.1% Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'): Difference at 95.0% confidence: -11.2% +/- 4.7% Normal -w (500k lines, -w 'the'): Difference at 95.0% confidence: -18.1% +/- 3.6% French text (100k lines, -w accented 'ete'): Difference at 95.0% confidence: -18.0% +/- 4.1% Non -w case shows no regression. Reviewed by: kevans Differential Revision: https://reviews.freebsd.org/D57587 --- usr.bin/grep/util.c | 44 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 34 insertions(+), 10 deletions(-) diff --git a/usr.bin/grep/util.c b/usr.bin/grep/util.c index dbb21dcfd78e..bbb174370bd5 100644 --- a/usr.bin/grep/util.c +++ b/usr.bin/grep/util.c @@ -490,6 +490,35 @@ litexec(const struct pat *pat, const char *string, size_t nmatch, #define iswword(x) (iswalnum((x)) || (x) == L'_') +/* + * Check if the byte at the given offset in the line is a word character + * (alphanumeric or _). Handles ASCII fast path, UTF-8 continuation bytes, + * and multi-byte decoding via mbrtowc(3). + */ +static bool +iswordchar(const char *dat, size_t len, size_t offset) +{ + unsigned char ch; + mbstate_t mbstate; + wchar_t wc; + size_t n; + + if (offset >= len) + return (false); + + ch = (unsigned char)dat[offset]; + if (ch < 0x80) + return (isalnum(ch) || ch == '_'); + if ((ch & 0xC0) == 0x80) + /* Continuation byte: part of a word */ + return (true); + + /* Multi-byte start byte: decode with mbrtowc */ + memset(&mbstate, 0, sizeof(mbstate)); + n = mbrtowc(&wc, &dat[offset], MB_CUR_MAX, &mbstate); + return (n == (size_t)-1 || n == (size_t)-2 || iswword(wc)); +} + /* * Processes a line comparing it with the specified patterns. Each pattern * is looped to be compared along with the full string, saving each and every @@ -501,7 +530,6 @@ static bool procline(struct parsec *pc) { regmatch_t pmatch, lastmatch, chkmatch; - wchar_t wbegin, wend; size_t st, nst; unsigned int i; int r = 0, leflags = eflags; @@ -567,18 +595,14 @@ procline(struct parsec *pc) continue; /* Check for whole word match */ if (wflag) { - wbegin = wend = L' '; if (pmatch.rm_so != 0 && - sscanf(&pc->ln.dat[pmatch.rm_so - 1], - "%lc", &wbegin) != 1) + iswordchar(pc->ln.dat, pc->ln.len, + pmatch.rm_so - 1)) r = REG_NOMATCH; - else if ((size_t)pmatch.rm_eo != + if (r == 0 && (size_t)pmatch.rm_eo != pc->ln.len && - sscanf(&pc->ln.dat[pmatch.rm_eo], - "%lc", &wend) != 1) - r = REG_NOMATCH; - else if (iswword(wbegin) || - iswword(wend)) + iswordchar(pc->ln.dat, pc->ln.len, + pmatch.rm_eo)) r = REG_NOMATCH; /* * If we're doing whole word matching and we