From nobody Sun Jun 14 18:14:35 2026 X-Original-To: dev-commits-src-main@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4gdhH03Bcmz6h9Pp for ; Sun, 14 Jun 2026 18:14:36 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R13" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4gdhGz68J8z48Zj for ; Sun, 14 Jun 2026 18:14:35 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1781460876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=R6feZXsCwWBHnP+KNP5yKMwDZLtHsyL8Nc8+vh/FF1I=; b=gTpliRwQJ1Xr21+Hc3jfrBE1U0SHjC4CZUodW5xO8okQCJ6fXtTXX4L2tBpky4wQHBsBLb S6Of7zeTxH+O8yusmWzzLr03S46DVtJWPk3bl/sDNMcwaOhYBCJdF5WmXc5jRGU1G6HzN4 1xOexYiiGO43R4upcp4ZH5cyRHCCmpTxpiYJN4zUpAAiO9qCms+XkwxNrxuVGa1k5YxSS6 mqrFa5kWymwHC9uG9nBTgPDMVTdpudjPOmORqap1hCjgD5XjURel8kcjoS7rOvMaNkWVQ0 /Rq8x6Iw9JBtGQkv4B31FJFKt3sz3Hb6lgGbttR6fdUBXEoHbT40TfIJUxftiQ== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1781460876; a=rsa-sha256; cv=none; b=md/GsmVqYGWKr7p9x/ne9bCSII1mbz/Vev0kz0edhtuWG6q36T40zR7tD55jTxbILg+cd3 Eq26IgsEI0sEt4OaHiqu4bELt4EwZr04EfjCuM3z4453w3WHV1aFHZPah8ketSeQXyKqVg h8vMmaH+RD0qLbZAkj+IgB2rK19fsUdJE9GI9yc8S/Ni9Bo3upHkLHeiEX2nW6wK0W6NTG PYuYS8fi9+b4NjpuxwshQPx0y8k7oKa96TDBQnwWA+3RsdtP4Nee2kvzLyv0GB+SbKw7sV 4/QQy7pZzEbGJn8djzamCTL9O2/pYT6z+/tt4zk6EUA2KwHULLlSzni3l2jJ4A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1781460876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=R6feZXsCwWBHnP+KNP5yKMwDZLtHsyL8Nc8+vh/FF1I=; b=HZFs3W1zvr0DBCCxyPbzCAo6voWtLje2rncgr5u5uxnCrLx/EkuwZLxCVpcENcc0H4LBsG Xeee5BRgVDVOiMVlblzGctagzWRRVjHsFaKud/3+2cj0LTVUFayM5QaPsVQf30z1T75qIJ 0uExB2cku7rYdPjPFHnrFN0L0KSId14Vp6Lfs5lI88mLOiRpLsGgWxhMZWola4zma6D2AW 9gRBHV/1f8aAqboE5bKZJhbejfCYiePM4x3+fY4sqgcJAtsBWZy02ZXFfHGjSnmE5DG1FH p/jC1kdw5JxloOFQwyDAza6pz52I7CRbQsDBb2ZQHwdyGFzkVKMDtRkR+dvJ8w== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) by mxrelay.nyi.freebsd.org (Postfix) with ESMTP id 4gdhGz5XpTz3y9 for ; Sun, 14 Jun 2026 18:14:35 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from git (uid 1279) (envelope-from git@FreeBSD.org) id 1f3da by gitrepo.freebsd.org (DragonFly Mail Agent v0.13+ on gitrepo.freebsd.org); Sun, 14 Jun 2026 18:14:35 +0000 To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Baptiste Daroussin Subject: git: a74c77cc7bed - main - grep(1): optimize -w/--word-regexp word boundary check List-Id: Commit messages for the main branch of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-main List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: dev-commits-src-main@freebsd.org Sender: owner-dev-commits-src-main@FreeBSD.org List-Id: List-Post: List-Help: List-Subscribe: List-Unsubscribe: List-Owner: Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: bapt X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: a74c77cc7bed8dba50e976a7be2aa0094ee27b61 Auto-Submitted: auto-generated Date: Sun, 14 Jun 2026 18:14:35 +0000 Message-Id: <6a2eef8b.1f3da.117cc007@gitrepo.freebsd.org> The branch main has been updated by bapt: URL: https://cgit.FreeBSD.org/src/commit/?id=a74c77cc7bed8dba50e976a7be2aa0094ee27b61 commit a74c77cc7bed8dba50e976a7be2aa0094ee27b61 Author: Baptiste Daroussin AuthorDate: 2026-06-10 14:41:39 +0000 Commit: Baptiste Daroussin CommitDate: 2026-06-14 18:14:31 +0000 grep(1): optimize -w/--word-regexp word boundary check The -w option checks word boundaries before and after each potential match by decoding the adjacent character. This was done via the heavyweight sscanf(3) with "%lc", which goes through the full scanf parser and locale-aware mbrtowc(3) machinery even for simple ASCII. Replace with a three-tier fast path: 1. ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison 2. UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed 3. Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead Benchmark with ministat(1) (10 runs each): Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'): Difference at 95.0% confidence: -15.3% +/- 3.1% Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'): Difference at 95.0% confidence: -11.2% +/- 4.7% Normal -w (500k lines, -w 'the'): Difference at 95.0% confidence: -18.1% +/- 3.6% French text (100k lines, -w accented 'ete'): Difference at 95.0% confidence: -18.0% +/- 4.1% Non -w case shows no regression. Reviewed by: kevans Differential Revision: https://reviews.freebsd.org/D57587 --- usr.bin/grep/util.c | 44 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 34 insertions(+), 10 deletions(-) diff --git a/usr.bin/grep/util.c b/usr.bin/grep/util.c index dbb21dcfd78e..bbb174370bd5 100644 --- a/usr.bin/grep/util.c +++ b/usr.bin/grep/util.c @@ -490,6 +490,35 @@ litexec(const struct pat *pat, const char *string, size_t nmatch, #define iswword(x) (iswalnum((x)) || (x) == L'_') +/* + * Check if the byte at the given offset in the line is a word character + * (alphanumeric or _). Handles ASCII fast path, UTF-8 continuation bytes, + * and multi-byte decoding via mbrtowc(3). + */ +static bool +iswordchar(const char *dat, size_t len, size_t offset) +{ + unsigned char ch; + mbstate_t mbstate; + wchar_t wc; + size_t n; + + if (offset >= len) + return (false); + + ch = (unsigned char)dat[offset]; + if (ch < 0x80) + return (isalnum(ch) || ch == '_'); + if ((ch & 0xC0) == 0x80) + /* Continuation byte: part of a word */ + return (true); + + /* Multi-byte start byte: decode with mbrtowc */ + memset(&mbstate, 0, sizeof(mbstate)); + n = mbrtowc(&wc, &dat[offset], MB_CUR_MAX, &mbstate); + return (n == (size_t)-1 || n == (size_t)-2 || iswword(wc)); +} + /* * Processes a line comparing it with the specified patterns. Each pattern * is looped to be compared along with the full string, saving each and every @@ -501,7 +530,6 @@ static bool procline(struct parsec *pc) { regmatch_t pmatch, lastmatch, chkmatch; - wchar_t wbegin, wend; size_t st, nst; unsigned int i; int r = 0, leflags = eflags; @@ -567,18 +595,14 @@ procline(struct parsec *pc) continue; /* Check for whole word match */ if (wflag) { - wbegin = wend = L' '; if (pmatch.rm_so != 0 && - sscanf(&pc->ln.dat[pmatch.rm_so - 1], - "%lc", &wbegin) != 1) + iswordchar(pc->ln.dat, pc->ln.len, + pmatch.rm_so - 1)) r = REG_NOMATCH; - else if ((size_t)pmatch.rm_eo != + if (r == 0 && (size_t)pmatch.rm_eo != pc->ln.len && - sscanf(&pc->ln.dat[pmatch.rm_eo], - "%lc", &wend) != 1) - r = REG_NOMATCH; - else if (iswword(wbegin) || - iswword(wend)) + iswordchar(pc->ln.dat, pc->ln.len, + pmatch.rm_eo)) r = REG_NOMATCH; /* * If we're doing whole word matching and we