From owner-freebsd-current@freebsd.org Thu May 3 14:41:47 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0473FAD69A for ; Thu, 3 May 2018 14:41:47 +0000 (UTC) (envelope-from kevans@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smtp.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 74C046FC6F; Thu, 3 May 2018 14:41:47 +0000 (UTC) (envelope-from kevans@freebsd.org) Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com [209.85.215.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) (Authenticated sender: kevans) by smtp.freebsd.org (Postfix) with ESMTPSA id 2FBBA14548; Thu, 3 May 2018 14:41:47 +0000 (UTC) (envelope-from kevans@freebsd.org) Received: by mail-lf0-f43.google.com with SMTP id b23-v6so26354602lfg.4; Thu, 03 May 2018 07:41:47 -0700 (PDT) X-Gm-Message-State: ALQs6tARu5330Lf/3FRJ9otkoJlEWawnmmzmdm8p1Mwp5Go3o5bybvWo TqL1fEsZW4lmxYFXv3Iz1M9GGIga/Ge0eqr+SAw= X-Google-Smtp-Source: AB8JxZrcSK6eNI6Y74xWpU+fF8/gwrgk5io1iYVE6UJQPzAe93vUCXEfq8vdk0D6HLBgUytiFJyUC76bmuMou/L3DSQ= X-Received: by 2002:a2e:8794:: with SMTP id n20-v6mr17210750lji.38.1525358505716; Thu, 03 May 2018 07:41:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.46.49.18 with HTTP; Thu, 3 May 2018 07:41:25 -0700 (PDT) In-Reply-To: <08d32caa-aa44-cff7-d09c-af2444674958@freebsd.org> References: <08d32caa-aa44-cff7-d09c-af2444674958@freebsd.org> From: Kyle Evans Date: Thu, 3 May 2018 09:41:25 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: grep extremely slow for LC_CTYPE=C? To: Stefan Esser Cc: FreeBSD Current Content-Type: text/plain; charset="UTF-8" X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 May 2018 14:41:47 -0000 On Thu, May 3, 2018 at 9:08 AM, Stefan Esser wrote: > Hi all, > > while working on a new portmaster version, I found that bsdgrep is much > faster in an UTF-8 locale than in the C locale, much to my surprise. > > I have uploaded a small shell-script with test data that can be fetched > from: > > https://people.freebsd.org/~se/grep-test.txz > > The script uses "grep -v -f patternfile datafile" to select from datafiles > the lines that are not matched by the contents of patternfile: > > #------------------------------------------------------------------- > #!/bin/sh > > LANG=en_US.UTF-8 > LC_CTYPE=en_US.UTF-8 > > export LANG LC_CTYPE > > time grep -v -f grep-test-pattern grep-test-data > > LANG=C > LC_CTYPE=C > #unset LANG LC_CTYPE # is an alternative leading to the same result ... > > time grep -v -f grep-test-pattern grep-test-data > #------------------------------------------------------------------- > > The first "grep" needs 3.5 seconds to finish on my system, but the second > one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not > bother to check whether it finishes at all). > > Is this a bug in grep? > > Maybe there is something odd in the data file (loading the pattern is not > slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a > problem that was observed with "real" data, not a specifically constructed > worst case. > > Any ideas what's causing this behavior? > > I'm currently setting the UTF-8 locale as in the first invocation above > to make grep run in reasonable time, but I'd expect it to be faster in > the C locale ... > > Regards, STefan Hmm... what does `grep -V` look like, just to confirm? These are the results on my local system: root@viper:/tmp/grep# ./grep-test.sh All/mpfr-3.1.7.tgz 0.10 real 0.10 user 0.00 sys All/mpfr-3.1.7.tgz 0.09 real 0.08 user 0.00 sys But I don't immediately recall if I have local modifications in regex(3)/bsdgrep that might have affected this. =( Thanks, Kyle Evans