Date: Thu, 3 May 2018 16:08:24 +0200 From: Stefan Esser <se@freebsd.org> To: FreeBSD Current <freebsd-current@freebsd.org> Subject: grep extremely slow for LC_CTYPE=C? Message-ID: <08d32caa-aa44-cff7-d09c-af2444674958@freebsd.org>
next in thread | raw e-mail | index | archive | help
Hi all, while working on a new portmaster version, I found that bsdgrep is much faster in an UTF-8 locale than in the C locale, much to my surprise. I have uploaded a small shell-script with test data that can be fetched from: https://people.freebsd.org/~se/grep-test.txz The script uses "grep -v -f patternfile datafile" to select from datafiles the lines that are not matched by the contents of patternfile: #------------------------------------------------------------------- #!/bin/sh LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 export LANG LC_CTYPE time grep -v -f grep-test-pattern grep-test-data LANG=C LC_CTYPE=C #unset LANG LC_CTYPE # is an alternative leading to the same result ... time grep -v -f grep-test-pattern grep-test-data #------------------------------------------------------------------- The first "grep" needs 3.5 seconds to finish on my system, but the second one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not bother to check whether it finishes at all). Is this a bug in grep? Maybe there is something odd in the data file (loading the pattern is not slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a problem that was observed with "real" data, not a specifically constructed worst case. Any ideas what's causing this behavior? I'm currently setting the UTF-8 locale as in the first invocation above to make grep run in reasonable time, but I'd expect it to be faster in the C locale ... Regards, STefan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?08d32caa-aa44-cff7-d09c-af2444674958>