From owner-freebsd-current@freebsd.org Thu May 3 14:15:14 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 18EC7FAC8E3 for ; Thu, 3 May 2018 14:15:14 +0000 (UTC) (envelope-from se@freebsd.org) Received: from mailout11.t-online.de (mailout11.t-online.de [194.25.134.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mailout00.t-online.de", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7F7B069B43 for ; Thu, 3 May 2018 14:15:12 +0000 (UTC) (envelope-from se@freebsd.org) Received: from fwd39.aul.t-online.de (fwd39.aul.t-online.de [172.20.27.138]) by mailout11.t-online.de (Postfix) with SMTP id 1B0AD42492A4 for ; Thu, 3 May 2018 16:08:28 +0200 (CEST) Received: from Stefans-MBP-LAN.fritz.box (GW9YOYZOYhrEbH44DEs0dWmVNrW5z+ybY9IzflTK9bPoYWQg7A6KqJdRGi0islFZB7@[84.154.116.170]) by fwd39.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1fEEuG-0hDhbM0; Thu, 3 May 2018 16:08:24 +0200 To: FreeBSD Current From: Stefan Esser Subject: grep extremely slow for LC_CTYPE=C? Openpgp: preference=signencrypt Autocrypt: addr=se@freebsd.org; prefer-encrypt=mutual; keydata= xsBNBFVxiRIBCADOLNOZBsqlplHUQ3tG782FNtVT33rQli9EjNt2fhFERHIo4NxHlWBpHLnU b0s4L/eItx7au0i7Gegv01A9LUMwOnAc9EFAm4EW3Wmoa6MYrcP7xDClohg/Y69f7SNpEs3x YATBy+L6NzWZbJjZXD4vqPgZSDuMcLU7BEdJf0f+6h1BJPnGuwHpsSdnnMrZeIM8xQ8PPUVQ L0GZkVojHgNUngJH6e21qDrud0BkdiBcij0M3TCP4GQrJ/YMdurfc8mhueLpwGR2U1W8TYB7 4UY+NLw0McThOCLCxXflIeF/Y7jSB0zxzvb/H3LWkodUTkV57yX9IbUAGA5RKRg9zsUtABEB AAHNLlN0ZWZhbiBFw59lciAoVC1PbmxpbmUpIDxzdC5lc3NlckB0LW9ubGluZS5kZT7CwH8E EwEIACkFAlhtTvQCGwMFCQWjmoAHCwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRBH67Xv Wv31RAn0B/9skuajrZxjtCiaOFeJw9l8qEOSNF6PKMN2i/wosqNK57yRQ9AS18x4+mJKXQtc mwyejjQTO9wasBcniKMYyUiie3p7iGuFR4kSqi4xG7dXKjMkYvArWH5DxeWBrVf94yPDexEV FnEG9t1sIXjL17iFR8ng5Kkya5yGWWmikmPdtZChj9OUq4NKHKR7/HGM2dxP3I7BheOwY9PF 4mhqVN2Hu1ZpbzzJo68N8GGBmpQNmahnTsLQ97lsirbnPWyMviWcbzfBCocI9IlepwTCqzlN FMctBpLYjpgBwHZVGXKucU+eQ/FAm+6NWatcs7fpGr7dN99S8gVxnCFX1Lzp/T1YzsBNBFVx iRIBCACxI/aglzGVbnI6XHd0MTP05VK/fJub4hHdc+LQpz1MkVnCAhFbY9oecTB/togdKtfi loavjbFrb0nJhJnx57K+3SdSuu+znaQ4SlWiZOtXnkbpRWNUeMm+gtTDMSvloGAfr76RtFHs kdDOLgXsHD70bKuMhlBxUCrSwGzHaD00q8iQPhJZ5itb3WPqz3B4IjiDAWTO2obD1wtAvSuH uUj/XJRsiKDKW3x13cfavkad81bZW4cpNwUv8XHLv/vaZPSAly+hkY7NrDZydMMXVNQ7AJQu fWuTJ0q7sImRcEZ5EIa98esJPey4O7C0vY405wjeyxpVZkpqThDMurqtQFn1ABEBAAHCwGUE GAEKAA8FAlVxiRICGwwFCQWjmoAACgkQR+u171r99UQEHAf/ZxNbMxwX1v/hXc2ytE6yCAil piZzOffT1VtS3ET66iQRe5VVKL1RXHoIkDRXP7ihm3WF7ZKy9yA9BafMmFxsbXR3+2f+oND6 nRFqQHpiVB/QsVFiRssXeJ2f0WuPYqhpJMFpKTTW/wUWhsDbytFAKXLLfesKdUlpcrwpPnJo KqtVbWAtQ2/o3y+icYOUYzUig+CHl/0pEPr7cUhdDWqZfVdRGVIk6oy00zNYYUmlkkVoU7MB V5D7ZwcBPtjs254P3ecG42szSiEo2cvY9vnMTCIL37tX0M5fE/rHub/uKfG2+JdYSlPJUlva RS1+ODuLoy1pzRd907hl8a7eaVLQWA== Message-ID: <08d32caa-aa44-cff7-d09c-af2444674958@freebsd.org> Date: Thu, 3 May 2018 16:08:24 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 7bit X-ID: GW9YOYZOYhrEbH44DEs0dWmVNrW5z+ybY9IzflTK9bPoYWQg7A6KqJdRGi0islFZB7 X-TOI-MSGID: e47e18d2-e1d5-413c-9c8c-2404b2839e4f X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 May 2018 14:15:14 -0000 Hi all, while working on a new portmaster version, I found that bsdgrep is much faster in an UTF-8 locale than in the C locale, much to my surprise. I have uploaded a small shell-script with test data that can be fetched from: https://people.freebsd.org/~se/grep-test.txz The script uses "grep -v -f patternfile datafile" to select from datafiles the lines that are not matched by the contents of patternfile: #------------------------------------------------------------------- #!/bin/sh LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 export LANG LC_CTYPE time grep -v -f grep-test-pattern grep-test-data LANG=C LC_CTYPE=C #unset LANG LC_CTYPE # is an alternative leading to the same result ... time grep -v -f grep-test-pattern grep-test-data #------------------------------------------------------------------- The first "grep" needs 3.5 seconds to finish on my system, but the second one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not bother to check whether it finishes at all). Is this a bug in grep? Maybe there is something odd in the data file (loading the pattern is not slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a problem that was observed with "real" data, not a specifically constructed worst case. Any ideas what's causing this behavior? I'm currently setting the UTF-8 locale as in the first invocation above to make grep run in reasonable time, but I'd expect it to be faster in the C locale ... Regards, STefan