From owner-freebsd-hackers@FreeBSD.ORG Sun Sep 27 13:27:09 2009 Return-Path: Delivered-To: hackers@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7ADC2106568F; Sun, 27 Sep 2009 13:27:09 +0000 (UTC) (envelope-from gabor@FreeBSD.org) Received: from server.mypc.hu (server.mypc.hu [87.229.73.95]) by mx1.freebsd.org (Postfix) with ESMTP id 02AB58FC08; Sun, 27 Sep 2009 13:27:08 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by server.mypc.hu (Postfix) with ESMTP id 2522914D9720; Sun, 27 Sep 2009 15:27:08 +0200 (CEST) X-Virus-Scanned: amavisd-new at example.com Received: from server.mypc.hu ([127.0.0.1]) by localhost (server.mypc.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id vG6w3+3kamNs; Sun, 27 Sep 2009 15:27:05 +0200 (CEST) Received: from [192.168.1.105] (catv-89-132-179-104.catv.broadband.hu [89.132.179.104]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by server.mypc.hu (Postfix) with ESMTPSA id 33C6414D970A; Sun, 27 Sep 2009 15:27:05 +0200 (CEST) Message-ID: <4ABF6824.9090601@FreeBSD.org> Date: Sun, 27 Sep 2009 15:27:00 +0200 From: Gabor Kovesdan User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: hackers@FreeBSD.ORG Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Roman Divacky Subject: BSDL texttools status and further thoughts... X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Sep 2009 13:27:09 -0000 Hello all, recently, I've had a discussion with rdivacky@ about the status of these tools. It's about bc, dc, grep, sort and iconv. He has persuaded me to write a summary here in case someone else is interested in contributing to these tools. So here I come with a little summary. BSD bc/dc will come just after 8.0-RELEASE. They are quite mature and delphij@ offered to help me getting this into the three by reviewing and approving my changes (I only have doc/ports bit). BSD grep is also quite mature, I've fixed the last critical bug recently. My only concern is the performance. GNU is fast but has ~8 KSLOC. BSD grep is slightly slower but has only ~1.5 KSLOC. It's a huge difference in complexity and GNU grep is very hard to read but they use a lot of custom optimizations to get this performance. I think we should go another way and have a well-optimized and mature regex library. The current one is very old and doesn't have wchar support, it's slow like hell and doesn't support custom GNU bullshit, which is unfortunately necessary to maintain compatiblity. (e.g. "(a|)" is considered invalid in strict POSIX regex but GNU accepts it!) Because of this, BSD grep is linked to the GNU regex library at the moment but because of the custom magic in grep it's still slower a bit. If we can live with this slight performance hit, we can commit it, I think because it's quite feature-complete. You know, I'm a beginner but I think that the code of BSD grep is so tiny and simple that there are almost absolutely no ways to optimize it more by simplifying the code, so I think further optimization should be done in the regex library. As for the regex library, NetBSD's SoC project is worth a look. I'm interested in this but I have too much things in the queue to start another one... As for sort, it isn't so mature yet. I've just made a TODO list of the known missing features or bugs: - sometimes it segfaults when reading huge files - the -k option isn't implemented yet - the -n option doesn't work correctly - preproc() optimization (I don't what it refers to actually but I had it on my previous TODO list, will have to check) - polishing man page - adding some more test cases to the regression test - checking performance (in this case, it really matters because sorting is an algorithmic piece of cake and I'm not an algorithmic guru... And this version of sort was written by me from scratch. The OpenBSD-one isn't wchar-clean and can't be fixed by design. This sort is much more tiny but it seems the algorithm isn't optimal.) As for iconv, I'll keep working on it in my BSc thesis. The forward (foo -> utf32) conversions are almost completely GNU-compatible, the reverse ones not so much. GNU has an optional transliteration, while BSD iconv uses it at default so I compared the output to GNU's transliterated output and it has some more advanced mappings to do this. Apart from this, almost all encodings are supported, that we have in locale(1) charmaps but the Big5 module segfaults. I hope I'll be able to solve these issues and check performance as part of my BSc thesis. Regards, -- Gabor Kovesdan FreeBSD Volunteer EMAIL: gabor@FreeBSD.org .:|:. gabor@kovesdan.org WEB: http://people.FreeBSD.org/~gabor .:|:. http://kovesdan.org