From owner-freebsd-questions@FreeBSD.ORG Sat Feb 18 16:28:19 2006 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0F87D16A420 for ; Sat, 18 Feb 2006 16:28:19 +0000 (GMT) (envelope-from gerard@seibercom.net) Received: from smtp1.suscom.net (smtp1.suscom.net [64.78.119.248]) by mx1.FreeBSD.org (Postfix) with ESMTP id 839AB43D48 for ; Sat, 18 Feb 2006 16:28:18 +0000 (GMT) (envelope-from gerard@seibercom.net) Received: from localhost (smtp1 [127.0.0.1]) by smtp1.suscom.net (Postfix) with ESMTP id 95F451D0009 for ; Sat, 18 Feb 2006 11:31:10 -0500 (EST) Received: from smtp1.suscom.net ([127.0.0.1]) by localhost (smtp1 [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 17314-05 for ; Sat, 18 Feb 2006 11:31:10 -0500 (EST) Received: from seibercom.net (ip148.217.susc.suscom.net [216.45.217.148]) by smtp1.suscom.net (Postfix) with SMTP id 14B241D0002 for ; Sat, 18 Feb 2006 11:31:10 -0500 (EST) Received: from [192.168.0.4] (boss [192.168.0.4]) by seibercom.net (8.13.4/8.13.4) with ESMTP id k1IGSFGh029749 for ; Sat, 18 Feb 2006 11:28:16 -0500 (EST) (envelope-from gerard@seibercom.net) Date: Sat, 18 Feb 2006 11:28:18 -0500 From: Gerard Seibert To: freebsd-questions@freebsd.org Sender: gerard@seibercom.net Organization: Seibercom X-Face: "\j?x](l|]4p?-1Bf@!wN<&p=$.}^k-HgL}cJKbQZ3r#Ar]\%U(#6}'?<3s7%(%(gxJxxcR nSNPNr*/^~StawWU9KDJ-CT0k$f#@t2^K&BS_f|?ZV/.7Q Message-Id: <20060218111849.15E6.GERARD@seibercom.net> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Mailer: Becky! ver. 2.24.01 [en] X-Virus-Scanned: ClamAV version 0.88, clamav-milter version 0.87 on seibercom.net X-Virus-Status: Clean X-Virus-Scanned: by amavisd-new at suscom.net Subject: Removing BOM from UTF-8 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-questions@freebsd.org List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Feb 2006 16:28:19 -0000 I have a large number of text files created in MS Word and saved in UTF-8 format. Unfortunately, MS Word adds the BOM to each file. I need to remove the BOM. Information regarding BOM and UTF-8 can be found here: http://www.cl.cam.ac.uk/~mgk25/unicode.html http://www.w3.org/International/questions/qa-utf8-bom A brief excerpt: It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons: * On POSIX systems, the locale and not magic file type codes define the encoding of plain text files. Mixing the two concepts would add a lot of complexity and break existing functionality. * Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter. * Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one. It has been suggested that a script could be written to eliminate the BOM from a file(s). My script writing skills suck. I have been unable to locate one using Google, so I was hoping that someone might know where I could either locate such a program, or perhaps give me an idea on how to script one. Thanks! -- Gerard Seibert gerard@seibercom.net I'm interested in the fact that the less secure a man is, the more likely he is to have extreme prejudice. Clint Eastwood