Date: Sat, 18 Feb 2006 11:28:18 -0500 From: Gerard Seibert <gerard@seibercom.net> To: freebsd-questions@freebsd.org Subject: Removing BOM from UTF-8 Message-ID: <20060218111849.15E6.GERARD@seibercom.net>
next in thread | raw e-mail | index | archive | help
I have a large number of text files created in MS Word and saved in UTF-8 format. Unfortunately, MS Word adds the BOM to each file. I need to remove the BOM. Information regarding BOM and UTF-8 can be found here: http://www.cl.cam.ac.uk/~mgk25/unicode.html http://www.w3.org/International/questions/qa-utf8-bom A brief excerpt: It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a signature to mark the beginning of a UTF-8 file. This practice should definitely not be used on POSIX systems for several reasons: * On POSIX systems, the locale and not magic file type codes define the encoding of plain text files. Mixing the two concepts would add a lot of complexity and break existing functionality. * Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter. * Handling BOMs properly would add undesirable complexity even to simple programs like cat or grep that mix contents of several files into one. It has been suggested that a script could be written to eliminate the BOM from a file(s). My script writing skills suck. I have been unable to locate one using Google, so I was hoping that someone might know where I could either locate such a program, or perhaps give me an idea on how to script one. Thanks! -- Gerard Seibert gerard@seibercom.net I'm interested in the fact that the less secure a man is, the more likely he is to have extreme prejudice. Clint Eastwood
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060218111849.15E6.GERARD>