Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Feb 2006 11:28:18 -0500
From:      Gerard Seibert <gerard@seibercom.net>
To:        freebsd-questions@freebsd.org
Subject:   Removing BOM from UTF-8
Message-ID:  <20060218111849.15E6.GERARD@seibercom.net>

next in thread | raw e-mail | index | archive | help
I have a large number of text files created in MS Word and saved in
UTF-8 format. Unfortunately, MS Word adds the BOM to each file. I need
to remove the BOM.

Information regarding BOM and UTF-8 can be found here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.w3.org/International/questions/qa-utf8-bom

A brief excerpt:

It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
as a signature to mark the beginning of a UTF-8 file. This practice
should definitely not be used on POSIX systems for several reasons:

    * On POSIX systems, the locale and not magic file type codes define
     the encoding of plain text files. Mixing the two concepts would add a
     lot of complexity and break existing functionality.

    * Adding a UTF-8 signature at the start of a file would interfere
     with many established conventions such as the kernel looking for “#!” at
     the beginning of a plaintext executable to locate the appropriate
     interpreter.

    * Handling BOMs properly would add undesirable complexity even to
     simple programs like cat or grep that mix contents of several files into
     one.

It has been suggested that a script could be written to eliminate the
BOM from a file(s). My script writing skills suck. I have been unable to
locate one using Google, so I was hoping that someone might know where I
could either locate such a program, or perhaps give me an idea on how to
script one.

Thanks!

-- 
Gerard Seibert
gerard@seibercom.net


     I'm interested in the fact that the less secure a man is, the more
     likely he is to have extreme prejudice.

          Clint Eastwood



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060218111849.15E6.GERARD>