From owner-freebsd-questions@FreeBSD.ORG Fri May 28 20:46:06 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3C1CD1065676 for ; Fri, 28 May 2010 20:46:06 +0000 (UTC) (envelope-from nvass9573@gmx.com) Received: from mailout-eu.gmx.com (mailout-eu.gmx.com [213.165.64.42]) by mx1.freebsd.org (Postfix) with SMTP id 9AF458FC17 for ; Fri, 28 May 2010 20:46:05 +0000 (UTC) Received: (qmail invoked by alias); 28 May 2010 20:46:03 -0000 Received: from adsl-78.79.107.71.tellas.gr (EHLO moby.local) [79.107.71.78] by mail.gmx.com (mp-eu002) with SMTP; 28 May 2010 22:46:03 +0200 X-Authenticated: #46156728 X-Provags-ID: V01U2FsdGVkX1/8VZgdJgX8RRsuNA9Cqan+htJF0yCJVDburjPliI ChxndhoFCRTeym Message-ID: <4C002B86.5090007@gmx.com> Date: Fri, 28 May 2010 23:45:58 +0300 From: Nikos Vassiliadis User-Agent: Thunderbird 2.0.0.23 (X11/20100313) MIME-Version: 1.0 To: Polytropon References: <20100527013843.GA40751@thought.org> <20100527050302.da39c258.freebsd@edvax.de> <20100527233607.GD19297@thought.org> <20100528090057.87144ef4.freebsd@edvax.de> In-Reply-To: <20100528090057.87144ef4.freebsd@edvax.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 Cc: Gary Kline , FreeBSD Mailing List Subject: Re: any shortcuts to doc to ascii? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 May 2010 20:46:06 -0000 Polytropon wrote: > On Thu, 27 May 2010 16:36:08 -0700, Gary Kline wrote: >> i don't see any ascii suffix [for OOo]. i saved as .txt. > > This should be right. The .txt extension refers to ASCII text, > at least in standard-compliant operating systems. > > > >> same krap. the \x94, x9d, \x9c... same with catdoc. i'll >> try antiword. [forgot about that. ] > > This makes me believe that the original DOC file has been created > with a wrong character set or language setting. "Windows" - as far > as I know - does not use standard locales such as all other systems > do, but uses an arbitrary setting. > It is a valid UTF-8 encoded text: [nik@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file - /dev/stdin: UTF-8 Unicode text You'll be able to see the character if you fire up a UTF-8 capable terminal with proper locale settings. [nik@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8 After that, just print the char: python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' and use copy & paste to pass it to tr to translate it to something else, for example: tr ' "'" < $file > $output > Another idea may be that the character that you think should be > an apostrophe isn't an apostrophe. I often do see this in german > texts with misplaces apostrophes that are in fact accent grave > or accent acute, or a character from UTF-8 that just looks like > an apostrophe. For example, if the original document contains > > We don`t > > and this ` is not a real ', then conversion tools will of course > use the "escape notation" for this unknown character. Indeed, the standard tool for encoding translations, iconv, chocks on this. Yet, it worked when I tried to convert from utf-8 to greek encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char: http://www.fileformat.info/info/unicode/char/2019/index.htm HTH, Nikos