From owner-freebsd-questions@FreeBSD.ORG  Fri May 28 20:46:06 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3C1CD1065676
	for <freebsd-questions@freebsd.org>;
	Fri, 28 May 2010 20:46:06 +0000 (UTC)
	(envelope-from nvass9573@gmx.com)
Received: from mailout-eu.gmx.com (mailout-eu.gmx.com [213.165.64.42])
	by mx1.freebsd.org (Postfix) with SMTP id 9AF458FC17
	for <freebsd-questions@freebsd.org>;
	Fri, 28 May 2010 20:46:05 +0000 (UTC)
Received: (qmail invoked by alias); 28 May 2010 20:46:03 -0000
Received: from adsl-78.79.107.71.tellas.gr (EHLO moby.local) [79.107.71.78]
	by mail.gmx.com (mp-eu002) with SMTP; 28 May 2010 22:46:03 +0200
X-Authenticated: #46156728
X-Provags-ID: V01U2FsdGVkX1/8VZgdJgX8RRsuNA9Cqan+htJF0yCJVDburjPliI
	ChxndhoFCRTeym
Message-ID: <4C002B86.5090007@gmx.com>
Date: Fri, 28 May 2010 23:45:58 +0300
From: Nikos Vassiliadis <nvass9573@gmx.com>
User-Agent: Thunderbird 2.0.0.23 (X11/20100313)
MIME-Version: 1.0
To: Polytropon <freebsd@edvax.de>
References: <20100527013843.GA40751@thought.org>	<20100527050302.da39c258.freebsd@edvax.de>	<20100527233607.GD19297@thought.org>
	<20100528090057.87144ef4.freebsd@edvax.de>
In-Reply-To: <20100528090057.87144ef4.freebsd@edvax.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Y-GMX-Trusted: 0
Cc: Gary Kline <kline@thought.org>,
	FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject: Re: any shortcuts to doc to ascii?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 May 2010 20:46:06 -0000

Polytropon wrote:
> On Thu, 27 May 2010 16:36:08 -0700, Gary Kline <kline@thought.org> wrote:
>> 	i don't see any ascii suffix [for OOo].  i saved as .txt.
> 
> This should be right. The .txt extension refers to ASCII text,
> at least in standard-compliant operating systems.
> 
> 
> 
>> 	same krap.  the \x94, x9d, \x9c...  same with catdoc.  i'll
>> 	try antiword.  [forgot about that.  ]
> 
> This makes me believe that the original DOC file has been created
> with a wrong character set or language setting. "Windows" - as far
> as I know - does not use standard locales such as all other systems
> do, but uses an arbitrary setting.
> 

It is a valid UTF-8 encoded text:
[nik@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file -
/dev/stdin: UTF-8 Unicode text

You'll be able to see the character if you fire up a UTF-8 capable 
terminal with proper locale settings.
[nik@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8

After that, just print the char:
python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)'
and use copy & paste to pass it to tr to translate it to something else, 
for example:
tr ' "'" < $file > $output

> Another idea may be that the character that you think should be
> an apostrophe isn't an apostrophe. I often do see this in german
> texts with misplaces apostrophes that are in fact accent grave
> or accent acute, or a character from UTF-8 that just looks like
> an apostrophe. For example, if the original document contains
> 
> 	We don`t
> 
> and this ` is not a real ', then conversion tools will of course
> use the "escape notation" for this unknown character.

Indeed, the standard tool for encoding translations, iconv, chocks on 
this. Yet, it worked when I tried to convert from utf-8 to greek 
encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char:
http://www.fileformat.info/info/unicode/char/2019/index.htm

HTH, Nikos