Date: Thu, 27 Mar 1997 10:13:56 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: rssh@cki.ipri.kiev.ua Cc: terry@lambert.org, leisner@sdsp.mc.xerox.com, msmith@atrad.adelaide.edu.au, johnp@lodgenet.com, se@freebsd.org, spaz@u.washington.edu, jkh@time.cdrom.com, hackers@freebsd.org Subject: Re: MSWord docs... Message-ID: <199703271713.KAA01589@phaeton.artisoft.com> In-Reply-To: <333A6507.3FB7@cki.ipri.kiev.ua> from "Ruslan Shevchenko" at Mar 27, 97 03:16:06 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > Nothing public. You can obtain documentation under NDA from > > Microsoft, provided you agree not to implement anything useful > > with the information (like a word processor). > > > > Hm, can I right to do a reingeneering of word file format, > created with my word ? You are only bound to not build a word processor if Microsoft tells you the file format. If you find out on your own, I think you are free to do what you want (you can't copyright a file format, only a document describing it or a program that implements it). Microsoft may have attempted to patent the file format; I doubt it, since GIF format is only in trouble because of the LZW technology patents. There may be similar patents, however, under Microsoft's belt, if they have patented their tiny modifications to LZW77 for their "compress/expand" technique, and if they use this technique on the data stored in the files. They may also have a patent on the encryption algorithm (a friend of mine, while employed at Word Perfect, actually cracked their encryption). The MS-Word format is actually documented in: The File Formats Handbook Gunter Born International Thompson Computer Press ISBN 0-442-01995-5 But WinWord format (which is what we are really discussing here) is not documented in the book, though some gross hints are given: o It's in three sections which are, in order, a header, text data, and formatting data o The header and format structure depend on the version of WinWord [1.0, 2.0, 6.0] o The total header size is 384 bytes o The text is stored as DOS ANSI o The first 36 bytes are: 00 2 Signature (0x9BA5=1.0, 0x9DA5=2.0, 0xD0CF=6.0) 02 2 version (major) 04 2 version (minor) 06 2 Language ID 08 2 Next page number 0A 1 Flags 0B 1 Encryption (1=Yes) 0C 6 Internal use (hah -- yeah, right) 12 1 Platform (0=Windows, 1=Mac) 13 1 Reserved 14 2 Character set (0 = ANSI) 16 2 Internal character set 18 4 absolute offset 1st character of text 1C 4 absolute offset end character of text + 1 20 4 Offset to end of file ... Other file pointers I'd guess that most of the files following the header are what are called "Internal files" or are "Index files", and are probably stored in BTREE format, the same as the .HLP (Help) files. I'm not really interested in hacking this out; I don't own a copy of Word to use to generate test data sets of known content, and that's probably prohibited in the license if I were to go buy a copy. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703271713.KAA01589>