From owner-freebsd-questions@FreeBSD.ORG  Mon Jan 26 23:39:56 2009
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F1392106566C
	for <freebsd-questions@freebsd.org>;
	Mon, 26 Jan 2009 23:39:56 +0000 (UTC)
	(envelope-from freebsd@edvax.de)
Received: from mx01.qsc.de (mx01.qsc.de [213.148.129.14])
	by mx1.freebsd.org (Postfix) with ESMTP id ADE068FC12
	for <freebsd-questions@freebsd.org>;
	Mon, 26 Jan 2009 23:39:56 +0000 (UTC)
	(envelope-from freebsd@edvax.de)
Received: from r55.edvax.de (port-92-196-68-197.dynamic.qsc.de [92.196.68.197])
	by mx01.qsc.de (Postfix) with ESMTP id E7D0F3CB7F;
	Tue, 27 Jan 2009 00:39:40 +0100 (CET)
Received: from r55.edvax.de (localhost [127.0.0.1])
	by r55.edvax.de (8.14.2/8.14.2) with SMTP id n0QNdYdb005827;
	Tue, 27 Jan 2009 00:39:34 +0100 (CET)
	(envelope-from freebsd@edvax.de)
Date: Tue, 27 Jan 2009 00:39:34 +0100
From: Polytropon <freebsd@edvax.de>
To: Gary Kline <kline@thought.org>
Message-Id: <20090127003934.3d828210.freebsd@edvax.de>
In-Reply-To: <20090126220623.GA76673@thought.org>
References: <20090126001822.GA38314@thought.org>
	<20090126005156.GJ66858@comcast.net> <497D0FF3.6090402@telenix.org>
	<20090126080618.GA51983@thought.org>
	<20090126091623.a0b50f64.freebsd@edvax.de>
	<20090126220623.GA76673@thought.org>
Organization: EDVAX
X-Mailer: Sylpheed 2.4.7 (GTK+ 2.12.1; i386-portbld-freebsd7.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: Chuck Robey <chuckr@telenix.org>,
	FreeBSD Mailing List <freebsd-questions@freebsd.org>
Subject: Re: can i split a pdf file?
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Polytropon <freebsd@edvax.de>
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Jan 2009 23:39:57 -0000

On Mon, 26 Jan 2009 14:06:23 -0800, Gary Kline <kline@thought.org> wrote:
> 	So what kind of moron is going to photograph pages --or maybe just
> 	get-screenshot-of-this-page" and upload it? 

The PDF serves as a container for pictural images in this context.
Another idea would be to have separate image files, one file per
page, that you could view at with your favourite image viewer.

The advantage of the PDF container is that you can easily print
a bunch of pages (or, a book).


>  Or a Real question:
> 	I read an online pdf of "The Art of War" from the 1880's [?], and
> 	it was in an old-English or olden-Deutsch type font.  In PDF.  i
> 	have other p.d. texts in pdf and am wondering in there is some
> 	sort of scanner than can take a book-length script and create a
> 	pdf file.  Anybody know?  

It's very complicated to handle old fonts using OCR techniques.
It's even quite complicated with today's standard fonts. Allthough
there are (usually expensive) OCR programs with good algorithms,
most documents need some work afterwards. It's not only about
correcting mis-recognized characters, you have to handle hyphenation
and paragraph typesetting as well.

I know that there are scanners that can process a bunch op paper
(sheets of paper) through an automatic feeder, then scan them and
finally have a PDF file ready for FTP download. But there's no
OCR involved, of course.


> I got a bunch of ^L bytes and nothing
> 	else. 

The Ctrl-L (^L) is the page break character (FF = form feed). The
rest of the file then contains images that are not transformable
into characters.


> Now I'm looking at the file with od -c and, yup, it's and
> 	image. The parts inbetween pages are in ASCII.  Do you know what
> 	"MediaBox" is?

An image container maybe? So every page contains of a "MediaBox"
container holding one image.


> 	At least the web article was not an image! 

Don't mind, I know "important" web pages where the text content 
actually IS an image, and of course theres no alt= or longdesc=
parameter because they're for weenies. :-)


-- 
Polytropon
>From Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...