FW: Producing PDFs (was Re: Scanned (PDF) original descriptions)

Fabio Moretzsohn fmoretzsohn at HOTMAIL.COM
Wed May 15 15:11:08 CDT 2002


Rich Pyle wrote:
>Do you (or anyone else on this list) have any knowledge about or experience
>with Adobe Acrobat "Capture" software?  It's an extention of Acrobat that
>is intended specifially for capturing paper-copy materials into PDF format.

No, Rich, I am not familiar with this program, but it sounds promising. I
sure would like to hear about your (and other's) experience with "Capture".

Eric Dumbar wrote:
>For the creation of PDFs without OCR I suggest using JPEG, saved at 75%
>quality ("very high" in Photoshop) instead of TIFF. The files will be 5-20x
>smaller than the TIFF (2-10x smaller than LZW compressed TIFF) and you will
>hardly lose image quality.

At least my OCR program (an old Xerox TextBridge) does not accept LZW
compressed TIFF, and JPEG causes some loss, that may compromise a good
character recognition, as Curtis pointed out below. But JPEG is good if you
have grayscale or color photographs and you want to include them in the PDF
file. Some of the newer OCR programs should do a good job at recognizing the
text and saving the figures automatically. I know that Acrobat can do OCR
from images, but Acrobat is not installed properly in any of the several
computers in my department, and OCR just won't work from Acrobat, hence my
using another program.

Curtis Clark wrote:
>JPEG is inappropriate for black/white text: it saves 16 million colors,
>whether you use them or not, and introduces artifacts around the letters at
>any useful level of compression. A 1-bit (b/w) TIFF with lossless
>compression will almost always be smaller than the equivalent jpeg.

I agree with Curtis, that JPEG is a 24-bit color (or 8-bit grayscale), and
the file size is usually larger than 1-bit BW TIFF. And after OCR is done,
it does not matter the size of the original image file, because the program
saves the letters as text, which uses just a fraction of space.

Other two advantages of doing OCR are: 1) the text can be searched (but not
in an image file), and 2) OCR'd text in PDF prints as good as your printer
can output, if you saved it at high resolution (e.g. 600 dpi) when
converting the file to PDF. The same cannot be said about figures if they
are compressed.

BUT:

Curtis Clark also wrote:
>A PDF containing only the OCRed text of a protologue is potentially
>harmful--it should also contain the original scan. Even a small, overlooked
>mistake can change meaning, and if a web-based PDF is available, Murphy's
>Law states that no one will look at the original again until it can cause
>the most nomenclatural disruption.
>For protologues in languages without good spell-checkers (such as Latin),
>mistakes are even more likely...

This is a good point, which I had not thought about. OCR programs can and
sometimes do introduce misspellings, and I guess that some programs can
automatically "correct" obvious misspellings without asking the user (so the
auto mode should not be used), and even a small mistake can alter the
meaning considerably.

The question then is what is the best compromise in terms of accuracy (no
OCR, just image file, but good resolution is needed) versus modest file size
for online use. Some of the PDFs that I've seen using images compress it so
much that they are hardly readable by human eyes, nearly useless. In those
cases, just a text file, not a facsimile, would be better.

Aloha, Fabio

----------------------------------------
Fabio Moretzsohn
PhD candidate in Zoology
Department of Zoology
University of Hawaii
2538 The Mall
Honolulu, Hawaii 96822
fmoretzsohn at hotmail.com (preferred)
fmoretz at hawaii.edu








_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp.




More information about the Taxacom mailing list