NomenclatorZoologicus
Robin Leech
releech at TELUSPLANET.NET
Mon Jan 3 14:13:31 CST 2005
Lemme know when you guys have things solved to the point
where I can help out, eh?
Robin
----- Original Message -----
From: "David Remsen" <dremsen at MBL.EDU>
To: <TAXACOM at LISTSERV.NHM.KU.EDU>
Sent: Monday, January 03, 2005 2:01 PM
Subject: Re: NomenclatorZoologicus
> Rich - We are working on a web-based interface for online review now.
> We would certainly appreciate any input you or others might have. Our
> idea is, as you suggested, to have the page image visible along with
> the editable data record. We are playing with image manipulation
> functions in php to dynamically/interactively crop and navigate the
> page image so that the current record is at the top of the image just
> under the converted data record. One of the most physically exhausting
> exercises in reviewing these data is moving the eye between the page
> and the record, particularly in alphabetical lists where the first
> characters are identical. Compound this with the fact that much of the
> record consists of abbreviations and non-text words that require a lot
> of this back and forth motion. We are going to be actively working on
> these issues in the next couple of weeks and will likely try out
> several ideas or offer several options before we get operational.
>
>
>> As for eliminating the need for the second of these, I think the best
>> way to
>> achieve this would be to develop a dead-simple web-based
>> user-interface that
>> allows an online reviewer to easily go page-by-page, line-by-line, and
>> see
>> the electronic version directly next to the original scanned version
>> (ideally with the electronic version formatted with the same font,
>> etc. as
>> the printed version, so discrepancies are more obvious); and a simple
>> way
>> for the reviewer to make the necessary adjustments right there on the
>> web
>> page, then click a "Submit" button to go to the next name. Dave --
>> I've got
>> a number of ideas & suggestions, if you're interested (although I
>> suspect
>> that you're already WAY ahead of me on this).
>
>
>
>
>> P.S. Dave -- out of curiosity, have you tried Adobe's Acrobat Capture
>> 3.0
>> software? It's their industrial-strength OCR product, which has
>> impressive
>> tools for collaborative scanning/OCRing/proofing of printed documents.
>> Was
>> this among the tools you tried when determining that in-house OCRing
>> was not
>> feasible?
>
>
> We tried this and several other OCR tools prior to out-sourcing to a
> data conversion company. We had volumes 1-9 professionally unbound so
> that the page feeds would go smoothly. Our library has several
> high-end scanner/copiers that bundle scanning and OCR together and the
> resultant conversion was a good 99% accurate but unfortunately this was
> not good enough. 99% gives an error approximately every 2 records.
> We also tried a different OCR package and then wrote some scripts to
> compare the two records hoping that they would narrow the mistakes down
> by disagreeing on the errors and agreeing on the successfully converted
> works. This is the approach used at the NLM for some of the Turning
> the Pages conversions they have done where they actually use something
> like 5 different OCR packages and then sort through the results for an
> editorial pass. It might be worth chatting with them sometime on this
> methodology. Our two-pass effort, however, created more problems than
> it solved.
>
> It's interesting to note that the conversion service we chose that
> purported to do double-keying actually initially used OCR tools as
> well. I had been told that these guys had some 'tricks' for
> high-accuracy OCR where they use double-keying only sporadically to
> check the OCR. While this might work for some text it didn't work for
> NZ where everything is abbreviated, punctuated, and very few of the
> strings are actual words. We went through 4 different iterations with
> the company before they threw there hands up and actually had it
> double-keyed. The fifth pass was the one we kept. They never actually
> admitted they used OCR but the errors I found were consistent with OCR
> and not typographical errors. (things like an 198o instead of 1980 or
> l980 instead of 1980). So I think future efforts should keep this in
> mind. We will summarize this and additional tips and tools that we
> have developed for parsing and QC-ing large volumes of nomenclatural
> text like this on the NZ site.
>
> David Remsen
>
More information about the Taxacom
mailing list