[Taxacom] Centrally supported electronic archive

Donat Agosti agosti at amnh.org
Wed May 27 07:50:40 CDT 2009


Proofreading seems to me to be a typical candidate for community
involvement - and thus in need of mechanisms, that people can download a
text, improve it and upload. Their benefit would then be much more
accurate search results.

Of course, if you look at very modern publications, then OCR is not a
problem. But with almost every year back the issues become larger, and
then there are not just english texts...

donat


> We were worried about how to go about the OCR process, standards, etc.
> but to our surprise (and relief), the software used to make PDFs 'just
> did it', and the text content of the PDFs was seachable without any
> further processing.  (it also tried to straighten the page visually,
> with interesting side effects as it tried to reorient images of
> diagonally mounted herbarium specimens vertically.)  The text
> recognition algorithms seeme to be really robust these days and we
> have not bothered to go to the extent of preparing an proofing a
> separate text version of the PDF document.  Perhaps we should, but
> time and funds are limited..
>
> jim
>
> On Wed, May 27, 2009 at 9:56 PM, Mary Barkworth <Mary at biology.usu.edu>
> wrote:
>> The OCRing is useful. I am not sure that "discovering the treatments"
>> is. The point was made that parts of a protologue may be widely
>> scattered (consisting of several "fragments") which is why access to the
>> whole of a work is desirable. Unless by "discovering the treatments" you
>> mean such things as identifying starts of chapters, articles, or
>> sections. Am I right in assuming that thepart of OCRing that is
>> time-consuming is verification/proof-reading?
>>
>> -----Original Message-----
>> From: Donat Agosti [mailto:agosti at amnh.org]
>> Sent: Wednesday, May 27, 2009 5:48 AM
>> To: Mary Barkworth; taxacom at mailman.nhm.ku.edu
>> Cc: Peter B. Phillipson; taxacom at mailman.nhm.ku.edu
>> Subject: Re: [Taxacom] Centrally supported electronic archive
>>
>>
>> Over all, the scanning itself is the least expensive part. OCR-ing and
>> extracting the treatments takes much more time and expertise, even
>> though
>> scanning properly is an art in itself...
>> I think, there is nothing insignificant in this process. At the same
>> time,
>> a huge number of colleagues are scanning their documents independently
>> that sharing this burden. If there would be a way to discover them, so
>> that they then could be used for further processing, then we would have
>> resolved one of the first bottlenecks in the transformation process.
>> BHL,
>> if I am right, is looking into this sort of archive - aren't you, Chris?
>>
>> Donat
>>
>>
>>> The problem, at least with articles and books that are not already
>>> scanned, is surely the cost of scanning, particularly if the work is
>> old
>>> or rare. That is not insignificant.
>>>
>>>
>>> -----Original Message-----
>>> From: taxacom-bounces at mailman.nhm.ku.edu
>>> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Peter B.
>>> Phillipson
>>> Sent: Wednesday, May 27, 2009 3:33 AM
>>> To: taxacom at mailman.nhm.ku.edu
>>> Subject: Re: [Taxacom] Centrally supported electronic archive
>>>
>>> I do want to deal with the whole article.....
>>>
>>> We should all be encouraged to read the entire paper or chapter in
>> which
>>> a
>>> protologue (or any nomenclatural change) is published, there is often
>>> crucial information about the whereabouts of specimens the author has
>>> cited
>>> and other valuable information in an introduction, illustration or
>>> elsewhere
>>> in an article, that can aid interpretation of the original author's
>>> intentions, especially in older publications.
>>>
>>> I have often been frustrated in the past by requesting the page
>> numbers
>>> cited for a particular protologue through inter-library loans, only to
>>> discover that essential parts of a protologue and its context were
>>> missing.
>>> With electronic media it doesn't usually cost more to send or download
>>> all
>>> the pages of an article than just the 1 page that contains the bare
>>> minimum,
>>> so why cut corners?
>>>
>>> We should also encourage authors (and databasers) to be as
>> comprehensive
>>> as
>>> possible in citing earlier taxonomic references, so that it is easier
>>> for
>>> future generations to obtain all the relevant pages of a publication -
>>> citing both the entire publication and the specific pages that contain
>>> all
>>> of the elements of a protologue.
>>>
>>> Pete Phillipson
>>>
>>> -----Original Message-----
>>> From: taxacom-bounces at mailman.nhm.ku.edu
>>> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Paul van
>>> Rijckevorsel
>>> Sent: 27 May 2009 08:44
>>> To: taxacom at mailman.nhm.ku.edu
>>> Subject: Re: [Taxacom] Centrally supported electronic archive
>>>
>>>
>>> From: "Jim Croft" <jim.croft at gmail.com>
>>> Sent: Tuesday, May 26, 2009 11:59 PM
>>>
>>>> When
>>>> someone calls [f]or the protologue, we do not want to send them the
>>>> whole article.  With limited resources we can not afford to scan
>> an[d]
>>>
>>>> store the whole article when all we want is one page of it...
>>>
>>> ***
>>> Yes, an important issue: if all you want is the protologue, you do not
>>> want
>>> to have to deal with a whole article. However, a complicating factor
>> is
>>> that
>>> from a nomenclatural perspective it is not necessarily immediately
>>> apparent
>>> what the protologue is; in fact it needs to be be 'circumscribed' from
>>> case
>>> to case. In the modern literature this will (almost always) be
>>> straightforward, but the introduction, etc to a book or article may
>> also
>>> contain material that belongs to the protologue. Say, the
>>> Acknowlegdements
>>> may comment: "we are deeply grateful for the hospitality of Mr
>>> Przilowsky;
>>> in acknowledgement we have named our third species in honour of his
>>> eldest
>>> daughter". Theoretically, there may be a separation of hundreds of
>> pages
>>> between one part of the protologue and another.
>>>
>>> ["Protologue ...: everything associated with a name at its valid
>>> publication, i.e. description or diagnosis, illustrations, references,
>>> synonymy, geographical data, citation of specimens, discussion, and
>>> comments."]
>>>
>>> It is not required that all the requirements of valid publication are
>>> met in
>>> a single publication; the final 'validating' publication only needs to
>>> refer
>>> to all the required parts, which need to have been effectively
>> published
>>> earlier. For example the final publication may be a few lines only,
>> but
>>> refer to a page-filling illustration elsewhere. So a protologue can be
>>> spread over more than one publication. All in all, 'circumscribing' a
>>> protologue is not a trivial matter. However, if the result goes into
>> an
>>> accessible database, it need be done only once.
>>>
>>> Paul
>>>
>>>
>>> _______________________________________________
>>>
>>> Taxacom Mailing List
>>> Taxacom at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>>
>>> The Taxacom archive going back to 1992 may be searched with either of
>>> these
>>> methods:
>>>
>>> (1) http://taxacom.markmail.org
>>>
>>> Or (2) a Google search specified as:
>>> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>>>
>>>
>>> _______________________________________________
>>>
>>> Taxacom Mailing List
>>> Taxacom at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>>
>>> The Taxacom archive going back to 1992 may be searched with either of
>>> these methods:
>>>
>>> (1) http://taxacom.markmail.org
>>>
>>> Or (2) a Google search specified as:
>>> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>>>
>>> _______________________________________________
>>>
>>> Taxacom Mailing List
>>> Taxacom at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>>
>>> The Taxacom archive going back to 1992 may be searched with either of
>>> these methods:
>>>
>>> (1) http://taxacom.markmail.org
>>>
>>> Or (2) a Google search specified as:
>>> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>>>
>>
>>
>> --
>> Dr. Donat Agosti
>> Research Associate, American Museum of Natural History and Smithsonian
>> Institution
>>
>> Email: agosti at amnh.org
>> Web: http://antbase.org
>> CV:
>> http://research.amnh.org/entomology/social_insects/agosticv_2003.html
>>
>> Swiss Residence
>> Elahieh
>> Ave. Khazer no. 74
>> 19649 Teheran
>> Iran
>>
>> +98-21-2200 8765 (office)
>> +98-21-2260 6160 (home)
>> +98-919-489 2744 (mobile)
>> +1-202-558 0330 (skype-in US)
>> +41-44-5862911 (skype-in Switzerland)
>>
>>
>>
>> _______________________________________________
>>
>> Taxacom Mailing List
>> Taxacom at mailman.nhm.ku.edu
>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>
>> The Taxacom archive going back to 1992 may be searched with either of
>> these methods:
>>
>> (1) http://taxacom.markmail.org
>>
>> Or (2) a Google search specified as:
>>  site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>>
>
>
>
> --
> _________________
> Jim Croft ~ jim.croft at gmail.com ~ +61-2-62509499 ~
> http://www.google.com/profiles/jim.croft
>
> "Words, as is well known, are the great foes of reality."
> - Joseph Conrad, author (1857-1924)
>
> "I know that you believe that you understood what you think I said,
> but I am not sure you realize that what you heard is not what I meant."
>  - attributed to Robert McCloskey, US State Department spokesman
>
> _______________________________________________
>
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>
> The Taxacom archive going back to 1992 may be searched with either of
> these methods:
>
> (1) http://taxacom.markmail.org
>
> Or (2) a Google search specified as:
> site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>


-- 
Dr. Donat Agosti
Research Associate, American Museum of Natural History and Smithsonian
Institution

Email: agosti at amnh.org
Web: http://antbase.org
CV: http://research.amnh.org/entomology/social_insects/agosticv_2003.html

Swiss Residence
Elahieh
Ave. Khazer no. 74
19649 Teheran
Iran

+98-21-2200 8765 (office)
+98-21-2260 6160 (home)
+98-919-489 2744 (mobile)
+1-202-558 0330 (skype-in US)
+41-44-5862911 (skype-in Switzerland)






More information about the Taxacom mailing list