[Taxacom] iSpecies with Wikipedia

Thu Mar 27 09:42:06 CDT 2008

Brian Tindall wrote: 

> Many end users would like one answer.

And if they're "Feeling Lucky", then there's no reason they can't have one
answer.  An algorithm to assess the existing set of classifications and
allow some sort of consensus to emerge is relatively trivial, comapred to
existing ranking algorithms used by search engines and such.  Such an
algorithm would track things like where different classifications are
congruent, and to what extent; weighting based on how many different
knowledgable users have individually ranked the different classifications;
weighting based on how many publications have emulated which classificaitons
(and where those were published, and when); and a bunch of other various
factors that should be too hard for a group of clever algorithms and clever
bioinformatics folks to hash out.  And, of course, the algorithm could be
tweaked iteratively over time.  With such an algorithm, you could not only
provide an "I'm Feeling Lucky" classification, but could also provide a
confidence metric for each node based on homogeneity/heterogeneity of the
various existing classifications.

On the scale of obstacles separating us from bioinformatic utopia, I suspect
that developing and implementing such an algorithmic approach is down near
the "solvable" end of the spectrum.

> But what is that consensus based upon and what happens if the 
> experts generally agree that "the preferred classification" 
> proposed by these providers is misleading? 

Hmmm...who are the "experts", if not the asserters of classifications?  I
only listed the "big" classification asserters because they are broad in
scope, and familiar to most Taxacom readers.  As I said, I don't see why
there can't be as many classifications as there are
individuals/organizations willing to assert them.  Experts are individuals,
and if they are willing to assert classifications, then they are all part of
the mix.  And their "expertness" would be reflected in the weighting
algorithm for generating the "IFL" classification (by whatever metric one
deems appropriate for representing "expertness").

> Harvesting and 
> indexing across multiple websites doesn't serve to correct 
> such problems. It just multiplies them and gives the 
> impression that the majority is correct and the minority (in 
> this case the experts) is wrong!

That's why the algorithm isn't at the "super easy" end of the spectrum --
just near the "solvable" end.  Besides, there is no reason why any
individual expert's assertions of classifications couldn't be made available
through the internet, provided an appropriate platform and the appropriate
software -- exactly the sort of platform and software that (I hope) EoL,
GBIF, and others are planning to develop (based on TDWG standards and
protocols that are already in development).

> I just wonder what the ratio is of the figures "knowledgeable 
> users" to "less-knowledgeable users" to "average users who 
> are looking to the web to provide an answer"? Is it 1:1:1, 
> 100:10:1 or 1:10:100?

I don't think it matters what the ratio is -- as long as tools to
accommodate the different needs are in place.

> I share some of Mary Barkworth's reservations.

So do I -- which is exactly why I advocate the approach that I just
described.  Such a system helps tear down the "authoritarian wall" that
inevitably forms by traditional approaches to establishing "accepted"
classifications.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html