Google for Internet Database of all life, and existing initiatives already doing this

Tue Mar 21 10:29:51 CST 2006

I'm scrambling to get ready for a trip in a few hours, so I don't have the
time to go into the detail that I would like on this thread -- but here is a
quick summary:

1) There have been several, mostly disconnected efforts by various people in
the biodiversity informatics community to establish a dialog with Google
about possibilities for collaboration -- some achieving more traction than
others.

2) There has been a recent effort to coordinate these different initiatives
somewhat -- not so much to unify them into a single conversation, but to at
least spread mutual awareness (both within the biodiversity community, and
within Google) so that the different initiatives can harmonize wherever they
can, and avoid competing or impinging progress on each other.

3) Over the next few months, more clarity should emerge as to what role(s)
Google might be able to assist our community -- both directly in terms of
technological expertise, and indirectly in terms of adding taxonomic
"intelligence" to Google searches. (Wish I had more time to describe what I
mean by "taxonomic intelligence" -- but I know that if I get started on that
topic now, there's a good chance I'll miss my plane, or perhaps fail to pack
enough underwear for my trip...)

As to the discussion between John/Ken/Doug on the "right" taxonomy, one of
my hopes is to tap Google for their algorithmic ranking expertise to apply
an analog of "PageRank" to any given taxon name -- that is, an objective,
algorithmically derived ranking for each name based on some sort of metric
incorporating the number, recency, and "quality" of each assertion/citation
about each named node in any classification.  The general public could be
presented with the "I'm feeling lucky" equivalent of each name's status,
thus presenting the appearance of a "single" classification.  The more nerdy
types (like us) can be presented with some sort of numeric value
representing the algorithmic "stability" of each name, so we can get a sense
about the extent to which consensus has been reached by the community.
Obviously the ranking algorithm would require a lot of careful thought and
testing (e.g., so it couldn't be "gamed"); and perhaps even more thought
needs to go into quantifying the "quality" metric.  And...obviously...we'd
need a system where historical and current research is reliably added
through a public-access name service or services (GBIF/ITIS/SP2K/uBio/etc.)
so that the algorithm as sufficient raw data to work on.  No easy feat --
but I believe it to be fundamentally possible (and useful), and I imagine an
organization like Google could go a long way to helping our community
develop such a system -- if they are willing (that's one of the "if"s we
hope to sort out over the coming months).

Damn...look at the time.  Gotta go pack my underwear....

Aloha,
Rich