ASC Collectors File

Peter D. Wilson dflora!peter at WPA.COM
Fri Apr 16 10:58:39 CDT 1993


Yesterday I downloaded the ASC Collectors from Taxacom so as to
incorporate it into the new Tropcos Database, but I am having a lot of
problems with the quality of the data. Is this data for general release or
have I jumped the gun a little. I also tried to e-mail David Boufford
about the problems but the address I have (boufford.oeb at mhsgw.harvard.edu)
bounced .. does anybody have a more current address for him.

One of the things I observe in the data is that is essentially manual..
ie. it is designed be to used to generate listings and these listings must
be interpreted by humans. I think it is time that we established machine
readable standards for our data. By this I don't mean in the sense of
something like TDWG Standards; I mean it in the sense of having datasets
which are decodeable by reasonably simple programs and which don't involve
applications of artificial inteligence. Adding little textual comments to
data fields; inserting private 'codes' like, =,?,???,! or !!!, or having
ambiguous delimeters  in lists or multivalued fields.

One of the ways in which we can achieve this is to actually clean the data
with the assistance of programs, rather than relying simply on human
scanning. I use perl scripts such as the one below. They take only a few
minutes to clone and modify, and in a few seconds show up problems which
would take immense amounts of time to catch by manual scanning. Admittedly
this listing is manually scanned but at least it in a form which shows up
problems. The scripts can also be rapidly modified to scan only for
particular problems like numbers in fields which are 'supposed' to be
alphabetic. More sophisticated scripts can look for acceptibly formatted
person names, plant names, dates, lat-long coordinates etc. I can provide
perl subroutines which do some of this, maybe other perlers have useful
data cleaning scripts we could use.

We have got to get out of this manual mindedness. Just because we put
things on computers is not to say that we have in any meaningful way
computerized anything !!!

'nough said.


PERL SCRIPT:

#!/LocalApps/perl
# This script prints out the collector_id and the value of a field, sorts
# on the value then prints out the value and all id's which have that
#  value. Each new value is introduced with an asterisk

# usage:    script.pl <asc_collector.data >problems

open(PRESORTFILE,"| sort -t@ +1 >/tmp/Tmp$$") || die "Can't open sort
file.";


while (<>) {
        chop;

        split(/\$/,$_);
        foreach (@_) {s/ +/ /g;s/^ //; s/ $//;}

        ($collector_id,$alt_name,$abbrev,$country,$dates,$institution,
         $full_name,$created,$edited) = @_;


         $field = $country;



        printf PRESORTFILE "%s@%s\n",$collector_id,$field  if $field;

}

close(PRESORTFILE);                                                     #
wait for sort to finish
die "sort failed." if $?;                                       # sordid
sort
open(SORTEDFILE, "</tmp/Tmp$$");

$maxids_perline = 5;            # format 5 specimen id's per line
$id_count = 0;                          # id's output for current
collector
$old_problem = "";              # switch variable

while (<SORTEDFILE>) {
        ($specimen_id, $new_problem) = split(/@/,$_);
        if($old_problem ne $new_problem) {
#               printf "\n";
                $old_problem = $new_problem;
                printf "\n*\n%s", $new_problem;
#               printf "%s", $specimen_id; # for representative id
                $id_count = 0;
        }
#       for all id's
        if($id_count ge $maxids_perline) { print "\n"; $id_count = 0;}
        printf " %6s", $specimen_id; $id_count++;
}
printf "\n";

# remove the temp sort file
unlink "/tmp/Tmp$$";


--
--------------------------------------------------------------------------
peter d. wilson
   NeXTMail prefered -- dflora!peter at wpa.com
   no NeXTMail       -- wilsonp at mobot.org     (Missouri Botanical Garden)
--------------------------------------------------------------------------




More information about the Taxacom mailing list