data sharing

Fri Dec 4 15:45:22 CST 1998

On Fri, 4 Dec 1998, Julian Humphries wrote:
> At 11:07 AM 12/4/98 -0800, Peter Rauch wrote:
> >If used data are to be well-understood (by the user), then I'd argue
> >that the audit trail for those data should be made available to the user
> >of the data.
>
> I think Peter and others arguing for a strong audit trail and public
> access to such a feature might not be fully appreciative of the
> complexities involved at both the data managment and public
> access points in the system.

I can't speak for others, but Peter is fully appreciative of the
complexities involved. Peter is also fully appreciative of the
consequences of building systems that hide/ignore/create junk data.

> ...  But to implement an audit trail as defined by U Smith
   ...
> is a monumental programming effort.

"Monumental" is relative, and building audit trailing is a cost, but why
does the concept of "audit trail" exist, and why have many information
systems been built with audit trailing? People make mistakes, and people
have determined that the cost of those mistakes is significant enough to
keep track of them. Do mistakes happen in "collections world"? Are those
mistakes of any consequence? Should users of collections data be
concerned about their prior use of data with mistakes (and if not
mistakes, then of changed opinions and updated knowledge, like when the
identity of an organism is changed)?

> It is unclear how Peter would handle the delay in acquisition and
> display of data that would result if we waited on a system that
> behaves as above.

I thought the data being put, at some considerable expense, into
information systems, had some part of its value derived from its
reliability and integrity. If you are saying that it's unnecessary to
document the integrity of the data, then I wonder why we are expending
the funds to capture the (only presumably correct, but in fact with some
degree of error and/or uncertainty and/or limited level of knowledge)
data in the first place?

As far as "delays" go, I think the first step --adding the marginal cost
to archive transactional (change) data-- would be prudent and not create
significant delays in something we've been waiting decades, if not
centuries, for. Providing public access to those audit archives, as
needed, on demand, can be placed second in priority.

>  What Hugh
> seems to be saying is that if we wait on these features until we start
> the process of computerization and online access, we will have
> missed the boat.

Ironically, the project proposal that Hugh pointed to earlier is loaded
with strong implications for the need for audit trailing. I don't know
how to reconcile those implications with the argument that such audits
are unnecessary.

> You have to decide what is a "change" and where to log
> it.  For example if a data entry person mistypes a record and goes
> back to change it right away, is that change?

Yes. Why wouldn't it be? Does that "change" matter? Well, that's a
question for the person who used the datum _before_ the change was made
to answer. Why would others presume to know how important a particular
change is with respect to a particular use? (In this example, it appears
that the data did not have a chance to actually be used, so in effect
you might say it never existed, and there'd be no need to document the
change. But, this is a trivial case. More interesting cases involves
those data that have already been used, or at least have had the
likelihood of being used).

>  If not, how and where do
> you distinguish.

Just record the transaction and remember it. Let those who will possibly
be impacted by the change be concerned with whether it is "important".

> (I can tell you focus groups responded that they wanted
> control over what was audited but didn't want it to take any extra effort,
> hows
> that for unrealistic).

Audit trailing can be automated, so I'm not sure what (significant)
"extra effort" you refer to, beyond that required to design the system.
Recovery of specific states of the system can be costly. Whether to
expend the cost is a matter of judgement, by the user, isn't it?

>  Are changes recorded at the record or field level?

At the "important" level, of course. Now, what's important? The datum
that was used by someone, surely. Do we wish to help that user find the
changed datum, or do we want to just point them in the general direction
(at the "record level")? I don't think we can do the latter because it
makes it very difficult for the user to determine whether their "datum"
is the target of the change, among other reasons. There are other
reasons as well for preferring to identify the changed element
(regardless of how that change transaction is ultimately recorded for
posterity).

> Do you distinguish between changes that are mass updates
> (e..g changing Brasil to Brazil) from individual record changes?

I think you are referring to two different issues here --lots of records
being changed, versus a single tag being changed that is pointed to by
lots of records. The distinction is important with respect to how the
user goes about determining which of his data were affected by the
change, but the fact remains that the user's data were changed, and the
user should be the one to worry about whether the change is important.

One other aspect of your example --a change in spelling which otherwise
has no substantive change in the _meaning_ of the datum-- is also of
concern to the user --did the user retrieve only Brazil data, and did
not search for that mass of Brasil data? While this is really a faulty
search, it's not reasonable to fault the user for failing to know that
"Brasil" might be in the database.

> Rollback systems in commercial databases produce
> huge (really huge) files if you don't commit the changes, so
> the goal of "easy" undo's for the life of a database is not
> realistic.

Before we talk about rollbacks, commits, and other such practices, let's
at least admit to errors and changes! (In fact, let's ask why those
concepts even exist.) And, let's ask and answer honestly whether changes
in used data should be divulged (or at least be discoverable) by the
user.

> Do searches look at the audit trail records?

Doesn't that depend on the needs of the user? I would assume that most
uses would be for data believed to be _currently_ correct, so in most
cases, I would assume that searching audit trails would not be
interesting. Being able to look back, at some point in the future, to
determine if the data I used have been changed is certainly of some
interest to a data user.

If one were trying to understand an existing interpretation of data
accessed at some point in the past (esp. if the interpretations doesn't
seem to correlate with current data), then indeed one may wish to invest
in reconstructing older representations of the database.

> How do you distinguish between specimens literally reidentified
> from simple nomenclatural updates (an very important issue
> for the "quality of data" issue).

Is this important to distinguish? If so, then should it not be made
possible to do? If not, then the question is moot.

> How do you display the record history on a 10 table joined
> view?  This might be a typical edit form for a catalog entry,
> and is already overloaded with information.  Where do you
> add previous version of dozens of fields?

Again, are you suggesting that there are constructs so complex that no
one should dare to ask whether the data are or were valid? Hmmmm.

> The critical issue seems to be a disagreement on the costs/benefits
> of more complex, more featured data acquistion and management
> software.

Aha! We can agree here :>)

>   Its very easy (we all might agree even),  to say that we want
> these 100 features in our curatorial system.  But if we put a time
> and dollar cost on each (both development and staff time) and then
> add a data quality/data retrieval/data value benefit to each feature
> we would only come to a single solution if we all agreed on the numbers.

Right.

> Does the audit trail feature increase development and curatorial
> staff costs by 5%, 15% or 25%.  I am not sure but the answer to that
> would certainly help focus the debate.

Yes. I can go along with your plea. It would also be useful to assess
the value (or damage) of using bad data, and the likelihood that a
certain amount of bad data exist, and the likelihood that _those_
particular bad data are really of any consequence if used, to give the
other part of the equation.

So, I still think that any provider of primary data needs to take
responsibility for documenting the changes in data they've served up.

The alternative is for primary data providers to argue that their data,
correct or not, are of no consequential use to anyone so it doesn't
matter which data are correct and which were not correct. Or, perhaps
the primary data provider would like to argue that the data served up
are so robust that errors and changes are simply unimportant. I will
leave it for the users of their data to argue the value of that
position.

Peter