Quantcast
Channel: WP7: Biodiversity literature access and data mining
Viewing all articles
Browse latest Browse all 12

M7.19 - Review of pilot of reference de-duplication software

$
0
0
Date: 
31/07/2013
Deliverable or Milestone: 
Milestone

Working with community contributed references to RefBank means that our repository will have a large number of duplicate references arising from:

  • letting users load textual, as opposed to marked-up, references means that a reference can be entered using any style such as Harvard, Chicago, etc laid out in the published  source.
  • near identical references varying only by a comma or space cause by individual stylistic quirks of the contributors.
  • near identical references varying only typographical errors, whether in the original source or induced later through re-keying the reference.

We consider important to RefBank's success that there are as few blocks as possible to user contributions: users should simply upload references as they are without having to specially reformat them to suit RefBank. This design decision leads to the problem of multiple references; however we consider it preferable that the duplicates are resolved within RefBank rather than prevent the load of these references at all, so hindering the workflow of our potential contributors.

The problem of de-duplication is still unresolved within bibliographic reference management. We will nee to develop a tool to automatically identify canonical forms of a reference from the many references loaded into RefBank. Our approach is based on graph theory, with each reference forming a node in a graph and the emergent centroid being considered the canonical form. Various algorithms will be used to calculate the centroid, decomposing the reference so that the most appropriate algorithm can be used, for example Jaro-Winkler for author names.This canonical form of a reference will be returned in future searches, however, the other references will not be deleted but simply marked as unavailable to general searches. Manual curation will be enabled so that a user can override RefBank's canonical form if necessary.

Reporter: 
Dauvit King
Completed: 
Ongoing

Viewing all articles
Browse latest Browse all 12

Latest Images

Trending Articles





Latest Images