A Standardized Reference Data Set for Vertebrate Taxon Name Resolution
نویسندگان
چکیده
Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
منابع مشابه
Corefrence resolution with deep learning in the Persian Labnguage
Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...
متن کاملPhylogenetic reference data for systematics and phylotaxonomy of arbuscular mycorrhizal fungi from phylum to species level.
Although the molecular phylogeny, evolution and biodiversity of arbuscular mycorrhizal fungi (AMF) are becoming clearer, phylotaxonomically reliable sequence data are still limited. To fill this gap, a data set allowing resolution and environmental tracing across all taxonomic levels is provided. Two overlapping nuclear DNA regions, totalling c. 3 kb, were analysed: the small subunit (SSU) rRNA...
متن کاملThe complete nucleotide sequence of the mitochondrial DNA of the agnathan Lampetra fluviatilis: bearings on the phylogeny of cyclostomes.
There are two competing theories about the interrelationships of craniates: the cyclostome theory assumes that lampreys and hagfishes are a clade, the cyclostomes, whose sister group is the jawed vertebrates (gnathostomes); the vertebrate theory assumes that lampreys and gnathostomes are a clade, the vertebrates, whose sister group is hagfishes. The vertebrate theory is best supported by a numb...
متن کاملThe concept of "potential taxa" in databases
The concept of a "potential taxon" as a nameand literature-related data area in botanical databases is introduced. A potential taxon is a name with taxon circumscription information attached to it by means of one or more literature references. As a compromise solution between linking information in database systems entirely to specimen data or only to accepted names, using potential taxa can ef...
متن کاملA resolution comparison of horizontal and vertical magnetic transfer functions
The main goal of the present study is to identify characteristics of the inter-station horizontal magnetic responses and the vertical magnetic data, as two types of magnetotelluric transfer functions, in the modeling procedure. Through consideration of model responses and two-dimensional inversion of synthetic data, sensitivity of the data components in detecting different geophysical structure...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 11 شماره
صفحات -
تاریخ انتشار 2016