ProMiner: rule-based protein and gene entity recognition
Podcast
Podcaster
Beschreibung
vor 19 Jahren
Background: Identification of gene and protein names in biomedical
text is a challenging task as the corresponding nomenclature has
evolved over time. This has led to multiple synonyms for individual
genes and proteins, as well as names that may be ambiguous with
other gene names or with general English words. The Gene List Task
of the BioCreAtIvE challenge evaluation enables comparison of
systems addressing the problem of protein and gene name
identification on common benchmark data. Methods: The ProMiner
system uses a pre-processed synonym dictionary to identify
potential name occurrences in the biomedical text and associate
protein and gene database identifiers with the detected matches. It
follows a rule-based approach and its search algorithm is geared
towards recognition of multi-word names [1]. To account for the
large number of ambiguous synonyms in the considered organisms, the
system has been extended to use specific variants of the detection
procedure for highly ambiguous and case-sensitive synonyms. Based
on all detected synonyms for one abstract, the most plausible
database identifiers are associated with the text. Organism
specificity is addressed by a simple procedure based on
additionally detected organism names in an abstract. Results: The
extended ProMiner system has been applied to the test cases of the
BioCreAtIvE competition with highly encouraging results. In blind
predictions, the system achieved an F-measure of approximately 0.8
for the organisms mouse and fly and about 0.9 for the organism
yeast.
text is a challenging task as the corresponding nomenclature has
evolved over time. This has led to multiple synonyms for individual
genes and proteins, as well as names that may be ambiguous with
other gene names or with general English words. The Gene List Task
of the BioCreAtIvE challenge evaluation enables comparison of
systems addressing the problem of protein and gene name
identification on common benchmark data. Methods: The ProMiner
system uses a pre-processed synonym dictionary to identify
potential name occurrences in the biomedical text and associate
protein and gene database identifiers with the detected matches. It
follows a rule-based approach and its search algorithm is geared
towards recognition of multi-word names [1]. To account for the
large number of ambiguous synonyms in the considered organisms, the
system has been extended to use specific variants of the detection
procedure for highly ambiguous and case-sensitive synonyms. Based
on all detected synonyms for one abstract, the most plausible
database identifiers are associated with the text. Organism
specificity is addressed by a simple procedure based on
additionally detected organism names in an abstract. Results: The
extended ProMiner system has been applied to the test cases of the
BioCreAtIvE competition with highly encouraging results. In blind
predictions, the system achieved an F-measure of approximately 0.8
for the organisms mouse and fly and about 0.9 for the organism
yeast.
Weitere Episoden
In Podcasts werben
Kommentare (0)