A simple approach for protein name identification: prospects and limits
Podcast
Podcaster
Beschreibung
vor 19 Jahren
Background: Significant parts of biological knowledge are available
only as unstructured text in articles of biomedical journals. By
automatically identifying gene and gene product (protein) names and
mapping these to unique database identifiers, it becomes possible
to extract and integrate information from articles and various data
sources. We present a simple and efficient approach that identifies
gene and protein names in texts and returns database identifiers
for matches. It has been evaluated in the recent BioCreAtIvE entity
extraction and mention normalization task by an independent jury.
Methods: Our approach is based on the use of synonym lists that map
the unique database identifiers for each gene/protein to the
different synonym names. For yeast and mouse, synonym lists were
used as provided by the organizers who generated them from public
model organism databases. The synonym list for fly was generated
directly from the corresponding organism database. The lists were
then extensively curated in largely automated procedure and matched
against MEDLINE abstracts by exact text matching. Rule-based and
support vector machine-based post filters were designed and applied
to improve precision. Results: Our procedure showed high recall and
precision with F-measures of 0.897 for yeast and 0.764/0.773 for
mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in
a post-evaluation. Conclusion: The results were close to the best
over all submissions. Depending on the synonym properties it can be
crucial to consider context and to filter out erroneous matches.
This is especially important for fly, which has a very challenging
nomenclature for the protein name identification task. Here, the
support vector machine-based post filter proved to be very
effective.
only as unstructured text in articles of biomedical journals. By
automatically identifying gene and gene product (protein) names and
mapping these to unique database identifiers, it becomes possible
to extract and integrate information from articles and various data
sources. We present a simple and efficient approach that identifies
gene and protein names in texts and returns database identifiers
for matches. It has been evaluated in the recent BioCreAtIvE entity
extraction and mention normalization task by an independent jury.
Methods: Our approach is based on the use of synonym lists that map
the unique database identifiers for each gene/protein to the
different synonym names. For yeast and mouse, synonym lists were
used as provided by the organizers who generated them from public
model organism databases. The synonym list for fly was generated
directly from the corresponding organism database. The lists were
then extensively curated in largely automated procedure and matched
against MEDLINE abstracts by exact text matching. Rule-based and
support vector machine-based post filters were designed and applied
to improve precision. Results: Our procedure showed high recall and
precision with F-measures of 0.897 for yeast and 0.764/0.773 for
mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in
a post-evaluation. Conclusion: The results were close to the best
over all submissions. Depending on the synonym properties it can be
crucial to consider context and to filter out erroneous matches.
This is especially important for fly, which has a very challenging
nomenclature for the protein name identification task. Here, the
support vector machine-based post filter proved to be very
effective.
Weitere Episoden
In Podcasts werben
Kommentare (0)