Gene and protein nomenclature in public databases
Podcast
Podcaster
Beschreibung
vor 18 Jahren
Background: Frequently, several alternative names are in use for
biological objects such as genes and proteins. Applications like
manual literature search, automated text-mining, named entity
identification, gene/protein annotation, and linking of knowledge
from different information sources require the knowledge of all
used names referring to a given gene or protein. Various
organism-specific or general public databases aim at organizing
knowledge about genes and proteins. These databases can be used for
deriving gene and protein name dictionaries. So far, little is
known about the differences between databases in terms of size,
ambiguities and overlap. Results: We compiled five gene and protein
name dictionaries for each of the five model organisms ( yeast,
fly, mouse, rat, and human) from different organism-specific and
general public databases. We analyzed the degree of ambiguity of
gene and protein names within and between dictionaries, to a
lexicon of common English words and domain-related non-gene terms,
and we compared different data sources in terms of size of
extracted dictionaries and overlap of synonyms between those. The
study shows that the number of genes/proteins and synonyms covered
in individual databases varies significantly for a given organism,
and that the degree of ambiguity of synonyms varies significantly
between different organisms. Furthermore, it shows that, despite
considerable efforts of co-curation, the overlap of synonyms in
different data sources is rather moderate and that the degree of
ambiguity of gene names with common English words and
domain-related non-gene terms varies depending on the considered
organism. Conclusion: In conclusion, these results indicate that
the combination of data contained in different databases allows the
generation of gene and protein name dictionaries that contain
significantly more used names than dictionaries obtained from
individual data sources. Furthermore, curation of combined
dictionaries considerably increases size and decreases ambiguity.
The entries of the curated synonym dictionary are available for
manual querying, editing, and PubMed- or Google-search via the
ProThesaurus-wiki. For automated querying via custom software, we
offer a web service and an exemplary client application.
biological objects such as genes and proteins. Applications like
manual literature search, automated text-mining, named entity
identification, gene/protein annotation, and linking of knowledge
from different information sources require the knowledge of all
used names referring to a given gene or protein. Various
organism-specific or general public databases aim at organizing
knowledge about genes and proteins. These databases can be used for
deriving gene and protein name dictionaries. So far, little is
known about the differences between databases in terms of size,
ambiguities and overlap. Results: We compiled five gene and protein
name dictionaries for each of the five model organisms ( yeast,
fly, mouse, rat, and human) from different organism-specific and
general public databases. We analyzed the degree of ambiguity of
gene and protein names within and between dictionaries, to a
lexicon of common English words and domain-related non-gene terms,
and we compared different data sources in terms of size of
extracted dictionaries and overlap of synonyms between those. The
study shows that the number of genes/proteins and synonyms covered
in individual databases varies significantly for a given organism,
and that the degree of ambiguity of synonyms varies significantly
between different organisms. Furthermore, it shows that, despite
considerable efforts of co-curation, the overlap of synonyms in
different data sources is rather moderate and that the degree of
ambiguity of gene names with common English words and
domain-related non-gene terms varies depending on the considered
organism. Conclusion: In conclusion, these results indicate that
the combination of data contained in different databases allows the
generation of gene and protein name dictionaries that contain
significantly more used names than dictionaries obtained from
individual data sources. Furthermore, curation of combined
dictionaries considerably increases size and decreases ambiguity.
The entries of the curated synonym dictionary are available for
manual querying, editing, and PubMed- or Google-search via the
ProThesaurus-wiki. For automated querying via custom software, we
offer a web service and an exemplary client application.
Weitere Episoden
In Podcasts werben
Kommentare (0)