| PhenomicDB is a multi-species genotype-phenotype
database for comparative phenomics. The current release
unites the data for many species from several different
sources: MGI, OMIM, FlyBase, WormBase, MAtDB, ZFIN, flyrnai.org,
Phenobank, and CYGD.
PhenomicDB is more than a mechanistic gathering of primary
sources. Due to their semantic mapping and integration and
the usage of HomoloGene as a source of orthologous data,
PhenomicDB allows direct comparison within the groups of
orthologous genes and their phenotypes. During semantic
mapping process the information in PhenomicDB had to be
generalized to some extent, but provided hyperlinks to the
original sources allow further data mining in-depth.
Database contents
Because of the data diversity, each entry can contain different
types of information, such as annotation of the gene locus,
functional annotation, known orthologues and/or phenotypic
information. The offered classes of phenotypic information
are: phenotypic classification, information about phenotypic
alleles, RNAi phenotypes, disease association etc. Some
of the entries may lack phenotype information but they are
stored in the database, as they have orthologous relationship
to a gene with known phenotype(s). Other records can represent
phenotype information only, since the phenotype description
is not associated with a known gene locus.
Search
page
The search page includes:
- Search text box, where the user can type the keywords
to be looked for.
- Search examples for detached search categories.
- Search sections drop-down menu, where the user can specify
the search category.
The options are:
- All - search in all available categories; this is
the default-selected category;
- Symbol and name - search in both gene and phenotype
names and symbols;
- Description - search in both gene and phenotype
descriptions;
- External IDs - search in external identifiers of
primary phenotype and genotype sources;
- NCBI Gene IDs - search in NCBI Gene numbers;
- Gene product ID - search in identifiers of the mRNA,
protein and genomic sequences;
- GO term - search in Gene Ontology keywords;
- GO ID - search in Gene Ontology identifiers;
- Phenotype Ontology ID - search in Phenotype Ontology
identifiers in phenotype keywords;
- Phenotype keyword - search in phenotype keywords;
- Experiment name - search in names of the experiments;
- Experiment description - search in description of
the experiment, by which the phenotype had been determined;
- Cell Line - search in all available cell lines;
- Phenotype description - search in phenotype descriptions;
- RNAi accession - search in RNAi identifiers;
- Reference ID - search in both gene and phenotype
reference identifiers.
- Tax ID - search by NCBI Taxonomy ID;
- Organism Name - search by scientific or common organism
name;
- If ‘Use wildcard’ checkbox is checked the
search engine will append ‘*’ to the beginning
and the end of the typed phrase; if the checkbox is not
checked, the user can manually type ‘*’ at
the beginning and/or the end of the string.
- ‘Select organisms’ list box where the user
can specify the organism(s) he is interested in.
- ‘Restrict query to’ section provides opportunity
for refining of the search. If ‘genotype’
or ‘phenotype’ radio buttons are chosen, the
search will be limited to the corresponding part of database
only. In addition to this restriction, if ‘only
genotypes with associated phenotypes’ or ‘only
phenotypes with associated genotypes’ is chosen
only the full entries, holding simultaneously genotype
and phenotype section will be returned. The default value
is ‘no restriction’.
- Using ‘Select data fields to show’ section,
customizing of the search results page is possible. Default
visualization includes organism, gene symbol and name,
as well as NCBI Gene IDs for the genotype objects and
organism, phenotype symbol, name and description and external
phenotype ID for the phenotype objects.
After defining the search criteria, the user should click
the ‘Search’ button. If he clicks the ‘Clear’
button, all of the entered values will be reset.
How to
search
The database can be queried by typing keywords or phrases
in the search text box. All search terms which contain special
symbols (e.g "#", "%", "()") can only be used in within
exact phrase search meaning that they must be enclosed in
quotation marks. If several words are entered their co-occurrence
will be looked for within the specified search section,
e.g. prostate cancer will retrieve the
entries, where the words prostate and cancer
occur although they may exist in different text sections.
Placing the query between quotation marks will result in
the exact phrase search, e.g. "prostate cancer".
If ‘Use wildcard’ checkbox is checked, * will
be appended to the end of the words, e.g. "prostat*
cance*" for the phrases and prostat*
and cance* for the single words. If the
asterisk is written in any other place rather than the end
of the words it would not be interpreted as a special symbol
anymore but as a regular character. The usage of and,
or, and and not allows
construction of more complex queries, e.g. ‘nuclear
lamina’ or nucleoplasm will retrieve the
entries containing at least one of the specified terms.
Note that when the Boolean operators are placed within a
quoted phrase they will be not interpreted as special words
anymore and will be searched as a part of the string. If
multiple Boolean operators are used, round brackets should
be placed around the terms in order to specify the consequence
of Boolean logic execution, e.g. ("Adenylate
kinase" or ADK*) and "cell proliferation"
will retrieve the entries containing the phrases Adenylate
kinase and cell proliferation
or words that begin with ADK and the phrase
cell proliferation, whereas "Adenylate
kinase" or (ADK* and "cell proliferation")
will retrieve the entries containing the phrase Adenylate
kinase either the words that begin with ADK
and the phrase cell proliferation. If no
brackets are placed the priority of Boolean logic execution
is and not before and
before or, i.e. "Adenylate
kinase" or ADK* and "cell proliferation"
= "Adenylate kinase" or (ADK* and "cell
proliferation"). Search terms less than three
characters are ignored from the search query.
Search
results page
This page will be loaded after the user performs a search.
There are two types of records at the search result page:
genotype colored in mauve followed by phenotype records
that are in fawn. A red line indicates those records where
there is no phenotype associated to a given genotype, or
there is only a phenotype for which the genotype is unknown.
The default size of search results page is 20 entries and
the next pages can be accessed through the ‘Page’
menu at the top of the page. The number and kind of search
results data fields may vary, according to the selected
fields to be shown from the previous page. ‘Show entry’
buttons can be used in order to access the whole information
for the given genotype/phenotype. ‘Orthologies’
button indicates whether information exists for a given
genotype. This button will bring the user to the ‘Orthologies’
page, where more detailed information is given. If multiple
selection is required, the user can select the one or more
checkboxes and to click the ‘Show Orthologies’
button at the bottom of the page. Thus all selected records
will be transferred to the ‘Orthologies’ page
regardless if they have orthologues or not.
Orthologies page
The available orthologues together with their phenotypes
(if any) for the selected gene are displayed here. The first
line indicates the ancestor taxonomy group to which the
orthologous genes belong, as well as a hyperlink to the
HomoloGene database. At the bottom of the table curated
homology relationships IDs can be listed. They are hyperlinked
to the corresponding sources, where detailed information
is available. By clicking on the ‘Show entry’
buttons the user can access the full database record.
Final result
page
The final result page has two sections: genotype colored
in mauve and phenotype, which is in fawn. Some entries may
lack genotype information, since the phenotype description
is not associated with a known gene locus. Others lack phenotype
information. By default this page contains only one phenotype
record. If there are multiple phenotypes for a given genotype,
they can be retrieved as separate pages as well as a complete
list using the phenotype header paging options.
- Genotype section contains the following fields:
- Internal Genotype ID - internal identifier of PhenomicDB;
- Symbol - official or preferred gene symbol;
- Name - official or preferred gene name;
- Type - the category of the locus. It could be: genes
that encode proteins, genes that encode untranslated
RNA, mapped phenotypes, anonymous DNA segments, models,
etc.;
- Organism Name - species name;
- External ID - external identifier, hyperlinked to
the genotype source;
- Alias symbols - alias gene symbols;
- Alias names - alias gene names;
- NCBI Gene ID - with hyperlink to NCBI Gene;
- Map position - includes chromosome number, cM-position
and pq-position;
- Sequences - includes mRNA, protein and genomic accessions,
hyperlinked to the corresponding databases;
- Gene ontology - includes GO terms, evidences and
identifiers (hyperlinked to GO database) for the three
basic branches of the GO tree: molecular function,
biological process and cellular component;
- Orthologues - includes genes, which are known to
be orthologues of the reported gene;
- Description - gene description;
- References - includes references related to reported
gene, hyperlinked to the reference database.
- Phenotype section could contain many records associated
to a single genotype. In this case the individual phenotypes
are listed at separate pages, which can be accessed from
the heading line of the section.
- Internal Phenotype ID - internal identifier of PhenomicDB;
- Symbol - official or preferred phenotype symbol;
- Name - official or preferred phenotype name;
- Organism Name - species name;
- External ID - external identifier, hyperlinked to
the phenotype source;
- Alias symbols - alias phenotype symbols;
- Alias names - alias phenotype names;
- Experiment - description of the experiment, by which
the phenotype had been determined as well as external
experiment reference ID (if available);
- Experimental Conditions - conditions in which the
experiment is performed;
- Keywords - includes phenotype keywords with their
definitions and examples;
- Descriptions - phenotype description;
- RNA - includes RNAi information such as loop overhang,
si/shRNA sequence and its length as well as external
identifiers;
- References - includes references related to reported
phenotype, hyperlinked to the reference database.
Cluster page
The cluster page represents the PhenomicDB data processed
by a text clustering algorithm which groups the genotypes
based on their associated phenotypes' properties (description, name,
symbol, etc.).
The phenotype clustering procedure relies on the notion
of the PhenoDoc entity. A PhenoDoc is a unique combination of a genotype
with an associated phenotype (thus orphaned genotypes or phenotypes
cannot form phenodocs). If a genotype has multiple associated
phenotypes within a single PhenomicDD entry, then this entry
produces multiple phenodocs by repeating the genotypic data
multiple times.
For more details on the theory behind the phenodoc clusters please refer to the Mining
phenotypes for gene function prediction article.
Interface
For all phenotypes which participate in a cluster (non-orphaned
phenotypes in clusters below or equal to 500 members) there
is a “Show Cluster” button on the middle result page which
opens the cluster page.
The cluster page consists of six separate tabs, each of which can represent the cluster data either as a graph and/or in tabular form.
In each graph there is a root node which is the cluster member on which the user clicked on the middle results page.
However, it is worth noting that if some nodes (including the root node) have no connections
that satisfy the cutoff, then these nodes will be missing
from the graph. The tabs in which make up the cluster page are as follows:
Cluster Overview – Displays all phenotypes as nodes that have been clustered together based on their textual similarity. With each phenotype node, the name of the associated genotype is displayed. The root node has a distinctive style. Click on a node to learn more. Data is based on calculations using PhenomicDB textual phenotype descriptions and CLUTO v 2.1.2 (Ref: (1))
Cosine Similarity – Displays cosine similarity as edges between all phenotype nodes that are part of the cluster. Similarities below a threshold of 0.3 are not shown as edges. Phenotype nodes without any edges are not shown. Click on nodes or edges for further information. Edges are colored according to a 3-scale range: (<0.6 – light grey; 0.6-0.9 - grey; >0.9 – dark grey). The similarity values are based on calculations using PhenomicDB textual phenotype descriptions and the cosine similarity measure.
The following tabs contain only gene members, not phenodocs:
PPI - All genes associated to a phenotype from the current cluster are shown as blue nodes and their direct Protein-protein-interaction (PPI) partners are added in red or green (red, if there is no known phenotype associated or green if there is a phenotype associated that is not in the current cluster). Nodes without any edges are not shown. External interactors (red and green nodes) must be connected to at least two cluster members (blue nodes) to be shown. Node relations are based on data from NCBI Entrez Gene database (Ref: (2)).
GO Similarity - Genes of the phenocluster and the similarity of their Gene Ontology (GO) annotations are displayed. Edges between the gene nodes indicate a similarity between their GO annotations. No edge is shown if the similarity measure is below a threshold of 0.3. Nodes without any edges are not shown. Edges are colored according to a 3-scale range: (<0.6 – light grey; 0.6-0.9 - grey; >0.9 – dark grey). GO associations are based on data from NCBI Entrez Gene database (Ref: (2)). Calculations are based on the semantic similarity measure by Lin (Ref: (3))
Orthology – Displays those genes of the phenocluster that are orthologs. These genes that are displayed here provide evidence of functional similarity by both, their orthology and their phenotype similarity. Data is derived from NCBI HomoloGene (Ref: (4)).
AA Sequence Similarity - Tab displays genes and their amino acid sequence similarity (by Needleman-Wunsch optimal global alignment (Ref: (5)) for which associated phenotypes are found in the current cluster. Edges between two nodes indicate the amino acid sequence similarity for two genes. No edges are shown for similarities below 0.3. Nodes without any edges are not shown. 1 is equivalent to 100% identity. Edges are colored according to a 3-scale range: (<0.6 – light grey; 0.6-0.9 - grey; >0.9 – dark grey).
References:
1. Zhao, Y. and G. Karypis, Data clustering in life sciences. Mol Biotechnol, 2005. 31(1): p. 55-80.
2. Maglott, D., J. Ostell, K.D. Pruitt, and T. Tatusova, Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 2007. 35(Database issue): p. D26-31.
3. Lin, D., An information-theoretic definition of similarity, in Proc. 15th International Conf. on Machine Learning. 1998, Morgan Kaufmann, San Francisco, CA. p. 296-304.
4. Wheeler, D.L., T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. Dicuccio, R. Edgar, S. Federhen, M. Feolo, L.Y. Geer, W. Helmberg, Y. Kapustin, O. Khovayko, D. Landsman, D.J. Lipman, T.L. Madden, D.R. Maglott, V. Miller, J. Ostell, K.D. Pruitt, G.D. Schuler, M. Shumway, E. Sequeira, S.T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, R.L. Tatusov, T.A. Tatusova, L. Wagner, and E. Yaschenko, Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 2008. 36(Database issue): p. D13-21.
5. Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.
The cutoffs are:
-
For overview tab – equal or below 500 members
-
For Cosine similarity tab - >0.3
-
For PPI tab – external interactors (red and green nodes)
must be connected to at least two cluster members (blue
nodes)
-
For GO similarity tab - >0.3
-
For orthology tab – no cutoff
-
For AA sequence similarity tab - >0.3
-
For all tabs (excluding overview tab) - >1000 edges
then only table view is available
Each node label in the graph view contains two rows: 1)
Genotype Symbol or ID; 2) PhenotypeSymbol or ID (for gene-centric
tabs this is the symbol of the first randomly-chosen phenotype from the same cluster).
In addition to the graph view, a table view is also available.
It is the only option when the number of edges
in a graph exceeds the threshold of 1000.
Tooltips
There are two kinds of tooltips (both are activated on left mouse click):
Node tooltips – display additional information
about each node (symbol, name, organism, etc.)
Edge tooltips – display participants symbols/ids and their
score value
If a gene has more than one phenotypes associated which belongs to the
same cluster, this is indicated by the tooltip ( where analogous to the graph node labels, one
random phenotype from the cluster is displayed, along with a
a message explaining that more phenotypes (if any) are available for
the particular gene)
Actions and information pane
The pane is situated on the right side and contains the following
sections:
K – allows the user to change the K value
Magnification – allows the user to zoom in and out the graph
Switch view – allows the user to change the data representation
to graph or table view. Also from here the user can show
or hide the taxonomy coloring
Cluster details – contains cluster primary information
Download – contains links for exporting the current graph in xdot and png file format (for more details about the xdot format please read here...). Please note that the taxonomy coloring and the marked root node will not be exported as part of the xdot/png files as they are separately formed. The most appropriate tool for displaying xdot files is Graphviz
Nodes legend – shows the nodes coloring explanations
Edges legend – shows the edge coloring values
Taxonomy legend – shows the organisms colors
PhenomicDB Query Specification
http://217.91.40.111/query.asp?term=[...]§ions=[...]&wildcard=[yes/no]&organisms=[...]&fields=
[...]&gp=[0..4]&form=[search|orthologs]
Short description
Elements can be placed in the query in different orders
according to the user's preferences. The query is not case-sensitive.
All elements except Term are optional. Term, Organisms and
Fields elements can be repeated in the query.
Examples:
http://217.91.40.111/query.asp?term='tumor
necrosis factor'&organisms=human&organisms=mouse
http://217.91.40.111/query.asp?term='tumor-suppressor
gene region'&organisms=human
http://217.91.40.111/query.asp?term=('Adenylate
kinase' or ADK*) and 'cell proliferation'&organisms=human
http://217.91.40.111/query.asp?term='prostate
canc antimetastasis'&wildcard=yes
http://217.91.40.111/query.asp?term=1291§ions=NCBI
Gene IDs
http://217.91.40.111/query.asp?term=1291§ions=NCBI
Gene IDs&form=orthologs
http://217.91.40.111/query.asp?term=89§ions=HomoloGene
ID
Term
A single word, a part of it or some words (phrase or not)
can be used as query terms. Search is not case sensitive.
Term is required element of the query.
Sections
If not specified the search will be performed in all sections.
- All (default)
- Symbol and name
- Description
- External IDs
- NCBI Gene IDs
- Gene product ID
- GO term
- GO ID
- Phenotype Ontology ID
- Phenotype keyword
- Experiment name
- Experiment description
- Cell line
- Phenotype description
- RNAi accession
- Reference ID
- Tax ID
- Organism name
Wildcard
Organisms
Search can be performed for one, for several or for all
organisms (default).
- All (default)
- Caenorhabditis elegans
- Dictyostelium discoideum
- Fruit fly
- Human
- Mouse
- Yeast
- Zebrafish
- Other
Fields
Multiple data fields can be selected in the query:
- Default (default)
- Official gene symbol
- Gene description
- External genotype ID
- NCBI Gene ID
- mRNA ID
- GO term
- GO ID
- Alias gene symbols
- Official gene name
- Alias gene names
- Organism name
- Protein ID
- Genomic ID
- Chromosome
- Localization
- Phenotype keyword
- Phenotype name
- Phenotype symbol
- Phenotype description
- External phenotype ID
- Experiment description
- RNAi accession
- sh/siRNA sequence
GP
Restriction in query to Genotypes and/or Phenotypes information
is made by using one-digit code:
- 0 - no restriction (default)
- 1 - genotypes
- 2 - phenotypes
- 3 - only genotypes with associated phenotypes
- 4 - only phenotypes with associated genotypes
Form
Choose between the following two options:
- search - to show the list of all entries specified by
the term
- orthologs - to show the list of the orthologs of all
entries specified by the term
If this tag is missing, a search form is assumed. |