PhenomicDB is a multi-species genotype-phenotype
database for comparative phenomics. The current release
unites the data for many species from several different
sources: MGI, OMIM, FlyBase, WormBase, MAtDB, ZFIN, flyrnai.org,
Phenobank, and CYGD.
PhenomicDB is more than a mechanistic gathering of primary
sources. Due to their semantic mapping and integration and
the usage of HomoloGene as a source of orthologous data,
PhenomicDB allows direct comparison within the groups of
orthologous genes and their phenotypes. During semantic
mapping process the information in PhenomicDB had to be
generalized to some extent, but provided hyperlinks to the
original sources allow further data mining in-depth.
Because of the data diversity, each entry can contain different
types of information, such as annotation of the gene locus,
functional annotation, known orthologues and/or phenotypic
information. The offered classes of phenotypic information
are: phenotypic classification, information about phenotypic
alleles, RNAi phenotypes, disease association etc. Some
of the entries may lack phenotype information but they are
stored in the database, as they have orthologous relationship
to a gene with known phenotype(s). Other records can represent
phenotype information only, since the phenotype description
is not associated with a known gene locus.
The search page includes:
Search text box, where the user can type the keywords
to be looked for.
Search examples for detached search categories.
Search sections drop-down menu, where the user can specify
the search category.
The options are:
All - search in all available categories; this is
the default-selected category;
Symbol and name - search in both gene and phenotype
names and symbols;
Description - search in both gene and phenotype
External IDs - search in external identifiers of
primary phenotype and genotype sources;
NCBI Gene IDs - search in NCBI Gene numbers;
Gene product ID - search in identifiers of the mRNA,
protein and genomic sequences;
GO term - search in Gene Ontology keywords;
GO ID - search in Gene Ontology identifiers;
Phenotype Ontology ID - search in Phenotype Ontology
identifiers in phenotype keywords;
Phenotype keyword - search in phenotype keywords;
Experiment name - search in names of the experiments;
Experiment description - search in description of
the experiment, by which the phenotype had been determined;
Cell Line - search in all available cell lines;
Phenotype description - search in phenotype descriptions;
RNAi accession - search in RNAi identifiers;
Reference ID - search in both gene and phenotype
Tax ID - search by NCBI Taxonomy ID;
Organism Name - search by scientific or common organism
If ‘Use wildcard’ checkbox is checked the
search engine will append ‘*’ to the beginning
and the end of the typed phrase; if the checkbox is not
checked, the user can manually type ‘*’ at
the beginning and/or the end of the string.
‘Select organisms’ list box where the user
can specify the organism(s) he is interested in.
‘Restrict query to’ section provides opportunity
for refining of the search. If ‘genotype’
or ‘phenotype’ radio buttons are chosen, the
search will be limited to the corresponding part of database
only. In addition to this restriction, if ‘only
genotypes with associated phenotypes’ or ‘only
phenotypes with associated genotypes’ is chosen
only the full entries, holding simultaneously genotype
and phenotype section will be returned. The default value
is "no restriction".
Using ‘Select data fields to show’ section,
customizing of the search results page is possible. Default
visualization includes organism, gene symbol and name,
as well as NCBI Gene IDs for the genotype objects and
organism, phenotype symbol, name and description and external
phenotype ID for the phenotype objects.
After defining the search criteria, the user should click
the ‘Search’ button. If he clicks the ‘Clear’
button, all of the entered values will be reset.
The database can be queried by typing keywords or phrases
in the search text box. All search terms which contain special
symbols (e.g "#", "%", "()") can only be used in within
exact phrase search meaning that they must be enclosed in
quotation marks. If several words are entered their co-occurrence
will be looked for within the specified search section,
e.g. prostate cancer will retrieve the
entries, where the words prostate and cancer
occur although they may exist in different text sections.
Placing the query between quotation marks will result in
the exact phrase search, e.g. "prostate cancer".
If ‘Use wildcard’ checkbox is checked, * will
be appended to the end of the words, e.g. "prostat*
cance*" for the phrases and prostat*
and cance* for the single words. If the
asterisk is written in any other place rather than the end
of the words it would not be interpreted as a special symbol
anymore but as a regular character. The usage of and,
or, and and not allows
construction of more complex queries, e.g. ‘nuclear
lamina’ or nucleoplasm will retrieve the
entries containing at least one of the specified terms.
Note that when the Boolean operators are placed within a
quoted phrase they will be not interpreted as special words
anymore and will be searched as a part of the string. If
multiple Boolean operators are used, round brackets should
be placed around the terms in order to specify the consequence
of Boolean logic execution, e.g. ("Adenylate
kinase" or ADK*) and "cell proliferation"
will retrieve the entries containing the phrases Adenylate
kinase and cell proliferation
or words that begin with ADK and the phrase
cell proliferation, whereas "Adenylate
kinase" or (ADK* and "cell proliferation")
will retrieve the entries containing the phrase Adenylate
kinase either the words that begin with ADK
and the phrase cell proliferation. If no
brackets are placed the priority of Boolean logic execution
is and not before and
before or, i.e. "Adenylate
kinase" or ADK* and "cell proliferation"
= "Adenylate kinase" or (ADK* and "cell
proliferation"). Search terms less than three
characters are ignored from the search query.
This page will be loaded after the user performs a search.
There are two types of records at the search result page:
genotype colored in mauve followed by phenotype records
that are in fawn. A red line indicates those records where
there is no phenotype associated to a given genotype, or
there is only a phenotype for which the genotype is unknown.
The default size of search results page is 20 entries and
the next pages can be accessed through the ‘Page’
menu at the top of the page. The number and kind of search
results data fields may vary, according to the selected
fields to be shown from the previous page. ‘Show entry’
buttons can be used in order to access the whole information
for the given genotype/phenotype. ‘Orthologies’
button indicates whether information exists for a given
genotype. This button will bring the user to the ‘Orthologies’
page, where more detailed information is given. If multiple
selection is required, the user can select the one or more
checkboxes and to click the ‘Show Orthologies’
button at the bottom of the page. Thus all selected records
will be transferred to the ‘Orthologies’ page
regardless if they have orthologues or not.
Via the "Export" button the user can export manually selected entries or all entries produced by his/her search query. Two export formats are available: XML and TSV. In XML format all possible categories are exported and for the TSV the user has to manually define which categories to be exported. If the user's export exceeds the threshold limit (above 100 entries) his/her request will be added to the export list queue and as soon as the export has finished he/she will receive a notification e-mail with a URL to download the exported entries.
NB: The export dialog counts selected entries or all entries returned by the search. There are three types of entries: genotype and all its phenotypes; genotype without phenotypes; phenotype without genotype. For example if there is one gene which has 10 phenotypes connected the export counts all these objects as one entry. So any discrepancy between the results statistic on the "Search result page" (where the genotype and phenotype objects are counted) and the Export form (where the entries are counted) is normal.
The available orthologues together with their phenotypes
(if any) for the selected gene are displayed here. The first
line indicates the ancestor taxonomy group to which the
orthologous genes belong, as well as a hyperlink to the
HomoloGene database. At the bottom of the table curated
homology relationships IDs can be listed. They are hyperlinked
to the corresponding sources, where detailed information
is available. By clicking on the ‘Show entry’
buttons the user can access the full database record.
The final result page has two sections: genotype colored
in mauve and phenotype, which is in fawn. Some entries may
lack genotype information, since the phenotype description
is not associated with a known gene locus. Others lack phenotype
information. By default this page contains only one phenotype
record. If there are multiple phenotypes for a given genotype,
they can be retrieved as separate pages as well as a complete
list using the phenotype header paging options.
Genotype section contains the following fields:
Internal Genotype ID - internal identifier of PhenomicDB;
Symbol - official or preferred gene symbol;
Name - official or preferred gene name;
Type - the category of the locus. It could be: genes
that encode proteins, genes that encode untranslated
RNA, mapped phenotypes, anonymous DNA segments, models,
Organism Name - species name;
External ID - external identifier, hyperlinked to
the genotype source;
Alias symbols - alias gene symbols;
Alias names - alias gene names;
NCBI Gene ID - with hyperlink to NCBI Gene;
Map position - includes chromosome number, cM-position
Sequences - includes mRNA, protein and genomic accessions,
hyperlinked to the corresponding databases;
Gene ontology - includes GO terms, evidences and
identifiers (hyperlinked to GO database) for the three
basic branches of the GO tree: molecular function,
biological process and cellular component;
Orthologues - includes genes, which are known to
be orthologues of the reported gene;
Description - gene description;
References - includes references related to reported
gene, hyperlinked to the reference database.
Phenotype section could contain many records associated
to a single genotype. In this case the individual phenotypes
are listed at separate pages, which can be accessed from
the heading line of the section.
Internal Phenotype ID - internal identifier of PhenomicDB;
Symbol - official or preferred phenotype symbol;
Name - official or preferred phenotype name;
Organism Name - species name;
External ID - external identifier, hyperlinked to
the phenotype source;
Alias symbols - alias phenotype symbols;
Alias names - alias phenotype names;
Experiment - description of the experiment, by which
the phenotype had been determined as well as external
experiment reference ID (if available);
Experimental Conditions - conditions in which the
experiment is performed;
Keywords - includes phenotype keywords with their
definitions and examples;
Descriptions - phenotype description;
RNA - includes RNAi information such as loop overhang,
si/shRNA sequence and its length as well as external
References - includes references related to reported
phenotype, hyperlinked to the reference database.
The cluster page represents the PhenomicDB data processed
by a text clustering algorithm which groups the genotypes
based on their associated phenotypes' properties (description, name,
The phenotype clustering procedure relies on the notion
of the PhenoDoc entity. A PhenoDoc is a unique combination of a genotype
with an associated phenotype (thus orphaned genotypes or phenotypes
cannot form phenodocs). If a genotype has multiple associated
phenotypes within a single PhenomicDD entry, then this entry
produces multiple phenodocs by repeating the genotypic data
For all phenotypes which participate in a cluster (non-orphaned
phenotypes in clusters below or equal to 500 members) there
is a "Show Cluster" button on the middle result page which
opens the cluster page.
The cluster page consists of six separate tabs, each of which can represent the cluster data either as a graph and/or in tabular form.
In each graph there is a root node which is the cluster member on which the user clicked on the middle results page.
However, it is worth noting that if some nodes (including the root node) have no connections
that satisfy the cutoff, then these nodes will be missing
from the graph. The tabs in which make up the cluster page are as follows:
Cluster Overview - Displays all phenotypes as nodes that have been clustered together based on their textual similarity. With each phenotype node, the name of the associated genotype is displayed. The root node has a distinctive style. Click on a node to learn more. Data is based on calculations using PhenomicDB textual phenotype descriptions and CLUTO v 2.1.2 (Ref: (1))
Cosine Similarity - Displays cosine similarity as edges between all phenotype nodes that are part of the cluster. Similarities below a threshold of 0.3 are not shown as edges. Phenotype nodes without any edges are not shown. Click on nodes or edges for further information. Edges are colored according to a 3-scale range: (<0.6 - light grey; 0.6-0.9 - grey; >0.9 - dark grey). The similarity values are based on calculations using PhenomicDB textual phenotype descriptions and the cosine similarity measure.
The following tabs contain only gene members, not phenodocs:
PPI - All genes associated to a phenotype from the current cluster are shown as blue nodes and their direct Protein-protein-interaction (PPI) partners are added in red or green (red, if there is no known phenotype associated or green if there is a phenotype associated that is not in the current cluster). Nodes without any edges are not shown. External interactors (red and green nodes) must be connected to at least two cluster members (blue nodes) to be shown. Node relations are based on data from NCBI Entrez Gene database (Ref: (2)).
GO Similarity - Genes of the phenocluster and the similarity of their Gene Ontology (GO) annotations are displayed. Edges between the gene nodes indicate a similarity between their GO annotations. No edge is shown if the similarity measure is below a threshold of 0.3. Nodes without any edges are not shown. Edges are colored according to a 3-scale range: (<0.6 - light grey; 0.6-0.9 - grey; >0.9 - dark grey). GO associations are based on data from NCBI Entrez Gene database (Ref: (2)). Calculations are based on the semantic similarity measure by Lin (Ref: (3))
Orthology - Displays those genes of the phenocluster that are orthologs. These genes that are displayed here provide evidence of functional similarity by both, their orthology and their phenotype similarity. Data is derived from NCBI HomoloGene (Ref: (4)).
AA Sequence Similarity - Tab displays genes and their amino acid sequence similarity (by Needleman-Wunsch optimal global alignment (Ref: (5)) for which associated phenotypes are found in the current cluster. Edges between two nodes indicate the amino acid sequence similarity for two genes. No edges are shown for similarities below 0.3. Nodes without any edges are not shown. 1 is equivalent to 100% identity. Edges are colored according to a 3-scale range: (<0.6 - light grey; 0.6-0.9 - grey; >0.9 - dark grey).
References: 1. Zhao, Y. and G. Karypis, Data clustering in life sciences. Mol Biotechnol, 2005. 31(1): p. 55-80.
2. Maglott, D., J. Ostell, K.D. Pruitt, and T. Tatusova, Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 2007. 35(Database issue): p. D26-31.
3. Lin, D., An information-theoretic definition of similarity, in Proc. 15th International Conf. on Machine Learning. 1998, Morgan Kaufmann, San Francisco, CA. p. 296-304.
4. Wheeler, D.L., T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. Dicuccio, R. Edgar, S. Federhen, M. Feolo, L.Y. Geer, W. Helmberg, Y. Kapustin, O. Khovayko, D. Landsman, D.J. Lipman, T.L. Madden, D.R. Maglott, V. Miller, J. Ostell, K.D. Pruitt, G.D. Schuler, M. Shumway, E. Sequeira, S.T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, R.L. Tatusov, T.A. Tatusova, L. Wagner, and E. Yaschenko, Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 2008. 36(Database issue): p. D13-21.
5. Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.
The cutoffs are:
For overview tab - equal or below 500 members
For Cosine similarity tab - >0.3
For PPI tab - external interactors (red and green nodes)
must be connected to at least two cluster members (blue
For GO similarity tab - >0.3
For orthology tab - no cutoff
For AA sequence similarity tab - >0.3
For all tabs (excluding overview tab) - >1000 edges
then only table view is available
Each node label in the graph view contains two rows: 1)
Genotype Symbol or ID; 2) PhenotypeSymbol or ID (for gene-centric
tabs this is the symbol of the first randomly-chosen phenotype from the same cluster).
In addition to the graph view, a table view is also available.
It is the only option when the number of edges
in a graph exceeds the threshold of 1000.
There are two kinds of tooltips (both are activated on left mouse click):
Node tooltips - display additional information
about each node (symbol, name, organism, etc.)
Edge tooltips - display participants symbols/ids and their
If a gene has more than one phenotypes associated which belongs to the
same cluster, this is indicated by the tooltip ( where analogous to the graph node labels, one
random phenotype from the cluster is displayed, along with a
a message explaining that more phenotypes (if any) are available for
the particular gene)
Actions and information pane
The pane is situated on the right side and contains the following
K - allows the user to change the K value
Magnification - allows the user to zoom in and out the graph
Switch view - allows the user to change the data representation
to graph or table view. Also from here the user can show
or hide the taxonomy coloring
Cluster details - contains cluster primary information
Download - contains links for exporting the current graph in xdot and png file format (for more details about the xdot format please read here...). Please note that the taxonomy coloring and the marked root node will not be exported as part of the xdot/png files as they are separately formed. The most appropriate tool for displaying xdot files is Graphviz
Nodes legend - shows the nodes coloring explanations
Elements can be placed in the query in different orders
according to the user's preferences. The query is not case-sensitive.
All elements except Term are optional. Term, Organisms and
Fields elements can be repeated in the query.