| general description |

PhenomicDB is a multi-species genotype-phenotype database for comparative phenomics. The current release unites the data for many species from several different sources: MGI, OMIM, FlyBase, WormBase, MAtDB, ZFIN, flyrnai.org, Phenobank, and CYGD.

PhenomicDB is more than a mechanistic gathering of primary sources. Due to their semantic mapping and integration and the usage of HomoloGene as a source of orthologous data, PhenomicDB allows direct comparison within the groups of orthologous genes and their phenotypes. During semantic mapping process the information in PhenomicDB had to be generalized to some extent, but provided hyperlinks to the original sources allow further data mining in-depth.

Database contents

Because of the data diversity, each entry can contain different types of information, such as annotation of the gene locus, functional annotation, known orthologues and/or phenotypic information. The offered classes of phenotypic information are: phenotypic classification, information about phenotypic alleles, RNAi phenotypes, disease association etc. Some of the entries may lack phenotype information but they are stored in the database, as they have orthologous relationship to a gene with known phenotype(s). Other records can represent phenotype information only, since the phenotype description is not associated with a known gene locus.

Search page

The search page includes:

  • Search text box, where the user can type the keywords to be looked for.
  • Search examples for detached search categories.
  • Search sections drop-down menu, where the user can specify the search category.
    The options are:
    • All - search in all available categories; this is the default-selected category;
    • Symbol and name - search in both gene and phenotype names and symbols;
    • Description - search in both gene and phenotype descriptions;
    • External IDs - search in external identifiers of primary phenotype and genotype sources;
    • NCBI Gene IDs - search in NCBI Gene numbers;
    • Gene product ID - search in identifiers of the mRNA, protein and genomic sequences;
    • GO term - search in Gene Ontology keywords;
    • GO ID - search in Gene Ontology identifiers;
    • Phenotype Ontology ID - search in Phenotype Ontology identifiers in phenotype keywords;
    • Phenotype keyword - search in phenotype keywords;
    • Experiment name - search in names of the experiments;
    • Experiment description - search in description of the experiment, by which the phenotype had been determined;
    • Cell Line - search in all available cell lines;
    • Phenotype description - search in phenotype descriptions;
    • RNAi accession - search in RNAi identifiers;
    • Reference ID - search in both gene and phenotype reference identifiers.
    • Tax ID - search by NCBI Taxonomy ID;
    • Organism Name - search by scientific or common organism name;
  • If ‘Use wildcard’ checkbox is checked the search engine will append ‘*’ to the beginning and the end of the typed phrase; if the checkbox is not checked, the user can manually type ‘*’ at the beginning and/or the end of the string.
  • ‘Select organisms’ list box where the user can specify the organism(s) he is interested in.
  • ‘Restrict query to’ section provides opportunity for refining of the search. If ‘genotype’ or ‘phenotype’ radio buttons are chosen, the search will be limited to the corresponding part of database only. In addition to this restriction, if ‘only genotypes with associated phenotypes’ or ‘only phenotypes with associated genotypes’ is chosen only the full entries, holding simultaneously genotype and phenotype section will be returned. The default value is "no restriction".
  • Using ‘Select data fields to show’ section, customizing of the search results page is possible. Default visualization includes organism, gene symbol and name, as well as NCBI Gene IDs for the genotype objects and organism, phenotype symbol, name and description and external phenotype ID for the phenotype objects.

After defining the search criteria, the user should click the ‘Search’ button. If he clicks the ‘Clear’ button, all of the entered values will be reset.

How to search

The database can be queried by typing keywords or phrases in the search text box. All search terms which contain special symbols (e.g "#", "%", "()") can only be used in within exact phrase search meaning that they must be enclosed in quotation marks. If several words are entered their co-occurrence will be looked for within the specified search section, e.g. prostate cancer will retrieve the entries, where the words prostate and cancer occur although they may exist in different text sections. Placing the query between quotation marks will result in the exact phrase search, e.g. "prostate cancer". If ‘Use wildcard’ checkbox is checked, * will be appended to the end of the words, e.g. "prostat* cance*" for the phrases and prostat* and cance* for the single words. If the asterisk is written in any other place rather than the end of the words it would not be interpreted as a special symbol anymore but as a regular character. The usage of and, or, and and not allows construction of more complex queries, e.g. ‘nuclear lamina’ or nucleoplasm will retrieve the entries containing at least one of the specified terms. Note that when the Boolean operators are placed within a quoted phrase they will be not interpreted as special words anymore and will be searched as a part of the string. If multiple Boolean operators are used, round brackets should be placed around the terms in order to specify the consequence of Boolean logic execution, e.g. ("Adenylate kinase" or ADK*) and "cell proliferation" will retrieve the entries containing the phrases Adenylate kinase and cell proliferation or words that begin with ADK and the phrase cell proliferation, whereas "Adenylate kinase" or (ADK* and "cell proliferation") will retrieve the entries containing the phrase Adenylate kinase either the words that begin with ADK and the phrase cell proliferation. If no brackets are placed the priority of Boolean logic execution is and not before and before or, i.e. "Adenylate kinase" or ADK* and "cell proliferation" = "Adenylate kinase" or (ADK* and "cell proliferation"). Search terms less than three characters are ignored from the search query.

Search results page

This page will be loaded after the user performs a search. There are two types of records at the search result page: genotype colored in mauve followed by phenotype records that are in fawn. A red line indicates those records where there is no phenotype associated to a given genotype, or there is only a phenotype for which the genotype is unknown. The default size of search results page is 20 entries and the next pages can be accessed through the ‘Page’ menu at the top of the page. The number and kind of search results data fields may vary, according to the selected fields to be shown from the previous page. ‘Show entry’ buttons can be used in order to access the whole information for the given genotype/phenotype. ‘Orthologies’ button indicates whether information exists for a given genotype. This button will bring the user to the ‘Orthologies’ page, where more detailed information is given. If multiple selection is required, the user can select the one or more checkboxes and to click the ‘Show Orthologies’ button at the bottom of the page. Thus all selected records will be transferred to the ‘Orthologies’ page regardless if they have orthologues or not.

Via the "Export" button the user can export manually selected entries or all entries produced by his/her search query. Two export formats are available: XML and TSV. In XML format all possible categories are exported and for the TSV the user has to manually define which categories to be exported. If the user's export exceeds the threshold limit (above 100 entries) his/her request will be added to the export list queue and as soon as the export has finished he/she will receive a notification e-mail with a URL to download the exported entries. NB: The export dialog counts selected entries or all entries returned by the search. There are three types of entries: genotype and all its phenotypes; genotype without phenotypes; phenotype without genotype. For example if there is one gene which has 10 phenotypes connected the export counts all these objects as one entry. So any discrepancy between the results statistic on the "Search result page" (where the genotype and phenotype objects are counted) and the Export form (where the entries are counted) is normal.

Orthologies page

The available orthologues together with their phenotypes (if any) for the selected gene are displayed here. The first line indicates the ancestor taxonomy group to which the orthologous genes belong, as well as a hyperlink to the HomoloGene database. At the bottom of the table curated homology relationships IDs can be listed. They are hyperlinked to the corresponding sources, where detailed information is available. By clicking on the ‘Show entry’ buttons the user can access the full database record.

Final result page

The final result page has two sections: genotype colored in mauve and phenotype, which is in fawn. Some entries may lack genotype information, since the phenotype description is not associated with a known gene locus. Others lack phenotype information. By default this page contains only one phenotype record. If there are multiple phenotypes for a given genotype, they can be retrieved as separate pages as well as a complete list using the phenotype header paging options.

  • Genotype section contains the following fields:
    • Internal Genotype ID - internal identifier of PhenomicDB;
    • Symbol - official or preferred gene symbol;
    • Name - official or preferred gene name;
    • Type - the category of the locus. It could be: genes that encode proteins, genes that encode untranslated RNA, mapped phenotypes, anonymous DNA segments, models, etc.;
    • Organism Name - species name;
    • External ID - external identifier, hyperlinked to the genotype source;
    • Alias symbols - alias gene symbols;
    • Alias names - alias gene names;
    • NCBI Gene ID - with hyperlink to NCBI Gene;
    • Map position - includes chromosome number, cM-position and pq-position;
    • Sequences - includes mRNA, protein and genomic accessions, hyperlinked to the corresponding databases;
    • Gene ontology - includes GO terms, evidences and identifiers (hyperlinked to GO database) for the three basic branches of the GO tree: molecular function, biological process and cellular component;
    • Orthologues - includes genes, which are known to be orthologues of the reported gene;
    • Description - gene description;
    • References - includes references related to reported gene, hyperlinked to the reference database.
  • Phenotype section could contain many records associated to a single genotype. In this case the individual phenotypes are listed at separate pages, which can be accessed from the heading line of the section.
    • Internal Phenotype ID - internal identifier of PhenomicDB;
    • Symbol - official or preferred phenotype symbol;
    • Name - official or preferred phenotype name;
    • Organism Name - species name;
    • External ID - external identifier, hyperlinked to the phenotype source;
    • Alias symbols - alias phenotype symbols;
    • Alias names - alias phenotype names;
    • Experiment - description of the experiment, by which the phenotype had been determined as well as external experiment reference ID (if available);
    • Experimental Conditions - conditions in which the experiment is performed;
    • Keywords - includes phenotype keywords with their definitions and examples;
    • Descriptions - phenotype description;
    • RNA - includes RNAi information such as loop overhang, si/shRNA sequence and its length as well as external identifiers;
    • References - includes references related to reported phenotype, hyperlinked to the reference database.

Cluster page

The cluster page represents the PhenomicDB data processed by a text clustering algorithm which groups the genotypes based on their associated phenotypes' properties (description, name, symbol, etc.).

The phenotype clustering procedure relies on the notion of the PhenoDoc entity. A PhenoDoc is a unique combination of a genotype with an associated phenotype (thus orphaned genotypes or phenotypes cannot form phenodocs). If a genotype has multiple associated phenotypes within a single PhenomicDD entry, then this entry produces multiple phenodocs by repeating the genotypic data multiple times.

For more details on the theory behind the phenodoc clusters please refer to the Mining phenotypes for gene function prediction article.


For all phenotypes which participate in a cluster (non-orphaned phenotypes in clusters below or equal to 500 members) there is a "Show Cluster" button on the middle result page which opens the cluster page.
The cluster page consists of six separate tabs, each of which can represent the cluster data either as a graph and/or in tabular form. In each graph there is a root node which is the cluster member on which the user clicked on the middle results page. However, it is worth noting that if some nodes (including the root node) have no connections that satisfy the cutoff, then these nodes will be missing from the graph. The tabs in which make up the cluster page are as follows:

  • Cluster Overview - Displays all phenotypes as nodes that have been clustered together based on their textual similarity. With each phenotype node, the name of the associated genotype is displayed. The root node has a distinctive style. Click on a node to learn more. Data is based on calculations using PhenomicDB textual phenotype descriptions and CLUTO v 2.1.2 (Ref: (1))

  • Cosine Similarity - Displays cosine similarity as edges between all phenotype nodes that are part of the cluster. Similarities below a threshold of 0.3 are not shown as edges. Phenotype nodes without any edges are not shown. Click on nodes or edges for further information. Edges are colored according to a 3-scale range: (<0.6 - light grey; 0.6-0.9 - grey; >0.9 - dark grey). The similarity values are based on calculations using PhenomicDB textual phenotype descriptions and the cosine similarity measure.
  • The following tabs contain only gene members, not phenodocs:

  • PPI - All genes associated to a phenotype from the current cluster are shown as blue nodes and their direct Protein-protein-interaction (PPI) partners are added in red or green (red, if there is no known phenotype associated or green if there is a phenotype associated that is not in the current cluster). Nodes without any edges are not shown. External interactors (red and green nodes) must be connected to at least two cluster members (blue nodes) to be shown. Node relations are based on data from NCBI Entrez Gene database (Ref: (2)).
  • GO Similarity - Genes of the phenocluster and the similarity of their Gene Ontology (GO) annotations are displayed. Edges between the gene nodes indicate a similarity between their GO annotations. No edge is shown if the similarity measure is below a threshold of 0.3. Nodes without any edges are not shown. Edges are colored according to a 3-scale range: (<0.6 - light grey; 0.6-0.9 - grey; >0.9 - dark grey). GO associations are based on data from NCBI Entrez Gene database (Ref: (2)). Calculations are based on the semantic similarity measure by Lin (Ref: (3))
  • Orthology - Displays those genes of the phenocluster that are orthologs. These genes that are displayed here provide evidence of functional similarity by both, their orthology and their phenotype similarity. Data is derived from NCBI HomoloGene (Ref: (4)).
  • AA Sequence Similarity - Tab displays genes and their amino acid sequence similarity (by Needleman-Wunsch optimal global alignment (Ref: (5)) for which associated phenotypes are found in the current cluster. Edges between two nodes indicate the amino acid sequence similarity for two genes. No edges are shown for similarities below 0.3. Nodes without any edges are not shown. 1 is equivalent to 100% identity. Edges are colored according to a 3-scale range: (<0.6 - light grey; 0.6-0.9 - grey; >0.9 - dark grey).

  • References:

    1. Zhao, Y. and G. Karypis, Data clustering in life sciences. Mol Biotechnol, 2005. 31(1): p. 55-80.

    2. Maglott, D., J. Ostell, K.D. Pruitt, and T. Tatusova, Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 2007. 35(Database issue): p. D26-31.

    3. Lin, D., An information-theoretic definition of similarity, in Proc. 15th International Conf. on Machine Learning. 1998, Morgan Kaufmann, San Francisco, CA. p. 296-304.

    4. Wheeler, D.L., T. Barrett, D.A. Benson, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. Dicuccio, R. Edgar, S. Federhen, M. Feolo, L.Y. Geer, W. Helmberg, Y. Kapustin, O. Khovayko, D. Landsman, D.J. Lipman, T.L. Madden, D.R. Maglott, V. Miller, J. Ostell, K.D. Pruitt, G.D. Schuler, M. Shumway, E. Sequeira, S.T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, R.L. Tatusov, T.A. Tatusova, L. Wagner, and E. Yaschenko, Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 2008. 36(Database issue): p. D13-21.

    5. Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.

    The cutoffs are:

    • For overview tab - equal or below 500 members
    • For Cosine similarity tab - >0.3
    • For PPI tab - external interactors (red and green nodes) must be connected to at least two cluster members (blue nodes)
    • For GO similarity tab - >0.3
    • For orthology tab - no cutoff
    • For AA sequence similarity tab - >0.3
    • For all tabs (excluding overview tab) - >1000 edges then only table view is available

    Each node label in the graph view contains two rows: 1) Genotype Symbol or ID; 2) PhenotypeSymbol or ID (for gene-centric tabs this is the symbol of the first randomly-chosen phenotype from the same cluster).
    In addition to the graph view, a table view is also available. It is the only option when the number of edges in a graph exceeds the threshold of 1000.


    There are two kinds of tooltips (both are activated on left mouse click):

  • Node tooltips - display additional information about each node (symbol, name, organism, etc.)
  • Edge tooltips - display participants symbols/ids and their score value

  • If a gene has more than one phenotypes associated which belongs to the same cluster, this is indicated by the tooltip ( where analogous to the graph node labels, one random phenotype from the cluster is displayed, along with a a message explaining that more phenotypes (if any) are available for the particular gene)

    Actions and information pane

    The pane is situated on the right side and contains the following sections:

  • K - allows the user to change the K value

  • Magnification - allows the user to zoom in and out the graph

  • Switch view - allows the user to change the data representation to graph or table view. Also from here the user can show or hide the taxonomy coloring

  • Cluster details - contains cluster primary information

  • Download - contains links for exporting the current graph in xdot and png file format (for more details about the xdot format please read here...). Please note that the taxonomy coloring and the marked root node will not be exported as part of the xdot/png files as they are separately formed. The most appropriate tool for displaying xdot files is Graphviz

  • Nodes legend - shows the nodes coloring explanations

  • Edges legend - shows the edge coloring values

  • Taxonomy legend - shows the organisms colors

  • PhenomicDB Query Specification

    query.asp?term=[...]&sections=[...]&wildcard=[yes/no]&organisms=[...]&fields= [...]&gp=[0..4]&form=[search|orthologs]

    Short description

    Elements can be placed in the query in different orders according to the user's preferences. The query is not case-sensitive. All elements except Term are optional. Term, Organisms and Fields elements can be repeated in the query.


    query.asp?term='tumor necrosis factor'&organisms=human&organisms=mouse

    query.asp?term='tumor-suppressor gene region'&organisms=human

    query.asp?term=('Adenylate kinase' or ADK*) and 'cell proliferation'&organisms=human

    query.asp?term='prostate canc antimetastasis'&wildcard=yes

    query.asp?term=1291&sections=NCBI Gene IDs

    query.asp?term=1291&sections=NCBI Gene IDs&form=orthologs

    query.asp?term=89&sections=HomoloGene ID


    A single word, a part of it or some words (phrase or not) can be used as query terms. Search is not case sensitive. Term is required element of the query.


    If not specified the search will be performed in all sections.

    • All (default)
    • Symbol and name
    • Description
    • External IDs
    • NCBI Gene IDs
    • Gene product ID
    • GO term
    • GO ID
    • Phenotype Ontology ID
    • Phenotype keyword
    • Experiment name
    • Experiment description
    • Cell line
    • Phenotype description
    • RNAi accession
    • Reference ID
    • Tax ID
    • Organism name


    • No (default)
    • Yes


    Search can be performed for one, for several or for all organisms (default).

    • All (default)
    • Caenorhabditis elegans
    • Dictyostelium discoideum
    • Fruit fly
    • Human
    • Mouse
    • Rat
    • Yeast
    • Zebrafish
    • Other


    Multiple data fields can be selected in the query:

    • Default (default)
    • Official gene symbol
    • Gene description
    • External genotype ID
    • NCBI Gene ID
    • mRNA ID
    • GO term
    • GO ID
    • Alias gene symbols
    • Official gene name
    • Alias gene names
    • Organism name
    • Protein ID
    • Genomic ID
    • Chromosome
    • Localization
    • Phenotype keyword
    • Phenotype name
    • Phenotype symbol
    • Phenotype description
    • External phenotype ID
    • Experiment description
    • RNAi accession
    • sh/siRNA sequence


    Restriction in query to Genotypes and/or Phenotypes information is made by using one-digit code:

    • 0 - no restriction (default)
    • 1 - genotypes
    • 2 - phenotypes
    • 3 - only genotypes with associated phenotypes
    • 4 - only phenotypes with associated genotypes


    Choose between the following two options:

    • search - to show the list of all entries specified by the term
    • orthologs - to show the list of the orthologs of all entries specified by the term

    If this tag is missing, a search form is assumed.