GenMAPP Help Topics    
  GenMAPP Introduction   MAPP Sets
  Drafting Board   MAPPFinder
  Drafting Board Toolbar   MAPPBuilder
  The Gene Database   Downloader
  The Gene Database Manager   Advanced Concepts
  Expression Datasets   GenMAPP Knowledge Base
  Expression Dataset Manager   Converter

The Gene Database

A GenMAPP Gene Database is a species-specific library of all genes used by GenMAPP and its accessory programs. Specifically, the Gene Database stores all the gene identifiers required to link gene objects on MAPPs to expression data contained in an Expression Dataset. The Gene Database also stores annotations for each gene, derived from public databases. The GenMAPP program consults the Gene Database when a user creates or modifies a MAPP, when a user clicks on a gene object on a MAPP to open a Backpage, when a user imports an Expression Dataset, and when a user calculates new results with MAPPFinder.

The master copy of the Gene Database for each species is prepared and maintained at www.GenMAPP.org. A local copy of the Gene Database resides on the user's computer. The Gene Databases for each species are updated on a regular basis, at which time the user may download a new version for use with GenMAPP and its accessory programs.

Each Gene Database contains annotation for genes from the supported gene ID systems. The Gene Database also relates the ID for a gene from one gene ID system to an ID for that same gene in another gene ID system. For example, the MGI ID for the gene Trp53, MGI:98834, is related to the UniGene ID Mm.222 and Ensembl ID ENSMUSG00000059552.

These relationships between gene IDs are important because your MAPPs may use identifiers from one gene ID system and the Expression Dataset may use ones from another system. GenMAPP will make the connection between the IDs. For example, a gene on a MAPP could be identified by MGI:98834 and a record in an Expression Dataset by the UniGene ID Mm.222. As long as you have specified that GenMAPP will color with "All Related Gene IDs" in the Options menu (the default setting), GenMAPP will associate the Expression Dataset data with the gene object on the MAPP.

Furthermore, when you double-left-click on a gene object to open the Backpage, GenMAPP shows you the annotations for all related gene IDs: for example, the MGI ID MGI:98834, UniGene ID Mm.222, Ensembl ID ENSMUSG00000059552, and any other related gene IDs from any other included gene ID systems in the Gene Database.

There are a number of functions that you can perform on the Gene Database. To change the Gene Database that you are using with the GenMAPP program, choose Data > Choose Gene Database from the main Drafting Board window. To change the Gene Database from MAPPFinder, select File > Choose Gene Database from various windows within the MAPPFinder program. Since the Gene Databases are species-specific, you will need to change the Gene Database whenever you wish to work with MAPPs and Expression Datasets for different species.

To find out which gene ID systems and relationships between systems exist in the Gene Database you are using, select Data > Gene Database Information form the main GenMAPP Drafting Board window.

You may also update your Gene Database, add remarks to the annotation for a particular gene ID, add your own gene ID system or relationships to a system, or create your own Gene Database. Detailed instructions for how to do this can be found in the Gene Database Manager section.

Gene Tables

The heart of a Gene Database is a set of Gene Tables, each of which contain data for a particular species from one of the gene ID systems. These tables are created from data extracted from the Ensembl resource. The Ensembl gene table contains the unique gene identifier as well as Symbol, Description and Chromosome. All other gene tables contain only the unique identifiers for each gene record.

Some Gene Tables also contain secondary identifiers that may be searched when identifying genes. UniProt for example, is uniquely identified by a gene ID (CALM_HUMAN) but can also be searched by accession number (P62158).

Most importantly, each Gene Table contains a web link to that organization's search site. If you click on a gene identifier on a GenMAPP Backpage, a web browser window opens with data for that specific gene on the organization's website. Using this browser window, you can navigate to any public part of the organization's site.

Gene Tables exist in two different varieties: general tables and Model Organism Database (MOD) tables. The MOD table in GenMAPP supported Gene Databases is normally the Ensembl or the UniProt table. The GenMAPP Gene Database Yeast is an exception, where the MOD system is SGD. For a custom database, the MOD Gene Table could be a Gene Table from a different system. Each species-specific database also contains a number of general tables, such as UniProt, UniGene, and Entrez Gene. The general tables will vary depending on the species, to include gene ID systems specific for a certain species. For example, only the mouse database contains an MGI general table and only the rat database contains an RGD general table.

One of the possible Gene Tables is that from Gene Ontology. GO identifiers would not be used for genes on a MAPP or in an Expression Dataset but the GO data can provide valuable information on the GenMAPP Backpage as well as when creating new MAPPs using MAPPFinder and MAPPBuilder.

Another table that exists in all Gene Databases is "Other." In this table you can add any genes from any cataloging systems, along with a name for each gene, annotations, and remarks. When loading an Expression Dataset into GenMAPP, you can elect to have any genes that the Expression Dataset Manager cannot identify in your Gene Database automatically added to your Other table.

In addition, as mentioned, you can create your own gene tables with virtually whatever characteristics you wish to include.

Relationship Tables

Relationship Tables store the relationships between genes in different Gene Tables. It is through these tables that GenMAPP is able to match a UniGene Hs.401835 on a MAPP with Entrez Gene 140597 in an Expression Dataset or vice versa.

A Gene Database will contain Relationship Tables between not only the included Gene Tables but also those Gene Tables and cataloging systems not in the Gene Database. For example, a Gene Database may contain an MGI Gene Table but not an Entrez Gene Table. However, the Gene Database maintains the MGI-to-Entrez Gene Relationship Table. This allows GenMAPP to match Entrez Gene identifiers on a MAPP and/or an Expression Dataset but does not provide any Entrez Gene annotation data.

Gene ID Systems

The systems that currently exist in the GenMAPP Gene Databases are the following: Ensembl, UniProt/TrEMBL, UniGene, Entrez Gene, RefSeq, MGI, RGD, GeneOntology, ZFIN, FlyBase, SGD, InterPro, WormBase, Affymetrix, SNP, PDB, Pfam, EMBL, GO and Other. More will be added from time to time, and you can add your own. The ID systems and their respective System Code are listed in the table below:

ID System (Species) System Code
Affymetrix Probe Set ID X
EMBL Em
Ensembl En
Entrez Gene L
FlyBase (D. melanogaster) F
Gene Ontology T
HUGO H
InterPro I
MGI (M. musculus) M
OMIM Om
PDB Pd
Pfam Pf
RefSeq Q
RGD (R. norvegicus) R
SGD (S.cerevisiae) D
UniProt/TrEMBL S
UniGene U
WormBase (C. elegans) W
ZFIN (D. rerio) Z
Other O

Affymetrix

The Affymetrix category is created based on Probe Set ID's from Affymetrix GeneChip arrays. The Affymetrix Probe Set ID is a unique identifier generated by Affymetrix, and the recommended primary identifier for analyzing data generated on Affymetrix arrays. The Affymetrix table in the Gene Database contains probe set ID's for all currently available GeneChip arrays.

EMBL

The European Molecular Biology Laboratory Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing  projects and patent applications.
The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.

Ensembl

Ensembl is a joint project between EMBL-EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on metazoan genomes. Ensembl is primarily funded by the Wellcome Trust. Access to all the data produced by the project, and to the software used to analyse and present it, is provided free and without constraints.

The Ensembl category of the GenMAPP Gene Databases serves as the MOD for several species-specific gene databases, such as human, mouse and rat.  

Entrez Gene

The Entrez Gene database contains curated sequence and descriptive information about genetic loci, and provides an integrated cross-referencing system between sources. The Entrez Gene table of the Gene Database has information on the symbol and name in addition to the Entrez Gene ID.

FlyBase

FlyBase is a database of genetic and molecular data for Drosophila. FlyBase includes data on all species from the family Drosophilidae; the primary species represented is Drosophila melanogaster. FlyBase is produced by a consortium of researchers funded by the National Institutes of Health, U.S.A., and the Medical Research Council, London.

Gene Ontology

The Gene Ontology database provides consistent descriptions of gene products present in other public databases. The GO consortium is developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. The GO table in the Gene Database stores the individual GO terms and their relationship to each other. Since the GO ID does not refer to a gene, it is not an appropriate gene ID system to use as primary ID on a MAPP or in a Gene Expression Dataset.

HUGO

The Human Genome Organisation (HUGO) is the international organization of scientists involved in human genetics. For each known human gene, the HUGO Gene Nomenclature Committee approves a gene name and symbol (short-form abbreviation).  All approved symbols are stored in the HGNC database.  Each symbol is unique and each gene is only given one approved gene symbol. 

InterPro

InterPro is a database with curated information about protein families, domains and functional sites. Entries in the database are created through sequence comparisons with the UniProt and TrEMBL databases. The InterPro database also stores the relationship between entries. The InterPro table of the Gene Database stores a name for each InterPro entry, in addition to the ID. Since the InterPro ID does not refer to a gene, it is not an appropriate gene ID system to use as primary ID.

MGI

The Mouse Genome Informatics (MGI) Database is a collaboration between several mouse resources and provides integrated access to data on the genetics, genomics and biology of the laboratory mouse. The MGI table in the Gene Database also contains the symbol for each MGI ID.

OMIM

The Online Mendelian Inheritance in Man (OMIM) database is a catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and developed for the World Wide Web by NCBI. The database contains textual information and references. It also contains links to MEDLINE and sequence records in the Entrez system, and links to additional related resources at NCBI and elsewhere.  Since the OMIM ID does not refer to a gene, it is not an appropriate gene ID system to use as primary ID on a MAPP or in a Gene Expression Dataset.

PDB

The Protein Data Bank (PDB) is a repository of 3-D structure data of large molecules of proteins and nucleic acids.

Pfam

Pfam (Protein families database of alignments and HMMs) is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Since the Pfam ID does not refer to a gene, it is not an appropriate gene ID system to use as primary ID on a MAPP or in a Gene Expression Dataset.

RefSeq

The RefSeq database provides a non-redundant set of reference sequences, including genomic DNA, transcript (RNA) and protein products. The RefSeq sequences are created in a semi-automated process that also creates the Entrez Gene entries.

RGD

The Rat Genome Database is created through a collaboration between several institutes involved in rat genetic and genomic research. The current RGD database contains curated data on rat genes, quantitative trait loci (QTL), microsatellite markers and rat strains used in genetic and genomic research.

SGD

The Saccharomyces Genome Database (SGD) stores nucleotide sequences for Saccharomyces cerevisiae, as well as other genomic and biological information. The SGD category is the MOD for the yeast Gene Database.

UniProt/TrEMBL

The UniProt database contains curated protein sequence information with good annotations for each sequence and a minimal level of redundancy. TrEMBL is a computer-generated supplement to the UniProt database that contains translations of DNA sequences from the EMBL (GenBank) database that are awaiting full annotation and inclusion in the UniProt database.

The UniProt/TrEMBL category of the Gene Database uses UniProt and TrEMBL accession numbers and entry names (referred to as UniProt/TrEMBL names) as gene identifiers. P16220 is an example of a UniProt accession number; CREB_HUMAN is an example of a UniProt name.

UniProt serves as the MOD category for the worm, dog, zebrafish and fruit fly Gene Databases.

UniGene

The UniGene database contains sets of non-redundant gene-oriented sequence clusters. It is created through automatic partitioning of GenBank sequences, and each UniGene cluster represents a unique gene. In addition to the UniGene table, the Gene Database also contains relationship tables relating the UniGene clusters to GenBank Accession numbers, Entrez Gene ID's and so on. For example, the UniGene-GenBank relationship table stores all the GenBank Accession numbers associated with each UniGene cluster.

WormBase

WormBase is a database containing nucleotide sequence and other genetic and genomic information for Caenorhabditis elegans. .

ZFIN

The ZebraFish Information Network (ZFIN) is a database containing gene and other information about Danio rerio, Zebrafish.

Other

The Other category contains gene identifiers added by the user that are not represented in any other Gene Table or Relationship Table. Situations where a user might add a gene identifier to the Other category include: adding genes with accession numbers from some gene ID system not included in the GenMAPP Gene Database or adding specific ID's for commercial cDNA clone sets. There are several methods by which you can add a gene identifier in the Other category. You can use the Gene Finder window to add ID's or you can do this by processing exceptions. Another option is to edit the Other table directly, through the Gene Database Manager.

Note: If you add IDs to the Other category, they will not color existing MAPPs in GenMAPP. In order to color MAPPs using Expression Datasets with IDs added to the Other category, you will need to create your own MAPPs containing the newly added IDs.

Gene Databases

Overview

Choosing the correct Gene Database for use with GenMAPP is a very important decision since the Gene Database is the backbone of GenMAPP and MAPPFinder.

Species-specific databases

GenMAPP.org currently supports and actively updates Gene Databases for the following species: Human (H.sapiens), Mouse (M.musculus), Rat (R.norvegicus), Yeast (S.cerevisiae), Worm (C.elegans), Dog (C.familiaris), Chicken (G.gallus), Cattle (B.taurus), Fruit Fly (D.melanogaster) and Zebrafish (D.rerio).

The following species-specific databases are available on request: Mosquito (A.gambiae), Honey Bee (A.mellifera), Fugu (F.rubripes), Chimpanzee (P.troglodytes) and Pufferfish (T.nigroviridis)

GenMAPP Converter Databases

The Converter Databases (for example Mm-Converter_20040411.gdb) are available for Mouse and Human and were designed to work with the Converter application to convert GenBank IDs in GenMAPP 1.0 formatted MAPPs and Expression Datasets to other ID types. The Converter Databases were designed specifically for the Converter and should not be used for regular GenMAPP purposes.

Other Species

GenMAPP.org will periodically create Gene Databases for additional species. These databases are listed on the GenMAPP download area. If your ID system of choice is not supported by GenMAPP, you can use the Gene Database Manager to add new gene systems and relationships.