the eukaryotic gene finder formerly known as
TIGRscan
 Download   |  Training Manual   |   Users' Guide  |   Architecture

About GeneZilla

GeneZilla is a state-of-the-art program for computational prediction of protein-coding genes in eukaryotic DNA, and is based on the Generalized Hidden Markov Model (GHMM) framework, similar to GENSCAN and GENIE. It is highly reconfigurable and includes software for retraining by the end-user. It is written in highly optimized C++ and runs under most UNIX/Linux platforms. The run time and memory requirements are linear in the sequence length, and are in general much better than those of competing systems, due to GeneZilla's novel decoding algorithm. Graph-theoretic representations of the high scoring open reading frames are provided, allowing for exploration of sub-optimal gene models. It utilizes Interpolated Markov Models (IMMs), Maximal Dependence Decomposition (MDD), and includes states for signal peptides, branch points, TATA boxes, CAP sites, and will soon model CpG islands as well.

GeneZilla is an open-source project hosted at bioinformatics.org and currently consists of ~20,000 lines of code.  GeneZilla evolved out of the ab initio eukaryotic gene finder TIGRscan, which was developed at The Institute for Genomic Research over a 3-year period under
NIH grants R01-LM06845 and R01-LM007938, and which served as the basis for the comparative gene finder TWAIN.

Accuracy - Human

A set of 458 human RefSeq genes selected randomly from all 24 chromosomes, non-overlapping with the training set, with 1000 bp margins:


nucleotide
splice site
start/stop codon
exons
exact

Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
genes
83%
73%
78%
75%
79%
61%
46%
70%
71%
71%
23%
GENSCAN
85%
67%
75%
75%
68%
49%
33%
68%
59%
63%
14%

F=2×Sn×Sp/(Sn+Sp) incorporates sensitivity (Sn) and specificity (Sp) into a single score.

An additional set of 481 genes selected randomly from all 24 chromosomes, non-overlapping with the training set (and disjoint from the previous set of 458 genes), with 1000 bp margins:


nucleotide
splice site
start/stop codon
exons
exact

Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
genes
86%
78%
82%
79%
78%
58%
48%
73%
71%
72%
18%
GENSCAN
88%
71%
79%
79%
66%
48%
39%
70%
59%
64%
12%

Yet another set of 420 genes selected randomly from all 24 chromosomes, non-overlapping with the training set (and disjoint from the previous two sets of test genes), with 50,000 bp margins:


nucleotide
splice site
start/stop codon
exons
exact

Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
genes
82%
63%
71%
72%
56%
45%
27%
65%
48%
55%
13%
GENSCAN
85%
59%
70%
73%
43%
39%
18%
64%
37%
47%
9%

The training set was 8000 human RefSeq genes. 

Accuracy - Arabidopsis thaliana

Results on a test set of 800 Arabidopsis thaliana genes are shown below; the training and test sets were disjoint:


Nucleotide
Exon
Gene

Sn
Sp
Acc
Sn
Sp
Acc
95%
98%
96%
77%
81%
43%
GENSCAN+ (trained for Arabidopsis)
93%
99%
95%
75%
82%
35%
GENSCAN (trained for human)
69%
98%
80%
22%
43%
19%
UNVEIL (a pure HMM trained for A. thaliana)
95%
84%
87%
40%
36%
7%


Accuracy - vertebrate genes

The classic Burset/Guigo test set of 558 vertebrate genes, non-overlapping with the training set (trained on original GENSCAN training data, courtesy of C. Burge.):


nucleotide
splice site
start/stop codon
exons
exact

Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
genes
95%
96%
96%
89%
87%
82%
79%
82%
80%
81%
51%
GENSCAN
96%
97%
97%
90%
89%
72%
89%
81%
84%
83%
43%


Efficiency

Time and memory usage on a 922 Kb Aspergillus fumigatus contig are shown below:


Memory (Mb)
Time (min:sec)
29
1:28
GENSCAN
445
2:57

These results demonstrate that GeneZilla is extremely memory efficient while also achieving higher speeds than GENSCAN.  GeneZilla has successfully processed contigs as large as 4 Mb on an ordinary laptop computer.  The improved efficiency of GeneZilla makes it ideal for use as a component of more sophisticated systems such as comparative gene finders, in which increased GHMM decoding efficiency translates into greater resources available for further computation.  The software underlying GeneZilla (formerly called "TIGRscan") was used in the construction of the GPHMM ("generalized pair HMM") comparative gene finder TWAIN, and is now being used in the construction of the phylogenetic gene finder EPIC.

Architecture

GeneZilla's state-transition diagram is essentially the same as that of GENSCAN.  GeneZilla has the ability to model different types of exons (i.e., initial/internal/final/single) using different content sensors, unlike many GHMM-based gene finders.  The state diagram shown below models only forward strand genes; reverse-strand genes are handled by a mirror-image of this model, and are not permitted to overlap with forward-strand predictions.



Not shown in this diagram are the signal peptide, CAP site, and branch point models.  More information about GeneZilla's software architecture can be found here.

GeneZilla (formerly "TIGRscan") is briefly described in:

Majoros W, et al. (2004) TIGRscan and GlimmerHMM: two open-source ab initio eukaryotic gene finders, Bioinformatics 20, 2878-2879. (link)

The novel decoding algorithm used by GeneZilla is described in:

Majoros W. et al. (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders, BMC Bioinformatics 5:616. (link)

Downloads

GeneZilla is available for download as OSI Certified Open Source Software under the Artistic License. Pre-trained model files are also available.

File
  
Description



genezilla.tar.gz
Complete source code and documentation



genezilla-arabidopsis.tar.gz

Model files for Arabidopsis thaliana



genezilla-tetrahymena.tar.gz

Model files for Tetrahymena thermophilis (trained on Plasmodium falciparum and on long Tetrahymena ORFs)



genezilla-plasmodium.tar.gz

Model files for Plasmodium falciparum



genezilla-toxoplama.tar.gz

Model files for Toxoplasma gondii



genezilla-aspergillus.tar.gz

Model files for Aspergillus fumigatus




GeneZilla is open source software
OSI certified (open-source)
governed by the ARTISTIC LICENSE.


BioBanner - free advertising for BioScience web sites.
BioBanner - free advertising for BioScience web sites.


Contact: Bill Majoros (bmajoros@tigr.org)

Also available on CD: the official GeneZilla Soundtrack!