the eukaryotic gene finder formerly known as
TIGRscan
 Download   |  Training Manual   |   Users' Guide  |   Architecture

About GeneZilla

GeneZilla is a state-of-the-art gene finder based on the Generalized Hidden Markov Model framework, similar to Genscan and Genie. It is highly reconfigurable and includes software for retraining by the end-user. It is written in highly optimized C++. The run time and memory requirements are linear in the sequence length, and are in general much better than those of competing systems, due to GeneZilla's novel decoding algorithm. Graph-theoretic representations of the high scoring open reading frames are provided, allowing for exploration of sub-optimal gene models. It utilizes Interpolated Markov Models (IMMs), Maximal Dependence Decomposition (MDD), and includes states for signal peptides, branch points, TATA boxes, CAP sites, and will soon model CpG islands as well.

Accuracy

Results on 800 Arabidopsis thaliana genes are shown below:


Nucleotide
Exon
Gene

Sn
Sp
Acc
Sn
Sp
Acc
95%
98%
96%
77%
81%
43%
Genscan+ (trained for Arabidopsis)
93%
99%
95%
75%
82%
35%
Genscan (trained for human)
69%
98%
80%
22%
43%
19%
Unveil (a pure HMM trained for A. thaliana)
95%
84%
87%
40%
36%
7%

Note that the training and test sets were disjoint for all results reported on this page.  On the "standard" Bursett/Guigo test set of 558 vertebrate genes GeneZilla performed very similarly to Genscan


nucleotide
splice site
start/stop codon
exons
exact

Sn
Sp
F
Sn
Sp
Sn
Sp
Sn
Sp
F
genes
95%
96%
96%
89%
87%
82%
79%
82%
80%
81%
51%
Genscan
96%
97%
97%
90%
89%
72%
89%
81%
84%
82%
43%

Efficiency

Time and memory usage on a 922 Kb Aspergillus fumigatus contig are shown below:


Memory (Mb)
Time (min:sec)
29
1:28
Genscan
445
2:57

These results demonstrate that GeneZilla is extremely memory efficient while also achieve higher speeds than Genscan.  GeneZilla has successfully processed contigs as large as 2 Mb on an ordinary laptop computer.  The excellent space efficiency of GeneZilla allows it to be used as a component of more sophisticated systems such as comparative gene finders, while leaving more memory for the comparative analyses.

Architecture

GeneZilla's state-transition diagram is essentially the same as that of Genscan.  GeneZilla has the ability to model different types of exons (i.e., initial/internal/final/single) using different content sensors, unlike most GHMM-based gene finders.  The state diagram shown below models only forward strand genes; reverse-strand genes are handled by a mirror-image of this model, and are not permitted to overlap with forward-strand precitions.



Not shown in this diagram are the signal peptide, CAP site, and branch point models.  GeneZilla will also soon possess an (optional) CpG island state upstream from the start codon for use when applied to vertebrate genomes.

More information about GeneZilla's software architecture can be found here.

GeneZilla (formerly TIGRscan) is briefly described in:

Majoros W, et al. (2004) TIGRscan and GlimmerHMM: two open-source ab initio eukaryotic gene finders, Bioinformatics 20, 2878-2879.

The novel decoding algorithm used by GeneZilla is described in:

Majoros W. et al. (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders, BMC Bioinformatics 5:616.

Downloads

GeneZilla is available for download as OSI Certified Open Source Software under the Artistic License. Pre-trained model files are also available.

File
  
Description



genezilla.tar.gz
Complete source code and documentation



genezilla-arabidopsis.tar.gz

Model files for Arabidopsis thaliana



genezilla-tetrahymena.tar.gz

Model files for Tetrahymena thermophilis (trained on Plasmodium falciparum and on long Tetrahymena ORFs)



genezilla-plasmodium.tar.gz

Model files for Plasmodium falciparum



genezilla-toxoplama.tar.gz

Model files for Toxoplasma gondii



genezilla-aspergillus.tar.gz

Model files for Aspergillus fumigatus




Contact: Bill Majoros (bmajoros@tigr.org)