GRAPE
The Automatic Parameter Estimation
Program for
 

Bill Majoros (bmajoros@tigr.org)
The Institute for Genomic Research




Introduction

GRAPE (GRadient Ascent Parameter Estimation) is the automatic training program for the ab initio eukaryotic gene finder GeneZilla.  The program uses gradient ascent optimization (also called "hill-climbing") to fine tune the parameters of the gene finder, as described in:

Majoros W. and Salzberg S.L. (2004) An empirical analysis of training protocols for probabilistic gene finders, BMC Bioinformatics 5:206.

The procedure is illustrated schematically below:



The program works by iteratively refining the current parameters of the gene finder so as to incrementally improve accuracy on a set of test genes. The relative sizes of the training and test sets shown in the figure are suggestive only. Note that the final accuracy measurements on the test set should not be used for publication purposes, as this would constitute a form of post hoc prediction, and is therefore generally invalid for objective accuracy assessment.

The GRAPE program, being a research tool, is continually undergoing modifications.  Currently, the program is designed to optimize the following parameters:
Using GRAPE to optimize other parameters than these requires modifications to the GRAPE source code.  GRAPE is written in Perl and is provided as open-source software.  Modifications are welcome, as are contributions and enhancements to the continually evolving code base of the GeneZilla and GRAPE projects.  Contact bmajoros@tigr.org for information on contributing to the software development of these projects.

GRAPE is currently configured to process only a single isochore at a time.  If you wish to use the gene finder in multiple-isochore mode, you will have to apply GRAPE to each isochore separately; it is recommended that this be done in a separate directory for each isochore.

The usage statement for the program is:

GRAPE.pl <logfile.txt> <max-num-test-genes> <genome-GC-content> [-d]
         where -d = optimize exon duration distributions
               <genomic-GC-content> is between 0 and 1 (ex: 0.48)

Prerequisites:
  1) Must have training set in ./iso0-100.gff
  2) Must have test set in ./test.gff
  3) Must have contigs in ./iso0-100.fasta
  4) Run get-examples.pl iso0-100.gff iso0-100.fasta TAG,TGA,TAA notrim
  5) Run get-duration-distr.pl for each type of exon
     (and view results using xgraph to ensure smoothness)

The <logfile.txt> file will be overwritten by the program to contain a record of the optimization steps performed during hill-climbing. The <max-num-test-genes> places a limit on the number of test genes in order to accelerate the evaluation step of the hill-climbing procedure. <genome-GC-content> must be a real number between 0 and 1.  The prerequisites for running the program are given above. Information on performing these prerequisite steps can be found here.

A number of additional parameters are defined within the GRAPE program:

my $MIN_SAMPLE_SIZE=175;
my $EXON_POOLING_THRESHOLD=50;# we pool if fewer than this
my $FORCE_EXON_POOLING=1;     # always pool exons?
my $MAX_BWM_SAMPLE_SIZE=40;   # use BWM if <40, otherwise use WAM
my $MIN_NONCODING_MARGIN=10;  # min size for noncoding signal margin
my $MAX_NONCODING_MARGIN=45;  # max size for noncoding signal margin
my $MIN_CODING_MARGIN=0;      # min size for coding signal margin
my $MAX_CODING_MARGIN=10;     # max size for coding signal margin

BWM denotes a Binomial Weight Matrix, which utilizes a binomial test to decide whether to use background positional nucleotide frequencies in the presence of small sample sizes.

The GRAPE.pl program may take several hours to complete, depending on the hardware on which it is executed.  The result of running the program will be a grape.iso file and a set of corresponding *.cfg files and model files having extensions *.model and *.trans. The grape.iso file can be used directly by the GeneZilla program, though modifications to the path names may be necessary before deployment on another filesystem.