Protocol for Hypothesis Generation by BioInformatics


The Foster et al. metabolic network (IFF708) includes 229 reactions for
which there is biochemical evidence in S. cerevisiae, and for which
there is no gene/ORF annotation. The bioinformatics method of
hypotheses generation attempts to use sequence similarity methods to
identify likely candidates for the ORFs that catalyse these reactions,
thereby allowing the Robot Scientist to discover novel biology. The
method is described by the following steps:

1. Identify Enzyme Commission (E.C.) numbers corresponding to enzymes
which participate in yeast metabolism but have no known ORF assigned
to them. This is achieved by identifying all ORFs from IFF708 that are
labelled with u_<id> (where <id> represents an integer) and have known
E.C. numbers.

2. For each E.C. number find the ORFs in other organisms that code for
that enzyme. Use all organisms from the KEGG genome database for this 
search. Collect all amino acid sequences for these ORFs. These are
known as the "query sequences".

3. For each query sequence use sequence similarity search (PSI-BLAST
or FASTA) to identify the most similar sequences/ORFs in
S. cerevisiae. For FASTA the search is against the S. cervisiae
genome alone (sequences from KEGG as of July 2004) and for PSI-BLAST
the search is against all databases at NCBI (downloaded on 2nd October
2006). The iterative search of PSI-BLAST may mean that similarities 
between the query sequences and S. cerevisiae may occur indirectly.
Initially the FASTA sequence similarity search algorithm was used to
test the feasibility of the method. For a detailed summary of the use
of FASTA and PSI-BLASt see Appendix.

4. Use e-value to rank the S. cerevisiae ORFs for each enzyme class.

5. A single hypothesis is the mapping of one S. cerevisiae ORF to one
E.C. class - e.g. YER152C -> 2.6.1.39. There are typically many
hypotheses for each enzyme class.


Appendix to Hypothesis Generation - Discussion of FASTA and PSI-BLAST


FASTA works by identifying short regions of similarity (words) between
an identified sequence and target sequences, joins adjacent matches
and re-scores the highest matching regions.  It is a rapid way of
finding related sequences in a database search but does not have the
underlying guarantee of finding an optimal solution compared to the
slower Smith-Waterman dynamic programming algorithm (p.283
Bioinformatics by D. Mount).

BLAST works in a similar way to FASTA but words in the query sequence
are expanded according to a pre-defined position specific substitution
matrix (PSSM), prior to comparison against database sequences. This
helps increase sensitivity in finding potential matches. The scores of
matches are also calculated based on the PSSM.

When it comes to DNA searches, FASTA outperforms BLAST but the latter
is more sensitive when looking at protein sequences. BLAST is also
faster and is better at ignoring regions of low complexity. It also
has the advantage of having easily at its disposal a large range of
databases of genome sequences such as nrprot, swissprot etc. to search
in.

FASTA represented a good prototype for the method with experiments
corresponding to hypotheses for E.C. 2.6.1.39 (L-2 Aminoadipate
Aminotransferase) of particular interest. However, the FASTA search
can often miss useful candidate ORFs involving a longer evolutionary
divergence.

In the current version the alternative search algorithm PSI-BLAST was
used to address this problem. PSI-BLAST can perform consecutive BLAST
searches where at each iteration the position specific substitution
matrix (PSSM) for calculating the scores of matches is updated
according to the results of the previous BLAST searches. That is, the
PSSM for PSI-BLAST is created from a multiple alignment of the highest
scoring hits in an initial BLAST search. The PSSM is generated by
calculating position-specific scores for each position in the
alignment and is refined in later iterations. This algorithm will help
find a protein family as it will look for protein sequences that align
with the variations found in the first search and so on. PSI-BLAST
follows a greedy algorithm in the sense that newly found sequences
influence the finding of more sequences like them but there is no
guarantee that if a related but slightly different sequence was used
for the initial search we would get the same final related
sequences. However, this is still a worthwhile approach to increase
one's chance for finding evolutionarily more distant related
sequences. But one should be more suspicious of similarities found
after a certain number of iterations.

We allowed a maximum of 20 PSI-BLAST iterations, which resulted in a
long processing time of several months on a beowulf cluster for the
most prolific of the genes. Most of the S. cerevisiae genes that show
up as matches are returned before round 10 and in many cases before
round 6.

Relevant code and data can be found in ../informatics/bioinformaticsData/