Protocol for Hypothesis Generation by BioInformatics The Foster et al. metabolic network (IFF708) includes 229 reactions for which there is biochemical evidence in S. cerevisiae, and for which there is no gene/ORF annotation. The bioinformatics method of hypotheses generation attempts to use sequence similarity methods to identify likely candidates for the ORFs that catalyse these reactions, thereby allowing the Robot Scientist to discover novel biology. The method is described by the following steps: 1. Identify Enzyme Commission (E.C.) numbers corresponding to enzymes which participate in yeast metabolism but have no known ORF assigned to them. This is achieved by identifying all ORFs from IFF708 that are labelled with u_ (where represents an integer) and have known E.C. numbers. 2. For each E.C. number find the ORFs in other organisms that code for that enzyme. Use all organisms from the KEGG genome database for this search. Collect all amino acid sequences for these ORFs. These are known as the "query sequences". 3. For each query sequence use sequence similarity search (PSI-BLAST or FASTA) to identify the most similar sequences/ORFs in S. cerevisiae. For FASTA the search is against the S. cervisiae genome alone (sequences from KEGG as of July 2004) and for PSI-BLAST the search is against all databases at NCBI (downloaded on 2nd October 2006). The iterative search of PSI-BLAST may mean that similarities between the query sequences and S. cerevisiae may occur indirectly. Initially the FASTA sequence similarity search algorithm was used to test the feasibility of the method. For a detailed summary of the use of FASTA and PSI-BLASt see Appendix. 4. Use e-value to rank the S. cerevisiae ORFs for each enzyme class. 5. A single hypothesis is the mapping of one S. cerevisiae ORF to one E.C. class - e.g. YER152C -> 2.6.1.39. There are typically many hypotheses for each enzyme class. Appendix to Hypothesis Generation - Discussion of FASTA and PSI-BLAST FASTA works by identifying short regions of similarity (words) between an identified sequence and target sequences, joins adjacent matches and re-scores the highest matching regions. It is a rapid way of finding related sequences in a database search but does not have the underlying guarantee of finding an optimal solution compared to the slower Smith-Waterman dynamic programming algorithm (p.283 Bioinformatics by D. Mount). BLAST works in a similar way to FASTA but words in the query sequence are expanded according to a pre-defined position specific substitution matrix (PSSM), prior to comparison against database sequences. This helps increase sensitivity in finding potential matches. The scores of matches are also calculated based on the PSSM. When it comes to DNA searches, FASTA outperforms BLAST but the latter is more sensitive when looking at protein sequences. BLAST is also faster and is better at ignoring regions of low complexity. It also has the advantage of having easily at its disposal a large range of databases of genome sequences such as nrprot, swissprot etc. to search in. FASTA represented a good prototype for the method with experiments corresponding to hypotheses for E.C. 2.6.1.39 (L-2 Aminoadipate Aminotransferase) of particular interest. However, the FASTA search can often miss useful candidate ORFs involving a longer evolutionary divergence. In the current version the alternative search algorithm PSI-BLAST was used to address this problem. PSI-BLAST can perform consecutive BLAST searches where at each iteration the position specific substitution matrix (PSSM) for calculating the scores of matches is updated according to the results of the previous BLAST searches. That is, the PSSM for PSI-BLAST is created from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment and is refined in later iterations. This algorithm will help find a protein family as it will look for protein sequences that align with the variations found in the first search and so on. PSI-BLAST follows a greedy algorithm in the sense that newly found sequences influence the finding of more sequences like them but there is no guarantee that if a related but slightly different sequence was used for the initial search we would get the same final related sequences. However, this is still a worthwhile approach to increase one's chance for finding evolutionarily more distant related sequences. But one should be more suspicious of similarities found after a certain number of iterations. We allowed a maximum of 20 PSI-BLAST iterations, which resulted in a long processing time of several months on a beowulf cluster for the most prolific of the genes. Most of the S. cerevisiae genes that show up as matches are returned before round 10 and in many cases before round 6. Relevant code and data can be found in ../informatics/bioinformaticsData/