/************************************************************************/ /* Prof v1.0 */ /* Author : Mohammed OUALI */ /* A full description of the method is given in the paper entitled */ /* Ouali, M., & King, R.D. (2000) Cascaded multiple classifiers for */ /* secondary structure prediction. Prot. Sci 9, 1162-1176 */ /* */ /* Copyright (C) 1999 Mohammed Ouali, Ross D. King */ /* and the University of Wales, Aberystwyth */ /* Penglais, Aberystwyth */ /* Ceredigion SY23 3DB WALES */ /* U.K. */ /* */ /* This software can be copied and used freely providing it is not */ /* resold in any form and its use is acknowledged. */ /* This software is provided by ``as is'' and any express or implied */ /* warranties, including, but not limited to, the implied warranties of */ /* merchantability and fitness for a particular purpose are disclaimed. */ /* In no event shall the regents or contributors be liable for any */ /* direct, indirect, incidental, special, exemplary, or consequential */ /* damages (including, but not limited to, procurement of substitute */ /* goods or services; loss of use, data, or profits; or business */ /* interruption) however caused and on any theory of liability, whether */ /* in contract, strict liability, or tort (including negligence or */ /* otherwise) arising in any way out of the use of this software, even */ /* if advised of the possibility of such damage. */ /* */ /* Questions and queries: rdk@aber.ac.uk */ /************************************************************************/ This software is for secondary structure prediction . The main code for Prof can be found in http://www.aber.ac.uk/~dcswww/Research/bio/dss/prof/Prof.tar.gz Prof uses a program called trimmer10 (included in this tar file). Prof also requires PSI-BLAST, a protein sequence database, and clustalw. ------------- PSI-BLAST (blastpgp,formatdb) can be found at: http://www.aber.ac.uk/~dcswww/Research/bio/dss/prof/ncbi.tar.gz Note that posit.c and blastpgp.c are slightly modified. Upgrades to PSI-BLAST can be found at: http://www.ncbi.nlm.nih.gov/BLAST/ ------------- Sequence Database The sequences database can be found at : http://www.aber.ac.uk/~dcswww/Research/bio/dss/prof/nr.gz The database currently used for the detection of homologous sequences should be in fasta format with gi numbers. This database may be easily changed (up-graded). Monthly up-graded versions of nr are available at : ftp://ncbi.nlm.nih.gov/blast/db/ the database has to be formated in order to be usable by blast using the program formatdb (of the blast package). Prof on the other hand use simply the flat file nr. ------------- ClustalW The program for multiple alignment is currently clustalw1.74 it can be find http://www.aber.ac.uk/~dcswww/Research/bio/dss/prof/clustalw.tar.gz A complied version for SUN is included. Upgrades for clustalw can be found at: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/ If a later version is used care is needed to ensure that the default parameters are set correctly: we use blosum matrices instead of Gonnet matrices, and we use the same order of output and input, We did that because for Prof the first sequence of the multiple alignment is the target one. ====== Below is some documentation of how to compile and a technical description of Prof : How to compile : ---------------- At the present Prof has been tested on SGI machines and on linux PC's. to compile Prof type in the directory /dcs/mho/Prof/src : make You have to compile trimmer10.c as well using gcc trimmer10.c -o trimmer10 ============= Before running Prof, you will need to set up some variables of environment : BLAST_DAT : The full path of the directory containing the database of sequence can be found. BLAST_DIR : The full path of the directory containing the 'blastpgp' executable. PROF_DIR : The full path of the directory containing the the files *.param. CLUSTAL_DIR : The full path of the directory containing the 'clustalw' executable. ============= There are basically two modes : 1) Automatic mode: 2) Semi-automatic mode: examples of set up : -------------------- setenv BLAST_DAT /dcs/mho/database3 #(location of the nr database in fasta format) setenv BLAST_DIR /dcs/mho/ncbi/build #(location of blastpgp) setenv PROF_DIR /dcs/mho/Prof/src #(location of the files *.param (for gor) which can be found in the src directory of Prof) setenv CLUSTAL_DIR /dcs/mho/clustalw1.7 #(location of clustalw1.74) OR setenv CLUSTAL_DIR /dcs/mho/clustalw1.8 #(location of clustalw1.8) You can put these commandes in the file .cshrc . Insert then the command "source .cshrc" in your file .login or .bashrc . Options for the program : ------------------------- -v : verbose mode. -d : option for deleting all the intermediate files generated in automatic mode. -A : option to get a global analysis of the results steps by steps. -m : option to set up the mode. 1 is for the semi-automatic mode, 0 for the automatic one. -i : if m is set to 0 name of the file containing the sequence in fasta format. -a : if m is set to 1 name of the file containing the multiple alignment in aln format. -p : if m is set to 1 name of the file the profil from blastpgp. -b : if m is set to 0 name of the the database for homology searching used by blastpgp by default it is nr (in fasta format). -c : will output a prediction in casp format (not very usefull, should be used only for CASP's trials). -o : name of the output file. examples of use : ----------------- semi-automatic mode : (relatively fast) $PROF_DIR/Prof -A -v -d -m 1 -a test.aln -p test.matrix -o test.out or automatic mode : (do each step for you but slow) $PROF_DIR/Prof -A -v -d -m 0 -i test.fasta -b nr -o test.out ----> native Prof output format or : $PROF_DIR/Prof -c -v -d -m 0 -i test.fasta -b nr -o test.out ----> casp output format Here follows some examples of the format which is used by our program. Contents of Prof ---------------- /**********************************************************************************/ All_net.h /* contains variables and function for the all the networks */ Prof.h /* contains the global variables and the functions declarations used by main */ alloue_2.c /* Allocate memory for a double pointer */ call_clustalw.c /* call clustalW */ call_gor_nets.c /* call networks using the output of gor(s) as input(s) */ call_phd_nets.c /* call phd-like networks */ call_profils_nets.c /* call networks using profils computed with gaps */ call_psi_nets.c /* call networks using psi-blast profil */ call_psiblast.c /* call psi-blast */ call_trimmer.c /* call trimmer (for trimming the sequences before clustalW ...) */ compute_profils.c /* Compute differents profils */ del_int_files.c /* delete all the intermediate files before the full prediction */ extract.c /* extract the homologous sequences */ gor.h /* contains the variables and some macro for GOR(s) */ gorI.c /* GOR I algorithm using both Information based decision and probabilities */ gorI.param /* Tables used by GOR I */ gorIII.c /* GOR III algorithm using both Information based decision and probabilities */ gorIII.param /* Tables used by GOR III */ gorIV.c /* GOR IV algorithm */ gorIV.param /* Tables used by GOR IV */ linear_gor.c /* linear discrimination over GOR algorithms outputs */ main.c /* Main options and Prof algorithm */ net_gor_consensus.c /* Net work using GOR output when consensus trained is a balanced way */ net_gor_consensus111.c /* Net work using GOR output when consensus trained is an unbalanced way */ net_gor_noconsensus.c /* Net work using GOR output when no consensus trained is a balanced way */ net_gor_noconsensus111.c /* Net work using GOR output when no consensus trained is an unbalanced way */ net_line_step3.c /* linear discrim and call network for step 3 of the algorithm and write outputs */ net_phd.c /* phd-like network balanced */ net_phd111.c /* phd-like network unbalanced */ net_profil.c /* network using profil with gap balanced */ net_profil111.c /* network using profil with gap unbalanced */ net_psi.c /* network using psi-blast profil output balanced */ net_psi111.c /* network using psi-blast profil output unbalanced */ net_step4.c /* network using output of step 3 as inputs */ quadratic_gor.c /* quadratic disrcimination over GOR linear */ read_align.c /* read ALN format from clustalW see examples directory */ read_param_gorI.c /* read GOR I parameters */ read_param_gorIII.c /* read GOR III parameters */ read_param_gorIV.c /* read GOR IV parameters */ read_seq.c /* read sequence in fasta format */ rematch_align.c /* rematch to the multiple alignment after prediction with GOR */ step3_net.c /* network of step 3 using step 2 outputs balanced */ step3_net111.c /* network of step 3 using step 2 outputs unbalanced */ step4_net.c /* network of step 4 using step 3 outputs as inputs balanced */ step4_net111.c /* network of step 4 using step 3 outputs as inputs unbalanced */ translate_align.c /* translate the multiple alignement for subsquent computations with GOR */ trimmer10.c /* trimmer program obtain from Mansoor Squi and modified slightly by M.O. */ truncate_name.c /* truncate name to construct some intermediate name files */ usage.c /* usage */ vote.c /* vote procedure for analysing some results */ write_all_gor.c /* write intermediate output form step 1 (GOR outputs) */ write_step2.c /* write intermediate output from step 3 (net outputs) */ /**********************************************************************************/ Thanks to Francesco Bettella for pointing out that the strings spread over multiple lines in usage.c were causing compilation problems, and sending us a modified usage.c file which is now included in this tarfile. (Feb 2009)