Aberystwyth Computer Science: Computational Biology: Protein Function

University of Wales, Aberystwyth
Computational Biology Group.
Department of Computer Science, Aberystwyth SY23 3DB, Wales, UK.


 

Predicting Protein Function from Sequence using Machine Learning

This page contains the data referred to in the papers:

Contents

Rules and predictions for M. tuberculosis and E. coli

These are the rules generated to predict function of the Open Reading Frames (ORFs) in the genomes at the different levels in the functional hierarchies, and the predictions that these rules made on the unseen test data, and on the unclassified ORFs in the data. A dash ('-') by an ORF name indicates this prediction was wrong and the correct classification is given (except for the unclassified ORFs listed under "Application to new data", as their true function is unknown). Any ORFs listed under the "Application to new data" sections are our predictions for ORFs which do not yet have a good functional classification.

Each M. tuberculosis ORF is numbered as in the Sanger Centre's gene list, but the prefix "Rv" is replaced by our prefix "tb", and any trailing "c" is removed.
Each E. coli ORF is numbered from 1 to 4289, with the prefix "ecoli", and can be mapped back to its Blattner number and ORF name by this file.

Each ORF is annotated with its true class number at each level in the hierarchy, followed by the names of the classes from most general to most specific. Then the name and specific function of the ORF if they exist (or 'null' if they don't).

Rules from the paper: Genome scale prediction of protein functional class from sequence using data mining.
M. tuberculosis E. coli
tb_rules_level1.txt [690K]
tb_rules_level2.txt [256K]
tb_rules_level3.txt [111K]
tb_rules_level4.txt [14K]
ecoli_rules_level1.txt [221K]
ecoli_rules_level2.txt [178K]
ecoli_rules_level3.txt [84K]

Rules from the paper: Prediction of Protein Functional Class from Sequence in E. Coli
SEQecoli_SEQ_level1.txt [130K]
ecoli_SEQ_level2.txt [110K]
ecoli_SEQ_level3.txt [21K]
SIMecoli_SIM_level1.txt [230K]
ecoli_SIM_level2.txt [206K]
ecoli_SIM_level3.txt [93K]
STRecoli_STR_level1.txt [67K]
ecoli_STR_level2.txt [11K]
ecoli_STR_level3.txt [34K]
SEQ_SIMecoli_SEQ_SIM_level1.txt [155K]
ecoli_SEQ_SIM_level2.txt [193K]
ecoli_SEQ_SIM_level3.txt [87K]
SEQ_STRecoli_SEQ_STR_level1.txt [120K]
ecoli_SEQ_STR_level2.txt [129K]
ecoli_SEQ_STR_level3.txt [21K]
SIM_STRecoli_SIM_STR_level1.txt [146K]
ecoli_SIM_STR_level2.txt [196K]
ecoli_SIM_STR_level3.txt [104K]
SEQ_SIM_STR
(this is the same as the
ecoli_rules_level*.txt
in the table above)
ecoli_SEQ_SIM_STR_level1.txt [221K]
ecoli_SEQ_SIM_STR_level2.txt [178K]
ecoli_SEQ_SIM_STR_level3.txt [84K]

The rules are of the form:

IF
Condition AND
Condition AND
...
THEN Class

A "1" indicates the condition is true, a "0" that it is false.

To further explain the form of these rules we provide a glossary and describe an example rule. Example from level 2 of the M. tuberculosis rules, Rule 25:

Rule 25: (11/1, lift 24.9)
	[hom( A ),keyword( A ,membrane)] = 1
	[hom( A ),species( A ,bacillus_subtilis)] = 1
	[hom( A ),mol_wt_rule( A ,1),amino_acid_ratio_rule(e,4),e_val_rule( A ,3)] = 0
	[hom( A ),e_val_rule( A ,2),amino_acid_ratio_rule(c,1),keyword( A ,transmembrane),mol_wt_rule( A ,4)] = 0
	[hom( A ),keyword( A ,transmembrane),amino_acid_ratio_rule(x,5),classification( A ,bacteria),mol_wt_rule( A ,3),classification( A ,firmicutes)] = 0
	[hom( A ),classification( A ,eukaryota),classification( A ,metazoa),classification( A ,chordata),classification( A ,vertebrata),keyword( A ,repeat),mol_wt_rule( A ,5),classification( A ,mammalia)] = 1
        ->  class 'function2(Degradation of macromolecules )'  [0.846]

	Evaluation on proper test data (811 items):
	tb217	2,2,5,0	Macromolecule metabolism 	Degradation of macromolecules 	Esterases and lipases 	Esterases and lipases 	'lipW'	"probable esterase"
	tb220	2,2,5,0	Macromolecule metabolism 	Degradation of macromolecules 	Esterases and lipases 	Esterases and lipases 	'lipC'	"probable esterase"
	tb706 -	2,1,1,0	Macromolecule metabolism 	Synthesis and modification of macromolecules 	Ribosomal protein synthesis and modification 	Ribosomal protein synthesis and modification 	'rplV'	"50S ribosomal protein L22"
	tb1399	2,2,5,0	Macromolecule metabolism 	Degradation of macromolecules 	Esterases and lipases 	Esterases and lipases 	'lipH'	"probable lipase"
	tb1426	2,2,5,0	Macromolecule metabolism 	Degradation of macromolecules 	Esterases and lipases 	Esterases and lipases 	'lipO'	"probable esterase"
	tb1566 -	4,1,0,0	Other 	Virulence 	Virulence 	Virulence 	'null'	"null"
	tb2485	2,2,5,0	Macromolecule metabolism 	Degradation of macromolecules 	Esterases and lipases 	Esterases and lipases 	'lipQ'	"probable carboxlyesterase"
	tb3682 -	2,3,3,0	Macromolecule metabolism 	Cell envelope 	Murein sacculus and peptidoglycan 	Murein sacculus and peptidoglycan 	'ponA2'	"class A penicillin binding protein"
	Proper test Accuracy: 5/8 (62.50%)

	Application to new data (498 items):
	tb996 -	6,0,0,0	Unknowns 	Unknowns 	Unknowns 	Unknowns 	'null'	"null"
	Total: 1
This rule can be translated as
IF
there exists a homologous protein in SwissProt with the keyword "membrane" AND
there exists a homologous protein in Bacillus subtilis AND
there does not exist a homologous protein with very low molecular weight, a large percentage of glutamic acid, and medium sequence similarity AND
there does not exist a homologous protein in SwissProt with good sequence similarity, low percentage of cysteine, the keyword "transmembrane" and a fairly high molecular weight
there does not exist a firmicutes sp. protein in SwissProt with the keyword "transmembrane", with medium molecular weight, and a very high amount of low entropy sequence AND
there exists a homologous mammalian protein in SwissProt with the keyword  "repeat" with very high molecular weight 
THEN
the ORF has the function "Degradation of macromolecules".

Facts lists for feature generation (WARMR)

The first stage in creating these rules was to use WARMR to find frequent ILP patterns in the data. These files contain the facts collected from the sequence data in a Prolog-readable format for this feature generation stage.

Example:

begin(model(ecoli2)).
ecoli_orf(ecoli2).
ecoli_mol_wt(178222.4).
ecoli_theo_pI(5.47).
ecoli_atomic_comp(c,7880).
ecoli_atomic_comp(h,12642).
ecoli_atomic_comp(n,2184).
ecoli_atomic_comp(o,2375).
ecoli_atomic_comp(s,70).
ecoli_aliphatic_index(99.71).
ecoli_hydro(0.035). 
......
Warning: These files are large! bzip2-ed versions are given for slightly easier downloading
M. tuberculosis E. coli
tb_facts_training.pl.gz [18.9M]
tb_facts_training.pl.bz2 [11.8M]
tb_facts_test.pl.gz [9.5M]
tb_facts_test.pl.bz2 [6.0M]
ecoli_facts_training.pl.gz [32.8M]
ecoli_facts_training.pl.bz2 [19.9M]
ecoli_facts_test.pl.gz [16.4M]
ecoli_facts_test.pl.bz2 [10.1M]

Rule induction data

At the next stage, rule induction, the input is a randomly chosen 2/3 of the data generated by WARMR. The remaining 1/3 is held back as validation data. The data was split into separate parts for each level of the classification hierarchy. C4.5 and C5.0 were used to generate rules to classify the data.

.data = C45/C50 training data file.
.validation = validation data file.
.test = heldout test data file.
.unknown = data for ORFs of unknown function.
.names = C45/C50 names file
.out = used to translate features in *.names into readable format afterwards
.struc.out = as above but for features related to secondary structure
.hom.out = as above but for features related to homology
.orfs = list of orfnames used in each case

M. tuberculosis E. coli
tb.level1.data.gz [1.9M]
tb.level2.data.gz [1.9M]
tb.level3.data.gz [1.9M]
tb.level4.data.gz [1.9M]
tb.level1.validation.gz [1.0M]
tb.level2.validation.gz [1.0M]
tb.level3.validation.gz [1.0M]
tb.level4.validation.gz [1.0M]
tb.level1.test.gz [1.5M]
tb.level2.test.gz [1.5M]
tb.level3.test.gz [1.5M]
tb.level4.test.gz [1.5M]
tb.unknown.gz [1.3M]
tb.level1.names [0.2M]
tb.level2.names [0.2M]
tb.level3.names [0.2M]
tb.level4.names [0.2M]
tb.out [3.6M]
tb.train.orfs
tb.validation.orfs
tb.test.orfs
tb.unknown.orfs
ecoli.level1.data.gz [2.8M]
ecoli.level2.data.gz [2.8M]
ecoli.level3.data.gz [2.8M]
ecoli.level1.validation.gz [1.4M]
ecoli.level2.validation.gz [1.4M]
ecoli.level3.validation.gz [1.4M]
ecoli.level1.test.gz [2.2M]
ecoli.level2.test.gz [2.2M]
ecoli.level3.test.gz [2.2M]
ecoli.unknown.gz [5.3M]
ecoli.level1.names [0.6M]
ecoli.level2.names [0.6M]
ecoli.level3.names [0.6M]
ecoli.struc.out [1.2M]
ecoli.hom.out [2.2M]
ecoli.train.orfs
ecoli.validation.orfs
ecoli.test.orfs
ecoli.unknown.orfs

And finally the classifications used: tb_gene_list_full.pl and ecoli_gene_list_full.pl
 


Group members Example publications. Grants and projects. Research Home Page

Enquiries, contact Dr. Ross King.
Updated: 18th October 2000