Predicting Protein Function from Sequence using Machine Learning

This page contains the data referred to in the papers:

  • King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2000) Genome scale prediction of protein functional class from sequence using data mining. In: The Sixth International Conference on Knowledge Discovery and Data Mining (KDD 2000). pdf
  • King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2000) Accurate prediction of protein functional class in the M. tuberculosis and E. coli genomes using data mining. Comparative and Functional Genomics 17 283-293 (nb: volume 1 of CFG was volume 17 of Yeast). actual article, gzipped preprint postscript
  • King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2001) The Utility of Different Representations of Protein Sequence for Predicting Functional Class. Bioinformatics 17(5) 445-454. abstract, gzipped pdf

Contents

Rules and predictions for M. tuberculosis and E. coli

These are the rules generated to predict function of the Open Reading Frames (ORFs) in the genomes at the different levels in the functional hierarchies, and the predictions that these rules made on the unseen test data, and on the unclassified ORFs in the data. A dash ('-') by an ORF name indicates this prediction was wrong and the correct classification is given (except for the unclassified ORFs listed under "Application to new data", as their true function is unknown). Any ORFs listed under the "Application to new data" sections are our predictions for ORFs which do not yet have a good functional classification.

Each M. tuberculosis ORF is numbered as in the Sanger Centre's gene list, but the prefix "Rv" is replaced by our prefix "tb", and any trailing "c" is removed.
Each E. coli ORF is numbered from 1 to 4289, with the prefix "ecoli", and can be mapped back to its Blattner number and ORF name by this file.

Each ORF is annotated with its true class number at each level in the hierarchy, followed by the names of the classes from most general to most specific. Then the name and specific function of the ORF if they exist (or 'null' if they don't).

Rules from the paper: Genome scale prediction of protein functional class from sequence using data mining.

M. tuberculosis E. coli
tb_rules_level1.txt [690K]
tb_rules_level2.txt [256K]
tb_rules_level3.txt [111K]
tb_rules_level4.txt [14K]
ecoli_rules_level1.txt [221K]
ecoli_rules_level2.txt [178K]
ecoli_rules_level3.txt [84K]

Rules from the paper: Prediction of Protein Functional Class from Sequence in E. Coli

SEQ ecoli_SEQ_level1.txt [130K]
ecoli_SEQ_level2.txt [110K]
ecoli_SEQ_level3.txt [21K]
SIM ecoli_SIM_level1.txt [230K]
ecoli_SIM_level2.txt [206K]
ecoli_SIM_level3.txt [93K]
STR ecoli_STR_level1.txt [67K]
ecoli_STR_level2.txt [11K]
ecoli_STR_level3.txt [34K]
SEQ_SIM ecoli_SEQ_SIM_level1.txt [155K]
ecoli_SEQ_SIM_level2.txt [193K]
ecoli_SEQ_SIM_level3.txt [87K]
SEQ_STR ecoli_SEQ_STR_level1.txt [120K]
ecoli_SEQ_STR_level2.txt [129K]
ecoli_SEQ_STR_level3.txt [21K]
SIM_STR ecoli_SIM_STR_level1.txt [146K]
ecoli_SIM_STR_level2.txt [196K]
ecoli_SIM_STR_level3.txt [104K]
SEQ_SIM_STR
(this is the same as the
ecoli_rules_level*.txt
in the table above)
ecoli_SEQ_SIM_STR_level1.txt [221K]
ecoli_SEQ_SIM_STR_level2.txt [178K]
ecoli_SEQ_SIM_STR_level3.txt [84K]

The rules are of the form:

IF
Condition AND
Condition AND
...
THEN Class

A "1" indicates the condition is true, a "0" that it is false.

To further explain the form of these rules we provide a Glossary and describe an example rule. Example from level 2 of the M. tuberculosis rules, Rule 25:

Rule 25: (11/1, lift 24.9)
[hom( A ),keyword( A ,membrane)] = 1
[hom( A ),species( A ,bacillus_subtilis)] = 1
[hom( A ),mol_wt_rule( A ,1),amino_acid_ratio_rule(e,4),e_val_rule( A ,3)] = 0
[hom( A ),e_val_rule( A ,2),amino_acid_ratio_rule(c,1),keyword( A ,transmembrane),mol_wt_rule( A ,4)] = 0
[hom( A ),keyword( A ,transmembrane),amino_acid_ratio_rule(x,5),classification( A ,bacteria),mol_wt_rule( A ,3),classification( A ,firmicutes)] = 0
[hom( A ),classification( A ,eukaryota),classification( A ,metazoa),classification( A ,chordata),classification( A ,vertebrata),keyword( A ,repeat),mol_wt_rule( A ,5),classification( A ,mammalia)] = 1
-> class 'function2(Degradation of macromolecules )' [0.846]

Evaluation on proper test data (811 items):
tb217 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipW' "probable esterase"
tb220 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipC' "probable esterase"
tb706 - 2,1,1,0 Macromolecule metabolism Synthesis and modification of macromolecules Ribosomal protein synthesis and modification Ribosomal protein synthesis and modification 'rplV' "50S ribosomal protein L22"
tb1399 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipH' "probable lipase"
tb1426 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipO' "probable esterase"
tb1566 - 4,1,0,0 Other Virulence Virulence Virulence 'null' "null"
tb2485 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipQ' "probable carboxlyesterase"
tb3682 - 2,3,3,0 Macromolecule metabolism Cell envelope Murein sacculus and peptidoglycan Murein sacculus and peptidoglycan 'ponA2' "class A penicillin binding protein"
Proper test Accuracy: 5/8 (62.50%)

Application to new data (498 items):
tb996 - 6,0,0,0 Unknowns Unknowns Unknowns Unknowns 'null' "null"
Total: 1
This rule can be translated as
IF
there exists a homologous protein in SwissProt with the keyword "membrane" AND
there exists a homologous protein in Bacillus subtilis AND
there does not exist a homologous protein with very low molecular weight, a large percentage of glutamic acid, and medium sequence similarity AND
there does not exist a homologous protein in SwissProt with good sequence similarity, low percentage of cysteine, the keyword "transmembrane" and a fairly high molecular weight
there does not exist a firmicutes sp. protein in SwissProt with the keyword "transmembrane", with medium molecular weight, and a very high amount of low entropy sequence AND
there exists a homologous mammalian protein in SwissProt with the keyword "repeat" with very high molecular weight
THEN
the ORF has the function "Degradation of macromolecules".

Facts lists for feature generation (WARMR)

The first stage in creating these rules was to use WARMR to find frequent ILP patterns in the data. These files contain the facts collected from the sequence data in a Prolog-readable format for this feature generation stage.

Example:

begin(model(ecoli2)).
ecoli_orf(ecoli2).
ecoli_mol_wt(178222.4).
ecoli_theo_pI(5.47).
ecoli_atomic_comp(c,7880).
ecoli_atomic_comp(h,12642).
ecoli_atomic_comp(n,2184).
ecoli_atomic_comp(o,2375).
ecoli_atomic_comp(s,70).
ecoli_aliphatic_index(99.71).
ecoli_hydro(0.035).
......
Warning: These files are large! bzip2-ed versions are given for slightly easier downloading
M. tuberculosis E. coli
tb_facts_training.pl.gz [18.9M]
tb_facts_training.pl.bz2 [11.8M]
tb_facts_test.pl.gz [9.5M]
tb_facts_test.pl.bz2 [6.0M]
ecoli_facts_training.pl.gz [32.8M]
ecoli_facts_training.pl.bz2 [19.9M]
ecoli_facts_test.pl.gz [16.4M]
ecoli_facts_test.pl.bz2 [10.1M]

Rule induction data

At the next stage, rule induction, the input is a randomly chosen 2/3 of the data generated by WARMR. The remaining 1/3 is held back as validation data. The data was split into separate parts for each level of the classification hierarchy. C4.5 and C5.0 were used to generate rules to classify the data.

.data = C45/C50 training data file.
.validation = validation data file.
.test = heldout test data file.
.unknown = data for ORFs of unknown function.
.names = C45/C50 names file
.out = used to translate features in *.names into readable format afterwards
.struc.out = as above but for features related to secondary structure
.hom.out = as above but for features related to homology
.orfs = list of orfnames used in each case

M. tuberculosis E. coli
tb.level1.data.gz [1.9M]
tb.level2.data.gz [1.9M]
tb.level3.data.gz [1.9M]
tb.level4.data.gz [1.9M]
tb.level1.validation.gz [1.0M]
tb.level2.validation.gz [1.0M]
tb.level3.validation.gz [1.0M]
tb.level4.validation.gz [1.0M]
tb.level1.test.gz [1.5M]
tb.level2.test.gz [1.5M]
tb.level3.test.gz [1.5M]
tb.level4.test.gz [1.5M]
tb.unknown.gz [1.3M]
tb.level1.names [0.2M]
tb.level2.names [0.2M]
tb.level3.names [0.2M]
tb.level4.names [0.2M]
tb.out [3.6M]
tb.train.orfs
tb.validation.orfs
tb.test.orfs
tb.unknown.orfs
ecoli.level1.data.gz [2.8M]
ecoli.level2.data.gz [2.8M]
ecoli.level3.data.gz [2.8M]
ecoli.level1.validation.gz [1.4M]
ecoli.level2.validation.gz [1.4M]
ecoli.level3.validation.gz [1.4M]
ecoli.level1.test.gz [2.2M]
ecoli.level2.test.gz [2.2M]
ecoli.level3.test.gz [2.2M]
ecoli.unknown.gz [5.3M]
ecoli.level1.names [0.6M]
ecoli.level2.names [0.6M]
ecoli.level3.names [0.6M]
ecoli.struc.out [1.2M]
ecoli.hom.out [2.2M]
ecoli.train.orfs
ecoli.validation.orfs
ecoli.test.orfs
ecoli.unknown.orfs

And finally the classifications used: tb_gene_list_full.pl and ecoli_gene_list_full.pl