Aberystwyth Computer Science: Computational Biology: Protein Function
University of Wales,
Aberystwyth
Computational Biology Group. Department of Computer Science, Aberystwyth SY23 3DB, Wales, UK. |
Each M. tuberculosis ORF is numbered as in the Sanger Centre's gene list, but the prefix "Rv" is replaced by our prefix "tb", and any trailing "c" is removed.
Each E. coli ORF is numbered from 1 to 4289, with the prefix "ecoli",
and can be mapped back to its Blattner number and ORF name by this file.
Each ORF is annotated with its true class number at each level in the hierarchy, followed by the names of the classes from most general to most specific. Then the name and specific function of the ORF if they exist (or 'null' if they don't).
Rules from the paper: Genome scale prediction of protein functional class from sequence using data mining.
M. tuberculosis | E. coli |
tb_rules_level1.txt [690K]
tb_rules_level2.txt [256K] tb_rules_level3.txt [111K] tb_rules_level4.txt [14K] |
ecoli_rules_level1.txt [221K]
ecoli_rules_level2.txt [178K] ecoli_rules_level3.txt [84K] |
Rules from the paper: Prediction of Protein Functional Class from Sequence in E. Coli
SEQ | ecoli_SEQ_level1.txt [130K] ecoli_SEQ_level2.txt [110K] ecoli_SEQ_level3.txt [21K] |
SIM | ecoli_SIM_level1.txt [230K] ecoli_SIM_level2.txt [206K] ecoli_SIM_level3.txt [93K] |
STR | ecoli_STR_level1.txt [67K] ecoli_STR_level2.txt [11K] ecoli_STR_level3.txt [34K] |
SEQ_SIM | ecoli_SEQ_SIM_level1.txt [155K] ecoli_SEQ_SIM_level2.txt [193K] ecoli_SEQ_SIM_level3.txt [87K] |
SEQ_STR | ecoli_SEQ_STR_level1.txt [120K] ecoli_SEQ_STR_level2.txt [129K] ecoli_SEQ_STR_level3.txt [21K] |
SIM_STR | ecoli_SIM_STR_level1.txt [146K] ecoli_SIM_STR_level2.txt [196K] ecoli_SIM_STR_level3.txt [104K] |
SEQ_SIM_STR (this is the same as the ecoli_rules_level*.txt in the table above) | ecoli_SEQ_SIM_STR_level1.txt [221K] ecoli_SEQ_SIM_STR_level2.txt [178K] ecoli_SEQ_SIM_STR_level3.txt [84K] |
The rules are of the form:
IF
Condition AND
Condition AND
...
THEN Class
A "1" indicates the condition is true, a "0" that it is false.
To further explain the form of these rules we provide a glossary and describe an example rule. Example from level 2 of the M. tuberculosis rules, Rule 25:
Rule 25: (11/1, lift 24.9) [hom( A ),keyword( A ,membrane)] = 1 [hom( A ),species( A ,bacillus_subtilis)] = 1 [hom( A ),mol_wt_rule( A ,1),amino_acid_ratio_rule(e,4),e_val_rule( A ,3)] = 0 [hom( A ),e_val_rule( A ,2),amino_acid_ratio_rule(c,1),keyword( A ,transmembrane),mol_wt_rule( A ,4)] = 0 [hom( A ),keyword( A ,transmembrane),amino_acid_ratio_rule(x,5),classification( A ,bacteria),mol_wt_rule( A ,3),classification( A ,firmicutes)] = 0 [hom( A ),classification( A ,eukaryota),classification( A ,metazoa),classification( A ,chordata),classification( A ,vertebrata),keyword( A ,repeat),mol_wt_rule( A ,5),classification( A ,mammalia)] = 1 -> class 'function2(Degradation of macromolecules )' [0.846] Evaluation on proper test data (811 items): tb217 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipW' "probable esterase" tb220 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipC' "probable esterase" tb706 - 2,1,1,0 Macromolecule metabolism Synthesis and modification of macromolecules Ribosomal protein synthesis and modification Ribosomal protein synthesis and modification 'rplV' "50S ribosomal protein L22" tb1399 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipH' "probable lipase" tb1426 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipO' "probable esterase" tb1566 - 4,1,0,0 Other Virulence Virulence Virulence 'null' "null" tb2485 2,2,5,0 Macromolecule metabolism Degradation of macromolecules Esterases and lipases Esterases and lipases 'lipQ' "probable carboxlyesterase" tb3682 - 2,3,3,0 Macromolecule metabolism Cell envelope Murein sacculus and peptidoglycan Murein sacculus and peptidoglycan 'ponA2' "class A penicillin binding protein" Proper test Accuracy: 5/8 (62.50%) Application to new data (498 items): tb996 - 6,0,0,0 Unknowns Unknowns Unknowns Unknowns 'null' "null" Total: 1This rule can be translated as
IF there exists a homologous protein in SwissProt with the keyword "membrane" AND there exists a homologous protein in Bacillus subtilis AND there does not exist a homologous protein with very low molecular weight, a large percentage of glutamic acid, and medium sequence similarity AND there does not exist a homologous protein in SwissProt with good sequence similarity, low percentage of cysteine, the keyword "transmembrane" and a fairly high molecular weight there does not exist a firmicutes sp. protein in SwissProt with the keyword "transmembrane", with medium molecular weight, and a very high amount of low entropy sequence AND there exists a homologous mammalian protein in SwissProt with the keyword "repeat" with very high molecular weight THEN the ORF has the function "Degradation of macromolecules".
Example:
begin(model(ecoli2)). ecoli_orf(ecoli2). ecoli_mol_wt(178222.4). ecoli_theo_pI(5.47). ecoli_atomic_comp(c,7880). ecoli_atomic_comp(h,12642). ecoli_atomic_comp(n,2184). ecoli_atomic_comp(o,2375). ecoli_atomic_comp(s,70). ecoli_aliphatic_index(99.71). ecoli_hydro(0.035). ......Warning: These files are large! bzip2-ed versions are given for slightly easier downloading
M. tuberculosis | E. coli |
tb_facts_training.pl.gz [18.9M]
tb_facts_training.pl.bz2 [11.8M] tb_facts_test.pl.gz [9.5M] tb_facts_test.pl.bz2 [6.0M] |
ecoli_facts_training.pl.gz [32.8M]
ecoli_facts_training.pl.bz2 [19.9M] ecoli_facts_test.pl.gz [16.4M] ecoli_facts_test.pl.bz2 [10.1M] |
.data = C45/C50 training data file.
.validation = validation data file.
.test = heldout test data file.
.unknown = data for ORFs of unknown function.
.names = C45/C50 names file
.out = used to translate features in *.names into readable format afterwards
.struc.out = as above but for features related to secondary structure
.hom.out = as above but for features related to homology
.orfs = list of orfnames used in each case
And finally the classifications used: tb_gene_list_full.pl and ecoli_gene_list_full.pl
Group members | Example publications. | Grants and projects. | Research Home Page |