Data for the yeast genome (S. cerevisiae)

This is the data used in experiments to predict the functional class of yeast ORFs. These experiments are reported in

  • the PhD thesis "Machine learning and data mining for yeast functional genomics", Amanda Clare, UWA, February 2003, pdf
  • Clare, A. and King R.D. (2003) Predicting gene function in Saccharomyces cerevisiae. 2nd European Conference on Computational Biology (ECCB '03). (published as a journal supplement in Bioinformatics 19: ii42-ii49).

The predictions that were made from this data can be found here.

Classes

The classes were taken from the MIPS functional catalog on 24/4/02, and are listed in the file classes.txt. The actual functional assignments we used are in the file yeast_list_full.24.4.02.pl.

Sequence

This data was collected from a variety of sources, including ProtParam and MIPS. The attributes are as follows:

 

Attribute Type Description
aa_rat_X real Percentage of amino acid X in the protein
seq_len integer Length of the protein sequence
aa_rat_pair_X_Y real Percentage of the pair of amino acids X and Y consecutively in the protein
mol_wt integer Molecular weight of the protein
theo_pI real Theoretical pI (ioselectric point)
atomic_comp_X real Atomic composition of X where X is c (carbon), o (oxygen), n (nitrogen), s (sulphur) or h (hydrogen)
aliphatic_index real The aliphatic index
hydro real Grand average of hydropathicity
strand 'w' or 'c' The DNA strand on which the ORF lies
position integer Number of exons (how many start positions are there in its coordinates list).
cai real Codon adaption index: calculated according to Sharp and Li \shortcite{Sharp1987}
motifs integer Number of motifs: according to PROSITE dictionary release 13 of Nov. 1995 (Bairoch1996)
transmembraneSpans integer Number of transmembrane spans: calculation follows Klein et al. (Klein1985) using the ALOM program. P:I threshold value of 0.1 is used for ORF products which have at least only one transmembrane span. P:I threshold value of 0.15 is used for all TM-calculated proteins. (Goffeau1993)
chromosome 1..16, mit Chromosome number for this ORF

The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.

seq0.names  -  seq0.train.bz2  -  seq0.valid.bz2  -  seq0.propertest.bz2  -  seq0.unknown.bz2
seq1.names  -  seq1.train.bz2  -  seq1.valid.bz2  -  seq1.propertest.bz2  -  seq1.unknown.bz2
seq2.names  -  seq2.train.bz2  -  seq2.valid.bz2  -  seq2.propertest.bz2  -  seq2.unknown.bz2
seq3.names  -  seq3.train.bz2  -  seq3.valid.bz2  -  seq3.propertest.bz2  -  seq3.unknown.bz2
seq4.names  -  seq4.train.bz2  -  seq4.valid.bz2  -  seq4.propertest.bz2  -  seq4.unknown.bz2

The files can be up to 1.5M each in size.

Phenotype

See also this pageabout learning with phenotype data.

Original sources of data:

Reformatted for C4.5 (large files are bzip2-ed):

Description of attributes(growth media).

Expression

Download data as a single gzipped tar file. See also this pageabout results from clustering expression data.

Original data sources:

cellcycle http://genome-www.stanford.edu/cellcycle/data/rawdata/
church http://arep.med.harvard.edu/mrnadata/expression.html
derisi http://cmgm.stanford.edu/pbrown/explore/additional.html
eisen http://rana.stanford.edu/clustering/
gasch1 http://genome-www.stanford.edu/yeast_stress/data/rawdata/complete_dataset.txt
gasch2 http://genome-www.stanford.edu/Mec1/data/DNAcomplete_dataset/DNAcomplete_dataset.cdt
spo http://cmgm.stanford.edu/pbrown/sporulation/additional/

Homology

The patterns discovered by PolyFARM and used as boolean attributes hompatterns.gz (378K)
The list of dbref terms used dbrefnames
The species hierarchy/taxonomy specieshierarchy
The homology data in a relational format (broken into pieces for easier download/maintenance) - each piece is about 7-10M in size as a bzipped file.

yeasthomA.tar.bz2
yeasthomB.tar.bz2
yeasthomC.tar.bz2
yeasthomD.tar.bz2
yeasthomE.tar.bz2
yeasthomF.tar.bz2
yeasthomG.tar.bz2
yeasthomH.tar.bz2
yeasthomI.tar.bz2
yeasthomJ.tar.bz2
yeasthomK.tar.bz2
yeasthomL.tar.bz2
yeasthomM.tar.bz2

Associations are constructed from the following set of predicates:

 

Fact Description
eval(Orf, SPId, EVal) The e-value of the similarity between the ORF and the SWISSPROT protein
yeast_to_yeast(Orf, Orf, EVal) The e-value between this ORF and another ORF in the yeast genome.
sq_len(SPId, Len) The sequence length of the SWISSPROT protein
mol_wt(SPId, MWt) The molecular weight of the SWISSPROT protein
classification(SPId, Classfn) The classification of the organism the SWISSPROT protein belonged to. This is part of a hierarchical species taxonomy. The top level of the hierarchy contains classes such as "bacteria" and "viruses" and the lower levels contain specific organism such as "escherichia" and "saccharomyces".
keyword(SPId, KWord) Any keywords listed for the SWISSPROT protein. Only keywords which could be directly ascertained from sequence were used. These were the following: transmembrane, inner_membrane, plasmid, repeat, outer_membrane, membrane.
db_ref(SPId, DBName) The names of any databases that the SWISSPROT protein had references to. For example: PROSITE, EMBL, FlyBase, PDB.

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not?

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.

hom0.names  -  hom0.train.bz2  -  hom0.valid.bz2  -  hom0.propertest.bz2  -  hom0.unknown.bz2
hom1.names  -  hom1.train.bz2  -  hom1.valid.bz2  -  hom1.propertest.bz2  -  hom1.unknown.bz2
hom2.names  -  hom2.train.bz2  -  hom2.valid.bz2  -  hom2.propertest.bz2  -  hom2.unknown.bz2
hom3.names  -  hom3.train.bz2  -  hom3.valid.bz2  -  hom3.propertest.bz2  -  hom3.unknown.bz2
hom4.names  -  hom4.train.bz2  -  hom4.valid.bz2  -  hom4.propertest.bz2  -  hom4.unknown.bz2

Predicted Secondary Structure

The structure data in a relational format struct.discretised.kb.gz (2.6M)
The patterns discovered by PolyFARM and used as boolean attributes structpatterns.gz(133K)

Secondary structure is predicted by Prof. Associations are constructed from the following set of predicates:

 

Predicate Description
ss(Orf, Num, Type) This Orf has a secondary structure prediction of type Type (alpha, beta or coil) at relative position Num. For example, ss(yal001c,3,alpha) would mean that the third prediction made for yal001c was alpha.
alpha_len(Num, AlphaLen) The alpha prediction at position number Num was of length AlphaLen
beta_len(Num, BetaLen) The beta prediction at position number Num was of length BetaLen
coil_len(Num, CoilLen) The coil prediction at position number Num was of length CoilLen
alpha_dist(Orf, Percent) The percentage of alphas for this ORF is Percent
beta_dist(Orf, Percent) The percentage of betas for this ORF is Percent
coil_dist(Orf, Percent) The percentage of coils for this ORF is Percent
nss(Num1, Num2, Type) The prediction at position Num2 is of type Type (we used Num2 = Num1+1 ie Num1 and Num2 are neighbouring positions)

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not?

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.

struc0.names  -  struc0.train.bz2  -  struc0.valid.bz2  -  struc0.propertest.bz2  -  struc0.unknown.bz2
struc1.names  -  struc1.train.bz2  -  struc1.valid.bz2  -  struc1.propertest.bz2  -  struc1.unknown.bz2
struc2.names  -  struc2.train.bz2  -  struc2.valid.bz2  -  struc2.propertest.bz2  -  struc2.unknown.bz2
struc3.names  -  struc3.train.bz2  -  struc3.valid.bz2  -  struc3.propertest.bz2  -  struc3.unknown.bz2
struc4.names  -  struc4.train.bz2  -  struc4.valid.bz2  -  struc4.propertest.bz2  -  struc4.unknown.bz2

Predictions

The predictions that were made from this data can be found here