Data for the yeast genome (S. cerevisiae)

This is the data used in experiments to predict the functional class of yeast ORFs. These experiments are reported in

the PhD thesis "Machine learning and data mining for yeast functional genomics", Amanda Clare, UWA, February 2003, pdf
Clare, A. and King R.D. (2003) Predicting gene function in Saccharomyces cerevisiae. 2nd European Conference on Computational Biology (ECCB '03). (published as a journal supplement in Bioinformatics 19: ii42-ii49).

The predictions that were made from this data can be found Yeast Preds.

Classes

The classes were taken from the MIPS functional catalog on 24/4/02, and are listed in the file classes.txt. The actual functional assignments we used are in the file yeast_list_full.24.4.02.pl.

Sequence

This data was collected from a variety of sources, including ProtParam and MIPS. The attributes are as follows:

Attribute	Type	Description
aa_rat_X	real	Percentage of amino acid X in the protein
seq_len	integer	Length of the protein sequence
aa_rat_pair_X_Y	real	Percentage of the pair of amino acids X and Y consecutively in the protein
mol_wt	integer	Molecular weight of the protein
theo_pI	real	Theoretical pI (ioselectric point)
atomic_comp_X	real	Atomic composition of X where X is c (carbon), o (oxygen), n (nitrogen), s (sulphur) or h (hydrogen)
aliphatic_index	real	The aliphatic index
hydro	real	Grand average of hydropathicity
strand	'w' or 'c'	The DNA strand on which the ORF lies
position	integer	Number of exons (how many start positions are there in its coordinates list).
cai	real	Codon adaption index: calculated according to Sharp and Li \shortcite{Sharp1987}
motifs	integer	Number of motifs: according to PROSITE dictionary release 13 of Nov. 1995 (Bairoch1996)
transmembraneSpans	integer	Number of transmembrane spans: calculation follows Klein et al. (Klein1985) using the ALOM program. P:I threshold value of 0.1 is used for ORF products which have at least only one transmembrane span. P:I threshold value of 0.15 is used for all TM-calculated proteins. (Goffeau1993)
chromosome	1..16, mit	Chromosome number for this ORF

The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.

seq0.names - seq0.train.bz2 - seq0.valid.bz2 - seq0.propertest.bz2 - seq0.unknown.bz2

seq1.names - seq1.train.bz2 - seq1.valid.bz2 - seq1.propertest.bz2 - seq1.unknown.bz2

seq2.names - seq2.train.bz2 - seq2.valid.bz2 - seq2.propertest.bz2 - seq2.unknown.bz2

seq3.names - seq3.train.bz2 - seq3.valid.bz2 - seq3.propertest.bz2 - seq3.unknown.bz2

seq4.names - seq4.train.bz2 - seq4.valid.bz2 - seq4.propertest.bz2 - seq4.unknown.bz2

The files can be up to 1.5M each in size.

Phenotype

See also this pageabout learning with phenotype data.

Original sources of data:

Reformatted for C4.5 (large files are bzip2-ed):

level 1: known - unknown - names
level 2: known - unknown - names
level 3: known - unknown - names
level 4: known - unknown - names

Description of attributes(growth media).

Expression

Download data as a single gzipped tar file. See also this pageabout results from clustering expression data.

Original data sources:

cellcycle	http://genome-www.stanford.edu/cellcycle/data/rawdata/
church	http://arep.med.harvard.edu/mrnadata/expression.html
derisi	http://cmgm.stanford.edu/pbrown/explore/additional.html
eisen	http://rana.stanford.edu/clustering/
gasch1	http://genome-www.stanford.edu/yeast_stress/data/rawdata/complete_dataset.txt
gasch2	http://genome-www.stanford.edu/Mec1/data/DNAcomplete_dataset/DNAcomplete_dataset.cdt
spo	http://cmgm.stanford.edu/pbrown/sporulation/additional/

Homology

The patterns discovered by PolyFARM and used as boolean attributes hompatterns.gz (378K)
The list of dbref terms used dbrefnames
The species hierarchy/taxonomy specieshierarchy
The homology data in a relational format (broken into pieces for easier download/maintenance) - each piece is about 7-10M in size as a bzipped file.

Associations are constructed from the following set of predicates:

Fact	Description
eval(Orf, SPId, EVal)	The e-value of the similarity between the ORF and the SWISSPROT protein
yeast_to_yeast(Orf, Orf, EVal)	The e-value between this ORF and another ORF in the yeast genome.
sq_len(SPId, Len)	The sequence length of the SWISSPROT protein
mol_wt(SPId, MWt)	The molecular weight of the SWISSPROT protein
classification(SPId, Classfn)	The classification of the organism the SWISSPROT protein belonged to. This is part of a hierarchical species taxonomy. The top level of the hierarchy contains classes such as "bacteria" and "viruses" and the lower levels contain specific organism such as "escherichia" and "saccharomyces".
keyword(SPId, KWord)	Any keywords listed for the SWISSPROT protein. Only keywords which could be directly ascertained from sequence were used. These were the following: transmembrane, inner_membrane, plasmid, repeat, outer_membrane, membrane.
db_ref(SPId, DBName)	The names of any databases that the SWISSPROT protein had references to. For example: PROSITE, EMBL, FlyBase, PDB.

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not?

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS functional classification (1 is the most general, 4 is most specific), and 0 represents function classification at all appropriate levels.

hom0.names - hom0.train.bz2 - hom0.valid.bz2 - hom0.propertest.bz2 - hom0.unknown.bz2

hom1.names - hom1.train.bz2 - hom1.valid.bz2 - hom1.propertest.bz2 - hom1.unknown.bz2

hom2.names - hom2.train.bz2 - hom2.valid.bz2 - hom2.propertest.bz2 - hom2.unknown.bz2

hom3.names - hom3.train.bz2 - hom3.valid.bz2 - hom3.propertest.bz2 - hom3.unknown.bz2

hom4.names - hom4.train.bz2 - hom4.valid.bz2 - hom4.propertest.bz2 - hom4.unknown.bz2

Predicted Secondary Structure

The structure data in a relational format struct.discretised.kb.gz (2.6M)
The patterns discovered by PolyFARM and used as boolean attributes structpatterns.gz(133K)

Secondary structure is predicted by Prof. Associations are constructed from the following set of predicates:

Predicate	Description
ss(Orf, Num, Type)	This Orf has a secondary structure prediction of type Type (alpha, beta or coil) at relative position Num. For example, ss(yal001c,3,alpha) would mean that the third prediction made for yal001c was alpha.
alpha_len(Num, AlphaLen)	The alpha prediction at position number Num was of length AlphaLen
beta_len(Num, BetaLen)	The beta prediction at position number Num was of length BetaLen
coil_len(Num, CoilLen)	The coil prediction at position number Num was of length CoilLen
alpha_dist(Orf, Percent)	The percentage of alphas for this ORF is Percent
beta_dist(Orf, Percent)	The percentage of betas for this ORF is Percent
coil_dist(Orf, Percent)	The percentage of coils for this ORF is Percent
nss(Num1, Num2, Type)	The prediction at position Num2 is of type Type (we used Num2 = Num1+1 ie Num1 and Num2 are neighbouring positions)

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not?

struc0.names - struc0.train.bz2 - struc0.valid.bz2 - struc0.propertest.bz2 - struc0.unknown.bz2

struc1.names - struc1.train.bz2 - struc1.valid.bz2 - struc1.propertest.bz2 - struc1.unknown.bz2

struc2.names - struc2.train.bz2 - struc2.valid.bz2 - struc2.propertest.bz2 - struc2.unknown.bz2

struc3.names - struc3.train.bz2 - struc3.valid.bz2 - struc3.propertest.bz2 - struc3.unknown.bz2

struc4.names - struc4.train.bz2 - struc4.valid.bz2 - struc4.propertest.bz2 - struc4.unknown.bz2

Predictions

The predictions that were made from this data can be found here