Data for the Arabidopsis genome

This is the data used in experiments to predict the functional class of yeast ORFs. These experiments are reported in

The predictions that were made from this data can be found here.

Classes

The classes were taken from the MIPS functional catalog on 3/3/04 and GeneOntology on 2/3/04 and are listed as follows:

Sequence

This data was collected from a variety of sources, including ProtParam and MIPS. The attributes are as follows:

Attribute Type Description
strand 'w' or 'c' The DNA strand on which the gene lies
chromo 1,2,3,4,5 The chromosome on which the gene lies
startpos integer Start position of seq
endpos integer End position of seq
numpos integer Number of exons
numaa integer Number of amino acids
mol_wt integer Molecular weight of the protein
theo_pI float Theoretical pI (ioselectric point)
percentneg float Percent of negatively charged residues
percentneg float Percent of positively charged residues
carbon float Atomic composition of carbon
hydrogen float Atomic composition of hydrogen
nitrogen float Atomic composition of nitrogen
oxygen float Atomic composition of oxygen
sulphur float Atomic composition of sulphur
aliphatic float The aliphatic index
instability float The instability index
gravy float Grand average of hydropathicity
X_ratio float Percentage of amino acid X in the protein
seq_len integer Length of the protein sequence
XYN_ratio float Percentage of the pair of amino acids X and Y separated by N-1 amino acids in the protein. That is, XY1 is X and Y adjacent. XY2 is X and Y separated by 1 other amino acid.

The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins), 98,0,0,0 (classification not yet clear cut) or GO:0005554 (molecular_function unknown) or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classifications (1 is the most general, 4 is most specific).

GO including IEA annotations

seq1.names  -  seq1.train.gz  -  seq1.valid.gz  -  seq1.propertest.gz  -  seq1.unknown.gz
seq2.names  -  seq2.train.gz  -  seq2.valid.gz  -  seq2.propertest.gz  -  seq2.unknown.gz
seq3.names  -  seq3.train.gz  -  seq3.valid.gz  -  seq3.propertest.gz  -  seq3.unknown.gz
seq4.names  -  seq4.train.gz  -  seq4.valid.gz  -  seq4.propertest.gz  -  seq4.unknown.gz

GO excluding IEA annotations

seq1.names  -  seq1.train.gz  -  seq1.valid.gz  -  seq1.propertest.gz  -  seq1.unknown.gz
seq2.names  -  seq2.train.gz  -  seq2.valid.gz  -  seq2.propertest.gz  -  seq2.unknown.gz
seq3.names  -  seq3.train.gz  -  seq3.valid.gz  -  seq3.propertest.gz  -  seq3.unknown.gz
seq4.names  -  seq4.train.gz  -  seq4.valid.gz  -  seq4.propertest.gz  -  seq4.unknown.gz

MIPS automatic annotations

seq1.names  -  seq1.train.gz  -  seq1.valid.gz  -  seq1.propertest.gz  -  seq1.unknown.gz
seq2.names  -  seq2.train.gz  -  seq2.valid.gz  -  seq2.propertest.gz  -  seq2.unknown.gz
seq3.names  -  seq3.train.gz  -  seq3.valid.gz  -  seq3.propertest.gz  -  seq3.unknown.gz
seq4.names  -  seq4.train.gz  -  seq4.valid.gz  -  seq4.propertest.gz  -  seq4.unknown.gz

MIPS manual annotations

seq1.names  -  seq1.train.gz  -  seq1.valid.gz  -  seq1.propertest.gz  -  seq1.unknown.gz
seq2.names  -  seq2.train.gz  -  seq2.valid.gz  -  seq2.propertest.gz  -  seq2.unknown.gz
seq3.names  -  seq3.train.gz  -  seq3.valid.gz  -  seq3.propertest.gz  -  seq3.unknown.gz
seq4.names  -  seq4.train.gz  -  seq4.valid.gz  -  seq4.propertest.gz  -  seq4.unknown.gz

The files can be up to 59M each in size.

Predicted Secondary Structure

Secondary structure is predicted by Prof. Associations are constructed from the following set of predicates:

Predicate Description
ss(Orf, Num, Type, FollowingNum) This Orf has a secondary structure prediction of type Type (alpha, beta or coil) at relative position Num (position FollowingNum is the next position after this). For example, ss(at1g01010,3,a,4) would mean that the third prediction made for at1g01010 was alpha.
a_len(Num, AlphaLen) The alpha prediction at position number Num was of length AlphaLen
b_len(Num, BetaLen) The beta prediction at position number Num was of length BetaLen
c_len(Num, CoilLen) The coil prediction at position number Num was of length CoilLen
alpha_dist(Orf, Percent) The percentage of alphas for this ORF is Percent
beta_dist(Orf, Percent) The percentage of betas for this ORF is Percent
coil_dist(Orf, Percent) The percentage of coils for this ORF is Percent
notequal(X,Y) Variables X and Y should not unify

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not? The list of associations and their corresponding attribute numbers is strucAllnondup.nums.

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classification (1 is the most general, 4 is most specific).

GO including IEA annotations

struc1.names  -  struc1.train.gz  -  struc1.valid.gz  -  struc1.propertest.gz  -  struc1.unknown.gz
struc2.names  -  struc2.train.gz  -  struc2.valid.gz  -  struc2.propertest.gz  -  struc2.unknown.gz
struc3.names  -  struc3.train.gz  -  struc3.valid.gz  -  struc3.propertest.gz  -  struc3.unknown.gz
struc4.names  -  struc4.train.gz  -  struc4.valid.gz  -  struc4.propertest.gz  -  struc4.unknown.gz

GO excluding IEA annotations

struc1.names  -  struc1.train.gz  -  struc1.valid.gz  -  struc1.propertest.gz  -  struc1.unknown.gz
struc2.names  -  struc2.train.gz  -  struc2.valid.gz  -  struc2.propertest.gz  -  struc2.unknown.gz
struc3.names  -  struc3.train.gz  -  struc3.valid.gz  -  struc3.propertest.gz  -  struc3.unknown.gz
struc4.names  -  struc4.train.gz  -  struc4.valid.gz  -  struc4.propertest.gz  -  struc4.unknown.gz

MIPS automatic annotations

struc1.names  -  struc1.train.gz  -  struc1.valid.gz  -  struc1.propertest.gz  -  struc1.unknown.gz
struc2.names  -  struc2.train.gz  -  struc2.valid.gz  -  struc2.propertest.gz  -  struc2.unknown.gz
struc3.names  -  struc3.train.gz  -  struc3.valid.gz  -  struc3.propertest.gz  -  struc3.unknown.gz
struc4.names  -  struc4.train.gz  -  struc4.valid.gz  -  struc4.propertest.gz  -  struc4.unknown.gz

MIPS manual annotations

struc1.names  -  struc1.train.gz  -  struc1.valid.gz  -  struc1.propertest.gz  -  struc1.unknown.gz
struc2.names  -  struc2.train.gz  -  struc2.valid.gz  -  struc2.propertest.gz  -  struc2.unknown.gz
struc3.names  -  struc3.train.gz  -  struc3.valid.gz  -  struc3.propertest.gz  -  struc3.unknown.gz
struc4.names  -  struc4.train.gz  -  struc4.valid.gz  -  struc4.propertest.gz  -  struc4.unknown.gz

Sequence similarity (Homology)

Sequence similarity (usually implying homology) is detected by a PSI-BLAST (blastpgp) search against NRDB. Associations are constructed from the following set of predicates:

Predicate Description
eval(Orf, SPID, EVal) Orf is similar to SwissProt protein with accession SPID, with e-value Eval.
desc(SPID,X) SwissProt protein SPID had description word X
db_ref(SPID,X) SwissProt protein SPID had a database reference to the X database
keyword(SPID,X) SwissProt protein SPID had keyword X
species(SPID,X) SwissProt protein SPID belonged to species X
species(SPID,X) SwissProt protein SPID belonged to classification X in the species taxonomy
sq_len(SPID,X) SwissProt protein SPID had sequence length X
mol_wt(SPID,X) SwissProt protein SPID had molecular weight X

Each association is converted into a boolean attribute representing the answer to: does this ORF have this association or not? The list of associations and their corresponding attribute numbers is homAllnondup.nums.

Boolean (c4.5) data:
The data is available here in C4.5 format. .names files give attribute names. .train, .valid and .propertest are training, validation and propertest data respectively. .unknown is data about ORFs whose functions were listed in categories 99,0,0,0 (unclassified proteins) or 98,0,0,0 (classification not yet clear cut), or those that had no function listed at this level in the hierarchy. The numbers 1-4 correspond to levels in the MIPS/GO functional classification (1 is the most general, 4 is most specific).

GO including IEA annotations

hom1.names  -  hom1.train.gz  -  hom1.valid.gz  -  hom1.propertest.gz  -  hom1.unknown.gz
hom2.names  -  hom2.train.gz  -  hom2.valid.gz  -  hom2.propertest.gz  -  hom2.unknown.gz
hom3.names  -  hom3.train.gz  -  hom3.valid.gz  -  hom3.propertest.gz  -  hom3.unknown.gz
hom4.names  -  hom4.train.gz  -  hom4.valid.gz  -  hom4.propertest.gz  -  hom4.unknown.gz

GO excluding IEA annotations

hom1.names  -  hom1.train.gz  -  hom1.valid.gz  -  hom1.propertest.gz  -  hom1.unknown.gz
hom2.names  -  hom2.train.gz  -  hom2.valid.gz  -  hom2.propertest.gz  -  hom2.unknown.gz
hom3.names  -  hom3.train.gz  -  hom3.valid.gz  -  hom3.propertest.gz  -  hom3.unknown.gz
hom4.names  -  hom4.train.gz  -  hom4.valid.gz  -  hom4.propertest.gz  -  hom4.unknown.gz

MIPS automatic annotations

hom1.names  -  hom1.train.gz  -  hom1.valid.gz  -  hom1.propertest.gz  -  hom1.unknown.gz
hom2.names  -  hom2.train.gz  -  hom2.valid.gz  -  hom2.propertest.gz  -  hom2.unknown.gz
hom3.names  -  hom3.train.gz  -  hom3.valid.gz  -  hom3.propertest.gz  -  hom3.unknown.gz
hom4.names  -  hom4.train.gz  -  hom4.valid.gz  -  hom4.propertest.gz  -  hom4.unknown.gz

MIPS manual annotations

hom1.names  -  hom1.train.gz  -  hom1.valid.gz  -  hom1.propertest.gz  -  hom1.unknown.gz
hom2.names  -  hom2.train.gz  -  hom2.valid.gz  -  hom2.propertest.gz  -  hom2.unknown.gz
hom3.names  -  hom3.train.gz  -  hom3.valid.gz  -  hom3.propertest.gz  -  hom3.unknown.gz
hom4.names  -  hom4.train.gz  -  hom4.valid.gz  -  hom4.propertest.gz  -  hom4.unknown.gz

SCOP

SCOP superfamily predictions, as made by the Superfamily server. Attributes are the classes in the SCOP hierarchy. Values are the e-values of a match to that family. Values of 10 are recorded where there is no match.

GO including IEA annotations

scop1.names  -  scop1.train.gz  -  scop1.valid.gz  -  scop1.propertest.gz  -  scop1.unknown.gz
scop2.names  -  scop2.train.gz  -  scop2.valid.gz  -  scop2.propertest.gz  -  scop2.unknown.gz
scop3.names  -  scop3.train.gz  -  scop3.valid.gz  -  scop3.propertest.gz  -  scop3.unknown.gz
scop4.names  -  scop4.train.gz  -  scop4.valid.gz  -  scop4.propertest.gz  -  scop4.unknown.gz

GO excluding IEA annotations

scop1.names  -  scop1.train.gz  -  scop1.valid.gz  -  scop1.propertest.gz  -  scop1.unknown.gz
scop2.names  -  scop2.train.gz  -  scop2.valid.gz  -  scop2.propertest.gz  -  scop2.unknown.gz
scop3.names  -  scop3.train.gz  -  scop3.valid.gz  -  scop3.propertest.gz  -  scop3.unknown.gz
scop4.names  -  scop4.train.gz  -  scop4.valid.gz  -  scop4.propertest.gz  -  scop4.unknown.gz

MIPS automatic annotations

scop1.names  -  scop1.train.gz  -  scop1.valid.gz  -  scop1.propertest.gz  -  scop1.unknown.gz
scop2.names  -  scop2.train.gz  -  scop2.valid.gz  -  scop2.propertest.gz  -  scop2.unknown.gz
scop3.names  -  scop3.train.gz  -  scop3.valid.gz  -  scop3.propertest.gz  -  scop3.unknown.gz
scop4.names  -  scop4.train.gz  -  scop4.valid.gz  -  scop4.propertest.gz  -  scop4.unknown.gz

MIPS manual annotations

scop1.names  -  scop1.train.gz  -  scop1.valid.gz  -  scop1.propertest.gz  -  scop1.unknown.gz
scop2.names  -  scop2.train.gz  -  scop2.valid.gz  -  scop2.propertest.gz  -  scop2.unknown.gz
scop3.names  -  scop3.train.gz  -  scop3.valid.gz  -  scop3.propertest.gz  -  scop3.unknown.gz
scop4.names  -  scop4.train.gz  -  scop4.valid.gz  -  scop4.propertest.gz  -  scop4.unknown.gz

InterPro

This data was derived using InterProScan.

The mapping from associations to the corresponding attribute numbers is in interprotrain.out.s100.d4.namesmap

GO including IEA annotations

interpro1.names  -  interpro1.train.gz  -  interpro1.valid.gz  -  interpro1.propertest.gz  -  interpro1.unknown.gz
interpro2.names  -  interpro2.train.gz  -  interpro2.valid.gz  -  interpro2.propertest.gz  -  interpro2.unknown.gz
interpro3.names  -  interpro3.train.gz  -  interpro3.valid.gz  -  interpro3.propertest.gz  -  interpro3.unknown.gz
interpro4.names  -  interpro4.train.gz  -  interpro4.valid.gz  -  interpro4.propertest.gz  -  interpro4.unknown.gz

GO excluding IEA annotations

interpro1.names  -  interpro1.train.gz  -  interpro1.valid.gz  -  interpro1.propertest.gz  -  interpro1.unknown.gz
interpro2.names  -  interpro2.train.gz  -  interpro2.valid.gz  -  interpro2.propertest.gz  -  interpro2.unknown.gz
interpro3.names  -  interpro3.train.gz  -  interpro3.valid.gz  -  interpro3.propertest.gz  -  interpro3.unknown.gz
interpro4.names  -  interpro4.train.gz  -  interpro4.valid.gz  -  interpro4.propertest.gz  -  interpro4.unknown.gz

MIPS automatic annotations

interpro1.names  -  interpro1.train.gz  -  interpro1.valid.gz  -  interpro1.propertest.gz  -  interpro1.unknown.gz
interpro2.names  -  interpro2.train.gz  -  interpro2.valid.gz  -  interpro2.propertest.gz  -  interpro2.unknown.gz
interpro3.names  -  interpro3.train.gz  -  interpro3.valid.gz  -  interpro3.propertest.gz  -  interpro3.unknown.gz
interpro4.names  -  interpro4.train.gz  -  interpro4.valid.gz  -  interpro4.propertest.gz  -  interpro4.unknown.gz

MIPS manual annotations

interpro1.names  -  interpro1.train.gz  -  interpro1.valid.gz  -  interpro1.propertest.gz  -  interpro1.unknown.gz
interpro2.names  -  interpro2.train.gz  -  interpro2.valid.gz  -  interpro2.propertest.gz  -  interpro2.unknown.gz
interpro3.names  -  interpro3.train.gz  -  interpro3.valid.gz  -  interpro3.propertest.gz  -  interpro3.unknown.gz
interpro4.names  -  interpro4.train.gz  -  interpro4.valid.gz  -  interpro4.propertest.gz  -  interpro4.unknown.gz

Expression

Some of the microarray data from NASC. Results of 43 experiments from cds between Dec 2002 and Jan 2004 using signal, detection call and detection P-values.

GO including IEA annotations

exprindiv1.names  -  exprindiv1.train.gz  -  exprindiv1.valid.gz  -  exprindiv1.propertest.gz  -  exprindiv1.unknown.gz
exprindiv2.names  -  exprindiv2.train.gz  -  exprindiv2.valid.gz  -  exprindiv2.propertest.gz  -  exprindiv2.unknown.gz
exprindiv3.names  -  exprindiv3.train.gz  -  exprindiv3.valid.gz  -  exprindiv3.propertest.gz  -  exprindiv3.unknown.gz
exprindiv4.names  -  exprindiv4.train.gz  -  exprindiv4.valid.gz  -  exprindiv4.propertest.gz  -  exprindiv4.unknown.gz

GO excluding IEA annotations

exprindiv1.names  -  exprindiv1.train.gz  -  exprindiv1.valid.gz  -  exprindiv1.propertest.gz  -  exprindiv1.unknown.gz
exprindiv2.names  -  exprindiv2.train.gz  -  exprindiv2.valid.gz  -  exprindiv2.propertest.gz  -  exprindiv2.unknown.gz
exprindiv3.names  -  exprindiv3.train.gz  -  exprindiv3.valid.gz  -  exprindiv3.propertest.gz  -  exprindiv3.unknown.gz
exprindiv4.names  -  exprindiv4.train.gz  -  exprindiv4.valid.gz  -  exprindiv4.propertest.gz  -  exprindiv4.unknown.gz

MIPS automatic annotations

exprindiv1.names  -  exprindiv1.train.gz  -  exprindiv1.valid.gz  -  exprindiv1.propertest.gz  -  exprindiv1.unknown.gz
exprindiv2.names  -  exprindiv2.train.gz  -  exprindiv2.valid.gz  -  exprindiv2.propertest.gz  -  exprindiv2.unknown.gz
exprindiv3.names  -  exprindiv3.train.gz  -  exprindiv3.valid.gz  -  exprindiv3.propertest.gz  -  exprindiv3.unknown.gz
exprindiv4.names  -  exprindiv4.train.gz  -  exprindiv4.valid.gz  -  exprindiv4.propertest.gz  -  exprindiv4.unknown.gz

MIPS manual annotations

exprindiv1.names  -  exprindiv1.train.gz  -  exprindiv1.valid.gz  -  exprindiv1.propertest.gz  -  exprindiv1.unknown.gz
exprindiv2.names  -  exprindiv2.train.gz  -  exprindiv2.valid.gz  -  exprindiv2.propertest.gz  -  exprindiv2.unknown.gz
exprindiv3.names  -  exprindiv3.train.gz  -  exprindiv3.valid.gz  -  exprindiv3.propertest.gz  -  exprindiv3.unknown.gz
exprindiv4.names  -  exprindiv4.train.gz  -  exprindiv4.valid.gz  -  exprindiv4.propertest.gz  -  exprindiv4.unknown.gz

Composite

To come.