Aberystwyth Computer Science: Computational Biology: Protein Function

University of Wales, Aberystwyth
Computational Biology Group.
Department of Computer Science, Aberystwyth SY23 3DB, Wales, UK.

Glossary of terms used in rules

This is a glossary of the terms used in the rules generated for Predicting Protein Function from Sequence using Machine Learning

hom(A) refers to a homologous protein found by PSI-BLAST.

keyword(A, Word) refers to a SwissProt keyword found in A.

classification(A, Class) refers to the phylogenic classification of the organism A came from, taken from SwissProt.

species(A, Species) refers to the species of A, taken from SwissProt.

mol_wt_rule(A, Weight) refers to the molecular weight of A: 1 very low, 2 low, 3 medium, 4 high, and 5 very high.

amino_acid_ratio_rule(Residue, Weight) refers to the percentage composition of the residue in the sequence.

e_val_rule(A, Weight) refers to the PSI-Blast sequence similarity measure (note that a low value means a high sequence similarity).

e_val_gt
e_val_lteq refers to the PSI-Blast sequence similarity measure,greater than or less than/equal to a certain value

mol_wt_lteq(A, Weight)
mol_wt_gt(A, Weight) refers to the molecular weight of A being greater than or less than/equal to some value

amino_acid_pairs_wg and others similar, refers to the number of pairs of these two amino acids, in this case tryptophan and glycine

amino_acid_pair_ratio_qh and others similar, refers to the ratio of one amino acid to another in the ORF, in this case the ratio of glutamine(q) to histidine(h). This ratio is not a percentage, not out of a hundred, instead it's a ratio out of a thousand. So for example 2.8 means 0.28%.

amino_acid_ratio_g and others similar, refers to the percentage composition of the residue in the sequence of the ORF, in this case the percentage of glycine

psi_iter_gt
psi_iter_lteq refers to the number of iterations of the PSI_BLAST search (greater than or less than/equal to some number)

ecoli_theo_pI refers to the ORF's theoretical pI value

ss( SS ,X) * The ORF has a secondary structure prediction at position SS of a certain type X (either alpha helix, beta strand, or coil).

nss( SS1, SS2, X) * The ORF has a secondary structure prediction at position SS1 and position SS2 of a certain type X (either alpha helix, beta strand, or coil).

ss_alpha( SS, gt, B) * The ORF has an alpha helix secondary structure prediction at position SS with a residue length greater than B.

ss_beta( SS ,gt, B) * The ORF has an beta strand secondary structure prediction at position SS with a residue length greater than B.

ss_coil( SS ,gt, B) * The ORF has an coil secondary structure prediction at position SS with a residue length greater than B.

nss_alpha( SS1, SS2, gt, B) * The ORF's SS1th and SS2th (where SS2th=SS1th+2) alpha helix prediction have a residue length greater than B (similarly lteq instead of gt)

nss_beta( SS1, SS2, gt, B) * The ORF's SS1th and SS2th (where SS2th=SS1th+2) beta strand prediction have a residue length greater than B (similarly lteq instead of gt)

nss_coil( SS1, SS2, gt, B) * The ORF's SS1th and SS2th (where SS2th=SS1th+2) coil prediction have a residue length greater than B (similarly lteq instead of gt)

ecoli_aliphatic_index refers to the ORF's aliphatic index

ecoli_atomic_comp_s refers to the ORF's atomic composition of sulphur (or carbon, nitrogen, hydrogen, oxygen if _s is replaced by _c, _n, _h, _o respectively)

(*) Note about secondary structure attributes:

Positions in this text refer to the order in the predicted secondary structure. If for example an ORF has the following predicted secondary structure:

aaaabbbbbbaaacccccccbbbaaaaa

it would translate into

the 1st alpha helix secondary structure prediction is of length 4.
the 1st beta strand secondary structure prediction is of length 6.
the 2nd alpha helix secondary structure prediction is of length 3.
the 1st coil secondary structure prediction is of length 7.
the 2st beta strand secondary structure prediction is of length 6.
the 3rd alpha helix secondary structure prediction is of length 5.

(where length is the number of residues)

Amino acids

Alanine	a
Arginine	r
Asparagine	n
Aspartic acid	d
Cysteine	c
Glutamine	q
Glutamic acid	e
Glycine	g
Histidine	h
Isoleucine	i
Leucine	l
Lysine	k
Methionine	m
Phenylalanine	f
Proline	p
Serine	s
Threonine	t
Tryptophan	w
Tyrosine	y
Valine	v
Aspartic acid/Asparagine	b
Glutamine/Glutamic acid	z
residue that was passed through a low complexity filter	x

Enquiries, contact Dr. Ross King.
Back to Predicting Protein Function from Sequence using Machine Learning

Updated: 14 March 2000

hom(A)	refers to a homologous protein found by PSI-BLAST.
keyword(A, Word)	refers to a SwissProt keyword found in A.
classification(A, Class)	refers to the phylogenic classification of the organism A came from, taken from SwissProt.
species(A, Species)	refers to the species of A, taken from SwissProt.
mol_wt_rule(A, Weight)	refers to the molecular weight of A: 1 very low, 2 low, 3 medium, 4 high, and 5 very high.
amino_acid_ratio_rule(Residue, Weight)	refers to the percentage composition of the residue in the sequence.
e_val_rule(A, Weight)	refers to the PSI-Blast sequence similarity measure (note that a low value means a high sequence similarity).
e_val_gt e_val_lteq	refers to the PSI-Blast sequence similarity measure,greater than or less than/equal to a certain value
mol_wt_lteq(A, Weight) mol_wt_gt(A, Weight)	refers to the molecular weight of A being greater than or less than/equal to some value
amino_acid_pairs_wg	and others similar, refers to the number of pairs of these two amino acids, in this case tryptophan and glycine
amino_acid_pair_ratio_qh	and others similar, refers to the ratio of one amino acid to another in the ORF, in this case the ratio of glutamine(q) to histidine(h). This ratio is not a percentage, not out of a hundred, instead it's a ratio out of a thousand. So for example 2.8 means 0.28%.
amino_acid_ratio_g	and others similar, refers to the percentage composition of the residue in the sequence of the ORF, in this case the percentage of glycine
psi_iter_gt psi_iter_lteq	refers to the number of iterations of the PSI_BLAST search (greater than or less than/equal to some number)
ecoli_theo_pI	refers to the ORF's theoretical pI value
ss( SS ,X) *	The ORF has a secondary structure prediction at position SS of a certain type X (either alpha helix, beta strand, or coil).
nss( SS1, SS2, X) *	The ORF has a secondary structure prediction at position SS1 and position SS2 of a certain type X (either alpha helix, beta strand, or coil).
ss_alpha( SS, gt, B) *	The ORF has an alpha helix secondary structure prediction at position SS with a residue length greater than B.
ss_beta( SS ,gt, B) *	The ORF has an beta strand secondary structure prediction at position SS with a residue length greater than B.
ss_coil( SS ,gt, B) *	The ORF has an coil secondary structure prediction at position SS with a residue length greater than B.
nss_alpha( SS1, SS2, gt, B) *	The ORF's SS1th and SS2th (where SS2th=SS1th+2) alpha helix prediction have a residue length greater than B (similarly lteq instead of gt)
nss_beta( SS1, SS2, gt, B) *	The ORF's SS1th and SS2th (where SS2th=SS1th+2) beta strand prediction have a residue length greater than B (similarly lteq instead of gt)
nss_coil( SS1, SS2, gt, B) *	The ORF's SS1th and SS2th (where SS2th=SS1th+2) coil prediction have a residue length greater than B (similarly lteq instead of gt)
ecoli_aliphatic_index	refers to the ORF's aliphatic index
ecoli_atomic_comp_s	refers to the ORF's atomic composition of sulphur (or carbon, nitrogen, hydrogen, oxygen if _s is replaced by _c, _n, _h, _o respectively)