Procedure to automatically interpret trial results for novel biology

A research (positive) hypothesis such as affects_growth(Metabolite,Strain) is confirmed iff the random forest bootstrap discrimination is significant. Otherwise the hypothesis is denied.

A negative hypothesis such as not affects_growth(Metabolite,Strain) is confirmed iff the random forest bootstrap discrimination is significant. Otherwise the hypothesis is denied.

The discrimination is as follows:

Data

Discrimination is done on a per-plate basis. One plate contains 24 replicates each of "wildtype without nutrient", "wildtype with nutrient", "knockout without nutrient" and "knockout with nutrient". 96 wells in total.

The attributes used are the following 13 parameters:

"lag_time", "miy_lag_time", "start_linear", "end_linear", "dur_linear", "linear_slope", "max_od_emg", "max_od_time_emg", "double_time", "global_max_od", "global_max_od_time", "sn_ratio", "threshold"

The values used for each attribute are the differences between the with-nutrient and the without-nutrient values.

We want to discriminate between the knockout differences and the wildtype differences (a binary classification problem).

Each plate has 24 wildtype without nutrient (w_m), 24 wildtype with nutrient (w_m_n), 24 knockout without nutrient (k_m) and 24 knockout with nutrient (k_m_n). In order to create the differences (w_m_n - w_m) and (k_m_n - k_m) we wanted to make the cartesian products of the sets of 24 (for example each w_m value paired with each w_m_n value in turn).

However, in order to have held-out test data that was not contaminated we first have to split the data into training and test set, and then form the cartesian products.

So we take a sample of size n of the w_m's and use these to make the training set. The remainder of the w_m's are used to make the test set. We also take a sample of size n of the w_m_n's for training and the remainder of the w_m_n's are used to make the test set. We take all of the training w_m's and all of the training w_m_n's and form the differences of their cross product.

training_wildtype_items = { y - x | x <- training_w_m_set, y <- training_w_m_n_set }

We also form the cross products of the test sets.

test_wildtype_items = { y - x | x <- test_w_m_set, y <- test_w_m_n_set }

This process is repeated for the knockouts. The samples are random with replacement (sample size n=24) in the case of the bootstrap process, or created from the initial 8 fold partition of wells for the 8 fold cross validation.

Each item in the dataset for input to the decision tree therefore has a class (wildtype or knockout), and has 13 attributes, representing the differences for each of the 13 parameters (lag_time, miy_lag_time, etc) for that pair of wells. For each run of the 8 fold cross validation there will be 21*21=441 training instances and 3*3=9 test instances. For each bootstrap run there will be 24*24=576 training instances representing 576 pairs of wells (perhaps with duplicates from the process of sampling with replacement) and some number of test instances that varies depending on how large the sets of remaining unsampled items were left after the training sample was taken.

Classifier

The classifiers are the decision tree and random forest algorithm as implemented in the Orange datamining libraries for Python. We used orngTree.TreeLearner with parameters binarization=True (to enforce binary trees) and minSubset=5 (to enforce a minimum number of examples per non-null leaf of 5). The attribute selection measure used is the default measure, "gainRatio". The random forests used orngEnsemble.RandomForestLearner with trees=100.

The aim is to discriminate knockout differences from wildtype differences.

Accuracy

Accuracy was measured as (TP+TN)/(TP+TN+FP+FN).

In order to predict accuracy we report figures from both the .632 bootstrap [Efron,1983] (100 bootstrap samples were taken) and 8 fold cross validation.

Code

Code for the discrimination can be found in the following two files:

Significance

How do we know what accuracy is significant? Is for example 70% good enough?

Binomial tests are no good, because our data is not independent (pairs are not independent).

We can use Monte Carlo type experiments to tell us the answer. If we create enough "random" data we can see whether the results we have are significantly different to what we'd expect by chance.

We generate Monte Carlo samples by the following procedure: For each plate:

Repeat 100 times:

sample with replacement any 24 wells without nutrient. Label these "w"
sample with replacement any 24 wells without nutrient. Label these "k"
sample with replacement any 24 wells with nutrient. Label these "wn"
sample with replacement any 24 wells with nutrient. Label these "kn"
Generate and test a random forest from this sample, exactly as before. Record the bootstrap accuracy.

Sort the 100 accuracy results and locate the real accuracy for this plate within the list. Is it in the top 5%?