Pomelo II is a new and much improved incarnation of Pomelo, a tool for finding differentially expressed genes, and genes that are of potential interest because they are related to an outcome of interest (e.g., type of cancer, survival).
The main features that differentiate Pomelo II from its ancestor, Pomelo, are:
So as to control multiple testing effects, we provide FDR-adjusted p-values, which we calculate using the approach of Benjamini & Hochberg (1995; J. Royal Statistical Society B, 57:289-300). FWER-correction is no longer provided. We think that control of the FDR (False Discovery Rate) is probably more relevant for most genomic and proteomics research than control of the FWER (Family-Wise Error Rate); in addition, use of the maxT FWER control procedure requires that the subset pivotality assumption holds, which is not the case with some of the tests available in Pomelo.
Finally, even if the assumptions of the Benjamini and Hochberg's FDR procedure are not satistified, the error is often small (Reiner, A., D. Yekutieli, and Y. Benjamini, 2003, "Identifying differentially expressed genes using false discovery rate controlling procedures", Bioinformatics 19, 368-375) and, overall, this method is competitive to other alternatives such as Benjamini & Yekutieli's or permutation-based approaches.
More extensive details on hypothesis testing and multiple testing correction are provided in the help for Pomelo I.
The all-famous t-test. Used to compare an interval variable (one where distances between levels make sense) between two groups. For example, we could compare gene expression data between two types of patients. We use here the test statistic for the case where we do not assume equal variances in the two groups. We compute the usual t-statistic, but p-values are obtained by permutation testing.
Analysis of variance. We compare between more than two groups the value of an interval data, such as the gene expression among five types of cancer. The statistic is your usual ANOVA F-ratio and p-values are obtained using a permutation test.
Linear regression. We try to predict the (interval scaled) values of a dependent variable based on the values of an independent, interval scaled variable. A typical example is predicting the expression levels of a protein using gene expression data. (Thus, note here that the "independent" variable is gene expression). We compute the usual ordinary least-squares regression coefficient and obtain the p-value using a permutation test.
Non permutation method for contingency tables. It obtains the unadjusted p-value using Fisher's exact test. This test would be the one to use if we have, say, different classes of patients (e.g., six types of cancers) and for a set of 1000 markers we can have either presence/absence of each marker in each patient (this would yield 1000 contingency tables of dimensions 6x2 ---each marker by each cancer type), or we can have several levels of (unordered) expression, say four types of expression (which would yield 1000 6x4 tables).
If the markers have more than two levels of expression and these are ordered (say, from less to more) other tests could be more powerful, such as the Cochran-Armitage test (which we have not implemented). Another alternative with categorical response data are logistic and multinomial models (which we have not implemented).
Please note that FisherIxJ is a test that is rarely used, almost never with gene-expression arrays. We implemented FisherIxJ for some people who were using tissue arrays (that yielded pressence/absence values).
A widely used model for survival data. With survival data we often have censored observations (e.g., a patient that is not yet dead, and all we know is that it lived for at least 100 days after initiation of the treatment).
Currently, the Cox model included here works with interval-scaled covariates; if you pass it a categorical covariate, it can ONLY have two possible values; otherwise, you will get meaningless results (it will be treated as an interval scaled covariate).
The Cox model implemented only works with right-censored observations, not left-censored or doubly-censored ones. As well, it is of course your responsibility to make sure that assumptions about censoring (e.g., lifetimes and censoring times are independent, etc) are reasonable. Statistics and p-values are computed in the "usual way"; the statistic we return is the Wald statistic and the p-value is from a Wald test.
The limma t-test is a non-permutation method used to compare gene expression data between two groups. When performing a test on one particular gene, this test borrows information from all the other genes to obtain better estimates of the variance, using an empirical bayes method.
Please note that all three limma-based tests make several assumptions; most of them are fairly reasonable and seem to hold in practice. However, notice that, for the empirical bayes procedure to work, it is assumed that the error variance of all genes comes from the same distribution; this assumption is clearly violated if the expression levels for different genes are the result of merging (averaging) over different number of spots. For instance, if you plan to use the limma-based tests, you do not want to use merge in preP if the number of replicates per gene (or clone) is different among genes (clones).
This test is the same test as the limma t-test, but it is to be used when samples are paired. If samples are paired, you will want to use a paired t-test (or a linear model, as explained below). Suppose we want to perform a t-test to compare tumor vs non-tumor samples from the same subjects. Each of the samples belonging to the same person should be similar, as they are from the same subject and, of course, the samples are naturally paired (each pair tumor/non-tumor comes from the same subject). By using a paired t-test, we take pairing into account decreasing the variance of our estimates (generally, taking pairing into account leads to smaller estimates of the error variance, and thus larger test statistics).
Pairing comes in many forms and flavors. In general, you have pairing whenever there is a feature of your design or sampling scheme that leads to pairs of your two conditions to share some (generally random) characteristic. This shared characteristic will lead to the pairs showing correlation in expression. For instance, in the two samples from the same subject example, each pair of samples shares that they come from the same subject, where the subject characteristics themselves are not something that interest us per se, but will induce a correlation between the two observations in the same subject. Similarly, if we sample families, and we choose pairs of siblings, one with cancer and one without, we have again pairing (brothers or sisters are likely to resemble each other by virtue of them being, well, brothers/sisters). Another example of pairing would arise if we take samples from two cell lines every hour. If there are any circadian patterns in gene expression, then the two samples (one from each cell line) at each hour will show correlated gene expression; thus, the two samples from every hour are paired, and the data ought to be analyzed with a paired t-test.
Please, see caveat about merging.
Non permutation method used to compare gene expression data between two or more groups. Linear models are a more general type of models that include, as particular cases, ANOVA and multiple regression. We use here the functionality provided by limma, and thus return empirical bayes adjusted estimates (i.e., we use information from all genes to obtain better estimates of variances).
Once the test results have been calculated (i.e., after obtaining the F-statistic and associated p-value for the test of global differences between groups), this method gives us the option to individually compare classes (t-test) and draw venn diagrams with the class comparisons.
Please, see caveat about merging.
With permutation tests we obtain the p-value by comparing the observed statistic (e.g., the observed t-statistic) from our data against a null distribution obtained from the actual data themselves (e.g., we randomly permute the class labels, and compute the t-statistic for the permuted data, and we repeat this, for instance, 50000 times, so that we can obtain the null distribution of the statistic under the hypothesis that the class labels were assigned randomly, regardless of gene expression).
In contrast, when using limma we use the standard distributions for computing the p-values (i.e., we compare the observed t-statistic agains the expected t-statistic under the null hypothesis by comparing with the t-distribution with a given degrees of freedom). As explained above, limma does not directly compute the t-statistic with the usual formula, but rather uses a "moderated statistic", where the estimate of the variance for every gene is "moderated" using information from all other genes. This is done using a specific model for the variances with an empirical bayes procedure as explained in Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3. What you will notice is that extremely small variances are pulled towards the much more frequent larger values and extremely larger variances are pulled towards smaller values. Moderated statistics have been shown repeatedly to work better (in several senses) with microarray data. The differences are most noticeable with small sample sizes.
If you use a permutation test, you need to choose the number of permutations (or use the default of 200000). The actual number of permutations that Pomelo II will use is often slightly larger than this one. Why? Becasue, for computational reasons, each of the slave CPUs (remember that permutation tests are parallelized in Pomelo II) does the same number of tests. Therefore, the number of permutation tests used must be an integer multiple of the number of CPUs that will be used in this particular run.
In more detail: At any time, the number of computing nodes available in our cluster is often between 28 and 30: sometimes a few nodes might not be available because of maintenance, network problems, etc. And each node has four CPUs available so, at any given time, there will generally be between 111 (i.e., 28 * 4 - 1) and 119 (i.e., 30 * 4 - 1) slave CPUs available. Thus, when you request a given number of permutations, say 200000, we find out the closest integer to 200000 that is also larger than 200000 and so that it is a multiple of the available number of slave nodes. In other words, we find the smallest integer no less than 200000 (more precisely, we use the mathematical function "ceiling") that is a multiple of the available number of slave nodes. For instance, I just requested a run of 200000 permutations, but the actual number used was 200039; at that time, 30 nodes were available, meaning Pomelo II could use 119 slave CPUs, and 200039 is the smallest integer no less than 200000 that is a multiple of 119 (200039 = 119 * 1681; but 119 * 1680 = 199920, which is less than 200000).
Number of permutations relative to the number of available number of permutations in the data set. With small sample sizes, the number of possible permutations of the labels is often small. For instance, with two classes, each of 5 subjects, there are only a total of 252 possible permutations of labels. In this case, you can find the "null distribution" by exhaustively enumerating all permutations, and computing the corresponding test statistic. The permutation tests in Pomelo II (and in virtually all permutation-based statistical software), however, are not exhaustive: we use random permutations. We do not attempt to exhaustively list all the possible permutations, but sample (randomly) over the space of all possible permutations. This is the only strategy available with moderate to large sample sizes; for example, with 30 individuals in two groups of 15, there are over 150 million permutations of labels. Listing all possible permutations is not a reasonable strategy, and we will instead use random permutations, for instance taking a random sample of 200000 permutations of the labels. What this means is that, if the number of possible permutations is small (say, 252) and you use 200000 permutations with Pomelo II, you will be repeatedly drawing many identical permutations of the labels. This is not a problem, and you will find that the p-values you observe quickly stabilize as you increase the number of permutations, without having to use very large numbers of permutations. In this case, since you are repeatedly sampling on a set that has a total of 252 events, a moderate number of permutations (say, 2520 permutations) should give you a very good picture of the actual exhaustive list of permutations (you should have drawn, on average, about 10 of each possible permutation); an even larger number of permutations (say 5000 permutations) will give you an even better picture.
Regardless of what is the actual number of possible permutations of your data set, the estimated p-value is a binomial variate, and thus has a variance that is inversely proportional to the number of permutations; using a number of permutations over 200000 ought to provide you with estimates of the p-value that are reasonable precise even in the tails of the distribution, and regardless of whether the actual number of permutations is 252 or several billions. Of course, if you need estimates of the p-values with extremely variance, you should increase the number of permutations.
What Pomelo tries to do is "detect" genes that are different using as a criteria gene expression data difference among two or more classes. Since most likely our data is not homogeneous (homogeneous meaning all subjects of the same age, country, ... ), we may find that the differences due to different classes are "blurred" by other differences. By using additional covariables, we can add this information to our study and thus enable a more complete analysis. (If there is a factor, say age, that accounts for some relevant portion of the variation in the gene expression data, but you don't use it in your analyses, then what you are doing is adding that variation, that is explained by age, to the error term. And, among other consequences, the larger the error term, the smaller the statistic).
Although additional covariables can be very useful, if the covariable you try to add is the same as the class file, pomelo can do nothing with this. Eg. If you are comparing two classes and you add a covariable indicating that all subjects of one class are male and all subjects of the other are female, pomelo has no way of knowing which differences belong to the covariable and which to the class difference. In this case, sex and class are completely confounded. Any differences in expression could be explained either by differences in the class variable you are interested, or by differences in sex.
Additional covariables can only be used with "Anova, linear models (limma)".
The files "Gene expression data" and "Class labels or dependent variable" are required for all tests. Files which are only necessary for certain tests are: "Censored indicator file" (Cox test), "Paired indicator file" (paired t-test) and "Additional covariables file" (optional for anova-limma).
The file for the gene expression should be formated as:
Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever. Of course, if you are using regression or Cox model, your dependent variable must be a number. And if you are using a t-test there can only be two classes.
There should be no empty spaces or trailing tabs at the end of the file. And we read only a single line, so place all labels in a single line. We will silently drop anything beyond the first line!!
This is a simple example of class labels file:
The class labels, as above, can be any arbitrary coding. The values in the gene expression data file should be consecutive integers that start at 0 (i.e., do NOT use negative intergers, or you'll get errors). This is an example file:
As you can see, most of these rows would yield (if we used the first row as the class labels) 3x2 tables or 3x3 tables (e.g., 3rd or 5th rows).
The format should be the same as for class labels, except that we will place the same integer at the places where the two paired subjects are. We can choose any integer we want.
This is a simple example of a paired indicator file:
In the example file, the first and the third subject are paired (integer used 4, although we could have used any other), the second subject is associated to the fifth (integer used 32) and so on.
Each row of the file represents a subject and each column an additional covariable. The format must be as follows: tab (\t) separated fields, a first row with the names of the columns (covariables) and the following rows with each subject's covariable value.
This is a simple example of a additional covariables file:
The order of the subjects must be exactly the same as the order in the expression data file. That means, the covariable values (age, country,...) of the first subject (i.e. row), must belong to the subject in the first column of the expression, etc.
The data can be of two types: numeric (age, weight, ...) or non-numeric (country, ...). When the data is read, you will be taken to a page where you will be able to see if it has been read correctly. Be careful, if your data does not have the correct format, it can be incorrectly read, meaning a non-numeric variable can be read as numeric or vice versa. If this happens all further analyses will be wrong!
Once the program has finished running, you will get a table with the output from the run and a heatmap.
The results table contains a header indicating the test you have used, number of permutations and which covariables where used (if any).
The table shows an index corresponding to the original ordering in the data file, gene names, p-values (undajusted), FDR-adjusted p-values, and statistics (and the absolute value of the statistic). For Fisher's IxJ tests the columns names statistic and abs(statistic) really have no meaning.
The figure is a heatmap where you can filter how and which genes to plot. For now the color scale goes from yellow to blue (green is the mix of yellow and blue); missing values are shown as white.
If you have run an "Anova, linear models (limma)" test, the output will also contain a Class compare section containing a button. By clicking on the button we will be taken to a class compare page, where we will be able to compare classes individually.
The observed test statistics are:
Running times are variable and will depend of course on the size of your data set and the analysis. Running times do depend too on the load of the cluster (e.g., if many other users are running Pomelo with very large data sets). Occasionally, with very large data sets and under heavy load a computation might fail (because one or more nodes run out of memory); Pomelo II detects these problems, and will keep trying to relaunch the computations. The maximum time we allow Pomelo II to run (or try to run) any give computation is 8 hours. If any process lasts longer than 8 hours, it is stopped and an error message indicating that limits is given as final result.
Most computations, however, last a lot less than 8 hours. To give you an idea, these are computing times (i.e., from the time the data arrive at the server until the HTML file with the results is produced) for a few cases. 100000 genes, 500 subjects in two groups of 250 each, using a permutation t-test: 3 hours (this run had Pomelo automatically relaunch itself after an MPI failure related to limited memory). The same data using limma: under 7 minutes. However, for the above two cases the expression data set was over 250 MB; thus, data transfer from the remote location to the server took in excess of 15 minutes. And then, once results were obtained, transferring back the HTML page with the 100000 rows of output took over 5 minutes. A smaller data set with 100000 genes and 100 subjects in two groups of 50 each (a 50 MB file), using a permutation t-test: 30 minutes. With a Cox model, about 12 minutes. These same data using using limma: under 3 minutes. However, as before, a very time consuming step was displaying the actual results (an HTML file with 100000 rows).
Thus, as a rule of thumb, after your data are uploaded to the server (i.e., when you receive the message "This is an autorefreshing page") you should expect to wait between a few minutes and a couple of hours, unless your data set is really huge. But, in the later case, most of your waiting time will be spent uploading the data and receiving back the results.
In this page there are a few examples and data sets to play with Pomelo II.
The code underlying these tests is written in C++ and R. The C++ code was originally written by Ramón Díaz-Uriarte and has been parallelized with LAM/MPI by Edward Morrissey. All of the multiple testing functions have been written from scratch, although some algorithms have been based on Westfall & Young (1993) or the documentation for the "multtest" package for R, some inspiration has been obtained from the above package (and a lot of testing has been done using multtest as a benchmark). Moreover, we have taken C code from R for Fisher's exact tests (fexact.c; the latter is based on Mehta and Patel's algorithm). The code for Cox model is from R, library survival.
All three limma tests have been written using limma R package [Smyth, G. K. (2005). Limma: linear models for microarray data. In: 'Bioinformatics and Computational Biology Solutions using R and Bioconductor'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, pages 397--420].
We have also used the GNU scientific library, GSL.
The code from R and the survival, limma, GDD, and imagemap packages, and the GSL are all free software released under GNU's General Public License. Our C++ is thus also released under the GPL. The CGI code is released under the Affero GPL.
All the code (including revision history) is available from The Launchpad.
This application is running on a cluster of machines using Debian GNU/Linux as operating system, Apache as web server, Linux Virtual Server for web server load-balancing, with heartbeat and drbd for high-availability of both services and storage, and LAM/MPI for parallelization.
Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.
If you use Pomelo II for any publication, we would appreciate if you could let us know and that you cite our program (you know, "credit where credit is due"). Also, please cite it in your papers. The reference for Pomelo II is Morrissey E, Diaz-Uriarte R. 2009. Pomelo II: finding differentially expressed genes. Nucleic Acids Research, 37, No. suppl_2 W581-W586.
Uploaded data set are saved in temporary directories in the server. These data are accessible from the internet until they are erased (five days now). The same as you do, anybody can access those directories, if they known their names, which are made of 20 random digits, so they should not be easy to guess.
In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.
Again, all responsibility for access to your data is yours. We do not offer any warranties that your data will be kept confidential or that nobody will break into our system. We suggest you rename the rows, columns, and other identifiers, so that they are meaningless to anybody but you.
(The manner we specify for attribution is that you write out our names, that you link to our web pages, and that you link to this page itself. Please, give credit where credit is due).
|Send comments to the webmaster. Last rev. 8-April-2009|