POMELO II
(THE RETURN OF THE POMELO)
Help index for Pomelo II




Introduction to Pomelo II


(Or, for old-time Pomelo users: Why you should switch to Pomelo II)

Pomelo II is a new and much improved incarnation of Pomelo, a tool for finding differentially expressed genes, and genes that are of potential interest because they are related to an outcome of interest (e.g., type of cancer, survival).

The main features that differentiate Pomelo II from its ancestor, Pomelo, are:



FDR and adjusted p-values

So as to control multiple testing effects, we provide FDR-adjusted p-values, which we calculate using the approach of Benjamini & Hochberg (1995; J. Royal Statistical Society B, 57:289-300). FWER-correction is no longer provided. We think that control of the FDR (False Discovery Rate) is probably more relevant for most genomic and proteomics research than control of the FWER (Family-Wise Error Rate); in addition, use of the maxT FWER control procedure requires that the subset pivotality assumption holds, which is not the case with some of the tests available in Pomelo.

Finally, even if the assumptions of the Benjamini and Hochberg's FDR procedure are not satistified, the error is often small (Reiner, A., D. Yekutieli, and Y. Benjamini, 2003, "Identifying differentially expressed genes using false discovery rate controlling procedures", Bioinformatics 19, 368-375) and, overall, this method is competitive to other alternatives such as Benjamini & Yekutieli's or permutation-based approaches.

More extensive details on hypothesis testing and multiple testing correction are provided in the help for Pomelo I.



Tests available

t-test (Permutations)

The all-famous t-test. Used to compare an interval variable (one where distances between levels make sense) between two groups. For example, we could compare gene expression data between two types of patients. We use here the test statistic for the case where we do not assume equal variances in the two groups. We compute the usual t-statistic, but p-values are obtained by permutation testing.

Anova (Permutations)

Analysis of variance. We compare between more than two groups the value of an interval data, such as the gene expression among five types of cancer. The statistic is your usual ANOVA F-ratio and p-values are obtained using a permutation test.

Regression (Permutations)

Linear regression. We try to predict the (interval scaled) values of a dependent variable based on the values of an independent, interval scaled variable. A typical example is predicting the expression levels of a protein using gene expression data. (Thus, note here that the "independent" variable is gene expression). We compute the usual ordinary least-squares regression coefficient and obtain the p-value using a permutation test.

FisherIxJ

Non permutation method for contingency tables. It obtains the unadjusted p-value using Fisher's exact test. This test would be the one to use if we have, say, different classes of patients (e.g., six types of cancers) and for a set of 1000 markers we can have either presence/absence of each marker in each patient (this would yield 1000 contingency tables of dimensions 6x2 ---each marker by each cancer type), or we can have several levels of (unordered) expression, say four types of expression (which would yield 1000 6x4 tables).

If the markers have more than two levels of expression and these are ordered (say, from less to more) other tests could be more powerful, such as the Cochran-Armitage test (which we have not implemented). Another alternative with categorical response data are logistic and multinomial models (which we have not implemented).

Please note that FisherIxJ is a test that is rarely used, almost never with gene-expression arrays. We implemented FisherIxJ for some people who were using tissue arrays (that yielded pressence/absence values).

Cox model

A widely used model for survival data. With survival data we often have censored observations (e.g., a patient that is not yet dead, and all we know is that it lived for at least 100 days after initiation of the treatment).

Currently, the Cox model included here works with interval-scaled covariates; if you pass it a categorical covariate, it can ONLY have two possible values; otherwise, you will get meaningless results (it will be treated as an interval scaled covariate).

The Cox model implemented only works with right-censored observations, not left-censored or doubly-censored ones. As well, it is of course your responsibility to make sure that assumptions about censoring (e.g., lifetimes and censoring times are independent, etc) are reasonable. Statistics and p-values are computed in the "usual way"; the statistic we return is the Wald statistic and the p-value is from a Wald test.

t-test (limma)

The limma t-test is a non-permutation method used to compare gene expression data between two groups. When performing a test on one particular gene, this test borrows information from all the other genes to obtain better estimates of the variance, using an empirical bayes method.

Please note that all three limma-based tests make several assumptions; most of them are fairly reasonable and seem to hold in practice. However, notice that, for the empirical bayes procedure to work, it is assumed that the error variance of all genes comes from the same distribution; this assumption is clearly violated if the expression levels for different genes are the result of merging (averaging) over different number of spots. For instance, if you plan to use the limma-based tests, you do not want to use merge in preP if the number of replicates per gene (or clone) is different among genes (clones).

Paired t-test (limma)

This test is the same test as the limma t-test, but it is to be used when samples are paired. If samples are paired, you will want to use a paired t-test (or a linear model, as explained below). Suppose we want to perform a t-test to compare tumor vs non-tumor samples from the same subjects. Each of the samples belonging to the same person should be similar, as they are from the same subject and, of course, the samples are naturally paired (each pair tumor/non-tumor comes from the same subject). By using a paired t-test, we take pairing into account decreasing the variance of our estimates (generally, taking pairing into account leads to smaller estimates of the error variance, and thus larger test statistics).

Pairing comes in many forms and flavors. In general, you have pairing whenever there is a feature of your design or sampling scheme that leads to pairs of your two conditions to share some (generally random) characteristic. This shared characteristic will lead to the pairs showing correlation in expression. For instance, in the two samples from the same subject example, each pair of samples shares that they come from the same subject, where the subject characteristics themselves are not something that interest us per se, but will induce a correlation between the two observations in the same subject. Similarly, if we sample families, and we choose pairs of siblings, one with cancer and one without, we have again pairing (brothers or sisters are likely to resemble each other by virtue of them being, well, brothers/sisters). Another example of pairing would arise if we take samples from two cell lines every hour. If there are any circadian patterns in gene expression, then the two samples (one from each cell line) at each hour will show correlated gene expression; thus, the two samples from every hour are paired, and the data ought to be analyzed with a paired t-test.

Please, see caveat about merging.

Anova, linear models (limma)

Non permutation method used to compare gene expression data between two or more groups. Linear models are a more general type of models that include, as particular cases, ANOVA and multiple regression. We use here the functionality provided by limma, and thus return empirical bayes adjusted estimates (i.e., we use information from all genes to obtain better estimates of variances).

Once the test results have been calculated (i.e., after obtaining the F-statistic and associated p-value for the test of global differences between groups), this method gives us the option to individually compare classes (t-test) and draw venn diagrams with the class comparisons.

Please, see caveat about merging.



Limma tests vs permutation tests

With permutation tests we obtain the p-value by comparing the observed statistic (e.g., the observed t-statistic) from our data against a null distribution obtained from the actual data themselves (e.g., we randomly permute the class labels, and compute the t-statistic for the permuted data, and we repeat this, for instance, 50000 times, so that we can obtain the null distribution of the statistic under the hypothesis that the class labels were assigned randomly, regardless of gene expression).

In contrast, when using limma we use the standard distributions for computing the p-values (i.e., we compare the observed t-statistic agains the expected t-statistic under the null hypothesis by comparing with the t-distribution with a given degrees of freedom). As explained above, limma does not directly compute the t-statistic with the usual formula, but rather uses a "moderated statistic", where the estimate of the variance for every gene is "moderated" using information from all other genes. This is done using a specific model for the variances with an empirical bayes procedure as explained in Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3. What you will notice is that extremely small variances are pulled towards the much more frequent larger values and extremely larger variances are pulled towards smaller values. Moderated statistics have been shown repeatedly to work better (in several senses) with microarray data. The differences are most noticeable with small sample sizes.



Permutation tests: number of permutations

If you use a permutation test, you need to choose the number of permutations (or use the default of 200000). The actual number of permutations that Pomelo II will use is often slightly larger than this one. Why? Becasue, for computational reasons, each of the slave CPUs (remember that permutation tests are parallelized in Pomelo II) does the same number of tests. Therefore, the number of permutation tests used must be an integer multiple of the number of CPUs that will be used in this particular run.

In more detail: At any time, the number of computing nodes available in our cluster is often between 28 and 30: sometimes a few nodes might not be available because of maintenance, network problems, etc. And each node has four CPUs available so, at any given time, there will generally be between 111 (i.e., 28 * 4 - 1) and 119 (i.e., 30 * 4 - 1) slave CPUs available. Thus, when you request a given number of permutations, say 200000, we find out the closest integer to 200000 that is also larger than 200000 and so that it is a multiple of the available number of slave nodes. In other words, we find the smallest integer no less than 200000 (more precisely, we use the mathematical function "ceiling") that is a multiple of the available number of slave nodes. For instance, I just requested a run of 200000 permutations, but the actual number used was 200039; at that time, 30 nodes were available, meaning Pomelo II could use 119 slave CPUs, and 200039 is the smallest integer no less than 200000 that is a multiple of 119 (200039 = 119 * 1681; but 119 * 1680 = 199920, which is less than 200000).

Number of permutations relative to the number of available number of permutations in the data set. With small sample sizes, the number of possible permutations of the labels is often small. For instance, with two classes, each of 5 subjects, there are only a total of 252 possible permutations of labels. In this case, you can find the "null distribution" by exhaustively enumerating all permutations, and computing the corresponding test statistic. The permutation tests in Pomelo II (and in virtually all permutation-based statistical software), however, are not exhaustive: we use random permutations. We do not attempt to exhaustively list all the possible permutations, but sample (randomly) over the space of all possible permutations. This is the only strategy available with moderate to large sample sizes; for example, with 30 individuals in two groups of 15, there are over 150 million permutations of labels. Listing all possible permutations is not a reasonable strategy, and we will instead use random permutations, for instance taking a random sample of 200000 permutations of the labels. What this means is that, if the number of possible permutations is small (say, 252) and you use 200000 permutations with Pomelo II, you will be repeatedly drawing many identical permutations of the labels. This is not a problem, and you will find that the p-values you observe quickly stabilize as you increase the number of permutations, without having to use very large numbers of permutations. In this case, since you are repeatedly sampling on a set that has a total of 252 events, a moderate number of permutations (say, 2520 permutations) should give you a very good picture of the actual exhaustive list of permutations (you should have drawn, on average, about 10 of each possible permutation); an even larger number of permutations (say 5000 permutations) will give you an even better picture.

Regardless of what is the actual number of possible permutations of your data set, the estimated p-value is a binomial variate, and thus has a variance that is inversely proportional to the number of permutations; using a number of permutations over 200000 ought to provide you with estimates of the p-value that are reasonable precise even in the tails of the distribution, and regardless of whether the actual number of permutations is 252 or several billions. Of course, if you need estimates of the p-values with extremely variance, you should increase the number of permutations.



Using additional covariables

What Pomelo tries to do is "detect" genes that are different using as a criteria gene expression data difference among two or more classes. Since most likely our data is not homogeneous (homogeneous meaning all subjects of the same age, country, ... ), we may find that the differences due to different classes are "blurred" by other differences. By using additional covariables, we can add this information to our study and thus enable a more complete analysis. (If there is a factor, say age, that accounts for some relevant portion of the variation in the gene expression data, but you don't use it in your analyses, then what you are doing is adding that variation, that is explained by age, to the error term. And, among other consequences, the larger the error term, the smaller the statistic).

Although additional covariables can be very useful, if the covariable you try to add is the same as the class file, pomelo can do nothing with this. Eg. If you are comparing two classes and you add a covariable indicating that all subjects of one class are male and all subjects of the other are female, pomelo has no way of knowing which differences belong to the covariable and which to the class difference. In this case, sex and class are completely confounded. Any differences in expression could be explained either by differences in the class variable you are interested, or by differences in sex.

Additional covariables can only be used with "Anova, linear models (limma)".



Program input

Necessary data files for each test

The files "Gene expression data" and "Class labels or dependent variable" are required for all tests. Files which are only necessary for certain tests are: "Censored indicator file" (Cox test), "Paired indicator file" (paired t-test) and "Additional covariables file" (optional for anova-limma).

Gene expression data
The file with the expression data (e.g., you microarray data). In this file, rows represent variables (generally genes) and columns are subjects or arrays. We want to find those variables that are most distinctly expressed among groups (e.g., a t-test or Anova) or that are most related to, say, survival (e.g., Cox model).

Class labels or dependent variable
These are generally the class labels (e.g., healthy or affected, or different types of cancer) that group the samples, or the survival times of patients, or another dependent continuous variable (if regression models). In our analyses we want to find which of the genes shows differential expression between the classes given here, or if a given dependent variable (e.g., survival time, expression of certain protein) is significantly related to the expression of some gene.

Censored indicator
For survival data only. An observation is censored if the time of occurrence of the event (usually death) has not yet been observed. We will represent uncensored observations with a 1 (because we have observed the time of death) and censored observations with a 0 (meaning that the survival time given is only a lower bound).

Paired indicator
This file is only necessary for paired t-test. It is used to indicate which subjects are paired. Two subjects are paired if they share some common characteristic that we think may imply similar gene expression traits (e.g., two samples from the same subject would be paired).

Additional covariables
You may use this file if you wish to add more information to your analysis (such as subject age, weight, ...). You will be offered this option after having launched a limma anova test.

Data format for gene expression data

The file for the gene expression should be formated as:

Data format for class labels and censored indicators

Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever. Of course, if you are using regression or Cox model, your dependent variable must be a number. And if you are using a t-test there can only be two classes.

There should be no empty spaces or trailing tabs at the end of the file. And we read only a single line, so place all labels in a single line. We will silently drop anything beyond the first line!!

This is a simple example of class labels file:

CL1CL2CL1CL4CL2CL2 CL1CL4

Data format for Fisher's exact test on contingency tables

The class labels, as above, can be any arbitrary coding. The values in the gene expression data file should be consecutive integers that start at 0 (i.e., do NOT use negative intergers, or you'll get errors). This is an example file:

#genec1c1c2c2c3c3
gene1001110
gene2NA2NA100
gene3NA12NA00
gene4112220
gene5100212
gene6211200
gene7211020
gene82200NANA

As you can see, most of these rows would yield (if we used the first row as the class labels) 3x2 tables or 3x3 tables (e.g., 3rd or 5th rows).

Data format for paired indicator

The format should be the same as for class labels, except that we will place the same integer at the places where the two paired subjects are. We can choose any integer we want.

This is a simple example of a paired indicator file:

  432  4  832   82222

In the example file, the first and the third subject are paired (integer used 4, although we could have used any other), the second subject is associated to the fifth (integer used 32) and so on.

Data format for additional covariables

Each row of the file represents a subject and each column an additional covariable. The format must be as follows: tab (\t) separated fields, a first row with the names of the columns (covariables) and the following rows with each subject's covariable value.

This is a simple example of a additional covariables file:

Age    CountryWeight
20  Spain76  
46  Germany100  
33  England88  
59  Spain79  
61  England65  
24  Spain58  
80  Germany90  
59  England72  

The order of the subjects must be exactly the same as the order in the expression data file. That means, the covariable values (age, country,...) of the first subject (i.e. row), must belong to the subject in the first column of the expression, etc.

The data can be of two types: numeric (age, weight, ...) or non-numeric (country, ...). When the data is read, you will be taken to a page where you will be able to see if it has been read correctly. Be careful, if your data does not have the correct format, it can be incorrectly read, meaning a non-numeric variable can be read as numeric or vice versa. If this happens all further analyses will be wrong!



Output files

Once the program has finished running, you will get a table with the output from the run and a heatmap.

The results table contains a header indicating the test you have used, number of permutations and which covariables where used (if any).

The table shows an index corresponding to the original ordering in the data file, gene names, p-values (undajusted), FDR-adjusted p-values, and statistics (and the absolute value of the statistic). For Fisher's IxJ tests the columns names statistic and abs(statistic) really have no meaning.

The figure is a heatmap where you can filter how and which genes to plot. For now the color scale goes from yellow to blue (green is the mix of yellow and blue); missing values are shown as white.

If you have run an "Anova, linear models (limma)" test, the output will also contain a Class compare section containing a button. By clicking on the button we will be taken to a class compare page, where we will be able to compare classes individually.

The observed test statistics are:

t-test (Permutations)
The difference of means divided by the square root of the sum of the sample variances of each of the means. (So just the usual t for the unequal variance case).
Anova (Permutations)
The usual F-statistic (mean squares model/mean squares error).
Regression (Permutations)
The coefficient divided by its standard error (i.e., the typical t-statistic).
Cox model
The Wald statistic (the estimate of the coefficient divided by its standard error).
t-test (limma)
Moderated t-statistic.
Paired t-test (limma)
Moderated t-statistic for paired subjects.
Anova, linear models (limma)
Moderated F-statistic.


How long does it take?

Running times are variable and will depend of course on the size of your data set and the analysis. Running times do depend too on the load of the cluster (e.g., if many other users are running Pomelo with very large data sets). Occasionally, with very large data sets and under heavy load a computation might fail (because one or more nodes run out of memory); Pomelo II detects these problems, and will keep trying to relaunch the computations. The maximum time we allow Pomelo II to run (or try to run) any give computation is 8 hours. If any process lasts longer than 8 hours, it is stopped and an error message indicating that limits is given as final result.

Most computations, however, last a lot less than 8 hours. To give you an idea, these are computing times (i.e., from the time the data arrive at the server until the HTML file with the results is produced) for a few cases. 100000 genes, 500 subjects in two groups of 250 each, using a permutation t-test: 3 hours (this run had Pomelo automatically relaunch itself after an MPI failure related to limited memory). The same data using limma: under 7 minutes. However, for the above two cases the expression data set was over 250 MB; thus, data transfer from the remote location to the server took in excess of 15 minutes. And then, once results were obtained, transferring back the HTML page with the 100000 rows of output took over 5 minutes. A smaller data set with 100000 genes and 100 subjects in two groups of 50 each (a 50 MB file), using a permutation t-test: 30 minutes. With a Cox model, about 12 minutes. These same data using using limma: under 3 minutes. However, as before, a very time consuming step was displaying the actual results (an HTML file with 100000 rows).

Thus, as a rule of thumb, after your data are uploaded to the server (i.e., when you receive the message "This is an autorefreshing page") you should expect to wait between a few minutes and a couple of hours, unless your data set is really huge. But, in the later case, most of your waiting time will be spent uploading the data and receiving back the results.



Examples and data sets

In this page there are a few examples and data sets to play with Pomelo II.



VIDEO TUTORIALS

It is now possible to view video tutorials of pomelo tests.
You can find the video tutorials here.


Code, authors and acknowledgements

The code underlying these tests is written in C++ and R. The C++ code was originally written by Ramón Díaz-Uriarte and has been parallelized with LAM/MPI by Edward Morrissey. All of the multiple testing functions have been written from scratch, although some algorithms have been based on Westfall & Young (1993) or the documentation for the "multtest" package for R, some inspiration has been obtained from the above package (and a lot of testing has been done using multtest as a benchmark). Moreover, we have taken C code from R for Fisher's exact tests (fexact.c; the latter is based on Mehta and Patel's algorithm). The code for Cox model is from R, library survival.

All three limma tests have been written using limma R package [Smyth, G. K. (2005). Limma: linear models for microarray data. In: 'Bioinformatics and Computational Biology Solutions using R and Bioconductor'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, pages 397--420].

We have also used the GNU scientific library, GSL.

The CGI has been written by Edward Morrissey and Ramón Díaz-Uriarte using Python. The heatmaps use R and the R packages GDD by Simon Urbanek and imagemap by Barry Rowlingson.

The code from R and the survival, limma, GDD, and imagemap packages, and the GSL are all free software released under GNU's General Public License. Our C++ is thus also released under the GPL. The CGI code is released under the Affero GPL.

All the code (including revision history) is available from The Launchpad

.

This application is running on a cluster of machines using Debian GNU/Linux as operating system, Apache as web server, Linux Virtual Server for web server load-balancing, with heartbeat and drbd for high-availability of both services and storage, and LAM/MPI for parallelization.

Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.



Terms of use: NO WARRANTY




Terms of use: citation

If you use Pomelo II for any publication, we would appreciate if you could let us know and that you cite our program (you know, "credit where credit is due"). Also, please cite it in your papers. The reference for Pomelo II is Morrissey E, Diaz-Uriarte R. 2009. Pomelo II: finding differentially expressed genes. Nucleic Acids Research, 37, No. suppl_2 W581-W586.




Privacy and Security

Uploaded data set are saved in temporary directories in the server. These data are accessible from the internet until they are erased (five days now). The same as you do, anybody can access those directories, if they known their names, which are made of 20 random digits, so they should not be easy to guess.

In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.

Again, all responsibility for access to your data is yours. We do not offer any warranties that your data will be kept confidential or that nobody will break into our system. We suggest you rename the rows, columns, and other identifiers, so that they are meaningless to anybody but you.




Copyright and license


This document is copyrighted. Copyright © 2005 to 2014 Ramón Díaz-Uriarte, Edward R. Morrissey. But you can do many things with this document, because of its license.


Creative Commons License
This document is licensed under a Creative Commons License.

(The manner we specify for attribution is that you write out our names, that you link to our web pages, and that you link to this page itself. Please, give credit where credit is due).

 Send comments to the webmaster. Last rev. 8-April-2009