Wednesday, May 13, 2015

Finding differentially expressed genes in MIV - Choosing a cutoff

Background

I'm trying to find which genes get turned on/off in MIV and to do this, I need to determine some way of classifying genes as either differentially expressed or not differentially expressed.

When using the edgeR or DESeq2 packages to compare two treatments, the software returns two important values: the log-fold change and the adjusted p-value (or false discovery rate, FDR). The log fold change tells you how differently expressed a gene is between two treatment. For example, if gene X is expressed twice as much in treatment B than it is in treatment A then the logFC would be log_2(2) = 1 if we were to compare B to A. Since treatments have multiple replicates, there is some amount of error associated with the logFC reported.

The packages compute a p-value corresponding to the question "what is the probability that the true logFC is 0?". Since this test is being done once for every gene, the problem of multiple comparisons arises. To accommodate this, the packages compute the FDR corrected p-values and this is what is used to determine significance.

For the data shown in this post, I used DESeq2 only. The DESeq2 workflow tutorial suggests using an adjusted p-value (padj) cutoff of 0.1, meaning any gene with a padj <0.1 should be considered significant. But this value only offers insight on statistical significance. Consider the case where gene X is consistently up 1.1 fold between treatments B and A. The padj for gene X may be extremely low if the replicates are very consistent, but is a 1.1 fold change biologically significant?

The problem

The issue I'm working on is deciding what genes are affected by transfer into MIV. If we had two treatments, say in sBHI and in MIV, the task would be relatively easy. I could just take all genes that pass a certain cut-off and say they are significant. One large complicating factor is that we have MIV timecourse data.

For example, I can take KW20 MIV data and compare timepoint M0 to M1, M2 and M3. When I do this and use a cut-off of padj <0.1 and |logFC| > 1, I get 371 differentially expressed genes comparing M0 to M1, 501 for M0/M2 and 479 for M0/M3. If I pool all these genes together and consider them all significantly different, I get almost 700 genes. Here is a visual:


































This multiplot takes the ~700 significant genes and groups them by which timepoints were significantly. For example, 133 genes showed a significant increase at timepoints M1,  M2 and M3 compared to M0 (bottom right). This is where most of the competence genes cluster.

To take any gene that is significantly different in at least one timepoint as important is probably not a good idea. There are likely many false positives in this set. But how many? and how can I reduce them? I thought that identifying a more stringent cutoff could work

Fortunately, the microarray paper our lab has published includes a list of genes changed in MIV as identified by the microarray data. As a quick test, I took the list of genes upregulated and used them to run a test: I wanted to see how many of these genes could be identified by a differential expression test given different cut-offs. Here is the result:


























I ran 3 differential expression tests: M0/M1, M0/M2, M0/M3. I then pooled together all the genes considered significant in at least one timepoint; this is plotted as the red line. Each plot shows a different padj used as a cut-off value. As a reminder, DESeq recommends 0.1. The x-axis shows the logFC value used as a cut-off. The black line is the number of genes that were identified as significant that also show up in the list from the microarray paper (n=138).

For instance, there appears to be 46 genes that are upregulated by at least 2^4 = 16 fold in at least one timepoint and have a padj < 1e-10. 43 of these were reported in the microarray paper.

The optimal cut-off values would lower the red line (many of which may not be truly significant or false positives) as much as possible and keep the black line as close to 138 as possible. For example, using a padj cut-off of 0.01 and a logFC cut-off of 1 would reduce the number of candidate genes to 341 and only miss 8 of the genes reported in the microarray paper (which, for all I know may be false positives reported by the microarray paper).

Solution?

Still working on it. It looks like taking a logFC cut-off of 1 to 2 (2 to 4 fold changes) is a good idea and varying the padj cut-off depending on how conservative I want to be. I think tomorrow I'll try to run a differential expression test that uses all timepoints at the same time to see if that works better.

No comments: