Wednesday, February 18, 2015

Exploration of the RNA-Seq data using DESeq2

First steps in DESeq analysis

The analysis presented here was done by following the Bioconductor document RNA-Seq workflow: gene-level exploratory analysis and differential expression. The example in this document differs from our experiment in that there are only 2 treatments (treated and untreated) and we have 7 treatments (3 growth stages in BHI and 4 in MIV). The example is similar to ours in that it looks at the effect of treatments on different cell lines.

I am skipping over the first part of the Bioconductor example; aligning and counting the reads starting from the FASTQ files. I’m going to start with a matrix of raw counts that Scott created (thank-you Scott). The DEseq people say it is important to use raw counts, not normalized counts as the input to DEseq2, you can read more about why at Bioconductor.

About the samples

In this post I am only looking at a subset of the 90+ samples we have. I created a smaller table of counts that only includes the kw20 samples and the rpod samples (I did this in Excel by simply deleting the columns I did not want).
kw20 = KW20 Rd -- our wild-type Haemophilus influenzae strain; a.k.a the control.

rpod = RR753 -- this hyper-competent strain has a single nucleotide difference from H. influenzae. The point mutation is in the gene rpoD, which encodes the housekeeping sigma factor for H. influenzae. This is a mutation I'm interested in so that is why I kept it in the data set.

M samples -  in MIV media for 0 min (M0), 10 min (M1), 30 min (M2) and 100 min (M3). 
B samples - grown in sBHI and sampled at OD600 of 0.02 (B1), 0.6 (B2) and 1.0 (B3).
Note that the M0 sample was taken immediately after transferring the cells from sBHI to MIV. This was done when the culture density was 0.2, so the M0 sample is really an sBHI sample at OD600 = 0.2.

Getting the counts into the format DEseq2 wants (a.k.a. creating the needed DEseq2 object)

This was the hardest part and required a lot of Goggling and some help from Scott. I tried to follow the Bioconductor example in the section “Starting from count matrices” but this did not have enough information for what I was doing. Usually I will not provide my R code in the blog, but I am going to include it here because it is very different from the example.

I read the table in using this code
countData <- read.table("dataKW753.txt", header=TRUE, row.names=1, sep="\t")
I added, header=TRUE and row.names=1 because if I didn’t the resulting table did not have names associated with the data and my graphs had useless axis labels like 1 vs 2. 
The resulting data frame looks like this


I then read in the information about the experiment using
design <- read.table("design_KW753.txt", header=TRUE,  sep="\t")
rownames(design) <- design[ ,1]
I had to add in rownames or the resulting plots were labeled by row numbers, not sample names.



Note for Mac users. The original design table included gene descriptions. In the process of reading this design table into R, I discovered that R on the Mac does not like the symbol (most people would see this as an apostrophe, but a molecular biologist sees this as a symbol called prime). R treated the ‘ as an escape character and this resulted in sub-setting a bunch of rows inside of other rows. To prevent this, simply do a search and replace to remove the ‘ characters with something else before attempting to read the information into R. I removed the gene description column from the design table before I used it.


Finally, I used these two data frames to create the data object need for DESeq2.
ddsRep <- DESeqDataSetFromMatrix(countData = countData, colData=design, design=~ Strain + Treatment)
I patterned the design after the Bioconductor example (I tried several variations and this seemed to work). Note that when you look at the data object, you don’t see a table but instead get back a description of the object

Looks good to me – there are 1709 genes in the H. influenzae genome and I kept 24 experiments in my downsized data table. 

First look at the data

Before determining differential expression, the counts need to be corrected so that neither the strongest expressed genes, not the weakest expressed genes will skew the results (see the Bioconductor example for more details). I used the suggested DESeq2 function rlog to do this.
I then made some simple scatter plots comparing the overall differences in gene expression between 2 samples. If our experiment worked as expected, we should see more differences between 2 different growth conditions then between 2 replicates in the same growth condition. This is the result we see in the figure below where I have plotted KW20 in sBHI. The increased point dispersion in the right panel (each point represents the expression of 1 gene in both samples) shows there are more expression differences between 2 samples taken at different ODs then 2 samples taken at the same OD.
Left panel is 2 replicates taken at OD600 = 0.02.
Right panel is KW20 at 2 different ODs (B1 = OD 0.02 and B2 = OD 0.6). 
I plan on getting deeper into the analysis of the KW20 samples in a future post. For now, I just want to show a few more ways to visualize the data.

Sample Distances

Again, following the Bioconductor example, I looked at the overall similarity between the samples. First I calculated the Euclidean distances between each sample pair using the R function dist (here is where I pretend to know what a Euclidean distance is).  Then I created a heatmap to visualize the distances (there are a few steps in this process). For this analysis I used all the kw20 samples and the rpod samples. 
This graph gives a nice visual showing which samples are most similar to which. The first thing to notice is all samples in sBHI cluster together (B1, B2, B3, M0) and all samples in MIV cluster together (M1, M2, M3). The second thing I noticed is that in the MIV samples, replicates A and B cluster closer together than either do with replicate C; this happens for all time points. I plan to investigate this further in another post. 

Using Poisson Distance you get a similar result.



Principal-components analysis from DEseq2 is another way of clustering the samples. I don't totally understand what this method does but I think it is supposed to find the variables that encapsulate the most differences between the samples. PC1 is the component that explains most of the variation and PC2 is the next most variable component. However, the graph does not tell you what PC1 and PC2 are.  I'm guessing that because the x-axis seems to separates the treatments by media type (sBHI vs MIV) that the type of media explains 64% of the variance. I think PC2 is showing that 12% of the variance between samples is due to the different amounts of time the samples were grown in a particular media. 
I think the important thing to note is that the samples in sBHI cluster together and the samples in MIV cluster together. In general the replicates seem to cluster together, except for 1 replicate of the MIV treatments which sits a fair distance from the other 2 replicates. I couldn't figure out how to label the replicates but based on the heat-map result, I'm guessing it is replicate C that sits by itself.
Principle Component Analysis of all KW20 and rpoD samples.



Sunday, February 15, 2015

Learning R basics for RNA-Seq analysis

I have tried a few approaches to learning R basics in the past month, some with more success then others. I tried a Coursera course but found it too advanced; it focused on creating functions but without having some experience actually using R this was simply too frustrating for me.  I tried some online R manuals but again I found them a little over my head. Code School’s Try R was useful for learning the basics of working at the R command line (thanks to Scott for pointing that one out). I also used the book “R for Everyone: Advanced Analytics and Graphics”. This book was especially helpful because it walks you through many graphing examples including several using GGplot2, an R package we will likely use a lot for RNA-Seq analysis.

I started by working through some of the book’s examples using data that comes with R. I then used the examples as templates to analyze some relatively simple expression data that I previously analyzed with Excel. Since I already knew what this data looked like after analysis, I could easily figure out what wasn’t working in R and then search for solutions. I must say that even though it took me two full days to figure out how to use R to analyze and generate graphs, I could see that R was considerably more powerful then Excel and that once I got the hang of it, I could use R to create customized graphs much more quickly then I currently can with Excel.

Another tool that will be useful for our keeping track of our analyses is GitHub. I am currently working through a short Udacity course to get a better feel for how, why and when to use Git and GitHub. This course is working very well for me, if you are interested you can find it here.

In my next few posts I will summarize how I have used R to analyze RNA-Seq data. I have experience analyzing old-fashioned Northerns and Real-Time PCR data (back when this was a new and hot technique) but I have zero experience working with RNA-Seq data. I’m hoping that other newbie’s will find my novice approach helpful in figuring out how to use R for RNA-Seq analysis. Likely I have made several mistakes in my approach but hopefully someone will both point them out and explain the problems to us in the comments.