Tuesday, July 21, 2015

New RNA-seq samples (quality)

The new RNA-seq data arrived about a week and a half ago. Since then I've aligned it to the KW20 genome and looked a bit into the antitoxin knockout samples.

But I wanted to mention the quality of the data. This analysis was done partially using FASTQC:


These plots show the distribution of quality at each base position. (Phred score on y axis, position in read on x axis) In Illumina data, we expect the quality to decrease the longer the read gets. In the old data, you can see the quick drop of quality at the end of each read. These plots are very representative of both old and new datasets and overall I'd say the sequencing quality is higher for the new dataset. This doesn't say anything about the content of the samples though.


Here are some interesting numbers:

Average # of reads of older samples: 7,643,000
Average # of reads of new samples: 11,620,000

On average, the new set of samples are much larger. But there are a few samples that are on the low side:

- toxx_M3_K ( 2,311,395 reads )
- hfqx_M2_J ( 5,416,787 reads )
- hfqx_M3_J ( 2,285,198 reads )

Average % GC in older samples: 40.6%
Average % GC in new samples: 48.8 %

This seems a little alarming at first. Let's look at the distribution more closely: 



This plot shows the number of reads with a given GC content (red line) and a smoothed theoretical bell-curve (blue line). I added the green bars to help compare peaks in the plots. The old data is what I expect a good distribution of KW20 mRNA to look like. The large peaks in the new dataset lines up with tiny bumps in the old data. These are likely from ribosomal RNA. The alignment data also supports the idea that there is much more KW20 rRNA in the new dataset (probably because we used suboptimal amounts of RiboZero when extracting the RNA). Is this a problem? I'll get to that soon.

% Aligned to KW20 genome in older samples (average):  98.1%
% Aligned to KW20 genome in new samples (average): 81.1%

It seems like there is a lot more contamination in the new datasets. In fact, ~20% of reads do not appear to come from KW20. I blasted a few of these unaligned pieces and they all aligned to 23S ribosomal RNA in other bacterial species (E. coli, Bacillus subtilis, etc.). This is a shame, but again, is it a problem?

Average count data in older samples: 8,223,000
Average count data in new samples: 3,000,000

I should explain what this number means. This is a count of the number of reads that align to some gene in the KW20 genome (other than rRNA genes), averaged across all samples. (note: reads that overlap two genes may be double counted). In other words, this number is proportional to the number of useful reads (useful for differential expression).

So there are about 2.75x more useful reads in older samples than the new samples. Although this isn't optimal, I don't think it is a big problem (and I haven't run into an obvious problem yet). Perhaps a few samples could use a couple million more reads, overall the data looks usable.