Saturday, January 10, 2015

An in-depth look at data quality (Part 1)



I’ve talked about FastQC in a previous post but now I’ve actually use it to analyze our data. We want to make sure the data we have is of sufficient quality before we do anything with it, so it’s important to do this analysis early.

Note: I put all relevant documents on the Google Drive in the folder Scott>Quality Analysis

Here’s what I did:
1)      Run FastQC on all .fastq files (96 x 2 reads per sample = 192 files)
2)      Run Sambamba  to align files to reference genome (to be discussed in future post)
3)      Run Sambamba  to extract alignment statistics
4)      Made a script that runs in the terminal to extract important statistics from the output of 1) and 3) (uploaded to the drive as fastqc_extraction.sh)

The output of my script is called fastq_results.xlsx on the drive. On the first worksheet of this file I have listed useful values and whether or not a sample passes or fails the standard FastQC tests. The second worksheet is just a list of every sequence FastQC has labelled as “overrepresented”.

The first thing I did was compare some numbers between the old and new RNA seq datasets in R. The script that makes this plot is called quality_analysis_script.R (and requires the script multiplot.R)


Number of reads (i.e. number of DNA fragments sequenced in each sample)
The graph suggests that the older samples tend to have more reads. Older samples have, on average, 1.5 million more reads. A t-test says that there is a significant difference with a p-value of 1e-08. Although 1.5 million sounds like a lot, keep in mind that some of the older samples differ by close to 6 million reads.

The top 4 files for reads are:
toxx_M0_A.R1.fq.gz – 10651839
toxx_M0_A.R2.fq.gz – 10651839
antx_M0_A.R1.fq.gz – 11311458
antx_M0_A.R2.fq.gz – 11311458

The bottom 4 files are:
antx_M2_C.R1.fq.gz – 1781164
antx_M2_C.R2.fq.gz – 1781164
antx_M2_E.R1.fq.gz – 4035388
antx_M2_E.R2.fq.gz – 4035388

Number of duplicates (i.e. reads that do not have a unique sequence in the sample)
The graph suggests that the newer samples tend to have more duplicated reads. Newer samples have, on average, 4% more duplicates. A t-test says that there is a significant difference with a p-value of 4e-11.

Duplicate reads are caused by sequencing two PCR products derived from the same fragment (technical duplicates) or by sequencing the same region of two distinct RNA molecules (biological duplicates). Technical duplicates are unwanted whereas biological duplicates are pretty much the basis of differential expression analysis. It is impossible to discriminate the two types of duplicates with the data we have.

So it looks like most samples fall above 80% duplicates – which sounds like a lot (only 20% of reads are actually unique). However, remember that only one of the two paired end reads is actually being analyzed here so all reads that begin at the same nucleotide are considered duplicates. Because we’re doing RNA seq here, most reads come from the transcriptome (a subset of the whole genome) and in some samples, there are 3x more reads than nucleotides in the genome; extremely high duplication (80-90%) is reasonable.

The top 4 files for duplicated % are:
kw20_M3_C.R1.fq.gz – 89.1
toxx_M1_D.R1.fq.gz – 88.6
kw20_M3_C.R2.fq.gz – 88.0
toxx_M1_D.R2.fq.gz – 87.9

The bottom 4 files are:
kw20_M0_B.R2.fq.gz – 75.8
sxyx_M1_B.R2.fq.gz – 75.5
antx_M2_C.R1.fq.gz – 60.4
antx_M2_C.R2.fq.gz – 58.7

Note the bottom two. They have a lot of contaminated sequences, so it’s not surprising that almost half of sequences are unique.

GC Percent (i.e. % of nucleotides that are G or C)
There is on average 1% more GC in the old data than the new (p=2e-08). H. influenzae has a GC % of 38% and Rosie has claimed that genes tend to be higher. It seems that the older dataset is much more variable (except for 4, 2 of which have been shown to be contaminated so far)

The top 4 files for GC % are:
kw20_M3_C.R1.fq.gz – 46
kw20_M3_C.R2.fq.gz – 45
antx_M2_C.R1.fq. gz – 45
antx_M2_C.R2.fq.gz – 45

The bottom 4 files are:
Most files are lowest at 40%

% Alignment (i.e. % of reads that can be mapped back to the H. influenzae genome)
Note that for antx_M2_C, only 20% reads aligned to the reference genome. This sample was not included in the graph. On average, reads from the new samples get aligned better than reads from the old sample by 0.5%. This is significant (p=1e-10) once the poorly aligned sample is removed.

This means that the newer samples could be purer or the quality of sequencing was higher in the new sample (base calling errors make it harder to align back to a reference sequence) or maybe something else I haven’t thought of.

The top 4 files for GC % are:
sxy1_B2_G.R1.fq.gz – 99.42
sxy1_B2_G.R2.fq.gz – 99.42
sxyx_M0_D.R1.fq.gz – 99.43
sxyx_M0_D.R2.fq.gz – 99.43

The bottom 4 files are:
antx_M2_C.R1.fq.gz – 20.1
antx_M2_C.R2.fq.gz – 20.1
toxx_M2_A.R1.fq.gz – 97.1
toxx_M2_A.R2.fq.gz – 97.1

Overall:
-The old data and the new data are different. Statistics can tell me that, however I don't see any major problem. Sample antx_M2_C is what I would expect a problem would look like. When samples are prepared by different people on different days, there is bound to be technical variation.
- Sample antx_M2_C is bad. I feel it may cause problems to keep in in any future analysis steps (but we can try anyway)
- Looking at the graphs, my prediction is that the newer data is either less contaminated or contains more PCR duplicates, because the graphs suggest that there is more H. influenzae DNA in the newer sample (better alignment, GC content has low variability)

So far, this analysis suggests that the new data is probably as good as (if not better) than the old data. In the next post I will look at the FastQC test results.

3 comments:

Rosie Redfield said...

You wrote "Note that one sample was aligned to only 20% of the genome and is not on the graph." This doesn't make any biological sense and it doesn't fit its context either, so I think you probably mean that 20% of the reads aligned to the genome.

If this is the correct interpretation then we shouldn't discard the data without more measures of whether or not it will be informative.

Rosie Redfield said...

I don't think we know the 'expected value' for %G+C of the pooled mRNAs.

Scott M. said...

1)The way I worded the "aligned to only 20% of the genome" was indeed awkward, and I've updated the post to correct it.

2)I'm not discarding that sample (yet), I just didn't include it in that graph because it caused scaling problems to the graph as a whole.

3)You're right, we don't know what the GC% of the transcriptome is. Intuitively, 40% is what I would expect but I realize now I really have no evidence to support that.