Thursday, April 23, 2015

Ribosomal RNA in the RNA-seq samples

To get an estimate of the amount of ribosomal RNA in each sample, I took all the FASTQ files and aligned them to a FASTA file which contains all of the rRNA genes in Haemophilus influenzae KW20 (which can be found here).

Then I just looked to see what percentage of the reads aligned to these regions:
File         %Aligned
antx_M0_A    0.07%
antx_M1_A    0.06%
antx_M2_A    0.06%
antx_M3_A    0.07%
cmnx_M0_C    4.99%
cmnx_M0_E    1.25%
cmnx_M1_C    0.22%
cmnx_M1_E    0.48%
cmnx_M2_C    13.04%
cmnx_M2_E    0.51%
cmnx_M3_C    2.50%
cmnx_M3_E    0.78%
crpx_B1_F    0.12%
crpx_B2_F    0.08%
crpx_M0_E    0.47%
crpx_M1_E    0.43%
crpx_M2_E    0.13%
crpx_M3_E    0.51%
hfqx_M0_D    0.19%
hfqx_M1_D    0.45%
hfqx_M2_D    0.05%
hfqx_M3_D    0.74%
kw20_B1_F    0.81%
kw20_B1_G    0.30%
kw20_B1_H    0.18%
kw20_B2_F    0.32%
kw20_B2_G    0.55%
kw20_B2_H    0.17%
kw20_B3_F    0.31%
kw20_B3_G    0.56%
kw20_B3_H    0.40%
kw20_M0_A    0.04%
kw20_M0_B    0.11%
kw20_M0_C    0.14%
kw20_M1_A    0.11%
kw20_M1_B    0.17%
kw20_M1_C    1.57%
kw20_M2_A    0.10%
kw20_M2_B    0.17%
kw20_M2_C    0.26%
kw20_M3_A    0.07%
kw20_M3_B    0.06%
kw20_M3_C    1.40%
mure_B1_G    0.55%
mure_B1_H    0.26%
mure_B2_G    0.55%
mure_B2_H    0.44%
mure_B3_G    1.16%
mure_B3_H    0.42%
rpod_B1_X    2.60%
rpod_B2_X    0.58%
rpod_B3_X    0.39%
sxy1_B1_G    0.30%
sxy1_B1_H    0.09%
sxy1_B2_G    0.41%
sxy1_B2_H    0.22%
sxy1_B3_G    0.46%
sxy1_B3_H    0.34%
sxyx_B1_G    0.19%
sxyx_B2_G    0.34%
sxyx_M0_A    0.04%
sxyx_M0_B    0.08%
sxyx_M0_D    0.12%
sxyx_M1_A    0.13%
sxyx_M1_B    0.23%
sxyx_M1_D    0.13%
sxyx_M2_A    0.18%
sxyx_M2_B    0.04%
sxyx_M2_D    0.79%
sxyx_M3_A    0.21%
sxyx_M3_B    0.04%
sxyx_M3_D    0.41%
taxx_M0_C    1.28%
taxx_M0_D    0.15%
taxx_M0_E    0.96%
taxx_M1_C    0.27%
taxx_M1_D    0.20%
taxx_M1_E    0.39%
taxx_M2_C    0.36%
taxx_M2_D    0.15%
taxx_M2_E    1.06%

Ignoring the cmnx C replicate that appears to have a lot of rRNA in it, each sample has on average 0.4% rRNA, ranging between 2.60% on the high end and 0.04% on the low end.

Monday, April 13, 2015

Checking strains, growing cultures

I think this post really belongs on RRResearch, so that's where you'll find it.

Thursday, April 9, 2015

What's wrong with the RNAseq reads for sample kw20_M3_C?

In his March 20 post, Scott pointed out that the kw20_M3_C sample appears to be very odd.  This is a replicate of samples kw20_M3_A and kw20_M3_B; they are all the wildtype strain KW20 after 100 minutes in the MIV competence induction medium.

The 'C' replicate doesn't cluster with the others in Scott's principal component analysis (see figures in Scott's post), and its quality control analysis gave very weird results, with sharp spikes of reads with much higher G+C than expected for H. influenzae.   But 98% of the reads aligned to the H. influenzae genome, so these high-GC reads aren't due to contamination. Instead they must be very high coverage of GC-rich parts of the genome.

The red line shows the distribution of %G+C in the reads of this sample.  The genome is 38% G+C, and from the QC analysis of the other samples we expect to see a smooth red bell curve peaking at about 40-41%, with some minor spikes at the high G+C end.  Instead we see a series of jagged peaks, one at 38%, one at 51% and one at 55%, with a smaller peak at 60%.  From the heights and widths of the peaks I'd estimate that about half of the reads have anomalously high %G+C.

Scott decided that many of the reads must be ribosomal RNA, but further analysis suggests that they are other abundant high-G+C noncoding RNAs, not rRNAs.

Many of the reads in the 'Overrepresented sequences' list are 'transfer-messenger RNA' (tmRNA).  This ~350 nt RNA has important functions in translation.  The gene is at 1357929-1357564; and looking at the nucleotide coverage of this region in our RNAseq sample (shows that more than 2,000,000 of the reads are of this gene.  Even more reads are from RNase P.  This 360 nt RNA ribozyme's gene is at 1746450-1746790: Coverage over most of this region is more than 1,000,000, so I estimate that about 3,000,000 of our reads are of RNase P.  This analysis would suggest that about 50% of the ~10,000,000 reads from this sample are of RNase P and tmRNA. 

To confirm this I directly checked whether half the reads in the sample do have anomalously high %G+C.  I pulled out the first 28 reads from the original fastq file and directlychecked their %G+C.  Thirteen of these have the low G+C (<41 expected="" i="" most="" of="">H. influenzae
mRNAs.  The rest have a range of very high G+C (47.5-59.3%), and BLAST tells me that all but three of these reads are either RNase P or tmRNA.  One of the others is 16S rRNA, and the other two are mRNAs that are also abundant in the replicate samples.  This suggests that about half of the reads are indeed from RNase P and tmRNA.

Why would one sample have this problem?  Perhaps it's just that much of the mRNA was degraded to fragments so small that they were lost in the library creation step?  RNase P and tmRNA molecules might have survived because of their highly folded structures.  To test this hypothesis, the Co-op tech is looking through the RNAseq notebook to find the gel photos that go with these RNA preps.

Tuesday, April 7, 2015

Back to planning the new RNA-seq samples

I just taught my last class of the term, so now I can get back to experiments.  The highest priority is preparing the new samples for RNA-seq analysis.  The preliminary planning is described in the previous post, which has a table summarizing both the runs we've done and the provisional plans.  Here it is again.

The plan is to get the RNA samples for these TO DO runs finished this month, before our Co-op tech finishes her work term and leaves us.

First step:  Decide what samples to collect and check what supplies we'll need to order (RNA-Protect, RNeasy kit, Trizol, DNA-free, RNeasy MinElute cleanup kit, Ribo-zero?).  We may collect a few more samples than the 24 we will be able to sequence.]

Second step:  Streak out all the strains.  Make especially sure that the antx and toxx strains are the unmarked ones, by checking their StrR and competence phenotypes.  In fact, check the StrR and competence phenotypes of everything.  Maybe use PCR to check the genotypes?

Third step:  Inoculate an overnight culture of each strains, Grow each strain as specified and collect RNA-protect treated cell pellets (store at -80 °C).

Fourth step:  Purify RNA from all samples except the 'kw20 Trizol' ones using the Qiagen RNeasy kit.  Check nucleic acid concentration using the nanodrop.  

Fifth step:  Purify RNA from the 'kw20 Trizol' samples using Trizol.  Check nucleic acid concentration using the nanodrop.

Sixth step:  Treat each RNA sample with DNase ('DNA-free'?).  Repurify using the Quigen RNeasy cleanup kit.  Check RNA concentration using the nanodrop.  If we have more than 24 samples, decide which ones to not sequence.

Seventh step:  Give samples to Sunita (?), who will treat them with Ribo-zero using half as much RNA and 1/4 as much Ribo-zero as recommended (thus using only a 6-sample kit rather than a 24-sample kit).  (Or maybe we will do this step ourselves.)

Eighth step:  Sunita will prepare these samples for sequencing.

So, first step.

Which samples to collect?  Has anything changed?  Well, before the sabbatical visitor left she did isolate a new rpoD mutation causing hypercompetence.  (Here I'll refer to it as rpo2).  Should we include it in the RNA-seq samples?  If we do, we'll have to cut out something else.  Its competence phenotype (from a sBHI tiome course) is very similar to that of the rpoD mutant we are including, so I think we won't include this one.

What do we need to order?

  • RNA-Protect?  We have 86 ml of RNA-Protect; this is enough for 21.5 2 ml samples.  Rather than buying more RNA-Protect ($278 for 200 ml), we'll just reduce our sample sizes slightly (1.6 ml) so we have enough for all of them.  This should still give us plenty of RNA.
  • Qiagen RNeasy purification kit:  We have a nearly complete kit with 47 of its 50 columns left.
  • Trizol:  We have a nearly full bottle of Trizol, though it's at least several years old.  It's been stored in the dark at 4 °C so it's probbly still OK.  A new bottle would cost about $230, so if we decide ours is too old we'll just scrounge the few ml we'll need.
  • RNeasy MinElute cleanup kit:  our old kit has 23 columns remaining, almost enough for one treatment of each of our smples (before Ribo-zero treatment).  But if we do the Ribo-zero work ourselves we'll need to also do a second cleanup after the Ribo-zero, so we'd need to buy another kit (about $400).
  • Ribo-zero:  If we do the treatment ourselves we'll need to order a 6-sample kit ($650).  However our collaborators (the former RA's new lab) might do this for us.  I'll check with her.

The tech is streaking out all the strains on plain sBHI plates.  We'll then restreak them on antibiotic plates to check their resistances, and confirm their transformation phenotypes when we grow them for the RNA preps.


  • KW20:  our wildtype strain (RR722, in freezer box 12): sensitive to all antibiotics, normal transformation.
  • HI0659 (antitoxin) knockout (unmarked) (RR3158, in freezer box 39): should be sensitive to all antibiotics, not transformable at all.
  • HI0660 (toxin) knockout (unmarked) (RR3159, in freezer box 39): should be sensitive to all antibiotics, normal transformation.
  • rpoD753 (RR753, in freezer box 16): resistant to streptomycin, moderately hypercompetent in sBHI.
  • hfq (marked deletion) (RR3187, in freezer box 40): resistant to spectinomycin, transformation down ~ tenfold.
  • crp::miniTn10kan (RR540, in freezer box 3): resistant to kanamycin and streptomycin, not transformable at all. 
  • sxy::miniTn10kan (RR648, in freezer box 8): resistant to kanamycin, not transformable at all.