Wednesday, May 27, 2015

CRP-S site near mobB

Out of curiosity, I decided to use a motif searcher to find CRP-S sites in the Haemophilus influenzae genome. It identified most if not all of the currently known sites (I was focusing more on the results that weren't known). 

I just wanted to record a few things that may be interesting. The results can be found here. If those results expire, I also put them on pastebin here.

Of interest is this result:
start stop strand score p-value q-value matched sequence
897196 897217 + 15.1 8.7e-08 0.013 TACTGCGATTTAGATCGCAAAC
 
This fall in a region between HI0850 and HI0851 (mobB). The microarray paper we did mentions that this is a potential CRP-S site but claims it is too far away from mobB to regulated it's transcription. But why should I believe that the closest gene is mobB? In my thesis I reported this region as a candidate small RNA based on expression from our RNA seq data (which means something's being transcribed here, it could encode a protein).
Doing a BLAST-N search (genome id: gi|16271976, from: 897186 to: 897517) returns results from mostly Haemophilus influenzae and Mannheimia haemolytica and some from Aggregatibacter actinomycetemcomitans and one Haemophilus bacteriophage: Aaphi23

There is a region ~100 bp that is aligned between these sequences (presumably some functional part) and in the H. influenzae strains KR494 and R2866, the CRP-S site also appears to be conserved:


Also in the KR494 genome, there appears this region appears to be annotated as a 73 amino acid protein sequence. Grabbing that sequence in running it though PFAM gets me a single PFAM-b result that isn't very helpful.

In summary, it looks like there is something beside mobB with a good CRP-S site. This is something that will probably require some groundwork biological data. I'll be sure to check in the RNA-seq data if it's expression is Sxy-dependent.

Tuesday, May 26, 2015

Predicting CRP sites


 I’ve been trying to identify CRP-regulated genes using the RNA-seq data (comparing crpx and KW20 and looking for differently expressed genes). Along with this, I’ve been trying to identify CRP binding motifs in the H. influenzae genome. For finding CRP motifs, I’ve put together an iterative processes that uses motif-finding software.

1) A sequence motif is created by giving MEME a set of known/predicted 22bp CRP motifs.

2) Using the motif created above, FIMO is used to find occurrences of the motif in a genome.

3) A custom R script takes all the genes in the genome and finds predicted CRP sites upstream of the gene (currently set to between 0 and 300 bp upstream). Predicted CRP sites that are found upstream of genes thought to be CRP-regulated are used as input for step 1. The the process repeats for as long as desired.

To test if this works, I tried it with CRP sites in E. coli since it is much better studied.

RegulonDB gives >200 candidate CRP binding sites. I chose 8 randomly:

TTGTGTGATCTGCATCACGCAT
TATCGTGACCTGGATCACTGTT
ATCTGCATCGGAATTTGCAGGC
GACCTCGGTTTAGTTCACAGAA
AATCTTTATCTTTGTAGCACTT
AATTTTACTTTTGGTTACATAT
TTATATATGTCAAGTTGTTAAA
TTACCCGCTTTAAAACACGCTA

I put these 8 sequences into the pipeline I created. Here are the results (note: scale on histogram is not constant):

ITERATION 1:
34/139 genes with known CRP start sites correctly predicted
24/181 CRP sites predicted correctly
286 other genes predicted to have CRP sites (7.1% of genome) – treat as false positives







ITERATION 2:
62/139 genes with known CRP start sites correctly predicted
46/181 CRP sites predicted correctly
431 other genes predicted to have CRP sites (10.8% of genome) – treat as false positives




ITERATION 3:
78/139 genes with known CRP start sites correctly predicted
63/181 CRP sites predicted correctly
530 other genes predicted to have CRP sites (13.2% of genome) – treat as false positives




ITERATION 4:
86/139 genes with known CRP start sites correctly predicted
69/181 CRP sites predicted correctly
552 other genes predicted to have CRP sites (13.8% of genome) – treat as false positives




From 8 sequences alone, I am able to predict ~ one third of known E. coli CRP sites (albeit with a pretty high false positive rate). After each iteration, the motif logo begins to look more and more like the CRP binding motif (in particular after the first iteration). It also seems that predicted CRP sites tend to cluster more between 0 and 300 bp upstream of the gene after each iteration.

Starting off with the following 30 sequences gives better results:

TTGTGTGATCTGCATCACGCAT
TAAAGTGATGGTAGTCACATAA
AAACTGAGACTAGTACGACTTT
AGTGGGATTAATTTCCACATTA
TAATTTCCACATTAAAACAGGG
GGCTGTGCTGCGCATAATACTT
TATCGTGACCTGGATCACTGTT
AATTTTGCGCTAAAGCACATTT
ATCTGCATCGGAATTTGCAGGC
AATAGTGACCTCGCGCAAAATG
AAACGTGATTTAACGCCTGATT
TCTCGTGATCAAGATCACATTC
ATTTGTGATGAAGATCACGTCA
TAACGTGATGTGCCTTGTAATT
AAACGTGATCAACCCCTCAATT
TTTTGCAAGCAACATCACGAAA
GACCTCGGTTTAGTTCACAGAA
TTATGTGACAGATAAAACGTTT
TAATGTGATTTATGCCTCACTA
ATTTGAGAGTTGAATCTCAAAT
TTATGTGATTGATATCACACAA
TCTTATGACGCTCTTCACACTC
CTTTGTGATGTGCTTCCTGTTA
AACAGTTATTTTTAACAAATTT
AAATGTATGACAGATCACTATT
AAATGTTATCCACATCACAATT
AATCTTTATCTTTGTAGCACTT
AATTTTACTTTTGGTTACATAT
TTATATATGTCAAGTTGTTAAA
AGTTGTTAAAATGTGCACAGTT
AAATGTGCACAGTTTCATGATT
TTTCATGATTTCAATCAAAACC
TTACCCGCTTTAAAACACGCTA

ITERATION 1:
87/139 genes with known CRP start sites correctly predicted
85/181 CRP sites predicted correctly
488 other genes predicted to have CRP sites (12.2% of genome) – treat as false positives




ITERATION 6:
98/139 genes with known CRP start sites correctly predicted
92/181 CRP sites predicted correctly
532 other genes predicted to have CRP sites (13.3% of genome) – treat as false positives




The pipeline appears to work in Haemophilus too. For example, I was able to start with the CRP-S sequences and end up with a CRP-N motif using this process:

Iteration 1:


Iteration 5:

Wednesday, May 13, 2015

Finding differentially expressed genes in MIV - Choosing a cutoff

Background

I'm trying to find which genes get turned on/off in MIV and to do this, I need to determine some way of classifying genes as either differentially expressed or not differentially expressed.

When using the edgeR or DESeq2 packages to compare two treatments, the software returns two important values: the log-fold change and the adjusted p-value (or false discovery rate, FDR). The log fold change tells you how differently expressed a gene is between two treatment. For example, if gene X is expressed twice as much in treatment B than it is in treatment A then the logFC would be log_2(2) = 1 if we were to compare B to A. Since treatments have multiple replicates, there is some amount of error associated with the logFC reported.

The packages compute a p-value corresponding to the question "what is the probability that the true logFC is 0?". Since this test is being done once for every gene, the problem of multiple comparisons arises. To accommodate this, the packages compute the FDR corrected p-values and this is what is used to determine significance.

For the data shown in this post, I used DESeq2 only. The DESeq2 workflow tutorial suggests using an adjusted p-value (padj) cutoff of 0.1, meaning any gene with a padj <0.1 should be considered significant. But this value only offers insight on statistical significance. Consider the case where gene X is consistently up 1.1 fold between treatments B and A. The padj for gene X may be extremely low if the replicates are very consistent, but is a 1.1 fold change biologically significant?

The problem

The issue I'm working on is deciding what genes are affected by transfer into MIV. If we had two treatments, say in sBHI and in MIV, the task would be relatively easy. I could just take all genes that pass a certain cut-off and say they are significant. One large complicating factor is that we have MIV timecourse data.

For example, I can take KW20 MIV data and compare timepoint M0 to M1, M2 and M3. When I do this and use a cut-off of padj <0.1 and |logFC| > 1, I get 371 differentially expressed genes comparing M0 to M1, 501 for M0/M2 and 479 for M0/M3. If I pool all these genes together and consider them all significantly different, I get almost 700 genes. Here is a visual:


































This multiplot takes the ~700 significant genes and groups them by which timepoints were significantly. For example, 133 genes showed a significant increase at timepoints M1,  M2 and M3 compared to M0 (bottom right). This is where most of the competence genes cluster.

To take any gene that is significantly different in at least one timepoint as important is probably not a good idea. There are likely many false positives in this set. But how many? and how can I reduce them? I thought that identifying a more stringent cutoff could work

Fortunately, the microarray paper our lab has published includes a list of genes changed in MIV as identified by the microarray data. As a quick test, I took the list of genes upregulated and used them to run a test: I wanted to see how many of these genes could be identified by a differential expression test given different cut-offs. Here is the result:


























I ran 3 differential expression tests: M0/M1, M0/M2, M0/M3. I then pooled together all the genes considered significant in at least one timepoint; this is plotted as the red line. Each plot shows a different padj used as a cut-off value. As a reminder, DESeq recommends 0.1. The x-axis shows the logFC value used as a cut-off. The black line is the number of genes that were identified as significant that also show up in the list from the microarray paper (n=138).

For instance, there appears to be 46 genes that are upregulated by at least 2^4 = 16 fold in at least one timepoint and have a padj < 1e-10. 43 of these were reported in the microarray paper.

The optimal cut-off values would lower the red line (many of which may not be truly significant or false positives) as much as possible and keep the black line as close to 138 as possible. For example, using a padj cut-off of 0.01 and a logFC cut-off of 1 would reduce the number of candidate genes to 341 and only miss 8 of the genes reported in the microarray paper (which, for all I know may be false positives reported by the microarray paper).

Solution?

Still working on it. It looks like taking a logFC cut-off of 1 to 2 (2 to 4 fold changes) is a good idea and varying the padj cut-off depending on how conservative I want to be. I think tomorrow I'll try to run a differential expression test that uses all timepoints at the same time to see if that works better.

Tuesday, May 12, 2015

Microarray vs RNAseq data in rich medium

To determine if the sBHI RNAseq data we have is really an issue, I pulled out some old microarray data from this paper. The paper claims the following:

Cultures become modestly competent in colonies on sBHI agar plates and when liquid sBHI cultures approach stationary phase. With the exception of ssb, all genes in the CRE regulon were also modestly induced  (4–20  times) as  stationary phase approached during growth in rich medium (data not shown). This suggests that the low level of competence seen at this stage is not due to failure of a particular competence function, but to a general low induction of all components. This is supported by the modest increases in expression of sxy and of the CRP-regulon genes induced in MIV.

The paper reports that all genes in the competence regulon were induced 4 to 20 fold in rich media. But is this true? Josh wrote an R script that made a rough plot of what the microarray says about the competence regulon. Red is MIV and blue is sBHI. He took the gene expression value from each timepoint and divided it by the expression value from the first timepoint (as a rough normalization). Therefore, the y-axis corresponds to gene expression level relative to the value from t=-70 (which is why all the t=-70 values are 1). It looks like the y-axis is on a log-scale according to the R script.

This plot is a little rough but that is the consequence of using the R base package for plotting. As a testament to the superiority of ggplot2 to the R base package, I have some new plots that are hopefully more informative. Note that these are not on a log-scale.

Here is the data from the microarray (sBHI only). It looks like a large subset of the competence genes fail to be induced even 2-fold and the highest is only up 11-fold.

I've done a similar plot for the RNAseq data, except using count values instead of the intensity measurements taken by the microarray:



















This only shows data from one replicate (A) but I the other two are very similar. Here we see that many genes appear to be slightly downregulated, but a handful are upregulated up to 2 fold.

As it is currently plotted, these two graphs cannot be compared directly since their x-axis differ. To correct for this, I estimated what the OD600 of the microarray data would be assuming that the OD600 at t=0 was 0.2 (when cells were transferred into MIV) and a doubling time of ~32 minutes. Here's what I get when combining both plots:

Interestingly, this plot shows that both datasets appear to behave similarly for many genes. But there appear to be ~10 genes that are induced in the microarray that show no signs of induction in the RNAseq data.

Overall, I would like to know why the paper claimed that these genes are induced 4-20 fold when in fact they appear to be induced 0-10 fold. Keep in mind though, that this normalization was very rough and the OD values are estimated. But overall, this result makes me a little bit more comfortable with the RNAseq data. I would still like to see in vivo transformation frequencies for the culture used in the RNAseq data.

Monday, May 11, 2015

Competence in rich medium?

It was pointed out to me that the last post I made shows that competence genes do not get turned on in rich medium for our KW20 samples. For example:


This is strange; expression of competence genes should increase as cell density increases. As for Sxy, I mentioned in the last post that it seems like sxy-1 had much higher expression levels than KW20. It could be the case that in fact the expression seen is sxy-1 is normal and there is an issue with what is believed to be KW20.

Looking at Sxy expression in other strains in sBHI, here is what I get:

As expected, sxyx has no Sxy expression. The three KW20 replicates are very reproducible and appear to be similar to the CRP knockout strain for the two CRP samples we have. sxy-1 and rpoD appear to have similar Sxy levels except at the last timepoint where sxy transcript appears to decrease in rpoD and slightly increase in sxy-1. The two murE replicates also appear to differ greatly. One has higher expression than all other strains and the other appears to only modestly turn on sxy, similar to the KW20 samples but at a higher baseline. I can look into this later, but right now the KW20 samples are the most concerning.

Looking at coverage of CRP and adenylate cyclase, there does not seem to be any sign of a deletion in these genes for the KW20 replicates. On Monday, I'll be sure to take a closer look at the raw reads in IGV.

I've been trying a lot of things to figure out what is going on. Consider the KW20 samples in MIV at time t=0. These data points correspond to cells in sBHI at an optical density of about 0.2 - in other words these are cells between timepoints B1 and B2. The KW20 MIV replicates we have appear to behave as expected in terms of competence development so I am pretty confident that this is how cells in sBHI behave.




I made plots like these for all the competence genes and they suggest that the behaviour of the M0 samples is consistent with the sBHI samples. That is, expression levels of competence genes in M0 is in line with the levels seen in sBHI. Sxy levels in M0 also appear to be similar to B1.

Finally, I wanted to check to see if there was any antibiotic resistance genes in the KW20 samples suggesting that there may be a gene knocked out or something. I took all the unaligned/unmapped reads from the KW20 sBHI samples and pooled them together. Then I used a de novo genome assembly tool to assemble the unmapped reads into contigs. Then I looked for long contigs with high coverage that would indicate some genetic element in the sample that could not find a place in the KW20 reference genome. Most of the hits I got were things liks 23S ribosomal RNA from a whole bunch of different bacteria. It may be interesting to see exactly what kinds of contaminants were present.

But I found some reads that when BLASTed aligned to specR cassettes on plasmids. For example:
http://www.ncbi.nlm.nih.gov/nuccore/814617358?report=graph&from=4760&to=4898
http://www.ncbi.nlm.nih.gov/nuccore/29242913?report=graph&from=3730&to=4374

I should mention that the contigs that aligned to these regions had low coverage of about 4 (that is, only 4 reads in all the pooled samples made up that contig). Although there were ~95000 contigs and I only BLASTed about 10, so it's very probable that there may be more pieces of specR cassettes that I am missing. As a positive control, I took a single toxx sample (which has specR) and did this assembly process and found a specR contig with a coverage of about 40 (which is extremely high compared to the other contigs). In comparison, there are much fewer specR reads in the KW20 samples, suggesting that they are from contaminants or they are in a gene that is not as highly expressed as the toxin is in MIV.

In summary, there is some evidence that the KW20 samples may have spec resistance. We have these cells frozen so it is possible to check. It would also be useful to do a competence timecourse with these cells to see if they have a normal competence phenotype.

Wednesday, May 6, 2015

Sxy-regulated genes

I've been working to identify all Sxy-regulated genes using the following procedure:


Briefly, I found genes downregulated in sxyx relative to KW20 and genes upregulated in sxy1 relative to KW20. I took the intersection of these two sets to be Sxy-dependent genes. In total, 36 genes were identified; a list of their IDs can be found here.

Prior to this, 26 competence genes were thought to exist. This list contains 25 of those. The single stranded binding protein (ssb) is missing from the list. This is not surprising because this gene is essential (knocking it out leads to a non-viable strain) and is only weakly regulated by Sxy.

Unsuprisingly, sxy appears on the list:

Virtually no reads align to the sxy gene in the sxyx samples as expected. Interestingly, it appears the sxy-1 strain has higher levels of sxy transcript. Not entirely sure why; is Sxy self-regulating?

As for the 25 known Sxy-dependent gene (Note that MIV data is being shown on the left and sBHI data on the right):
(slightly higher resolution versions: here and here)




Ten other genes were identified. I'll first go though the ones that appear to be false positives.

Likely false positives

HI0609.2 tRNA-Arg:

This gene doesn't appear to behave similarly to any of the competence genes. It looks like the only reason it was flagged was because it's significantly higher than sxyx in MIV at M3 (probably due to the one sample that is an outlier) and slightly (but significatly) higher in sxy1 at the last sBHI timepoint.

HI1281 pseudogene:

The expression pattern is actually very similar to the tRNA above, so for the same reasons, I'll consider this another false positive.

HI1572 pseudogene:


It appears that one of the KW20 replicates behaves oddly in MIV for this gene but other than that, the expression appears to be mostly independent of Sxy.

More interesting results  

HI0984 dprB


This gene is downstream of dprA and both are likely part of the dprABC operon. dprA is a known competence gene and the graph above suggests that dprB may also be Sxy-dependent. This paper claims that dprABC are all coordinately expressed and competence inducible but our microarray paper claims that dprBC showed little induction by Sxy and were not coordinately expressed with dprA. So I looked closely at the actual RNA-seq reads:

 It appears that dprA is the only gene in the operon that is Sxy-dependent and dprBC appear to be expressed independent of Sxy. The reason dprB was flagged by this analysis is likely because some of the reads from dprA probably overlapped and "contaminated" drpB (i.e. the dprB read counts were inflated when dprA was expressed).


HI0654, HI0656, HI0657, HI0658

This appears to be an operon adjacent to HI0659-HI0660 and moderately regulated by Sxy. The microarray paper also identified this operon, but noted that these genes lacked a CRP-S site and could not identify a terminator between HI0659 and HI0658, suggesting that this may be due to readthrough. It appears to be a good explanation but it comes short of showing that this operon doesn't have a competence-specific role. I would like to see if this these two operons (HI0658 and the toxin-antitoxin operons) were always adjacent in species that have both. I believe Hailey has looked into this, so I'll look into what she has found later.

HI1456-HI1457


I think this one is really interesting and will require a much more in-depth look. There appears to be a very large difference between sxy1 and KW20 for this operon. As for the sxy knockout, there doesn't appear to be such a dramatic difference. However, the sxyx graphs above are for cells in MIV. In for cells in sBHI it looks more like this:


This suggests that this operon is upregulated in the sxy-1 mutant and downregulated in the sxy knockout (compared to KW20) only in sBHI. I need to look into this one a little more, maybe it will get its own blogpost.

HI0352 lic3A


This one is really weird. In sxy-1, it is clearly expressed at a lower level. No difference can be seen between wild type and sxyx though (though there does appear to be a replicate that is an outlier). Again, this is sxyx in MIV though. Looking at cells in sBHI:


In sBHI, it appears that any deviation from wildtype sxy causes this gene to be downregulated. This gene encodes an alpha-2,3-sialyltransferase which adds N-acetylneuraminic acid (Neu5Ac) to a lactose moiety on the LPS. More information can be found in this paper. I think this gene also deserves its own blog post.

Summary

The RNA-seq data verifies 25 of the known 26 competence genes with this analysis. dprB and an operon next to HI0659 may be Sxy-regulated, but I would tend to agree with the conclusion from the microarray paper that this is caused by readthrough (and dprA reads leaking into dprB). The genes
HI1456, HI1457 and HI0352 appear to behave very differently when Sxy protein levels are high/low - but this appears to be limited to cells growing in sBHI. These genes demand further investigation.