Tuesday, May 26, 2015

Predicting CRP sites


 I’ve been trying to identify CRP-regulated genes using the RNA-seq data (comparing crpx and KW20 and looking for differently expressed genes). Along with this, I’ve been trying to identify CRP binding motifs in the H. influenzae genome. For finding CRP motifs, I’ve put together an iterative processes that uses motif-finding software.

1) A sequence motif is created by giving MEME a set of known/predicted 22bp CRP motifs.

2) Using the motif created above, FIMO is used to find occurrences of the motif in a genome.

3) A custom R script takes all the genes in the genome and finds predicted CRP sites upstream of the gene (currently set to between 0 and 300 bp upstream). Predicted CRP sites that are found upstream of genes thought to be CRP-regulated are used as input for step 1. The the process repeats for as long as desired.

To test if this works, I tried it with CRP sites in E. coli since it is much better studied.

RegulonDB gives >200 candidate CRP binding sites. I chose 8 randomly:

TTGTGTGATCTGCATCACGCAT
TATCGTGACCTGGATCACTGTT
ATCTGCATCGGAATTTGCAGGC
GACCTCGGTTTAGTTCACAGAA
AATCTTTATCTTTGTAGCACTT
AATTTTACTTTTGGTTACATAT
TTATATATGTCAAGTTGTTAAA
TTACCCGCTTTAAAACACGCTA

I put these 8 sequences into the pipeline I created. Here are the results (note: scale on histogram is not constant):

ITERATION 1:
34/139 genes with known CRP start sites correctly predicted
24/181 CRP sites predicted correctly
286 other genes predicted to have CRP sites (7.1% of genome) – treat as false positives







ITERATION 2:
62/139 genes with known CRP start sites correctly predicted
46/181 CRP sites predicted correctly
431 other genes predicted to have CRP sites (10.8% of genome) – treat as false positives




ITERATION 3:
78/139 genes with known CRP start sites correctly predicted
63/181 CRP sites predicted correctly
530 other genes predicted to have CRP sites (13.2% of genome) – treat as false positives




ITERATION 4:
86/139 genes with known CRP start sites correctly predicted
69/181 CRP sites predicted correctly
552 other genes predicted to have CRP sites (13.8% of genome) – treat as false positives




From 8 sequences alone, I am able to predict ~ one third of known E. coli CRP sites (albeit with a pretty high false positive rate). After each iteration, the motif logo begins to look more and more like the CRP binding motif (in particular after the first iteration). It also seems that predicted CRP sites tend to cluster more between 0 and 300 bp upstream of the gene after each iteration.

Starting off with the following 30 sequences gives better results:

TTGTGTGATCTGCATCACGCAT
TAAAGTGATGGTAGTCACATAA
AAACTGAGACTAGTACGACTTT
AGTGGGATTAATTTCCACATTA
TAATTTCCACATTAAAACAGGG
GGCTGTGCTGCGCATAATACTT
TATCGTGACCTGGATCACTGTT
AATTTTGCGCTAAAGCACATTT
ATCTGCATCGGAATTTGCAGGC
AATAGTGACCTCGCGCAAAATG
AAACGTGATTTAACGCCTGATT
TCTCGTGATCAAGATCACATTC
ATTTGTGATGAAGATCACGTCA
TAACGTGATGTGCCTTGTAATT
AAACGTGATCAACCCCTCAATT
TTTTGCAAGCAACATCACGAAA
GACCTCGGTTTAGTTCACAGAA
TTATGTGACAGATAAAACGTTT
TAATGTGATTTATGCCTCACTA
ATTTGAGAGTTGAATCTCAAAT
TTATGTGATTGATATCACACAA
TCTTATGACGCTCTTCACACTC
CTTTGTGATGTGCTTCCTGTTA
AACAGTTATTTTTAACAAATTT
AAATGTATGACAGATCACTATT
AAATGTTATCCACATCACAATT
AATCTTTATCTTTGTAGCACTT
AATTTTACTTTTGGTTACATAT
TTATATATGTCAAGTTGTTAAA
AGTTGTTAAAATGTGCACAGTT
AAATGTGCACAGTTTCATGATT
TTTCATGATTTCAATCAAAACC
TTACCCGCTTTAAAACACGCTA

ITERATION 1:
87/139 genes with known CRP start sites correctly predicted
85/181 CRP sites predicted correctly
488 other genes predicted to have CRP sites (12.2% of genome) – treat as false positives




ITERATION 6:
98/139 genes with known CRP start sites correctly predicted
92/181 CRP sites predicted correctly
532 other genes predicted to have CRP sites (13.3% of genome) – treat as false positives




The pipeline appears to work in Haemophilus too. For example, I was able to start with the CRP-S sequences and end up with a CRP-N motif using this process:

Iteration 1:


Iteration 5:

No comments: