Aligning bisulfite-converted DNA reads to a genome

Bisulfite is used to detect methylated cytosines. It converts unmethylated Cs to Ts, but it leaves methylated Cs intact. If we then sequence the DNA and align it to a reference genome, we can infer cytosine methylation.

To align the DNA accurately, we should take the C->T conversion into account. Here is how to do it with LAST.

Let's assume we have bisulfite-converted DNA reads in a file called "reads.fastq" (in fastq-sanger format), and the genome is in "mygenome.fa" (in fasta format). We will also assume that all the reads are from the converted strand, and not its reverse-complement (i.e. they have C->T conversions and not G->A conversions).

First, we need to run lastdb twice, for forward-strand and reverse-strand alignments:

lastdb -u bisulfite_f.seed my_f mygenome.fa
lastdb -u bisulfite_r.seed my_r mygenome.fa

Then find alignments, one strand at a time:

lastal -p bisulfite_f.mat -s1 -Q1 -e120 my_f reads.fastq > temp_f
lastal -p bisulfite_r.mat -s0 -Q1 -e120 my_r reads.fastq > temp_r

Take care not to mistype these commands, especially f/r and s1/s0. Finally, merge the alignments and estimate which one represents the genomic source of each read:

last-merge-batches.py temp_f temp_r | last-map-probs.py -s150 > myalns.maf

These commands refer to files (bisulfite_f.seed etc), which are in the examples directory. You need to specify exactly where they are (e.g. "-u examples/bisulfite_f.seed").

Avoiding biased methylation estimates

Imagine that one genomic cytosine is methylated in 50% of cells in your sample, so that 50% of reads covering it have C and 50% have T. It is possible that the reads with C are easier to align, so we align more of them. The methylation rate would then look higher than 50%.

We can avoid this bias by converting all Cs in the reads to Ts, before aligning them:

perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_f - > temp_f
perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_r - > temp_r

This perl command assumes that the fastq file has uppercase sequences, and no line-wrapping or blank lines. It converts Cs to lowercase Ts: lowercase has no effect on the alignment, but it lets you see where the Cs were in the output.

Tips

Paired-end DNA reads

You can align paired-end reads by combining the recipe in this document with the one in last-pair-probs.html. For large datasets, you would want to go faster by using multiple CPUs. This gets a bit complicated, so we provide a lastal-bisulfite-paired script in the examples directory. Typical usage:

lastdb -w2 -u bisulfite_f.seed my_f mygenome.fa
lastdb -w2 -u bisulfite_r.seed my_r mygenome.fa

lastal-bisulfite-paired.sh my_f my_r reads1.fastq reads2.fastq > results.maf

This assumes that reads1.fastq are all from the converted strand (i.e. they have C->T conversions) and reads2.fastq are all from the reverse-complement (i.e. they have G->A conversions). It requires GNU parallel to be installed. You are encouraged to customize this script.