A story about how multi-maps are handled by bowtie2.
In the default behavior of bowtie2, even if the array is hit multiple times, only one of them is reported. Then, how to choose one of them is usually randomly selected from the ones with the highest alignment score. The document also says so [^ 1]. … But in some cases it seems that it will not be random and will solidify [^ 2], so I would like to clarify the conditions.
Most of the conclusions are written here [^ 2], but this problem is due to the random number generation of bowtie2. The --non-deterministic
option section of the Bowtie2 instruction manual [^ 3] has the following explanation.
Normally, Bowtie 2 re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the
--seed
option. This means that if two reads are identical (same name, same nucleotides, same qualities) Bowtie 2 will find and report the same alignment(s) for both, even if there was ambiguity. When--non-deterministic
is specified, Bowtie 2 re-initializes its pseudo-random generator for each read using the current time. This means that Bowtie 2 will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.
In summary, bowtie2 initializes a random number generator for each read, and its seed value seems to depend on the following four.
--Lead name
--Base sequence
--Quality array
--- The value of the --seed
option
The point is that leads with the same information, including the __name, are mapped to the same location __.
Below, let's confirm that the mapping result will be different if the read name
, quality array
, and --seed option
are different.
Prepare the following FASTQ based on the multi-mapped base sequence. (Arrangements etc. are created based on the verification example here [^ 2])
--FASTQ with the same read name, base sequence, and quality sequence --FASTQ has the same base sequence, but the name and quality sequence {either or both} are different for each read.
Since it is troublesome to prepare each time, the next generator will be used instead.
gen_testseq.py
#!/usr/bin/env python
import argparse
import random
parser = argparse.ArgumentParser()
parser.add_argument("-r", "--reads", type=int, default=10000)
parser.add_argument("-n", "--name", action="store_true")
parser.add_argument("-q", "--quality", action="store_true")
args = parser.parse_args()
for i in range(args.reads):
print "@test_sequence{}".format(random.random() if args.name else '')
print "CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA"
print '+'
print ''.join([chr(random.randrange(33, 127)) for _ in range(101)]) if args.quality else 'I' * 101
When the array on the second line of print is mapped to hg19, 9 places are multi-hit, and the following 3 places have the highest alignment score and are map candidates.
chrom | position | strand |
---|---|---|
chr1 | 11635 | + |
chr2 | 114359380 | - |
chr15 | 102519535 | - |
bowtie2 used version 2.2.4.
Like this, only the position is taken out from the map result and the number is counted.
$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c
result
10000 reads; of these:
10000 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
10000 (100.00%) aligned >1 times
100.00% overall alignment rate
10000 chr15 102519435
All were mapped in one place.
--seed
option$ bowtie2 --no-hd --seed 17320508 -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
10000 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
10000 (100.00%) aligned >1 times
100.00% overall alignment rate
10000 chr1 11635
The location changes depending on the value, but only one location is mapped.
--non-deterministic
option$ bowtie2 --no-hd --non-deterministic -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
10000 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
10000 (100.00%) aligned >1 times
100.00% overall alignment rate
3302 chr1 11635
3379 chr15 102519435
3319 chr2 114359280
This time it was randomly mapped. Of course, the result will be different each time you run it.
$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py -n) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
10000 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
10000 (100.00%) aligned >1 times
100.00% overall alignment rate
3379 chr1 11635
3294 chr15 102519435
3327 chr2 114359280
This time it was randomly mapped. here,
$ ./gen_testseq.py -n > test.fastq
$ bowtie2 --no-hd -x hg19 -U test.fastq | cut -f 3-4 | sort | uniq -c
If you drop it to a file as described above, you will get the same result at each runtime. Of course, this does not apply if the value of the --seed
option is different or if the --non-deterministic
option is specified.
Randomly mapped even if the quality array is different
$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py -q) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
10000 (100.00%) were unpaired; of these:
0 (0.00%) aligned 0 times
0 (0.00%) aligned exactly 1 time
10000 (100.00%) aligned >1 times
100.00% overall alignment rate
3351 chr1 11635
3312 chr15 102519435
3337 chr2 114359280
--Usually, you can think that the map destination is randomly selected (it is unlikely that the lead name will be duplicated in addition to the array).
--If the same lead is included, it will be mapped randomly by using the --non-deterministic
option.
--However, reproducibility is lost.
--Use the --seed
option if you need results with different multimap destinations and want to ensure reproducibility.
It seems that there was a case where it did not vary well [^ 2], but was it a bug?
[^ 1]: See the -k
option section [^ 3]
Recommended Posts