Let's create a function that converts DNA to a complementary sequence in python. Let's make 4 ways and compare which one is faster.
Convert A to T, C to G, G to C, and T to A. Then reverse the arrangement. The complementary sequence of ACGTTTTT is AAAAAACGT.
R stands for A or G. Y stands for C or T. D is other than C (that is, A, G, or T). There are also uppercase and lowercase letters.
The length of the DNA sequence is, for example, 10 Mb (10 million bases).
Create a dictionary that specifies the replacement method.
compDic = {"R":"Y","M":"K","W":"W","S":"S","Y":"R","K":"M","H":"D","B":"V","D":"H","V":"B","N":"N","A":"T","C":"G","G":"C","T":"A","r":"y","m":"k","w":"w","s":"s","y":"r","k":"m","h":"d","b":"v","d":"h","v":"b","n":"n","a":"t","c":"g","g":"c","t":"a"}
First way: Make an empty list (the size is the same as the length of the DNA sequence) to store the complementary sequence, read the DNA sequence from the beginning, and insert the complementary sequence one character at a time from the end of the list. Finally, connect with join ().
def comp1(dna):
l = len(dna)
c = ["" for num in range(l)]
index = l-1
for i in dna:
c[index] = compDic[i]
index -= 1
return ''.join(c)
Second way: Make an empty list. Read the DNA sequence one by one from the front and add the complementary base to the top of the list with insert ().
def comp2(dna):
l = len(dna)
c = []
for i in dna:
c.insert(0,(compDic[i]))
return ''.join(c)
Third way: Make an empty list. Read the DNA sequences one by one from the back and add the complementary bases to the end of the list with append ().
def comp3(dna):
l = len(dna)
c = []
for i in range(l):
c.append(compDic[dna[l-i-1]])
return ''.join(c)
Fourth way: Make an empty string. Read the DNA sequences one by one from the back and add the complementary bases to the string.
def comp4(dna):
l = len(dna)
str = ""
for i in range(l):
str += compDic[dna[l-i-1]]
return str
I read a DNA sequence of about 5 million bases from a file and made it a complementary strand with the above four functions. Code belongs to @ fantm21. Thank you for using it.
start = time.time()
comp1(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")
start = time.time()
comp2(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")
start = time.time()
comp3(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")
start = time.time()
comp4(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")
It became as follows. comp2 didn't end.
elapsed_time:1.2188289165496826[sec]#comp1()
elapsed_time:1.3529019355773926[sec]#comp3()
elapsed_time:1.5209426879882812[sec]#comp4()
It seemed quick to make an empty list of the required size in advance and put the complementary sequence here. Well, this time may be good. If anyone knows a faster way, I would appreciate it if you could teach me.