Introduction

Let's create a function that converts DNA to a complementary sequence in python. Let's make 4 ways and compare which one is faster.

To make it a complementary strand

Convert A to T, C to G, G to C, and T to A. Then reverse the arrangement. The complementary sequence of ACGTTTTT is AAAAAACGT.

There are other than ACGT

R stands for A or G. Y stands for C or T. D is other than C (that is, A, G, or T). There are also uppercase and lowercase letters.

The length of the DNA sequence is, for example, 10 Mb (10 million bases).

dictionary

Create a dictionary that specifies the replacement method.

compDic = {"R":"Y","M":"K","W":"W","S":"S","Y":"R","K":"M","H":"D","B":"V","D":"H","V":"B","N":"N","A":"T","C":"G","G":"C","T":"A","r":"y","m":"k","w":"w","s":"s","y":"r","k":"m","h":"d","b":"v","d":"h","v":"b","n":"n","a":"t","c":"g","g":"c","t":"a"}

Try to make various

First way: Make an empty list (the size is the same as the length of the DNA sequence) to store the complementary sequence, read the DNA sequence from the beginning, and insert the complementary sequence one character at a time from the end of the list. Finally, connect with join ().

def comp1(dna):
    l = len(dna)
    c = ["" for num in range(l)]
    index = l-1
    for i in dna:
        c[index] = compDic[i]
        index -= 1
    return ''.join(c)

Second way: Make an empty list. Read the DNA sequence one by one from the front and add the complementary base to the top of the list with insert ().

def comp2(dna):
    l = len(dna)
    c = []
    for i in dna:
        c.insert(0,(compDic[i]))
    return ''.join(c)

Third way: Make an empty list. Read the DNA sequences one by one from the back and add the complementary bases to the end of the list with append ().

def comp3(dna):
    l = len(dna)
    c = []
    for i in range(l):
        c.append(compDic[dna[l-i-1]])
    return ''.join(c)

Fourth way: Make an empty string. Read the DNA sequences one by one from the back and add the complementary bases to the string.

def comp4(dna):
    l = len(dna)
    str = ""
    for i in range(l):
        str += compDic[dna[l-i-1]]
    return str

Try to measure the time it takes

I read a DNA sequence of about 5 million bases from a file and made it a complementary strand with the above four functions. Code belongs to @ fantm21. Thank you for using it.

start = time.time()
comp1(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")

start = time.time()
comp2(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")

start = time.time()
comp3(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")

start = time.time()
comp4(sequence)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")

result

It became as follows. comp2 didn't end.

elapsed_time:1.2188289165496826[sec]#comp1()
elapsed_time:1.3529019355773926[sec]#comp3()
elapsed_time:1.5209426879882812[sec]#comp4()

At the end

It seemed quick to make an empty list of the required size in advance and put the complementary sequence here. Well, this time may be good. If anyone knows a faster way, I would appreciate it if you could teach me.

Python code that makes DNA sequences complementary strands Which method is faster?