In investigating protein sequence-function correlation A mutant sequence in which an arbitrary amino acid mutation is applied to an arbitrary site for a certain protein sequence I created it because I wanted to generate it easily. I also practice python, so I would appreciate it if you could teach me other methods or better methods.
GitHub https://github.com/kyo46n/Mutant-Library-Generator
jupyter notebook python 3.7.4 (Anaconda) pandas 0.25.3 biopython 1.74
Assuming you have the parent protein sequence in a fasta file Use biopython to read and write files.
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pandas as pd
Read the array name (id) and array (seq) of the fasta file in dictionary type, After converting the seq part into a DataFrame with one character and one string, insert the array name column as name at the left end. If it is left as it is, the amino acid sequence will start at 0 and the site position will deviate from the intuition. 1 Correct column index at the beginning.
#fasta to dataframe
def fd(p):
d = {rec.id : list(str(rec.seq)) for rec in SeqIO.parse(p, "fasta")}
df = pd.DataFrame(d.values())
for i in d.keys():
name = i
df.insert(0, 'name', name)
df.iat[0, 0] = name
plus_one = {}
j = 1
for i in range(len(df.columns)+1):
plus_one[i] = j
j = j+1
df = df.rename(columns=plus_one)
return df
As a function for basic calculations With sdm () which converts one specific site to a specific amino acid Create ssm () that converts one specific site to all amino acids.
#for calculation of site directed mutagenesis
def sdm(df, site, mut):
df_mut = df.copy()
df_mut.iat[0, 0] = df_mut.iat[0,0] + "_" + df_mut.iat[0,site] + str(site) + mut
df_mut.iat[0,site] = mut
return df_mut
#for calculation of site saturation mutagenesis
def ssm(df, site):
aa_list = ['R', 'H', 'K', 'D', 'E', 'S', 'T', 'N', 'Q', 'C', 'G', 'P', 'A', 'V', 'I', 'L', 'M', 'F', 'Y', 'W']
df_mut = df.copy()
for i in range(20):
df_mut = df_mut.append(df.iloc[0])
j = 1
for i in aa_list:
df_mut.iat[j, 0] = df_mut.iat[0,0] + "_" + df_mut.iat[0,site] + str(site) + i
df_mut.iat[j,site] = i
j = j + 1
df_mut.reset_index(drop=True,inplace=True)
return df_mut
Create isdm () and issm () so that multiple mutants can be generated together. These two functions generate a sequence with a mutation in any one place in a separate row.
#individual site directed mutagenesis
def isdm(df, site_list, mut_list):
mylist = []
j = 0
for i in site_list:
mylist.insert(j, sdm(df, i, mut_list[j]))
j = j+1
df = pd.concat(mylist)
df = df.drop_duplicates(subset='name')
df.reset_index(drop=True,inplace=True)
return df
#individual site site saturation mutagenesis
def issm(df, site_list):
mylist = []
j = 0
for i in site_list:
mylist.insert(j, ssm(df,i))
j = j+1
df = pd.concat(mylist)
df = df.drop_duplicates(subset='name')
df.reset_index(drop=True,inplace=True)
return df
Create ssdm () to generate a sequence with mutations in multiple places at the same time.
#simultaneous site directed mutagenesis
def ssdm(df, site_list, mut_list):
j = 0
for i in site_list:
df = sdm(df, i, mut_list[j])
j = j+1
return df
Create sssm () to generate all combinations of multiple saturation mutations.
#simultaneous site saturation mutagenesis
def sssm(df, site_list):
for x in range(len(site_list)):
df_mut = df.copy()
templist = []
j = 0
for i in range(len(df_mut)):
dftemp = ssm(df_mut[i:i+1], site_list[x])
templist.insert(j, dftemp)
j = j + 1
df = pd.concat(templist)
df = df.drop_duplicates(subset='name')
df.reset_index(drop=True,inplace=True)
return df
Input.txt (fasta format, id: seq1 seq: MTIKE) is prepared as a parent sequence sample. Read.
#read fasta
file = "input.txt"
df_wt = fd(file)
df_wt
If you want to generate one mutant of multiple sites ↓
site_list = [2,4,5]
mut_list = ["L","A","R"]
df_isdm = isdm(df_wt, site_list, mut_list)
df_isdm
If you want to generate saturated mutants at multiple sites ↓ (For example, 2nd to 5th saturated mutants)
site_list = range(2,6)
df_issm = issm(df_wt, site_list)
df_issm
If you want to generate one sequence with mutations at multiple sites ↓
site_list = [2,4,5]
mut_list = ["L","A","R"]
df_ssdm = ssdm(df_wt, site_list, mut_list)
df_ssdm
If you want to generate a combination of saturated mutants at multiple sites ↓
site_list = [2,3]
df_sssm = sssm(df_wt, site_list)
df_sssm
If it remains one character and one string, it is difficult to use in research, so combine it into an array format and delete the same array.
df = df_sssm.copy()
df.insert(1, 'sequence', "-")
df.loc[:,'sequence'] = df[1].str.cat(df[range(2,len(df.columns)-1)])
df_seq = df.iloc[:,0:2]
df_seq = df_seq.drop_duplicates(subset='sequence')
df_seq.reset_index(drop=True,inplace=True)
df_seq
Export to csv and fasta respectively.
df_seq.to_csv("output.csv")
with open("output.txt", "w") as handle:
for i in range(len(df_seq.index)):
seq = Seq(df_seq.iloc[i,1])
rec = SeqRecord(seq, description="")
rec.id = df_seq.iloc[i,0]
SeqIO.write(rec, handle, "fasta")
I feel that there is a lot of waste in how to create functions and around for statements, but for the time being, I can get the output I want.
--Mixed single mutation and saturated mutation (if "X", execute saturated mutation) --Processing of multiple parent arrays --Processing of huge array (shortening execution time) --Code refinement
Recommended Posts