Create a protein sequence mutation library in pandas

Introduction

Purpose

In investigating protein sequence-function correlation A mutant sequence in which an arbitrary amino acid mutation is applied to an arbitrary site for a certain protein sequence I created it because I wanted to generate it easily. I also practice python, so I would appreciate it if you could teach me other methods or better methods.

GitHub https://github.com/kyo46n/Mutant-Library-Generator

Execution environment

jupyter notebook python 3.7.4 (Anaconda) pandas 0.25.3 biopython 1.74

import

Assuming you have the parent protein sequence in a fasta file Use biopython to read and write files.


from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pandas as pd

Function definition

Read the array name (id) and array (seq) of the fasta file in dictionary type, After converting the seq part into a DataFrame with one character and one string, insert the array name column as name at the left end. If it is left as it is, the amino acid sequence will start at 0 and the site position will deviate from the intuition. 1 Correct column index at the beginning.

#fasta to dataframe
def fd(p):
    d = {rec.id : list(str(rec.seq)) for rec in SeqIO.parse(p, "fasta")}
    df = pd.DataFrame(d.values())
    for i in d.keys():
        name = i
    df.insert(0, 'name', name)
    df.iat[0, 0] = name
    plus_one = {}
    j = 1
    for i in range(len(df.columns)+1):
        plus_one[i] = j
        j = j+1
    df = df.rename(columns=plus_one)
    return df

As a function for basic calculations With sdm () which converts one specific site to a specific amino acid Create ssm () that converts one specific site to all amino acids.

#for calculation of site directed mutagenesis
def sdm(df, site, mut):
    df_mut = df.copy()
    df_mut.iat[0, 0] = df_mut.iat[0,0] + "_" + df_mut.iat[0,site] + str(site) + mut
    df_mut.iat[0,site] = mut
    return df_mut

#for calculation of site saturation mutagenesis
def ssm(df, site):
    aa_list = ['R', 'H', 'K', 'D', 'E', 'S', 'T', 'N', 'Q', 'C', 'G', 'P', 'A', 'V', 'I', 'L', 'M', 'F', 'Y', 'W']
    df_mut = df.copy()
    for i in range(20):
        df_mut = df_mut.append(df.iloc[0])
    j = 1
    for i in aa_list:
        df_mut.iat[j, 0] = df_mut.iat[0,0] + "_" + df_mut.iat[0,site] + str(site) + i
        df_mut.iat[j,site] = i
        j = j + 1
    df_mut.reset_index(drop=True,inplace=True)
    return df_mut

Create isdm () and issm () so that multiple mutants can be generated together. These two functions generate a sequence with a mutation in any one place in a separate row.

#individual site directed mutagenesis
def isdm(df, site_list, mut_list):
    mylist = []
    j = 0
    for i in site_list:
        mylist.insert(j, sdm(df, i, mut_list[j]))
        j = j+1
    df = pd.concat(mylist)
    df = df.drop_duplicates(subset='name')
    df.reset_index(drop=True,inplace=True)
    return df

#individual site site saturation mutagenesis
def issm(df, site_list):
    mylist = []
    j = 0
    for i in site_list:
        mylist.insert(j, ssm(df,i))
        j = j+1
    df = pd.concat(mylist)
    df = df.drop_duplicates(subset='name')
    df.reset_index(drop=True,inplace=True)
    return df

Create ssdm () to generate a sequence with mutations in multiple places at the same time.

#simultaneous site directed mutagenesis
def ssdm(df, site_list, mut_list):
    j = 0
    for i in site_list:
        df = sdm(df, i, mut_list[j])
        j = j+1
    return df

Create sssm () to generate all combinations of multiple saturation mutations.

#simultaneous site saturation mutagenesis
def sssm(df, site_list):
    for x in range(len(site_list)):
        df_mut = df.copy()
        templist = []
        j = 0
        for i in range(len(df_mut)):
            dftemp = ssm(df_mut[i:i+1], site_list[x])
            templist.insert(j, dftemp)
            j = j + 1
        df = pd.concat(templist)
        df = df.drop_duplicates(subset='name')
        df.reset_index(drop=True,inplace=True)
    return df

Execution example

Input.txt (fasta format, id: seq1 seq: MTIKE) is prepared as a parent sequence sample. Read.

#read fasta
file = "input.txt"
df_wt = fd(file)
df_wt

If you want to generate one mutant of multiple sites ↓

site_list = [2,4,5]
mut_list = ["L","A","R"]
df_isdm = isdm(df_wt, site_list, mut_list)
df_isdm

If you want to generate saturated mutants at multiple sites ↓ (For example, 2nd to 5th saturated mutants)

site_list = range(2,6)
df_issm = issm(df_wt, site_list)
df_issm

If you want to generate one sequence with mutations at multiple sites ↓

site_list = [2,4,5]
mut_list = ["L","A","R"]
df_ssdm = ssdm(df_wt, site_list, mut_list)
df_ssdm

If you want to generate a combination of saturated mutants at multiple sites ↓

site_list = [2,3]
df_sssm = sssm(df_wt, site_list)
df_sssm

Array joins and duplicate removal

If it remains one character and one string, it is difficult to use in research, so combine it into an array format and delete the same array.

df = df_sssm.copy()
df.insert(1, 'sequence', "-")
df.loc[:,'sequence'] = df[1].str.cat(df[range(2,len(df.columns)-1)])
df_seq = df.iloc[:,0:2]
df_seq = df_seq.drop_duplicates(subset='sequence')
df_seq.reset_index(drop=True,inplace=True)
df_seq

export

Export to csv and fasta respectively.

df_seq.to_csv("output.csv")

with open("output.txt", "w") as handle: 
    for i in range(len(df_seq.index)):
        seq = Seq(df_seq.iloc[i,1])
        rec = SeqRecord(seq, description="")
        rec.id = df_seq.iloc[i,0]
        SeqIO.write(rec, handle, "fasta")

in conclusion

I feel that there is a lot of waste in how to create functions and around for statements, but for the time being, I can get the output I want.

Future tasks

--Mixed single mutation and saturated mutation (if "X", execute saturated mutation) --Processing of multiple parent arrays --Processing of huge array (shortening execution time) --Code refinement