Create a protein sequence mutation library in pandas

Introduction

Purpose

In investigating protein sequence-function correlation A mutant sequence in which an arbitrary amino acid mutation is applied to an arbitrary site for a certain protein sequence I created it because I wanted to generate it easily. I also practice python, so I would appreciate it if you could teach me other methods or better methods.

GitHub https://github.com/kyo46n/Mutant-Library-Generator

Execution environment

jupyter notebook python 3.7.4 (Anaconda) pandas 0.25.3 biopython 1.74

Contents

import

Assuming you have the parent protein sequence in a fasta file Use biopython to read and write files.


from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pandas as pd

Function definition

Read the array name (id) and array (seq) of the fasta file in dictionary type, After converting the seq part into a DataFrame with one character and one string, insert the array name column as name at the left end. If it is left as it is, the amino acid sequence will start at 0 and the site position will deviate from the intuition. 1 Correct column index at the beginning.

#fasta to dataframe
def fd(p):
    d = {rec.id : list(str(rec.seq)) for rec in SeqIO.parse(p, "fasta")}
    df = pd.DataFrame(d.values())
    for i in d.keys():
        name = i
    df.insert(0, 'name', name)
    df.iat[0, 0] = name
    plus_one = {}
    j = 1
    for i in range(len(df.columns)+1):
        plus_one[i] = j
        j = j+1
    df = df.rename(columns=plus_one)
    return df

As a function for basic calculations With sdm () which converts one specific site to a specific amino acid Create ssm () that converts one specific site to all amino acids.

#for calculation of site directed mutagenesis
def sdm(df, site, mut):
    df_mut = df.copy()
    df_mut.iat[0, 0] = df_mut.iat[0,0] + "_" + df_mut.iat[0,site] + str(site) + mut
    df_mut.iat[0,site] = mut
    return df_mut

#for calculation of site saturation mutagenesis
def ssm(df, site):
    aa_list = ['R', 'H', 'K', 'D', 'E', 'S', 'T', 'N', 'Q', 'C', 'G', 'P', 'A', 'V', 'I', 'L', 'M', 'F', 'Y', 'W']
    df_mut = df.copy()
    for i in range(20):
        df_mut = df_mut.append(df.iloc[0])
    j = 1
    for i in aa_list:
        df_mut.iat[j, 0] = df_mut.iat[0,0] + "_" + df_mut.iat[0,site] + str(site) + i
        df_mut.iat[j,site] = i
        j = j + 1
    df_mut.reset_index(drop=True,inplace=True)
    return df_mut

Create isdm () and issm () so that multiple mutants can be generated together. These two functions generate a sequence with a mutation in any one place in a separate row.

#individual site directed mutagenesis
def isdm(df, site_list, mut_list):
    mylist = []
    j = 0
    for i in site_list:
        mylist.insert(j, sdm(df, i, mut_list[j]))
        j = j+1
    df = pd.concat(mylist)
    df = df.drop_duplicates(subset='name')
    df.reset_index(drop=True,inplace=True)
    return df

#individual site site saturation mutagenesis
def issm(df, site_list):
    mylist = []
    j = 0
    for i in site_list:
        mylist.insert(j, ssm(df,i))
        j = j+1
    df = pd.concat(mylist)
    df = df.drop_duplicates(subset='name')
    df.reset_index(drop=True,inplace=True)
    return df

Create ssdm () to generate a sequence with mutations in multiple places at the same time.

#simultaneous site directed mutagenesis
def ssdm(df, site_list, mut_list):
    j = 0
    for i in site_list:
        df = sdm(df, i, mut_list[j])
        j = j+1
    return df

Create sssm () to generate all combinations of multiple saturation mutations.

#simultaneous site saturation mutagenesis
def sssm(df, site_list):
    for x in range(len(site_list)):
        df_mut = df.copy()
        templist = []
        j = 0
        for i in range(len(df_mut)):
            dftemp = ssm(df_mut[i:i+1], site_list[x])
            templist.insert(j, dftemp)
            j = j + 1
        df = pd.concat(templist)
        df = df.drop_duplicates(subset='name')
        df.reset_index(drop=True,inplace=True)
    return df

Execution example

Input.txt (fasta format, id: seq1 seq: MTIKE) is prepared as a parent sequence sample. Read.

#read fasta
file = "input.txt"
df_wt = fd(file)
df_wt

image.png

If you want to generate one mutant of multiple sites ↓

site_list = [2,4,5]
mut_list = ["L","A","R"]
df_isdm = isdm(df_wt, site_list, mut_list)
df_isdm

image.png

If you want to generate saturated mutants at multiple sites ↓ (For example, 2nd to 5th saturated mutants)

site_list = range(2,6)
df_issm = issm(df_wt, site_list)
df_issm

image.png

If you want to generate one sequence with mutations at multiple sites ↓

site_list = [2,4,5]
mut_list = ["L","A","R"]
df_ssdm = ssdm(df_wt, site_list, mut_list)
df_ssdm

image.png

If you want to generate a combination of saturated mutants at multiple sites ↓

site_list = [2,3]
df_sssm = sssm(df_wt, site_list)
df_sssm

image.png

Array joins and duplicate removal

If it remains one character and one string, it is difficult to use in research, so combine it into an array format and delete the same array.

df = df_sssm.copy()
df.insert(1, 'sequence', "-")
df.loc[:,'sequence'] = df[1].str.cat(df[range(2,len(df.columns)-1)])
df_seq = df.iloc[:,0:2]
df_seq = df_seq.drop_duplicates(subset='sequence')
df_seq.reset_index(drop=True,inplace=True)
df_seq

image.png

export

Export to csv and fasta respectively.

df_seq.to_csv("output.csv")

with open("output.txt", "w") as handle: 
    for i in range(len(df_seq.index)):
        seq = Seq(df_seq.iloc[i,1])
        rec = SeqRecord(seq, description="")
        rec.id = df_seq.iloc[i,0]
        SeqIO.write(rec, handle, "fasta")

in conclusion

I feel that there is a lot of waste in how to create functions and around for statements, but for the time being, I can get the output I want.

Future tasks

--Mixed single mutation and saturated mutation (if "X", execute saturated mutation) --Processing of multiple parent arrays --Processing of huge array (shortening execution time) --Code refinement

Recommended Posts

Create a protein sequence mutation library in pandas
Create a function in Python
Create a dictionary in Python
Create a CSV reader in Flask
Create a DI Container in Python
Create a pandas Dataframe from a string.
Create a binary file in Python
Create a Kubernetes Operator in Python
Create a random string in Python
Create a LINE Bot in Django
Create a JSON object mapper in Python
Create a Python-GUI app in Docker (PySimpleGUI)
[GPS] Create a kml file in Python
Create a dataframe from excel using pandas
Create a web service in Flask-SQLAlchemy + PostgreSQL
Create a Vim + Python test environment in 1 minute
Create a GIF file using Pillow in Python
Create an executable file in a scripting language
I want to create a window in Python
Create a standard normal distribution graph in Python
How to create a JSON file in Python
Create a virtual environment with conda in Python
Create a custom search command in Splunk (Streaming Command)
Create a simple momentum investment model in Python
Extract lines containing a specific "string" in Pandas
Create a new page in confluence with Python
Create a datetime object from a string in Python (Python 3.3)
Create a package containing global commands in Python
How to create a Rest Api in Django
Until you create a new app in Django
Create a MIDI file in Python using pretty_midi
Create a loop antenna pattern in Python in KiCad
[Docker] Create a jupyterLab (python) environment in 3 minutes!
Create a web server in Go language (net/http) (2)
dlopen () ltrace a function call in a shared library
Create a data collection bot in Python using Selenium
[LINE Messaging API] Create a rich menu in Python
Create a plugin to run Python Doctest in Vim (2)
Create a plugin to run Python Doctest in Vim (1)
Create a web server in Go language (net / http) (1)
Create a fake Minecraft server in Python with Quarry
Publish / upload a library created in Python to PyPI
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)