There are times when you want to randomly sample a large number of sequences. This time, I will introduce the code to use in such a case.
The method is simply to open the file, select an array with random numbers and access it.
Select an array with random numbers and access
#Get array
seqnum=random.randint(1,seqnumMAX)
name = linecache.getline(infn, seqnum*2-1) #Array name
seq = linecache.getline(infn, seqnum*2) #Array
#writing
outfdl = open(outfn, 'w')
outfdl.writelines(name+seq)
outfdl.close()
linecache.clearcache() #Clear cache
At this time, `linecache.getline (input file name, number of lines)`
is used for reading.
This will internally optimize the reading of the specified line.
Bioinformatics tends to handle large array files, which may be useful.
Random restoration extraction can be performed by looping as many times as you want to acquire this process.
The above can be summarized as a proper process as follows.
randsampleFasta.py
#!/usr/local/bin/python3
# -*- coding: utf-8 -*-
"""
Randomly restore and extract an array from a fasta file.
"""
__author__ = "Kazuki Nakamae <[email protected]>"
__version__ = "0.00"
__date__ = "29 May 2017"
import sys
import mmap
import linecache
import random
def randsampleFasta(infn,outfn,n):
"""
Randomly restore and extract an array from a fasta file.
@param infn fasta file name to read{string}
@param outfn output fasta file name{string}
@param n Number of arrays to restore and extract{int}
"""
infdl = open(infn, 'r')
#Check the number of arrays in the input file
print('Checking how many sequences are in a FASTA format sequence file....(1/2)')
buf = mmap.mmap(infdl.fileno(), 0, prot=mmap.PROT_READ)
seqnumMAX= 0
readline = buf.readline
while readline():
seqnumMAX += 1
seqnumMAX = int(seqnumMAX / 2)
infdl.close()
print(str(seqnumMAX)+' sequences')
#Random restoration extraction of the array
print('random sampling from a sequence with replacement....(2/2)')
outfdl = open(outfn, 'w')
random.seed(a='hoge', version=2) #Seed setting
seqi=1
while seqi<=n:
seqnum=random.randint(1,seqnumMAX)
name = linecache.getline(infn, seqnum*2-1)
seq = linecache.getline(infn, seqnum*2)
outfdl.writelines(name+seq)
seqi += 1
outfdl.close()
linecache.clearcache()
print('done.')
if __name__ == '__main__':
argvs = sys.argv #Command line arguments
argc = len(argvs) #Number of arguments
if (argc != 4): #Argument check
print("USAGE : python3 randsampleFasta.py <INPUT.fa> <OUTPUT.fa> <NUMBER_OF_SEQUENCES>")
quit()
randsampleFasta(argvs[1],argvs[2],int(argvs[3]))
quit()
File to enter
test.fa
>1
CCGTATTGGAAAGCTC
>2
AGGATTATCGGATACT
>3
ATCCGGACGGGGGGTT
>4
GACCTCGTTATCATCC
>5
AGTCAGGTTACCCGCA
Input on Bash
input
python3 randsampleFasta.py test.fa out.fa 4
Output on Bash
Standard output
Checking how many sequences are in a FASTA format sequence file....(1/2)
5 sequences
random sampling from a sequence with replacement....(2/2)
done.
Output file
out.fa
>3
ATCCGGACGGGGGGTT
>2
AGGATTATCGGATACT
>4
GACCTCGTTATCATCC
>3
ATCCGGACGGGGGGTT
You are free.
Recommended Posts