4.3 Feature, location and position objects To 4.2
4.3.1 SeqFeature objects
Sequence features are an essential part of describing a sequence. ** feature is an essential part of describing an array. ** **
Once you get beyond the sequence itself, you need some way to organize and easily get at the more “abstract” information that is known about the sequence. ** When you go beyond just an array, you should have a way to organize it and easily access complex information. ** **
The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you’ll probably have an easier time grasping the structure of the Biopython classes. ** The design is also based on the GenBank / EMBL feature table, so if you're familiar with it, the Biopython classes will be easier to understand. ** **
The key idea about each SeqFeature object is to describe a region on a parent sequence, typically a SeqRecord object. That region is described with a location object, typically a range between two positions (see Section 4.3.2 below). ** The main idea of a SeqFeature object is to describe the area of the parent array, the parent is usually a SeqRecord object. A region is defined by a location object and is usually the range between two positions (see session 4.3.2). ** **
position – This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions. location – A location is region of sequence bounded by some positions. For instance 5..20 (i. e. 5 to 20) is a location.
**position – It can be clear or unclear to represent a single position in the array. For example, 5, 20, <100 and> 200 are all positions. location – The space between positions. Example: 5..20 (i.e. 5 to 20). **
I just mention this because sometimes I get confused between the two. ** The reason I mention this is because I often make a mistake. ** **
4.3.2.1 FeatureLocation object
Unless you work with eukaryotic genes, most SeqFeature locations are extremely simple - you just need start and end coordinates and a strand. That’s essentially all the basic FeatureLocation object does. ** Unless it is a eukaryotic gene, SeqFeature location is very easy. Only start, end coordinates and strand information. The most basic Feature Location has the above three pieces of information. ** **
In practise of course, things can be more complicated. First of all we have to handle compound locations made up of several regions. Secondly, the positions themselves may be fuzzy (inexact). ** But the actual case will not be that easy. You need to handle a complex location that consists of multiple regions, and the position may be unclear. ** **
4.3.2.2 CompoundLocation object Biopython 1.62 introduced the CompoundLocation as part of a restructuring of how complex locations made up of multiple regions are represented. The main usage is for handling ‘join’ locations in EMBL/GenBank files. ** Introduced Compound Location from Biopython 1.62 to better handle'join'locations in EMBL / GenBank files. ** **
4.3.2.3 Fuzzy Positions So far we’ve only used simple positions. One complication in dealing with feature locations comes in the positions themselves. In biology many times things aren’t entirely certain (as much as us wet lab biologists try to make them certain!). For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position. To help us deal with this, we have the concept of fuzzy positions. Basically there are several types of fuzzy positions, so we have five classes do deal with them: ** So far we have only dealt with simple positions. The complexity of feature locations is due to its uncertainty. For example, in the nucleoside priming experiment, the transcription start point of mRNA is one of two sites, which is very useful information, but the difficult thing is how to express the information of this position by position. The solution is to use fuzzy position to display. There are five main categories of fuzzy position: **
ExactPosition – As its name suggests, this class represents a position which is specified as exact along the sequence. This is represented as just a number, and you can get the position by looking at the position attribute of the object. BeforePosition – This class represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like '<13', signifying that the real position is located somewhere less than 13. To get the specified upper boundary, look at the position attribute of the object. AfterPosition – Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as '>13', and like BeforePosition, you get the boundary number by looking at the position attribute of the object. WithinPosition – Occasionally used for GenBank/EMBL locations, this class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as ‘(1.5)’, to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary. OneOfPosition – Occasionally used for GenBank/EMBL locations, this class deals with a position where several possible values exist, for instance you could use this if the start codon was unclear and there where two candidates for the start of the gene. Alternatively, that might be handled explicitly as two related gene features.
UnknownPosition – This class deals with a position of unknown location. This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt. --Represents an unknown part. Not used in GenBank / EMBL, but corresponds to UniProt'?'Coordinates.
**ExactPosition --A single number represents the exact position in the array. You can get accurate information from the position attribute of this object. BeforePosition --Indicates that it is in front of a specific part. For example,'<13' indicates that the actual part is before 13 in the GenBank / EMBL notation. You can get this upper limit information from the position attribute. AfterPosition --In contrast to BeforePosition,'> 13' indicates that the actual part is before 13. You can also get information from the position attribute WithinPosition --Sometimes used for GenBank / EMBL locations, meaning between two parts. For example,'(1/5)' means that the actual part is between 1 and 5 in the GenBank / EMBL notation. You need to look at two attributes to get information about this class. The first argument represents the lower boundary, 1 in this example. The extension is the difference between the upper and lower boundaries, 4 in this example. Therefore, object.position represents the lower boundary and object.position + object.extension represents the upper boundary. OneOfPosition – Occasionally used for GenBank / EMBL locations, when there are multiple candidate locations, for example when the first codon is unclear, or when there are two candidates Or it is used to describe the characteristics of two distinct genes. UnknownPosition --Represents an unknown part. Not used in GenBank / EMBL, but corresponds to UniProt'?'Coordinates. ** **
Here’s an example where we create a location with fuzzy end points: ** Here is an example of making fuzzy end points: **
>>> from Bio import SeqFeature
>>> start_pos = SeqFeature.AfterPosition(5)
>>> end_pos = SeqFeature.BetweenPosition(9, left=8, right=9)
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
Note that the details of some of the fuzzy-locations changed in Biopython 1.59, in particular for BetweenPosition and WithinPosition you must now make it explicit which integer position should be used for slicing etc. For a start position this is generally the lower (left) value, while for an end position this would generally be the higher (right) value. ** Note: Since Biopython 1.59, there have been some fixes for fuzzy-locations, especially for slicing you need to use integers for BetweenPosition and WithinPosition. start is generally a smaller value and end is a larger value. ** **
If you print out a FeatureLocation object, you can get a nice representation of the information: ** If you print the FeatureLocation object, you can get the following information neatly: **
>>> print(my_location)
[>5:(8^9)]
We can access the fuzzy start and end positions using the start and end attributes of the location: ** You can get the start and end points of fuzzy position through the start and end attributes. ** **
>>> my_location.start
AfterPosition(5)
>>> print(my_location.start)
>5
>>> my_location.end
BetweenPosition(9, left=8, right=9)
>>> print(my_location.end)
(8^9)
If you don’t want to deal with fuzzy positions and just want numbers, they are actually subclasses of integers so should work like integers: ** If you just want to get the numbers instead of fuzzy positions, you can convert them to integer type with the subclass int. ** **
>>> int(my_location.start)
5
>>> int(my_location.end)
9
For compatibility with older versions of Biopython you can ask for the nofuzzy_start and nofuzzy_end attributes of the location which are plain integers: ** Nofuzzy_start and nofuzzy_end are pending for compatibility with older versions of Biopython. ** **
>>> my_location.nofuzzy_start
5
>>> my_location.nofuzzy_end
9
Notice that this just gives you back the position attributes of the fuzzy locations. ** Notice: Just call the position attribute of fuzzy locations. ** **
Similarly, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the FeaturePosition constructors, and you’ll get back out ExactPosition objects: ** Similarly, if you want to generate an exact location, you just need to pass an integer to the FeaturePosition function, and you can get the ExactPosition. ** **
>>> exact_location = SeqFeature.FeatureLocation(5, 9)
>>> print(exact_location)
[5:9]
>>> exact_location.start
ExactPosition(5)
>>> int(exact_location.start)
5
>>> exact_location.nofuzzy_start
5
That is most of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true! ** The above is the core of fuzzy positions. The purpose of making it this way is to make it less complicated than exact positions. ** **
4.3.2.4 Location testing
You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not. ** You can check if the parent coordinates of the base / residue are in feature / location with the in keyword of python. ** **
For example, suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop: ** For example, when you want to check the features of SNP, there is a powerful but easy way to check all features in a loop. ** **
>>> from Bio import SeqIO
>>> my_snp = 4350
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> for feature in record.features:
... if my_snp in feature:
... print("%s %s" % (feature.type, feature.qualifiers.get("db_xref")))
...
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']
Note that gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns. ** Note: Genes and CDS features in GenBank or EMBL files contain only exons. --Intron does not exist **
4.3.3 Sequence described by a feature or location
A SeqFeature or location object doesn’t directly contain a sequence, instead the location (see Section 4.3.2) describes how to get this from the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(6..18), like this: ** SeqFeature and location do not have a direct array, instead they hold a location to get from the parent array. For example, a short DNA sequence at location 5:18 on the reverse strand, the GenBank / EMBL notation counts from 1 (6 ... 18). ** **
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement. If you are using Biopython 1.59 or later, the feature location’s start and end are integer like so this works: ** Slices can extract 5:18 from the parent sequence to obtain complementary DAN (cDNA). If you use Biopython 1.59 or later, you can pass the start and end of the feature as integer type arguments as shown below. ** **
>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print(feature_seq)
AGCCTTTGCCGTC
This is a simple example so this isn’t too bad – however once you have to deal with compound features (joins) this is rather messy. Instead, the SeqFeature object has an extract method to take care of all this: ** This isn't bad, but it's very cumbersome when dealing with complex features (joins). You can solve this problem with the extract method of SeqFeature. ** **
>>> feature_seq = example_feature.extract(example_parent)
>>> print(feature_seq)
AGCCTTTGCCGTC
The length of a SeqFeature or location matches that of the region of sequence it describes. ** The length of SeqFeature or location is the length of the array itself. ** **
>>> print(example_feature.extract(example_parent))
AGCCTTTGCCGTC
>>> print(len(example_feature.extract(example_parent)))
13
>>> print(len(example_feature))
13
>>> print(len(example_feature.location))
13
For simple FeatureLocation objects the length is just the difference between the start and end positions. However, for a CompoundLocation the length is the sum of the constituent regions. ** The length in FeatureLocation is the difference between start and end, and in CompoundLocation it is the sum of each component. ** **
Recommended Posts