4.3 Feature, location and position objects Vers 4.2
4.3.1 SeqFeature objects
Sequence features are an essential part of describing a sequence. ** La fonctionnalité est une partie essentielle de la description d'un tableau. ** **
Once you get beyond the sequence itself, you need some way to organize and easily get at the more “abstract” information that is known about the sequence. ** Lorsque vous allez au-delà d'un simple tableau, vous devriez avoir un moyen de l'organiser et d'accéder facilement à des informations complexes. ** **
The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you’ll probably have an easier time grasping the structure of the Biopython classes. ** La conception est également basée sur la table de fonctionnalités GenBank / EMBL, donc si vous la connaissez, les classes Biopython seront plus faciles à comprendre. ** **
The key idea about each SeqFeature object is to describe a region on a parent sequence, typically a SeqRecord object. That region is described with a location object, typically a range between two positions (see Section 4.3.2 below). ** L'idée principale d'un objet SeqFeature est de décrire la zone du tableau parent, le parent est généralement un objet SeqRecord. Une région est définie par un objet de localisation et correspond généralement à la plage entre deux positions (voir Session 4.3.2). ** **
position – This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions. location – A location is region of sequence bounded by some positions. For instance 5..20 (i. e. 5 to 20) is a location.
**position
I just mention this because sometimes I get confused between the two. ** La raison pour laquelle je mentionne cela est que je fais souvent une erreur. ** **
4.3.2.1 FeatureLocation object
Unless you work with eukaryotic genes, most SeqFeature locations are extremely simple - you just need start and end coordinates and a strand. That’s essentially all the basic FeatureLocation object does. ** À moins qu'il ne s'agisse d'un gène eucaryote, la localisation de SeqFeature est très facile. Seulement les coordonnées de début, de fin et les informations de brin L'emplacement de fonctionnalité le plus basique contient les trois informations ci-dessus. ** **
In practise of course, things can be more complicated. First of all we have to handle compound locations made up of several regions. Secondly, the positions themselves may be fuzzy (inexact). ** Mais le cas réel ne sera pas si simple. Vous devez gérer un emplacement complexe composé de plusieurs régions et la position peut ne pas être claire. ** **
4.3.2.2 CompoundLocation object Biopython 1.62 introduced the CompoundLocation as part of a restructuring of how complex locations made up of multiple regions are represented. The main usage is for handling ‘join’ locations in EMBL/GenBank files. ** Introduction de l'emplacement composé de Biopython 1.62 pour mieux gérer les emplacements de `` jointure '' dans les fichiers EMBL / GenBank. ** **
4.3.2.3 Fuzzy Positions So far we’ve only used simple positions. One complication in dealing with feature locations comes in the positions themselves. In biology many times things aren’t entirely certain (as much as us wet lab biologists try to make them certain!). For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position. To help us deal with this, we have the concept of fuzzy positions. Basically there are several types of fuzzy positions, so we have five classes do deal with them: ** Jusqu'à présent, nous n'avons traité que de simples positions. La complexité des emplacements des entités est due à son incertitude. Par exemple, dans l'expérience d'amorçage à l'acide dinucléaire, le point de départ de la transcription de l'ARNm est l'un des deux sites, ce qui est une information très utile, mais le plus difficile est de savoir comment exprimer l'information de cette position par position. La solution consiste à utiliser une position floue pour afficher. Il existe cinq principaux types de position floue: **
ExactPosition – As its name suggests, this class represents a position which is specified as exact along the sequence. This is represented as just a number, and you can get the position by looking at the position attribute of the object. BeforePosition – This class represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like '<13', signifying that the real position is located somewhere less than 13. To get the specified upper boundary, look at the position attribute of the object. AfterPosition – Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as '>13', and like BeforePosition, you get the boundary number by looking at the position attribute of the object. WithinPosition – Occasionally used for GenBank/EMBL locations, this class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as ‘(1.5)’, to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary. OneOfPosition – Occasionally used for GenBank/EMBL locations, this class deals with a position where several possible values exist, for instance you could use this if the start codon was unclear and there where two candidates for the start of the gene. Alternatively, that might be handled explicitly as two related gene features.
UnknownPosition – This class deals with a position of unknown location. This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt. --Représente une partie inconnue. Non utilisé dans GenBank / EMBL, mais correspond aux coordonnées UniProt '?'.
**ExactPosition
Here’s an example where we create a location with fuzzy end points: ** Voici un exemple de création de points d'extrémité flous: **
>>> from Bio import SeqFeature
>>> start_pos = SeqFeature.AfterPosition(5)
>>> end_pos = SeqFeature.BetweenPosition(9, left=8, right=9)
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
Note that the details of some of the fuzzy-locations changed in Biopython 1.59, in particular for BetweenPosition and WithinPosition you must now make it explicit which integer position should be used for slicing etc. For a start position this is generally the lower (left) value, while for an end position this would generally be the higher (right) value. ** Remarque: Depuis Biopython 1.59, il y a eu quelques corrections pour les emplacements flous, en particulier pour le découpage, vous devez utiliser des entiers pour BetweenPosition et WithinPosition. start est généralement une valeur plus petite et fin est une valeur plus grande. ** **
If you print out a FeatureLocation object, you can get a nice representation of the information: ** Si vous imprimez l'objet FeatureLocation, vous pouvez obtenir les informations suivantes proprement: **
>>> print(my_location)
[>5:(8^9)]
We can access the fuzzy start and end positions using the start and end attributes of the location: ** Vous pouvez obtenir les points de début et de fin de la position floue via les attributs de début et de fin. ** **
>>> my_location.start
AfterPosition(5)
>>> print(my_location.start)
>5
>>> my_location.end
BetweenPosition(9, left=8, right=9)
>>> print(my_location.end)
(8^9)
If you don’t want to deal with fuzzy positions and just want numbers, they are actually subclasses of integers so should work like integers: ** Si vous souhaitez simplement obtenir les nombres au lieu de positions floues, vous pouvez les convertir en type entier avec la sous-classe int. ** **
>>> int(my_location.start)
5
>>> int(my_location.end)
9
For compatibility with older versions of Biopython you can ask for the nofuzzy_start and nofuzzy_end attributes of the location which are plain integers: ** Nofuzzy_start et nofuzzy_end sont en attente de compatibilité avec les anciennes versions de Biopython. ** **
>>> my_location.nofuzzy_start
5
>>> my_location.nofuzzy_end
9
Notice that this just gives you back the position attributes of the fuzzy locations. ** Remarque: il suffit d'appeler l'attribut position des emplacements flous. ** **
Similarly, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the FeaturePosition constructors, and you’ll get back out ExactPosition objects: ** De même, si vous souhaitez générer un emplacement exact, il vous suffit de passer un entier à la fonction FeaturePosition, et vous pouvez obtenir l'ExactPosition. ** **
>>> exact_location = SeqFeature.FeatureLocation(5, 9)
>>> print(exact_location)
[5:9]
>>> exact_location.start
ExactPosition(5)
>>> int(exact_location.start)
5
>>> exact_location.nofuzzy_start
5
That is most of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true! ** Ce qui précède est le cœur des positions floues. Le but de le faire de cette façon est de le rendre moins compliqué que les positions exactes. ** **
4.3.2.4 Location testing
You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not. ** Vous pouvez vérifier si les coordonnées parentes de la base / résidu sont en fonction / emplacement avec le mot-clé in de python. ** **
For example, suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop: ** Par exemple, lorsque vous souhaitez vérifier les fonctionnalités de SNP, il existe un moyen puissant mais simple de vérifier toutes les fonctionnalités en boucle. ** **
>>> from Bio import SeqIO
>>> my_snp = 4350
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> for feature in record.features:
... if my_snp in feature:
... print("%s %s" % (feature.type, feature.qualifiers.get("db_xref")))
...
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']
Note that gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns. ** Remarque: les fonctionnalités Genes et CDS des fichiers GenBank ou EMBL contiennent uniquement Exxon. --Intron n'existe pas **
4.3.3 Sequence described by a feature or location
A SeqFeature or location object doesn’t directly contain a sequence, instead the location (see Section 4.3.2) describes how to get this from the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(6..18), like this: ** SeqFeature et location n'ont pas de séquence directe, mais contiennent à la place un emplacement à obtenir de la séquence parente. Par exemple, une courte séquence d'ADN à l'emplacement 5:18 sur le brin inverse, la notation GenBank / EMBL compte de 1 (6 ... 18). ** **
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement. If you are using Biopython 1.59 or later, the feature location’s start and end are integer like so this works: ** Les tranches peuvent extraire 5:18 de la séquence parente pour obtenir un DAN complémentaire (ADNc). Si vous utilisez Biopython 1.59 ou version ultérieure, vous pouvez transmettre le début et la fin de la fonctionnalité sous forme d'arguments de type entier, comme indiqué ci-dessous. ** **
>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print(feature_seq)
AGCCTTTGCCGTC
This is a simple example so this isn’t too bad – however once you have to deal with compound features (joins) this is rather messy. Instead, the SeqFeature object has an extract method to take care of all this: ** Ce n'est pas mal, mais c'est très fastidieux lorsqu'il s'agit de fonctionnalités complexes (jointures). Vous pouvez couvrir ce problème avec la méthode d'extraction SeqFeature. ** **
>>> feature_seq = example_feature.extract(example_parent)
>>> print(feature_seq)
AGCCTTTGCCGTC
The length of a SeqFeature or location matches that of the region of sequence it describes. ** La longueur de SeqFeature ou location est la longueur du tableau lui-même. ** **
>>> print(example_feature.extract(example_parent))
AGCCTTTGCCGTC
>>> print(len(example_feature.extract(example_parent)))
13
>>> print(len(example_feature))
13
>>> print(len(example_feature.location))
13
For simple FeatureLocation objects the length is just the difference between the start and end positions. However, for a CompoundLocation the length is the sum of the constituent regions. ** La longueur dans FeatureLocation est la différence entre le début et la fin, et dans CompoundLocation, c'est la somme de chaque composant. ** **
Recommended Posts