Following on from Python variables and data types learned from chemoinformatics, we will explain "data structures" with the theme of lipidomics (comprehensive analysis of lipids). .. We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.
Pharmaceutical researcher summarized the data structure of Python
A list is a data type that stores multiple elements and can be created with list name = [element 1, element 2, ...]
.
The following example is a list that stores only character strings, but you can also enter numbers and boolean values, multiple elements with the same value, and a mixture of multiple data types.
Elements in the list can be referenced by their list name [index number]
.
It should be noted that the index number starts with 0
, not with 1
.
fatty_acids = ['FA 16:0', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3']
print(fatty_acids[0]) #First (first) element
print(fatty_acids[1]) #Second element
print(fatty_acids[-1]) #First (last) element from the back
print(fatty_acids[2:4]) #3rd to 4th elements
print(fatty_acids[3:]) #4th and subsequent elements
print(fatty_acids[:3]) #Up to 4 elements
print(fatty_acids[:-2]) #The second element from the back
You can update the value of the element with the specified index number by setting list name [index number] = value
.
fatty_acids[3] = 'FA 18:2 (6Z, 9Z)'
print(fatty_acids)
By the way, (6Z, 9Z)
represents the position and style of the double bond. 6
and 9
indicate which carbon atom counts from the carbon atom on the opposite side of the carboxylic acid forms a double bond, and Z
means that the double bond is * cis *. It shows that. If it is ʻE`, it will be * trans *.
For details on the structure of linoleic acid, please see the link below.
Linoleic acid (FA 18:2)
If there is a double bond, it is necessary to specify the position and mode of the double bond as described above, but it will be a little longer, so we will omit it from now on.
You can check the number of elements in the list with len
.
By the way, len
is an abbreviation for" length ".
print(len(fatty_acids))
The most commonly used list operations are +
and *
.
You can combine lists with +
and create a list with the specified number of the same elements with *
.
saturated_fatty_acids = ['FA 16:0', 'FA 18:0'] #Saturated fatty acids (fatty acids without double bonds)
unsaturated_fatty_acids = ['FA 18:1', 'FA 18:2', 'FA 18:3'] #Unsaturated fatty acids (fatty acids with double bonds)
fatty_acids = saturated_fatty_acids + unsaturated_fatty_acids #Join list
print(fatty_acids)
number_carbons = [16] + [18]*4 #List join and duplicate
print(number_carbons)
number_carbons
is the number of carbon atoms in the list fatty_acids
.
Since fatty_acids
contains four molecular species with 18 carbon atoms, it is duplicated with*
.
The methods that are often used in the list are introduced below.
fatty_acids = ['FA 16:0', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3']
fatty_acids_copy = fatty_acids.copy() #Make a copy
print(fatty_acids_copy)
fatty_acids.append('FA 20:4') #Add element at the end
print(fatty_acids)
fatty_acids.extend(['FA 20:5', 'FA 22:6']) #Add multiple elements at the end
print(fatty_acids)
fatty_acids.insert(1, 'FA 16:1') #Add element to specified index number
print(fatty_acids)
fatty_acids.remove('FA 18:3') #Delete the specified element
print(fatty_acids)
print(fatty_acids.pop()) #Delete the last element and output the deleted element
print(fatty_acids.pop(2)) #Delete the third element and output the deleted element
fatty_acids.sort(key=None, reverse=True) #Sort elements in descending order
print(fatty_acids)
fatty_acids.sort(key=None, reverse=False) #Sort elements in ascending order
print(fatty_acids)
print(fatty_acids.index('FA 18:2')) #Index number of the specified element
print(fatty_acids.count('FA 18:2')) #Number of specified elements
.extend (['FA 20: 5','FA 22: 6')
may be written as .append (['FA 20: 5','FA 22: 6'])
. However, the execution result will change.
When using ʻextend, two elements are added,
'FA 20: 5'and
'FA 22: 6', but when using ʻappend
,['FA 20 The list: 5','FA 22: 6']
is added as one element. In other words, if you use ʻappend, the list will be included in the list. Be careful when using ʻappend
and ʻextend` properly.
Strings can be treated like lists. You can think of a string as a single-character list, refer to the fifth character from the front, and so on.
palmitic_acid = fatty_acids[0] #List "fatty_first element of "acid"
print(palmitic_acid) # FA 16:0
print(palmitic_acid[0]) # 「FA 16:The first character of the string "0", that is, "F"
print(len(palmitic_acid)) #word count
lipid_class = palmitic_acid[0:2]
print(lipid_class) # FA
Cn = int(palmitic_acid[3:5])
print(Cn) #16 (numerical value)
Un = int(palmitic_acid[6])
print(Un) #0 (numerical value)
As an application, consider counting the number of carbon atoms and double bonds of fatty acids using the SMILES notation.
smiles_la = 'OC(CCCCCCC/C=C\C/C=C\CCCCC)=O' #SMILES of linoleic acid
Cn = smiles_la.count('C') #Number of carbon atoms
Un = smiles_la.count('=') - 1 #Number of double bonds in the carbon chain
linoleic_acid = f'FA {Cn}:{Un}' # f-string
print(linoleic_acid)
A tuple is a list-like data type and can be created with tuple name = (element 1, element 2, ...)
.
As with the list, you can see the value by tuple name [index number]
, but you cannot update the value.
So, if you want to make an array of data that you don't want to rewrite, you can use tuples.
fatty_acids = ('FA 16:0', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3')
print(fatty_acids[0]) #First element
print(fatty_acids[1]) #Second element
print(fatty_acids[-1]) #First element from the back
print(fatty_acids[2:4]) #3rd to 4th elements
print(fatty_acids[3:]) #4th and subsequent elements
print(fatty_acids[:3]) #Up to 4 elements
print(fatty_acids[:-2]) #The second element from the back
A dictionary is a one-to-one correspondence between "keys" and "values", and an array of these key / value combinations.
It can be created with dictionary name = {key 1: value 1, key 2: value 2, ...}
.
Cn = 18 #Number of carbon atoms (chain length) of fatty acid
Un = 2 #Number of double bonds (degree of unsaturation)
num_C = Cn #Number of carbon atoms in the entire molecule
num_H = Cn * 2 - Un * 2 #Number of hydrogen atoms in the whole molecule
num_O = 2 #Number of oxygen atoms in the whole molecule
molecular_formula = {'C': num_C, 'H': num_H, 'O': num_O}
In the above example, a dictionary is created with the number of atoms as the value, using the element symbol as the key. To see all the keys and values in the dictionary:
print(molecular_formula.keys()) #List of keys
print(molecular_formula.values()) #List of values
print(molecular_formula.items()) #List of tuples of keys and values
If dictionary name [key] = value
, the value is updated if the key already exists in the dictionary, and a new key and value are added if the key does not exist.
molecular_formula['C'] = 16 #Value rewriting
molecular_formula['H'] = 32 #Value rewriting
molecular_formula['N'] = 0 #Add new keys and values
print(molecular_formula)
A set can be created with set name = {}
.
There is no concept of order, and it is used to judge whether there is a specific element without specifying the element by index number.
fatty_acids = {'FA 16:0', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3'}
Here, I explained the data structure of Python, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.
--A list can store multiple elements and can treat strings like a list. It can also be applied to handle the structure of compounds described in SMILES notation. --Dictionaries can store multiple data by associating keys with values. It can be used to store information such as the composition formula of a compound.
Next, the following article explains conditional branching in Python.
Conditional branching of Python learned by chemoinformatics
Surprisingly few! ?? "Minimum" knowledge required for programming in a pharmaceutical company
Recommended Posts