The work of pharmaceutical companies often deals with the structure of compounds. Skills in measuring the concentration and structural analysis of compounds that are candidates for new drugs synthesized in-house, their metabolites, and endogenous metabolites (amino acids, sugars, lipids, etc.) that are naturally present in the body are important.
Therefore, here, we will explain Python programming using Lipidomics, which is a comprehensive analysis of lipids existing in the living body. Molecular species with various structures exist in the body of lipids, and there are more than 1 million molecular species that can generate structures with * in silico *. It is practically difficult to manually describe the structure of each of these molecular species and to calculate physical property values such as molecular weight and polarity, and programming is essential.
If we can handle the structure and physical property values of lipids by programming, it will be applicable to chemoinformatics targeting new drug candidate compounds, so I hope that you will learn it.
This time, I will explain about "variables and data types". We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.
Pharmaceutical researcher summarized the basic description rules of Python Pharmaceutical researcher summarized variables in Python Pharmaceutical researcher summarized the operators used in Python
Please read the following article about environment construction.
How to install Anaconda for pharmaceutical company researchers
You can create a new variable by setting variable = value
.
You can also use print ()
to print the objects in parentheses (variable values, program execution results, etc.).
The character string after #
is recognized as a comment and is excluded from the execution range of the program.
You can use it to write notes in the script or to prevent the part where the error occurs from being executed.
lipid_class = 'FA' #String
Cn = 16 #Numerical value
Un = 0 #Numerical value
print(lipid_class) #Variable "lipid_Output the value of "class"
print(Cn) #Output the value of the variable "Cn"
print(Un) #Output the value of the variable "Un"
You can also create multiple variables on one line, as shown below.
Also, by setting print (object 1, object 2)
, it will be output as object 1 (half-width space) object 2
.
If you want to change the half-width space to another character or symbol, you can change it with sep =
.
If you set sep =''
, object 1
and object 2
are concatenated without spaces and output.
Cn, Un = 16, 0 # 「Cn = 16、Un =Means "0"
print(Cn) #"16" is output
print(Un) #"0" is output
print(Cn, Un) #"160" is output
print(Cn, Un, sep='When') # 「16When0」When出力される
First of all, variable names basically use English words, and variable names that start with numbers cannot be used (numbers can be used if they are the second or subsequent letters). It is good to use a variable name that allows you to see at a glance what value is stored in the variable.
If there are multiple words, separate them with _
(underscore) and basically write each word in all lowercase letters.
When giving variable names, you need to be careful not to use keywords (reserved words) that are defined in Python in advance. You can check the reserved words with the following script. (You don't need to know the details such as ʻimport` for now. I hope you can check it by just executing the following script as copy & pace.)
import keyword
import pprint
pprint.pprint(keyword.kwlist, compact=True)
The execution result is as follows. Don't use variable names for anything included below.
['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue',
'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global',
'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise',
'return', 'try', 'while', 'with', 'yield']
The string must be enclosed in '
(quotations).
Conversely, numbers enclosed in quotation marks are also treated as strings.
Cn_int = 16 #"16" as a numerical value
print(type(Cn_int)) # <class 'int'>
Cn_str = '16' #"16" as a character string
print(type(Cn_str)) # <class 'str'>
You can check the data type of the object in parentheses by using type
.
A special character string (escape sequence) is provided in case you want to write quotation marks or start a new line in quotation marks. An example is shown below.
print('molecular species: FA 16:0')
print('\'molecular species: FA 16:0\'') #Put quotes
print('molecular species: \nFA 16:0') #Insert a line break
In the above example, \'
represents '
itself. Therefore, the quotation marks are output separately from the quotation marks that surround the character string.
Also, \ n
indicates a line break. Therefore, in the output result of print
, a line break is inserted aftermolecular species:
.
Here, FA
stored in the variable lipid_class
is an abbreviation for "fatty acid" or "fatty acid".
lipid_class
is the" lipid class "in Japanese and indicates the category of lipids.
Also, Cn
is the" number of carbon atoms ", that is, the number of carbon atoms (the length of the carbon chain).
ʻUn` indicates the degree of unsaturation, that is, the degree of unsaturation (the number of double bonds).
Fatty acids have the simplest structure in the lipid class, and the structure is almost determined by specifying Cn
and ʻUn. Many of the other lipid classes have a chemical structure in which fatty acids are bound to the skeleton such as glycerol, and the skeleton part characterizes the lipid class. By combining the lipid class, the number of carbon atoms, and the degree of unsaturation, the molecular species of lipid is almost determined. So, let's consider combining
lipid_class,
Cn, and ʻUn
as a string.
By the way, the molecular species of fatty acids with 16 carbon atoms and 0 double bonds is palmitic acid. The chemical structure of palmitic acid is posted on the linked page below, so please refer to it as appropriate. Palmitic acid (FA 16:0) | LIPID MAPS Structure Database
lipid_class = 'FA'
Cn = 16
Un = 0
molecular_species = lipid_class + ' ' + str(Cn) + ':' + str(Un)
print(molecular_species) # 「FA 16:0 "is output
You can combine strings by using +
.
Here, '''
is a half-width space.
Also, str
is an abbreviation for" string ", which converts objects in parentheses to string data.
This is because Cn
and ʻUnare numbers here, so they cannot be combined as strings as they are. Furthermore, in the field of lipidomics, it is customary to connect
Cn and ʻUn
with:
.
Use ʻintor
float to convert a stringed number back to numeric data. ʻInt
is an abbreviation for "integer" and refers to an integer, and float
refers to a number with a decimal point (floating point number).
Cn = 16
Cn_str = str(Cn)
Cn_int = int(Cn_str)
print(type(Cn_str)) # <class 'str'>
print(type(Cn_int)) # <class 'int'>
exact_mass = 256.2402 #Precision mass of palmitic acid
exact_mass_str = str(exact_mass)
exact_mass_float = float(exact_mass_str)
print(type(exact_mass_str)) # <class 'str'>
print(type(exact_mass_float)) # <class 'float'>
The differences between using the +
operator for numeric data and using it for string data are summarized below.
Cn = 16
Un = 0
print(Cn + Un) #Numerical value "16"
print(type(Cn + Un))
print(str(Cn) + str(Un)) #Character string "160" (character string in which 1s, 6s, and 0s are lined up)
print(type(str(Cn) + str(Un)))
If you add numeric variables, the addition will be done normally, but if you add string variables, it will be a string combination.
(Adding Cn
and ʻUn` has no chemistry meaning, but here it is given as an example to show how the program works.)
You can also combine strings (embed variable values in strings) by writing as follows.
lipid_class = 'FA'
Cn = 16
Un = 0
molecular_species = '{0} {1}:{2}'.format(lipid_class, Cn, Un)
print(molecular_species) #This is also "FA 16:0 "is output
Write {}
in the quotation marks, arrange the variables in parentheses of format
, and put 0, 1, and 2 in the parentheses of format
in the order of {}
. It is embedded from the variable on the left.
In the world of programming, serial numbers often start from 0 instead of 1, so be careful when you start programming.
Furthermore, in Python 3.6 or later, it is possible to embed variables in a character string in a simpler way, as shown below, called "f-string".
lipid_class = 'FA'
Cn = 16
Un = 0
molecular_species = f'{lipid_class} {Cn}:{Un}'
print(molecular_species)
Just put the character string you want to finally create as f'character string'
and put the variable name in {}
, and the {}
part will be replaced with the specified variable.
You can use replace
to replace a particular string with another.
molecular_species = 'FA 16:0'
print(molecular_species.replace(':', '_')) # 「:"(Colon)"_"(Underscore)
print(molecular_species.replace(' ', '')) #Delete half-width space
There is a method called "SMILES (simplified molecular input line entry system) notation" as a method of describing the structure of a compound. As shown below, the chemical structure can be described only with a character string.
smiles_pa = 'OC(' + 'C' * (Cn - 1) + ')=O' # 'pa'Is'palmitic acid'Abbreviation for
print(smiles_pa)
As above, the SMILES notation describes the chemical structure without the use of a hydrogen atom (H).
With SMILES notation, the molecular structure can be described automatically even if the value of Cn
changes.
*
can be used not only for multiplication of numbers, but also for repeating the same string.
Next, consider a molecular species called linoleic acid, which has 18 carbon atoms and 2 double bonds. Linoleic acid (FA 18:2) | LIPID MAPS Structure Database
smiles_la = 'OC(CCCCCCC/C=C\C/C=C\CCCCC)=O' #Linoleic acid
Double bonds are described using =
.
/
and \
indicate whether the double bond is * cis * or * trans *, * cis * if the symbols before or after the carbon atom forming the double bond have different orientations, and if they have the same orientation. It becomes * trans *.
smiles_la = 'OC(CCCC/C=C\C/C=C\CCCCCCCC)=O' #Linoleic acid
smiles_la_oxidized = smiles_la.replace('/C=C\C', 'C(O)CC')
print(smiles_la_oxidized)
In this way, the replace
mentioned above can be used to express the oxidation of the double bond moiety.
By the way, it seems that you can replace it with .replace ('/ C = C \','C (O) C')
, but the part after '/ C = C \'`` \'
Is the escape sequence mentioned above, and the quotation mark for closing the string is recognized as the quotation mark in the string, and a syntax error occurs.
So, here, I put another carbon atom on the right and made it .replace ('/ C = C \ C','C (O) CC')
.
A boolean (boolean) refers to a data type that is either True
or False
.
It can be used to compare whether multiple variables are the same or to find out if a certain condition is met.
palmitic_acid = 'FA 16:0' #Palmitic acid (saturated fatty acid with 16 carbon atoms)
stearic_acid = 'FA 18:0' #Stearic acid (saturated fatty acid with 18 carbon atoms)
print(molecular_species == palmitic_acid) # True
print(molecular_species == stearic_acid) # False
A fatty acid with 16 carbon atoms and 0 double bonds is "palmitic acid", not "stearic acid".
Here, we have explained Python variables and data types, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.
--You can create a variable by setting variable name = value
.
--Strings can be combined with the +
operator. You can also use f-string etc. to embed variables in a string. It can be used when generating compound names mechanically.
--You can repeat the same string a specified number of times with string * number
. It can be used when specifying the number of carbon atoms in SMILES notation.
Next, the following articles explain Python data structures (lists, dictionaries, etc.).
Python data structure learned by chemoinformatics
Surprisingly few! ?? "Minimum" knowledge required for programming in a pharmaceutical company
Recommended Posts