Introduction

The work of pharmaceutical companies often deals with the structure of compounds. Skills in measuring the concentration and structural analysis of compounds that are candidates for new drugs synthesized in-house, their metabolites, and endogenous metabolites (amino acids, sugars, lipids, etc.) that are naturally present in the body are important.

Therefore, here, we will explain Python programming using Lipidomics, which is a comprehensive analysis of lipids existing in the living body. Molecular species with various structures exist in the body of lipids, and there are more than 1 million molecular species that can generate structures with * in silico *. It is practically difficult to manually describe the structure of each of these molecular species and to calculate physical property values such as molecular weight and polarity, and programming is essential.

If we can handle the structure and physical property values of lipids by programming, it will be applicable to chemoinformatics targeting new drug candidate compounds, so I hope that you will learn it.

This time, I will explain about "variables and data types". We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.

Pharmaceutical researcher summarized the basic description rules of Python Pharmaceutical researcher summarized variables in Python Pharmaceutical researcher summarized the operators used in Python

Please read the following article about environment construction.

How to install Anaconda for pharmaceutical company researchers

Strings and numbers

Create and output variables

You can create a new variable by setting variable = value. You can also use print () to print the objects in parentheses (variable values, program execution results, etc.). The character string after # is recognized as a comment and is excluded from the execution range of the program. You can use it to write notes in the script or to prevent the part where the error occurs from being executed.

lipid_class = 'FA' #String
Cn = 16 #Numerical value
Un = 0 #Numerical value

print(lipid_class) #Variable "lipid_Output the value of "class"
print(Cn) #Output the value of the variable "Cn"
print(Un) #Output the value of the variable "Un"

You can also create multiple variables on one line, as shown below. Also, by setting print (object 1, object 2), it will be output as object 1 (half-width space) object 2. If you want to change the half-width space to another character or symbol, you can change it with sep =. If you set sep ='', object 1 and object 2 are concatenated without spaces and output.

Cn, Un = 16, 0 # 「Cn = 16、Un =Means "0"
print(Cn) #"16" is output
print(Un) #"0" is output
print(Cn, Un) #"160" is output
print(Cn, Un, sep='When') # 「16When0」When出力される

How to name variables and precautions

First of all, variable names basically use English words, and variable names that start with numbers cannot be used (numbers can be used if they are the second or subsequent letters). It is good to use a variable name that allows you to see at a glance what value is stored in the variable. If there are multiple words, separate them with _ (underscore) and basically write each word in all lowercase letters.

When giving variable names, you need to be careful not to use keywords (reserved words) that are defined in Python in advance. You can check the reserved words with the following script. (You don't need to know the details such as ʻimport` for now. I hope you can check it by just executing the following script as copy & pace.)

import keyword
import pprint


pprint.pprint(keyword.kwlist, compact=True)

The execution result is as follows. Don't use variable names for anything included below.

['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue',
 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global',
 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise',
 'return', 'try', 'while', 'with', 'yield']

Precautions regarding character strings and how to check the data type

The string must be enclosed in '(quotations). Conversely, numbers enclosed in quotation marks are also treated as strings.

Cn_int = 16 #"16" as a numerical value
print(type(Cn_int)) # <class 'int'>

Cn_str = '16' #"16" as a character string
print(type(Cn_str)) # <class 'str'>

You can check the data type of the object in parentheses by using type.

Escape sequence

A special character string (escape sequence) is provided in case you want to write quotation marks or start a new line in quotation marks. An example is shown below.

print('molecular species: FA 16:0')
print('\'molecular species: FA 16:0\'') #Put quotes
print('molecular species: \nFA 16:0') #Insert a line break

In the above example, \' represents ' itself. Therefore, the quotation marks are output separately from the quotation marks that surround the character string. Also, \ n indicates a line break. Therefore, in the output result of print, a line break is inserted aftermolecular species:.

String concatenation

Here, FA stored in the variable lipid_class is an abbreviation for "fatty acid" or "fatty acid". lipid_class is the" lipid class "in Japanese and indicates the category of lipids. Also, Cn is the" number of carbon atoms ", that is, the number of carbon atoms (the length of the carbon chain). ʻUn` indicates the degree of unsaturation, that is, the degree of unsaturation (the number of double bonds).

Fatty acids have the simplest structure in the lipid class, and the structure is almost determined by specifying Cn and ʻUn. Many of the other lipid classes have a chemical structure in which fatty acids are bound to the skeleton such as glycerol, and the skeleton part characterizes the lipid class. By combining the lipid class, the number of carbon atoms, and the degree of unsaturation, the molecular species of lipid is almost determined. So, let's consider combining lipid_class, Cn, and ʻUn as a string.

By the way, the molecular species of fatty acids with 16 carbon atoms and 0 double bonds is palmitic acid. The chemical structure of palmitic acid is posted on the linked page below, so please refer to it as appropriate. Palmitic acid (FA 16:0) | LIPID MAPS Structure Database

lipid_class = 'FA'
Cn = 16
Un = 0

molecular_species = lipid_class + ' ' + str(Cn) + ':' + str(Un)
print(molecular_species) # 「FA 16:0 "is output

You can combine strings by using +. Here, ''' is a half-width space. Also, str is an abbreviation for" string ", which converts objects in parentheses to string data. This is because Cn and ʻUnare numbers here, so they cannot be combined as strings as they are. Furthermore, in the field of lipidomics, it is customary to connectCn and ʻUn with: .

Use ʻintorfloat to convert a stringed number back to numeric data. ʻInt is an abbreviation for "integer" and refers to an integer, and float refers to a number with a decimal point (floating point number).

Cn = 16

Cn_str = str(Cn)
Cn_int = int(Cn_str)
print(type(Cn_str)) # <class 'str'>
print(type(Cn_int)) # <class 'int'>


exact_mass = 256.2402 #Precision mass of palmitic acid
exact_mass_str = str(exact_mass)
exact_mass_float = float(exact_mass_str)
print(type(exact_mass_str)) # <class 'str'>
print(type(exact_mass_float)) # <class 'float'>

The differences between using the + operator for numeric data and using it for string data are summarized below.

Cn = 16
Un = 0

print(Cn + Un) #Numerical value "16"
print(type(Cn + Un))

print(str(Cn) + str(Un)) #Character string "160" (character string in which 1s, 6s, and 0s are lined up)
print(type(str(Cn) + str(Un)))

If you add numeric variables, the addition will be done normally, but if you add string variables, it will be a string combination. (Adding Cn and ʻUn` has no chemistry meaning, but here it is given as an example to show how the program works.)

You can also combine strings (embed variable values in strings) by writing as follows.

lipid_class = 'FA'
Cn = 16
Un = 0

molecular_species = '{0} {1}:{2}'.format(lipid_class, Cn, Un)
print(molecular_species) #This is also "FA 16:0 "is output

Write {} in the quotation marks, arrange the variables in parentheses of format, and put 0, 1, and 2 in the parentheses of format in the order of {}. It is embedded from the variable on the left. In the world of programming, serial numbers often start from 0 instead of 1, so be careful when you start programming.

Furthermore, in Python 3.6 or later, it is possible to embed variables in a character string in a simpler way, as shown below, called "f-string".

lipid_class = 'FA'
Cn = 16
Un = 0

molecular_species = f'{lipid_class} {Cn}:{Un}'
print(molecular_species)

Just put the character string you want to finally create as f'character string' and put the variable name in {}, and the {} part will be replaced with the specified variable.

String replacement

You can use replace to replace a particular string with another.

molecular_species = 'FA 16:0'

print(molecular_species.replace(':', '_')) # 「:"(Colon)"_"(Underscore)
print(molecular_species.replace(' ', '')) #Delete half-width space

Application: SMILES notation

There is a method called "SMILES (simplified molecular input line entry system) notation" as a method of describing the structure of a compound. As shown below, the chemical structure can be described only with a character string.

smiles_pa = 'OC(' + 'C' * (Cn - 1) + ')=O' # 'pa'Is'palmitic acid'Abbreviation for
print(smiles_pa)

As above, the SMILES notation describes the chemical structure without the use of a hydrogen atom (H). With SMILES notation, the molecular structure can be described automatically even if the value of Cn changes. * can be used not only for multiplication of numbers, but also for repeating the same string.

Next, consider a molecular species called linoleic acid, which has 18 carbon atoms and 2 double bonds. Linoleic acid (FA 18:2) | LIPID MAPS Structure Database

smiles_la = 'OC(CCCCCCC/C=C\C/C=C\CCCCC)=O' #Linoleic acid

Double bonds are described using =. / and \ indicate whether the double bond is * cis * or * trans *, * cis * if the symbols before or after the carbon atom forming the double bond have different orientations, and if they have the same orientation. It becomes * trans *.

smiles_la = 'OC(CCCC/C=C\C/C=C\CCCCCCCC)=O' #Linoleic acid

smiles_la_oxidized = smiles_la.replace('/C=C\C', 'C(O)CC')
print(smiles_la_oxidized)

In this way, the replace mentioned above can be used to express the oxidation of the double bond moiety. By the way, it seems that you can replace it with .replace ('/ C = C \','C (O) C'), but the part after '/ C = C \'`` \' Is the escape sequence mentioned above, and the quotation mark for closing the string is recognized as the quotation mark in the string, and a syntax error occurs. So, here, I put another carbon atom on the right and made it .replace ('/ C = C \ C','C (O) CC').

Boolean value

A boolean (boolean) refers to a data type that is either True or False. It can be used to compare whether multiple variables are the same or to find out if a certain condition is met.

palmitic_acid = 'FA 16:0' #Palmitic acid (saturated fatty acid with 16 carbon atoms)
stearic_acid = 'FA 18:0' #Stearic acid (saturated fatty acid with 18 carbon atoms)

print(molecular_species == palmitic_acid) # True
print(molecular_species == stearic_acid) # False

A fatty acid with 16 carbon atoms and 0 double bonds is "palmitic acid", not "stearic acid".

Summary

Here, we have explained Python variables and data types, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.

--You can create a variable by setting variable name = value. --Strings can be combined with the + operator. You can also use f-string etc. to embed variables in a string. It can be used when generating compound names mechanically. --You can repeat the same string a specified number of times with string * number. It can be used when specifying the number of carbon atoms in SMILES notation.

Next, the following articles explain Python data structures (lists, dictionaries, etc.).

Python data structure learned by chemoinformatics

Reference materials / links

Surprisingly few! ?? "Minimum" knowledge required for programming in a pharmaceutical company

Python variables and data types learned in chemoinformatics