Following NumPy learned from chemoinformatics, "Pandas" is one of Python's representative libraries with the theme of lipidomics (comprehensive analysis of lipids). I will explain about. We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.
Pharmaceutical researcher summarized Pandas
Pandas makes spreadsheets easy.
To use the library, first load the library with ʻimport. By convention, it is often abbreviated as
pd`.
Pandas handles two types of data structures, "Series" and "DataFrame".
Series is one-dimensional data, which has a data structure similar to a list or dictionary.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
print(numbers_carbon)
print(numbers_unsaturation)
In the above example, you can think of ʻindex_fatty_acids as something like the name of the data. You can also create a Series based on the dictionary as shown below. ʻIndex_fatty_acids
is the key of the dictionary.
However, since the code becomes long, it is better to specify the list with ʻindex =`.
import pandas as pd
numbers_carbon = pd.Series({
'FA 16:0': 16,
'FA 16:1': 16,
'FA 18:0': 18,
'FA 18:1': 18,
'FA 18:2': 18,
'FA 18:3': 18,
'FA 18:4': 18,
'FA 20:0': 20,
'FA 20:3': 20,
'FA 20:4': 20,
'FA 20:5': 20
})
print(numbers_carbon)
On the other hand, DataFrame is two-dimensional data created by combining Series.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
print(df_fatty_acids)
In the above example, the dictionary key inside pd.DataFrame
is the column name of the table.
On the other hand, ʻindexspecified when the Series was created earlier becomes the line name. By the way,
df in
df_fatty_acids` is an abbreviation for "data frame".
Use ʻindex and
columns` to refer to the row and column names of the DataFrame, respectively.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
print(df_fatty_acids.index) #Line name
print(df_fatty_acids.columns) #Column name
To access a specific element of a DataFrame, write:
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
print(df_fatty_acids['Cn']) #Specify column name
print(df_fatty_acids.Cn) #Specify column name
print(df_fatty_acids['Cn'][0]) #Specify the column name and specify the row number
print(df_fatty_acids[2:5]) #Specify line number (index number) with slice
print(df_fatty_acids[5:]) #Extract data after the specified line number
print(df_fatty_acids[:5]) #Extract data up to the specified line number
print(df_fatty_acids[-5:]) #Line number counted from the back
print(df_fatty_acids[2:5]['Cn']) #Specify row number and column name
print(df_fatty_acids.loc['FA 16:0', 'Cn']) #Specify row name and column name
print(df_fatty_acids.loc['FA 16:0']) #Specify line name
print(df_fatty_acids.loc[:, 'Cn']) #Specify column name
print(df_fatty_acids.iloc[0, 0]) #Specify row and column numbers
print(df_fatty_acids.iloc[0]) #Specify line number
print(df_fatty_acids.iloc[:, 0]) #Specify column number
print(df_fatty_acids.iloc[-1, -1]) #Element in the last column of the last row
Whether it is better to specify the row name or column name or the row number or column number is case by case, so it is better to choose the one that is easy to do each time.
You can also extract data that meets the specified conditions.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18]) #DataFrame extracted rows that satisfy the conditions
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18]['Cn']) #Extract by specifying the column name in the DataFrame that extracted the rows that satisfy the conditions
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18].iloc[:, 0]) #Extract the rows that satisfy the conditions by specifying the column number in the DataFrame
print(df_fatty_acids[(df_fatty_acids['Cn'] >= 18) & (df_fatty_acids['Un'] >= 2)]) #Specify multiple conditions (and)
print(df_fatty_acids[(df_fatty_acids['Cn'] >= 18) | (df_fatty_acids['Un'] >= 1)]) #Specify multiple conditions (or)
When specifying multiple conditions, parentheses ()
are required for each condition. Don't forget.
You can specify a specific column with DataFrame name ['column name']
, but if you specify a column name that is not in the DataFrame, a new column will be created.
You can also easily calculate based on the data in another column.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2
print(df_fatty_acids)
In the above example, the number of carbon atoms C
and the number of hydrogen atoms H
of each fatty acid molecular species are calculated based on the values in the columns Cn
and ʻUn. If you output
df_fatty_acids, you can see that new columns
C,
H and ʻO
have been added.
Next, consider finding the exact mass of each fatty acid molecular species.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2
df_fatty_acids['Exact mass'] = pd.Series([0] * len(index_fatty_acids), index=index_fatty_acids) #For the time being, put 0 in all lines
exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})
for atom in exact_masses.index:
df_fatty_acids['Exact mass'] += exact_masses[atom] * df_fatty_acids[atom] #Calculate precision mass
print(df_fatty_acids)
In the above example, the precise mass of the fatty acid molecule is calculated by adding the precision mass of each atom multiplied by the number of atoms.
Next, let's find the composition formula.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2
df_fatty_acids['Molecular formula'] = pd.Series([''] * len(index_fatty_acids), index=index_fatty_acids) #For the time being, put an empty string on every line
exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})
for atom in exact_masses.index:
df_fatty_acids['Molecular formula'] += atom + df_fatty_acids[atom].astype(str) #Write the composition formula
print(df_fatty_acids)
You can combine the element symbol and the number of atoms as a character string, but since the data contained in C
, H
, ʻO is a numerical value, it is necessary to convert it to a character string before combining. There is. Therefore, in the above example, as ʻas type (str)
, the numerical values are converted into character strings and then combined.
Next, consider outputting the completed data as an external file.
import pandas as pd
index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']
numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)
df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})
df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2
df_fatty_acids['Exact mass'] = 0
exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})
df_fatty_acids['Exact mass'] = exact_masses * df_fatty_acids #Precision mass
for atom in exact_masses.index:
df_fatty_acids['Molecular formula'] += atom + df_fatty_acids[atom].astype(str) #Composition formula
df_fatty_acids.to_csv('fatty_acids.csv') #Output as CSV file
df_fatty_acids.to_csv('fatty_acids.txt', sep='\t') #Output as tab-delimited text file
df_fatty_acids.to_excel('fatty_acids.xlsx', sheet_name='fatty_acids') #Output as an excel file
Conversely, to read an external file, do the following:
import pandas as pd
df_csv = pd.read_csv('fatty_acids.csv', index_col=0) #Read CSV file
df_text = pd.read_csv('fatty_acids.txt', sep='\t', index_col=0) #Read tab-delimited text file
df_excel = pd.read_excel('fatty_acids.xlsx', index_col=0) #Read excel file
print(df_csv)
print(df_text)
print(df_excel)
To read only the first or last few rows of a DataFrame:
import pandas as pd
df_csv = pd.read_csv('fatty_acids.csv', index_col=0) #Read CSV file
df_text = pd.read_csv('fatty_acids.txt', sep='\t', index_col=0) #Read tab-delimited text file
df_excel = pd.read_excel('fatty_acids.xlsx', index_col=0) #Read excel file
print(df_csv.head()) #Show first 5 lines
print(df_csv.head(3)) #Show first 3 lines
print(df_csv.tail()) #Show last 5 lines
head
extracts the data of the first specified number of lines, and tail
extracts the data of the last specified number of lines.
If you do not specify the number of lines, 5 lines are displayed by default.
The same is true for df_text
and df_excel
.
As mentioned above, if you read the external file as a DataFrame and know how the data is stored, you can extract a specific row or column, add a new column and calculate, and you're done. The basic flow of data analysis is to output a table.
Here, I explained about Pandas, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.
--Pandas can handle two types of data structures, Series and DataFrame. --You can perform processing like database operations, such as extracting specific rows and columns and extracting only data that meets the conditions. --You can also read and output external files.
Next, Matplotlib is explained in the following article.
Learn Matplotlib with Cheminformatics
What is the programming language Python? Can it be used for AI and machine learning?
Recommended Posts