Introduction

Following NumPy learned from chemoinformatics, "Pandas" is one of Python's representative libraries with the theme of lipidomics (comprehensive analysis of lipids). I will explain about. We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.

Pharmaceutical researcher summarized Pandas

Creating Series and DataFrame

Pandas makes spreadsheets easy.

To use the library, first load the library with ʻimport. By convention, it is often abbreviated as pd`.

Pandas handles two types of data structures, "Series" and "DataFrame".

Series is one-dimensional data, which has a data structure similar to a list or dictionary.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

print(numbers_carbon)
print(numbers_unsaturation)

In the above example, you can think of ʻindex_fatty_acids as something like the name of the data. You can also create a Series based on the dictionary as shown below. ʻIndex_fatty_acids is the key of the dictionary. However, since the code becomes long, it is better to specify the list with ʻindex =`.

import pandas as pd


numbers_carbon = pd.Series({
    'FA 16:0': 16,
    'FA 16:1': 16,
    'FA 18:0': 18,
    'FA 18:1': 18,
    'FA 18:2': 18,
    'FA 18:3': 18,
    'FA 18:4': 18,
    'FA 20:0': 20,
    'FA 20:3': 20,
    'FA 20:4': 20,
    'FA 20:5': 20
})

print(numbers_carbon)

On the other hand, DataFrame is two-dimensional data created by combining Series.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids)

In the above example, the dictionary key inside pd.DataFrame is the column name of the table. On the other hand, ʻindexspecified when the Series was created earlier becomes the line name. By the way,df in df_fatty_acids` is an abbreviation for "data frame".

Data reference

Use ʻindex and columns` to refer to the row and column names of the DataFrame, respectively.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids.index) #Line name
print(df_fatty_acids.columns) #Column name

To access a specific element of a DataFrame, write:

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids['Cn']) #Specify column name
print(df_fatty_acids.Cn) #Specify column name

print(df_fatty_acids['Cn'][0]) #Specify the column name and specify the row number

print(df_fatty_acids[2:5]) #Specify line number (index number) with slice
print(df_fatty_acids[5:]) #Extract data after the specified line number
print(df_fatty_acids[:5]) #Extract data up to the specified line number
print(df_fatty_acids[-5:]) #Line number counted from the back
print(df_fatty_acids[2:5]['Cn']) #Specify row number and column name

print(df_fatty_acids.loc['FA 16:0', 'Cn']) #Specify row name and column name
print(df_fatty_acids.loc['FA 16:0']) #Specify line name
print(df_fatty_acids.loc[:, 'Cn']) #Specify column name

print(df_fatty_acids.iloc[0, 0]) #Specify row and column numbers
print(df_fatty_acids.iloc[0]) #Specify line number
print(df_fatty_acids.iloc[:, 0]) #Specify column number
print(df_fatty_acids.iloc[-1, -1]) #Element in the last column of the last row

Whether it is better to specify the row name or column name or the row number or column number is case by case, so it is better to choose the one that is easy to do each time.

You can also extract data that meets the specified conditions.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids[df_fatty_acids['Cn'] >= 18]) #DataFrame extracted rows that satisfy the conditions
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18]['Cn']) #Extract by specifying the column name in the DataFrame that extracted the rows that satisfy the conditions
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18].iloc[:, 0]) #Extract the rows that satisfy the conditions by specifying the column number in the DataFrame

print(df_fatty_acids[(df_fatty_acids['Cn'] >= 18) & (df_fatty_acids['Un'] >= 2)]) #Specify multiple conditions (and)
print(df_fatty_acids[(df_fatty_acids['Cn'] >= 18) | (df_fatty_acids['Un'] >= 1)]) #Specify multiple conditions (or)

When specifying multiple conditions, parentheses () are required for each condition. Don't forget.

Add data

You can specify a specific column with DataFrame name ['column name'], but if you specify a column name that is not in the DataFrame, a new column will be created. You can also easily calculate based on the data in another column.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

print(df_fatty_acids)

In the above example, the number of carbon atoms C and the number of hydrogen atoms H of each fatty acid molecular species are calculated based on the values in the columns Cn and ʻUn. If you output df_fatty_acids, you can see that new columns C, H and ʻO have been added.

Arithmetic calculation

Next, consider finding the exact mass of each fatty acid molecular species.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

df_fatty_acids['Exact mass'] = pd.Series([0] * len(index_fatty_acids), index=index_fatty_acids) #For the time being, put 0 in all lines

exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})

for atom in exact_masses.index:
    df_fatty_acids['Exact mass'] += exact_masses[atom] * df_fatty_acids[atom] #Calculate precision mass

print(df_fatty_acids)

In the above example, the precise mass of the fatty acid molecule is calculated by adding the precision mass of each atom multiplied by the number of atoms.

String concatenation

Next, let's find the composition formula.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

df_fatty_acids['Molecular formula'] = pd.Series([''] * len(index_fatty_acids), index=index_fatty_acids) #For the time being, put an empty string on every line

exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})

for atom in exact_masses.index:
    df_fatty_acids['Molecular formula'] += atom + df_fatty_acids[atom].astype(str) #Write the composition formula
    
print(df_fatty_acids)

You can combine the element symbol and the number of atoms as a character string, but since the data contained in C, H, ʻO is a numerical value, it is necessary to convert it to a character string before combining. There is. Therefore, in the above example, as ʻas type (str), the numerical values are converted into character strings and then combined.

Output to an external file

Next, consider outputting the completed data as an external file.

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

df_fatty_acids['Exact mass'] = 0

exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})

df_fatty_acids['Exact mass'] = exact_masses * df_fatty_acids #Precision mass

for atom in exact_masses.index:
    df_fatty_acids['Molecular formula'] += atom + df_fatty_acids[atom].astype(str) #Composition formula

df_fatty_acids.to_csv('fatty_acids.csv') #Output as CSV file
df_fatty_acids.to_csv('fatty_acids.txt', sep='\t') #Output as tab-delimited text file
df_fatty_acids.to_excel('fatty_acids.xlsx', sheet_name='fatty_acids') #Output as an excel file

Reading an external file

Conversely, to read an external file, do the following:

import pandas as pd


df_csv = pd.read_csv('fatty_acids.csv', index_col=0) #Read CSV file
df_text = pd.read_csv('fatty_acids.txt', sep='\t', index_col=0) #Read tab-delimited text file
df_excel = pd.read_excel('fatty_acids.xlsx', index_col=0) #Read excel file

print(df_csv)
print(df_text)
print(df_excel)

To read only the first or last few rows of a DataFrame:

import pandas as pd


df_csv = pd.read_csv('fatty_acids.csv', index_col=0) #Read CSV file
df_text = pd.read_csv('fatty_acids.txt', sep='\t', index_col=0) #Read tab-delimited text file
df_excel = pd.read_excel('fatty_acids.xlsx', index_col=0) #Read excel file

print(df_csv.head()) #Show first 5 lines
print(df_csv.head(3)) #Show first 3 lines
print(df_csv.tail()) #Show last 5 lines

head extracts the data of the first specified number of lines, and tail extracts the data of the last specified number of lines. If you do not specify the number of lines, 5 lines are displayed by default. The same is true for df_text and df_excel.

As mentioned above, if you read the external file as a DataFrame and know how the data is stored, you can extract a specific row or column, add a new column and calculate, and you're done. The basic flow of data analysis is to output a table.

Summary

Here, I explained about Pandas, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.

--Pandas can handle two types of data structures, Series and DataFrame. --You can perform processing like database operations, such as extracting specific rows and columns and extracting only data that meets the conditions. --You can also read and output external files.

Next, Matplotlib is explained in the following article.

Learn Matplotlib with Cheminformatics

Reference materials / links

What is the programming language Python? Can it be used for AI and machine learning?

Learn Pandas with Cheminformatics