ENZYME is a database file of information on the naming of enzymes. In the file
ID Identification (Begins each entry; 1 per entry)
DE Description (official name) (>=1 per entry)
AN Alternate name(s) (>=0 per entry)
CA Catalytic activity (>=1 per entry)
CF Cofactor(s) (>=0 per entry)
CC Comments (>=0 per entry)
PR Cross-references to PROSITE (>=0 per entry)
DR Cross-references to Swiss-Prot (>=0 per entry)
Information such as is stored.
enzyme.Contents of dat (part)
ID 1.1.1.1
DE Alcohol dehydrogenase.
AN Aldehyde reductase.
CA (1) A primary alcohol + NAD(+) = an aldehyde + NADH.
CA (2) A secondary alcohol + NAD(+) = a ketone + NADH.
CF Zn(2+) or Fe cation.
CC -!- Acts on primary or secondary alcohols or hemi-acetals with very broad
CC specificity; however the enzyme oxidizes methanol much more poorly
CC than ethanol.
CC -!- The animal, but not the yeast, enzyme acts also on cyclic secondary
CC alcohols.
PR PROSITE; PDOC00058;
PR PROSITE; PDOC00059;
PR PROSITE; PDOC00060;
DR P07327, ADH1A_HUMAN; P28469, ADH1A_MACMU; Q5RBP7, ADH1A_PONAB;
DR P25405, ADH1A_SAAHA; P25406, ADH1B_SAAHA; P00327, ADH1E_HORSE;
With EC number (corresponding to the above ID) classified according to the function of the enzyme as part of the research Since it was necessary to make a correspondence table of Uniprot entry (corresponding to the above DR) of each protein, I decided to extract the explanation of ** ID **, ** DR **, and EC number (corresponding to ** DE ** above) from enzyme.dat and create the associated table.
-enzyme.dat (obtained from ftp://ftp.expasy.org/databases/enzyme)
--pandas (used to create DataFrame)
Create a list by extracting the lines that start with ID, DE, and DR. Create a table with DataFrame and export it as a csv file.
Open file
path = "enzyme.dat"
with open(path) as f:
s = f.readlines() #Separated by line and read as a list
s = s[24:] #Exclude the explanation part of the head
id_list = []
for i in s:
if i.startswith("ID "): #Find a string that starts with an ID
x = i[5:-1] # "ID "Get the following strings
id_list.append(x) #Add to list
id_list[:10]
['1.1.1.1',
'1.1.1.2',
'1.1.1.3',
'1.1.1.4',
'1.1.1.5',
'1.1.1.6',
'1.1.1.7',
'1.1.1.8',
'1.1.1.9',
'1.1.1.10']
Since DE and DR may have two or more lines, add elements while referring to the contents after one line. Continue adding strings until the beginning of the line is no longer "DE", and when you reach the last line of DE, add it to the list.
description_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DE "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DE "):
description_list.append(name)
name = ""
description_list[:10]
['Alcohol dehydrogenase.',
'Alcohol dehydrogenase (NADP(+)).',
'Homoserine dehydrogenase.',
'(R,R)-butanediol dehydrogenase.',
'Transferred entry: 1.1.1.303 and 1.1.1.304.',
'Glycerol dehydrogenase.',
'Propanediol-phosphate dehydrogenase.',
'Glycerol-3-phosphate dehydrogenase (NAD(+)).',
'D-xylulose reductase.',
'L-xylulose reductase.']
accession_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DR "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DR "):
accession_list.append(name)
name = ""
accession_list[1]
'Q6AZW2, A1A1A_DANRE; Q568L5, A1A1B_DANRE; Q24857, ADH3_ENTHI ;Q04894, ADH6_YEAST ; P25377, ADH7_YEAST ; O57380, ADH8_PELPE ;Q9F282, ADHA_THEET ; P0CH36, ADHC1_MYCS2; P0CH37, ADHC2_MYCS2;P0A4X1, ADHC_MYCBO ; P9WQC4, ADHC_MYCTO ; P9WQC5, ADHC_MYCTU ;P27250, AHR_ECOLI ; Q3ZCJ2, AK1A1_BOVIN; Q5ZK84, AK1A1_CHICK;O70473, AK1A1_CRIGR; P14550, AK1A1_HUMAN; Q9JII6, AK1A1_MOUSE;P50578, AK1A1_PIG ; Q5R5D5, AK1A1_PONAB; P51635, AK1A1_RAT ;Q6GMC7, AK1A1_XENLA; Q28FD1, AK1A1_XENTR; Q9UUN9, ALD2_SPOSA ;P27800, ALDX_SPOSA ; P75691, YAHK_ECOLI ;'
After that, I should be able to create a DataFrame using these three lists, Comparing the number of elements in the created list
len(id_list), len(description_list), len(accession_list)
(7876, 7876, 5001)
Only accession_list does not match
If you check the dat file carefully
//
ID 1.14.13.42
DE Deleted entry.
//
ID 1.14.13.43
DE Questin monooxygenase.
AN Questin oxygenase.
CA Questin + NADPH + O(2) = demethylsulochrin + NADP(+).
CC -!- The enzyme cleaves the anthraquinone ring of questin to form a
CC benzophenone.
CC -!- Involved in the biosynthesis of the seco-anthraquinone (+)-geodin.
//
There are quite a few IDs that do not have DR. Therefore
# PR, CC, DE, CA,Use CF to find enzymes without DR
for name in ("PR", "CC", "DE", "CA", "CF"):
print("start", name)
no_dr_enzyme = []
for i in range(len(s)):
if s[i].startswith(f"{name} "):
if s[i + 1].startswith("//"):
no_dr_enzyme.append(i)
x = 1
for i in no_dr_enzyme:
s.insert(i + x, "DR none ;\n")
x += 1
Add the line "DR none" to the ID that does not have DR.
If you create accession_list again and compare the number of elements
len(id_list), len(description_list), len(accession_list)
(7876, 7876, 7876)
Now that we have all the numbers, we can create a DataFrame.
import pandas as pd
df = pd.DataFrame(
{"ID": id_list, "Description": description_list, "Accession": accession_list}
)
#Export as a csv file
df.to_csv("enzyme.csv", index=False)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# make_enzyme_table.py
#
import pandas as pd
def main():
#Read file
path = "enzyme.dat"
with open(path) as f:
s = f.readlines()
s = s[24:]
print(s[:10])
#Creation of id column
id_list = []
for i in s:
if i.startswith("ID "):
x = i[5:-1]
id_list.append(x)
#Create description column
description_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DE "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DE "):
description_list.append(name)
name = ""
# PR, CC, DE, CA,Use CF to find and complement enzymes without DR
for name in ("PR", "CC", "DE", "CA", "CF"):
print("start", name)
no_dr_enzyme = []
for i in range(len(s)):
if s[i].startswith(f"{name} "):
if s[i + 1].startswith("//"):
no_dr_enzyme.append(i)
x = 1
for i in no_dr_enzyme:
s.insert(i + x, "DR none ;\n")
x += 1
#Creating an accession column
accession_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DR "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DR "):
accession_list.append(name)
name = ""
#Creating a DataFrame
df = pd.DataFrame(
{"ID": id_list, "Description": description_list, "Accession": accession_list}
)
#csv write
df.to_csv("enzyme.csv", index=False)
if '__main__' == __name__:
main()
enzyme.Contents of csv (part)
ID,Description,Accession
1.1.1.1,Alcohol dehydrogenase.,"P07327, ADH1A_HUMAN; P28469, ADH1A_MACMU; Q5RBP7, ADH1A_PONAB;P25405, ADH1A_SAAHA; P25406, ADH1B_SAAHA; P00327, ADH1E_HORSE;P00326, ADH1G_HUMAN; O97959, ADH1G_PAPHA; P00328, ADH1S_HORSE;P80222, ADH1_ALLMI ; P30350, ADH1_ANAPL ; P49645, ADH1_APTAU ;P06525, ADH1_ARATH ; P41747, ADH1_ASPFN ; Q17334, ADH1_CAEEL ;P43067, ADH1_CANAX ; P85440, ADH1_CATRO ; P14219, ADH1_CENAM ;P48814, ADH1_CERCA ; Q70UN9, ADH1_CERCO ; P23991, ADH1_CHICK ;P86883, ADH1_COLLI ; P19631, ADH1_COTJA ; P23236, ADH1_DROHY ;P48586, ADH1_DROMN ; P09370, ADH1_DROMO ; P22246, ADH1_DROMT ;P07161, ADH1_DROMU ; P12854, ADH1_DRONA ; P08843, ADH1_EMENI ;P26325, ADH1_GADMC ; Q9Z2M2, ADH1_GEOAT ; Q64413, ADH1_GEOBU ;Q64415, ADH1_GEOKN ; P12311, ADH1_GEOSE ; P05336, ADH1_HORVU ;P20369, ADH1_KLULA ; Q07288, ADH1_KLUMA ; P00333, ADH1_MAIZE ;P86885, ADH1_MESAU ; P00329, ADH1_MOUSE ; P80512, ADH1_NAJNA ;Q9P6C8, ADH1_NEUCR ; Q75ZX4, ADH1_ORYSI ; Q2R8Z5, ADH1_ORYSJ ;P12886, ADH1_PEA ; P22797, ADH1_PELPE ; P41680, ADH1_PERMA ;P25141, ADH1_PETHY ; O00097, ADH1_PICST ; Q03505, ADH1_RABIT ;P06757, ADH1_RAT ; P14673, ADH1_SOLTU ; P80338, ADH1_STRCA ;P13603, ADH1_TRIRP ; P00330, ADH1_YEAST ; Q07264, ADH1_ZEALU ;P20368, ADH1_ZYMMO ; O45687, ADH2_CAEEL ; O94038, ADH2_CANAL ;P48815, ADH2_CERCA ; Q70UP5, ADH2_CERCO ; Q70UP6, ADH2_CERRO ;P27581, ADH2_DROAR ; P25720, ADH2_DROBU ; P23237, ADH2_DROHY ;P48587, ADH2_DROMN ; P09369, ADH2_DROMO ; P07160, ADH2_DROMU ;P24267, ADH2_DROWH ; P37686, ADH2_ECOLI ; P54202, ADH2_EMENI ;Q24803, ADH2_ENTHI ; P42327, ADH2_GEOSE ; P10847, ADH2_HORVU ;P49383, ADH2_KLULA ; Q9P4C2, ADH2_KLUMA ; P04707, ADH2_MAIZE ;Q4R1E8, ADH2_ORYSI ; Q0ITW7, ADH2_ORYSJ ; O13309, ADH2_PICST ;P28032, ADH2_SOLLC ; P14674, ADH2_SOLTU ; F2Z678, ADH2_YARLI ;P00331, ADH2_YEAST ; F8DVL8, ADH2_ZYMMA ; P0DJA2, ADH2_ZYMMO ;P07754, ADH3_EMENI ; P42328, ADH3_GEOSE ; P10848, ADH3_HORVU ;P49384, ADH3_KLULA ; P14675, ADH3_SOLTU ; P07246, ADH3_YEAST ;P49385, ADH4_KLULA ; Q09669, ADH4_SCHPO ; A6ZTT5, ADH4_YEAS7 ;P10127, ADH4_YEAST ; Q6XQ67, ADH5_SACPS ; P38113, ADH5_YEAST ;P28332, ADH6_HUMAN ; P41681, ADH6_PERMA ; Q5R7Z8, ADH6_PONAB ;Q5XI95, ADH6_RAT ; P40394, ADH7_HUMAN ; Q64437, ADH7_MOUSE ;P41682, ADH7_RAT ; P9WQC0, ADHA_MYCTO ; P9WQC1, ADHA_MYCTU ;O31186, ADHA_RHIME ; Q7U1B9, ADHB_MYCBO ; P9WQC6, ADHB_MYCTO ;P9WQC7, ADHB_MYCTU ; P9WQB8, ADHD_MYCTO ; P9WQB9, ADHD_MYCTU ;P33744, ADHE_CLOAB ; P0A9Q8, ADHE_ECO57 ; P0A9Q7, ADHE_ECOLI ;P81600, ADHH_GADMO ; P72324, ADHI_RHOS4 ; Q9SK86, ADHL1_ARATH;Q9SK87, ADHL2_ARATH; A1L4Y2, ADHL3_ARATH; Q8VZ49, ADHL4_ARATH;Q0V7W6, ADHL5_ARATH; Q8LEB2, ADHL6_ARATH; Q9FH04, ADHL7_ARATH;P81601, ADHL_GADMO ; P39451, ADHP_ECOLI ; O46649, ADHP_RABIT ;O46650, ADHQ_RABIT ; Q96533, ADHX_ARATH ; Q3ZC42, ADHX_BOVIN ;Q17335, ADHX_CAEEL ; Q54TC2, ADHX_DICDI ; P46415, ADHX_DROME ;P19854, ADHX_HORSE ; P11766, ADHX_HUMAN ; P93629, ADHX_MAIZE ;P28474, ADHX_MOUSE ; P80360, ADHX_MYXGL ; P81431, ADHX_OCTVU ;A2XAZ3, ADHX_ORYSI ; Q0DWH1, ADHX_ORYSJ ; P80572, ADHX_PEA ;O19053, ADHX_RABIT ; P12711, ADHX_RAT ; P80467, ADHX_SAAHA ;P86884, ADHX_SCYCA ; P79896, ADHX_SPAAU ; Q9NAR7, ADH_BACOL ;P14940, ADH_CUPNE ; Q0KDL6, ADH_CUPNH ; Q00669, ADH_DROAD ;P21518, ADH_DROAF ; P25139, ADH_DROAM ; Q50L96, ADH_DROAN ;P48584, ADH_DROBO ; P22245, ADH_DRODI ; Q9NG42, ADH_DROEQ ;P28483, ADH_DROER ; P48585, ADH_DROFL ; P51551, ADH_DROGR ;Q09009, ADH_DROGU ; P51549, ADH_DROHA ; P21898, ADH_DROHE ;Q07588, ADH_DROIM ; Q9NG40, ADH_DROIN ; Q27404, ADH_DROLA ;P10807, ADH_DROLE ; P07162, ADH_DROMA ; Q09010, ADH_DROMD ;P00334, ADH_DROME ; Q00671, ADH_DROMM ; P25721, ADH_DROMY ;Q00672, ADH_DRONI ; P07159, ADH_DROOR ; P84328, ADH_DROPB ;P37473, ADH_DROPE ; P23361, ADH_DROPI ; P23277, ADH_DROPL ;Q6LCE4, ADH_DROPS ; Q9U8S9, ADH_DROPU ; Q9GN94, ADH_DROSE ;Q24641, ADH_DROSI ; P23278, ADH_DROSL ; Q03384, ADH_DROSU ;P28484, ADH_DROTE ; P51550, ADH_DROTS ; B4M8Y0, ADH_DROVI ;Q05114, ADH_DROWI ; P26719, ADH_DROYA ; P17648, ADH_FRAAN ;P48977, ADH_MALDO ; P81786, ADH_MORSE ; P9WQC2, ADH_MYCTO ;P9WQC3, ADH_MYCTU ; P39462, ADH_SACS2 ; P25988, ADH_SCAAL ;Q00670, ADH_SCACA ; P00332, ADH_SCHPO ; Q2FJ31, ADH_STAA3 ;Q2G0G1, ADH_STAA8 ; Q2YSX0, ADH_STAAB ; Q5HI63, ADH_STAAC ;Q99W07, ADH_STAAM ; Q7A742, ADH_STAAN ; Q6GJ63, ADH_STAAR ;Q6GBM4, ADH_STAAS ; Q8NXU1, ADH_STAAW ; Q5HRD6, ADH_STAEQ ;Q8CQ56, ADH_STAES ; Q4J781, ADH_SULAC ; P50381, ADH_SULSR ;Q96XE0, ADH_SULTO ; P51552, ADH_ZAPTU ; Q5AR48, ASQE_EMENI ;A5JYX5, DHS3_CAEEL ; P32771, FADH_YEAST ; A7ZIA4, FRMA_ECO24 ;Q8X5J4, FRMA_ECO57 ; A7ZX04, FRMA_ECOHS ; A1A835, FRMA_ECOK1 ;Q0TKS7, FRMA_ECOL5 ; Q8FKG1, FRMA_ECOL6 ; B1J085, FRMA_ECOLC ;P25437, FRMA_ECOLI ; B1LIP1, FRMA_ECOSM ; Q1RFI7, FRMA_ECOUT ;P44557, FRMA_HAEIN ; P39450, FRMA_PHODP ; Q3Z550, FRMA_SHISS ;P73138, FRMA_SYNY3 ; E1ACQ9, NOTN_ASPSM ; N4WE73, OXI1_COCH4 ;N4WE43, RED2_COCH4 ; N4WW42, RED3_COCH4 ; P33010, TERPD_PSESP;O07737, Y1895_MYCTU;"
1.1.1.2,Alcohol dehydrogenase (NADP(+)).,"Q6AZW2, A1A1A_DANRE; Q568L5, A1A1B_DANRE; Q24857, ADH3_ENTHI ;Q04894, ADH6_YEAST ; P25377, ADH7_YEAST ; O57380, ADH8_PELPE ;Q9F282, ADHA_THEET ; P0CH36, ADHC1_MYCS2; P0CH37, ADHC2_MYCS2;P0A4X1, ADHC_MYCBO ; P9WQC4, ADHC_MYCTO ; P9WQC5, ADHC_MYCTU ;P27250, AHR_ECOLI ; Q3ZCJ2, AK1A1_BOVIN; Q5ZK84, AK1A1_CHICK;O70473, AK1A1_CRIGR; P14550, AK1A1_HUMAN; Q9JII6, AK1A1_MOUSE;P50578, AK1A1_PIG ; Q5R5D5, AK1A1_PONAB; P51635, AK1A1_RAT ;Q6GMC7, AK1A1_XENLA; Q28FD1, AK1A1_XENTR; Q9UUN9, ALD2_SPOSA ;P27800, ALDX_SPOSA ; P75691, YAHK_ECOLI ;"
1.1.1.3,Homoserine dehydrogenase.,"P00561, AK1H_ECOLI ; P27725, AK1H_SERMA ; P00562, AK2H_ECOLI ;Q9SA18, AKH1_ARATH ; P49079, AKH1_MAIZE ; O81852, AKH2_ARATH ;P49080, AKH2_MAIZE ; P57290, AKH_BUCAI ; Q8K9U9, AKH_BUCAP ;Q89AR4, AKH_BUCBP ; P37142, AKH_DAUCA ; P44505, AKH_HAEIN ;P19582, DHOM_BACSU ; P08499, DHOM_CORGL ; Q5B998, DHOM_EMENI ;Q9ZL20, DHOM_HELPJ ; P56429, DHOM_HELPY ; Q9CGD8, DHOM_LACLA ;P52985, DHOM_LACLC ; P37143, DHOM_METGL ; Q58997, DHOM_METJA ;P63630, DHOM_MYCBO ; P46806, DHOM_MYCLE ; P9WPX0, DHOM_MYCTO ;P9WPX1, DHOM_MYCTU ; P29365, DHOM_PSEAE ; O94671, DHOM_SCHPO ;P52986, DHOM_SYNY3 ; P31116, DHOM_YEAST ; P37144, DHON_METGL ;"
1.1.1.4,"(R,R)-butanediol dehydrogenase.","P14940, ADH_CUPNE ; Q0KDL6, ADH_CUPNH ; P39714, BDH1_YEAST ;O34788, BDHA_BACSU ; Q00796, DHSO_HUMAN ;"
1.1.1.5,Transferred entry: 1.1.1.303 and 1.1.1.304.,none ;
1.1.1.6,Glycerol dehydrogenase.,"A4IP64, ADH1_GEOTN ; O13702, GLD1_SCHPO ; P45511, GLDA_CITFR ;P0A9S6, GLDA_ECOL6 ; P0A9S5, GLDA_ECOLI ; P32816, GLDA_GEOSE ;P50173, GLDA_PSEPU ; Q9WYQ4, GLDA_THEMA ; Q92EU6, GOLD_LISIN ;"
1.1.1.7,Propanediol-phosphate dehydrogenase.,none ;
After that, by collating the result of blast with the created table, you can get the EC number list of the enzyme contained in the sample used for blast (= you can grasp what kind of role the enzyme exists).
Recommended Posts