Part 5 Plot of ChemTHEATER data on a map (Part 1)

Chap.0 Overall flow

In Part 5, the measured chemical substance concentrations are plotted on a map using the ChemTHEATER data. So to speak, it is a simple GIS.
In the first part, we aim to create an animation by connecting data processing to plotting on a map, and in the second part, connecting the output image data.

Loading the Chap.1 library

If you have not installed Cartopy, install it in advance.

conda install -c conda-forge cartopy

Then, load the following library.

%matplotlib inline
import os
import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt
from cartopy import crs as ccrs

Again, start by importing the libraries needed for processing.
The first line is, as usual, a command to display the output of matplotlib.
In Part 5, the following libraries will be used. (Refer to the second and subsequent lines)
Both are already installed on Anaconda3 (Windows 64bit version).
This time, we will use cartopy to display the map on the graph of matplotlib.

Library	Overview	Purpose of use this time	Official URL
os	Standard library	Directory manipulation	https://docs.python.org/ja/3/library/os.html
NumPy	Numerical calculation library	Used for numerical calculation in statistical processing	https://www.numpy.org
itertools	Generate iterator ^{1 Standard library}	Used to streamline loop processing	https://docs.python.org/ja/3/library/itertools.html
pandas	Data analysis library	Used for data processing and formatting	https://pandas.pydata.org
Matplotlib	Graph drawing library	Used for data visualization	https://matplotlib.org
cartopy	Map drawing library	Used for visualization of map data	https://scitools.org.uk/cartopy/docs/latest

Reading Chap.2 data

Now that the library is ready, let's load the data. Since we will be using data from marine mammals this time, search for Sample Type as "Biotic --Mammals --Marine mammals" in the Sample Search of ChemTHEATER. From the search results, click "Export samples" and "Export measured data" to download the sample list and measured value list files, respectively. Save the file in any directory and read it from your notebook as follows.

data_file = "measureddata_20191002074038.tsv"    #Change the string entered in the variable to the tsv file name of your measured data
data = pd.read_csv(data_file, delimiter="\t")
data = data.drop(["ProjectID", "ScientificName", "RegisterDate", "UpdateDate"], axis=1)    #Remove duplicate columns when joining with samples later
data

	MeasuredID	SampleID	ChemicalID	ChemicalName	ExperimentID	MeasuredValue	AlternativeData	Unit	Remarks
0	1026	SAA000173	CH0000034	PCB4+PCB10	EXA000001	0.010	<1.00E-2	ng/g lipid	NaN
1	1027	SAA000173	CH0000035	PCB8	EXA000001	0.010	<1.00E-2	ng/g lipid	NaN
2	1028	SAA000173	CH0000037	PCB19	EXA000001	0.010	<1.00E-2	ng/g lipid	NaN
3	1029	SAA000173	CH0000038	PCB22	EXA000001	0.010	<1.00E-2	ng/g lipid	NaN
4	1030	SAA000173	CH0000039	PCB28	EXA000001	32.000	NaN	ng/g lipid	NaN
...	...	...	...	...	...	...	...	...	...
7098	27705	SAA002002	CH0000094	PCB208	EXA000001	77.249	NaN	ng/g lipid	NaN
7099	27706	SAA002002	CH0000088	PCB194	EXA000001	512.160	NaN	ng/g lipid	NaN
7100	27707	SAA002002	CH0000092	PCB205	EXA000001	3.000	<3.00E+0	ng/g lipid	NaN
7101	27708	SAA002002	CH0000093	PCB206	EXA000001	81.947	NaN	ng/g lipid	NaN
7102	27709	SAA002002	CH0000095	PCB209	EXA000001	127.064	NaN	ng/g lipid	NaN

7103 rows × 9 columns

sample_file = "samples_20191002074035.tsv"    #Change the string entered in the variable to the tsv file name of your samples
sample = pd.read_csv(sample_file, delimiter="\t")
sample

	ProjectID	SampleID	SampleType	TaxonomyID	UniqCodeType	UniqCode	SampleName	ScientificName	...
0	PRA000003	SAA000173	ST004	34892	es-BANK	EW00884	M32582	Neophocaena phocaenoides	...
1	PRA000003	SAA000174	ST004	34892	es-BANK	EW00888	M32588	Neophocaena phocaenoides	...
2	PRA000003	SAA000175	ST004	34892	es-BANK	EW00932	M32580	Neophocaena phocaenoides	...
3	PRA000003	SAA000176	ST004	34892	es-BANK	EW00929	M33556	Neophocaena phocaenoides	...
4	PRA000003	SAA000177	ST004	34892	es-BANK	EW00934	M32548	Neophocaena phocaenoides	...
...	...	...	...	...	...	...	...	...	...
197	PRA000036	SAA002159	ST004	103596	es-BANK	EW04779	060301-1	Peponocephala electra	...
198	PRA000036	SAA002160	ST004	103596	es-BANK	EW00115	M32625	Peponocephala electra	...
199	PRA000036	SAA002161	ST004	103596	es-BANK	EW00122	M32633	Peponocephala electra	...
200	PRA000036	SAA002162	ST004	103596	es-BANK	EW00116	M32626	Peponocephala electra	...
201	PRA000036	SAA002163	ST004	103596	es-BANK	EW00117	M32627	Peponocephala electra	...

202 rows × 66 columns

Chap.3 Data preparation

Prepare to be able to plot the DataFrame read in Chap.2 on the map. First, combine measured data and samples and delete the N / A column.

df = pd.merge(data, sample, on="SampleID")

df = df.dropna(how='all', axis=1)
df

	MeasuredID	SampleID	ChemicalID	ChemicalName	ExperimentID	MeasuredValue	AlternativeData	Unit	...
0	1026	SAA000173	CH0000034	PCB4+PCB10	EXA000001	0.010	<1.00E-2	ng/g lipid	...
1	1027	SAA000173	CH0000035	PCB8	EXA000001	0.010	<1.00E-2	ng/g lipid	...
2	1028	SAA000173	CH0000037	PCB19	EXA000001	0.010	<1.00E-2	ng/g lipid	...
3	1029	SAA000173	CH0000038	PCB22	EXA000001	0.010	<1.00E-2	ng/g lipid	...
4	1030	SAA000173	CH0000039	PCB28	EXA000001	32.000	NaN	ng/g lipid	...
...	...	...	...	...	...	...	...	...	...
7098	27705	SAA002002	CH0000094	PCB208	EXA000001	77.249	NaN	ng/g lipid	...
7099	27706	SAA002002	CH0000088	PCB194	EXA000001	512.160	NaN	ng/g lipid	...
7100	27707	SAA002002	CH0000092	PCB205	EXA000001	3.000	<3.00E+0	ng/g lipid	...
7101	27708	SAA002002	CH0000093	PCB206	EXA000001	81.947	NaN	ng/g lipid	...
7102	27709	SAA002002	CH0000095	PCB209	EXA000001	127.064	NaN	ng/g lipid	...

7103 rows × 39 columns

Next, extract the necessary data. This time, we will handle data on four types of chemical substances (ΣPCBs, ΣDDTs, ΣCHLs, ΣHCHs) with the unit [ng / g lipid].

data_lipid = df[df["Unit"] == "ng/g lipid"]
data_lipid = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") | (data_lipid["ChemicalName"] == "ΣDDTs") |
                        (data_lipid["ChemicalName"] == "ΣCHLs") | (data_lipid["ChemicalName"] == "ΣHCHs")]

This completes the extraction of the necessary data. For confirmation, obtain a list of species and chemical names included in this dataset.

lipid_species = data_lipid["ScientificName"].unique()
lipid_chemicals = data_lipid["ChemicalName"].unique()

lipid_species, lipid_chemicals

(array(['Peponocephala electra', 'Neophocaena phocaenoides',
        'Sousa chinensis', 'Stenella coeruleoalba'], dtype=object),
 array(['ΣPCBs', 'ΣDDTs', 'ΣCHLs', 'ΣHCHs'], dtype=object))

Also, here, DataFrame / data_lipid is output in CSV format so that the same data can be used in the second part. CSV output uses Pandas' to_csv method.

data_lipid.to_csv("data.csv")

Chap.4 Plot on map

Sec.4-1 Base diagram

Next, we will plot on the map, which is the main part of this time. But first, let's carefully check how to create a map.
Since the graph this time will be output on the map, the map will be output first. In Python, you can use cartopy to output a map on matplotlib.

ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
plt.show()

Next, check the glag that is overlaid on the map. This time, the points in the two-dimensional space of latitude and longitude are plotted as measurement points, so the data is drawn as a scatter plot.
Before drawing, select the data for each species included in data_lipid.

df_0 = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") & (data_lipid["ScientificName"] == lipid_species[0])]    #Melon-headed whale ΣPCBs
df_1 = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") & (data_lipid["ScientificName"] == lipid_species[1])]    #Finless porpoise ΣPCBs
df_2 = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") & (data_lipid["ScientificName"] == lipid_species[2])]    #Sinaus Iro Dolphin ΣPCBs
df_3 = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") & (data_lipid["ScientificName"] == lipid_species[3])]    #Striped dolphin ΣPCBs

When the data is more divided, try scatter plotting. The graph drawing of matplotlib can be overlaid on top, so you can add images one by one.

fig = plt.figure()
ax = plt.axes()

#CollectionLongitudeFrom,Extract latitude / longitude information from CollectionLatitudeFrom
ax.scatter(x = np.array(df_0["CollectionLongitudeFrom"]), y = np.array(df_0["CollectionLatitudeFrom"]), c = "red", alpha=0.5)
ax.scatter(x = np.array(df_1["CollectionLongitudeFrom"]), y = np.array(df_1["CollectionLatitudeFrom"]), c = "blue", alpha=0.5)
ax.scatter(x = np.array(df_2["CollectionLongitudeFrom"]), y = np.array(df_2["CollectionLatitudeFrom"]), c = "yellow", alpha=0.5)
ax.scatter(x = np.array(df_3["CollectionLongitudeFrom"]), y = np.array(df_3["CollectionLatitudeFrom"]), c = "green", alpha=0.5)
plt.show()

Sec.4-2 Scatter plot + equirectangular projection

Now that we have drawn the base figure, let's draw it by superimposing them.

fig = plt.figure()
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
ax.scatter(x = np.array(df_0["CollectionLongitudeFrom"]), y = np.array(df_0["CollectionLatitudeFrom"]), c = "red", alpha=0.5)
ax.scatter(x = np.array(df_1["CollectionLongitudeFrom"]), y = np.array(df_1["CollectionLatitudeFrom"]), c = "blue", alpha=0.5)
ax.scatter(x = np.array(df_2["CollectionLongitudeFrom"]), y = np.array(df_2["CollectionLatitudeFrom"]), c = "yellow", alpha=0.5)
ax.scatter(x = np.array(df_3["CollectionLongitudeFrom"]), y = np.array(df_3["CollectionLatitudeFrom"]), c = "green", alpha=0.5)
plt.show()

Sec.4-3 Improvement of Sec.4-2

I was able to output for the time being with Sec.4-2. However, in the current state, the map range is set automatically. Since this is only automatically processed so that the entire range in which the data exists is included, the range changes depending on the data to be handled. Here, the map is fixed within an appropriate range.

By using the set_xlim and set_ylim methods of matplotlib, the range of the X-axis direction and Y-axis direction of the graph to be drawn can be determined respectively.

fig = plt.figure()
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
ax.scatter(x = np.array(df_0["CollectionLongitudeFrom"]), y = np.array(df_0["CollectionLatitudeFrom"]), c = "red", alpha=0.5)
ax.scatter(x = np.array(df_1["CollectionLongitudeFrom"]), y = np.array(df_1["CollectionLatitudeFrom"]), c = "blue", alpha=0.5)
ax.scatter(x = np.array(df_2["CollectionLongitudeFrom"]), y = np.array(df_2["CollectionLatitudeFrom"]), c = "yellow", alpha=0.5)
ax.scatter(x = np.array(df_3["CollectionLongitudeFrom"]), y = np.array(df_3["CollectionLatitudeFrom"]), c = "green", alpha=0.5)
ax.set_xlim(90,180)    #Range in the X-axis (longitude) direction to draw
ax.set_ylim(15, 60)    #Range in the Y-axis (latitude) direction to draw
ax.set_title("ΣPCBs")
plt.show()

Chap.5 Output in one sheet

As you can see in

Chap.3, this DataFrame contains data of 4 kinds of chemical substances, so let's output 4 of them side by side in a single sheet. </ p>

First, as we did in Chap.4, we sort out the data for each chemical substance, but since there are 16 combinations (4 types of biological and chemical substances each) in total, we list them here. Collectively, it takes the form of reading sequentially when drawing. At this time, it is necessary to generate each combination (for example, ΣPCBs of finless porpoise), so use the function of Python iterator ^{1 to directly product. We will ask for ^{2 . In other words, the direct product of the list of chemical substances (lipid_chemicals) and the list of species names (lipid_species) is obtained, and all 16 combination patterns are derived.}}

df_list = []
for k1, k2 in itertools.product(lipid_chemicals, lipid_species):
    df_list.append(data_lipid[(data_lipid["ChemicalName"] == k1) & (data_lipid["ScientificName"] == k2)])    #Extract data for each combination

ax = [0]*4
fig = plt.figure(figsize=(16, 9))
for i in range(4): 
    ax[i] = fig.add_subplot(2, 2, i+1, projection=ccrs.PlateCarree())    #Added area to draw graph
    ax[i].coastlines()
    ax[i].scatter(x = np.array(df_list[4*i+0]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+0]["CollectionLatitudeFrom"]),
                  c = "red", alpha=0.5)    #Melon-headed whale
    ax[i].scatter(x = np.array(df_list[4*i+1]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+1]["CollectionLatitudeFrom"]), 
                  c = "blue", alpha=0.5)    #Finless porpoise
    ax[i].scatter(x = np.array(df_list[4*i+2]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+2]["CollectionLatitudeFrom"]), 
                  c = "yellow", alpha=0.5)    #Cinaus dolphin
    ax[i].scatter(x = np.array(df_list[4*i+3]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+3]["CollectionLatitudeFrom"]), 
                  c = "green", alpha=0.5)    #Striped dolphin
    ax[i].set_xlim(90,180)
    ax[i].set_ylim(15, 60)
    ax[i].set_title(lipid_chemicals[i])
plt.show()

This only reflects the measurement points, so plot the measured values on the map as well. Let's reflect it in the radius of each scatter point. All you have to do is assign the data in the MeasuredValue column to the s parameter of the scatter method that draws the scatter plot.

fig = plt.figure(figsize=(16, 9))
for i in range(4): 
    ax[i] = fig.add_subplot(2, 2, i+1, projection=ccrs.PlateCarree())
    ax[i].coastlines()
    ax[i].scatter(x = np.array(df_list[4*i+0]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+0]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+0]["MeasuredValue"]), c = "red", alpha=0.5)
    ax[i].scatter(x = np.array(df_list[4*i+1]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+1]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+1]["MeasuredValue"]), c = "blue", alpha=0.5)
    ax[i].scatter(x = np.array(df_list[4*i+2]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+2]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+2]["MeasuredValue"]), c = "yellow", alpha=0.5)
    ax[i].scatter(x = np.array(df_list[4*i+3]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+3]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+3]["MeasuredValue"]), c = "green", alpha=0.5)
    ax[i].set_xlim(90,180)
    ax[i].set_ylim(15, 60)
    plt.title(lipid_chemicals[i])
plt.show()

If the value of

MeasuredValue is substituted as it is into the radius of the scatter point, the circle of the graph of ΣDDTs will become large and all will overlap, so adjust so that the radius is small. Here, the size is uniformly set to 1/10. </ p>

fig = plt.figure(figsize=(16, 9))
rate = [10,10,10,10]    #Variable indicating scale
for i in range(4): 
    ax[i] = fig.add_subplot(2, 2, i+1, projection=ccrs.PlateCarree())
    ax[i].coastlines()
    ax[i].scatter(x = np.array(df_list[4*i+0]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+0]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+0]["MeasuredValue"])/rate[i], c = "red", alpha=0.3)
    ax[i].scatter(x = np.array(df_list[4*i+1]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+1]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+1]["MeasuredValue"])/rate[i], c = "blue", alpha=0.3)
    ax[i].scatter(x = np.array(df_list[4*i+2]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+2]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+2]["MeasuredValue"])/rate[i], c = "yellow", alpha=0.3)
    ax[i].scatter(x = np.array(df_list[4*i+3]["CollectionLongitudeFrom"]), y = np.array(df_list[4*i+3]["CollectionLatitudeFrom"]), s=np.array(df_list[4*i+3]["MeasuredValue"])/rate[i], c = "green", alpha=0.3)
    ax[i].set_xlim(90,180)
    ax[i].set_ylim(15, 60)
    plt.title(lipid_chemicals[i])
plt.show()

Now you can plot the measurement results on the map. However, since all the data are displayed in an overlapping manner, it is very difficult to see, and the time-series changes cannot be seen in this figure. In the second part, we will animate to solve those problems.

footnote

^{1 An iterator is a mechanism that repeatedly accesses the next element in a type that collects data such as a list.}

^{2 Cartesian product is to create a new set by extracting one element from each set and making a set as shown in the figure below. With Cartesian product.}

Learning Python with ChemTHEATER 05-1