In the analysis project of Cloud Pak for Data (CP4D), Notebook and Data Refinery Flow can be converted to Job and executed in batch. What I want to do this time is the following two points.
Strictly speaking, Job seems to be more accurate in the expression "set environment variables and start" than "pass arguments at runtime". I presume that it treats the environment variable as an OpenShift ConfigMap, probably because it launches internally as an OpenShift pod.
Let's actually start Job with API, give environment variables at that time, and pass it to the processing logic.
Create a Notebook and turn it into a Job. Assuming "MYENV1", "MYENV2", and "MYENV3" as the environment variables handled this time, the values are processed into a pandas data frame and output as CSV to the data assets of the analysis project. Of course, these environment variables are not defined by default, so set the default values in default in os.getenv.
import os
myenv1 = os.getenv('MYENV1', default='no MYENV1')
myenv2 = os.getenv('MYENV2', default='no MYENV2')
myenv3 = os.getenv('MYENV3', default='no MYENV3')
print(myenv1)
print(myenv2)
print(myenv3)
# -output-
# no MYENV1
# no MYENV2
# no MYENV3
Next, dataframe these three values with pandas and
import pandas as pd
df = pd.DataFrame({'myenv1' : [myenv1], 'myenv2' : [myenv2], 'myenv3' : [myenv3]})
df
# -output-
# myenv1 myenv2 myenv3
# 0 no MYENV1 no MYENV2 no MYENV3
Export as a data asset for your analysis project. Add a time stamp to the file name. The output of data assets to the analysis project is [this article](https://qiita.com/ttsuzuku/items/eac3e4bedc020da93bc1#%E3%83%87%E3%83%BC%E3%82%BF%E8 % B3% 87% E7% 94% A3% E3% 81% B8% E3% 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E4% BF % 9D% E5% AD% 98-% E5% 88% 86% E6% 9E% 90% E3% 83% 97% E3% 83% AD% E3% 82% B8% E3% 82% A7% E3% 82% AF% E3% 83% 88).
from project_lib import Project
project = Project.access()
import datetime
now = datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=9))).strftime('%Y%m%d_%H%M%S')
project.save_data("jov_env_test_"+now+".csv", df.to_csv(),overwrite=True)
From the Notebook menu, select File> Save Versions to save the version. Required when creating a Job. Then click the Job button at the top right of the Notebook screen> Create Job. Give the job a name and click Create.
Let's execute the created Job on the CP4D screen. First, just click the "Run Job" button and execute it without defining any environment variables.
OK when the job is executed and it becomes "Complete".
Looking at the data assets of the analysis project, a CSV file is generated,
If you click on the file name to see the preview, you can see that the default value set in Notebook is stored.
Next, set the environment variable and execute it. Click "Edit" of "Environment Variables" on the Job screen and set the following 3 lines.
MYENV1=1
MYENV2=hoge
MYENV3=10.5
It is like this.
After submitting the settings, try executing Job again. The contents of the resulting CSV file look like this. Since it is an environment variable, it is treated as a character string String even if you enter a numerical value.
Use python requests to kick the created Job via API. From the Python environment outside CP4D, execute the following code.
In order to get a token, basic authentication is performed with a user name and password, and an accessToken is obtained. For authentication, [Example of running with curl in CP4D v2.5 product manual](https://www.ibm.com/support/knowledgecenter/ja/SSQNUZ_2.5.0/wsj/analyze-data/ml-authentication- There is local.html).
url = "https://cp4d.hostname.com"
uid = "username"
pw = "password"
import requests
#Authentication
response = requests.get(url+"/v1/preauth/validateAuth", auth=(uid,pw), verify=False).json()
token = response['accessToken']
The verify = False option in requests is a certificate checking workaround if CP4D is using a self-signed certificate.
Get the Job list of the analysis project. As a preparation, find out the ID of the analysis project to be used on CP4D in advance. View and verify the environment variable PROJECT_ID in the Notebook in your analysis project.
Project ID survey(Run on Notebook on CP4D)
import os
os.environ['PROJECT_ID']
# -output-
# 'f3110316-687e-450a-8f17-57296c907973'
Set the project ID found above and get the job list with API. The API uses the Watson Data API. The API reference is Jobs / Get list of jobs under a project is.
project_id = 'f3110316-687e-450a-8f17-57296c907973'
headers = {
'Authorization': 'Bearer ' + token,
'Content-Type': 'application/json'
}
# Job list
response = requests.get(url+"/v2/jobs?project_id="+project_id, headers=headers, verify=False).json()
response
# -output-
#{'total_rows': 1,
# 'results': [{'metadata': {'name': 'job_env_test',
# 'description': '',
# 'asset_id': 'b05d1214-d684-4bd8-b1fa-cc05a8ccee81',
# 'owner_id': '1000331001',
# 'version': 0},
# 'entity': {'job': {'asset_ref': '6e0b450e-2f9e-4605-88bf-d8a5e2bda4a3',
# 'asset_ref_type': 'notebook',
# 'configuration': {'env_id': 'jupconda36-f3110316-687e-450a-8f17-57296c907973',
# 'env_type': 'notebook',
# 'env_variables': ['MYENV1=1', 'MYENV2=hoge', 'MYENV3=10.5']},
# 'last_run_initiator': '1000331001',
# 'last_run_time': '2020-05-31T22:20:18Z',
# 'last_run_status': 'Completed',
# 'last_run_status_timestamp': 1590963640135,
# 'schedule': '',
# 'last_run_id': 'ebd1c2f1-f7e7-40cc-bb45-5e12f4635a14'}}}]}
The above asset_id is the ID of Job "job_env_test". Store it in a variable.
job_id = "b05d1214-d684-4bd8-b1fa-cc05a8ccee81"
Execute the above Job with API. The API reference is Job Runs / Start a run for a job. You need to give json the value job_run at runtime, including the runtime environment variables here.
jobrunpost = {
"job_run": {
"configuration" : {
"env_variables" : ["MYENV1=100","MYENV2=runbyapi","MYENV3=100.0"]
}
}
}
Give the above job_run as json and run the job. The execution ID is stored in the'asset_id'of the response'metadata'.
# Run job
response = requests.post(url+"/v2/jobs/"+job_id+"/runs?project_id="+project_id, headers=headers, json=jobrunpost, verify=False).json()
# Job run id
job_run_id = response['metadata']['asset_id']
job_run_id
# -output-
# 'cedec57a-f9a7-45e9-9412-d7b87a04036a'
After running, check the status. API reference is Job Runs / Get a specific run of a jobis.
# Job run status
response = requests.get(url+"/v2/jobs/"+job_id+"/runs/"+job_run_id+"?project_id="+project_id, headers=headers, verify=False).json()
response['entity']['job_run']['state']
# -output-
# 'Starting'
If you run this requests.get several times, the result will change to'Starting'->' Running'->'Completed'. When it becomes'Completed', the execution is completed.
Return to the CP4D screen and check the contents of the CSV file generated in the data asset of the analysis project.
It was confirmed that the environment variable specified in job_run is properly stored in the result data.
(bonus) Double-byte characters could also be used for the value of the job_run environment variable.
Jobs containing double-byte characters_run
jobrunpost = {
"job_run": {
"configuration" : {
"env_variables" : ["MYENV1=AIUEO","MYENV2=a-I-U-E-O","MYENV3=Aio"]
}
}
}
Execution result:
After that, you can boil or bake the value (character string) of the environment variable received in Job's Notebook or use it as you like.
(Reference material) https://github.ibm.com/GREGORM/CPDv3DeployML/blob/master/NotebookJob.ipynb This repository contained useful samples of Notebooks that can be used with CP4D.
Recommended Posts