In Using Azure ML Python SDK: Using dataset as input-Part 1, the input dataset was specified by the caller of script script.py. Azure Machine Learning Workspace allows you to register a dataset, so it's natural to want to retrieve it and use it in script.py. This time, I will introduce how to do it.
The items that will appear this time are as follows.
--CSV file (assuming that it is located in Azure Blob Storage for operation) --Here, the CSV file is registered in the Azure Machine Learning Studio UI. --Remote virtual machine (hereinafter "computing cluster" using Azure ML terminology)
Last time I used a local PC (with Visual Studio Code, Azure ML Python SDK installed) instead of Jupyter Notebook, but both are the same in terms of using a remote compute cluster for script execution. Launching from a compute instance in Azure Machine Learning Studio, Jupyter Notebook is useful because you can recreate your compute instance to take advantage of the latest Azure ML Python SDK version.
The folder structure of the Notebook is described on the assumption that it is as follows. You don't need to consider config.json because it's in the Azure ML Workspace environment.
In this example as well, script1.1.py is as simple as reading the CSV file on the blob Storage and writing it to the outputs directory. Similarly, it is HelloWorld1.1.ipynb's job to send script1.1.py to the compute cluster for execution.
The procedure for HelloWorld1.1.ipynb is as follows. Unlike last time, there is no step to specify the CSV file path on the Blob Storage.
Let's take a look at the steps in order.
Load the package
First, load the package.
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, Datastore, ScriptRunConfig
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration, DEFAULT_CPU_IMAGE
from azureml.core.conda_dependencies import CondaDependencies
workspace = Workspace.from_config()
Specifying a computing cluster
You can also create remote compute resources with the Python SDK, but here I've created a compute cluster in the Azure ML Studo workspace in advance for a better overall view.
aml_compute_target = "demo-cpucluster" # <== The name of the cluster being used
try:
aml_compute = ComputeTarget(workspace, aml_compute_target)
print("found existing compute target.")
except ComputeTargetException:
print("no compute target with the specified name found")
Specifying the container environment
Here, specify the execution environment. Pass the variables of the compute cluster and specify the package to use in the container image. We're only using pip_package here, but you can also specify conda_package.
run_config = RunConfiguration()
run_config.target = aml_compute
run_config.environment.docker.enabled = True
run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
run_config.environment.python.user_managed_dependencies = False
run_config.environment.python.conda_dependencies = CondaDependencies.create(
pip_packages=['azureml-defaults'],
pin_sdk_version=False
)
Specifying the executable file name
Specify the folder name that contains the set of scripts to be executed remotely with script_folder. Also, in script, specify the name of the script file that will be the entry for remote execution.
In remote execution, all the files and subdirectories in script_folder are passed to the container, so be careful not to place unnecessary files.
Since the input file is fetched by script1.1.py, it is not specified here.
src = ScriptRunConfig(source_directory='script_folder', script='script1.1.py',
run_config = run_config)
Run the experiment
experiment_name is used as the display name for the experiment.
experiment_name = 'ScriptRunConfig2'
experiment = Experiment(workspace = workspace, name = experiment_name)
run = experiment.submit(config=src)
run
This cell ends asynchronously, so if you want to wait for the end of execution, execute the following statement.
```python
%%time
run.wait_for_completion(show_output=True)
```
script1.1.py
The contents of the script that is executed remotely.
Get_context () passes the execution information of the calling script. The experiment information, which is the attribute information in this run, is fetched, and the workspace information, which is the attribute information of the experiment, is fetched. Once you know the workspace information, you can get the dataset registered in the workspace with get_by_name. This get_by_name is written in the same format that appears on the "Use" tab from the registered dataset in Azure Machine Learning Studio.
This script finally writes the file to the outputs folder. By default, this outputs folder is created without any action and can be referenced from the experiment's "Outputs and Logs" after execution.
from azureml.core import Run, Dataset, Workspace
run = Run.get_context()
exp = run.experiment
workspace = exp.workspace
dataset = Dataset.get_by_name(workspace, name='hello_ds')
df = dataset.to_pandas_dataframe()
HelloWorld = df.iloc[0,1]
print('*******************************')
print('********* ' + HelloWorld + ' *********')
print('*******************************')
df.to_csv('./outputs/HelloWorld.csv', mode='w', index=False)
[Reference] Contents of HelloWorld.txt
The CSV file used here is simple.
0,Hello World
1,Hello World
2,Hello World
What do you think. There are several variations of input / output in the Azure ML Python SDK. Next time, I would like to introduce the output.
What is Azure Machine Learning SDK for Python azureml.core.experiment.Experiment class - Microsoft Docs Use Azure ML Python SDK 1: Use dataset as input-Part 1 Use Azure ML Python SDK 3: Write output to Blob storage-Part 1 [Use Azure ML Python SDK 4: Write output to Blob storage-Part 2] (https://qiita.com/notanaha/items/655290670a83f2a00fdc)
Recommended Posts