Use Azure ML Python SDK: Using dataset as input-Part 1 and [Use Azure ML Python SDK: Use dataset as input-Part 2] (https://qiita.com/notanaha/items/30d57590c92b03bc953c) Then, I described the handling of input data. I used the outputs folder provided by default for the output, but this time I will write it to any blob storage.
The items that will appear this time are
--CSV file (assuming that it is located in Azure Blob Storage for operation) --In the figure below, only HelloWorld.txt in the work folder is prepared at the beginning, and output.csv of work-out is the file that is copied as a result of executing this script. --Remote virtual machine (hereinafter "computing cluster" using Azure ML terminology)
To check the Python SDK version
import azureml.core
print("SDK version:", azureml.core.VERSION)
The folder structure of the Notebook is described on the assumption that it is as follows.
As before, script2.py is as simple as reading a CSV file on blob storage and writing it to the work-out folder. Similarly, it is the role of HelloWorld2.0.ipynb to send script2.py to the compute cluster for execution.
The procedure for HelloWorld2.0.ipynb is as follows. The output folder is specified in ③.
Let's take a look at the steps in order.
Load the package
First, load the package.
from azureml.core import Workspace, Experiment, Dataset, Datastore, ScriptRunConfig, Environment
from azureml.data import OutputFileDatasetConfig
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.conda_dependencies import CondaDependencies
workspace = Workspace.from_config()
Specifying a computing cluster
You can also create remote compute resources with the Python SDK, but here I created a compute cluster in Azure ML Studio in advance for a better overall view.
aml_compute_target = "demo-cpucluster" # <== The name of the cluster being used
try:
aml_compute = ComputeTarget(workspace, aml_compute_target)
print("found existing compute target.")
except ComputeTargetException:
print("no compute target with the specified name found")
Specify the CSV file path and output folder
demostore is the datastore name registered in Azure ML Workspace. I'm passing the file path inside the BLOB container of the datastore to the dataset class.
Unlike the last time, it is passed with the file name File.from_files (). Tabular.from_delimited_files () is used to pass tabular data such as csv files, but File.from_files () can be used to pass other files and folders.
ds = Datastore(workspace, 'demostore')
input_data = Dataset.File.from_files(ds.path('work/HelloWorld.txt')).as_named_input('input_ds').as_mount()
output = OutputFileDatasetConfig(destination=(ds, 'work_out'))
Specifying the container environment
As mentioned above, this time we will use Environment () instead of RunConfiguration (). In the former, the variables of the compute cluster were specified here, but in the latter, they are not specified, and the variables of the compute cluster will be specified in the following ScriptRunConfig ().
Only pip_package is used here, but conda_package can be specified in the same way as RunConfiguration ().
myenv = Environment("myenv")
myenv.docker.enabled = True
myenv.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-defaults'])
Specifying the executable file name
Specify the folder name containing the set of scripts to be executed remotely in source_directory. Also, in script, specify the name of the script file that will be the entry for remote execution.
In remote execution, all files and subdirectories in source_directory are passed to the container, so be careful not to place unnecessary files.
Up to the last time, we introduced the method specific to Azure ML Python SDK for data set passing, but this time we will describe it according to the method of receiving the argument specified by arguments with argparse.
Pass input_data with the argument name datadir and output with the argument name output.
We also specify the compute cluster name in compute_target and pass in myenv, which instantiates the Environment in environment.
src = ScriptRunConfig(source_directory='script_folder2',
script='script2.py',
arguments =['--datadir', input_data, '--output', output],
compute_target=aml_compute,
environment=myenv)
Run the experiment
Run the script.
exp = Experiment(workspace, 'InOutSample')
run = exp.submit(config=src)
This cell ends asynchronously, so if you want to wait for the end of execution, execute the following statement.
```python
%%time
run.wait_for_completion(show_output=True)
```
script2.py
The contents of the script that is executed remotely.
You can parse the datadir and output arguments using a parser. The full path to the input file is passed to args.datadir. On the other hand, args.output only passes to the folder name, so os.path.join is used to specify the file name output.csv here.
import argparse
import os
print("*********************************************************")
print("************* Hello World! *************")
print("*********************************************************")
parser = argparse.ArgumentParser()
parser.add_argument('--datadir', type=str, help="data directory")
parser.add_argument('--output', type=str, help="output")
args = parser.parse_args()
print("Argument 1: %s" % args.datadir)
print("Argument 2: %s" % args.output)
with open(args.datadir, 'r') as f:
content = f.read()
with open(os.path.join(args.output, 'output.csv'), 'w') as fw:
fw.write(content)
What do you think. This time I showed you how to write the output to any blob storage. Also, File.from_files () was used to specify the input file, and Environment () was used to specify the container environment, which was different from the previous time. Next time, I will introduce a variation that specifies a folder with File.from_files ().
Azure/MachineLearningNotebooks: scriptrun-with-data-input-output Use Azure ML Python SDK 1: Use dataset as input-Part 1 [Using Azure ML Python SDK 2: Using dataset as input-Part 2] (https://qiita.com/notanaha/items/30d57590c92b03bc953c) [Use Azure ML Python SDK 4: Write output to Blob storage-Part 2] (https://qiita.com/notanaha/items/655290670a83f2a00fdc) azureml.data.OutputFileDatasetConfig class - Microsoft Docs azureml.core.Environment class - Microsoft Docs
Recommended Posts